Provision table not updated after failed jobs, reaping unsuccessful #148

alex-hancock · 2017-08-05T01:24:24Z

Problem

I'm experiencing an issue where Consonance loses track of multiple provisioned EC2 instances and fails to reap those instances.

I run 16 jobs (all intended failures) with a c4.4xlarge each, while the Youxia config allows a maximum of 15 instances at a time and a batch size of 7. When the first batch of 7 fails, only one provisioned instance is set to FAILED, while the others are left on RUNNING. While all of these terminate successfully (confirmed via EC2 console), the second batch of 7 and the last batch of 2 never receive IP addresses listed in the Provision table. Additionally, their Provision status is never changed from START. While the job status for all jobs eventually turns to FAILED, all of the last two batches will fail to be reaped.

Below is the SQL command I use to merge columns from the Job and Provision tables, and its output.

postgres=# SELECT
job.job_uuid,
job.provision_uuid,             
job.status AS job_status,
job.flavour,
job.create_timestamp AS created_timestamp,
job.update_timestamp AS job_update,
provision.ip_address,
provision.status AS provision_status
FROM job, provision
WHERE job.job_uuid=provision.job_uuid
ORDER BY job.update_timestamp;
               job_uuid               |   provision_uuid    | job_status |  flavour   |    created_timestamp    |         job_update         |  ip_address   | provision_status 
--------------------------------------+---------------------+------------+------------+-------------------------+----------------------------+---------------+------------------
 a777855e-50d6-418a-b9c4-180e61f03a01 | i-019d809ee75730049 | FAILED     | c4.4xlarge | 2017-08-04 23:10:14.691 | 2017-08-04 23:24:52.488541 | 172.31.16.88  | FAILED
 3432cdba-a841-486d-9254-68e67d6ede66 | i-0b3a0c23cb7ed8de0 | FAILED     | c4.4xlarge | 2017-08-04 23:10:02.037 | 2017-08-04 23:25:37.918831 | 172.31.29.64  | RUNNING
 7c3a8f15-1784-4e8b-b9bd-16c85c4f8c34 | i-00617c0267a1d8396 | FAILED     | c4.4xlarge | 2017-08-04 23:10:08.198 | 2017-08-04 23:28:43.963945 | 172.31.31.45  | RUNNING
 8b3e1ea4-0120-4d5a-9750-b9e89b80aecf | i-093b77ffa878827c5 | FAILED     | c4.4xlarge | 2017-08-04 23:10:08.104 | 2017-08-04 23:30:25.860921 | 172.31.17.212 | RUNNING
 6500d1c9-16fd-4eb2-a56b-32ea34e72e2e | i-056cebc65e0dcd6bf | FAILED     | c4.4xlarge | 2017-08-04 23:10:14.806 | 2017-08-04 23:30:34.37672  | 172.31.31.215 | RUNNING
 02779cb3-14d8-4b28-b31b-33240f7cf5ab | i-0837f2465ec7a7329 | FAILED     | c4.4xlarge | 2017-08-04 23:10:02.037 | 2017-08-04 23:30:47.270624 | 172.31.18.6   | RUNNING
 9a60b12f-eaa0-48a1-959d-5a9a14e2dd09 | i-0d555364e4d6b0f43 | FAILED     | c4.4xlarge | 2017-08-04 23:10:25.426 | 2017-08-04 23:30:58.547793 | 172.31.25.46  | RUNNING
 4457078c-54d2-4f22-94c0-066362e7c001 | i-02332abfa4fd538d8 | FAILED     | c4.4xlarge | 2017-08-04 23:10:30.62  | 2017-08-04 23:35:48.382019 |               | START
 e8c48430-68d2-4436-9c90-2b5035edb9c6 | i-02c9c6dea3b18861e | FAILED     | c4.4xlarge | 2017-08-04 23:10:37.348 | 2017-08-04 23:35:50.609924 |               | START
 4476a64d-8d2c-47ef-8c75-3dc7f6a07614 | i-0f721f57d34598b05 | FAILED     | c4.4xlarge | 2017-08-04 23:10:20.018 | 2017-08-04 23:36:23.810358 |               | START
 9e132ddd-554e-408c-8f86-5698d6b21a49 | i-001bfe7836c7d99af | FAILED     | c4.4xlarge | 2017-08-04 23:10:25.352 | 2017-08-04 23:38:27.233367 |               | START
 5be73d0f-d9c6-4e6d-b326-d2845d71cecf | i-02b8ed7eb273d3309 | FAILED     | c4.4xlarge | 2017-08-04 23:10:42.833 | 2017-08-04 23:39:44.279985 |               | START
 16c9f06c-dff3-4947-b097-b40dfa87dd3e | i-02f82d771cae508e4 | FAILED     | c4.4xlarge | 2017-08-04 23:10:19.643 | 2017-08-04 23:41:05.5611   |               | START
 bad95583-b07b-4687-a5a1-fb0663e8646b | i-0240d0144dec11daa | FAILED     | c4.4xlarge | 2017-08-04 23:10:30.954 | 2017-08-04 23:41:16.039627 |               | START
 8bbd902b-9b12-4f9e-b9e6-472223c06141 | i-072fc17dbf3913fb8 | FAILED     | c4.4xlarge | 2017-08-04 23:10:37.92  | 2017-08-04 23:48:39.109857 |               | START
 ec471fe7-72fe-41a7-ae1f-90c047341fb9 | i-0619b6e725951f5f2 | FAILED     | c4.4xlarge | 2017-08-04 23:10:42.916 | 2017-08-04 23:50:42.482837 |               | START

The container provisioner log notes that it cannot find an IP address listed for those instances, and the only items added to the kill list are the old jobs that are taken off after Consonance detects that they have been terminated already. Additionally, the RabbitMQ Queue for CleanupVMs contains thousands of unacknowledged messages piled up.

Possible Patch

Setting numerous breakpoints at reaping calls reveals that this line in the call method of the CleanupVMs class in ContainerProvisionerThreads.java is called before any other reaping call. As none of the nodes in the cluster are "orphans", this seems suspicious.

We commented out lines 395-410 of the above script, and remade the jar file. With our patched v16, we ran the same script of 16 jobs on c4.4xlarge's, and saw none of the above issues: all provisioned instances were updated to RUNNING with the corresponding Job status, set to FAILED after the job ends, and terminated accordingly. The resulting Job-Provision merge results here:

postgres=# SELECT
job.job_uuid, 
job.provision_uuid,
job.status AS job_status,
job.flavour,
job.create_timestamp AS created_timestamp,
job.update_timestamp AS job_update,
provision.ip_address,
provision.status AS provision_status
FROM job, provision
WHERE job.job_uuid=provision.job_uuid  
ORDER BY job.update_timestamp;
               job_uuid               |   provision_uuid    | job_status |  flavour   |    created_timestamp    |         job_update         |  ip_address   | provision_status 
--------------------------------------+---------------------+------------+------------+-------------------------+----------------------------+---------------+------------------
 cd22448b-e020-4d1f-b546-ece7d23baf26 | i-028b3dacc84dcd633 | FAILED     | c4.4xlarge | 2017-08-05 00:39:28.629 | 2017-08-05 00:54:27.846297 | 172.31.24.51  | FAILED
 0a64434e-e74b-4624-9c7a-0d4cad0b2d1f | i-0dbc488fdc9d5ffe6 | FAILED     | c4.4xlarge | 2017-08-05 00:39:23.857 | 2017-08-05 00:59:58.888847 | 172.31.31.164 | FAILED
 1ea2ec1f-4e28-4737-89f7-ef98964af909 | i-05b3cc2f5f5c667a0 | FAILED     | c4.4xlarge | 2017-08-05 00:39:33.375 | 2017-08-05 01:00:01.041602 | 172.31.25.229 | FAILED
 b00353f5-b8bc-4498-9534-2ce787cc7d5c | i-005f5fb84a85768f7 | FAILED     | c4.4xlarge | 2017-08-05 00:39:14.439 | 2017-08-05 01:00:03.928182 | 172.31.20.21  | FAILED
 8293e6ed-ae30-4ac4-ade9-8d08415bdb18 | i-08d17ebfca27f0b23 | FAILED     | c4.4xlarge | 2017-08-05 00:39:19.021 | 2017-08-05 01:00:06.23457  | 172.31.24.254 | FAILED
 9dbdf2ed-67e6-4b81-af98-0ed848c676e7 | i-0a9aec76be2773aba | FAILED     | c4.4xlarge | 2017-08-05 00:39:47.077 | 2017-08-05 01:00:08.122962 | 172.31.22.6   | FAILED
 4bc6ae94-eec9-4e54-bbfd-91bfe2a0ce90 | i-050aca3da447e4a96 | FAILED     | c4.4xlarge | 2017-08-05 00:40:08.557 | 2017-08-05 01:00:09.660584 | 172.31.17.145 | FAILED
 3b8087a8-d964-4722-9209-1fa70ab8dad2 | i-0e09067239fbee752 | FAILED     | c4.4xlarge | 2017-08-05 00:40:13.718 | 2017-08-05 01:05:11.680508 | 172.31.16.160 | FAILED
 bcbff0ed-7282-4bfe-9e3b-bcef8a9de65f | i-050a2d500544d658e | FAILED     | c4.4xlarge | 2017-08-05 00:39:51.832 | 2017-08-05 01:08:29.564504 | 172.31.29.218 | FAILED
 ee1eaf33-524e-4d00-9326-131956f171c8 | i-0e44cb8618944961f | FAILED     | c4.4xlarge | 2017-08-05 00:39:42.349 | 2017-08-05 01:08:31.100526 | 172.31.23.29  | FAILED
 851e046e-04aa-4e1b-bc6a-7689971be7ab | i-075b986bb91635c49 | FAILED     | c4.4xlarge | 2017-08-05 00:39:37.903 | 2017-08-05 01:08:32.463062 | 172.31.29.85  | FAILED
 a444e873-4c8c-4233-a7a0-e50c2f51a2f0 | i-0b69bc43695d23c50 | FAILED     | c4.4xlarge | 2017-08-05 00:40:19.185 | 2017-08-05 01:08:33.991364 | 172.31.23.246 | FAILED
 f16828de-d9f5-4d2e-9755-1b9c7fe6659f | i-0d6c73106d865950c | FAILED     | c4.4xlarge | 2017-08-05 00:40:03.429 | 2017-08-05 01:08:35.269804 | 172.31.24.113 | FAILED
 93932a49-ff8a-4605-90bb-eb981e93ff87 | i-0fd181b160bb3898e | FAILED     | c4.4xlarge | 2017-08-05 00:39:58.232 | 2017-08-05 01:11:14.82567  | 172.31.24.240 | FAILED
 a4ae09ea-54f3-4fed-88a6-18f919545793 | i-0572ded166b330abb | FAILED     | c4.4xlarge | 2017-08-05 00:40:24.571 | 2017-08-05 01:13:52.419499 | 172.31.24.36  | FAILED
 954524f8-345e-4b08-91ef-9b936803442f | i-0ce7cc86ba8ab33e3 | FAILED     | c4.4xlarge | 2017-08-05 00:40:29.599 | 2017-08-05 01:17:24.727598 | 172.31.27.12  | FAILED

I'm not inclined to say the commented-out bits are strictly responsible for the reaping failure, but removing them appears to have fixed the bug. I'd appreciate any insight as to why the Orphan Reaper might be involved with the failure, why the failure exists in the first place, and what a more effective solution might be.

The text was updated successfully, but these errors were encountered:

alex-hancock mentioned this issue Aug 23, 2017

Commented-out orphan hunting script #153

Closed

alex-hancock mentioned this issue Dec 12, 2017

Add progress on WES support to CGP development branch achave11-ucsc/consonance#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provision table not updated after failed jobs, reaping unsuccessful #148

Provision table not updated after failed jobs, reaping unsuccessful #148

alex-hancock commented Aug 5, 2017

Provision table not updated after failed jobs, reaping unsuccessful #148

Provision table not updated after failed jobs, reaping unsuccessful #148

Comments

alex-hancock commented Aug 5, 2017

Problem

Possible Patch