You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experiencing an issue where Consonance loses track of multiple provisioned EC2 instances and fails to reap those instances.
I run 16 jobs (all intended failures) with a c4.4xlarge each, while the Youxia config allows a maximum of 15 instances at a time and a batch size of 7. When the first batch of 7 fails, only one provisioned instance is set to FAILED, while the others are left on RUNNING. While all of these terminate successfully (confirmed via EC2 console), the second batch of 7 and the last batch of 2 never receive IP addresses listed in the Provision table. Additionally, their Provision status is never changed from START. While the job status for all jobs eventually turns to FAILED, all of the last two batches will fail to be reaped.
Below is the SQL command I use to merge columns from the Job and Provision tables, and its output.
The container provisioner log notes that it cannot find an IP address listed for those instances, and the only items added to the kill list are the old jobs that are taken off after Consonance detects that they have been terminated already. Additionally, the RabbitMQ Queue for CleanupVMs contains thousands of unacknowledged messages piled up.
Possible Patch
Setting numerous breakpoints at reaping calls reveals that this line in the call method of the CleanupVMs class in ContainerProvisionerThreads.java is called before any other reaping call. As none of the nodes in the cluster are "orphans", this seems suspicious.
We commented out lines 395-410 of the above script, and remade the jar file. With our patched v16, we ran the same script of 16 jobs on c4.4xlarge's, and saw none of the above issues: all provisioned instances were updated to RUNNING with the corresponding Job status, set to FAILED after the job ends, and terminated accordingly. The resulting Job-Provision merge results here:
I'm not inclined to say the commented-out bits are strictly responsible for the reaping failure, but removing them appears to have fixed the bug. I'd appreciate any insight as to why the Orphan Reaper might be involved with the failure, why the failure exists in the first place, and what a more effective solution might be.
The text was updated successfully, but these errors were encountered:
Problem
I'm experiencing an issue where Consonance loses track of multiple provisioned EC2 instances and fails to reap those instances.
I run 16 jobs (all intended failures) with a c4.4xlarge each, while the Youxia config allows a maximum of 15 instances at a time and a batch size of 7. When the first batch of 7 fails, only one provisioned instance is set to FAILED, while the others are left on RUNNING. While all of these terminate successfully (confirmed via EC2 console), the second batch of 7 and the last batch of 2 never receive IP addresses listed in the Provision table. Additionally, their Provision status is never changed from START. While the job status for all jobs eventually turns to FAILED, all of the last two batches will fail to be reaped.
Below is the SQL command I use to merge columns from the Job and Provision tables, and its output.
The container provisioner log notes that it cannot find an IP address listed for those instances, and the only items added to the kill list are the old jobs that are taken off after Consonance detects that they have been terminated already. Additionally, the RabbitMQ Queue for CleanupVMs contains thousands of unacknowledged messages piled up.
Possible Patch
Setting numerous breakpoints at reaping calls reveals that this line in the call method of the CleanupVMs class in ContainerProvisionerThreads.java is called before any other reaping call. As none of the nodes in the cluster are "orphans", this seems suspicious.
We commented out lines 395-410 of the above script, and remade the jar file. With our patched v16, we ran the same script of 16 jobs on c4.4xlarge's, and saw none of the above issues: all provisioned instances were updated to RUNNING with the corresponding Job status, set to FAILED after the job ends, and terminated accordingly. The resulting Job-Provision merge results here:
I'm not inclined to say the commented-out bits are strictly responsible for the reaping failure, but removing them appears to have fixed the bug. I'd appreciate any insight as to why the Orphan Reaper might be involved with the failure, why the failure exists in the first place, and what a more effective solution might be.
The text was updated successfully, but these errors were encountered: