-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: With this PR we add the possibility to have multiple connection pools in Orca #4619
Conversation
The following commits need their title changed:
Please format your commit title into the form:
This allows us to easily generate changelogs & determine semantic version numbers when cutting releases. You can read more about commit conventions here. |
774e6d4
to
d5e2843
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀 Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure to document this change alongside SQL configs for other services. Otherwise LGTM, and would be good to get @dbyron0's take as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the effort here. Sounds like we've been running into similar struggles with the scalability of orca's database. We have a bit of a different way of solving this...also with multiple connection pools, but using a different way of configuring them that works with the current mechanism (i.e. the "write" connection pool is still named default).
Besides that difference though, we found that using the read replica for some read operations doesn't work....or at least it doesn't work without some other signficant changes to teach orca to be aware of replication lag. When one task writes something to the database and another task runs immediately afterwards and reads, the current code expects to get exactly what was read, but with replication lag that doesn't always happen.
We're still working through the steps to get the changes to handle that rolled out in prod and gain confidence in it.
We have been using a read replica for a subset of operations though. Lemme see if I can get a PR for that.
Curious on the places you've found that don't handle async reads after writes. I know the correlation ids are one we hit. Our solution was to modify those to "ignore if there's a unique constraint post read when it tries to insert". I'd not traced all the places that look for a similar read after write operation that would be impacted, but I THOUGHT most of those stages either did "ignore constraint failures" OR "hey retry the read on next queue operation". That said would NOT surprise me if there are more such places. |
Per slack, SF is intending at some point to contribute more PRs around using RO replicas in Orca. Closing this for now. |
With this PR we add the possibility to have multiple connection pools in Orca, one for the writes and one for reads.
Orca’s operations are primarily READ operations from the database (almost 85% of the SQL transactions are SELECT statements). For high-scale customers with hundreds of applications/pipelines with big execution contexts, this translates to extreme pressure towards the backend database and high network utilization, transferring high volumes of data from the database to the Orca pods.
We noticed that for high-scale customers, we have doubled twice the instance types, and now we have reached a situation where the Writer endpoint of the database is getting throttled on the network bandwidth side (hitting the max bandwidth of 20gbps)
On the flip side, the Reader instance of the Orca database sits idle since Orca doesn't support ReadOnly operations through the SQL connection pools. Having Orca splitting the traffic between the READS and WRITES on the database endpoints will dramatically increase the performance/utilization but also provide cost savings for high-scale customers.
We have tested this in our environment; below you can find the statistics
![image (11)](https://private-user-images.githubusercontent.com/105648914/292198944-5f2e2c82-f6e3-4e17-af50-c7ca63ca9dc5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMzE0NDUsIm5iZiI6MTczOTAzMTE0NSwicGF0aCI6Ii8xMDU2NDg5MTQvMjkyMTk4OTQ0LTVmMmUyYzgyLWY2ZTMtNGUxNy1hZjUwLWM3Y2E2M2NhOWRjNS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOFQxNjEyMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lMTg0NTUyN2U0MjBkNWM3NTQ4OWU1ZDhkOTkyMzk4MTg4ZTcwNzg5MGVhMGQ1ZmFjNjMxMzdkNDYxZjkwOGE4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.YL22FBsB9c6uW6sxSopBMGFYJDRQAQDQ0Zmt8VBP85w)
![image (10)](https://private-user-images.githubusercontent.com/105648914/292198978-e4fa7de9-b59a-43a6-b22e-c90de0fe8407.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMzE0NDUsIm5iZiI6MTczOTAzMTE0NSwicGF0aCI6Ii8xMDU2NDg5MTQvMjkyMTk4OTc4LWU0ZmE3ZGU5LWI1OWEtNDNhNi1iMjJlLWM5MGRlMGZlODQwNy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOFQxNjEyMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wZmIxYjkyYzNkYzk3ZTQ5MDU1OTEyZTY4YzlhZjkwMTIzNjg1MWZmY2QyMGVjYjg5MjVjZTFmZmEyNDY5ZGEyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.EHCzlJTCybw9U9uGFAccuHP52L7k-4GZAK1DT-HaAdc)
![image (9)](https://private-user-images.githubusercontent.com/105648914/292198989-0ee8e6e0-c893-4001-92dd-aa14265ea0bd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMzE0NDUsIm5iZiI6MTczOTAzMTE0NSwicGF0aCI6Ii8xMDU2NDg5MTQvMjkyMTk4OTg5LTBlZThlNmUwLWM4OTMtNDAwMS05MmRkLWFhMTQyNjVlYTBiZC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOFQxNjEyMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02MTgyYTRhNTJiZmM4NzkwZWQ2OTM4ZTFjMjIwOWE5MGVkOWFlMGE4OGI4Yzg4MmRjMGY3NzBmYjY5MWFmODgzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.r5SZX3v_4ySWwtMUv6W1doXPqp01PzHRJB33gN9S9ao)
![image (12)](https://private-user-images.githubusercontent.com/105648914/292199063-9e9a9f40-020d-4bec-8cef-1937818dad88.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMzE0NDUsIm5iZiI6MTczOTAzMTE0NSwicGF0aCI6Ii8xMDU2NDg5MTQvMjkyMTk5MDYzLTllOWE5ZjQwLTAyMGQtNGJlYy04Y2VmLTE5Mzc4MThkYWQ4OC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOFQxNjEyMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mNDE0YmM0ZGI0NTg1MWYxYzQ5Nzg5YTdhZjFhNjg5MDY3Yjc3MWZmOWIyZmEzNzdjNDg0NTRmMmI1NzIyNTk4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.ZU0Xz1romYGy1S-CZt0un_NxPftSRX2vq75PI5L0EpY)