Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use etags for caching to avoid backend/db load when idle #2731

Merged
merged 34 commits into from
Sep 17, 2024

Conversation

corneliusroemer
Copy link
Contributor

@corneliusroemer corneliusroemer commented Sep 9, 2024

co-authored with @anna-parker and Claude Sonnet

related to: #2092

preview URL: https://table-update-tracker.loculus.org

Summary

  1. Adds UpdateTrackerTable to database with last time a table in the database was modified.
iTerm2 2024-09-09 19 46 36
  1. Adds optional IF_NONE_MATCH header to get-released-data, this accepts an etag and returns a lastDatabaseWriteETag corresponding to the lastDatabaseWrite. If the etag and the current lastDatabaseWriteETag differ get-released-data will respond with all released data and the current lastDatabaseWriteETag. If the atg is equal to current lastDatabaseWriteETag then get-released-data will respond with 304: NOT_MODIFIED.
  2. Updates silo_import_job.sh and silo_import_wrapper.sh to use the IF_NONE_MATCH header. This significantly reduces the load on the database and backend, times are from after preprocessing has finished (CPU is in mCPU):
kubectl top pod --sum --containers --all-namespaces | egrep "backend|s-database" | egrep "main|tracker-anya" | sort -hk4                                           bs-mbpas-0092: Wed Sep 11 19:51:30 2024

prev-table-update-tracker-anya   loculus-backend-c859b6f4b-gj5k5                                   backend                              19m          421Mi
prev-table-update-tracker-anya   loculus-database-67f7d88557-9mf6v                                 database                             101m         326Mi
prev-main                        loculus-database-645647c886-kh5vf                                 database                             113m         234Mi
prev-main                        loculus-backend-55f5cdb6cb-p7p4f                                  backend                              300m         483Mi

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@corneliusroemer corneliusroemer added the preview Triggers a deployment to argocd label Sep 9, 2024
@anna-parker
Copy link
Contributor

anna-parker commented Sep 9, 2024

Ideas from pair programming: We can use this with the http header with if-modified-since (or possibly better ETAG) for the get-released-data endpoints and also for the preprocessing pipeline.

Update silo_import_job.sh to first check for changes before getting all released-data

@corneliusroemer
Copy link
Contributor Author

One issue we have right now is that "current_preprocessing_pipeline" table seems to be updated without any real changes. It still shows this:

loculus=# select * from table_update_tracker;
 groups_table                       | 2024-09-09 17:14:34.637079
 user_groups_table                  | 2024-09-09 17:14:34.637079
 current_processing_pipeline        | 2024-09-09 17:46:29.561886
 sequence_entries_preprocessed_data | 2024-09-09 17:46:13.487679
 sequence_entries                   | 2024-09-09 17:23:21.609788
 sequence_upload_aux_table          | 2024-09-09 17:15:51.885098
 metadata_upload_aux_table          | 2024-09-09 17:15:51.885098

loculus=# select * from current_processing_pipeline;
       1 | 2024-09-09 17:13:09.196022

We either have to make the trigger function smarter to only trigger on actual changes (not just update commands that don't cause changes) - or we use that timestamp from that table directly instead of the "last update" table.

@corneliusroemer
Copy link
Contributor Author

Same with the sequence_entries_preprocessed_data - it also gets an "update" issued without any real underlying change it seems.

Probably for now, in the version table we can look up the timestamp directly. And we might be able to ignore the preprocessed data table entirely.

@anna-parker
Copy link
Contributor

Hmm I wonder if the fact that current_processing_pipeline and sequence_entries_preprocessed_data are continuously getting updated is actually a sign of a bug in the backend - every update call will mean a database transaction that is not required

@anna-parker
Copy link
Contributor

anna-parker commented Sep 10, 2024

sequence_entries_preprocessed_data

Just looked into this with @chaoran-chen - thanks! useNewerProcessingPipelineIfPossible is constantly performing an update transaction - this is where a change would be required. I just checked with chatgpt and even better than splitting the query into a check and update we can do this with a conditional update:

UPDATE users
SET name = 'John Doe', email = 'john.doe@example.com'
WHERE (name != 'John Doe' OR email != 'john.doe@example.com')
AND user_id = 123;

Implemented in 3fe4d5f

@corneliusroemer
Copy link
Contributor Author

Great! Much easier to get rid of these spurious updates than I feared!

What about the continuous updates to sequence_entries_preprocessed_data? Is that also covered by your commit? I'll merge it into this one after checks.

@corneliusroemer
Copy link
Contributor Author

@anna-parker
Copy link
Contributor

anna-parker commented Sep 10, 2024

After further investigation the conditional update that chatgpt suggested does not seem to work - causing a database update every 10seconds even when there are no changes, additionally the cleanUpStaleSequencesInProcessing will always perform a delete action on the database every minute even when there is nothing to delete.

@corneliusroemer corneliusroemer added the format_me Triggers github_actions to format website code on PR label Sep 10, 2024
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a unit test that checks that it works? The new SQL file seems at least medium complex.

@corneliusroemer
Copy link
Contributor Author

ETag works on client side as well:

Brave Browser 2024-09-12 00 35 24

@corneliusroemer
Copy link
Contributor Author

This decreases backend/database load (and hence costs) by a lot! As I expected when making #2092 🎉

iTerm2 2024-09-12 10 03 36

corneliusroemer and others added 3 commits September 16, 2024 09:41
…/UpdateTrackerTable.kt

Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
…/UpdateTrackerTable.kt

Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
…_tracker.sql

Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things left, but looks good 👍

@corneliusroemer corneliusroemer enabled auto-merge (squash) September 17, 2024 18:22
@corneliusroemer corneliusroemer changed the title feat: add table_update_tracker tracking last change per table feat: use etags for caching to avoid backend/db load when idle Sep 17, 2024
@corneliusroemer corneliusroemer merged commit 5cd7069 into main Sep 17, 2024
18 checks passed
@corneliusroemer corneliusroemer deleted the table-update-tracker branch September 17, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format_me Triggers github_actions to format website code on PR preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants