Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jdbc] Refactor make-cached-row-num->i->thunk #179

Closed

Conversation

alexander-yakushev
Copy link
Contributor

@alexander-yakushev alexander-yakushev commented Sep 13, 2024

I made two changes to make-cached-row-num->i->thunk:

  1. Apparently, the cache implemented in this function was not working because the cached value was just discarded, and the new value was always recomputed and re-cached.
  2. I removed one level of closures/thunking in the implementation while keeping the interface the same.

I also have a few questions for some future potential changes:

  1. Why does the internal function take current-row-num if it is in fact only used for clearing the cache? Can (.getRow rset) be used instead, like in other functions?
  2. What is it caching anyway? It looks to me that read-column-thunk (which eventually produces the actual values that are then put into cached-values atom) doesn't seem to perform anything worth caching. This is probably the reason why nobody noticed yet that cache wasn't actually working.

If cache was not needed here, the function and approach can be simplified much more. If backward compatibility of these particular functions is not a concern (they look like a part of the internal API), then even more improvements are possible.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.12%. Comparing base (e14a45e) to head (bf301c2).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/toucan2/jdbc/read.clj 84.61% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
- Coverage   83.23%   83.12%   -0.12%     
==========================================
  Files          37       37              
  Lines        2511     2506       -5     
  Branches      215      215              
==========================================
- Hits         2090     2083       -7     
- Misses        206      208       +2     
  Partials      215      215              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@camsaul
Copy link
Owner

camsaul commented Oct 4, 2024

  1. Why does the internal function take current-row-num if it is in fact only used for clearing the cache? Can (.getRow rset) be used instead, like in other functions?

Yeah, it's only used for clearing the cache. I guess (.getRow rset) would be fine. I think my thinking must have been calling it once and passing it around was better than calling it 20 times for 20 columns or something like that but maybe I was imagining stuff

  1. What is it caching anyway? It looks to me that read-column-thunk (which eventually produces the actual values that are then put into cached-values atom) doesn't seem to perform anything worth caching. This is probably the reason why nobody noticed yet that cache wasn't actually working.

The idea is that if you get the value of say id twice then the first time does (.getInteger rset 1) or whatever and caches it and then the second time returns the cached value instead of fetching that value from the ResultSet again. But maybe this is not really important or the overhead of the caching stuff simply is not worth it

(defn- read-column-value [conn model ^ResultSet rset i]
(let [i (int i)
rsmeta (.getMetaData rset)
thunk (read-column-thunk conn model rset rsmeta i)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is actually somewhat important here to calculate these thunks just once for the entire result set instead of once per value. The overhead of calling the read-column-thunk multimethod on every single value in the results can get pretty high.

This whole read-column-thunk idea came about when I did something similar in Metabase. Some of these read-column-thunk methods have to look at ResultSetMetaData to determine how to read a column out or make other decisions like that that only need to happen once per ResultSet rather than once per row. Here's an example with Postgres in Metabase:

https://github.com/metabase/metabase/blob/f49407fcbdf18f40e247437f41b3bbefd1e227bf/src/metabase/driver/postgres.clj#L795-L803

Both timestamp and timestamp with time zone come back as java.sql.Types/TIMESTAMP so you need to look at the ResultSetMetaData to determine what the actual column type is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your intent and agree that it would be beneficial. The problem is, if I'm reading the code correctly, the process that you describe is not actually cached right now. Obtaining metadata and calling read-column-thunk happens here https://github.com/camsaul/toucan2/blob/master/src/toucan2/jdbc/read.clj#L134, inside the cached lambda, and so it happens for each row.

I can try to rewrite this code so that it is properly cached and reused.

Copy link
Owner

@camsaul camsaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating result-set-thunk once per ResultSet instead of once per row is actually an important optimization

@alexander-yakushev
Copy link
Contributor Author

The idea is that if you get the value of say id twice then the first time does (.getInteger rset 1) or whatever and caches it and then the second time returns the cached value instead of fetching that value from the ResultSet again. But maybe this is not really important or the overhead of the caching stuff simply is not worth it

After benchmarking various Metabase endpoints that interact with the database, I strongly suspect that the caching overhead outweighs the benefit. ResultSet getters like .getInteger and .getString are lightweight enough so that looking up cached values in the hashmap are already not faster than accessing ResultSet directly. But the caching machinery that surrounds it is quite expensive to run.

I will make a separate PR that removes the value caching completely. Let's see how it goes.

@alexander-yakushev
Copy link
Contributor Author

Closing this in favor of #183 and #189.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants