Skip to content

Conversation

mkleczek
Copy link
Contributor

@mkleczek mkleczek commented Oct 10, 2025

Fixes #4360
Fixes #3704

  • Update changelog

@mkleczek
Copy link
Contributor Author

Wow, it looks like it is too fast now :) - test_big_schema assertion that schema reloading takes 10s fails quite dramatically.

@wolfgangwalther
Copy link
Member

Haha, that's nice. We should do some kind of consistency check as well, that the schema cache still returns all the same things. But... looks great so far. It's exactly the kind of problem I assumed to exist with these relationships...

"plan"
]
assert plan_dur > 10000.0
assert plan_dur < 10000.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrhr, nice change :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the change is nice, it does break the test expectation. This is not the primary test outcome - we are testing whether other stuff waits on the schema cache reload here. So a high plan duration is a requirement to make the test effective.

We will either need to throw the test away or rewrite it or... I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - I've taken a look at the test and understand its purpose now.

Testing this by measuring response time is brittle, to say politely.

Without deeper changes into the way things work, I don't see a way to keep this test.

One way would be to keep the timestamp of last schema load, and provide it in a response header. But it wouldn't really allow us to test waiting of concurrent requests for schema loading.
I tend to think this kind of properties should be tested using some kind of chaos testing or fuzzing that would only provide statistical guarantees of the system behavior.

Not sure what to do in this PR though. Sensible thing would be to simply comment out this test for now.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd certainly not comment it out. Either we can fix it or we should just remove it - we can always add it back. But commented out code is just waste.

I'll defer to @steve-chavez for this, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this by measuring response time is brittle, to say politely.

Yeah, it was the easiest way to prove that at the time. It's most desirable to have requests waiting than to flood server logs with errors or reply quickly with failure (users just see postgREST as unreliable in these cases).

One way would be to keep the timestamp of last schema load, and provide it in a response header. But it wouldn't really allow us to test waiting of concurrent requests for schema loading.
I tend to think this kind of properties should be tested using some kind of chaos testing or fuzzing that would only provide statistical guarantees of the system behavior.

Another option would be to inject a sleep like we do here:

_ <-
let sleepCall = SQL.Statement "select pg_sleep($1 / 1000.0)" (param HE.int4) HD.noResult prepared in
whenJust configInternalSCSleep (`SQL.statement` sleepCall) -- only used for testing

But only for the relationships loading part. Currently the above works for the whole schema cache load.

Then this test could be removed from test/io/test_big_schema.py too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steve-chavez - yeah, that makes the test more robust (no longer dependent on sluggish schema loading performance)

Done.

Introduced internal-schema-cache-relationship-load-sleep and implemented delayed loading of relationships (see ec31bdd)

I left the tests in test_big_schema.py - moving it to IO tests is probably a good idea but I wouldn't do that as part of this PR - it is getting out of hand anyway.

WDYT?

Copy link
Contributor Author

@mkleczek mkleczek Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was trickier than I thought. Finally I've ended up with three internal configuration properties to introduce delays in various phases of schema cache loading.
See f8ff7c8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek I think those are great! 🔥

How about separating them into another PR? Looks like they can be merged independently.

I would also suggest prefixing these commits with test: instead of refactor:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek mkleczek force-pushed the index-rels-in-addm2mrels branch from 6f03a38 to 9c9e641 Compare October 10, 2025 20:03
@mkleczek
Copy link
Contributor Author

Haha, that's nice. We should do some kind of consistency check as well, that the schema cache still returns all the same things. But... looks great so far. It's exactly the kind of problem I assumed to exist with these relationships...

I am wondering what test we could add... This is a kind of refactoring that does not change any behavior (except performance) and should be covered by existing tests. WDYT?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Oct 10, 2025

I wouldn't add an automated test for it, but it should be simple to load the big schema fixtures from that IO test and then run a --dump-schema on it before and after this change. Then look at the diff. Might need to pipe through jq to sort keys or so. Might still end up with a diff based on some ordering, not sure. But that should give us a result whether... just all relationships are missing or so. (which is extremely unlikely)

@steve-chavez
Copy link
Member

I think @MHC2000 will be happy about this

I can confirm the schema privately shared on #3704 (comment)

Goes from ~7 minutes:

06/Oct/2025:16:06:50 -0500: Schema cache queried in 349.6 milliseconds
06/Oct/2025:16:06:50 -0500: Schema cache loaded 919 Relations, 3312 Relationships, 249 Functions, 0 Domain Representations, 4 Media Type Handlers, 1196 Timezones
06/Oct/2025:16:13:43 -0500: Schema cache loaded in 413123.5 milliseconds

To ~14 seconds:

10/Oct/2025:16:02:15 -0500: Schema cache queried in 416.6 milliseconds
10/Oct/2025:16:02:15 -0500: Schema cache loaded 919 Relations, 3312 Relationships, 249 Functions, 0 Domain Representations, 4 Media Type Handlers, 1196 Timezones
10/Oct/2025:16:02:29 -0500: Schema cache loaded in 14115.5 milliseconds

@mkleczek mkleczek force-pushed the index-rels-in-addm2mrels branch 2 times, most recently from 9c9e641 to deeae82 Compare October 11, 2025 05:55
@mkleczek mkleczek changed the title perf: Index relations in addM2MRels to change complexity from O(n*n) to O(n) perf: Index various lists in SchemaCache to change complexity from O(n*n) to O(n) Oct 11, 2025
@mkleczek
Copy link
Contributor Author

I've updated the patch to cover more cases:

  • addViewM2OAndO2ORels
  • addViewPrimaryKeys

@mkleczek
Copy link
Contributor Author

I think @MHC2000 will be happy about this

I can confirm the schema privately shared on #3704 (comment)

Goes from ~7 minutes:
[...]
To ~14 seconds:

@steve-chavez can you check again with the latest version of the patch? It should improve the times even more.

@mkleczek mkleczek force-pushed the index-rels-in-addm2mrels branch from be211a9 to e9fcffe Compare October 11, 2025 11:43
@steve-chavez
Copy link
Member

@mkleczek It's now down to ~2 seconds 🚀

11/Oct/2025:07:02:14 -0500: Schema cache queried in 448.0 milliseconds
11/Oct/2025:07:02:14 -0500: Schema cache loaded 919 Relations, 3312 Relationships, 249 Functions, 0 Domain Representations, 4 Media Type Handlers, 1196 Timezones
11/Oct/2025:07:02:16 -0500: Schema cache loaded in 2447.1 milliseconds

@mkleczek
Copy link
Contributor Author

mkleczek commented Oct 11, 2025

@mkleczek It's now down to ~2 seconds 🚀

Amazing, thanks for checking.

I guess it is in mergeable state - patch code coverage is somewhat low but I think the changes touched code not covered by tests originally. It should be probably fixed but I am not sure if this PR is the right one though.
(Interestingly, overall code coverage is better than before)

@mkleczek mkleczek force-pushed the index-rels-in-addm2mrels branch from e9fcffe to f22151f Compare October 12, 2025 06:09
Comment on lines 557 to 560
filter (\(ViewKeyDependency _ viewQi _ dep _) -> dep == PKDep && viewQi == QualifiedIdentifier sch vw) keyDeps
fold $ HM.lookup (PKDep, QualifiedIdentifier sch vw) indexedDeps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only diff in the schema-cache I get is here. The big schema has many of these:

@@ -858435,8 +858435,8 @@
         "tableIsView": true,
         "tableName": "v_pop_ohnekoord",
         "tablePKCols": [
-          "ap_id",
-          "id"
+          "id",
+          "ap_id"
         ],
         "tableSchema": "apflora",
         "tableUpdatable": false

Aka, the order of tablePKCols is changed.

I'm not exactly sure whether we rely on this order anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the source of HM.fromListWith op and indeed: it seems like it calls newValue op existingValue so it reverses the list order.

Not sure if we depend on this order anyway but will add reversing the list back just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wolfgangwalther added fmap reverse in line 565 - that should fix this issue.

Can you re-check? (or you could provide me with a quick way to generate this diff)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command I used was:

PGRST_DB_SCHEMAS=apflora postgrest-with-postgresql-17 -f test/io/big_schema.sql postgrest-run --dump-schema | jq

Pipe this to a file on main, then do the same on your branch, then diff the two.

Copy link
Contributor Author

@mkleczek mkleczek Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @wolfgangwalther - run it after 39b404e and it fixes the issue - no more differences between main and this branch.

Comment on lines -480 to +483
viewRels Relationship{relTable,relForeignTable,relCardinality=card} =
if isM2O card || isO2O card then
viewRels Relationship{relTable,relForeignTable,relCardinality=card} | isM2O card || isO2O card =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unrelated refactor.

Copy link
Contributor Author

@mkleczek mkleczek Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
The previous version caused ugly multiple empty list returns (one for the else branch in if and another for the pattern match case). Fixing it was too tempting to resist :)

Do you think it is worth doing it in a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a separate PR, but a separate commit would be good - that will make things much easier to understand whenever we look at this again in months or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else Nothing
| Relationship jt1 t _ (M2O cons1 cols) _ tblIsView <- rels
, Relationship jt2 ft _ (M2O cons2 fcols) _ fTblisView <- rels
, jt1 == jt2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the removal of jt1 == jt2, yet. Is this another unrelated refactor? Or related to the change here?

Copy link
Contributor Author

@mkleczek mkleczek Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed as we lookup in the hash-map using jt1 as the key (ie. we changed filtering by equality to hash-map lookup) - that's the crux of this PR.

@mkleczek mkleczek force-pushed the index-rels-in-addm2mrels branch 2 times, most recently from cbcd7b8 to bb8423b Compare October 15, 2025 13:57
-- so we don't need to know about the other references.
-- * We need to choose a single reference for each column, otherwise we'd output too many columns in location headers etc.
takeFirstPK = mapMaybe (head . snd)
indexedDeps = fmap reverse $ HM.fromListWith (++) $ fmap ((keyDepType &&& keyDepView) &&& pure) keyDeps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverse is quite mysterious, maybe add a comment that it was done to preserve backwards compat?

Copy link
Contributor Author

@mkleczek mkleczek Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we decide to keep it, can we first figure out whether we actually need it? Do we rely on the order of these or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we first figure out whether we actually need it?

I don't think we can figure that out. Overall users have been sensitive to OpenAPI changes, so IMO if we can easily maintain backwards compat we should do it.

@wolfgangwalther If you recall #1701, I tried to remove this legacy hack we have (IIRC, it's incorrect since it doesn't consider composite keys)

n = catMaybes
[ Just "Note:"
, if pk then Just "This is a Primary Key.<pk/>" else Nothing
, fk
]

But that broke vue-postgrest so we kept it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it does consider composite keys.

But this brings up an interesting question: Does this cause any observable OpenAPI output differences? I have not tested that. I have only tested the schema dump, which is just an internal representation. So my question was not really targeting whether this would change anything in the openapi output, but whether it would change anything internal.

But yes, testing the openapi output for differences would be a great idea as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we leave it like this in this PR. Let's not make perfect the enemy of the good. This PR brings significant performance gains and I think it is worth merging it and possibly taking care of removing fmap reverse here and updating the tests appropriately in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, testing the openapi output for differences would be a great idea as well.

I compared it against the main branch for both follow-privileges and ignore-privileges settings. It returns the same OpenAPI output in both branches. What's funny it's that it even returns the same output when the fmap reverse is removed.

and possibly taking care of removing fmap reverse here and updating the tests appropriately in the future.

So I agree here. Having the same internal schema should be enough and no queries were modified either, so it's not likely that anything else would change (maybe add a TODO so we don't forget to check in the future?).

@mkleczek mkleczek force-pushed the index-rels-in-addm2mrels branch 3 times, most recently from 138eba9 to f8ff7c8 Compare October 16, 2025 16:54
…n*n) to O(n)

* rels in addM2MRels
* keyDeps in addViewPrimaryKeys
* keyDeps in addViewM2OAndO2ORels
Also changed pattern matching to point-free usage of record function in findViewPKCols so that variable names match between code and comment.
@mkleczek mkleczek force-pushed the index-rels-in-addm2mrels branch from f8ff7c8 to 45ef1a4 Compare October 18, 2025 05:12
@mkleczek
Copy link
Contributor Author

mkleczek commented Oct 18, 2025

Hmm... I've rebased the PR on top of main and one test in postgrest-test-memory now fails. Got no idea what's going on.

Locally:

CI reported all green on pre-rebase version of this PR.

Looks like there is some flakiness in postgrest-test-memory. @steve-chavez are you able to provide any insights?

@steve-chavez
Copy link
Member

Looks like there is some flakiness in postgrest-test-memory. @steve-chavez are you able to provide any insights?

I restarted the CI job and it passed, it's flakiness. Although I've never had all memory tests failing locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Schema cache reload with large relationship count causes 100+ second API blocking Slow schema cache loading and double caching schema

4 participants