perf: Cache dbTables FuzzySet per schema #4472

mkleczek · 2025-11-17T05:44:21Z

Calculation of hint message when requested relation is not present in schema cache requires creation of a FuzzySet (to use fuzzy search to find candidate tables). For schemas with many tables it is costly. This patch introduces dbTablesFuzzyIndex in SchemaCache to memoize the FuzzySet creation.

Related to #4463

steve-chavez · 2025-11-17T19:01:04Z

Definitely looks related to #4463. Have you tested the impact of this PR on a schema with lots of tables?

Looks like we'll need the same test setup for #4462.

steve-chavez · 2025-11-17T23:14:52Z

I think I've managed to repro, with 125K tables:

17/Nov/2025:17:56:40 -0500: Schema cache queried in 5230.2 milliseconds
17/Nov/2025:17:56:40 -0500: Schema cache loaded 125271 Relations, 232 Relationships, 149 Functions, 15 Domain Representations, 45 Media Type Handlers, 1196 Timezones
17/Nov/2025:17:56:40 -0500: Schema cache loaded in 237.2 milliseconds

The server dies when trying to compute the hint:

$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

# correct table names do work
$ curl localhost:3000/projects
[{"id":1,"name":"Windows 7","client_id":1}, 
 {"id":2,"name":"Windows 10","client_id":1}, 
 {"id":3,"name":"IOS","client_id":2}, 
 {"id":4,"name":"OSX","client_id":2}, 
 {"id":5,"name":"Orphan","client_id":null}]$

steve-chavez · 2025-11-17T23:29:02Z

The server dies when trying to compute the hint:
$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

Same result with this PR. I'll add a test case.

mkleczek · 2025-11-17T23:37:19Z

The server dies when trying to compute the hint:
$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

Same result with this PR. I'll add a test case.

Yeah... this PR won't help with the first hint computation (FuzzySet is created anyway). It would only help with subsequent hints (without this PR FuzzySet is created each time).

Seems like FuzzySet implementation is memory hungry.

Also - it depends on the number of schemas as with this PR there is a separate FuzzySet per schema created (doesn't help if all tables are in a single schema).

mkleczek · 2025-11-18T07:41:55Z

@steve-chavez
I've spent a little bit of time researching the problem and it seems we do not have any good options for databases with a large number of relations.

My suggestion would be to either:

Get rid of fuzzy relation search altogether.
Only search when the number of relations is small enough (we need to come up with the right number).
Have a configuration property to turn on/off fuzzy search.
Hit the database with dummy SELECT 1 FROM non_existing_table LIMIT 0 or similar and retrieve information from the reported error (if there is no error that means the schema cache is stale).

Aside:
To be honest I have a gut feeling we do something wrong here trying to re-implement catalog search in PostgREST - this is the role of the DBMS. I am not even sure if #3869 is the right direction as ideally we should be doing just syntactical transformations of requests to queries (and if possible not have the schema cache a all). I can even imagine the implementation that delegates query construction to the database itself (ie. have SQL generating SQL) and only caching them for subsequent executions.

OTOH I understand that if we already have the schema cache then let's use it - trade-offs everywhere :)

wolfgangwalther · 2025-11-18T09:42:40Z

Get rid of fuzzy relation search altogether.

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

steve-chavez · 2025-11-18T16:43:59Z

Get rid of fuzzy relation search altogether.

From my experience maintaining a postgREST-as-a-service, before the fuzzy search hint there were lots of support tickets that came from typos but users were quick to to blame postgREST and suspect a malfunction. It was a huge time sink. After the fuzzy search, those complaints just stopped.

Nowadays with typescript clients and other typed clients we have in the ecosystem, maybe this is not such a big problem but I'd still argue have the fuzzy search is a net win for 99% use cases.

Only search when the number of relations is small enough (we need to come up with the right number).

This looks like the viable short-term solution. Most use cases don't have that many tables too.

I am not even sure if #3869 is the right direction as ideally we should be doing just syntactical transformations of requests to queries (and if possible not have the schema cache a all)

IIRC, #3869 was mostly for better error messages, but we have other features that might also reuse it. To not use the schema cache we'd have to implement resource embedding and other features in PostgreSQL itself (possible but would take a loot of time).

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

Could we instead vendor the dependency? 🤔 (see laserpants/fuzzyset-haskell#9)

steve-chavez · 2025-11-18T17:30:25Z

$ curl localhost:3000/xprojects
curl: (52) Empty reply from server

Besides the above, I'm also getting an "empty reply from server" on / (openAPI). I guess the empty replies happens because of the space leak protection #2122?

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

Looking at the algorithms used as inspiration on laserpants/fuzzyset-haskell#2 (comment), I wonder if it's optimized for our use case. Maybe we can come up with a better library? (cc @taimoorzaeem)

mkleczek · 2025-11-18T17:32:56Z

Get rid of fuzzy relation search altogether.

From my experience maintaining a postgREST-as-a-service, before the fuzzy search hint there were lots of support tickets that came from typos but users were quick to to blame postgREST and suspect a malfunction. It was a huge time sink. After the fuzzy search, those complaints just stopped.

Makes sense.

[...]

+1 for this also from a dependency perspective: We have to pin an older version of the fuzzy-thing dependency, because a newer version causes even more regression. I'd like to get rid of that dependency, ideally.

Could we instead vendor the dependency? 🤔 (see laserpants/fuzzyset-haskell#9)

I think it makes sense to merge this PR regardless of these decisions as it is a quite obvious optimization (even though it does not fix the OOM issue).
@steve-chavez @wolfgangwalther WDYT?

steve-chavez · 2025-11-18T18:01:20Z

I think it makes sense to merge this PR regardless of these decisions as it is a quite obvious optimization (even though it does not fix the OOM issue).
@steve-chavez @wolfgangwalther WDYT?

I don't think we should merge without a test that proves what's being improved (since it doesn't solve #4463).

taimoorzaeem · 2025-11-18T18:16:47Z

Looking at the algorithms used as inspiration on laserpants/fuzzyset-haskell#2 (comment), I wonder if it's optimized for our use case. Maybe we can come up with a better library? (cc @taimoorzaeem)

I will be happy to write a library 👍 . I would just need to do some feasibility, maybe a better library already exist on hackage? Needs some investigation.

mkleczek · 2025-11-18T18:41:37Z

I think it makes sense to merge this PR regardless of these decisions as it is a quite obvious optimization (even though it does not fix the OOM issue).
@steve-chavez @wolfgangwalther WDYT?

I don't think we should merge without a test that proves what's being improved (since it doesn't solve #4463).

Done.
Updated mixed load test to include requests for non-existing relation and the fixture to create 1000 tables.

See: https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508

mkleczek · 2025-11-18T18:49:11Z

Looking at the algorithms used as inspiration on laserpants/fuzzyset-haskell#2 (comment), I wonder if it's optimized for our use case. Maybe we can come up with a better library? (cc @taimoorzaeem)

I will be happy to write a library 👍 . I would just need to do some feasibility, maybe a better library already exist on hackage? Needs some investigation.

As I mentioned earlier - I've done some research and I couldn't find any interesting alternatives. Fuzzy search algorithms are divided into two subgroups: online and offline.

Online algorithms do not require additional memory but require scanning the whole list (so are O(n) at least).

Offline algorithms require creating an index. Best results are achieved using indexes based on n-grams - this is exactly what FuzzySet we use is. The problem with n-grams based index is that it is big (for each word we keep multiple entries: one per n-gram found in the word).

I am skeptical we can come up with a solution to 125k tables...

steve-chavez · 2025-11-19T16:14:45Z

Done.
Updated mixed load test to include requests for non-existing relation and the fixture to create 1000 tables.
See: https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508

@mkleczek We should not modify the mixed loadtest for this, that should stay the same across versions so we can see regressions. Additionally, that doesn't prove there's an enhancement here.

Test should be like:

A new dedicated fixture (like we have fortest/io/replica.sql).
A test that triggers a not found error
The test should measure there's a speed up via Server-Timing or other means (for this it might help to increase the number of tables to a big number)

mkleczek · 2025-11-19T16:40:18Z

Done.
Updated mixed load test to include requests for non-existing relation and the fixture to create 1000 tables.
See: https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508

@mkleczek We should not modify the mixed loadtest for this, that should stay the same across versions so we can see regressions.

I am not sure I understand this.

The old mixed loadtest does not account for this scenario, and without it it is not a good regression test (the proof is that we didn't catch performance regression introduced by #3869 )

So I would say adding the "table not found" scenario to it is an important addition - it makes it more realistic and covering wider set of scenarios.

Additionally, that doesn't prove there's an enhancement here.

https://github.com/PostgREST/postgrest/actions/runs/19477317041/attempts/1#summary-55740304508
shows more than 20% increase in throughput (108 vs 82 req/s) and around 10% less CPU usage comparing to main and 14.

Am I missing something here?

Test should be like:

A new dedicated fixture (like we have fortest/io/replica.sql).

A test that triggers a not found error

The test should measure there's a speed up via Server-Timing or other means (for this it might help to increase the number of tables to a big number)

Hmm...
Do we run io tests against different branches (similar to load tests)?
If not then I don't really understand how we could show improvement vs main or 14.

I thought load tests (not io tests) are supposed to test performance?

@steve-chavez could you advice on the way you would like it to be tested?

steve-chavez · 2025-11-19T22:30:23Z

I am not sure I understand this.
The old mixed loadtest does not account for this scenario, and without it it is not a good regression test (the proof is that we didn't catch performance regression introduced by #3869 )

The problem is that "mixed" is already conflating too many scenarios (see #4123), with this addition is even worse as we also conflate "successful traffic" with "error traffic" (note that now we have errors.0: 404 Not Found in the loadtest results).

Another problem is that loadtests are slow, so we shouldn't add too many of them (IMO a dedicated one for errors doesn't seem worth it) if we can instead have an io test.

Hmm...
Do we run io tests against different branches (similar to load tests)?
If not then I don't really understand how we could show improvement vs main or 14.
I thought load tests (not io tests) are supposed to test performance?

We do have perf related tests on:

postgrest/test/io/test_big_schema.py

Lines 10 to 36 in 91abcd4

    
           def test_schema_cache_load_max_duration(defaultenv): 
        
               "schema cache load should not surpass a max_duration of elapsed milliseconds" 
        
               max_duration = 500.0 
        
               env = { 
        
                   **defaultenv, 
        
                   "PGRST_DB_SCHEMAS": "apflora", 
        
                   "PGRST_DB_POOL": "2", 
        
                   "PGRST_DB_ANON_ROLE": "postgrest_test_anonymous", 
        
               } 
        
               with run(env=env, wait_max_seconds=30, no_startup_stdout=False) as postgrest: 
        
                   log_lines = postgrest.read_stdout(nlines=50) 
        
                   schema_cache_lines = [ 
        
                       line for line in log_lines if "Schema cache loaded in" in line 
        
                   ] 
        
                   match = re.search( 
        
                       r"Schema cache loaded in ([0-9]+(?:\.[0-9])?) milliseconds", 
        
                       schema_cache_lines[-1], 
        
                   ) 
        
                   assert match, f"unexpected log format: {schema_cache_lines[-1]}" 
        
                   duration_ms = float(match.group(1)) 
        
                   assert duration_ms < max_duration

It looks enough to have an upper bound of expected time for computing the hint to not have regressions.

@steve-chavez could you advice on the way you would like it to be tested?

I'd suggest adding the fixtures on test/io/test_big_schema.py too, it seems fit. Roughly following the steps I mentioned above, I'm thinking measuring Server-Timing should be enough. You could also add this test on another PR, to see the current server-timing numbers and then updating the numbers on this PR, to confirm the improvement.

mkleczek · 2025-11-20T16:24:29Z

@steve-chavez could you advice on the way you would like it to be tested?

I'd suggest adding the fixtures on test/io/test_big_schema.py too, it seems fit. Roughly following the steps I mentioned above, I'm thinking measuring Server-Timing should be enough. You could also add this test on another PR, to see the current server-timing numbers and then updating the numbers on this PR, to confirm the improvement.

@steve-chavez
Done. (see test_second_request_for_non_existent_table_should_be_quick in test_big_schema.py)

test/io/test_big_schema.py

steve-chavez · 2025-11-20T23:49:14Z

test/io/test_big_schema.py

+
+
+def test_second_request_for_non_existent_table_should_be_quick(defaultenv):
+    "requesting a non-existent relationship the second time should be quick"


Q: Why is the second time quicker? Does the fuzzy index get populated after hitting an error?

From the code, it looks like the fuzzy index is only populated when the schema cache is built.

Q: Why is the second time quicker? Does the fuzzy index get populated after hitting an error?

From the code, it looks like the fuzzy index is only populated when the schema cache is built.

That's because of Haskell laziness.

Ok, then to avoid confusion I suggest:

Suggested change

"requesting a non-existent relationship the second time should be quick"

"requesting a non-existent relationship should be quick after the schema cache is loaded (2nd request)"

On second thought, not sure if I understand this.

When we do a request with resource embedding the schema cache is already loaded and works on the first request. Why does this change for the fuzzy index and we need to make 2 requests?

Maybe what we need is some comments about this optimization, I suggest adding those on the type definition #4472 (comment)

On second thought, not sure if I understand this.

When we do a request with resource embedding the schema cache is already loaded and works on the first request. Why does this change for the fuzzy index and we need to make 2 requests?

This is indeed tricky: schema cache is loaded lazily as well. But there are these two lines in AppState.retryingSchemaCacheLoad:

(t, _) <- timeItT $ observer $ SchemaCacheSummaryObs $ showSummary sCache observer $ SchemaCacheLoadedObs t

which cause evaluation of SchemaCache fields.

I've decided not to add dbTablesFuzzyIndex to schema cache summary and leave its evaluation till first use.

Introduced the type alias, added comments to SchemaCache field and updated the test description.

src/PostgREST/Error.hs

Calculation of hint message when requested relation is not present in schema cache requires creation of a FuzzySet (to use fuzzy search to find candidate tables). For schemas with many tables it is costly. This patch introduces dbTablesFuzzyIndex in SchemaCache to memoize the FuzzySet creation.

taimoorzaeem · 2025-11-23T09:31:30Z

test/io/test_big_schema.py

+    with run(env=env, wait_max_seconds=30) as postgrest:
+        response = postgrest.session.get("/unknown-table")
+        assert response.status_code == 404
+        data = response.json()
+        assert data["code"] == "PGRST205"
+        first_duration = response.elapsed.total_seconds()
+        response = postgrest.session.get("/unknown-table")
+        assert response.elapsed.total_seconds() < first_duration / 20


I noticed something strange on my system. When I run the test with get endpoint changed to /table-with-a-weird-name, the test fails with an exception.

FAILED test/io/test_big_schema.py::test_second_request_for_non_existent_table_should_be_quick - requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected( - 'Remote end closed connection without response'))

@taimoorzaeem I'm afraid the same thing happens when running this test against main.

This is expected as this PR only makes sure the FuzzySet is created once instead of every time hint is calculated. Hint calculation logic and the data structures stay the same.

IMHO it looks like the library we use for fuzzy search has reliability issues and we should look for other solutions.

@taimoorzaeem I found https://hackage-content.haskell.org/package/fuzzily-0.2.1.0/docs/Text-Fuzzily.html and https://hackage.haskell.org/package/fuzzyfind-3.0.2/docs/Text-FuzzyFind.html but they both implement online fuzzy search. For us means that hint calculation time is at least linear in the number of tables (which I don't think is a good idea as it will almost certainly fail with timeout for large schemas).

I think the most scalable solution would be to implement the SymSpell algorithm. See: ref1, ref2, ref3.

I think a haskell package for this would be great for the entire haskell community.

I think the most scalable solution would be to implement the SymSpell algorithm. See: ref1, ref2, ref3.

The problem with SymSpell is that it is memory hungry (ie. for 100,000 words and edit distance of 2 it requires 1,500,000 entries in the dictionary). Building the index is also very costly.
You can minimize memory requirements with perfect hashing but it makes building the index even more costly.

In general, this is a tough problem. It is even tougher if the spelling dictionary is dynamic (relation names in the schema cache can be reloaded and require index rebuilding).

I think a haskell package for this would be great for the entire haskell community.

Undoubtedly. You can read about SOTA techniques for example here: https://towardsdatascience.com/spelling-correction-how-to-make-an-accurate-and-fast-corrector-dc6d0bcbba5f/
Implementing this in Haskell would be a very interesting task.

Having said that, when you think about the whole architecture from the high level, PostgREST does not seem to be the right place to implement in-memory text search engine. Especially that it is a component that is supposed to be scaled easily (ie. start new instances quickly). In such scenarios you want to externalize state, ie. have a separate component implementing text search algorithms. But you already have such a component: PostgreSQL itself!

IMHO these are possible paths for PostgREST:

Stay with current architecture (ie. query building based on in-memory schema cache outside of the database) and implement limited but cheap spell checking.

Move query building to the database itself and use database facilities to handle spell checking (somewhat not a viable solution as it would be a complete rewrite).

Stay with current architecture and externalize spell checking

by invoking external system when calculating hint (there are multiple questions here: should it be some specialized search engine - we don't want to have such a dependency. should it be PostgreSQL - then see below)

by letting PostgreSQL handle misspelled relations (that in essence means reverting fix: handle queries on non-existing table gracefully #3869)

It just seems to me we've just hit the wall with our current architecture here and the only thing we can do is to admit it and live with deficiency in spell checking or re-architect and rewrite PostgREST.

One mitigation would be to work on fuzzyset library and try to optimize it as much as possible (I've taken a quick look at the source and found some minor optimization opportunities). The question is really: what schema sizes are "normal" for PostgREST and what sizes are outside of what PostgREST supports?

@taimoorzaeem @steve-chavez does the above make sense?

Great analysis Michal! Many options to go from here, but I think the first thing we need to do is to have reproducible production level testing setup, only then we can take a sure fire decision.

mkleczek force-pushed the dbtables-fuzzyset-cache branch from 3149071 to a1b92a6 Compare November 17, 2025 06:06

mkleczek mentioned this pull request Nov 17, 2025

Revert "fix: handle queries on non-existing table gracefully " #4468

Draft

mkleczek force-pushed the dbtables-fuzzyset-cache branch from a1b92a6 to 20b391f Compare November 18, 2025 18:39

mkleczek force-pushed the dbtables-fuzzyset-cache branch from 20b391f to 2057eaa Compare November 19, 2025 05:20

mkleczek force-pushed the dbtables-fuzzyset-cache branch 2 times, most recently from ff6d3eb to c2e1d94 Compare November 20, 2025 16:22

steve-chavez reviewed Nov 20, 2025

View reviewed changes

test/io/test_big_schema.py Outdated Show resolved Hide resolved

steve-chavez reviewed Nov 20, 2025

View reviewed changes

mkleczek force-pushed the dbtables-fuzzyset-cache branch from c2e1d94 to ee52b3f Compare November 21, 2025 05:28

steve-chavez reviewed Nov 21, 2025

View reviewed changes

src/PostgREST/Error.hs Outdated Show resolved Hide resolved

mkleczek force-pushed the dbtables-fuzzyset-cache branch 2 times, most recently from f8d5cf5 to 6226416 Compare November 21, 2025 19:43

taimoorzaeem reviewed Nov 22, 2025

View reviewed changes

src/PostgREST/Error.hs Outdated Show resolved Hide resolved

mkleczek force-pushed the dbtables-fuzzyset-cache branch from 6226416 to ccd9664 Compare November 22, 2025 09:23

mkleczek force-pushed the dbtables-fuzzyset-cache branch from ccd9664 to 465d076 Compare November 22, 2025 09:23

taimoorzaeem reviewed Nov 23, 2025

View reviewed changes

mkleczek mentioned this pull request Nov 24, 2025

Add configuration to enable turning off spell checking of relation and function names #4499

Open



		def test_second_request_for_non_existent_table_should_be_quick(defaultenv):
		"requesting a non-existent relationship the second time should be quick"

Uh oh!

perf: Cache dbTables FuzzySet per schema #4472

Are you sure you want to change the base?

perf: Cache dbTables FuzzySet per schema #4472

Conversation

mkleczek commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steve-chavez commented Nov 17, 2025

Uh oh!

steve-chavez commented Nov 17, 2025

Uh oh!

steve-chavez commented Nov 17, 2025

Uh oh!

mkleczek commented Nov 17, 2025

Uh oh!

mkleczek commented Nov 18, 2025

Uh oh!

wolfgangwalther commented Nov 18, 2025

Uh oh!

steve-chavez commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steve-chavez commented Nov 18, 2025

Uh oh!

mkleczek commented Nov 18, 2025

Uh oh!

steve-chavez commented Nov 18, 2025

Uh oh!

taimoorzaeem commented Nov 18, 2025

Uh oh!

mkleczek commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkleczek commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steve-chavez commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkleczek commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steve-chavez commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkleczek commented Nov 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mkleczek commented Nov 17, 2025 •

edited

Loading

steve-chavez commented Nov 18, 2025 •

edited

Loading

mkleczek commented Nov 18, 2025 •

edited

Loading

mkleczek commented Nov 18, 2025 •

edited

Loading

steve-chavez commented Nov 19, 2025 •

edited

Loading

mkleczek commented Nov 19, 2025 •

edited

Loading

steve-chavez commented Nov 19, 2025 •

edited

Loading

mkleczek Nov 21, 2025 •

edited

Loading

mkleczek Nov 21, 2025 •

edited

Loading

mkleczek Nov 23, 2025 •

edited

Loading

mkleczek Nov 23, 2025 •

edited

Loading

mkleczek Nov 24, 2025 •

edited

Loading