SQLite as a backend? SQLite perf problems. #14

tantaman · 2023-12-20T13:45:15Z

tantaman
Dec 20, 2023
Maintainer

Ref: Reactivity / Live Queries / Subscriptions / Database Observation cr-sqlite#309 (comment)

This is just a scratch pad of threads to pull on and not a rigorous investigation yet.

The purpose of this is to collect evidence to decide between:

Supporting SQLite as a backend for Materialiate or
Dropping any consideration of supporting SQLite as a backend to Materialite in the browser and instead supporting IndexedDB

Non performance related considerations (draft): https://github.com/vlcn-io/docs/blob/main/pages/blog/sqlite-isnt-it.mdx

Threads to pull on:

SQLite seems dubiously slow in the browser once tables hit any magnitude (10,000 rows +). This may be due to a count query being run over the set in Linearite which requires a table scan in SQLite iiuc.
Startup time for seeding 1 million issues in the Materialite examples is 1-2 seconds. For SQLite in the browser it is on the order of minutes and usually crashes or corrupts the DB.
Fetching 10,000 rows + from SQLite takes an order of magnitude longer than from Materialite. E.g., 100ms vs 10ms
SQLite transactions are always durable creating a very slow max throughput in terms of TPS. Supposedly we should get 1k TPS with IDBBatchAtomic relaxed durability but I see much less than that in practice.
The benchmarks from https://rhashimoto.github.io/wa-sqlite/demo/benchmarks.html (VFS benchmarks (including the official SQLite WASM OPFS VFS) rhashimoto/wa-sqlite#86) don't match what I see in practice when running an app. The benchmarks are an order of magnitude faster. Why is this? The benchmarks run in a worker rather than UI thread. The benchmarks load a SQL file but that still seems like it goes through the normal statements path.
What is the true cost of the JS<->WASM bridge?
How fast would it be to read and write pages of JS objects to/from IndexedDB? E.g., an LSMTree of JS Objects stored to IndexedDB.

Multi-worker SharedArrayBuffer+OPFS VFS? rhashimoto/wa-sqlite#138
Also: somewhere I wrote about creating a custom VFS for SQLite which spans worker and main thread. Writes could by synchronous and then lazily shuttled off to the worker for persistence. Ah, here it is: https://github.com/livestorejs/livestore/discussions/39#discussioncomment-7835066 Private repo though :/

Updates:

12/27/2023 - Micro benchmarks

tantaman · 2023-12-20T15:30:09Z

tantaman
Dec 20, 2023
Maintainer Author

Anecdata:

@schickling had to move SQLite into the main thread and use a completely in-memory db to drive Overtone.

@KyleAMathews, @samwillis, @quolpr -- have you guys run into any performance walls with SQLite in the browser once you have a sizeable dataset?

I mean it isn't horrible but it doesn't seem fast enough to drive a reactive UI once the tables have 10k, 100k + rows. Even when indexed.

5 replies

quolpr Dec 20, 2023

@tantaman, I encountered performance issues, even in in-memory mode within the same thread(not in web worker). I can't recall the exact cause, but encoding/decoding strings seemed to be a consistent bottleneck.

Moreover, I might be mistaken, but even with large datasets, queries using SELECT SUM (for in-memory mode) didn't perform as fast as a simple JavaScript sum of the same objects. The performance gap was substantial, despite the presence of indexes.

In response, I developed an in-memory layer on top of mobx-keystone to enhance UI responsiveness.

I also remember working on a to-do app with in-memory SQLite running in the same thread, ensuring that all queries ran synchronously(it was using polling strategy. So on each INSERT/UPDATE all queries were rerunning). However, when the to-do count exceeded 100k, I began noticing significant freezes during simple queries. While I attempted to address this by optimizing indexes, my efforts didn't yield the desired responsiveness. Consequently, I transitioned to an in-memory mobx-keystone layer + SQLite in worker. Unfortunately, I don't have the code at the moment to proof my claims, so take my words with a grain of salt. 😅

quolpr Dec 20, 2023

t appears that we're in need of a database solution that combines reactivity and persistence while being written in JavaScript to avoid the overhead of a WebAssembly (wasm) to JavaScript bridge. While IndexedDB seems like a potential candidate, it falls short as it cannot be run on a server and lacks certain essential features. It could be used as a backend for this new db(like we do for SQLite + IndexedDB VFS).

I briefly read your post, and I agree that this in-memory layer is quite challenging. It gives rise to a considerable amount of inconsistency that requires constant attention and effort to manage. Unfortunately, the sad reality is that I haven't come across an optimal solution for this problem yet, even though I had high hopes for in-memory SQLite.

quolpr Dec 20, 2023

Hence, rather than dismissing the idea of using SQLite as a backend for Materialite in the browser in favor of supporting IndexedDB, I would propose creating a custom persistence layer with an SQLite-like approach, storing data in raw blocks 🤔. This approach would allow to support a wide range of storage options. One of the reason I favor SQLite in my projects is the ability to craft identical SQL queries for both front-end and back-end, making an HTTP API at minimal cost.

KyleAMathews Dec 20, 2023

I've just done toy stuff so everything has been very quick. The one perf problem I saw was a sort on a table with a few hundred rows. Without an index it was getting noticably slow (and was fine again with one). I was surprised that it'd gotten slow already.

Have you compared opfs? Another theory (though agree the wasm--js bridge is likely a big slow down) is that indexeddb is a bottleneck.

tantaman Dec 21, 2023
Maintainer Author

I haven't tried OPFs yet but according to @rhashimoto's benchmarks IDBBatchAtomic + relaxed durability has the fastest transactions per second -- https://rhashimoto.github.io/wa-sqlite/demo/benchmarks.html

But my observed perf isn't matching the benchmarks so your point stands to try other backends as well as to write my own benchmarks against the abstractions I have atop wa-sqlite.

tantaman · 2023-12-21T14:49:36Z

tantaman
Dec 21, 2023
Maintainer Author

Linking @AlexErrant's perf problems with SQLite: vlcn-io/js#27

@AlexErrant - do you have any other anecdotal or hard data where you've hit performance walls with SQLite in the browser?

0 replies

tantaman · 2023-12-21T15:04:29Z

tantaman
Dec 21, 2023
Maintainer Author

Some more anecdata -- in-memory sqlite vs js memory: https://youtu.be/I6n65YlBWfA

5 replies

quolpr Dec 21, 2023

Yeah, I had the same experience with my todo app with in-mem SQLite running in same browser thread. It's actually frustrating

tantaman Dec 21, 2023
Maintainer Author

more anecdata --rendering 40 rows out of a set of 1,000 rows.

This frame drops too but I suspect the real culprit is React in this case. Each row is a new object reference on re-query which causes a re-render of all 40 rows in the table. We'd need to do some manual diffing of rows inside our DB wrapper to prevent re-renders to make SQLite optimal here.

40-of-1k-trim.mov

rhashimoto Dec 23, 2023

@tantaman Are you still using an Asyncify build with the database in memory? If you are, that's going to be slow. The Asyncify transformation adds extra unwind/rewind logic to every function that might make an asynchronous JavaScript call even if it doesn't, and SQLite approaches a worst case because it makes so many calls through function pointers. This penalty is less noticeable with a VFS that goes to real browser storage because the storage delays are much larger.

What does "seeding" do in your test? Are you using a single transaction? A single prepared statement with placeholders? Is it just simple row insertion, or is it invoking triggers?

tantaman Dec 26, 2023
Maintainer Author

What does "seeding" do in your test?

I'm using the official build so a full synchronous API to SQLite. The inserts use a prepared statement re-bound for every row within a single transaction.

rhashimoto Dec 26, 2023

How large is the seeded database (in MB)? Using PRAGMA page_size and PRAGMA page_count would be fine.

KyleAMathews · 2023-12-21T15:33:21Z

KyleAMathews
Dec 21, 2023

An interesting thing to compare would be how long do the same actions (seeding, limit, filter, etc) take with native sqlite (in-memory + disk). If in fact those are significantly faster than browser WASM, then perhaps the WASM build is missing some optimization.

0 replies

samwillis · 2023-12-23T23:05:40Z

samwillis
Dec 23, 2023

I need to do some more benchmarking on querying large datasets, but I can comment on impact of PRAGMA foreign_keys = ON.

We have found that when doing large inserts (NNk rows in one transaction) and enforcing foreign key relations causes up to 100x slow down. If you can be sure that your insert does not have invalid foreign keys, temporarily setting PRAGMA foreign_keys = OFF will give a massive speed up.

@tantaman, the seeding demo, is it seeding multiple tables - issues and comments?

On a somewhat related note, you can also get a 10% speed up on large inserts by sorting the rows in JS by the foreign key value first, before the insert. I suspect this is helping with building indexes, but that's a guess at this point.

1 reply

tantaman Dec 27, 2023
Maintainer Author

It is issues & descriptions. I'll try the trick of no foreign key constraints.

tantaman · 2023-12-27T16:27:41Z

tantaman
Dec 27, 2023
Maintainer Author

Here's an Observable notebook that does some micro-benchmarking of SQLite reads/writes against vanilla in-memory interactions:

https://observablehq.com/@tantaman/in-memory-sqlite-perf

Writing N objects to an in-memory DB, in a single transaction, is 100x slower than creating N objects and pushing them to an array
Writing N objects to an in-memory DB, with 1 tx per write, is 1,000x slower than creating N objects and pushing them on an array
Reading from an in-memory DB by primary key is 30x slower than looking up a value in a JS map

1 reply

tantaman Dec 27, 2023
Maintainer Author

I'll follow this up with benchmarks against

Sorted array
Treap
PersistentTreap
Map

The Treap and PersistentTreap are important benchmarks since those would be the base structures for an alternative DB in JS or a cache in JS on top of SQLite.

samwillis · 2023-12-27T19:04:59Z

samwillis
Dec 27, 2023

I'm quite interested in the possibility of using materialite as a layer in front of a sqlite database. This could be with it embedded into an ORM or with a SQL parser that translates (a subset of) SQL into a materialite query.

Ideally the workflow would be something like:

Developer uses an ORM or SQL to define query
The query is translated into a materialite query automatically
The original query is run against SQLite
Initial results from SQLite are loaded into materialite preparing its state
The initial results are returned to the developer for display
As rows are inserted or updated these are fead into materialite, it then updates the results set without having to re-query the sqlite database.

I think it would be more than fine to only support a subset of SQL queries, and then fallback to a full refresh from the database when the query isn't supported.

I don't see the performance of SQLite itself being an issue, it's the potential speed up of being able to skip a full re-query that makes materialite so exciting. With the above workflow you could have very large datasets that your don't want to fully load into memory, but continue to have the incremental result updates that materialite enables.

Is this something that you think is possible?

A couple of thing to consider:

The different semantics between JS and SQLite types and how comparisons will vary.
What happens when materialite needs to load additional data from SQLite, i.e. a new row inserted that has a foreign key to something that hasn't been loaded yet.

4 replies

tantaman Dec 27, 2023
Maintainer Author

Yep, being able to back Materialite with different storage backends (SQLite being first) is my plan.

I think it would be more than fine to only support a subset of SQL queries

I think:

[left/right/natural] join [table] on [t1.column] [operator] [t2.column]
order by [column]
where [column] [operator] [value]
count(*)
offset N
limit N
group by [column]

Wouldn't be much of a stretch to support. Luckily I've done this before and another time at my old job 😅.

Things that seem like a lot of trouble to support:

Window functions
Sub-queries
json functions
aggregate functions like group_array

Any query patterns that jump out to you that would be "must haves"?

I haven't thought much about dealing with concurrent edits but it seems like the same set of problems and possible solutions as we discussed with LiveStore cross tab sync.

samwillis Dec 27, 2023

You've pretty much covered the key things there, I would guess that a left join is more important than the others, but I suspect the tooling is basically identical for all types. Group by is probably least important of that initial list.

It would be nice is sum, avg, min, max was possible along with count, but again less important.

dealing with concurrent edits

In the case of Electric we would be looking to feed our oplog into Materialite, so it will receive all updates will be in order.

samwillis Dec 27, 2023

Hmm, our oplog will probably be async though, and it would be nice if we could feed an update into Materialite synchronously so that we get a synchronous update of the results... need to think a little more about out of order operations.

quolpr Dec 27, 2023

After giving it some more thought, I completely agree with @samwillis's points. RxDB implemented a similar approach using a MongoDB-like query syntax. It utilizes https://github.com/pubkey/event-reduce , and I'm curious about how it compares to materialite 🤔

Btw, this links look useful — https://github.com/pubkey/event-reduce?tab=readme-ov-file#previous-work

SQLite as a backend? SQLite perf problems. #14

tantaman Dec 20, 2023 Maintainer

Updates:

Replies: 7 comments · 16 replies

tantaman Dec 20, 2023 Maintainer Author

tantaman Dec 21, 2023 Maintainer Author

tantaman Dec 21, 2023 Maintainer Author

tantaman Dec 21, 2023 Maintainer Author

tantaman Dec 21, 2023 Maintainer Author

tantaman Dec 26, 2023 Maintainer Author

tantaman Dec 27, 2023 Maintainer Author

tantaman Dec 27, 2023 Maintainer Author

tantaman Dec 27, 2023 Maintainer Author

tantaman Dec 27, 2023 Maintainer Author

tantaman
Dec 20, 2023
Maintainer

Replies: 7 comments 16 replies

tantaman
Dec 20, 2023
Maintainer Author

tantaman Dec 21, 2023
Maintainer Author

tantaman
Dec 21, 2023
Maintainer Author

tantaman
Dec 21, 2023
Maintainer Author

tantaman Dec 21, 2023
Maintainer Author

tantaman Dec 26, 2023
Maintainer Author

tantaman Dec 27, 2023
Maintainer Author

tantaman
Dec 27, 2023
Maintainer Author

tantaman Dec 27, 2023
Maintainer Author

tantaman Dec 27, 2023
Maintainer Author