-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to integrate DuckDB for text-based search & filtering #71
Comments
There's also some movement here I should mention. Transformations on a working branch are allowed to return promises. Since duckdb-wasm queries are async, that means you could package an Arrow record batch on a tile, send it to the duckDB with arbitrary SQL, and then get a record batch back. |
So this is a little more straightforward now, although the API is still in flux. Here's what the steps would be on this new branch.
[...]
scatterplot._root.transformations['has dog in it'] = async function(tile) {
const output = new Float32Array(tile.record_batch.numRows)
// the column being searched.
const all_rows = tile.record_batch.getChild("full_text")
const all_strings = all_rows.toArray(); // The reason to use duckdb is that this function, deserializing a lot of text from arrow UTF-8 to javascript UTF-16 strings, is *extremely* slow.
let i = 0;'
for (let string of all_strings) {
if (string.match(/dog/) {
output[i] = 1;// Store a match as a float.
}
}
return output // This array will be attached to the record batch at render-time, lazily.
} (It's allowed for the promise not to be async, which is what you'd want in this simple JS regex case. But with duckdb you would ship the array of full_text to the db, and then get back the results of a db query as arrow.) And then actually plot it, which causes the transformation to be run on tiles as they're needed. scatterplot.plotAPI({encoding: {
foreground: {
field: 'has dog in it',
op: 'eq',
a: 1
}}}) |
If anyone's following along at home, David and I got a prototype of this running over the weekend. https://observablehq.com/d/cae8e4a3a8b7d4db The hardest part turned out to be misalignment of Arrow versions among |
Adding my Slack convo with @bmschmidt here for public reference
Ben's general guidance:
What I would do is create a new column that is a float of ‘1’ or ‘0’ in duckdb for every tile in the dataset.
That happens by applying a transformation to the data, but you hopefully won’t need to do it manually.
But there are a couple ways: if the percentage of matches in the dataset is pretty low, you could use add_sparse_identifiers to add a set of matching ids.
You could also manually define a transformation to do it.
add_tiled_column
does something quite similar: it defines a new transformation on the Dataset that consumes a tile. As with that function, you probably don’t want to do something like run a different duckdb search for every tile; instead, run a single search, and then write some logic that can create an object called records which keys from the tile identifier to a JS native Float32 Array (which is usually not too hard to pull off an Arrow float32),This is a new part of the codebase and not documented yet, so let me know if that’s not enough to go on.
The point of defining transformations at the tile level like this is that it staggers the load of updating the WebGL buffers across multiple animation ticks; it would also be possible to pull back the abstraction a little bit and simply update the record batches in place by adding a column for each one all at the same time when the duckdb results came back. For less than a million rows, it’s probably about the same--but at higher numbers of points, it would get painful waiting for transformations to be applied way down the chain.
All this requires no changes to the core deepscatter code. But integrating duckdb might require getting a little into the weeds with Arrow, since on the JS side the core unit in deepscatter is basically not the ‘point’ but actually the tile, which corresponds to a single Arrow record batch.
Only in the last couple months have I settled on a strategy for altering these record batches, which is this new ‘transformations’ binding.
The text was updated successfully, but these errors were encountered: