Skip to content

add native duckdb loader with tests#3035

Open
mpr1255 wants to merge 7 commits intosaulpw:developfrom
mpr1255:duckdb-loader
Open

add native duckdb loader with tests#3035
mpr1255 wants to merge 7 commits intosaulpw:developfrom
mpr1255:duckdb-loader

Conversation

@mpr1255
Copy link
Copy Markdown

@mpr1255 mpr1255 commented Mar 19, 2026

Problem:

  • add DuckDB as a native core loader on develop
  • make the loader submission match the published core loader checklist

Approach:

  • add a native DuckDB index/table/view loader with exec-sql, saver support, and deferred commit handling
  • keep focused cmdlog coverage for open, view, exec-sql, PK commits, no-PK safety, alternate schemas, composite PKs, and save roundtrip
  • make the canonical loader replay open the checked-in sample_data/benchmark.duckdb fixture and save the replayed result to tests/golden/load-duckdb.tsv

Checklist:

  • followed the core loader checklist: https://www.visidata.org/docs/contributing/#loader
  • open_duckdb returns the DuckDB sheet
  • explicit rowtype and # rowdef are present
  • the duckdb dependency is noted in requirements.txt
  • a small sample DuckDB dataset is checked into sample_data/
  • tests/load-duckdb.vd opens the checked-in sample dataset
  • tests/golden/load-duckdb.tsv is generated from replaying that loader test
  • dev/formats.jsonl includes the DuckDB section

Risks:

  • DuckDB cmdlog tests require a Python environment with the duckdb package installed
  • deferred writes remain intentionally disabled for tables without a primary key

Testing:

  • PYTHON=.venv/bin/python dev/test.sh load-duckdb load-duckdb-view duckdb-exec-sql duckdb-commit-changes duckdb-commit-no-pk duckdb-open-alt-schema duckdb-commit-composite-pk duckdb-save-sheet
  • PYTHON=.venv/bin/python dev/test-all.sh tests/test-vdx.sh

@mpr1255
Copy link
Copy Markdown
Author

mpr1255 commented Mar 19, 2026

Hey just a note -- all this was done with codex 4.5xhigh + claude code opus 4.6 max effort. I used gpt 4.5 pro for some reviews. I also have various skills to do code reviews. So I bounced all the models off each other for a while, reviewed various outputs, asked for more investigation and added some additional tests.

This was mainly to solve a pain point for me -- I love visidata and wanted to use it for duckdb files. I have it aliased as 'vd' now and it seems to work great. Hope it's useful and not more AI slop.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 19, 2026

CLA assistant check
All committers have signed the CLA.

@saulpw
Copy link
Copy Markdown
Owner

saulpw commented Mar 24, 2026

Thanks, @mpr1255, it's much appreciated! and thank you for the disclosure, great job. I just did a writeup this weekend about this (https://visidata.org/ai), it sounds like this is AI level 6 or 7. Do you understand most of the code and overall technical choices (like you could find your way around without AI, level 6) or did you not even look at the code really (level 7)?

@mpr1255
Copy link
Copy Markdown
Author

mpr1255 commented Mar 24, 2026

Yeah probably closer to 7 than 6, I'd say? I looked at the code but didn't seriously read it. I mostly bossed the LLMs around into doing that. Like I would say "compare it closely to the sqlite loader -- do we differ only where absolutely necessary? tell me how." and then like, "ok is this as simple as it can be? No LLM slop??" [I realise the irony]. Then I would get codex and claude to check each other's work, and sent it to 5.4 Pro for extra AI power.

Here is an example. Claude tells me

_iterfetch() is unnecessary. Since VisiData stores all rows in self.rows anyway, batched fetching saves no memory. result.fetchall() does the same thing in one line. The sqlite loader doesn't batch either — it just yields from the cursor directly.

I'm a bit suss. I feed this to codex to confirm. Spelunks in the codebase and 10 minutes later says:

  1. No, replacing _iterfetch() with fetchall() would be a real regression, not just cleanup. The current helper at visidata/loaders/duckdb.py:45 feeds rows incrementally into VisiData’s load loop at visidata/sheets.py:317, which appends each row into self.rows at visidata/sheets.py:320. yield from result.fetchall() would first materialize the entire result before yielding the first row, so you lose incremental loading and you also defeat the early max_rows cutoff for precious sheets at visidata/sheets.py:318 and visidata/basesheet.py:105.

  2. It is true that VisiData still keeps all rows in memory long-term. self.rows is reset and filled in visidata/sheets.py:314, so steady-state memory is O(n) either way. But _iterfetch() still reduces peak transient memory. I measured a VisiData-like append loop on DuckDB 1.5.0 with 1,000,000 4-int rows: fetchall() peaked at 191.1 MB, fetchmany(1000) at 183.4 MB. That is the extra whole-result Python list from fetchall().

  3. The sqlite comparison is misleading. SQLite can yield rows directly from its cursor at visidata/loaders/sqlite.py:112. DuckDB does not expose that same path here: con.execute() returns a non-iterable connection object in practice, so some adapter loop is needed. DuckDB’s Python API also documents fetchall() as fetching all rows and fetchmany() as fetching the next set: https://duckdb.org/docs/stable/clients/python/reference/

Now -- am I following all the rabbitholes there? No. I am doing a "lgtm" and just trying to ensure the tests pass and that I did the most minimal possible implementation. Also, I used it a bunch and it behaved normally.

BTW big other issue: visidata will fill memory obviously if you try to open a gigantic file. But duckdb seems to behave worse because it shoves things in so fast. So I turned on lazy loading for the main db view (before you click into a table) and recover the metadata duck db surfaces for rows/columns. So you can find out about the tables in your db and not have to calculate them fresh each time. For no memory cost. FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants