[NEM-363] Add built-context indexing flow #114

Mateus-Cordeiro · 2026-02-05T10:36:47Z

This PR introduces a full "index built contexts" workflow. After the datasource context is built, users can now run index to read the generated context files from output/, generate embeddings and persist into DuckDB.

Changes

New dce index command
- Added a new index command that indexes built context files into DuckDB.
- Supports optional filtering.
Added DatabaoContextProjectManager.index_built_contexts(...)
Added BuildService.index_built_context(...)
- Parses the YAML stored in the context file
- Reconstructs the wrapper (BuildDatasourceContext) and deserializes the context into the plugin's expected context_type
Persistence: override support
- PersistenceService.write_chunks_and_embeddings(..., override=True) now deletes old embeddings and chunks for a datasource before inserting new ones.
- Deletion happens outside the transaction due to DuckDB foreign key limitations.
Repositories: delete by datasource_id
- ChunkRepository and EmbeddingRepository now support deletion by datasource_id
Plugins now expose context_type
- Indexing reads context files from YAML. After yaml.safe_load(), the payload is always in Python primitives, but the chunkers are intentionally written against typed context objects. Because of that, each plugin now declares context_type to tell the indexing pipeline what type to reconstruct before calling the chunking operation.
New dependency: cattrs
- cattrs provides structured conversion from unstructured data into python types, which fits our needs and avoids boilerplate deserialize methods that may be tough to maintain as the project grows.

JulienArzul

I think that'd be nice to use Pydantic to create the context type from the plugin rather than adding a new library (cattrs) that does the same thing

Looks good otherwise 🚀

JulienArzul · 2026-02-05T13:47:54Z

src/databao_context_engine/build_sources/build_runner.py

+    """Summary of an indexing run over built contexts."""
+
+    total: int
+    indexed: int


Instead of int for each of these properties, should we return a list of DatasourceId?
We can still keep the indexed as a calculated property for quick access if we want

@dataclass class IndexSummary: """Summary of an indexing run over built contexts.""" indexed: set[DatasourceId] skipped: set[DatasourceId] failed: set[DatasourceId] @property def number_indexed() -> int: return len(indexed) ...

At the very least, I think we should be able to know which datasource failed

That is already being logged in the exception catch.

That is already being logged in the exception catch.

But logs is an information that might or might not be shown depending on who is calling: I'm guessing we will still show info logs in the console of the CLI (but since the CLI is going to live in a separate repo, we shouldn't make any strong assumptions on this). But what about the agents? I don't think they would output the logs anywhere.

And since it's not returned in Python, it's not usable by any callers of the method. The most obvious usage in Python would be to retry indexing only the datasources that failed.

JulienArzul · 2026-02-05T13:55:43Z

src/databao_context_engine/build_sources/build_service.py

 import logging
+from datetime import datetime
+
+import cattrs


We're already using Pydantic in the project, which in my understanding does the same thing. It would be better IMO to not bring in an other library and have two different ways of creating classes from YAML

Interesting. I wasn't aware that Pydantic could do this work for non-pydantic models, and I wanted to avoid forcing users to use Pydantic. I will test without the new library and, if it fits, I'll remove the added dependency.

Here is the code that does this in Pydantic:

databao-context-engine/src/databao_context_engine/pluginlib/plugin_utils.py

Line 45 in a9f4b91

return TypeAdapter(plugin.config_file_type).validate_python(config)

(using a TypeAdapter to handle non-Pydantic models)

JulienArzul · 2026-02-05T13:58:51Z

src/databao_context_engine/build_sources/build_service.py

+
+        converter = cattrs.Converter()
+        converter.register_structure_hook(datetime, lambda v, _: v)
+        build_datasource_context = converter.structure(raw_context, BuiltDatasourceContext)


❓ What is used for the context: Any inside this class? A dictionary?

We could potentially skip this step to avoid the weird object of build_datasource_context with the wrong content for "context".

That would mean reading the attributes from BuiltDatasourceContext directly in the raw dictionary:

typed_context = converter.structure(raw_context.get("context", {}), context_type)

I'm not sure what you mean here. I'm doing this deliberately so that I can use the other items from the BuiltDatasourceContext object (datasource_type, datasource_id and the context itself)

What I find weird is that we're creating a build_datasource_context with a context attribute that hasn't been converted (and I'm guessing will be a dict).

It's prone to mistakes IMO as calling build_datasource_context.context will return a dict and not the expected type. You have to convert that dict (which you do just below) to the right type.

At the very least, I would hide that intermediate object by extracting it into a method that creates the proper BuildDatasourceContext. Something like

def parse_build_datasource_context(yaml_context: dict, context_type: type) -> BuildDatasourceContext: context_type = plugin.context_type typed_context = converter.structure(yaml_context["context"], context_type) # Not sure if you can exclude the `context` attribute from this build_datasource_context = converter.structure(raw_context, BuiltDatasourceContext) build_datasource_context.context = typed_context return build_datasource_context

Or actually, the problem might be that BuildDatasourceContext should now be a generic BuildDatasourceContext[T]

I'm sorry, I didn't understand again what you mean here. I have modified this part and am no longer using cattrs, but can't tell if this addresses what you are saying.

Ah I hadn't looked at the new code with Pydantic

I'm not fully sure what Pydantic does but similarly, I assume it has no ways of knowing what type to use for BuildDatasourceContext.context the first time it creates that Python object and you still require your 2 steps: first building the BuildDatasourceContext (with an invalid context type) and then typing the context specifically

JulienArzul · 2026-02-05T14:57:25Z

src/databao_context_engine/datasources/datasource_context.py

-            if Path(context_file_name).suffix not in DatasourceId.ALLOWED_YAML_SUFFIXES:
+            if (
+                Path(context_file_name).suffix not in DatasourceId.ALLOWED_YAML_SUFFIXES
+                or context_file_name == "all_results.yaml"


👍 well spotted

JulienArzul · 2026-02-05T15:01:10Z

src/databao_context_engine/services/persistence_service.py

        if not chunk_embeddings:
            raise ValueError("chunk_embeddings must be a non-empty list")

+        # Outside the transaction due to duckdb limitations.


That's annoying...

But I guess since we're in a local context, there shouldn't be any concurrency on the DB and we can probably live with it. The only potential problem would be is something fails within the following transaction: we deleted all previously existing contexts but we didn't add new ones, which is not great but is something that we can deal with as long as we notify the user that this datasource failed to be indexed

I think it's highly likely this all will be not relevant after we'll start working on perf optimizations, so it's fine

JulienArzul · 2026-02-05T15:11:26Z

src/databao_context_engine/build_sources/build_runner.py

+        try:
+            logger.info(f"Indexing datasource {context.datasource_id}")
+
+            datasource_type = read_datasource_type_from_context_file(


Nit: Since you're getting a DatasourceContext as input, you have already read the context as a string. So we don't really need to re-read from the file system anymore (which is what this function does)

We could either:

replicate what that function does (finding the line with the type attribute and parsing that one only)

or simply parse the full YAML string since you'll do it afterwards anyway

JulienArzul · 2026-02-05T15:12:15Z

src/databao_context_engine/databao_context_project_manager.py

+        datasource_ids: list[DatasourceId] | None = None,
+        chunk_embedding_mode: ChunkEmbeddingMode = ChunkEmbeddingMode.EMBEDDABLE_TEXT_ONLY,
+    ) -> IndexSummary:
+        """Index built datasource contexts into duckdb.


Nit: I'm not sure we should we mention DuckDB in externally facing docs?

JulienArzul · 2026-02-05T15:14:09Z

src/databao_context_engine/databao_context_project_manager.py

+            The summary of the index operation.
+        """
+        engine: DatabaoContextEngine = self.get_engine_for_project()
+        contexts: list[DatasourceContext] = engine.get_all_contexts()


Improvement for an other PR: we probably should have an API in the engine to get only the datasources from a list

Right now, we only have:

get one datasource context

get all datasource context

We should add:

get multiple datasource contexts

Indeed, we could add that. I considered implementing it for this PR, but I'm not sure it actually saves too much IO, but it might, specially if we have contexts that are too big.

JulienArzul · 2026-02-05T15:16:08Z

src/databao_context_engine/databao_context_project_manager.py

+
+        if datasource_ids is not None:
+            wanted_paths = {d.datasource_path for d in datasource_ids}
+            contexts = [c for c in contexts if c.datasource_id.datasource_path in wanted_paths]


Wouldn't you be able to simply check if c.datasource_id in datasource_ids?

JulienArzul · 2026-02-05T15:18:50Z

tests/test_databao_context_project_manager.py

+    c2 = DatasourceContext(DatasourceId.from_string_repr("other/b.yaml"), context="B")
+
+    engine = mocker.Mock()
+    engine.get_all_contexts.return_value = [c1, c2]


IMO it would be interesting to make this a full end-2-end test rather than testing the very small amount of code within the ProjectManager that's only filtering which datasource context to use
(I think all other tests in this class are end2end tests since this is the entry point)

There is already a helper function called given_output_dir_with_built_contexts that can create the contexts for you in the output folder so it shouldn't be hard code-wise

Mateus-Cordeiro added 2 commits February 5, 2026 11:21

Support index triggering

f3fe77c

Pass ruff checks

3dc440f

Mateus-Cordeiro changed the title ~~Add built-context indexing flow~~ [NEM-363] Add built-context indexing flow Feb 5, 2026

Mateus-Cordeiro added 3 commits February 5, 2026 11:42

mypy fixes

ad0ef1f

Pass tests

2c4cdf7

Test fix

7eb309f

JulienArzul reviewed Feb 5, 2026

View reviewed changes

Mateus-Cordeiro added 11 commits February 5, 2026 18:31

NEM-363 pr comments addressing

2ea2c94

Merge branch 'main' into NEM-363

7742baf

Pass tests

135ab2e

Merge branch 'main' into NEM-363

b5e1327

Improve deserialization

474a249

Merge branch 'main' into NEM-363

92788e7

Merge branch 'main' into NEM-363

cf2f145

Merge branch 'main' into NEM-363

f887d70

Merge branch 'main' into NEM-363

0a71424

Make pdf plugin optional

af99c58

Fix broken test

615bfd1

Mateus-Cordeiro merged commit 7cd8843 into main Feb 10, 2026
5 checks passed

Mateus-Cordeiro deleted the NEM-363 branch February 10, 2026 15:04

[NEM-363] Add built-context indexing flow #114

[NEM-363] Add built-context indexing flow #114

Uh oh!

Conversation

Mateus-Cordeiro commented Feb 5, 2026

Changes

Uh oh!

JulienArzul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants