Switch ingestion to token chunking and add sample corpus #20

EngineerNV · 2025-10-30T07:50:41Z

Summary

switch the ingestion step to token-aware chunking with metadata-preserving overlap
document the new chunking strategy and provide a sample Pokémon knowledge base
update repository ignores so the sample corpus file is tracked

Testing

python scripts/00_ingest.py
python scripts/01_build_index.py (fails: Hugging Face model download blocked by network restrictions)

https://chatgpt.com/codex/tasks/task_e_68fe777c3e188321b73869f4dec43017

EngineerNV · 2025-10-30T07:54:27Z

scripts/00_ingest.py

 CORPUS_DIR = os.environ.get("CORPUS_DIR", os.path.join("data", "corpus"))
-HEADER_LEVELS = ["#", "##", "###"]  # order matters
+
+# Configure chunking in token units so the ingestion step aligns with embedding models.


@copilot we need to review the ingestion of the tokens is done correctly and that we are setting up our build script for success. TODO: Review these changes - do they achieve the functionality of creating our chunks for embeddings? TODO: Test our build script along with the ingest script to ensure that our repo is functionally sound after these changes.

Copilot · 2025-10-30T07:54:34Z

@EngineerNV I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you.

…gestion Went from using markdown headers to now using tokens for chunking - with a 20% overlap.

Switch ingestion to token-based chunks and add sample corpus

611b94b

EngineerNV added the codex label Oct 30, 2025 — with ChatGPT Codex Connector

EngineerNV commented Oct 30, 2025

View reviewed changes

Initial plan

efb51ee

Copilot AI mentioned this pull request Oct 30, 2025

Verify token-based chunking and build script integration #21

Merged

Token chunking rework using tokens instead of markdown headers for in…

0ed84d2

…gestion Went from using markdown headers to now using tokens for chunking - with a 20% overlap.

EngineerNV merged commit b885aba into main Oct 30, 2025

EngineerNV deleted the codex/change-context-extraction-to-token-numbers branch October 30, 2025 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch ingestion to token chunking and add sample corpus #20

Switch ingestion to token chunking and add sample corpus #20

Uh oh!

EngineerNV commented Oct 30, 2025

Uh oh!

EngineerNV Oct 30, 2025

Uh oh!

Copilot AI commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Switch ingestion to token chunking and add sample corpus #20

Switch ingestion to token chunking and add sample corpus #20

Uh oh!

Conversation

EngineerNV commented Oct 30, 2025

Summary

Testing

Uh oh!

EngineerNV Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants