Skip to content

Conversation

@EngineerNV
Copy link
Owner

Summary

  • switch the ingestion step to token-aware chunking with metadata-preserving overlap
  • document the new chunking strategy and provide a sample Pokémon knowledge base
  • update repository ignores so the sample corpus file is tracked

Testing

  • python scripts/00_ingest.py
  • python scripts/01_build_index.py (fails: Hugging Face model download blocked by network restrictions)

https://chatgpt.com/codex/tasks/task_e_68fe777c3e188321b73869f4dec43017

CORPUS_DIR = os.environ.get("CORPUS_DIR", os.path.join("data", "corpus"))
HEADER_LEVELS = ["#", "##", "###"] # order matters

# Configure chunking in token units so the ingestion step aligns with embedding models.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot we need to review the ingestion of the tokens is done correctly and that we are setting up our build script for success. TODO: Review these changes - do they achieve the functionality of creating our chunks for embeddings? TODO: Test our build script along with the ingest script to ensure that our repo is functionally sound after these changes.

Copy link

Copilot AI commented Oct 30, 2025

@EngineerNV I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you.

…gestion

Went from using markdown headers to now using tokens for chunking - with a 20% overlap.
@EngineerNV EngineerNV merged commit b885aba into main Oct 30, 2025
@EngineerNV EngineerNV deleted the codex/change-context-extraction-to-token-numbers branch October 30, 2025 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants