Skip to content

Conversation

@ppinchuk
Copy link
Collaborator

Add option to process local documents, similar to how we allow processing known URLs

@ppinchuk ppinchuk self-assigned this Oct 28, 2025
@ppinchuk ppinchuk requested a review from castelao as a code owner October 28, 2025 18:33
@ppinchuk ppinchuk added enhancement Update to logic or general code improvements topic-python-llm Issues/pull requests related to LLMs labels Oct 28, 2025
@ppinchuk ppinchuk requested a review from Copilot October 28, 2025 18:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds functionality to process local documents in addition to web-based URLs. The implementation introduces a new load_local_docs utility and corresponding file loader services for HTML and PDF files, enabling users to provide local file paths alongside URL-based sources.

Key Changes

  • Added AsyncLocalFileLoader support for processing local PDF and HTML files
  • Introduced known_local_docs parameter to allow users to specify local document paths
  • Modified processing pipeline to check local documents before URL-based searches
  • Updated class references from AsyncFileLoader to AsyncWebFileLoader for clarity

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
compass/utilities/io.py New utility module for loading local documents
compass/services/cpu.py Added functions to read local PDF files (with and without OCR)
compass/services/threaded.py Added HTMLFileLoader service and functions to read local HTML files
compass/scripts/process.py Updated main processing function to support local documents and modified processing order
compass/scripts/download.py Added load_known_docs function for loading local documents
compass/utilities/nt.py Added known_local_docs field to ProcessKwargs namedtuple
compass/utilities/init.py Exported new load_local_docs function
compass/validation/location.py Updated class reference from AsyncFileLoader to AsyncWebFileLoader
compass/validation/content.py Made legal_text_validator parameter optional in parse_by_chunks
compass/extraction/apply.py Added logic to skip legal text and date validation for known documents
compass/web/website_crawl.py Updated class references and fixed documentation typo
tests/python/unit/utilities/test_utilities_io.py New test file for local document loading functionality
tests/python/unit/utilities/test_utilities_base.py Minor docstring correction
tests/python/unit/validation/test_validation_location.py Added missing @pytest.mark.asyncio decorator
tests/python/integration/test_integrated.py Updated class references from AsyncFileLoader to AsyncWebFileLoader

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ppinchuk ppinchuk merged commit cb6aa37 into main Oct 28, 2025
13 checks passed
@ppinchuk ppinchuk deleted the pp/local_docs branch October 28, 2025 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Update to logic or general code improvements topic-python-llm Issues/pull requests related to LLMs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant