-
Notifications
You must be signed in to change notification settings - Fork 840
Introduce Microsoft.Extensions.DataIngestion.Abstractions #6949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the Microsoft.Extensions.DataIngestion.Abstractions library, implementing the APIs approved in the referenced GitHub issues. The library provides abstractions for processing documents from various formats into structured chunks suitable for data ingestion scenarios (e.g., RAG pipelines).
Key changes:
- Core document representation classes (
IngestionDocument,IngestionDocumentElementand its specialized types) - Processing pipeline abstractions (
IngestionDocumentReader,IngestionDocumentProcessor,IngestionChunker,IngestionChunkProcessor,IngestionChunkWriter) - Support classes (
IngestionChunk) for representing processed content chunks
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| Microsoft.Extensions.DataIngestion.Abstractions.csproj | Project file defining multi-targeting (including netstandard2.0) and conditional package references |
| IngestionDocument.cs | Core document container with section management and content enumeration |
| IngestionDocumentElement.cs | Base class and specialized element types (Section, Paragraph, Header, Footer, Table, Image) |
| IngestionDocumentReader.cs | Abstract reader with file/stream overloads and extensive media type mapping |
| IngestionDocumentProcessor.cs | Abstract processor for document transformation pipeline |
| IngestionChunk.cs | Generic chunk representation with metadata support and validation |
| IngestionChunker.cs | Abstract chunker for splitting documents into chunks |
| IngestionChunkProcessor.cs | Abstract processor for chunk transformation pipeline |
| IngestionChunkWriter.cs | Abstract writer with disposable pattern for chunk output |
| Microsoft.Extensions.DataIngestion.Tests.csproj | Test project configuration with analyzer suppressions |
| IngestionDocumentTests.cs | Unit tests for document enumeration and validation |
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentElement.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionChunk.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionChunk.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionChunk.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocument.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentElement.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentReader.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentReader.cs
Outdated
Show resolved
Hide resolved
...Extensions.DataIngestion.Abstractions/Microsoft.Extensions.DataIngestion.Abstractions.csproj
Show resolved
Hide resolved
...Extensions.DataIngestion.Abstractions/Microsoft.Extensions.DataIngestion.Abstractions.csproj
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionDocumentTests.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentElement.cs
Show resolved
Hide resolved
...Extensions.DataIngestion.Abstractions/Microsoft.Extensions.DataIngestion.Abstractions.csproj
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj
Outdated
Show resolved
Hide resolved
…xt PRs in parallel
db5d273 to
7bbb1ef
Compare
The APIs got approved in #6893 (comment) and in #6895 (comment)
Microsoft Reviewers: Open in CodeFlow