Skip to content

Conversation

@adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Oct 22, 2025

The APIs got approved in #6893 (comment) and in #6895 (comment)

Microsoft Reviewers: Open in CodeFlow

@adamsitnik adamsitnik self-assigned this Oct 22, 2025
@Copilot Copilot AI review requested due to automatic review settings October 22, 2025 14:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the Microsoft.Extensions.DataIngestion.Abstractions library, implementing the APIs approved in the referenced GitHub issues. The library provides abstractions for processing documents from various formats into structured chunks suitable for data ingestion scenarios (e.g., RAG pipelines).

Key changes:

  • Core document representation classes (IngestionDocument, IngestionDocumentElement and its specialized types)
  • Processing pipeline abstractions (IngestionDocumentReader, IngestionDocumentProcessor, IngestionChunker, IngestionChunkProcessor, IngestionChunkWriter)
  • Support classes (IngestionChunk) for representing processed content chunks

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Microsoft.Extensions.DataIngestion.Abstractions.csproj Project file defining multi-targeting (including netstandard2.0) and conditional package references
IngestionDocument.cs Core document container with section management and content enumeration
IngestionDocumentElement.cs Base class and specialized element types (Section, Paragraph, Header, Footer, Table, Image)
IngestionDocumentReader.cs Abstract reader with file/stream overloads and extensive media type mapping
IngestionDocumentProcessor.cs Abstract processor for document transformation pipeline
IngestionChunk.cs Generic chunk representation with metadata support and validation
IngestionChunker.cs Abstract chunker for splitting documents into chunks
IngestionChunkProcessor.cs Abstract processor for chunk transformation pipeline
IngestionChunkWriter.cs Abstract writer with disposable pattern for chunk output
Microsoft.Extensions.DataIngestion.Tests.csproj Test project configuration with analyzer suppressions
IngestionDocumentTests.cs Unit tests for document enumeration and validation

@adamsitnik adamsitnik force-pushed the dataIngestionAbstractions branch from db5d273 to 7bbb1ef Compare October 23, 2025 10:40
@adamsitnik adamsitnik merged commit 72f930a into dotnet:main Oct 23, 2025
6 checks passed
@adamsitnik adamsitnik deleted the dataIngestionAbstractions branch October 23, 2025 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants