π Transform any document into LLM-ready Markdown with this powerful C#/.NET library!
MarkItDown is a comprehensive document conversion library that transforms diverse file formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, and more) into clean, high-quality Markdown. Perfect for AI workflows, RAG (Retrieval-Augmented Generation) systems, content processing pipelines, and text analytics applications.
Why MarkItDown for .NET?
- π― Built for modern C# developers - Native .NET 9 library with async/await throughout
- π§ LLM-optimized output - Clean Markdown that AI models love to consume
- π¦ Zero-friction NuGet package - Just
dotnet add package ManagedCode.MarkItDown
and go - π Stream-based processing - Handle large documents efficiently without temporary files
- π οΈ Highly extensible - Add custom converters or integrate with AI services for captions/transcription
This is a high-fidelity C# port of Microsoft's original MarkItDown Python library, reimagined for the .NET ecosystem with modern async patterns, improved performance, and enterprise-ready features.
- Perfect for RAG systems - Convert documents to searchable, contextual Markdown chunks
- Token-efficient - Clean output maximizes your LLM token budget
- Structured data preservation - Tables, headers, and lists maintain semantic meaning
- Metadata extraction - Rich document properties for enhanced context
- Native performance - Built from the ground up for .NET, not a wrapper
- Modern async/await - Non-blocking I/O with full cancellation support
- Memory efficient - Stream-based processing avoids loading entire files into memory
- Enterprise ready - Proper error handling, logging, and configuration options
- 22+ file formats supported - From Office documents to web pages to archives
- Batch processing ready - Handle hundreds of documents efficiently
- Extensible architecture - Add custom converters for proprietary formats
- Smart format detection - Automatic MIME type and encoding detection
- Features
- Format Support
- Quick Start
- Usage
- Architecture
- Development & Contributing
- Roadmap
- Performance
- Configuration
- License
- Acknowledgments
- Support
β¨ Modern .NET - Targets .NET 9.0 with up-to-date language features
π¦ NuGet Package - Drop-in dependency for libraries and automation pipelines
π Async/Await - Fully asynchronous pipeline for responsive apps
π§ LLM-Optimized - Markdown tailored for AI ingestion and summarisation
π§ Extensible - Register custom converters or plug additional caption/transcription services
π§ Smart Detection - Automatic MIME, charset, and file-type guessing (including data/file URIs)
β‘ High Performance - Stream-friendly, minimal allocations, zero temp files
Format | Extension | Status | Description |
---|---|---|---|
HTML | .html , .htm |
β Supported | Full HTML to Markdown conversion |
Plain Text | .txt , .md |
β Supported | Direct text processing |
.pdf |
β Supported | Adobe PDF documents with text extraction | |
Word | .docx |
β Supported | Microsoft Word documents with formatting |
Excel | .xlsx |
β Supported | Microsoft Excel spreadsheets as tables |
PowerPoint | .pptx |
β Supported | Microsoft PowerPoint presentations |
Images | .jpg , .png , .gif , .bmp , .tiff , .webp |
β Supported | Exif metadata extraction + optional captions |
Audio | .wav , .mp3 , .m4a , .mp4 |
β Supported | Metadata extraction + optional transcription |
CSV | .csv |
β Supported | Comma-separated values as Markdown tables |
JSON | .json , .jsonl , .ndjson |
β Supported | Structured JSON data with formatting |
XML | .xml , .xsd , .xsl , .rss , .atom |
β Supported | XML documents with structure preservation |
EPUB | .epub |
β Supported | E-book files with metadata and content |
.eml |
β Supported | Email files with headers, content, and attachment info | |
ZIP | .zip |
β Supported | Archive processing with recursive file conversion |
Jupyter Notebook | .ipynb |
β Supported | Python notebooks with code and markdown cells |
RSS/Atom Feeds | .rss , .atom , .xml |
β Supported | Web feeds with structured content and metadata |
YouTube URLs | YouTube links | β Supported | Video metadata extraction and link formatting |
Wikipedia Pages | wikipedia.org | β Supported | Article-only extraction with clean Markdown |
Bing SERPs | bing.com/search | β Supported | Organic result summarisation |
- Headers (H1-H6) β Markdown headers
- Bold/Strong text β bold
- Italic/Emphasis text β italic
- Links β text
- Images β
- Lists (ordered/unordered)
- Tables with header detection and Markdown table output
- Code blocks and inline code
- Blockquotes, sections, semantic containers
- Text extraction with page separation
- Header detection based on formatting
- List item recognition
- Title extraction from document content
- Word (.docx): Headers, paragraphs, tables, bold/italic formatting
- Excel (.xlsx): Spreadsheet data as Markdown tables with sheet organization
- PowerPoint (.pptx): Slide-by-slide content with title recognition
- Automatic table formatting with headers
- Proper escaping of special characters
- Support for various CSV dialects
- Handles quoted fields and embedded commas
- Structured Format: Converts JSON objects to readable Markdown with proper hierarchy
- JSON Lines Support: Processes
.jsonl
and.ndjson
files line by line - Data Type Preservation: Maintains JSON data types (strings, numbers, booleans, null)
- Nested Objects: Handles complex nested structures with proper indentation
- Structure Preservation: Maintains XML hierarchy as Markdown headings
- Attributes Handling: Converts XML attributes to Markdown lists
- Multiple Formats: Supports XML, XSD, XSL, RSS, and Atom feeds
- CDATA Support: Properly handles CDATA sections as code blocks
- Metadata Extraction: Extracts title, author, publisher, and other Dublin Core metadata
- Content Order: Processes content files in proper reading order using spine information
- HTML Processing: Converts XHTML content using the HTML converter
- Table of Contents: Maintains document structure from the original EPUB
- Recursive Processing: Extracts and converts all supported files within archives
- Structure Preservation: Maintains original file paths and organization
- Multi-Format Support: Processes different file types within the same archive
- Error Handling: Continues processing even if individual files fail
- Size Limits: Protects against memory issues with large files
- Cell Type Support: Processes markdown, code, and raw cells appropriately
- Metadata Extraction: Extracts notebook title, kernel information, and language details
- Code Output Handling: Captures and formats execution results, streams, and errors
- Syntax Highlighting: Preserves language information for proper code block formatting
- Multi-Format Support: Handles RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds
- Feed Metadata: Extracts title, description, last update date, and author information
- Article Processing: Converts feed items with proper title linking and content formatting
- Date Formatting: Normalizes publication dates across different feed formats
- URL Recognition: Supports standard and shortened YouTube URLs (youtube.com, youtu.be)
- Metadata Extraction: Extracts video ID and URL parameters with descriptions
- Embed Integration: Provides thumbnail images and multiple access methods
- Parameter Parsing: Decodes common YouTube URL parameters (playlist, timestamps, etc.)
- Support for JPEG, PNG, GIF, BMP, TIFF, WebP
- Exif metadata extraction via
exiftool
(optional) - Optional multimodal image captioning hook (LLM integration ready)
- Graceful fallback when metadata/captioning unavailable
- Handles WAV/MP3/M4A/MP4 containers
- Extracts key metadata (artist, album, duration, channels, etc.)
- Optional transcription delegate for speech-to-text results
- Markdown summary highlighting metadata and transcript
Install via NuGet Package Manager:
# Package Manager Console
Install-Package ManagedCode.MarkItDown
# .NET CLI
dotnet add package ManagedCode.MarkItDown
# PackageReference (add to your .csproj)
<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.3" />
- .NET 9.0 SDK or later
- Compatible with .NET 9 apps and libraries
using MarkItDown;
// Create converter instance
var markItDown = new MarkItDown();
// Convert any file to Markdown
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);
// That's it! MarkItDown handles format detection automatically
RAG System Document Ingestion
using MarkItDown;
using Microsoft.Extensions.Logging;
// Set up logging to track conversion progress
using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
var logger = loggerFactory.CreateLogger<MarkItDown>();
var markItDown = new MarkItDown(logger: logger);
// Convert documents for vector database ingestion
string[] documents = { "report.pdf", "data.xlsx", "webpage.html" };
var markdownChunks = new List<string>();
foreach (var doc in documents)
{
try
{
var result = await markItDown.ConvertAsync(doc);
markdownChunks.Add($"# Document: {result.Title ?? Path.GetFileName(doc)}\n\n{result.Markdown}");
logger.LogInformation("Converted {Document} ({Length} characters)", doc, result.Markdown.Length);
}
catch (UnsupportedFormatException ex)
{
logger.LogWarning("Skipped unsupported file {Document}: {Error}", doc, ex.Message);
}
}
// markdownChunks now ready for embedding and vector storage
Batch Email Processing
using MarkItDown;
var markItDown = new MarkItDown();
var emailFolder = @"C:\Emails\Exports";
var outputFolder = @"C:\ProcessedEmails";
await foreach (var emlFile in Directory.EnumerateFiles(emailFolder, "*.eml").ToAsyncEnumerable())
{
var result = await markItDown.ConvertAsync(emlFile);
// Extract metadata
Console.WriteLine($"Email: {result.Title}");
Console.WriteLine($"Converted to {result.Markdown.Length} characters of Markdown");
// Save processed version
var outputPath = Path.Combine(outputFolder, Path.ChangeExtension(Path.GetFileName(emlFile), ".md"));
await File.WriteAllTextAsync(outputPath, result.Markdown);
}
Web Content Processing
using MarkItDown;
using Microsoft.Extensions.Logging;
using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
using var httpClient = new HttpClient();
var markItDown = new MarkItDown(
logger: loggerFactory.CreateLogger<MarkItDown>(),
httpClient: httpClient);
// Convert web pages directly
var urls = new[]
{
"https://en.wikipedia.org/wiki/Machine_learning",
"https://docs.microsoft.com/en-us/dotnet/csharp/",
"https://github.com/microsoft/semantic-kernel"
};
foreach (var url in urls)
{
var result = await markItDown.ConvertFromUrlAsync(url);
Console.WriteLine($"π {result.Title}");
Console.WriteLine($"π Source: {url}");
Console.WriteLine($"π Content: {result.Markdown.Length} characters");
Console.WriteLine("---");
}
- PDF Support: Provided via PdfPig (bundled)
- Office Documents: Provided via DocumentFormat.OpenXml (bundled)
- Image metadata: Install ExifTool for richer output (
brew install exiftool
,choco install exiftool
) - Image captions: Supply an
ImageCaptioner
delegate (e.g., calls to an LLM or vision service) - Audio transcription: Supply an
AudioTranscriber
delegate (e.g., Azure Cognitive Services, OpenAI Whisper)
Note: External tools are optionalβMarkItDown degrades gracefully when they are absent.
using MarkItDown;
// Convert a DOCX file and print the Markdown
var markItDown = new MarkItDown();
DocumentConverterResult result = await markItDown.ConvertAsync("report.docx");
Console.WriteLine(result.Markdown);
using System.IO;
using System.Text;
using MarkItDown;
using var stream = File.OpenRead("invoice.html");
var streamInfo = new StreamInfo(
mimeType: "text/html",
extension: ".html",
charset: Encoding.UTF8,
fileName: "invoice.html");
var markItDown = new MarkItDown();
var result = await markItDown.ConvertAsync(stream, streamInfo);
Console.WriteLine(result.Title);
using MarkItDown;
// Convert an EML file to Markdown
var markItDown = new MarkItDown();
DocumentConverterResult result = await markItDown.ConvertAsync("message.eml");
// The result includes email headers and content
Console.WriteLine($"Subject: {result.Title}");
Console.WriteLine(result.Markdown);
// Output includes:
// # Email
// **Subject:** Important Project Update
// **From:** sender@example.com
// **To:** recipient@example.com
// **Date:** 2024-01-15 10:30:00 +00:00
//
// ## Message Content
// [Email body content converted to Markdown]
//
// ## Attachments (if any)
// - file.pdf (application/pdf) - 1.2 MB
using MarkItDown;
using Microsoft.Extensions.Logging;
using var loggerFactory = LoggerFactory.Create(static builder => builder.AddConsole());
using var httpClient = new HttpClient();
var markItDown = new MarkItDown(
logger: loggerFactory.CreateLogger<MarkItDown>(),
httpClient: httpClient);
DocumentConverterResult urlResult = await markItDown.ConvertFromUrlAsync("https://contoso.example/blog");
Console.WriteLine(urlResult.Title);
using Azure;
using MarkItDown;
var options = new MarkItDownOptions
{
// Plug in your own services (Azure AI, OpenAI, etc.)
ImageCaptioner = async (bytes, info, token) =>
await myCaptionService.DescribeAsync(bytes, info, token),
AudioTranscriber = async (bytes, info, token) =>
await speechClient.TranscribeAsync(bytes, info, token),
DocumentIntelligence = new DocumentIntelligenceOptions
{
Endpoint = "https://<your-resource>.cognitiveservices.azure.com/",
Credential = new AzureKeyCredential("<document-intelligence-key>")
}
};
var markItDown = new MarkItDown(options);
Create your own format converters by implementing IDocumentConverter
:
using System.IO;
using MarkItDown;
public sealed class MyCustomConverter : IDocumentConverter
{
public int Priority => ConverterPriority.SpecificFileFormat;
public bool AcceptsInput(StreamInfo streamInfo) =>
string.Equals(streamInfo.Extension, ".mycustom", StringComparison.OrdinalIgnoreCase);
public Task<DocumentConverterResult> ConvertAsync(
Stream stream,
StreamInfo streamInfo,
CancellationToken cancellationToken = default)
{
stream.Seek(0, SeekOrigin.Begin);
using var reader = new StreamReader(stream, leaveOpen: true);
var markdown = "# Converted from custom format\n\n" + reader.ReadToEnd();
return Task.FromResult(new DocumentConverterResult(markdown, "Custom document"));
}
}
var markItDown = new MarkItDown();
markItDown.RegisterConverter(new MyCustomConverter());
using MarkItDown;
public class PowerBIConverter : IDocumentConverter
{
public int Priority => 150; // Between HTML and PlainText
public bool AcceptsInput(StreamInfo streamInfo) =>
streamInfo.Extension?.ToLowerInvariant() == ".pbix" ||
streamInfo.MimeType?.Contains("powerbi") == true;
public async Task<DocumentConverterResult> ConvertAsync(
Stream stream,
StreamInfo streamInfo,
CancellationToken cancellationToken = default)
{
// Custom PowerBI file processing logic here
var markdown = await ProcessPowerBIFile(stream, cancellationToken);
return new DocumentConverterResult(markdown, "PowerBI Report");
}
private async Task<string> ProcessPowerBIFile(Stream stream, CancellationToken cancellationToken)
{
// Implementation details...
await Task.Delay(100, cancellationToken); // Placeholder
return "# PowerBI Report\n\nProcessed PowerBI content here...";
}
}
using MarkItDown;
using Microsoft.Extensions.Logging;
public class DocumentProcessor
{
private readonly MarkItDown _markItDown;
private readonly ILogger<DocumentProcessor> _logger;
public DocumentProcessor(ILogger<DocumentProcessor> logger)
{
_logger = logger;
_markItDown = new MarkItDown(logger: logger);
}
public async Task<List<ProcessedDocument>> ProcessDirectoryAsync(
string directoryPath,
string outputPath,
IProgress<ProcessingProgress>? progress = null)
{
var files = Directory.EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories)
.Where(f => !Path.GetFileName(f).StartsWith('.'))
.ToList();
var results = new List<ProcessedDocument>();
var processed = 0;
await Parallel.ForEachAsync(files, new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
},
async (file, cancellationToken) =>
{
try
{
var result = await _markItDown.ConvertAsync(file, cancellationToken: cancellationToken);
var outputFile = Path.Combine(outputPath,
Path.ChangeExtension(Path.GetRelativePath(directoryPath, file), ".md"));
Directory.CreateDirectory(Path.GetDirectoryName(outputFile)!);
await File.WriteAllTextAsync(outputFile, result.Markdown, cancellationToken);
lock (results)
{
results.Add(new ProcessedDocument(file, outputFile, result.Markdown.Length));
processed++;
progress?.Report(new ProcessingProgress(processed, files.Count, file));
}
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to process {File}", file);
}
});
return results;
}
}
public record ProcessedDocument(string InputPath, string OutputPath, int CharacterCount);
public record ProcessingProgress(int Processed, int Total, string CurrentFile);
using MarkItDown;
using Microsoft.Extensions.VectorData;
public class DocumentIndexer
{
private readonly MarkItDown _markItDown;
private readonly IVectorStore _vectorStore;
public DocumentIndexer(IVectorStore vectorStore)
{
_vectorStore = vectorStore;
_markItDown = new MarkItDown();
}
public async Task IndexDocumentAsync<T>(string filePath) where T : class
{
// Convert to Markdown
var result = await _markItDown.ConvertAsync(filePath);
// Split into chunks for better vector search
var chunks = SplitIntoChunks(result.Markdown, maxChunkSize: 500);
var collection = _vectorStore.GetCollection<T>("documents");
for (int i = 0; i < chunks.Count; i++)
{
var document = new DocumentChunk
{
Id = $"{Path.GetFileName(filePath)}_{i}",
Content = chunks[i],
Title = result.Title ?? Path.GetFileName(filePath),
Source = filePath,
ChunkIndex = i
};
await collection.UpsertAsync(document);
}
}
private List<string> SplitIntoChunks(string markdown, int maxChunkSize)
{
// Smart chunking logic that preserves markdown structure
var chunks = new List<string>();
var lines = markdown.Split('\n');
var currentChunk = new StringBuilder();
foreach (var line in lines)
{
if (currentChunk.Length + line.Length > maxChunkSize && currentChunk.Length > 0)
{
chunks.Add(currentChunk.ToString().Trim());
currentChunk.Clear();
}
currentChunk.AppendLine(line);
}
if (currentChunk.Length > 0)
chunks.Add(currentChunk.ToString().Trim());
return chunks;
}
}
public class DocumentChunk
{
public string Id { get; set; } = "";
public string Content { get; set; } = "";
public string Title { get; set; } = "";
public string Source { get; set; } = "";
public int ChunkIndex { get; set; }
}
// Azure Functions example
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Http;
using Microsoft.Extensions.Logging;
using MarkItDown;
public class DocumentConversionFunction
{
private readonly MarkItDown _markItDown;
private readonly ILogger<DocumentConversionFunction> _logger;
public DocumentConversionFunction(ILogger<DocumentConversionFunction> logger)
{
_logger = logger;
_markItDown = new MarkItDown(logger: logger);
}
[Function("ConvertDocument")]
public async Task<HttpResponseData> ConvertDocument(
[HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req)
{
try
{
var formData = await req.ReadFormAsync();
var file = formData.Files.FirstOrDefault();
if (file == null)
{
var badResponse = req.CreateResponse(System.Net.HttpStatusCode.BadRequest);
await badResponse.WriteStringAsync("No file uploaded");
return badResponse;
}
var streamInfo = new StreamInfo(
extension: Path.GetExtension(file.FileName),
fileName: file.FileName,
mimeType: file.ContentType
);
var result = await _markItDown.ConvertAsync(file.OpenReadStream(), streamInfo);
var response = req.CreateResponse(System.Net.HttpStatusCode.OK);
response.Headers.Add("Content-Type", "application/json");
await response.WriteAsJsonAsync(new
{
title = result.Title,
markdown = result.Markdown,
characterCount = result.Markdown.Length
});
return response;
}
catch (UnsupportedFormatException ex)
{
var response = req.CreateResponse(System.Net.HttpStatusCode.UnsupportedMediaType);
await response.WriteStringAsync($"Unsupported file format: {ex.Message}");
return response;
}
catch (Exception ex)
{
_logger.LogError(ex, "Document conversion failed");
var response = req.CreateResponse(System.Net.HttpStatusCode.InternalServerError);
await response.WriteStringAsync("Internal server error");
return response;
}
}
}
MarkItDown
- Main entry point for conversionsIDocumentConverter
- Interface for format-specific convertersDocumentConverterResult
- Contains the converted Markdown and optional metadataStreamInfo
- Metadata about the input stream (MIME type, extension, charset, etc.)ConverterRegistration
- Associates converters with priority for selection
MarkItDown includes these converters in priority order:
YouTubeUrlConverter
- Video metadata from YouTube URLsHtmlConverter
- HTML to Markdown using AngleSharpWikipediaConverter
- Clean article extraction from Wikipedia pagesBingSerpConverter
- Search result summaries from BingRssFeedConverter
- RSS/Atom feeds with article processingJsonConverter
- Structured JSON data with formattingJupyterNotebookConverter
- Python notebooks with code and markdown cellsCsvConverter
- CSV files as Markdown tablesEpubConverter
- E-book content and metadataEmlConverter
- Email files with headers and attachmentsXmlConverter
- XML documents with structure preservationZipConverter
- Archive processing with recursive conversionPdfConverter
- PDF text extraction using PdfPigDocxConverter
- Microsoft Word documentsXlsxConverter
- Microsoft Excel spreadsheetsPptxConverter
- Microsoft PowerPoint presentationsAudioConverter
- Audio metadata and optional transcriptionImageConverter
- Image metadata via ExifTool and optional captionsPlainTextConverter
- Plain text, Markdown, and other text formats (fallback)
- Priority-based dispatch (lower values processed first)
- Automatic stream sniffing via
StreamInfoGuesser
- Manual overrides via
MarkItDownOptions
orStreamInfo
using MarkItDown;
var markItDown = new MarkItDown();
try
{
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);
}
catch (UnsupportedFormatException ex)
{
// File format not supported by any converter
Console.WriteLine($"Cannot process this file type: {ex.Message}");
}
catch (FileNotFoundException ex)
{
// File path doesn't exist
Console.WriteLine($"File not found: {ex.Message}");
}
catch (UnauthorizedAccessException ex)
{
// Permission issues
Console.WriteLine($"Access denied: {ex.Message}");
}
catch (MarkItDownException ex)
{
// General conversion errors (corrupt files, parsing issues, etc.)
Console.WriteLine($"Conversion failed: {ex.Message}");
if (ex.InnerException != null)
Console.WriteLine($"Details: {ex.InnerException.Message}");
}
File Format Detection Issues:
// Force specific format detection
var streamInfo = new StreamInfo(
mimeType: "application/pdf", // Explicit MIME type
extension: ".pdf", // Explicit extension
fileName: "document.pdf" // Original filename
);
var result = await markItDown.ConvertAsync(stream, streamInfo);
Memory Issues with Large Files:
// Use cancellation tokens to prevent runaway processing
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(10));
try
{
var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
}
catch (OperationCanceledException)
{
Console.WriteLine("Conversion timed out - file may be too large or complex");
}
Network Issues (URLs):
// Configure HttpClient for better reliability
using var httpClient = new HttpClient();
httpClient.Timeout = TimeSpan.FromSeconds(30);
httpClient.DefaultRequestHeaders.Add("User-Agent", "MarkItDown/1.0");
var markItDown = new MarkItDown(httpClient: httpClient);
Logging for Diagnostics:
using Microsoft.Extensions.Logging;
using var loggerFactory = LoggerFactory.Create(builder =>
builder.AddConsole().SetMinimumLevel(LogLevel.Debug));
var logger = loggerFactory.CreateLogger<MarkItDown>();
var markItDown = new MarkItDown(logger: logger);
// Now you'll see detailed conversion progress in console output
If you're familiar with the original Python library, here are the key differences:
Python | C#/.NET | Notes |
---|---|---|
MarkItDown() |
new MarkItDown() |
Similar constructor |
markitdown.convert("file.pdf") |
await markItDown.ConvertAsync("file.pdf") |
Async pattern |
markitdown.convert(stream, file_extension=".pdf") |
await markItDown.ConvertAsync(stream, streamInfo) |
StreamInfo object |
markitdown.convert_url("https://...") |
await markItDown.ConvertFromUrlAsync("https://...") |
Async URL conversion |
llm_client=... parameter |
ImageCaptioner , AudioTranscriber delegates |
More flexible callback system |
Plugin system | Not yet implemented | Planned for future release |
Example Migration:
# Python version
import markitdown
md = markitdown.MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
// C# version
using MarkItDown;
var markItDown = new MarkItDown();
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);
# Clone the repository
git clone https://github.com/managedcode/markitdown.git
cd markitdown
# Build the solution
dotnet build
# Run tests
dotnet test
# Create NuGet package
dotnet pack --configuration Release
dotnet test --collect:"XPlat Code Coverage"
The command emits standard test results plus a Cobertura coverage report at
tests/MarkItDown.Tests/TestResults/<guid>/coverage.cobertura.xml
. Tools such as
ReportGenerator can turn this into
HTML or Markdown dashboards.
βββ src/
β βββ MarkItDown/ # Core library
β βββ Converters/ # Format-specific converters (HTML, PDF, audio, etc.)
β βββ MarkItDown.cs # Main conversion engine
β βββ StreamInfoGuesser.cs # MIME/charset/extension detection helpers
β βββ MarkItDownOptions.cs # Runtime configuration flags
β βββ ... # Shared utilities (UriUtilities, MimeMapping, etc.)
βββ tests/
β βββ MarkItDown.Tests/ # xUnit + Shouldly tests, Python parity vectors
βββ Directory.Build.props # Shared build + packaging settings
βββ README.md # This document
- Fork the repository.
- Create a feature branch (
git checkout -b feature/my-feature
). - Add tests with xUnit/Shouldly mirroring relevant Python vectors.
- Run
dotnet test
(CI enforces green builds + coverage upload). - Update docs or samples if behaviour changes.
- Submit a pull request for review.
- Azure Document Intelligence converter (options already scaffolded)
- Outlook
.msg
ingestion via MIT-friendly dependencies - Performance optimizations and memory usage improvements
- Enhanced test coverage mirroring Python test vectors
- Plugin discovery & sandboxing for custom converters
- Built-in LLM caption/transcription providers (OpenAI, Azure AI)
- Incremental/streaming conversion APIs for large documents
- Cloud-native integration samples (Azure Functions, AWS Lambda)
- Command-line interface (CLI) for batch processing
MarkItDown is designed for high-performance document processing in production environments:
Feature | Benefit | Impact |
---|---|---|
Stream-based processing | No temporary files created | Faster I/O, lower disk usage |
Async/await throughout | Non-blocking operations | Better scalability, responsive UIs |
Memory efficient | Smart buffer reuse | Lower memory footprint for large documents |
Fast format detection | Lightweight MIME/extension sniffing | Quick routing to appropriate converter |
Parallel processing ready | Thread-safe converter instances | Handle multiple documents concurrently |
MarkItDown's performance depends on:
- Document size and complexity - Larger files with more formatting take longer to process
- File format - Some formats (like PDF) require more processing than others (like plain text)
- Available system resources - Memory, CPU, and I/O capabilities
- Optional services - Image captioning and audio transcription add processing time
Performance will vary based on your specific documents and environment. For production workloads, we recommend benchmarking with your actual document types and sizes.
// 1. Reuse MarkItDown instances (they're thread-safe)
var markItDown = new MarkItDown();
await Task.WhenAll(
markItDown.ConvertAsync("file1.pdf"),
markItDown.ConvertAsync("file2.docx"),
markItDown.ConvertAsync("file3.html")
);
// 2. Use cancellation tokens for timeouts
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(5));
var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
// 3. Configure HttpClient for web content (reuse connections)
using var httpClient = new HttpClient();
var markItDown = new MarkItDown(httpClient: httpClient);
// 4. Pre-specify StreamInfo to skip format detection
var streamInfo = new StreamInfo(mimeType: "application/pdf", extension: ".pdf");
var result = await markItDown.ConvertAsync(stream, streamInfo);
var options = new MarkItDownOptions
{
EnableBuiltins = true, // Use built-in converters (default: true)
EnablePlugins = false, // Plugin system (reserved for future use)
ExifToolPath = "/usr/local/bin/exiftool" // Path to exiftool binary (optional)
};
var markItDown = new MarkItDown(options);
using Azure;
using OpenAI;
var options = new MarkItDownOptions
{
// Azure AI Vision for image captions
ImageCaptioner = async (bytes, info, token) =>
{
var client = new VisionServiceClient("your-endpoint", new AzureKeyCredential("your-key"));
var result = await client.AnalyzeImageAsync(bytes, token);
return $"Image: {result.Description?.Captions?.FirstOrDefault()?.Text ?? "Visual content"}";
},
// OpenAI Whisper for audio transcription
AudioTranscriber = async (bytes, info, token) =>
{
var client = new OpenAIClient("your-api-key");
using var stream = new MemoryStream(bytes);
var result = await client.AudioEndpoint.CreateTranscriptionAsync(
stream,
Path.GetFileName(info.FileName) ?? "audio",
cancellationToken: token);
return result.Text;
},
// Azure Document Intelligence for enhanced PDF/form processing
DocumentIntelligence = new DocumentIntelligenceOptions
{
Endpoint = "https://your-resource.cognitiveservices.azure.com/",
Credential = new AzureKeyCredential("your-document-intelligence-key"),
ApiVersion = "2023-10-31-preview"
}
};
var markItDown = new MarkItDown(options);
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.DependencyInjection;
// Set up dependency injection
var services = new ServiceCollection();
services.AddLogging(builder => builder.AddConsole().SetMinimumLevel(LogLevel.Information));
services.AddHttpClient();
var serviceProvider = services.BuildServiceProvider();
var logger = serviceProvider.GetRequiredService<ILogger<MarkItDown>>();
var httpClientFactory = serviceProvider.GetRequiredService<IHttpClientFactory>();
var options = new MarkItDownOptions
{
// Graceful degradation for image processing
ImageCaptioner = async (bytes, info, token) =>
{
try
{
// Your AI service call here
return await CallVisionServiceAsync(bytes, token);
}
catch (Exception ex)
{
logger.LogWarning("Image captioning failed: {Error}", ex.Message);
return $"[Image: {info.FileName ?? "unknown"}]"; // Fallback
}
}
};
var markItDown = new MarkItDown(options, logger, httpClientFactory.CreateClient());
This project is licensed under the MIT License - see the LICENSE file for details.
This project is a comprehensive C# port of the original Microsoft MarkItDown Python library, created by the Microsoft AutoGen team. We've reimagined it specifically for the .NET ecosystem while maintaining compatibility with the original's design philosophy and capabilities.
Key differences in this .NET version:
- π― Native .NET performance - Built from scratch in C#, not a Python wrapper
- π Modern async patterns - Full async/await support with cancellation tokens
- π¦ NuGet ecosystem integration - Easy installation and dependency management
- π οΈ Enterprise features - Comprehensive logging, error handling, and configuration
- π Enhanced performance - Stream-based processing and memory optimizations
Maintained by: ManagedCode team
Original inspiration: Microsoft AutoGen team
License: MIT (same as the original Python version)
We're committed to maintaining feature parity with the upstream Python project while delivering the performance and developer experience that .NET developers expect.
- π Documentation: GitHub Wiki
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Email: Create an issue for support
β Star this repository if you find it useful!
Made with β€οΈ by ManagedCode