Skip to content

managedcode/markitdown

Β 
Β 

Repository files navigation

MarkItDown

.NET NuGet License: MIT

πŸš€ Transform any document into LLM-ready Markdown with this powerful C#/.NET library!

MarkItDown is a comprehensive document conversion library that transforms diverse file formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, and more) into clean, high-quality Markdown. Perfect for AI workflows, RAG (Retrieval-Augmented Generation) systems, content processing pipelines, and text analytics applications.

Why MarkItDown for .NET?

  • 🎯 Built for modern C# developers - Native .NET 9 library with async/await throughout
  • 🧠 LLM-optimized output - Clean Markdown that AI models love to consume
  • πŸ“¦ Zero-friction NuGet package - Just dotnet add package ManagedCode.MarkItDown and go
  • πŸ”„ Stream-based processing - Handle large documents efficiently without temporary files
  • πŸ› οΈ Highly extensible - Add custom converters or integrate with AI services for captions/transcription

This is a high-fidelity C# port of Microsoft's original MarkItDown Python library, reimagined for the .NET ecosystem with modern async patterns, improved performance, and enterprise-ready features.

🌟 Why Choose MarkItDown?

For AI & LLM Applications

  • Perfect for RAG systems - Convert documents to searchable, contextual Markdown chunks
  • Token-efficient - Clean output maximizes your LLM token budget
  • Structured data preservation - Tables, headers, and lists maintain semantic meaning
  • Metadata extraction - Rich document properties for enhanced context

For .NET Developers

  • Native performance - Built from the ground up for .NET, not a wrapper
  • Modern async/await - Non-blocking I/O with full cancellation support
  • Memory efficient - Stream-based processing avoids loading entire files into memory
  • Enterprise ready - Proper error handling, logging, and configuration options

For Content Processing

  • 22+ file formats supported - From Office documents to web pages to archives
  • Batch processing ready - Handle hundreds of documents efficiently
  • Extensible architecture - Add custom converters for proprietary formats
  • Smart format detection - Automatic MIME type and encoding detection

Table of Contents

Features

✨ Modern .NET - Targets .NET 9.0 with up-to-date language features
πŸ“¦ NuGet Package - Drop-in dependency for libraries and automation pipelines
πŸ”„ Async/Await - Fully asynchronous pipeline for responsive apps
🧠 LLM-Optimized - Markdown tailored for AI ingestion and summarisation
πŸ”§ Extensible - Register custom converters or plug additional caption/transcription services
🧭 Smart Detection - Automatic MIME, charset, and file-type guessing (including data/file URIs)
⚑ High Performance - Stream-friendly, minimal allocations, zero temp files

πŸ“‹ Format Support

Format Extension Status Description
HTML .html, .htm βœ… Supported Full HTML to Markdown conversion
Plain Text .txt, .md βœ… Supported Direct text processing
PDF .pdf βœ… Supported Adobe PDF documents with text extraction
Word .docx βœ… Supported Microsoft Word documents with formatting
Excel .xlsx βœ… Supported Microsoft Excel spreadsheets as tables
PowerPoint .pptx βœ… Supported Microsoft PowerPoint presentations
Images .jpg, .png, .gif, .bmp, .tiff, .webp βœ… Supported Exif metadata extraction + optional captions
Audio .wav, .mp3, .m4a, .mp4 βœ… Supported Metadata extraction + optional transcription
CSV .csv βœ… Supported Comma-separated values as Markdown tables
JSON .json, .jsonl, .ndjson βœ… Supported Structured JSON data with formatting
XML .xml, .xsd, .xsl, .rss, .atom βœ… Supported XML documents with structure preservation
EPUB .epub βœ… Supported E-book files with metadata and content
Email .eml βœ… Supported Email files with headers, content, and attachment info
ZIP .zip βœ… Supported Archive processing with recursive file conversion
Jupyter Notebook .ipynb βœ… Supported Python notebooks with code and markdown cells
RSS/Atom Feeds .rss, .atom, .xml βœ… Supported Web feeds with structured content and metadata
YouTube URLs YouTube links βœ… Supported Video metadata extraction and link formatting
Wikipedia Pages wikipedia.org βœ… Supported Article-only extraction with clean Markdown
Bing SERPs bing.com/search βœ… Supported Organic result summarisation

HTML Conversion Features (AngleSharp powered)

  • Headers (H1-H6) β†’ Markdown headers
  • Bold/Strong text β†’ bold
  • Italic/Emphasis text β†’ italic
  • Links β†’ text
  • Images β†’ alt
  • Lists (ordered/unordered)
  • Tables with header detection and Markdown table output
  • Code blocks and inline code
  • Blockquotes, sections, semantic containers

PDF Conversion Features

  • Text extraction with page separation
  • Header detection based on formatting
  • List item recognition
  • Title extraction from document content

Office Documents (DOCX/XLSX/PPTX)

  • Word (.docx): Headers, paragraphs, tables, bold/italic formatting
  • Excel (.xlsx): Spreadsheet data as Markdown tables with sheet organization
  • PowerPoint (.pptx): Slide-by-slide content with title recognition

CSV Conversion Features

  • Automatic table formatting with headers
  • Proper escaping of special characters
  • Support for various CSV dialects
  • Handles quoted fields and embedded commas

JSON Conversion Features

  • Structured Format: Converts JSON objects to readable Markdown with proper hierarchy
  • JSON Lines Support: Processes .jsonl and .ndjson files line by line
  • Data Type Preservation: Maintains JSON data types (strings, numbers, booleans, null)
  • Nested Objects: Handles complex nested structures with proper indentation

XML Conversion Features

  • Structure Preservation: Maintains XML hierarchy as Markdown headings
  • Attributes Handling: Converts XML attributes to Markdown lists
  • Multiple Formats: Supports XML, XSD, XSL, RSS, and Atom feeds
  • CDATA Support: Properly handles CDATA sections as code blocks

EPUB Conversion Features

  • Metadata Extraction: Extracts title, author, publisher, and other Dublin Core metadata
  • Content Order: Processes content files in proper reading order using spine information
  • HTML Processing: Converts XHTML content using the HTML converter
  • Table of Contents: Maintains document structure from the original EPUB

ZIP Archive Features

  • Recursive Processing: Extracts and converts all supported files within archives
  • Structure Preservation: Maintains original file paths and organization
  • Multi-Format Support: Processes different file types within the same archive
  • Error Handling: Continues processing even if individual files fail
  • Size Limits: Protects against memory issues with large files

Jupyter Notebook Conversion Features

  • Cell Type Support: Processes markdown, code, and raw cells appropriately
  • Metadata Extraction: Extracts notebook title, kernel information, and language details
  • Code Output Handling: Captures and formats execution results, streams, and errors
  • Syntax Highlighting: Preserves language information for proper code block formatting

RSS/Atom Feed Conversion Features

  • Multi-Format Support: Handles RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds
  • Feed Metadata: Extracts title, description, last update date, and author information
  • Article Processing: Converts feed items with proper title linking and content formatting
  • Date Formatting: Normalizes publication dates across different feed formats

YouTube URL Conversion Features

  • URL Recognition: Supports standard and shortened YouTube URLs (youtube.com, youtu.be)
  • Metadata Extraction: Extracts video ID and URL parameters with descriptions
  • Embed Integration: Provides thumbnail images and multiple access methods
  • Parameter Parsing: Decodes common YouTube URL parameters (playlist, timestamps, etc.)

Image Conversion Features

  • Support for JPEG, PNG, GIF, BMP, TIFF, WebP
  • Exif metadata extraction via exiftool (optional)
  • Optional multimodal image captioning hook (LLM integration ready)
  • Graceful fallback when metadata/captioning unavailable

Audio Conversion Features

  • Handles WAV/MP3/M4A/MP4 containers
  • Extracts key metadata (artist, album, duration, channels, etc.)
  • Optional transcription delegate for speech-to-text results
  • Markdown summary highlighting metadata and transcript

πŸš€ Quick Start

Installation

Install via NuGet Package Manager:

# Package Manager Console
Install-Package ManagedCode.MarkItDown

# .NET CLI
dotnet add package ManagedCode.MarkItDown

# PackageReference (add to your .csproj)
<PackageReference Include="ManagedCode.MarkItDown" Version="0.0.3" />

Prerequisites

  • .NET 9.0 SDK or later
  • Compatible with .NET 9 apps and libraries

πŸƒβ€β™‚οΈ 60-Second Quick Start

using MarkItDown;

// Create converter instance
var markItDown = new MarkItDown();

// Convert any file to Markdown
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);

// That's it! MarkItDown handles format detection automatically

πŸ“š Real-World Examples

RAG System Document Ingestion

using MarkItDown;
using Microsoft.Extensions.Logging;

// Set up logging to track conversion progress
using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
var logger = loggerFactory.CreateLogger<MarkItDown>();
var markItDown = new MarkItDown(logger: logger);

// Convert documents for vector database ingestion
string[] documents = { "report.pdf", "data.xlsx", "webpage.html" };
var markdownChunks = new List<string>();

foreach (var doc in documents)
{
    try 
    {
        var result = await markItDown.ConvertAsync(doc);
        markdownChunks.Add($"# Document: {result.Title ?? Path.GetFileName(doc)}\n\n{result.Markdown}");
        logger.LogInformation("Converted {Document} ({Length} characters)", doc, result.Markdown.Length);
    }
    catch (UnsupportedFormatException ex)
    {
        logger.LogWarning("Skipped unsupported file {Document}: {Error}", doc, ex.Message);
    }
}

// markdownChunks now ready for embedding and vector storage

Batch Email Processing

using MarkItDown;

var markItDown = new MarkItDown();
var emailFolder = @"C:\Emails\Exports";
var outputFolder = @"C:\ProcessedEmails";

await foreach (var emlFile in Directory.EnumerateFiles(emailFolder, "*.eml").ToAsyncEnumerable())
{
    var result = await markItDown.ConvertAsync(emlFile);
    
    // Extract metadata
    Console.WriteLine($"Email: {result.Title}");
    Console.WriteLine($"Converted to {result.Markdown.Length} characters of Markdown");
    
    // Save processed version
    var outputPath = Path.Combine(outputFolder, Path.ChangeExtension(Path.GetFileName(emlFile), ".md"));
    await File.WriteAllTextAsync(outputPath, result.Markdown);
}

Web Content Processing

using MarkItDown;
using Microsoft.Extensions.Logging;

using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
using var httpClient = new HttpClient();

var markItDown = new MarkItDown(
    logger: loggerFactory.CreateLogger<MarkItDown>(),
    httpClient: httpClient);

// Convert web pages directly
var urls = new[] 
{
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://docs.microsoft.com/en-us/dotnet/csharp/",
    "https://github.com/microsoft/semantic-kernel"
};

foreach (var url in urls)
{
    var result = await markItDown.ConvertFromUrlAsync(url);
    Console.WriteLine($"πŸ“„ {result.Title}");
    Console.WriteLine($"πŸ”— Source: {url}");
    Console.WriteLine($"πŸ“ Content: {result.Markdown.Length} characters");
    Console.WriteLine("---");
}

Optional Dependencies for Advanced Features

  • PDF Support: Provided via PdfPig (bundled)
  • Office Documents: Provided via DocumentFormat.OpenXml (bundled)
  • Image metadata: Install ExifTool for richer output (brew install exiftool, choco install exiftool)
  • Image captions: Supply an ImageCaptioner delegate (e.g., calls to an LLM or vision service)
  • Audio transcription: Supply an AudioTranscriber delegate (e.g., Azure Cognitive Services, OpenAI Whisper)

Note: External tools are optionalβ€”MarkItDown degrades gracefully when they are absent.

πŸ’» Usage

Convert a local file

using MarkItDown;

// Convert a DOCX file and print the Markdown
var markItDown = new MarkItDown();
DocumentConverterResult result = await markItDown.ConvertAsync("report.docx");
Console.WriteLine(result.Markdown);

Convert a stream with metadata overrides

using System.IO;
using System.Text;
using MarkItDown;

using var stream = File.OpenRead("invoice.html");
var streamInfo = new StreamInfo(
    mimeType: "text/html",
    extension: ".html",
    charset: Encoding.UTF8,
    fileName: "invoice.html");

var markItDown = new MarkItDown();
var result = await markItDown.ConvertAsync(stream, streamInfo);
Console.WriteLine(result.Title);

Convert email files (EML)

using MarkItDown;

// Convert an EML file to Markdown
var markItDown = new MarkItDown();
DocumentConverterResult result = await markItDown.ConvertAsync("message.eml");

// The result includes email headers and content
Console.WriteLine($"Subject: {result.Title}");
Console.WriteLine(result.Markdown);
// Output includes:
// # Email
// **Subject:** Important Project Update
// **From:** sender@example.com
// **To:** recipient@example.com
// **Date:** 2024-01-15 10:30:00 +00:00
// 
// ## Message Content
// [Email body content converted to Markdown]
// 
// ## Attachments (if any)
// - file.pdf (application/pdf) - 1.2 MB

Convert content from HTTP/HTTPS

using MarkItDown;
using Microsoft.Extensions.Logging;

using var loggerFactory = LoggerFactory.Create(static builder => builder.AddConsole());
using var httpClient = new HttpClient();

var markItDown = new MarkItDown(
    logger: loggerFactory.CreateLogger<MarkItDown>(),
    httpClient: httpClient);

DocumentConverterResult urlResult = await markItDown.ConvertFromUrlAsync("https://contoso.example/blog");
Console.WriteLine(urlResult.Title);

Customise the pipeline with options

using Azure;
using MarkItDown;

var options = new MarkItDownOptions
{
    // Plug in your own services (Azure AI, OpenAI, etc.)
    ImageCaptioner = async (bytes, info, token) =>
        await myCaptionService.DescribeAsync(bytes, info, token),
    AudioTranscriber = async (bytes, info, token) =>
        await speechClient.TranscribeAsync(bytes, info, token),
    DocumentIntelligence = new DocumentIntelligenceOptions
    {
        Endpoint = "https://<your-resource>.cognitiveservices.azure.com/",
        Credential = new AzureKeyCredential("<document-intelligence-key>")
    }
};

var markItDown = new MarkItDown(options);

Custom converters

Create your own format converters by implementing IDocumentConverter:

using System.IO;
using MarkItDown;

public sealed class MyCustomConverter : IDocumentConverter
{
    public int Priority => ConverterPriority.SpecificFileFormat;

    public bool AcceptsInput(StreamInfo streamInfo) =>
        string.Equals(streamInfo.Extension, ".mycustom", StringComparison.OrdinalIgnoreCase);

    public Task<DocumentConverterResult> ConvertAsync(
        Stream stream,
        StreamInfo streamInfo,
        CancellationToken cancellationToken = default)
    {
        stream.Seek(0, SeekOrigin.Begin);
        using var reader = new StreamReader(stream, leaveOpen: true);
        var markdown = "# Converted from custom format\n\n" + reader.ReadToEnd();
        return Task.FromResult(new DocumentConverterResult(markdown, "Custom document"));
    }
}

var markItDown = new MarkItDown();
markItDown.RegisterConverter(new MyCustomConverter());

🎯 Advanced Usage Patterns

Custom Format Converters

using MarkItDown;

public class PowerBIConverter : IDocumentConverter
{
    public int Priority => 150; // Between HTML and PlainText

    public bool AcceptsInput(StreamInfo streamInfo) =>
        streamInfo.Extension?.ToLowerInvariant() == ".pbix" ||
        streamInfo.MimeType?.Contains("powerbi") == true;

    public async Task<DocumentConverterResult> ConvertAsync(
        Stream stream, 
        StreamInfo streamInfo, 
        CancellationToken cancellationToken = default)
    {
        // Custom PowerBI file processing logic here
        var markdown = await ProcessPowerBIFile(stream, cancellationToken);
        return new DocumentConverterResult(markdown, "PowerBI Report");
    }
    
    private async Task<string> ProcessPowerBIFile(Stream stream, CancellationToken cancellationToken)
    {
        // Implementation details...
        await Task.Delay(100, cancellationToken); // Placeholder
        return "# PowerBI Report\n\nProcessed PowerBI content here...";
    }
}

Batch Processing with Progress Tracking

using MarkItDown;
using Microsoft.Extensions.Logging;

public class DocumentProcessor
{
    private readonly MarkItDown _markItDown;
    private readonly ILogger<DocumentProcessor> _logger;

    public DocumentProcessor(ILogger<DocumentProcessor> logger)
    {
        _logger = logger;
        _markItDown = new MarkItDown(logger: logger);
    }

    public async Task<List<ProcessedDocument>> ProcessDirectoryAsync(
        string directoryPath, 
        string outputPath,
        IProgress<ProcessingProgress>? progress = null)
    {
        var files = Directory.EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories)
            .Where(f => !Path.GetFileName(f).StartsWith('.'))
            .ToList();

        var results = new List<ProcessedDocument>();
        var processed = 0;

        await Parallel.ForEachAsync(files, new ParallelOptions 
        { 
            MaxDegreeOfParallelism = Environment.ProcessorCount 
        },
        async (file, cancellationToken) =>
        {
            try
            {
                var result = await _markItDown.ConvertAsync(file, cancellationToken: cancellationToken);
                var outputFile = Path.Combine(outputPath, 
                    Path.ChangeExtension(Path.GetRelativePath(directoryPath, file), ".md"));
                
                Directory.CreateDirectory(Path.GetDirectoryName(outputFile)!);
                await File.WriteAllTextAsync(outputFile, result.Markdown, cancellationToken);
                
                lock (results)
                {
                    results.Add(new ProcessedDocument(file, outputFile, result.Markdown.Length));
                    processed++;
                    progress?.Report(new ProcessingProgress(processed, files.Count, file));
                }
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Failed to process {File}", file);
            }
        });

        return results;
    }
}

public record ProcessedDocument(string InputPath, string OutputPath, int CharacterCount);
public record ProcessingProgress(int Processed, int Total, string CurrentFile);

Integration with Vector Databases

using MarkItDown;
using Microsoft.Extensions.VectorData;

public class DocumentIndexer
{
    private readonly MarkItDown _markItDown;
    private readonly IVectorStore _vectorStore;

    public DocumentIndexer(IVectorStore vectorStore)
    {
        _vectorStore = vectorStore;
        _markItDown = new MarkItDown();
    }

    public async Task IndexDocumentAsync<T>(string filePath) where T : class
    {
        // Convert to Markdown
        var result = await _markItDown.ConvertAsync(filePath);
        
        // Split into chunks for better vector search
        var chunks = SplitIntoChunks(result.Markdown, maxChunkSize: 500);
        
        var collection = _vectorStore.GetCollection<T>("documents");
        
        for (int i = 0; i < chunks.Count; i++)
        {
            var document = new DocumentChunk
            {
                Id = $"{Path.GetFileName(filePath)}_{i}",
                Content = chunks[i],
                Title = result.Title ?? Path.GetFileName(filePath),
                Source = filePath,
                ChunkIndex = i
            };

            await collection.UpsertAsync(document);
        }
    }
    
    private List<string> SplitIntoChunks(string markdown, int maxChunkSize)
    {
        // Smart chunking logic that preserves markdown structure
        var chunks = new List<string>();
        var lines = markdown.Split('\n');
        var currentChunk = new StringBuilder();
        
        foreach (var line in lines)
        {
            if (currentChunk.Length + line.Length > maxChunkSize && currentChunk.Length > 0)
            {
                chunks.Add(currentChunk.ToString().Trim());
                currentChunk.Clear();
            }
            currentChunk.AppendLine(line);
        }
        
        if (currentChunk.Length > 0)
            chunks.Add(currentChunk.ToString().Trim());
            
        return chunks;
    }
}

public class DocumentChunk
{
    public string Id { get; set; } = "";
    public string Content { get; set; } = "";
    public string Title { get; set; } = "";
    public string Source { get; set; } = "";
    public int ChunkIndex { get; set; }
}

Cloud Function Integration

// Azure Functions example
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Http;
using Microsoft.Extensions.Logging;
using MarkItDown;

public class DocumentConversionFunction
{
    private readonly MarkItDown _markItDown;
    private readonly ILogger<DocumentConversionFunction> _logger;

    public DocumentConversionFunction(ILogger<DocumentConversionFunction> logger)
    {
        _logger = logger;
        _markItDown = new MarkItDown(logger: logger);
    }

    [Function("ConvertDocument")]
    public async Task<HttpResponseData> ConvertDocument(
        [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req)
    {
        try
        {
            var formData = await req.ReadFormAsync();
            var file = formData.Files.FirstOrDefault();
            
            if (file == null)
            {
                var badResponse = req.CreateResponse(System.Net.HttpStatusCode.BadRequest);
                await badResponse.WriteStringAsync("No file uploaded");
                return badResponse;
            }

            var streamInfo = new StreamInfo(
                extension: Path.GetExtension(file.FileName),
                fileName: file.FileName,
                mimeType: file.ContentType
            );

            var result = await _markItDown.ConvertAsync(file.OpenReadStream(), streamInfo);
            
            var response = req.CreateResponse(System.Net.HttpStatusCode.OK);
            response.Headers.Add("Content-Type", "application/json");
            
            await response.WriteAsJsonAsync(new 
            { 
                title = result.Title,
                markdown = result.Markdown,
                characterCount = result.Markdown.Length
            });
            
            return response;
        }
        catch (UnsupportedFormatException ex)
        {
            var response = req.CreateResponse(System.Net.HttpStatusCode.UnsupportedMediaType);
            await response.WriteStringAsync($"Unsupported file format: {ex.Message}");
            return response;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Document conversion failed");
            var response = req.CreateResponse(System.Net.HttpStatusCode.InternalServerError);
            await response.WriteStringAsync("Internal server error");
            return response;
        }
    }
}

πŸ—οΈ Architecture

Core Components

  • MarkItDown - Main entry point for conversions
  • IDocumentConverter - Interface for format-specific converters
  • DocumentConverterResult - Contains the converted Markdown and optional metadata
  • StreamInfo - Metadata about the input stream (MIME type, extension, charset, etc.)
  • ConverterRegistration - Associates converters with priority for selection

Built-in Converters

MarkItDown includes these converters in priority order:

  • YouTubeUrlConverter - Video metadata from YouTube URLs
  • HtmlConverter - HTML to Markdown using AngleSharp
  • WikipediaConverter - Clean article extraction from Wikipedia pages
  • BingSerpConverter - Search result summaries from Bing
  • RssFeedConverter - RSS/Atom feeds with article processing
  • JsonConverter - Structured JSON data with formatting
  • JupyterNotebookConverter - Python notebooks with code and markdown cells
  • CsvConverter - CSV files as Markdown tables
  • EpubConverter - E-book content and metadata
  • EmlConverter - Email files with headers and attachments
  • XmlConverter - XML documents with structure preservation
  • ZipConverter - Archive processing with recursive conversion
  • PdfConverter - PDF text extraction using PdfPig
  • DocxConverter - Microsoft Word documents
  • XlsxConverter - Microsoft Excel spreadsheets
  • PptxConverter - Microsoft PowerPoint presentations
  • AudioConverter - Audio metadata and optional transcription
  • ImageConverter - Image metadata via ExifTool and optional captions
  • PlainTextConverter - Plain text, Markdown, and other text formats (fallback)

Converter Priority & Detection

  • Priority-based dispatch (lower values processed first)
  • Automatic stream sniffing via StreamInfoGuesser
  • Manual overrides via MarkItDownOptions or StreamInfo

🚨 Error Handling & Troubleshooting

Common Exceptions

using MarkItDown;

var markItDown = new MarkItDown();

try
{
    var result = await markItDown.ConvertAsync("document.pdf");
    Console.WriteLine(result.Markdown);
}
catch (UnsupportedFormatException ex)
{
    // File format not supported by any converter
    Console.WriteLine($"Cannot process this file type: {ex.Message}");
}
catch (FileNotFoundException ex)
{
    // File path doesn't exist
    Console.WriteLine($"File not found: {ex.Message}");
}
catch (UnauthorizedAccessException ex)
{
    // Permission issues
    Console.WriteLine($"Access denied: {ex.Message}");
}
catch (MarkItDownException ex)
{
    // General conversion errors (corrupt files, parsing issues, etc.)
    Console.WriteLine($"Conversion failed: {ex.Message}");
    if (ex.InnerException != null)
        Console.WriteLine($"Details: {ex.InnerException.Message}");
}

Troubleshooting Tips

File Format Detection Issues:

// Force specific format detection
var streamInfo = new StreamInfo(
    mimeType: "application/pdf",  // Explicit MIME type
    extension: ".pdf",            // Explicit extension
    fileName: "document.pdf"      // Original filename
);

var result = await markItDown.ConvertAsync(stream, streamInfo);

Memory Issues with Large Files:

// Use cancellation tokens to prevent runaway processing
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(10));

try 
{
    var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
}
catch (OperationCanceledException)
{
    Console.WriteLine("Conversion timed out - file may be too large or complex");
}

Network Issues (URLs):

// Configure HttpClient for better reliability
using var httpClient = new HttpClient();
httpClient.Timeout = TimeSpan.FromSeconds(30);
httpClient.DefaultRequestHeaders.Add("User-Agent", "MarkItDown/1.0");

var markItDown = new MarkItDown(httpClient: httpClient);

Logging for Diagnostics:

using Microsoft.Extensions.Logging;

using var loggerFactory = LoggerFactory.Create(builder => 
    builder.AddConsole().SetMinimumLevel(LogLevel.Debug));

var logger = loggerFactory.CreateLogger<MarkItDown>();
var markItDown = new MarkItDown(logger: logger);

// Now you'll see detailed conversion progress in console output

πŸ”„ Development & Contributing

Migration from Python MarkItDown

If you're familiar with the original Python library, here are the key differences:

Python C#/.NET Notes
MarkItDown() new MarkItDown() Similar constructor
markitdown.convert("file.pdf") await markItDown.ConvertAsync("file.pdf") Async pattern
markitdown.convert(stream, file_extension=".pdf") await markItDown.ConvertAsync(stream, streamInfo) StreamInfo object
markitdown.convert_url("https://...") await markItDown.ConvertFromUrlAsync("https://...") Async URL conversion
llm_client=... parameter ImageCaptioner, AudioTranscriber delegates More flexible callback system
Plugin system Not yet implemented Planned for future release

Example Migration:

# Python version
import markitdown
md = markitdown.MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
// C# version  
using MarkItDown;
var markItDown = new MarkItDown();
var result = await markItDown.ConvertAsync("document.pdf");
Console.WriteLine(result.Markdown);

Building from Source

# Clone the repository
git clone https://github.com/managedcode/markitdown.git
cd markitdown

# Build the solution
dotnet build

# Run tests
dotnet test

# Create NuGet package
dotnet pack --configuration Release

Tests & Coverage

dotnet test --collect:"XPlat Code Coverage"

The command emits standard test results plus a Cobertura coverage report at tests/MarkItDown.Tests/TestResults/<guid>/coverage.cobertura.xml. Tools such as ReportGenerator can turn this into HTML or Markdown dashboards.

Project Structure

β”œβ”€β”€ src/
β”‚   └── MarkItDown/                 # Core library
β”‚       β”œβ”€β”€ Converters/             # Format-specific converters (HTML, PDF, audio, etc.)
β”‚       β”œβ”€β”€ MarkItDown.cs          # Main conversion engine
β”‚       β”œβ”€β”€ StreamInfoGuesser.cs   # MIME/charset/extension detection helpers
β”‚       β”œβ”€β”€ MarkItDownOptions.cs   # Runtime configuration flags
β”‚       └── ...                    # Shared utilities (UriUtilities, MimeMapping, etc.)
β”œβ”€β”€ tests/
β”‚   └── MarkItDown.Tests/          # xUnit + Shouldly tests, Python parity vectors
β”œβ”€β”€ Directory.Build.props          # Shared build + packaging settings
└── README.md                      # This document

Contributing Guidelines

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/my-feature).
  3. Add tests with xUnit/Shouldly mirroring relevant Python vectors.
  4. Run dotnet test (CI enforces green builds + coverage upload).
  5. Update docs or samples if behaviour changes.
  6. Submit a pull request for review.

πŸ—ΊοΈ Roadmap

🎯 Near-Term

  • Azure Document Intelligence converter (options already scaffolded)
  • Outlook .msg ingestion via MIT-friendly dependencies
  • Performance optimizations and memory usage improvements
  • Enhanced test coverage mirroring Python test vectors

🎯 Future Ideas

  • Plugin discovery & sandboxing for custom converters
  • Built-in LLM caption/transcription providers (OpenAI, Azure AI)
  • Incremental/streaming conversion APIs for large documents
  • Cloud-native integration samples (Azure Functions, AWS Lambda)
  • Command-line interface (CLI) for batch processing

πŸ“ˆ Performance

MarkItDown is designed for high-performance document processing in production environments:

πŸš€ Performance Characteristics

Feature Benefit Impact
Stream-based processing No temporary files created Faster I/O, lower disk usage
Async/await throughout Non-blocking operations Better scalability, responsive UIs
Memory efficient Smart buffer reuse Lower memory footprint for large documents
Fast format detection Lightweight MIME/extension sniffing Quick routing to appropriate converter
Parallel processing ready Thread-safe converter instances Handle multiple documents concurrently

πŸ“Š Performance Considerations

MarkItDown's performance depends on:

  • Document size and complexity - Larger files with more formatting take longer to process
  • File format - Some formats (like PDF) require more processing than others (like plain text)
  • Available system resources - Memory, CPU, and I/O capabilities
  • Optional services - Image captioning and audio transcription add processing time

Performance will vary based on your specific documents and environment. For production workloads, we recommend benchmarking with your actual document types and sizes.

⚑ Optimization Tips

// 1. Reuse MarkItDown instances (they're thread-safe)
var markItDown = new MarkItDown();
await Task.WhenAll(
    markItDown.ConvertAsync("file1.pdf"),
    markItDown.ConvertAsync("file2.docx"),
    markItDown.ConvertAsync("file3.html")
);

// 2. Use cancellation tokens for timeouts
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(5));
var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);

// 3. Configure HttpClient for web content (reuse connections)
using var httpClient = new HttpClient();
var markItDown = new MarkItDown(httpClient: httpClient);

// 4. Pre-specify StreamInfo to skip format detection
var streamInfo = new StreamInfo(mimeType: "application/pdf", extension: ".pdf");
var result = await markItDown.ConvertAsync(stream, streamInfo);

πŸ”§ Configuration

Basic Configuration

var options = new MarkItDownOptions
{
    EnableBuiltins = true,      // Use built-in converters (default: true)
    EnablePlugins = false,      // Plugin system (reserved for future use)
    ExifToolPath = "/usr/local/bin/exiftool"  // Path to exiftool binary (optional)
};

var markItDown = new MarkItDown(options);

Advanced AI Integration

using Azure;
using OpenAI;

var options = new MarkItDownOptions
{
    // Azure AI Vision for image captions
    ImageCaptioner = async (bytes, info, token) =>
    {
        var client = new VisionServiceClient("your-endpoint", new AzureKeyCredential("your-key"));
        var result = await client.AnalyzeImageAsync(bytes, token);
        return $"Image: {result.Description?.Captions?.FirstOrDefault()?.Text ?? "Visual content"}";
    },
    
    // OpenAI Whisper for audio transcription  
    AudioTranscriber = async (bytes, info, token) =>
    {
        var client = new OpenAIClient("your-api-key");
        using var stream = new MemoryStream(bytes);
        var result = await client.AudioEndpoint.CreateTranscriptionAsync(
            stream, 
            Path.GetFileName(info.FileName) ?? "audio", 
            cancellationToken: token);
        return result.Text;
    },
    
    // Azure Document Intelligence for enhanced PDF/form processing
    DocumentIntelligence = new DocumentIntelligenceOptions
    {
        Endpoint = "https://your-resource.cognitiveservices.azure.com/",
        Credential = new AzureKeyCredential("your-document-intelligence-key"),
        ApiVersion = "2023-10-31-preview"
    }
};

var markItDown = new MarkItDown(options);

Production Configuration with Error Handling

using Microsoft.Extensions.Logging;
using Microsoft.Extensions.DependencyInjection;

// Set up dependency injection
var services = new ServiceCollection();
services.AddLogging(builder => builder.AddConsole().SetMinimumLevel(LogLevel.Information));
services.AddHttpClient();

var serviceProvider = services.BuildServiceProvider();
var logger = serviceProvider.GetRequiredService<ILogger<MarkItDown>>();
var httpClientFactory = serviceProvider.GetRequiredService<IHttpClientFactory>();

var options = new MarkItDownOptions
{
    // Graceful degradation for image processing
    ImageCaptioner = async (bytes, info, token) =>
    {
        try
        {
            // Your AI service call here
            return await CallVisionServiceAsync(bytes, token);
        }
        catch (Exception ex)
        {
            logger.LogWarning("Image captioning failed: {Error}", ex.Message);
            return $"[Image: {info.FileName ?? "unknown"}]";  // Fallback
        }
    }
};

var markItDown = new MarkItDown(options, logger, httpClientFactory.CreateClient());

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

This project is a comprehensive C# port of the original Microsoft MarkItDown Python library, created by the Microsoft AutoGen team. We've reimagined it specifically for the .NET ecosystem while maintaining compatibility with the original's design philosophy and capabilities.

Key differences in this .NET version:

  • 🎯 Native .NET performance - Built from scratch in C#, not a Python wrapper
  • πŸ”„ Modern async patterns - Full async/await support with cancellation tokens
  • πŸ“¦ NuGet ecosystem integration - Easy installation and dependency management
  • πŸ› οΈ Enterprise features - Comprehensive logging, error handling, and configuration
  • πŸš€ Enhanced performance - Stream-based processing and memory optimizations

Maintained by: ManagedCode team
Original inspiration: Microsoft AutoGen team
License: MIT (same as the original Python version)

We're committed to maintaining feature parity with the upstream Python project while delivering the performance and developer experience that .NET developers expect.

πŸ“ž Support


⭐ Star this repository if you find it useful!

Made with ❀️ by ManagedCode

About

C# tool for converting files and office documents to Markdown.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 73.6%
  • C# 26.3%
  • Jupyter Notebook 0.1%