Skip to content

MCP Document Converter - A powerful MCP tool for converting documents between multiple formats, enabling AI agents to easily transform documents.

Notifications You must be signed in to change notification settings

xt765/mcp-document-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MCP Document Converter

mcp-name: io.github.xt765/mcp-document-converter

MCP (Model Context Protocol) Document Converter - A powerful MCP tool for converting documents between multiple formats, enabling AI agents to easily transform documents.

GitHub Gitee CSDN PyPI MCP Registry Python 3.10+ License: MIT

Features

  • Multi-format Support: Supports 5 mainstream document formats: Markdown, HTML, DOCX, PDF, and Text
  • Bidirectional Conversion: Any format can be converted to any other format (5×5=25 conversion combinations)
  • MCP Protocol: Compliant with MCP standards, can be used as a tool for AI assistants like Trae IDE
  • Plugin Architecture: Easy to extend with new parsers and renderers
  • Syntax Highlighting: HTML and PDF outputs support code syntax highlighting
  • Style Customization: Support for custom CSS styles
  • Metadata Preservation: Preserves document title, author, creation time, and other metadata during conversion

Supported Formats

Input Formats (Parsers)

Format Extensions MIME Type Features
Markdown .md, .markdown, .mdown, .mkd text/markdown YAML Front Matter, GFM extensions
HTML .html, .htm text/html Semantic tag parsing
DOCX .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document Styles, tables, images
PDF .pdf application/pdf Text extraction and structure recognition
Text .txt, .text text/plain Auto encoding detection and structure recognition

Output Formats (Renderers)

Format Extension MIME Type Features
HTML .html text/html Beautiful styling, code highlighting, responsive design
Markdown .md text/markdown Standard Markdown format, YAML Front Matter
DOCX .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document Word document format, style preservation
PDF .pdf application/pdf Generated with WeasyPrint, pagination support
Text .txt text/plain Plain text, basic formatting preserved

Conversion Matrix

Source \ Target HTML PDF Markdown DOCX Text
Markdown
HTML
DOCX
PDF
Text

Installation

Using pip (Recommended)

pip install mcp-document-converter

From Source

git clone https://github.com/xt765/mcp-document-converter.git
cd mcp-document-converter
pip install -e .

MCP Tools

This server provides the following tools:

convert_document

Convert a document from one format to another.

Arguments:

  • source_path (string, required): Path to the source document.
  • target_format (string, required): Target format (html, pdf, markdown, docx, text).
  • output_path (string, optional): Path for the output file.
  • source_format (string, optional): Format of the source file (auto-detected if not provided).
  • options (object, optional): Additional options like template, css, and preserve_metadata.

Configuration

Using in Trae IDE / Claude Desktop

Add the following to your MCP configuration file:

Option 1: Using PyPI (Recommended)

{
  "mcpServers": {
    "mcp-document-converter": {
      "command": "uvx",
      "args": [
        "mcp-document-converter"
      ]
    }
  }
}

Option 2: Using GitHub repository

{
  "mcpServers": {
    "mcp-document-converter": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/xt765/mcp-document-converter",
        "mcp-document-converter"
      ]
    }
  }
}

Option 3: Using Gitee repository (Faster access in China)

{
  "mcpServers": {
    "mcp-document-converter": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://gitee.com/xt765/mcp-document-converter",
        "mcp-document-converter"
      ]
    }
  }
}

Option 4: Using pip (Manual installation)

First install the package:

pip install mcp-document-converter

Then add to configuration:

{
  "mcpServers": {
    "mcp-document-converter": {
      "command": "mcp-document-converter",
      "args": []
    }
  }
}

Usage

As an MCP Tool

After configuration, AI assistants can directly call the following tools:

1. convert_document (Recommended)

Use a unified interface to convert any supported document type.

# Markdown to HTML
convert_document(
    source_path="document.md",
    target_format="html"
)

# HTML to PDF
convert_document(
    source_path="document.html",
    target_format="pdf"
)

# DOCX to Markdown
convert_document(
    source_path="document.docx",
    target_format="markdown"
)

# Conversion with options
convert_document(
    source_path="document.md",
    target_format="html",
    output_path="output.html",
    options={
        "css": "custom.css",
        "preserve_metadata": True
    }
)

2. list_supported_formats

List all supported document formats.

list_supported_formats()

3. get_conversion_matrix

Get the complete format conversion matrix.

get_conversion_matrix()

4. can_convert

Check if conversion from source format to target format is supported.

can_convert(source_format="markdown", target_format="pdf")

5. get_format_info

Get detailed information about a specific format.

get_format_info(format="markdown")

As a Python Library

from mcp_document_converter import DocumentConverter
from mcp_document_converter.registry import get_registry
from mcp_document_converter.parsers import MarkdownParser, HTMLParser
from mcp_document_converter.renderers import HTMLRenderer, PDFRenderer

# Register parsers and renderers
registry = get_registry()
registry.register_parser(MarkdownParser())
registry.register_parser(HTMLParser())
registry.register_renderer(HTMLRenderer())
registry.register_renderer(PDFRenderer())

# Create converter
converter = DocumentConverter(registry)

# Convert document
result = converter.convert(
    source="input.md",
    target_format="html",
    output_path="output.html"
)

if result.success:
    print(f"✅ Conversion successful: {result.output_path}")
else:
    print(f"❌ Conversion failed: {result.error_message}")

Tool Interface Details

convert_document

Convert a document from one format to another.

Parameters:

Parameter Type Required Description
source_path string Source file path, supports absolute or relative paths
target_format string Target format: html, pdf, markdown, docx, text
output_path string Output file path (optional, defaults to source filename)
source_format string Source format (optional, auto-detected from file extension)
options object Conversion options

Options:

Option Type Default Description
template string - Template name
css string - Custom CSS styles
preserve_metadata boolean true Whether to preserve metadata
extract_images boolean true Whether to extract images

Example:

{
  "source_path": "/path/to/document.md",
  "target_format": "html",
  "output_path": "/path/to/output.html",
  "options": {
    "css": "body { font-family: Arial; }",
    "preserve_metadata": true
  }
}

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    MCP Document Converter                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Parsers                          Renderers                     │
│   ┌─────────────┐                  ┌─────────────┐              │
│   │ Markdown    │ ───────────────→ │ HTML        │              │
│   │ DOCX        │ ───────────────→ │ PDF         │              │
│   │ HTML        │ ───────────────→ │ Markdown    │              │
│   │ PDF         │ ───────────────→ │ DOCX        │              │
│   │ Text        │ ───────────────→ │ Text        │              │
│   └─────────────┘                  └─────────────┘              │
│          ↓                                ↓                     │
│   ┌─────────────────────────────────────────────────────┐       │
│   │         Intermediate Representation (IR)             │       │
│   │  - Document Tree                                     │       │
│   │  - Metadata                                          │       │
│   │  - Assets (images, attachments, etc.)                │       │
│   └─────────────────────────────────────────────────────┘       │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Core Components

  1. DocumentIR (Intermediate Representation): Unified abstraction for all documents, containing document tree, metadata, assets, etc.
  2. BaseParser (Parser Base Class): Defines the parser interface, parses various formats into DocumentIR
  3. BaseRenderer (Renderer Base Class): Defines the renderer interface, renders DocumentIR into various formats
  4. ConverterRegistry (Registry): Manages all parsers and renderers, provides format lookup and auto-matching
  5. DocumentConverter (Conversion Engine): Coordinates parsers and renderers to complete document conversion

Extension Development

Adding a New Parser

from typing import List, Union
from pathlib import Path
from mcp_document_converter.core.parser import BaseParser
from mcp_document_converter.core.ir import DocumentIR, Node, NodeType

class MyParser(BaseParser):
    @property
    def supported_extensions(self) -> List[str]:
        return [".myext"]
    
    @property
    def format_name(self) -> str:
        return "myformat"
    
    @property
    def mime_types(self) -> List[str]:
        return ["application/x-myformat"]
    
    def parse(self, source: Union[str, Path, bytes], **options) -> DocumentIR:
        # Read source file
        content = self._read_source(source)
        
        # Parse into DocumentIR
        document = DocumentIR()
        document.title = "My Document"
        
        # Add content nodes
        document.add_node(Node(
            type=NodeType.PARAGRAPH,
            content=[Node(type=NodeType.TEXT, content="Hello World")]
        ))
        
        return document

Adding a New Renderer

from typing import Any
from mcp_document_converter.core.renderer import BaseRenderer
from mcp_document_converter.core.ir import DocumentIR

class MyRenderer(BaseRenderer):
    @property
    def output_extension(self) -> str:
        return ".myext"
    
    @property
    def format_name(self) -> str:
        return "myformat"
    
    @property
    def mime_type(self) -> str:
        return "application/x-myformat"
    
    def render(self, document: DocumentIR, **options: Any) -> str:
        # Render DocumentIR to target format
        parts = []
        
        if document.title:
            parts.append(f"# {document.title}")
        
        for node in document.content:
            # Render each node
            pass
        
        return "\n".join(parts)

Registering Extensions

from mcp_document_converter.registry import get_registry

# Register new parser and renderer
registry = get_registry()
registry.register_parser(MyParser())
registry.register_renderer(MyRenderer())

Testing

# Run all tests
python tests/test_conversion.py

# Run specific test
python tests/test_conversion.py::test_markdown_to_html

Environment Variables

Variable Description Default
MCP_CONVERTER_LOG_LEVEL Log level INFO
MCP_CONVERTER_TEMP_DIR Temporary files directory System temp directory

Dependencies

Core Dependencies

  • mcp >= 1.0.0 - MCP protocol implementation
  • pydantic >= 2.0.0 - Data validation

Parser Dependencies

  • markdown >= 3.5.0 - Markdown parsing
  • beautifulsoup4 >= 4.12.0 - HTML parsing
  • python-docx >= 1.1.0 - DOCX parsing
  • PyPDF2 >= 3.0.0 - PDF parsing
  • chardet >= 5.0.0 - Encoding detection
  • pyyaml >= 6.0.0 - YAML parsing

Renderer Dependencies

  • weasyprint >= 60.0 - PDF rendering
  • pygments >= 2.17.0 - Code highlighting
  • jinja2 >= 3.1.0 - Template engine

License

MIT License

Contributing

Issues and Pull Requests are welcome!

Related Projects

About

MCP Document Converter - A powerful MCP tool for converting documents between multiple formats, enabling AI agents to easily transform documents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published