likhit is a public MarkItDown plugin that adds Nepal-specific document support.
The default path is powered by MarkItDown, with likhit intercepting born-digital Nepali PDFs that need Nepal-specific repair before Markdown is emitted. That repair layer handles Kalimati broken-CMap fixes, Devanagari reordering and spacing normalization, and legacy Nepali font remapping where applicable.
pip install markitdown-likhitlikhit is primarily used as a MarkItDown plugin.
Once installed, enable plugins when creating a MarkItDown instance:
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True)
result = md.convert("path/to/nepali-document.pdf")
print(result.text_content)You can also use likhit through the standard MarkItDown CLI:
markitdown --use-plugins path/to/nepali-document.pdfTo write the output to a file:
markitdown --use-plugins path/to/nepali-document.pdf -o output.mdTo verify the plugin is registered:
markitdown --list-pluginsYou should see likhit in the output.
This package also installs a small helper CLI that runs MarkItDown with the likhit plugin enabled and writes Markdown files for you:
likhit-save path/to/nepali-document.pdf --out output.mdConvert multiple files into a directory:
likhit-save samples/pressrelease.pdf samples/kanunpatrika.pdf --out-dir converted/likhit intercepts only the formats where it adds behavior beyond MarkItDown:
- PDF: Detected automatically by scanning embedded fonts. If any font is classified
as
broken_cmap(Kalimati variants) orlegacy_remap(Preeti, Kantipur, PCS Nepali, Sagarmatha, Himali), likhit's repair pipeline runs. All other PDFs fall through to markitdown's built-in converter. - DOC: Legacy Microsoft Word
.docfiles are handled by likhit's extraction pipeline. - DOCX: Left to MarkItDown's built-in Word converter.
- Generic Nepali born-digital PDFs
- Legacy
.docfiles
For image-dominant or scanned PDFs, likhit can use markitdown-ocr when OCR is configured.
Required model configuration:
export MARKITDOWN_OCR_MODEL="your-model-name"You can also provide the model through OPENAI_MODEL or GEMINI_MODEL.
Authentication options:
- OpenAI-compatible provider with a standard OpenAI key:
export OPENAI_API_KEY="your-api-key"- OpenAI-compatible provider with a custom base URL:
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-provider.example/v1/"
export MARKITDOWN_OCR_MODEL="your-model-name"- Gemini using the OpenAI compatibility endpoint:
export GEMINI_API_KEY="your-gemini-api-key"
export GEMINI_MODEL="gemini-2.5-flash"When GEMINI_API_KEY is set, likhit automatically uses Gemini's OpenAI-compatible base URL unless you explicitly override OPENAI_BASE_URL.
Optional variables:
export MARKITDOWN_OCR_PROMPT="Custom OCR instructions"The pipeline is:
- MarkItDown loads the plugin when
enable_plugins=Trueor--use-pluginsis used. - For PDFs that need Nepali repair, likhit scans fonts and runs its repair pipeline.
- After extraction, likhit checks whether the document matches a known structure such as a single-column notice or a dense two-column layout.
- If a known structure is detected, likhit applies its structure-aware ordering and paragraph assembly.
- Otherwise, MarkItDown handles the default conversion path.
- When the PDF needs Nepali repair,
likhitrepairs the text first:- Kalimati broken-CMap repair
- Devanagari reordering
- Devanagari spacing normalization
- Legacy-font remapping through
npttf2utf
likhitassembles repaired text blocks into Markdown.
src/likhit/_plugin.py: MarkItDown plugin entry point and converter registrationsrc/likhit/converters/: plugin converters for Nepali PDF and legacy DOC inputssrc/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layersrc/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion pathsrc/likhit/extractors/: extraction strategies (PDF, DOC)font_based.py: PDF extraction with Nepali font repairdocx_based.py: legacy DOC text extraction
src/likhit/handlers/: structure-aware handlers and detection logicsrc/likhit/renderers/: Markdown renderingtests/: conversion, extraction, and plugin coveragetests/integration/: end-to-end integration teststests/integration/test_data/: committed test fixtures (PDF, DOCX, DOC samples)
Run all tests:
poetry run pytest- MarkItDown: https://github.com/microsoft/markitdown
- MarkItDown sample plugin: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin