Skip to content

feat: add Rust MCP server + fix KaTeX rendering issues#2

Open
userFRM wants to merge 5 commits intotonydavis629:mainfrom
userFRM:feat/mcp-and-katex-fixes
Open

feat: add Rust MCP server + fix KaTeX rendering issues#2
userFRM wants to merge 5 commits intotonydavis629:mainfrom
userFRM:feat/mcp-and-katex-fixes

Conversation

@userFRM
Copy link

@userFRM userFRM commented Feb 5, 2026

Summary

  • Rust MCP server that uses the markxiv library directly — no dependency on markxiv.org
  • KaTeX rendering fixes for common LaTeX commands that break in browser math rendering
  • Figure preservation with ar5iv links — addresses Feature request: add links to figures #1

Figure Links (closes #1)

Previously, <figure> blocks from pandoc output were stripped entirely. Now:

  1. extract_figure_captions() converts <figure> blocks into numbered markdown blockquotes preserving caption text:
    > **Figure 1:** Architecture of the proposed model
    
  2. add_ar5iv_figure_links() enriches each figure reference with a link to the paper's ar5iv HTML page where figures are viewable:
    > **Figure 1:** Architecture of the proposed model — [view on ar5iv](https://ar5iv.labs.arxiv.org/html/2107.02789)
    

Since ar5iv renames figure files to x1.png, x2.png etc. (unpredictable from LaTeX source), we link to the full ar5iv page rather than individual images.

KaTeX Fixes

Added regex-based post-processing in sanitize_markdown():

Issue Before After
Missing subscript \mathcal{X}{Y} \mathcal{X}_{Y}
Unsupported command \textsc{Algo} \textbf{Algo}
Algorithm pseudo-code \Call{Solve}{x} \textbf{Solve}(x)
Unsupported font \mathbbm{1} \mathbb{1}
Angle brackets in math $a < b$ → HTML stripped Math blocks preserved verbatim
Display math inline text $$x^2$$ text $$x^2$$ on own line

Also added search() method to ArxivClient trait for arXiv keyword search.

MCP Server

A Rust-native MCP binary (markxiv-mcp) built with rmcp v0.14. Uses the markxiv library directly (no HTTP calls to markxiv.org).

Tools:

  • convert_paper — full arXiv paper → markdown via local pandoc pipeline
  • get_paper_metadata — title/authors/abstract lookup via arXiv API
  • search_papers — keyword search via arXiv Atom API

Claude Desktop config:

{
  "mcpServers": {
    "markxiv": {
      "command": "/path/to/markxiv-mcp"
    }
  }
}

Requirements: pandoc + pdftotext installed locally. The MCP binary can be distributed pre-built so end users don't need a Rust toolchain.

How this differs from other arXiv MCP servers

Most arXiv MCP servers (~10+ exist) either fetch raw LaTeX, scrape HTML, or do basic PDF text extraction. markxiv-mcp runs pandoc locally on actual LaTeX source for much higher fidelity markdown output — the same pipeline powering markxiv.org.

Sanitize Pipeline

The sanitize_markdown() pipeline now has 4 stages:

  1. extract_figure_captions — convert <figure> blocks to numbered markdown blockquotes
  2. fix_katex_commands — regex fixes for unsupported LaTeX commands
  3. normalize_display_math — ensure $$...$$ blocks are on their own lines
  4. strip_html_tags_preserve_math — remove remaining HTML while preserving math verbatim

Test Plan

  • cargo test --lib — all 50 unit tests pass
  • cargo build -p markxiv-mcp — MCP binary compiles
  • cargo build — main library compiles
  • Test MCP tools manually with Claude Desktop or MCP inspector
  • Verify KaTeX fixes against real papers with known rendering issues

- Add regex-based post-processing in sanitize_markdown():
  - fix_katex_commands(): fixes \mathcal subscripts, \textsc→\textbf,
    \Call macro, \mathbbm→\mathbb
  - protect_math_angle_brackets(): replaces < and > with \langle/\rangle
    inside math delimiters before HTML stripping
- Add SearchResult type and search() method to ArxivClient trait
- Add parse_atom_search_results() for multi-entry Atom feed parsing
- Add regex dependency
- Add 7 unit tests for new KaTeX fix functions
Add a Rust-native MCP server (markxiv-mcp) that uses the markxiv
library directly — no dependency on markxiv.org or any web service.

Tools exposed:
- convert_paper: full arXiv paper → markdown via pandoc pipeline
- get_paper_metadata: title/authors/abstract lookup
- search_papers: keyword search via arXiv Atom API

Built with rmcp 0.14 (stdio transport). Users need pandoc + pdftotext
installed locally. The MCP binary can be distributed pre-built so end
users don't need a Rust toolchain.
The protect_math_angle_brackets() approach injected \langle/\rangle
into all math blocks, which breaks inside text-mode commands like
\texttt{<name>} where \langle is undefined.

New approach: strip_html_tags_preserve_math() copies $...$ and $$...$$
blocks verbatim so angle brackets survive for KaTeX to handle natively.
This fixes the ParseError on papers with \texttt{<...>} in math.
Adds normalize_display_math() to the sanitize pipeline to ensure
$$...$$ display math blocks are isolated on their own lines. Prevents
markdown renderers from misparsing inline $$...$$ as two $ inline math
delimiters, which caused "Can't use function '$' in math mode" errors.
Instead of stripping <figure> blocks entirely, extract captions
into markdown blockquotes and enrich them with ar5iv viewing links.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: add links to figures

1 participant