feat: add Rust MCP server + fix KaTeX rendering issues#2
Open
userFRM wants to merge 5 commits intotonydavis629:mainfrom
Open
feat: add Rust MCP server + fix KaTeX rendering issues#2userFRM wants to merge 5 commits intotonydavis629:mainfrom
userFRM wants to merge 5 commits intotonydavis629:mainfrom
Conversation
- Add regex-based post-processing in sanitize_markdown():
- fix_katex_commands(): fixes \mathcal subscripts, \textsc→\textbf,
\Call macro, \mathbbm→\mathbb
- protect_math_angle_brackets(): replaces < and > with \langle/\rangle
inside math delimiters before HTML stripping
- Add SearchResult type and search() method to ArxivClient trait
- Add parse_atom_search_results() for multi-entry Atom feed parsing
- Add regex dependency
- Add 7 unit tests for new KaTeX fix functions
Add a Rust-native MCP server (markxiv-mcp) that uses the markxiv library directly — no dependency on markxiv.org or any web service. Tools exposed: - convert_paper: full arXiv paper → markdown via pandoc pipeline - get_paper_metadata: title/authors/abstract lookup - search_papers: keyword search via arXiv Atom API Built with rmcp 0.14 (stdio transport). Users need pandoc + pdftotext installed locally. The MCP binary can be distributed pre-built so end users don't need a Rust toolchain.
The protect_math_angle_brackets() approach injected \langle/\rangle
into all math blocks, which breaks inside text-mode commands like
\texttt{<name>} where \langle is undefined.
New approach: strip_html_tags_preserve_math() copies $...$ and $$...$$
blocks verbatim so angle brackets survive for KaTeX to handle natively.
This fixes the ParseError on papers with \texttt{<...>} in math.
Adds normalize_display_math() to the sanitize pipeline to ensure $$...$$ display math blocks are isolated on their own lines. Prevents markdown renderers from misparsing inline $$...$$ as two $ inline math delimiters, which caused "Can't use function '$' in math mode" errors.
Instead of stripping <figure> blocks entirely, extract captions into markdown blockquotes and enrich them with ar5iv viewing links.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Figure Links (closes #1)
Previously,
<figure>blocks from pandoc output were stripped entirely. Now:extract_figure_captions()converts<figure>blocks into numbered markdown blockquotes preserving caption text:add_ar5iv_figure_links()enriches each figure reference with a link to the paper's ar5iv HTML page where figures are viewable:Since ar5iv renames figure files to
x1.png,x2.pngetc. (unpredictable from LaTeX source), we link to the full ar5iv page rather than individual images.KaTeX Fixes
Added regex-based post-processing in
sanitize_markdown():\mathcal{X}{Y}\mathcal{X}_{Y}\textsc{Algo}\textbf{Algo}\Call{Solve}{x}\textbf{Solve}(x)\mathbbm{1}\mathbb{1}$a < b$→ HTML strippedtext $$x^2$$ text$$x^2$$on own lineAlso added
search()method toArxivClienttrait for arXiv keyword search.MCP Server
A Rust-native MCP binary (
markxiv-mcp) built with rmcp v0.14. Uses the markxiv library directly (no HTTP calls to markxiv.org).Tools:
convert_paper— full arXiv paper → markdown via local pandoc pipelineget_paper_metadata— title/authors/abstract lookup via arXiv APIsearch_papers— keyword search via arXiv Atom APIClaude Desktop config:
{ "mcpServers": { "markxiv": { "command": "/path/to/markxiv-mcp" } } }Requirements: pandoc + pdftotext installed locally. The MCP binary can be distributed pre-built so end users don't need a Rust toolchain.
How this differs from other arXiv MCP servers
Most arXiv MCP servers (~10+ exist) either fetch raw LaTeX, scrape HTML, or do basic PDF text extraction. markxiv-mcp runs pandoc locally on actual LaTeX source for much higher fidelity markdown output — the same pipeline powering markxiv.org.
Sanitize Pipeline
The
sanitize_markdown()pipeline now has 4 stages:<figure>blocks to numbered markdown blockquotes$$...$$blocks are on their own linesTest Plan
cargo test --lib— all 50 unit tests passcargo build -p markxiv-mcp— MCP binary compilescargo build— main library compiles