Server data from the Official MCP Registry
PDF-to-Markdown router. Per-page backend selection + confidence scoring for RAG ingestion.
PDF-to-Markdown router. Per-page backend selection + confidence scoring for RAG ingestion.
Valid MCP server (1 strong, 4 medium validity signals). No known CVEs in dependencies. Package registry verified. Imported from the Official MCP Registry.
9 files analyzed · 1 issue found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Set these up before or after installing:
Environment variable: GEMINI_API_KEY
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-nameetp-pdfmux": {
"env": {
"GEMINI_API_KEY": "your-gemini-api-key-here"
},
"args": [
"serve",
"pdfmux"
],
"command": "uvx"
}
}
}From the project's GitHub README.
Universal PDF extraction orchestrator. Routes each page to the best backend, audits the output, re-extracts failures. 5 rule-based extractors + BYOK LLM fallback. One CLI. One API. Zero config.
PDF ──> pdfmux router ──> best extractor per page ──> audit ──> re-extract failures ──> Markdown / JSON / chunks
|
├─ PyMuPDF (digital text, 0.01s/page)
├─ OpenDataLoader (complex layouts, 0.05s/page)
├─ RapidOCR (scanned pages, CPU-only)
├─ Docling (tables, 97.9% TEDS)
├─ Surya (heavy OCR fallback)
└─ YOUR LLM (Gemini / Claude / GPT-4o / Ollama — BYOK via 5-line YAML)
pip install pdfmux
That's it. Handles digital PDFs out of the box. Add backends for harder documents:
pip install "pdfmux[ocr]" # RapidOCR — scanned/image pages (~200MB, CPU-only)
pip install "pdfmux[tables]" # Docling — table-heavy docs (~500MB)
pip install "pdfmux[opendataloader]" # OpenDataLoader — complex layouts (Java 11+)
pip install "pdfmux[llm]" # LLM fallback — Gemini, Claude, GPT-4o, Ollama
pip install "pdfmux[all]" # everything
Requires Python 3.11+.
# zero config — just works
pdfmux convert invoice.pdf
# invoice.pdf -> invoice.md (2 pages, 95% confidence, via pymupdf4llm)
# RAG-ready chunks with token limits
pdfmux convert report.pdf --chunk --max-tokens 500
# cost-aware extraction with budget cap
pdfmux convert report.pdf --mode economy --budget 0.50
# schema-guided structured extraction (5 built-in presets)
pdfmux convert invoice.pdf --schema invoice
# BYOK any LLM for hardest pages
pdfmux convert scan.pdf --llm-provider claude
# batch a directory
pdfmux convert ./docs/ -o ./output/
import pdfmux
# text -> markdown
text = pdfmux.extract_text("report.pdf")
# structured data -> dict with tables, key-values, metadata
data = pdfmux.extract_json("report.pdf")
# RAG chunks -> list of dicts with token estimates
chunks = pdfmux.chunk("report.pdf", max_tokens=500)
┌─────────────────────────────┐
│ Segment Detector │
│ text / tables / images / │
│ formulas / headers per page │
└─────────────┬───────────────┘
│
┌────────────────────────────────────────┐
│ Router Engine │
│ │
│ economy ── balanced ── premium │
│ (minimize $) (default) (max quality)│
│ budget caps: --budget 0.50 │
└────────────────────┬───────────────────┘
│
┌──────────┬──────────┬────────┴────────┬──────────┐
│ │ │ │ │
PyMuPDF OpenData RapidOCR Docling LLM
digital Loader scanned tables (BYOK)
0.01s/pg complex CPU-only 97.9% any provider
layouts TEDS
│ │ │ │ │
└──────────┴──────────┴────────┬────────┴──────────┘
│
┌────────────────────────────────────────┐
│ Quality Auditor │
│ │
│ 4-signal dynamic confidence scoring │
│ per-page: good / bad / empty │
│ if bad -> re-extract with next backend│
└────────────────────┬───────────────────┘
│
┌────────────────────────────────────────┐
│ Output Pipeline │
│ │
│ heading injection (font-size analysis)│
│ table extraction + normalization │
│ text cleanup + merge │
│ confidence score (honest, not inflated)│
└────────────────────────────────────────┘
| Feature | What it does | Command |
|---|---|---|
| Zero-config extraction | Routes to best backend automatically | pdfmux convert file.pdf |
| RAG chunking | Section-aware chunks with token estimates | pdfmux convert file.pdf --chunk --max-tokens 500 |
| Cost modes | economy / balanced / premium with budget caps | pdfmux convert file.pdf --mode economy --budget 0.50 |
| Schema extraction | 5 built-in presets (invoice, receipt, contract, resume, paper) | pdfmux convert file.pdf --schema invoice |
| BYOK LLM | Gemini, Claude, GPT-4o, Ollama, any OpenAI-compatible API | pdfmux convert file.pdf --llm-provider claude |
| Benchmark | Eval all installed extractors against ground truth | pdfmux benchmark |
| Doctor | Show installed backends, coverage gaps, recommendations | pdfmux doctor |
| MCP server | AI agents read PDFs via stdio or HTTP | pdfmux serve |
| Batch processing | Convert entire directories | pdfmux convert ./docs/ |
| Streaming | Bounded-memory page iteration for large files | for page in ext.extract("500pg.pdf") |
pdfmux convertpdfmux convert <file-or-dir> [options]
Options:
-o, --output PATH Output file or directory
-f, --format FORMAT markdown | json | csv | llm (default: markdown)
-q, --quality QUALITY fast | standard | high (default: standard)
-s, --schema SCHEMA JSON schema file or preset (invoice, receipt, contract, resume, paper)
--chunk Output RAG-ready chunks
--max-tokens N Max tokens per chunk (default: 500)
--mode MODE economy | balanced | premium (default: balanced)
--budget AMOUNT Max spend per document in USD
--llm-provider PROVIDER LLM backend: gemini | claude | openai | ollama
--confidence Include confidence score in output
--stdout Print to stdout instead of file
pdfmux serveStart the MCP server for AI agent integration.
pdfmux serve # stdio mode (Claude Desktop, Cursor)
pdfmux serve --http 8080 # HTTP mode
pdfmux doctorpdfmux doctor
# ┌──────────────────┬─────────────┬─────────┬──────────────────────────────────┐
# │ Extractor │ Status │ Version │ Install │
# ├──────────────────┼─────────────┼─────────┼──────────────────────────────────┤
# │ PyMuPDF │ installed │ 1.25.3 │ │
# │ OpenDataLoader │ installed │ 0.3.1 │ │
# │ RapidOCR │ installed │ 3.0.6 │ │
# │ Docling │ missing │ -- │ pip install pdfmux[tables] │
# │ Surya │ missing │ -- │ pip install pdfmux[ocr-heavy] │
# │ LLM (Gemini) │ configured │ -- │ GEMINI_API_KEY set │
# └──────────────────┴─────────────┴─────────┴──────────────────────────────────┘
pdfmux benchmarkpdfmux benchmark report.pdf
# ┌──────────────────┬────────┬────────────┬─────────────┬──────────────────────┐
# │ Extractor │ Time │ Confidence │ Output │ Status │
# ├──────────────────┼────────┼────────────┼─────────────┼──────────────────────┤
# │ PyMuPDF │ 0.02s │ 95% │ 3,241 chars │ all pages good │
# │ Multi-pass │ 0.03s │ 95% │ 3,241 chars │ all pages good │
# │ RapidOCR │ 4.20s │ 88% │ 2,891 chars │ ok │
# │ OpenDataLoader │ 0.12s │ 97% │ 3,310 chars │ best │
# └──────────────────┴────────┴────────────┴─────────────┴──────────────────────┘
import pdfmux
text = pdfmux.extract_text("report.pdf") # -> str (markdown)
text = pdfmux.extract_text("report.pdf", quality="fast") # PyMuPDF only, instant
text = pdfmux.extract_text("report.pdf", quality="high") # LLM-assisted
data = pdfmux.extract_json("report.pdf")
# data["page_count"] -> 12
# data["confidence"] -> 0.91
# data["ocr_pages"] -> [2, 5, 8]
# data["pages"][0]["key_values"] -> [{"key": "Date", "value": "2026-02-28"}]
# data["pages"][0]["tables"] -> [{"headers": [...], "rows": [...]}]
chunks = pdfmux.chunk("report.pdf", max_tokens=500)
for c in chunks:
print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")
data = pdfmux.extract_json("invoice.pdf", schema="invoice")
# Uses built-in invoice preset: extracts date, vendor, line items, totals
# Also accepts a path to a custom JSON Schema file
from pdfmux.extractors import get_extractor
ext = get_extractor("fast")
for page in ext.extract("large-500-pages.pdf"): # Iterator[PageResult]
process(page.text) # constant memory, even on 500-page PDFs
from pdfmux import (
# Enums
Quality, # FAST, STANDARD, HIGH
OutputFormat, # MARKDOWN, JSON, CSV, LLM
PageQuality, # GOOD, BAD, EMPTY
# Data objects (frozen dataclasses)
PageResult, # page: text, page_num, confidence, quality, extractor
DocumentResult, # document: pages, source, confidence, extractor_used
Chunk, # chunk: title, text, page_start, page_end, tokens
# Errors
PdfmuxError, # base -- catch this for all pdfmux errors
FileError, # file not found, unreadable, not a PDF
ExtractionError, # extraction failed
ExtractorNotAvailable,# requested backend not installed
FormatError, # invalid output format
AuditError, # audit could not complete
)
pip install langchain-pdfmux
from langchain_pdfmux import PDFMuxLoader
loader = PDFMuxLoader("report.pdf", quality="standard")
docs = loader.load() # -> list[Document] with confidence metadata
pip install llama-index-readers-pdfmux
from llama_index.readers.pdfmux import PDFMuxReader
reader = PDFMuxReader(quality="standard")
docs = reader.load_data("report.pdf") # -> list[Document]
Listed on mcpservers.org. One-line setup:
{
"mcpServers": {
"pdfmux": {
"command": "npx",
"args": ["-y", "pdfmux-mcp"]
}
}
}
Or via Claude Code:
claude mcp add pdfmux -- npx -y pdfmux-mcp
Tools exposed: convert_pdf, analyze_pdf, extract_structured, get_pdf_metadata, batch_convert.
pdfmux supports any LLM via 5 lines of YAML. Bring your own keys -- nothing leaves your machine unless you configure it to.
# ~/.pdfmux/llm.yaml
provider: claude # gemini | claude | openai | ollama | any OpenAI-compatible
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
base_url: https://api.anthropic.com # optional, for custom endpoints
max_cost_per_page: 0.02 # budget cap
Supported providers:
| Provider | Models | Local? | Cost |
|---|---|---|---|
| Gemini | 2.5 Flash, 2.5 Pro | No | ~$0.01/page |
| Claude | Sonnet, Opus | No | ~$0.015/page |
| GPT-4o | GPT-4o, GPT-4o-mini | No | ~$0.01/page |
| Ollama | Any local model | Yes | Free |
| Custom | Any OpenAI-compatible API | Configurable | Varies |
Tested on opendataloader-bench -- 200 real-world PDFs across financial reports, legal filings, academic papers, and scanned documents.
| Engine | Overall | Reading Order | Tables (TEDS) | Headings | Requires |
|---|---|---|---|---|---|
| opendataloader hybrid | 0.909 | 0.935 | 0.928 | 0.828 | API calls ($) |
| pdfmux | 0.905 | 0.920 | 0.911 | 0.852 | CPU only, $0 |
| docling | 0.877 | 0.900 | 0.887 | 0.802 | ~500MB models |
| marker | 0.861 | 0.890 | 0.808 | 0.796 | GPU recommended |
| opendataloader local | 0.844 | 0.913 | 0.494 | 0.761 | CPU only |
| mineru | 0.831 | 0.857 | 0.873 | 0.743 | GPU + ~2GB models |
#2 overall, #1 among free tools. 99.5% of the paid #1 score at zero cost per page. Best heading detection of any engine tested. Image table OCR extracts tables embedded as images.
Every result includes a 4-signal confidence score:
When confidence drops below 80%, pdfmux tells you exactly what went wrong and how to fix it:
Page 4: 32% confidence. 0 chars extracted from image-heavy page.
-> Install pdfmux[ocr] for RapidOCR support on 6 image-heavy pages.
| Mode | Behavior | Typical cost |
|---|---|---|
| economy | Rule-based backends only. No LLM calls. | $0/page |
| balanced | LLM only for pages that fail rule-based extraction. | ~$0.002/page avg |
| premium | LLM on every page for maximum quality. | ~$0.01/page |
Set a hard budget cap: --budget 0.50 stops LLM calls when spend reaches $0.50 per document.
pdfmux is not another PDF extractor. It is the orchestration layer that picks the right extractor per page, verifies the result, and retries failures.
| Tool | Good at | Limitation |
|---|---|---|
| PyMuPDF | Fast digital text | Cannot handle scans or image layouts |
| Docling | Tables (97.9% accuracy) | Slow on non-table documents |
| Marker | GPU ML extraction | Needs GPU, overkill for digital PDFs |
| Unstructured | Enterprise platform | Complex setup, paid tiers |
| LlamaParse | Cloud-native | Requires API keys, not local |
| Reducto | High accuracy | $0.015/page, closed source |
| pdfmux | Orchestrates all of the above | Routes per page, audits, re-extracts |
Open source Reducto alternative: what costs $0.015/page elsewhere is free with pdfmux's rule-based backends, or ~$0.002/page average with BYOK LLM fallback.
git clone https://github.com/NameetP/pdfmux.git
cd pdfmux
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest # 151 tests
ruff check src/ tests/
ruff format src/ tests/
git checkout -b feature/your-feature)pytest and ruff check passBe the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.
by mcp-marketplace · Developer Tools
Create, build, and publish Python MCP servers to PyPI — conversationally.
by Microsoft · Content & Media
Convert files (PDF, Word, Excel, images, audio) to Markdown for LLM consumption
by mcp-marketplace · Developer Tools
Scaffold, build, and publish TypeScript MCP servers to npm — conversationally
by mcp-marketplace · Finance
Free stock data and market news for any MCP-compatible AI assistant.