Server data from the Official MCP Registry
Local OCR & image analysis via Apple Vision — no cloud, no API keys, ~97% fewer tokens on PDFs.
Local OCR & image analysis via Apple Vision — no cloud, no API keys, ~97% fewer tokens on PDFs.
macos-vision-mcp is a well-architected MCP server that provides local OCR and image analysis using Apple's Vision Framework. The codebase is clean, properly typed with TypeScript, and has no authentication requirements (as intended for a local, offline tool). Permissions are appropriate for its purpose: file reading and system APIs. Minor code quality suggestions exist but do not represent security vulnerabilities. Supply chain analysis found 3 known vulnerabilities in dependencies (0 critical, 3 high severity). Package verification found 1 issue.
4 files analyzed · 8 issues found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-woladi-macos-vision-mcp": {
"args": [
"-y",
"macos-vision-mcp"
],
"command": "npx"
}
}
}From the project's GitHub README.
Cut document token costs by ~97% with local, private, offline OCR for any MCP client — no API keys, no uploads.
Pre-extracts text and image data locally before your AI ever sees it — cutting token usage by ~97% on real documents and returning structured paragraphs, lines, and bounding boxes so the model can reconstruct the document into Markdown, HTML, DOCX, or any other format. Files never leave your Mac: no cloud API, no API keys, no network requests.
How the ~97% is measured: a 44-page scanned PDF sent as page images costs ~73,500 tokens; the same file run through
analyze_documentreturns ~2,400 tokens of extracted text and structure (raw page-image tokens vs. extracted-text tokens). Your numbers vary with page density and tokenizer — treat 97% as the order of magnitude, not a guarantee.
Contents: Quick Start · What you get · Why it's different · Available Tools · Usage · Example workflows · Configuration · Privacy layer
npm install — powered by Apple Vision Framework, same engine as Live Text in Photos.app.❌ Without macos-vision-mcp:
✅ With macos-vision-mcp:
Most OCR options for LLMs either ship your documents to a cloud vision API or make you stand up and tune your own engine. This runs on Apple's on-device Vision framework — the same engine behind Live Text in Photos.app — so extraction is free, private, and instant.
| macos-vision-mcp | Cloud vision OCR (GPT-4o, Google Vision, Mistral OCR) | Tesseract-based MCP | |
|---|---|---|---|
| Cost | $0 — no per-page or per-token fees | Per-call / per-page billing | $0, but self-hosted |
| Offline | Yes, after install | No — every page hits the network | Yes |
| Privacy | Files never leave your Mac | Documents uploaded to a third party | Local |
| Setup | One command, no keys | API key + billing account | Install + language data + tuning |
| Quality | Apple Vision (strong on clean scans, receipts, screenshots) | Generally high | Varies; weaker on poor scans |
The trade-off is honest: it's macOS-only, and on heavily skewed or low-contrast scans a cloud model may still read more. For the common case — invoices, contracts, receipts, screenshots, clean PDFs — you get cloud-grade extraction with zero cost, zero setup, and nothing leaving your machine.
macos-vision-mcp acts as a local pre-processing layer between your documents and the cloud. Useful for:
Instead of sending the raw document to your AI, you extract the text and structure locally first. The model then works only with the extracted text — never the original file.
Add to your MCP client (example for Claude Code):
claude mcp add macos-vision-mcp -- npx -y macos-vision-mcp
Using Claude Desktop or Cursor? Jump to Configuration ↓
Restart your client. npx fetches the package on first run, caches it, and the tools appear automatically — no separate install step. This is the convention used by most MCP servers and recommended by Anthropic, Cursor, and other clients.
Note: On first run, the package downloads prebuilt Swift helper binaries (
vision-helper,pdf-helper) from its GitHub Releases (~300 KB, ~1–2s). Subsequent invocations hit the npx cache and start instantly. Xcode Command Line Tools are only required as a fallback when the download can't reach the network — setMACOS_VISION_SKIP_DOWNLOAD=1to force local compilation withswiftc.
Prefer instant cold-starts (no npx cache lookup)? Install globally with
npm install -g macos-vision-mcpand use the alternative config shown at the bottom of Configuration.
| Tool | What it does | Example prompt |
|---|---|---|
ocr_image | Extract text from an image or PDF (JPG, PNG, HEIC, TIFF, PDF). Returns plain text, or per-page paragraphs + text blocks with lineId / paragraphId and bounding boxes. Accepts start_page / max_pages for partial PDF OCR. | "Read the text from ~/Desktop/screenshot.png" |
detect_faces | Detect human faces and return their count and positions. | "How many people are in this photo?" |
detect_barcodes | Read QR codes, EAN, UPC, Code128, PDF417, Aztec, and other 1D/2D codes. | "What does the QR code in /tmp/qr.jpg say?" |
detect_document | Detect the four corner points of a document in a photo (paper, receipt, ID). Useful as a crop / deskew hint before OCR. | "Find the document corners in ~/Desktop/receipt.jpg" |
classify_image | Classify image content into 1000+ categories with confidence scores. | "What is in this image?" |
analyze_document | Returns structured JSON with reading-order paragraphs, raw text blocks (bbox / confidence), faces, barcodes, and rectangles — ready for the model to reconstruct into Markdown, HTML, or anything else. Also accepts start_page / max_pages for long PDFs. | "Reconstruct ~/Desktop/scan.pdf as clean Markdown" |
Use the tool name explicitly in your prompt to guarantee local processing:
Extract text from an image or PDF:
Use ocr_image to extract text from ~/Desktop/invoice.pdf
Detect faces in a photo:
Use detect_faces on ~/Photos/team.jpg and tell me how many people are in it
Classify image content:
Use classify_image on ~/Downloads/unknown.jpg
Full document analysis + reconstruction:
Use analyze_document on ~/Desktop/report.pdf and reconstruct it as clean Markdown
The tool returns structured JSON; the model picks the output format you ask for (Markdown, HTML, DOCX outline, etc.) without any extra dependencies — no Ollama, no cloud LLM, no extra tooling.
Real-world combinations that work out of the box once the server is connected:
analyze_document returns reading-order paragraphs and bounding boxes; the model renders Markdown ready to drop into a docs site, knowledge base, or RAG pipeline.analyze_document, then send only the structured JSON upstream. The original document never leaves your Mac.ocr_image on a phone photo, the model normalizes amount / date / merchant, and pipes the result straight into your expense tool's API.detect_barcodes returns the decoded value plus symbology in one round trip.detect_document returns the four corner points so you (or a downstream tool) can deskew and crop the image before reading the text.{
"source": { "path": "...", "pageCount": 1, "isPdf": false },
"pages": [
{
"page": 0,
// primary surface for reconstruction — reading-order paragraphs joined with "\n"
"paragraphs": [
{ "paragraphId": 0, "lineIds": [0], "text": "ACME COFFEE" },
{ "paragraphId": 1, "lineIds": [1, 2], "text": "12 Main St\nPortland, OR" },
],
// spatial fallback — raw blocks with page-local 0–1 bbox, confidence, line/paragraph membership
"textBlocks": [
{
"text": "ACME COFFEE",
"lineId": 0,
"paragraphId": 0,
"confidence": 0.99,
"bbox": { "x": 0.21, "y": 0.04, "width": 0.58, "height": 0.06 },
},
],
"faces": [],
"barcodes": [],
"rectangles": [],
},
],
"summary": {
"totalTextBlocks": 8,
"totalParagraphs": 2,
"totalFaces": 0,
"totalBarcodes": 0,
"totalRectangles": 0,
},
}
Use paragraphs[].text for the 95% case (rebuild Markdown/HTML/plain text directly). Reach for textBlocks[] when you need spatial context — multi-column layouts, tables, forms, IDs.
Notes:
ocr_image in blocks mode returns the same per-page shape minus the detection sections: { pages: [{ page, paragraphs, textBlocks }] }.paragraphId / lineId reset on every page.textBlocks[] and reconstruct from the bounding boxes.All examples below use npx -y — the recommended default. No prior npm install needed; the package is fetched and cached on first run, and updates pick up automatically when the npx cache rolls over.
claude mcp add macos-vision-mcp -- npx -y macos-vision-mcp
Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"macos-vision-mcp": {
"command": "npx",
"args": ["-y", "macos-vision-mcp"]
}
}
}
Add to ~/.cursor/mcp.json:
{
"mcpServers": {
"macos-vision-mcp": {
"command": "npx",
"args": ["-y", "macos-vision-mcp"]
}
}
}
If you'd rather skip the npx cache lookup on cold starts — or you want to pin a specific version — install once:
npm install -g macos-vision-mcp
…then use "command": "macos-vision-mcp" (no args) in any of the JSON configs above, or claude mcp add macos-vision-mcp -- macos-vision-mcp for Claude Code. Note that global installs can break when switching Node versions with nvm / asdf / volta — re-run npm install -g after switching.
If macos-vision-mcp saved you tokens or kept a document on your Mac, consider starring the repo — it helps others find it.
Contributions are welcome. Please follow Conventional Commits for commit messages — this project uses release-it with @release-it/conventional-changelog to automate releases.
git clone <repo>
cd macos-vision-mcp
npm install
npm run dev # watch mode
MIT — Adrian Wolczuk
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Modelcontextprotocol · Developer Tools
Web content fetching and conversion for efficient LLM usage
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.