Server data from the Official MCP Registry
Schema-driven document extraction with local OCR + LLM. Document in, Structured JSON out.
Schema-driven document extraction with local OCR + LLM. Document in, Structured JSON out.
Valid MCP server (1 strong, 3 medium validity signals). 3 known CVEs in dependencies (1 critical, 1 high severity) Package registry verified. Imported from the Official MCP Registry.
7 files analyzed · 4 issues found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-arknill-docpick": {
"args": [
"docpick"
],
"command": "uvx"
}
}
}From the project's GitHub README.
Document in, Structured JSON out. Locally. With your schema.
docpick is a lightweight, schema-driven document extraction pipeline that combines local OCR engines with local LLMs to extract structured JSON from any document — invoices, receipts, bills of lading, tax forms, and more.
pip install docpick # core (LLM extraction only)
pip install docpick[paddle] # + PaddleOCR (recommended)
pip install docpick[easyocr] # + EasyOCR (Korean-optimized)
pip install docpick[got] # + GOT-OCR2.0 (GPU, vision-language)
pip install docpick[all] # all OCR backends
Requirements: Python 3.11+ / LLM endpoint (vLLM, Ollama, or OpenAI-compatible)
from docpick import DocpickPipeline
from docpick.schemas import InvoiceSchema
pipeline = DocpickPipeline()
result = pipeline.extract("invoice.pdf", schema=InvoiceSchema)
print(result.data) # Structured dict matching schema
print(result.validation) # Validation errors/warnings
print(result.confidence) # Per-field confidence scores
# Extract structured data
docpick extract invoice.pdf --schema invoice --output result.json
# OCR only (no LLM)
docpick ocr document.png --lang ko,en
# Validate extracted JSON
docpick validate result.json --schema invoice
# Batch process a directory
docpick batch ./documents/ --schema invoice --output ./results/ --concurrency 4
# List available schemas
docpick schemas list
# Show schema details
docpick schemas show invoice
| Schema | Document Type | Key Validations |
|---|---|---|
invoice | Commercial invoices | Line item sums, tax ID checkdigit, date order |
receipt | Retail/restaurant receipts | Total = subtotal + tax + tip |
bill_of_lading | Ocean/air B/L | Container weight sums, ISO 6346, HS code format |
purchase_order | Purchase orders | PO total = line items, delivery date order |
kr_tax_invoice | Korean e-tax invoice (세금계산서) | Business number checkdigit (x2), supply/tax/total sums |
bank_statement | Bank statements | IBAN mod97, period date order |
id_document | Passport/ID (ICAO 9303) | MRZ, ISO 3166 country codes, date ranges |
certificate_of_origin | Certificate of Origin | ISO 3166 alpha-2 country codes |
Define your own schema with Pydantic:
from pydantic import BaseModel
from docpick import DocpickPipeline
from docpick.validation.rules import SumEqualsRule, RequiredFieldRule
class MyDocument(BaseModel):
"""Custom document schema."""
company_name: str | None = None
total_amount: float | None = None
tax_amount: float | None = None
net_amount: float | None = None
items: list[dict] | None = None
class ValidationRules:
rules = [
RequiredFieldRule("company_name"),
SumEqualsRule(["net_amount", "tax_amount"], "total_amount"),
]
pipeline = DocpickPipeline()
result = pipeline.extract("my_document.pdf", schema=MyDocument)
Or use a JSON Schema file:
docpick extract document.pdf --schema my_schema.json
| Algorithm | Use Case |
|---|---|
kr_business_number | Korean business registration number (10 digits) |
luhn | Credit card numbers |
iso_6346 | Shipping container numbers |
iban_mod97 | International bank account numbers |
awb_mod7 | Air waybill numbers |
mrz | Machine Readable Zone (passport/ID) |
| Rule | Description |
|---|---|
SumEqualsRule | Sum of fields equals target (with tolerance) |
DateBeforeRule | Date A must precede Date B |
RequiredFieldRule | Field must be non-null and non-empty |
FieldEqualsRule | Two fields must be equal |
RangeRule | Numeric field within min/max bounds |
RegexRule | Field matches regex pattern |
Validate consistency across related documents (e.g., Invoice + B/L + Packing List):
from docpick.validation.cross_document import create_trade_document_validator
validator = create_trade_document_validator()
result = validator.validate({
"invoice": invoice_data,
"bl": bl_data,
"packing_list": packing_list_data,
"certificate": certificate_data,
})
print(result.is_valid)
| Engine | Type | GPU | Languages | Best For |
|---|---|---|---|---|
| PaddleOCR | Traditional OCR | Optional | 111 | General documents (default) |
| EasyOCR | Traditional OCR | Optional | 80+ | Korean text |
| GOT-OCR2.0 | Vision-Language | Required | Multi | Complex layouts |
| VLM | Vision-Language | Required | Multi | Direct image → JSON |
The default auto engine uses confidence-based fallback:
If Tier 1 average confidence falls below threshold (default 0.7), automatically escalates to Tier 2.
| Provider | Endpoint | Default Model |
|---|---|---|
| vLLM | http://localhost:8000/v1 | Qwen/Qwen3.5-32B-AWQ |
| Ollama | http://localhost:11434 | qwen3.5:7b |
Configure via CLI or YAML:
docpick config set llm.provider ollama
docpick config set llm.base_url http://localhost:11434
docpick config set llm.model qwen3.5:7b
The pipeline is designed to be resilient:
result.errorsresult = pipeline.extract("damaged.pdf", schema=InvoiceSchema)
if result.errors:
print("Pipeline warnings:", result.errors)
if result.data:
print("Partial extraction:", result.data)
Process entire directories with parallel workers:
from docpick.batch import BatchProcessor
from docpick.schemas import InvoiceSchema
processor = BatchProcessor(concurrency=4)
result = processor.process_directory(
"./invoices/",
schema=InvoiceSchema,
recursive=True,
)
print(f"Processed {result.succeeded}/{result.total} files")
for path, extraction in result.results.items():
print(f"{path}: {extraction.data.get('total_amount')}")
flowchart TD
A["📄 Document\n(PDF / Image)"] --> B["DocumentLoader\n(pypdfium2)"]
B --> C["Tier 1: OCR\n(PaddleOCR / EasyOCR)\nCPU"]
C --> D{"Confidence\n≥ threshold?"}
D -->|"yes"| F["LLM Extractor\n(vLLM / Ollama)\nSchema prompt"]
D -->|"no"| E["Tier 2: VLM\n(GOT / VLM)\nGPU"]
E --> F
F --> G["Pydantic Validation"]
G --> H["✅ ExtractionResult"]
Apache 2.0 — all dependencies are Apache 2.0 or MIT licensed.
Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.
by mcp-marketplace · Developer Tools
Create, build, and publish Python MCP servers to PyPI — conversationally.
by Microsoft · Content & Media
Convert files (PDF, Word, Excel, images, audio) to Markdown for LLM consumption
by mcp-marketplace · Developer Tools
Scaffold, build, and publish TypeScript MCP servers to npm — conversationally
by mcp-marketplace · Finance
Free stock data and market news for any MCP-compatible AI assistant.