Server data from the Official MCP Registry
Find & fetch research datasets across 12 archives, omics registries, and literature sources.
Find & fetch research datasets across 12 archives, omics registries, and literature sources.
Valid MCP server (1 strong, 1 medium validity signals). 3 known CVEs in dependencies (0 critical, 3 high severity) Package registry verified. Imported from the Official MCP Registry. Trust signals: trusted author (8/9 approved).
3 files analyzed Β· 4 issues found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Set these up before or after installing:
Environment variable: NCBI_API_KEY
Environment variable: LLM_API_BASE
Environment variable: LLM_API_KEY
Environment variable: LLM_MODEL
Environment variable: EMBEDDING_API_BASE
Environment variable: EMBEDDING_API_KEY
Environment variable: EMBEDDING_MODEL
Environment variable: UNPAYWALL_EMAIL
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-musharna-data-aggregator-mcp": {
"env": {
"LLM_MODEL": "your-llm-model-here",
"LLM_API_KEY": "your-llm-api-key-here",
"LLM_API_BASE": "your-llm-api-base-here",
"NCBI_API_KEY": "your-ncbi-api-key-here",
"EMBEDDING_MODEL": "your-embedding-model-here",
"UNPAYWALL_EMAIL": "your-unpaywall-email-here",
"EMBEDDING_API_KEY": "your-embedding-api-key-here",
"EMBEDDING_API_BASE": "your-embedding-api-base-here"
},
"args": [
"data-aggregator-mcp"
],
"command": "uvx"
}
}
}From the project's GitHub README.
One MCP server to find and fetch research data across archives, omics registries, and literature β behind a single normalized model.
search one query across 12 sources β Zenodo, DataCite (Dryad /
Figshare / Dataverse / OSF / OpenNeuro / Mendeley), NCBI omics
(GEO / SRA / BioProject), literature (PubMed / OpenAIRE), HuggingFace
datasets, DataONE (eco / environmental), OmicsDI (proteomics /
metabolomics), DANDI (neurophysiology), CZ CELLxGENE (single-cell),
OpenML (ML datasets), RCSB PDB (structures), and the GWAS Catalog β
deduplicated, normalized, and cross-linked. resolve any hit to its file
manifest, citation, trust signals, and the data it points at. fetch it to
disk with checksum verification.
mcp-name: io.github.musharna/data-aggregator-mcp
Most data MCPs wrap a single source. This one unifies them behind six tools
and one DataResource model, so an agent searches once and gets back comparable
records:
organism="Orobanche aegyptiaca" also matches
Phelipanche aegyptiaca (NCBI Taxonomy), so a species rename doesn't cost you
results.resolve.metrics (citations / views / downloads / likes),
version status (is_latest / superseded_by), and last_updated freshness,
surfaced wherever the source exposes them.resolve(format="croissant") or "ro-crate" hands a
dataset to an ML or research-packaging pipeline as standard JSON-LD.operate reads the schema, previews rows, or
runs a read-only SQL SELECT against a remote Parquet/CSV/TSV without
downloading it (Parquet footer + DuckDB httpfs range reads). Optional
[operate] extra; base install is unchanged.relate takes a handful of resolved ids and
reports how they connect β shared accession, shared cross-identifier, an
explicit link, or version lineage β naming the literal shared value as
evidence. Metadata hints only: it never reads files or executes a join.β Full rationale and a comparison vs. single-source servers, breadth gateways, and ML-dataset tools: docs/POSITIONING.md.
Run with no install:
uvx data-aggregator-mcp
Register with Claude Code:
claude mcp add data-aggregator -- uvx data-aggregator-mcp
A typical agent flow:
search("drought stress RNA-seq", organism="Sorghum bicolor")
β [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ] # deduped, taxa-normalized
resolve("sra:SRX079566")
β DataResource{ files: [ENA FASTQ urlsβ¦], access: "open", taxa: [...] }
fetch("sra:SRX079566", dest="./data")
β ["./data/SRX079566_1.fastq.gz", β¦] # md5-verified
pip install data-aggregator-mcp
data-aggregator-mcp # or: python -m data_aggregator_mcp
To use the operate tool (query remote tabular files in place), install the
optional extra:
pip install "data-aggregator-mcp[operate]"
Add to a client's MCP config (e.g. Claude Desktop claude_desktop_config.json):
{
"mcpServers": {
"data-aggregator": {
"command": "uvx",
"args": ["data-aggregator-mcp"],
"env": { "NCBI_API_KEY": "your-optional-key" }
}
}
}
| Source | Discover | Fetch | Checksum |
|---|---|---|---|
| Zenodo | β | β | md5 |
| DataCite β Figshare | β | β | md5 |
| DataCite β Dataverse | β | β | md5 |
| DataCite β OSF | β | β | md5 |
| DataCite β Dryad | β | manifest onlyΒΉ | sha-256 (listed) |
| DataCite β Mendeley & others | β | β | β |
| NCBI SRA | β | β (ENA FASTQ) | md5 |
| NCBI GEO | β | β
(suppl/) | noneΒ² |
| NCBI BioProject | β | β SRA links | β |
| PubMed / OpenAIRE | β | β (OA full text) | noneΒ² |
| HuggingFace datasets | β | β (resolve URL) | none |
| DataONE (eco/env) | β | β (Member Node) | md5 / sha-256 |
| OmicsDI β PRIDE | β | β (HTTPS FTP) | size only |
| OmicsDI β MetaboLights | β | β (HTTPS FTP) | none |
| OmicsDI β other MS repos | β | β | β |
| DataCite β OpenNeuro | β | β (snapshot) | noneΒ² |
| DANDI (neurophysiology) | β | β (302βS3) | noneΒ² |
| CZ CELLxGENE (single-cell) | β | β (H5AD/RDS) | noneΒ² |
| OpenML (ML datasets) | β | β (ARFF) | md5 |
| RCSB PDB (structures) | β | β (.cif/.pdb) | noneΒ² |
| GWAS Catalog | β | β PMID bridge | β |
ΒΉ Dryad downloads are token / bot-challenge gated, so fetch fails loud;
resolve still lists the files.
Β² No upstream checksum β fetch verifies content-type instead (rejects an HTML
page served in place of a binary).
search(query?, size?, sources?, organism?, disease?, tissue?, chemical?, assay?, kind?, published_after?, published_before?, rank?, cursor?, collapse_mirrors?, understand?, multi_query?, provenance?)Fan out across all wired sources in parallel and return compact DataResource
records, deduped by DOI. Per-source failures land in errors{} β never silently
dropped.
organism β expand the query with NCBI-Taxonomy synonyms; the expansion is
echoed in taxon_expansion, and results carry normalized taxa[]
({taxid, name}) plus a described_in link to plant-genomics-mcp for plant
taxa.sources β restrict the fan-out, e.g. ["omics"].size β max results (1β50).kind β keep only dataset / sequencing_run / study / publication /
software.published_after / published_before β filter by publication year.rank β relevance (default) or semantic (re-rank the fetched page by
embedding similarity to the query; needs EMBEDDING_API_BASE, degrades to
relevance order otherwise).understand β opt into LLM query understanding (default false). A free-text
query is normalized into a focused keyword query: conversational fluff
("I'm looking forβ¦", "where can I findβ¦") is stripped while the scientific
and entity terms are kept so they still match by text. The LLM also detects
structured entities (organism/disease/tissue/chemical/assay, kind) β these are
echoed in query_understanding.extracted for transparency but not
auto-applied, because ANDing LLM-inferred facets across free-text keyword
upstreams over-constrains and hurts recall. Only the cleaned keyword_core and
explicit year scopes are applied; the ontology resolvers still run on the
facets you pass (the LLM proposes, you dispose). Needs an LLM endpoint
(LLM_API_BASE); with none configured the search runs unchanged and notes it in
errors['understand']. Effectiveness is query- and model-dependent β opt-in /
default-off; validate the recall lift on your own corpus and LLM (see the eval
harness below). On our small verified set multi_query= is the stronger,
always-safe recall lever; understand= is approximately neutral with a weak
local model.multi_query β opt into diverse multi-query recall expansion (default false).
An LLM generates up to a few deliberately-diverse reformulations of your query
(different facets/synonyms/framings, not paraphrases), each is fanned out across
every source, and the deduped union is re-ranked against your original query β
surfacing relevant records a single keyword query would miss. Bounded at
MAX_QUERY_VARIANTS (4, incl. the original, which is always kept so recall never
drops below baseline), so it costs at most NΓ the upstream calls. Composes with
understand= (which structures variant 0). The variants used are echoed in
query_expansion. Needs an LLM endpoint (LLM_API_BASE); with none configured
the search runs as a normal single query and notes it in errors['multi_query'].cursor β opaque token from a prior result's next_cursor; pages forward
across every source. In cursor mode the other params are read from the
token, so query is optional.resolve(id, cite?, format?, trust?, fair?, use?)Full record + files manifest. Routes by id shape β zenodo:7654321, a bare DOI,
datacite:10.5061/dryad.x, an omics id (sra:SRX079566, geo:GSE332789,
bioproject:PRJNA1468572), a literature id (pubmed:34320281, openaire:<id>),
a HuggingFace id (hf:owner/name), a DataONE id (dataone:doi:10.5063/F1HT2M7Q),
or an OmicsDI id (omicsdi:pride:PXD000001). Attaches, where available:
files[] β ENA FASTQ manifest (SRA), GEO suppl/, or the host repo's
native manifest (Figshare / Dataverse / OSF / Dryad).links[] β paper β data: pubmed: β sra: / geo: / bioproject: (NCBI
elink); openaire: β datacite: (ScholeXplorer Scholix).access / license β normalized status
(open / embargoed / restricted / closed / unknown) and license where
the source exposes it.identifiers β normalized {pmid, pmcid, doi}, plus an open-access
full-text FileEntry (EuropePMC XML, or an Unpaywall PDF fallback) for papers.citation β pass cite=<format>: bibtex, ris, csl-json, or any CSL
style name (apa, mla, vancouver, β¦). DOI records use content
negotiation; others render CSL-JSON from metadata. Off by default; failures
degrade quietly.metrics (citations / views / downloads / likes),
is_latest / superseded_by (derived from version links), and last_updated
freshness, where the source provides them.trust=true β attach retraction status (via Crossref) under trust{}.
One extra Crossref call; meaningful for DOI-bearing records only.fair=true β attach an RDA-grounded FAIRness score (0β100 + F/A/I/R
sub-scores + actionable gaps) computed from the record metadata under fair{}.
Pure/local β no extra network call.use=<intent> β attach a licence-compatibility advisory under
license_compat{} for the intended use (commercial / redistribute /
modify / ml-training). Returns ALLOW/REVIEW/DENY with the governing clause.
Metadata-derived advisory, not legal advice; an absent/unrecognized licence
yields REVIEW.format β pass format="croissant" (file-level Croissant JSON-LD),
"ro-crate" (minimal RO-Crate 1.1), or "provenance" (one-call RO-Crate 1.1
data-availability dossier bundling version-currency, licence+SPDX, FAIR score,
and retraction status) to attach a standard manifest under the matching field.fetch(id, dest?, files?, max_bytes?, force?, extract?)Download files to disk and return their paths. Streams under a max_bytes guard
(force to override) with md5 verification wherever a checksum exists.
files β restrict to a subset of the resolved manifest.extract β unpack downloaded zip / tar archives in place, guarded against
path traversal and runaway extracted size. Off by default.suppl/, literature full text) get a content-type
sniff that fails loud if a declared binary is actually an HTML page.FetchNotSupportedError.list_sources()Wired sources with their capabilities β layer, kinds, supported filters,
fetchability, operable flag, id examples, auth, and rate limits.
operate(op, id, file?, query?, n?, columns?)Inspect or query a remote tabular file (Parquet / CSV / TSV) without
downloading it. Addresses a file by catalog id + file name (defaults to the
first tabular file on the resolved record). Ops:
schema β column names + types (reads the Parquet footer / sniffs the CSV
header; no full load).preview β a small sample of rows.head β the first n rows (default 20), optionally restricted to columns.sql β a read-only SELECT (the file is the view data), e.g.
SELECT col, count(*) FROM data GROUP BY 1.peek β per-column profile via DuckDB SUMMARIZE (type, null-rate,
approximate distinct count, min/max, numeric quartiles) without
downloading the file. Like head/sql, reads the whole file and honors
the source-size ceiling.Backed by the Parquet footer reader + DuckDB httpfs range reads. sql runs in
a locked-down DuckDB (read-only, local filesystem disabled, single-SELECT
validation, row / wall-clock caps). Requires the optional [operate] extra
(pip install data-aggregator-mcp[operate]); without it, operate returns a
clear install-the-extra message and the other four tools are unaffected.
Any HuggingFace dataset with a datasets-server converted view is operable
(schema / preview / head / sql): resolve surfaces the auto-converted
Parquet files (source="hf-datasets-server") even for datasets stored as
JSON/JSONL/arrow, so pass file=<config>/<split>/...parquet to pick a split when
there are several.
relate(ids)Cross-resource join/harmonization hints. Given 2β10 resource ids, relate resolves
each (TTL-cached) and reports how they relate and on what key they could be joined:
shared_accession β same BioProject/SRA/GEO accession on β₯2 records β joinable key.shared_identifier β same doi/pmid/pmcid across records β same work / paperβdata link.explicit_link β one record's links[] points at another input record.version_lineage β one record supersedes another (dedupe, don't join, those).Hints only. relate never reads file columns, fetches files, or executes a
join/merge/conversion β every hint names the shared value as evidence. Per-id resolve
failures are reported in errors, not fatal; an empty result carries an explanatory
note.
Three workflow prompts surface in clients (e.g. /mcp__data_aggregator__* in
Claude Code):
find_data β find datasets for a topic, optionally scoped to an organism.data_behind_paper β find the datasets / accessions behind a paper.search_resolve_fetch β walk the end-to-end search β resolve β fetch flow.Both optional, set via environment variables:
NCBI_API_KEY β raises the NCBI E-utilities rate limit (3 β 10 req/s) used by
the omics, literature, and taxonomy lookups.UNPAYWALL_EMAIL β enables the Unpaywall fallback leg of literature full-text
retrieval (the EuropePMC leg works without it).EMBEDDING_API_BASE / EMBEDDING_API_KEY / EMBEDDING_MODEL β an
OpenAI-compatible embeddings endpoint enabling rank=semantic. Absent β
semantic re-rank degrades to relevance order. Key is optional (keyless local
servers supported); model defaults to text-embedding-3-small.LLM_API_BASE / LLM_API_KEY / LLM_MODEL β an OpenAI-compatible
/chat/completions endpoint enabling search(understand=true) (NLβstructured
query rewriting) and search(multi_query=true) (diverse multi-query recall
expansion). Absent β both run the raw query unchanged and note it in
errors['understand'] / errors['multi_query']. Key is optional (keyless local
servers supported); model defaults to gpt-4o-mini (a passthrough string β set
it to whatever your endpoint serves). multi_query fans out at most
MAX_QUERY_VARIANTS (4, incl. the original) variants, bounding the NΓ cost.To measure the recall lift of understand=true / multi_query=true on a small
labeled set, run the gated eval harnesses (need a live LLM endpoint):
DATA_AGGREGATOR_MCP_LIVE=1 LLM_API_BASE=... python scripts/eval_understand.py
DATA_AGGREGATOR_MCP_LIVE=1 LLM_API_BASE=... python scripts/eval_multi_query.py
They print per-query and mean recall@20 (understand / multi-query off vs. on). See
the fixtures at scripts/eval_understand_fixture.json and
scripts/eval_multi_query_fixture.json.
uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q # real-API probes
The README demo (examples/assets/demo.svg) is recorded network-free from
examples/_demo_stdio.py β see the header of that file to re-record.
MIT β see LICENSE.
Be the first to review this server!
by Modelcontextprotocol Β· Developer Tools
Web content fetching and conversion for efficient LLM usage
by Modelcontextprotocol Β· Developer Tools
Read, search, and manipulate Git repositories programmatically
by Toleno Β· Developer Tools
Toleno Network MCP Server β Manage your Toleno mining account with Claude AI using natural language.