Web search that doesn't wreck your AI's memory.
Valid MCP server (1 strong, 1 medium validity signals). 4 known CVEs in dependencies (0 critical, 3 high severity) ⚠️ Package registry links to a different repository than scanned source. Imported from the Official MCP Registry.
5 files analyzed · 5 issues found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-annibale-x-mcp-webgate": {
"args": [
"mcp-webgate"
],
"command": "uvx"
}
}
}From the project's GitHub README.
Web search that doesn't wreck your AI's memory.
mcp-webgate is an MCP server that gives your AI clean, bounded web content — across all major AI clients:
What is mcp-webgate? When your AI uses a standard "fetch URL" tool, it gets the raw HTML of the page — ads, menus, scripts, cookie banners and all. A single news article can dump 200,000 tokens of garbage into the AI's memory, wiping out your entire conversation.
mcp-webgate is a protective filter that sits between your AI and the web:
The result: clean, bounded, useful web content — always.
Searching for "mcp model context protocol" with LLM features on:
Query → LLM expands to 5 search variants → 20 pages found, 13 fetched in parallel
Raw HTML downloaded 5.16 MB (~1,290,000 tokens)
After cleaning 52.1 KB ( ~13,000 tokens) — 99% noise stripped
After LLM summary 5.8 KB ( ~1,450 tokens) — structured report with citations
13 sources distilled into ~1,450 tokens. A single naive fetch of just one of those pages (e.g. a security blog at 563 KB) would dump ~140,000 tokens of raw HTML into your AI's context. webgate processes all 13 and delivers a clean briefing that fits in a footnote.
This is an intensive case (5 queries × 5 results). A typical search with 3–5 results still saves 95%+ of context compared to raw fetching — and your AI gets structured, ranked content instead of a wall of HTML soup.
uvxpip install uv
uvx runs Python tools without installing them permanently. You only need to do this once.
The easiest option is SearXNG — free, no account, runs locally:
docker run -d -p 8080:8080 --name searxng searxng/searxng
No Docker? Use a cloud backend instead (Brave, Tavily, Exa, SerpAPI) — see Backends.
See the Integrations table for your specific client. As a quick example, for Claude Desktop:
Open the config file:
~/Library/Application Support/Claude/claude_desktop_config.json%APPDATA%\Claude\claude_desktop_config.jsonAdd this:
{
"mcpServers": {
"webgate": {
"command": "uvx",
"args": ["mcp-webgate"],
"env": {
"WEBGATE_DEFAULT_BACKEND": "searxng",
"WEBGATE_SEARXNG_URL": "http://localhost:8080"
}
}
}
}
Restart the client after editing.
Search the web for: latest news on AI regulation
The AI will use webgate_query automatically. You're done.
Your question
↓
Search backend (SearXNG / Brave / Tavily / Exa / SerpAPI)
↓ [deduplicate URLs, block binary files, filter domains]
Fetch pages in parallel (streaming — hard size cap per page)
↓ [optional: retry failed pages from reserve pool]
Strip HTML junk (menus, ads, scripts, footers — lxml)
↓
Clean up text (invisible chars, unicode junk, BiDi tricks)
↓
BM25 reranking (best-matching results first — always active)
↓ [optional: LLM reranking]
Cap total output to budget
↓ [optional: LLM summarization → compact Markdown report]
Clean result lands in your AI's context
webgate gives your AI three tools:
webgate_fetch — read a single pageUse this when you already know the URL you want. The AI passes the URL and gets back the cleaned text — up to max_query_budget characters (default 32,000).
{ "url": "https://example.com/article", "max_chars": 32000 }
{
"url": "https://example.com/article",
"title": "Article Title",
"text": "cleaned text...",
"truncated": true,
"char_count": 12450
}
webgate_query — search + fetch + cleanRuns a full search cycle. Pass one query (or several) and get back cleaned, ranked results.
{ "queries": "how to set up a VPN on Linux", "num_results_per_query": 5 }
Multiple queries run in parallel and are merged:
{
"queries": ["VPN Linux setup", "best VPN Linux 2024"],
"num_results_per_query": 5
}
Output without LLM — returns cleaned page content for each result:
{
"sources": [
{ "id": 1, "title": "...", "url": "...", "content": "cleaned text...", "truncated": false }
],
"snippet_pool": [ { "id": 6, "title": "...", "url": "...", "snippet": "..." } ],
"stats": { "fetched": 5, "total_chars": 18200, "per_page_limit": 6400 }
}
Output with LLM summarization — returns a compact Markdown report:
{
"summary": "## How to set up a VPN on Linux\n\nTo install...[1][2]",
"citations": [{ "id": 1, "title": "...", "url": "..." }],
"stats": { "fetched": 5, "total_chars": 58000 }
}
Output when LLM fails — error reason shown, full sources returned as fallback:
{
"llm_summary_error": "ReadTimeout: LLM did not respond in time",
"sources": [ "..." ],
"stats": { "..." : "..." }
}
snippet_pool contains extra results from the search that were not fetched (search-engine snippet only). The AI can use these to decide if more fetches are worthwhile.
webgate_onboarding — how-to guideReturns a JSON guide explaining how to use webgate effectively. The AI should call this once at the start of a session if in doubt about which tool to use.
Most frontier models follow MCP tool instructions automatically. Smaller or local models sometimes ignore the server-provided guidance and fall back to a built-in fetch tool instead — returning raw HTML that floods the context with noise.
If you notice this happening, add an explicit instruction block to your system prompt:
You have access to webgate tools for web search and page retrieval.
Follow these rules in every session:
- To search the web: use webgate_query — never use a built-in fetch, browser, or HTTP tool
- To retrieve a URL: use webgate_fetch — never fetch URLs directly
- Built-in fetch tools return raw HTML that floods your context; webgate returns clean, bounded text
At the start of each session, call webgate_onboarding to read the full operational guide.
This works because user system prompt instructions take precedence over MCP server-level guidance, making the constraint explicit at the highest-priority layer the model sees.
Tip: if your client supports named system prompts or prompt templates, save the block above as a reusable preset so you don't have to paste it every time.
This section explains what the key parameters do and when to change them. The defaults work well for most cases — only tweak if you have a specific reason.
webgate measures text in characters (not tokens). A rough conversion for English text:
4 characters ≈ 1 token
| Characters | Approximate tokens |
|---|---|
| 8,000 | ~2,000 |
| 32,000 | ~8,000 |
| 96,000 | ~24,000 |
webgate_fetch budgetWhen you fetch a single URL, the ceiling is max_query_budget (default 32,000 chars). The tool parameter max_chars can request less, but never more than this ceiling.
Why max_query_budget and not max_result_length? Because you're fetching one page — the "total output" IS that one page, so the right limit is the overall context budget, not the per-page cap designed for multi-source queries.
webgate_query budget — without LLMWith no LLM, the cleaned sources go directly to your AI's context. webgate distributes max_query_budget across all fetched pages so the total never exceeds the budget:
Per-page limit =
max_query_budget÷ number of results (capped atmax_result_length)
| Results fetched | Per-page limit | Total output |
|---|---|---|
| 1 | 8,000 (cap) | ≤ 8,000 |
| 5 | 6,400 | ≤ 32,000 |
| 10 | 3,200 | ≤ 32,000 |
| 20 | 1,600 | ≤ 32,000 |
The total output is always at most max_query_budget, regardless of how many results you request — the per-page share automatically shrinks to compensate.
webgate_query budget — with LLM summarizationWhen a secondary LLM is summarizing, it compresses the content before passing the result to your primary AI. This means it's safe — and beneficial — to give it more raw material to work from.
webgate scales up the input using input_budget_factor (default 3):
LLM input budget =
max_query_budget×input_budget_factorDefault: 32,000 × 3 = 96,000 chars
| Results fetched | LLM input / page | Total LLM input | Output to your AI |
|---|---|---|---|
| 1 | 96,000 | 96,000 | compact report |
| 5 | 19,200 | 96,000 | compact report |
| 10 | 9,600 | 96,000 | compact report |
| 20 | 4,800 | 96,000 | compact report |
The secondary LLM sees much more content per page. Your primary AI sees only the final report — typically 1,000–3,000 tokens — regardless of how many sources were processed. This is the main efficiency advantage of LLM mode.
| Symptom | Fix |
|---|---|
| AI responses feel slow, too much text | Reduce max_query_budget (e.g. 16000) |
| AI answers are shallow or miss details | Increase max_query_budget (e.g. 48000) |
| LLM summary is thin or misses things | Increase input_budget_factor (e.g. 5) |
| LLM summary times out or is very slow | Reduce input_budget_factor (e.g. 2) or reduce results_per_query |
fetch returns too little of a long page | Increase max_query_budget (e.g. 64000) |
| Pages are slow to download | Reduce max_download_mb (e.g. 1, already default) |
| Server downloads too much garbage | Reduce max_download_mb (e.g. 1) |
Optional, opt-in. When llm.enabled = false (the default), webgate is fully deterministic. Enable the [llm] block to unlock three extra capabilities.
| Situation | Recommended setup | Typical latency overhead |
|---|---|---|
| Fast answers, general research | LLM disabled (default) — BM25-ranked clean sources, zero latency overhead | none |
| Deep research on a complex topic | Summarization on — get a cited Markdown report instead of raw pages | +5–30s |
| Broad topic, one query isn't enough | Expansion + Summarization — LLM generates variants and synthesizes all results | +6–35s |
| Result order matters more than speed | LLM reranking on — semantic ordering at the cost of one extra LLM call per query | +1–5s |
Privacy: with LLM disabled, no data leaves your machine except web requests. With LLM enabled, cleaned search results (not raw HTML) are sent to the configured base_url. Point it at a local Ollama instance to keep everything on-device.
Latency trade-off: each enabled feature adds one LLM round-trip per query. Expansion adds ~1–5s; summarization adds ~5–30s depending on model and content volume. For interactive use, summarization with a fast local model (e.g. Gemma 3 4B) is a good starting point.
[llm]
enabled = true
base_url = "http://localhost:11434/v1" # Ollama, OpenAI, LM Studio, vLLM, Groq...
api_key = "" # empty for local models
model = "gemma3:27b"
timeout = 60 # local 27B+ models may need up to 60s
Or with env vars:
"env": {
"WEBGATE_LLM_ENABLED": "true",
"WEBGATE_LLM_BASE_URL": "http://localhost:11434/v1",
"WEBGATE_LLM_MODEL": "gemma3:27b",
"WEBGATE_LLM_TIMEOUT": "60"
}
base_url accepts any OpenAI-compatible endpoint: OpenAI, Ollama, LM Studio, vLLM, Together AI, Groq, and others.
When you send a single query and expansion_enabled = true, the LLM automatically generates complementary search variants before hitting the backend. If you already pass multiple queries, this step is skipped.
"best laptop for programming"
↓ expansion
["best laptop for programming 2024", "developer laptop recommendations", "laptop specs for coding"]
↓ all search in parallel
Falls back silently to your original query if the LLM fails.
When summarization_enabled = true, the LLM reads all fetched pages and writes a structured Markdown report with inline citations. Your AI receives the report instead of the raw text.
summary + citations (lean output — no raw content passed to your AI)llm_summary_error with the reason + full sources as fallback (your AI can still work with the cleaned content)The report length target is max_summary_words. When 0 (default), it is derived from max_query_budget / 5 — e.g. with a 32k budget, the target is ~6,400 words.
Results are always reranked by BM25 (keyword overlap, zero cost) before being returned. Optionally, the LLM can do a second pass for semantic relevance:
| Tier | When | Cost |
|---|---|---|
| BM25 (deterministic) | Always | Zero — pure math |
| LLM-assisted | llm_rerank_enabled = true | One LLM call per query |
LLM reranking adds latency proportional to your LLM response time. Enable it only if result ordering matters more than speed.
Pipeline: clean → BM25 rerank → (LLM rerank) → (LLM summarize) → output
mcp-webgate works with all major AI clients:
| Platform | Configuration Guide | Notes |
|---|---|---|
| Claude Desktop | IDE Integration | Desktop application |
| Claude Code | IDE Integration | CLI coding agent |
| Zed Editor | IDE Integration | Native MCP support |
| Cursor | IDE Integration | Requires Agent mode |
| Windsurf | IDE Integration | Global config only |
| VSCode | IDE Integration | Via Copilot or MCP extension |
| Gemini CLI | Agent Integration | Google's CLI agent |
| Claude CLI | Agent Integration | Anthropic's CLI agent |
uvx mcp-webgate
pip install mcp-webgate
# or
uv add mcp-webgate
Ready-to-use config files are in examples/.
CLI args > env vars > webgate.toml > defaults
Config is read once at startup; restart the server to apply changes.
You can configure webgate in three ways — mix and match as needed:
webgate.toml — checked at startup in ./webgate.toml then ~/webgate.tomlWEBGATE_* prefix, always strings (MCP JSON requirement)--kebab-case, integers stay integers, ideal for multi-instance setupswebgate.toml)[server]
max_download_mb = 1 # how many MB to download per page before cutting off
max_result_length = 8000 # max chars per page in multi-source queries (no LLM)
max_query_budget = 32000 # total char budget for a fetch, or input pool for a query
max_search_queries = 5 # max parallel queries per call
results_per_query = 5 # results to fetch per query
search_timeout = 8 # seconds before giving up on a page
oversampling_factor = 2 # fetch 2× more candidates than needed (dedup reserve)
auto_recovery_fetch = false # retry failed fetches from reserve pool
max_total_results = 20 # hard cap: never fetch more than this many pages total
blocked_domains = ["reddit.com", "pinterest.com"]
allowed_domains = [] # if non-empty, only these domains are allowed
adaptive_budget = false # [EXPERIMENTAL] proportional char allocation based on BM25 rank
adaptive_budget_fetch_factor = 3 # generous pre-rank fetch multiplier
[backends]
default = "searxng"
[backends.searxng]
url = "http://localhost:8080"
[backends.brave]
api_key = "BSA..."
[backends.tavily]
api_key = "tvly-..."
search_depth = "basic"
[llm]
enabled = true
base_url = "http://localhost:11434/v1"
api_key = ""
model = "llama3.2"
timeout = 60
expansion_enabled = true
summarization_enabled = true
llm_rerank_enabled = false
max_summary_words = 0 # 0 = max_query_budget / 5 (e.g. 6400 with budget 32000)
input_budget_factor = 3 # LLM input = max_query_budget × factor (default: 96000)
With env vars (all values must be strings):
{
"mcpServers": {
"webgate": {
"command": "uvx",
"args": ["mcp-webgate"],
"env": {
"WEBGATE_DEFAULT_BACKEND": "searxng",
"WEBGATE_SEARXNG_URL": "http://localhost:8080",
"WEBGATE_LLM_ENABLED": "true",
"WEBGATE_LLM_TIMEOUT": "60"
}
}
}
}
With CLI args (integers stay integers — ideal for running independent instances in Zed, Cursor, etc.):
{
"mcpServers": {
"webgate": {
"command": "uvx",
"args": [
"mcp-webgate",
"--searxng-url", "http://localhost:8080",
"--llm-enabled",
"--llm-model", "gemma3:27b",
"--llm-timeout", "60"
]
}
}
}
Boolean flags support --flag / --no-flag syntax (e.g. --llm-enabled, --no-llm-rerank-enabled).
| CLI argument | Env var | Default | Description |
|---|---|---|---|
--default-backend | WEBGATE_DEFAULT_BACKEND | searxng | Active backend |
--searxng-url | WEBGATE_SEARXNG_URL | http://localhost:8080 | SearXNG instance URL |
--brave-api-key | WEBGATE_BRAVE_API_KEY | (empty) | Brave Search API key |
--tavily-api-key | WEBGATE_TAVILY_API_KEY | (empty) | Tavily API key |
--exa-api-key | WEBGATE_EXA_API_KEY | (empty) | Exa API key |
--serpapi-api-key | WEBGATE_SERPAPI_API_KEY | (empty) | SerpAPI key |
--serpapi-engine | WEBGATE_SERPAPI_ENGINE | google | SerpAPI engine (google, bing, ...) |
--serpapi-gl | WEBGATE_SERPAPI_GL | us | SerpAPI country code |
--serpapi-hl | WEBGATE_SERPAPI_HL | en | SerpAPI language |
--max-download-mb | WEBGATE_MAX_DOWNLOAD_MB | 1 | Per-page download size cap (MB) |
--max-result-length | WEBGATE_MAX_RESULT_LENGTH | 8000 | Per-page char cap (no-LLM queries) |
--max-query-budget | WEBGATE_MAX_QUERY_BUDGET | 32000 | Total char budget for fetch and query |
--max-search-queries | WEBGATE_MAX_SEARCH_QUERIES | 5 | Max queries per call |
--results-per-query | WEBGATE_RESULTS_PER_QUERY | 5 | Default results fetched per query |
--search-timeout | WEBGATE_SEARCH_TIMEOUT | 8 | HTTP request timeout (seconds) |
--oversampling-factor | WEBGATE_OVERSAMPLING_FACTOR | 2 | Search result multiplier for dedup reserve |
--auto-recovery-fetch | WEBGATE_AUTO_RECOVERY_FETCH | false | Enable gap-filler (Round 2 fetch) |
--max-total-results | WEBGATE_MAX_TOTAL_RESULTS | 20 | Hard cap on total results per call |
--debug | WEBGATE_DEBUG | false | Enable structured debug logging |
--log-file | WEBGATE_LOG_FILE | (empty) | Log file path (empty = stderr) |
--trace | WEBGATE_TRACE | false | Include content in summarized citations; also activates debug logging |
--adaptive-budget | WEBGATE_ADAPTIVE_BUDGET | false | [EXPERIMENTAL] Proportional char allocation based on BM25 rank |
--adaptive-budget-fetch-factor | WEBGATE_ADAPTIVE_BUDGET_FETCH_FACTOR | 3 | [EXPERIMENTAL] Generous pre-rank fetch multiplier |
--llm-enabled | WEBGATE_LLM_ENABLED | false | Enable LLM features |
--llm-base-url | WEBGATE_LLM_BASE_URL | http://localhost:11434/v1 | OpenAI-compatible endpoint |
--llm-api-key | WEBGATE_LLM_API_KEY | (empty) | API key (empty for local models) |
--llm-model | WEBGATE_LLM_MODEL | llama3.2 | Model name |
--llm-timeout | WEBGATE_LLM_TIMEOUT | 30 | LLM request timeout (seconds) |
--llm-expansion-enabled | WEBGATE_LLM_EXPANSION_ENABLED | true | Auto-expand queries into variants |
--llm-summarization-enabled | WEBGATE_LLM_SUMMARIZATION_ENABLED | true | LLM summary with citations |
--llm-rerank-enabled | WEBGATE_LLM_RERANK_ENABLED | false | LLM-assisted reranking |
--llm-max-summary-words | WEBGATE_LLM_MAX_SUMMARY_WORDS | 0 | Summary word target (0 = auto) |
--llm-input-budget-factor | WEBGATE_LLM_INPUT_BUDGET_FACTOR | 3 | LLM input budget multiplier |
| Backend | Auth | Notes |
|---|---|---|
| SearXNG | none | Self-hosted, recommended |
| Brave Search | API key | High quality, free tier available |
| Tavily | API key | AI-oriented snippets, free tier available |
| Exa | API key | Neural/semantic search, free tier available |
| SerpAPI | API key | Proxy for Google, Bing, DuckDuckGo and more, free tier available |
docker run -d -p 8080:8080 --name searxng searxng/searxng
Then set WEBGATE_SEARXNG_URL=http://localhost:8080.
Exa uses neural (semantic) search by default — the primary reason to use it over keyword backends. use_autoprompt is hardcoded to false (not user-configurable) because mcp-webgate handles query expansion via its own LLM expander.
engine selects the underlying search engine (google, bing, duckduckgo, yandex, yahoo). gl and hl significantly affect result quality for non-English queries.
When enabled, every tool call logs a structured entry:
fetch: URL, raw KB downloaded, clean KB returned, elapsed msquery: queries used, results requested/fetched/failed, raw MB, clean KB, total elapsed msexport WEBGATE_DEBUG=true # log to stderr
export WEBGATE_LOG_FILE=/tmp/wg.log # or log to file
These protections are always active — they are the core value proposition and cannot be disabled.
| What could go wrong | How webgate stops it |
|---|---|
| Page dumps 2 MB of HTML | max_download_mb hard cap — download stops mid-stream, never buffered |
| Cleaned text is still huge | max_result_length char cap per page |
| Many results flood the context | max_query_budget distributes a fixed total across all results |
| Too many pages fetched | max_total_results hard cap |
| PDF / ZIP / DOCX requested | Binary extension filter runs before any network request |
| Slow or hanging connections | search_timeout + 5s connect timeout |
| Invisible Unicode tricks in content | Full regex sterilization pipeline (zero-width, BiDi, etc.) |
| Rate limiting (429 / 502 / 503) | Exponential retry backoff, respects Retry-After header |
| Unwanted domains | blocked_domains / allowed_domains filter |
mcp-webgate is in beta. Core functionality is stable and the server is used in production, but the configuration API may still change before 1.0.
Feedback is very welcome. If something doesn't work as expected, behaves oddly, or you have a use case that isn't covered:
Bug reports, configuration questions, and feature requests all help shape the roadmap.
Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines on:
MIT License — see LICENSE for details.
Need help? Check the documentation or open an issue on GitHub.
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.
by mcp-marketplace · Developer Tools
Create, build, and publish Python MCP servers to PyPI — conversationally.
by Microsoft · Content & Media
Convert files (PDF, Word, Excel, images, audio) to Markdown for LLM consumption
by mcp-marketplace · Developer Tools
Scaffold, build, and publish TypeScript MCP servers to npm — conversationally
by mcp-marketplace · Finance
Free stock data and market news for any MCP-compatible AI assistant.