Methodology

Complete transparency on how we discover, analyze, and score every MCP server.

TL;DR

What we score: MCP server quality across up to 6 dimensions — Schema Quality, Protocol Compliance, Reliability, Docs & Maintenance, Security Hygiene, and Schema Interpretability — weighted into a 0–100 composite with a letter grade.

How: We discover servers from 8 sources, analyze source code + tool schemas via LLM, probe remote/local servers for protocol compliance, monitor uptime, and evaluate schema clarity with a 3-model AI consensus panel. Grades show dimension coverage: A (5/5) means 5 of 5 applicable dimensions scored.

What we don't score: Functional correctness (does the tool return the right answer), capability risk (is the tool dangerous to use), or MCP resources/prompts (not yet widely adopted). Latency measurements are from a single European probe location.

Weights are editorial: Based on engineering judgment about what matters most for server consumers. The scoring engine is open source — given the same inputs, it always produces the same outputs.

Scoring Independence

MCP Scoreboard is an open-source project. The scoring methodology is applied uniformly to every server. There is no way to pay for or influence scores. The only way to improve your score is to improve your server.

The scoring engine is fully open source — given the same inputs, it always produces the same outputs.

Overview

Every server on the scoreboard is measured through a progressive visibility model — the more quality data we collect, the more complete the picture becomes. MCP Scoreboard provides independent quality scores for every public MCP server we can find. Our unified analysis pipeline has four stages:

  1. Discover — ingest servers from 8 sources and deduplicate into a single catalog.
  2. Analyze — run static analysis on source code, extract tool schemas via LLM, discover endpoints, probe remote and local servers, and evaluate tool usability with multi-model AI consensus.
  3. Score — combine up to 6 weighted categories into a 0–100 composite score, applying runtime validation, schema drift detection, and red flag caps.
  4. Monitor — re-probe, re-analyze, and re-score continuously. Track changes over time.

Server authors who claim ownership unlock additional tools — publisher notes, improvement checklists, and score change alerts — but claiming has no effect on scores.

Scope

MCP Scoreboard scores MCP server quality across the dimensions where meaningful data exists at ecosystem scale. The MCP specification defines three server primitives — tools, resources, and prompts — plus client-side features like sampling, roots, and elicitation.

Today, >95% of MCP servers in the wild are tool-only. Our scoring reflects this reality: tool schema quality, protocol compliance, and implementation patterns are the primary signals. As resource and prompt adoption grows, scoring dimensions will expand to cover them.

Client-side features (sampling, roots, elicitation) are out of scope — we score servers, not clients. Functional correctness (does a tool return the right answer?) is not tested; this would require domain-specific ground truth that doesn't exist at ecosystem scale. Our scores measure implementation quality, not output accuracy.

Visibility Levels

Every server is assigned a visibility level that communicates how much quality data we have collected. Servers on the leaderboard are sorted by visibility tier first, then by composite score within each tier, so well-understood servers naturally rise to the top.

LevelIconRequirementsComposite?
Complete All applicable standard dimensions plus Schema Interpretability evaluation (6/6 dims) Yes + Grade
Verified All applicable standard dimensions with live probe data (deep probe or sandbox), no agent eval (4–5 dims) Yes + Grade
Assessed All applicable standard dimensions scored from static analysis alone, no live probe (typically 3 dims for local-only servers) Yes + Grade
Limited 2 or more dimensions scored, but not all applicable dimensions filled Yes + Grade
Unscored 0 or 1 dimension — not enough data to compute a meaningful composite No

How Visibility Affects Scoring

All servers with 2+ scored dimensions receive a weighted composite score and letter grade. Weights are renormalized across the available dimensions. This means a local server with 3 dimensions (Schema Quality, Docs & Maintenance, Security) can earn an A+ with a 95/100 score, while a remote server with all 6 dimensions might earn an A with a 92/100 — both are valid assessments of the data we have.

The visibility level communicates how much of the quality picture you're seeing. A “Complete” server's score reflects all 6 quality dimensions including real-time protocol probing and AI agent evaluation. A “Limited” server's score reflects only 2–3 dimensions and may change significantly as more data is collected.

Leaderboard sorting respects this hierarchy: Complete servers appear above Verified at the same score, Verified above Assessed, and so on. Within the same tier, higher composite scores win.

Dimension Checklist

Each server's detail page shows a dimension checklist indicating which quality dimensions have been scored, which are pending, and which don't apply to that server type. This helps server owners understand exactly what data is being used in their score and what actions would unlock additional dimensions.

How to Read a Score

Every grade on the scoreboard includes a dimension indicator showing how many quality dimensions contributed to that score. For example:

  • A+ (6/6) — All 6 dimensions scored (schema, protocol, reliability, docs, security, schema interpretability). This is the most complete assessment.
  • A (3/3) — 3 of 3 applicable dimensions scored. For a local-only server, protocol compliance and reliability don't apply, so 3/3 means fully assessed for its type.
  • B (2/5) — Only 2 of 5 applicable dimensions have data. This score will likely change as more data is collected.

The denominator is dimensions applicable (varies by server type), not always 6. A local server with 3/3 dimensions is as complete as a remote server with 5/5 — both have full coverage for their type. The leaderboard groups servers by visibility tier so you compare like with like.

Classification

Every server is automatically classified into a category and tagged with target platforms.

Categories

We classify servers into 14 categories based on keyword matching across the server name, description, registry namespace, and repo path: AI/ML, Database, DevTools, Cloud, Communication, Productivity, Search, Monitoring, Data, Finance, Media, E-commerce, Browser, and Other.

Target Platforms

135+ keywords are mapped to platform labels (e.g., "postgres" → PostgreSQL, "openai" → OpenAI). Matching uses a 4-tier weighted search: name (3×), repo path (2×), registry namespace (2×), and description (1×).

Verified Publishers

Servers from known organizations (Anthropic, OpenAI, Microsoft, Google, Cloudflare, Stripe, Supabase, etc.) receive a verified publisher badge. This is detected from the registry namespace or GitHub organization. The verified publisher list is managed in the database and can be updated without code deploys.

Composite Scoring

Servers with all applicable categories scored receive a weighted composite score from 0 to 100. The score is normalized based on which categories apply to each server.

Category Weights

Category Without Interpretability With Interpretability Source
Schema Quality 25% 20% Schema completeness (60%) + description quality (40%)
Protocol Compliance 20% 18% Avg of reachability, schema validity, error handling, fuzz score, invocation smoke test (+10 auth bonus)
Reliability 20% 18% Uptime (60–70%) + p50 latency (25–30%) + p95 latency (0–15%)
Docs & Maintenance 15% 12% Documentation (30%) + maintenance pulse (30%) + dependency health (15%) + license (15%) + versions (10%)
Security Hygiene 20% 17% Secret count, transport risk, credential sensitivity, distribution clarity, behavioral security analysis
Schema Interpretability N/A 15% Multi-model LLM consensus evaluation (derived from per-tool metrics)

Weight Rationale

Category weights are editorial choices based on engineering judgment about what matters most for MCP server consumers. They are not empirically derived from user studies or adoption data (yet). Here's why each weight is what it is:

  • Schema Quality (25%): The primary interface between a server and its consumers. Agents, IDEs, and users all depend on schemas first. Poor schemas cascade into every downstream interaction.
  • Protocol Compliance (20%): A server that doesn't follow the MCP protocol correctly is unreliable regardless of other qualities. Error handling and fuzz resilience protect real users.
  • Reliability (20%): For remote servers, uptime and latency are table stakes. A perfect server that's down 30% of the time is a 70%-quality server in practice.
  • Docs & Maintenance (15%): Documentation and maintenance matter but are secondary to the interface and runtime behavior. A well-documented server with broken schemas is still broken.
  • Security Hygiene (20%): Security patterns affect trust. Weighted equally with protocol compliance because both are about implementation correctness.
  • Schema Interpretability (15%, enhanced only): Cross-model AI consensus is a strong proxy for schema clarity. Weighted lower than direct quality measurements because it's an indirect signal.

We welcome community input on weight calibration. The scoring engine is open source — you can re-weight categories and re-run analysis with your own priorities.

Security Scoring Detail

The Security Hygiene score measures whether a server follows secure implementation patterns appropriate to its deployment model. A high security score does NOT mean the server's capabilities are safe for all contexts. A filesystem MCP server can score well on security hygiene while still being dangerous to use without user consent. Always review tool descriptions and capability flags before granting a server access.

The score is computed from five sub-components based on the server's registry metadata and source code analysis:

Sub-componentMax PointsWhat it measures
Secret Env Vars 20 How many API keys, tokens, or passwords are required. 0 secrets = 20pts, 1 = 16, 2 = 11, 3–4 = 6, 5+ = 0
Transport Risk 15 STDIO-only (local) = 15pts, SSE = 9, remote = 6
Credential Sensitivity 15 Type of credentials required. DB passwords and cloud keys are weighted more heavily than API tokens
Distribution Clarity 15 Published package + source repo = 15, source repo only = 12, package only = 8, neither = 3
Behavioral Security 35 LLM-analyzed source code behavior: prompt injection patterns, data exfiltration risk, hardcoded credentials, dangerous operations, and scope creep. Defaults to 50/100 (18pts) when no source analysis is available.

Template Description Penalty

Servers with generic placeholder descriptions (“A Model Context Protocol server”, “MCP server for...”, “TODO: add description”) receive a −15 point penalty on their Schema Quality score.

Local vs Remote

Not all categories apply to every server. Protocol Compliance requires probe data or source code inference signals. Reliability requires accumulated probe history from ongoing monitoring. When a category doesn't apply, its weight is redistributed proportionally among the remaining categories.

Category Remote + Monitored With Protocol Data Base (3 dims)
Schema Quality25%~33%~42%
Protocol Compliance20%~27%N/A
Reliability20%N/AN/A
Docs & Maintenance15%~20%~25%
Security Hygiene20%~27%~33%

Remote + Monitored: Remote servers with accumulated probe history get all 5 dimensions. With Protocol Data: Servers with protocol signals from live probes, sandbox probing, or source code inference — but without reliability monitoring history — get 4 dimensions. Base: Servers with only static analysis and no protocol signals get 3 dimensions.

Letter Grades

Grades are assigned to all servers with 2 or more scored dimensions:

GradeScore RangeMeaning
A+95–100Exceptional
A85–94Excellent
B70–84Good
C55–69Acceptable
D40–54Below Standard
F0–39Failing

Flags & Badges

Red Flags

Red flags surface potential quality or security concerns. They appear as pills on server listings and detail pages.

FlagSeverityCondition
Dead Repository Critical GitHub repo returns 404 or has been deleted
Archived Repository Critical Repo is archived and no longer maintained
No Source Code Warning No repository URL or source link found
Sensitive Credentials Warning 3+ sensitive environment variables required
High Config Demand Warning 5+ potentially-sensitive environment variables (non-sensitive config like PORT, HOST, LOG_LEVEL is filtered out)
Stale Project Warning No commits in over 12 months
Staging Endpoint Warning Remote URL contains localhost or staging patterns
Template Description Warning Description is a default placeholder (−15 to Schema Quality)
Duplicate Description Warning Same description shared by 3+ servers
Schema Drift Warning Source code and runtime probe expose different sets of tools
Ambiguous Schemas Warning 2+ tools where AI models disagree significantly (spread > 25) on how to interpret the schema
Outdated Spec Warning Server implements the original MCP spec version (2024-11-05) and may lack newer protocol features
Stale Analysis Warning Static analysis data is over 60 days old, meaning scores may not reflect the current state of the codebase
Prompt Injection Critical Tool descriptions contain patterns consistent with prompt injection (attempts to manipulate LLM behavior)
Exfiltration Risk Critical Source code analysis detected patterns that could exfiltrate user data to external services

Flag Score Caps

Critical flags impose hard caps on the composite score, regardless of how well the server scores in other categories. Individual category scores remain visible for diagnostics.

FlagMax CompositeEffect
Dead Repository 0 Score forced to 0 — server is unusable
Archived Repository 40 Maximum D grade — no longer maintained
Staging Endpoint 55 Maximum C grade — not a production endpoint
Exfiltration Risk 25 Severe security concern — potential data exfiltration
Prompt Injection 30 Severe security concern — tool descriptions may manipulate LLM behavior

Badges

Badges provide quick visual indicators of specific qualities. They're grouped by category:

  • Schema: README, Changelog, Examples, Contributing Guide, SECURITY.md, Namespace Match, Installable
  • Protocol: Reachable, Schema Valid, Tool Count, Error Handling Quality, Auth Discovery
  • Reliability: 99%+ Uptime, 95%+ Uptime, Degraded, Fast (<200ms), Slow (>500ms), Local Only
  • Maintenance: Active Development, Stale, Abandoned, Regular Releases, CI/CD, Lock File, Semver
  • Security: No Secrets Required, Few Secrets, Many Secrets, STDIO Only, Published Package

Data Collection

We discover MCP servers from 8 independent sources. Each source is ingested daily and contributes different metadata.

Source Type Key Data
Official MCP Registry REST API Canonical registry_id, repo URL, remote endpoints, transport type, env vars
awesome-mcp-servers GitHub README Curated list with language badges and category tags
Glama.ai REST API 17K+ servers, license info, hosting attributes, remote capability
PulseMCP REST API Visitor estimates, official flags, version info
GitHub Topics Search API Repos tagged mcp-server or model-context-protocol
Smithery.ai REST API 2,200+ servers, deployment URLs for remote-deployed servers
best-of-mcp-servers CSV snapshots Project rank scores, contributor counts, category labels
Docker Hub MCP Catalog REST API Official mcp/* images, pull counts, verified publisher status

We also enrich servers with download data from npm and PyPI on a weekly basis.

Deduplication

The same MCP server often appears in multiple sources under different names or URLs. Our deduplication engine merges them into a single canonical record.

  • Primary key: Normalized GitHub repo URL (https://github.com/{owner}/{repo}, stripped of .git and trailing slashes)
  • Fallback key: Registry ID for servers without a GitHub repo
  • Source priority: Official Registry fields take precedence, then aggregator-specific metadata is layered on
  • Source tracking: Each server tracks which sources reported it, visible on the detail page
  • External URLs: Links to Glama, Smithery, PulseMCP, Docker Hub, npm, and PyPI are preserved and never overwritten with blanks
  • Lifecycle: Servers that disappear from all sources are deactivated (not deleted)

Pipeline Architecture

After discovery, each server flows through a pipeline of analysis phases. The path depends on what the server exposes:

Has GitHub repo + remote endpoint:
  analysis → remote_probe → scoring → agent_eval → scored
Has GitHub repo + local entry point:
  analysis → sandbox_probe → scoring → agent_eval → scored
Has GitHub repo only:
  analysis → scoring → agent_eval → scored
No GitHub repo:
  scoring → scored

The analysis phase is comprehensive: it runs static code analysis, extracts tool schemas via LLM, and discovers remote endpoints from README files — all in a single pass. This integrated approach avoids redundant GitHub API calls and gets servers to their scoring phase faster.

Servers flow through phases independently. A pipeline tick runs every 2 minutes, dispatching work for all phases concurrently. Failed servers are automatically retried with exponential backoff.

Static Analysis (Tier 1)

Static analysis examines the server's GitHub repository without connecting to the server itself. It runs hourly in batches, dispatching up to 1,000 servers per cycle. Each server requires ~10 GitHub API calls. A freshness check skips repos that haven't changed since the last scan.

Tool Schema Extraction

A unified LLM-based extraction module identifies and parses tool definitions from source code. This is a critical input for Schema Quality scoring and the foundation for agent usability evaluation.

  1. File selection: Source files are selected in multiple stages. First, manifest files (package.json, pyproject.toml, setup.py) are checked for MCP SDK dependencies (@modelcontextprotocol/sdk, mcp, etc.) to confirm the project is an MCP server. Next, files matching MCP-relevant keywords (tool, server, mcp, handler, route, api, endpoint, function, core, plugin, etc.) are selected. An import-based scan then adds files containing MCP SDK imports (from mcp, import mcp, from "@modelcontextprotocol). For confirmed MCP projects, common entry points (src/index.ts, main.py, server.py, etc.) are always included. Fallback stages check source directories (src/, lib/, packages/, servers/, plugins/) and then any source file in the repo. The top 20 files are selected (25KB per file). For confirmed MCP projects with few tool-pattern matches, additional source files are included to enable behavioral security analysis and spec version detection. Supported extensions: .py, .ts, .js, .mjs, .tsx, .jsx, .go, .rs.
  2. LLM extraction: The selected source files are sent to a lightweight LLM (see Models Used) which extracts structured tool definitions: name, description, and full inputSchema (JSON Schema for parameters).
  3. Endpoint discovery: The server's README is parsed for remote endpoint URLs using a separate LLM call. Discovered endpoints are validated against a domain blocklist (documentation sites, package registries, API catalogs like Glama and Smithery), checked for MCP-specific URL patterns, and verified via DNS resolution. Placeholder and example domains are rejected. Invalid endpoints are not stored, preventing repeated failed probes.

The extracted tool schemas are stored and used downstream by the scoring engine, agent usability evaluation, and runtime drift detection. All LLM calls include automatic retry with backoff on rate-limit responses (up to 3 attempts).

Source Code Inference

For servers with source files but no live endpoint (the majority of the catalog), three focused LLM inference passes run on the same source code to infer protocol compliance signals. This closes the data gap for servers that can't be probed remotely.

  1. Error Handling Analysis — evaluates try/catch patterns, MCP error response formatting, timeout handling, and graceful degradation. Produces an error handling score (0–100).
  2. Input Validation & Fuzz Resilience — evaluates schema validation (Zod, Pydantic, JSON Schema), type checking, bounds checking, null handling, and injection prevention. Produces a fuzz score (0–100) and schema validity assessment.
  3. Security & Auth Patterns — evaluates authentication mechanisms, secret handling (env vars vs hardcoded), HTTPS enforcement, and safe file access. Produces an auth discovery assessment.

These inferred signals are used as a fallback when no live probe or sandbox probe data exists, allowing the Protocol Compliance dimension to be scored for source-code-only servers. When a live probe exists but failed to connect (producing no protocol data), inference data supplements the probe results.

Source inference uses lightweight LLMs for all three scoring passes (see Models Used for exact versions). The same source text budget as tool extraction is used. Each pass is a separate, focused LLM call to maximize accuracy per dimension.

Seven Sub-Metrics

In addition to tool extraction, seven sub-metrics are computed from the repository:

Schema Completeness (0–100)

Do tools have typed parameters and descriptions? We scan source files for tool definition patterns and check for schema markers: inputSchema, parameters, properties, description, type, required.

  • Base 40 points when tool definitions are found
  • Up to +60 based on the ratio of schema markers present
  • Capped at 60 if no inputSchema or parameters found
  • Capped at 70 if no description markers found

Runtime override: When runtime tool schemas from a probe produce a higher score than the static analysis, the runtime score is always preferred. This prevents penalizing servers whose tools are well-defined but registered dynamically at runtime. See Runtime Validation for details.

Description Quality (0–100)

Is the repo description clear and the README useful? We check:

  • Repo description length and action-word usage
  • README length, heading count, code examples, and usage sections

Documentation Coverage (0–100)

Does the repo include the documentation artifacts developers expect?

  • README (+25), CHANGELOG (+15), examples/ directory (+15), CONTRIBUTING guide (+10)
  • LICENSE file (+10), docs/ directory (+7)
  • Provenance signals: SECURITY.md (+3), CODE_OF_CONDUCT (+3), namespace/owner match (+7), installable package (+5)

Maintenance Pulse (0–100)

Is the project actively maintained? Scored using a three-signal stability model that avoids penalizing mature, stable projects:

  • Vitality (40 pts): Push recency (up to 25 pts), commit count (up to 10 pts), and a stability floor (minimum 20 pts for mature projects with ≥50 stars, releases, and low issue ratio)
  • Release Discipline (30 pts): Release count (up to 10 pts), release recency (up to 10 pts), and release notes quality (up to 10 pts)
  • Community Health (30 pts): Star tiers (up to 10 pts), issue responsiveness (up to 10 pts, neutral at 5 for projects with 0 issues), and fork engagement (up to 10 pts)

Stability floor: Mature projects (50+ stars, has releases, low issue ratio) receive a minimum vitality score of 20, preventing them from being penalized for infrequent commits. Stability is not the same as neglect.

Dependency Health (0–100)

How well are dependencies managed?

  • Dependency manifest present (+30)
  • Lock file present (+25)
  • CI/CD configuration (+20)
  • Dependency automation (Renovate, Dependabot) (+25)

License Clarity (0–100)

Is there a recognized license? Known SPDX licenses (MIT, Apache-2.0, GPL-3.0, etc.) score 100. Non-standard but identified licenses score 70. Custom licenses score 40. No license scores 0.

Version Hygiene (0–100)

Does the project follow release best practices? Scored on: GitHub releases (+30), semver tag ratio (+35), release notes (+15), and pre-release usage (+10).

Protocol Probes (Tier 2)

Protocol probes test a server against the MCP specification. They apply to remote servers with endpoints (Streamable HTTP or SSE) and to local servers via sandbox probing.

Connection & Initialization

Can the server be reached? Does it complete the MCP handshake with proper capabilities declaration? Timing is recorded: connection, initialization, and ping latency in milliseconds.

Tool Discovery & Schema Validation

Does tools/list return well-formed tool definitions matching the MCP JSON Schema spec? We record tool count, schema validity, and any schema issues. The full tool schemas are captured as tools_json for use in agent usability evaluation and runtime validation.

Error Handling

Does the server return proper JSON-RPC 2.0 error codes for bad inputs? We test with:

  • Unknown tool name — should return a proper error, not crash
  • Missing required parameters — should return a proper error
  • Wrong parameter types — should return a proper error

Fuzz Testing

How does the server handle adversarial inputs? Tested with oversized strings, empty strings, Unicode edge cases, null values in required fields, and boundary values. Each test is scored as "proper_error" (good), "accepted" (ok), or "crash" (bad).

Auth Discovery

If authentication is required, does the server implement OAuth discovery per the MCP auth spec? We check for .well-known/oauth-protected-resource. Servers that successfully implement auth discovery earn a +10 bonus on their protocol compliance score (capped at 100). Servers without auth discovery are not penalized.

Sandbox Probing

Local servers (stdio transport, no remote endpoint) can still earn protocol compliance scores through sandbox probing. We detect the server's entry point from static analysis, then:

  1. Clone the repository into an isolated Docker container
  2. Install dependencies and start the server using the detected entry point
  3. Run the full protocol probe suite (tool discovery, schema validation, error handling, fuzz testing)

Sandbox probing allows local-only servers to demonstrate protocol compliance and earn a more complete score. The probe results are equivalent to a deep probe on a remote endpoint.

Source Code Inference Fallback

Servers without a remote endpoint or sandbox probe can still earn protocol compliance scores through source code inference. When no live probe data exists, the LLM-inferred signals (error handling score, schema validity, fuzz resilience, auth discovery) are used to construct a synthetic protocol result. This means a well-written local server can demonstrate protocol awareness from its source code alone.

For details on probe frequency, consent, and rate limiting, see our Probe Policy.

Reliability Monitoring (Tier 3)

Reliability is measured over time through repeated health checks. This is what differentiates a scoreboard from a one-time scan.

Fast Probes (rolling cycle)

A lightweight health check: connect, initialize, ping. The probe task runs every 5 minutes and cycles through the fleet — each individual server is probed roughly once every 15–30 minutes depending on fleet size. This powers uptime calculations and incident detection.

Uptime

Percentage of successful probes over rolling windows. We track 24-hour, 7-day, and 30-day uptime. The 7-day figure is used in the composite score.

Latency Profile

Connection time percentiles (p50, p95) computed from the 7-day window of fast probe results. Only successful probes (where the server was reachable) are included.

The reliability score combines uptime and latency using a weighted blend:

  • With p95 data: Uptime × 60% + p50 latency × 25% + p95 latency × 15%
  • Without p95: Uptime × 70% + p50 latency × 30%

Latency is scored using a continuous curve (not step thresholds) for smoother differentiation:

LatencyScoreNote
100ms100Excellent
200ms90Very good
500ms70Acceptable
1,000ms50Slow
2,000ms25Very slow
5,000ms+10Minimum

Data quality gate: A minimum of 10 probe results is required before a reliability score is generated. This prevents a single successful probe from producing a misleading 100% uptime score.

Schema Interpretability (Tier 4)

For servers with at least “Assessed” visibility and at least 1 tool, we run an AI-powered evaluation: can diverse AI models correctly interpret these tool schemas and generate valid inputs?

Schema Interpretability measures whether AI models can understand tool schemas well enough to use them correctly. This is a necessary condition for agent usability but does not measure end-to-end task success. If three cheap models from different families all struggle to understand a schema, that schema will cause problems for real agent use.

Three-Judge Consensus

Three different LLMs independently evaluate the same tool schemas. Each model assesses every tool on three dimensions:

  • Clarity: Can you understand what this tool does from its name and description?
  • Input Completeness: Are parameters well-defined with types, descriptions, and constraints?
  • Confidence: Would you construct a correct call on the first try?

Using three diverse models (from different providers) means the score reflects the general agent experience, not one model's quirks. We intentionally use mid-tier models because schemas that only work with frontier models will fail for the majority of real-world agent deployments. If three different model families all struggle with your schema, that's a meaningful signal.

Derived Scoring

Rather than relying on a model's self-reported “overall score” (which can be inconsistent across models), we derive the score from the per-tool metrics. For each tool, the clarity, input completeness, and confidence scores are averaged, then the tool averages are combined into a single server-level score. This produces more stable, reproducible results.

Score Blending

Individual model scores are blended using a minimum-weighted strategy — the lowest score carries the most weight. This ensures the final score reflects the worst-case agent experience:

ModelsWeighting
3 models Lowest × 50% + Middle × 30% + Highest × 20%
2 models Lowest × 60% + Highest × 40%

Consensus Levels

SpreadConsensusPenalty
≤ 10 pointsHighNone
≤ 25 pointsMediumNone
> 25 pointsLow× 0.90

The 10% penalty on low consensus reflects genuine uncertainty about the tool quality.

Per-Tool Consensus Analysis

Beyond the server-level score, we analyze agreement at the individual tool level. For each tool evaluated by 2+ models, we compare scores across all three dimensions (clarity, input completeness, confidence). Tools where any single dimension has a spread greater than 25 points across models are flagged as ambiguous.

Ambiguous tools indicate schemas that some AI models can interpret correctly while others cannot — a sign that the tool's description, parameter names, or type constraints need improvement. When 2+ tools are flagged as ambiguous, the server receives an Ambiguous Schemas flag.

The per-tool breakdown is visible in the Schema Interpretability modal on each server's detail page, showing each tool's average score and highlighting disagreements.

Runtime Validation & Drift Detection

Static analysis reads source code. Protocol probes observe runtime behavior. These two views of the same server can diverge — and when they do, it matters.

Runtime Schema Override

When a server has both static analysis results and runtime tool schemas (from a deep probe or sandbox probe), the runtime schema completeness score is always preferred when it's higher. This means:

  • Python servers that register tools dynamically at startup are scored on their actual runtime schemas, not on what pattern matching found in source code
  • Servers using code generation or plugin systems get credit for their real tool definitions
  • Static analysis is never used to lower a score that runtime data would support

Schema Drift Detection

We compare the set of tool names found in static analysis against those discovered at runtime. When the sets differ, this is flagged as schema drift:

  • Missing at runtime: Tools defined in source code that don't appear when the server actually runs. May indicate dead code, conditional registration, or tools that fail to load.
  • Extra at runtime: Tools that appear at runtime but aren't in the source code. Common with plugin systems, dynamic registration, or tools loaded from configuration.

Schema drift data is stored per-server and used to generate the Schema Drift flag. Drift isn't inherently bad — dynamic tool registration is a valid pattern — but significant divergence between source and runtime can indicate maintenance issues or surprises for users reading the source code.

Score Lifecycle

Scores are not static snapshots. Every server is periodically re-evaluated to incorporate scoring methodology improvements and detect changes.

What Triggers a Rescore

  • Code push detected: An hourly check compares GitHub pushed_at timestamps. When a repo has new commits, the server is re-analyzed and rescored with the current engine and models.
  • Probe data accumulation: Remote servers that accumulate enough probe history (10+ fast probes) are rescored to incorporate reliability data.
  • Owner request: Claimed server owners can request a priority rescore (once per 24 hours) after making improvements.
  • Periodic freshness sweep: The oldest scored servers are cycled back through the pipeline weekly, ensuring the entire fleet is refreshed approximately every 8 weeks.
  • Methodology updates: When scoring formula changes are deployed (weight adjustments, new dimensions, restructured sub-components), a fleet-wide rescore is triggered so changes apply uniformly.

Score Freshness

Every server's API response includes last_scored_at and score_age_days so consumers know how fresh a score is. The typical refresh cycle is ~8 weeks for the full fleet, with more frequently updated servers (active repos, remote endpoints) rescored sooner.

Ongoing Maintenance

Scores are not static. The system continuously re-evaluates every server.

TaskFrequencyWhat It Does
Server Discovery Daily Ingest all 8 sources, deduplicate, create/update/deactivate servers
Static Analysis Hourly Dispatch up to 1,000 servers; skip unchanged repos. Includes tool extraction and endpoint discovery.
Fast Probes Every 5 min (rolling) Health check remote endpoints on a rolling cycle: connect, initialize, ping. Each server probed ~every 15–30 min.
Deep Probes Daily Full protocol compliance scan: tools, error handling, fuzz, auth
Schema Interpretability Eval Daily Three-judge LLM evaluation with per-tool consensus analysis
Download Enrichment Weekly Fetch npm and PyPI download counts
Score Computation Daily Recompute all composite scores, grades, flags, badges, and drift detection

Stale Data Decay

Static analysis scores gradually decay as data ages, ensuring scores reflect current code rather than stale snapshots. Our hourly analysis cycle skips repositories that haven't changed since the last scan, which means stable projects with no recent commits will see their analysis age and decay apply. This is a known trade-off: we prioritize pipeline throughput over re-confirming unchanged results.

In practice, the decay is mild for the first 30 days (0.90×) and most active projects push commits well within that window. If you believe your stable project's score has decayed unfairly, open an issue or request a priority rescore to refresh the analysis timestamp.

  • ≤ 7 days: Full value (1.0×)
  • ≤ 14 days: 0.97×
  • ≤ 30 days: 0.90×
  • ≤ 60 days: 0.75×
  • ≤ 90 days: 0.55×
  • > 90 days: 0.40× (floor)

Score History

A daily snapshot of every server's scores is stored. This powers the score trend charts on server detail pages and lets us detect significant changes.

Score Drop Alerts

When a server's composite score drops by 10+ points between runs, an alert is generated. This helps catch regressions from broken endpoints, deleted repos, or dependency issues.

Infrastructure Monitoring

We monitor our own probe infrastructure: alerts fire if fast probes haven't run in 15 minutes, if probe success rate drops below 50%, or if deep probes haven't run in 36 hours.

Open Source Engine

The scoring engine that computes all quality scores is open source and available on PyPI as mcp-scoring-engine. You can run it locally to understand exactly how your server is scored, or integrate it into your CI/CD pipeline.

  • Install: pip install mcp-scoring-engine
  • Usage: The engine accepts server metadata, static analysis results, and optional probe data, then produces the same composite score, grade, visibility level, flags, and badges shown on this scoreboard.
  • Deterministic: Given the same inputs, the engine always produces the same outputs. No randomness, no server-side state.
  • Transparency: Every scoring rule, weight, flag condition, and grade threshold documented on this page is implemented in the open source code. The engine is the source of truth.

The scoreboard runs the same engine version deployed to production. The current version is shown in the API response headers.

Known Limitations

We believe in transparency about what our methodology can and cannot measure. These are known gaps we're actively working to improve.

Spec Version Detection

We detect which MCP specification version a server implements by analyzing its SDK dependencies and source code patterns. Servers using MCP SDK < 1.1.0 are mapped to the original spec (2024-11-05), SDK 1.1.x–1.5.x to spec 2025-06-18, and SDK ≥ 1.6.0 to spec 2025-11-25. Source code markers (streamable HTTP, elicitation, tasks, structured output) provide additional signals.

Limitation: servers that pin old SDK versions but implement newer features through custom code may be incorrectly classified as using an older spec. Servers not using a recognized SDK cannot be spec-versioned from dependencies alone.

Behavioral Security Analysis

Source code is analyzed by an LLM to detect behavioral security concerns: prompt injection patterns in tool descriptions, data exfiltration risk, hardcoded credentials, dangerous operations (file deletion, process execution), and scope creep beyond stated purpose. The analysis produces a behavioral security score (0–100) that contributes 35% of the security category.

Limitation: LLM-based analysis may produce false positives for legitimate patterns (e.g., a file management server that intentionally deletes files) and false negatives for obfuscated malicious code. Servers without source code receive a neutral default score of 50/100 (contributing 18 of 35 possible security points). This is not a passing grade — no-source servers are capped at “Limited” visibility and cannot score above the mid-range because they also miss schema quality and docs/maintenance dimensions. The neutral default prevents penalizing closed-source servers that may be perfectly safe, while the visibility cap ensures they never appear alongside fully-analyzed servers.

Invocation Smoke Tests

During deep probes, we attempt to call each discovered tool with schema-valid inputs generated from the tool's JSON Schema definition. The invocation smoke test score reflects the percentage of tools that return a non-error structured response within 10 seconds. This is included as a component of the Protocol Compliance category.

Limitation: generated inputs satisfy schema constraints but may not be semantically valid (e.g., a valid URL format pointing to a nonexistent resource). Tools that require specific external state (databases, APIs, files) may fail even when implemented correctly. The score reflects invocation success, not functional correctness — a server could have perfect scores and return wrong data for every call.

Sandbox Probing Coverage

Local-only servers are probed inside Docker containers: we clone the repository, install dependencies, and run the server in an isolated sandbox. Mock environment variables are injected for common patterns (API keys, tokens, database URLs) so servers that require credentials can still be probed. If the initial install fails, a minimal no-deps fallback is attempted for Python projects to maximize probing coverage.

Limitation: not all servers can be sandbox-probed. Complex build systems, native dependencies, or non-standard project layouts may prevent successful installation. Servers installed via the no-deps fallback may be missing runtime dependencies and fail during probing. The mock environment approach means servers requiring real external services (actual database connections, third-party APIs) will fail at runtime even if they install successfully.

Endpoint Discovery Coverage

Remote endpoints are discovered from source code and README files using LLM-based extraction. Discovered URLs are validated against domain blocklists, DNS resolution, and MCP-specific pattern checks before being stored. Endpoints from third-party registries (Smithery, Glama) are proxy URLs that typically do not support direct MCP connections and are rejected.

Limitation: not all servers have discoverable remote endpoints. Servers behind authentication, private networks, or requiring specific deployment cannot be probed remotely. LLM-based endpoint extraction from READMEs may miss endpoints in non-standard documentation formats. Some servers are designed to run locally only and have no remote endpoint at all.

Visibility-Based Scoring

Composite scores are computed from however many dimensions are available (2–6), with weights renormalized proportionally. This means a server scored on 3 dimensions can achieve the same composite score as one scored on 6 dimensions, even though the 6-dimension score is structurally more informative.

Limitation: the visibility level and dimension indicator communicate this difference. Grades display as A (3/3) or A (6/6) so the coverage is always visible. The leaderboard groups servers by visibility tier for fair comparison.

Geographic Probe Bias

All latency measurements are taken from a single probe location in Europe (Hetzner). Servers hosted in other regions may show higher latency scores that reflect geographic distance rather than server quality. This affects the latency component of the Reliability dimension.

Limitation: we plan to add multi-region probe infrastructure in the future. Until then, uptime (which is geography-independent) is weighted more heavily than latency in the reliability score.

Static Analysis Scope

We analyze up to 20 source files (25KB per file) per server. This covers the vast majority of tool definitions in practice, but very large monorepos or deeply nested project structures may have tools in files outside this budget.

Functional Correctness

MCP Scoreboard does not test functional correctness — whether a tool's output is accurate for a given input. This would require domain-specific ground truth that doesn't exist at ecosystem scale. A server could have perfect implementation quality scores and still return wrong data. Our scores measure how well a server is built, not whether its outputs are correct.

Security vs. Safety

The Security Hygiene dimension measures secure implementation patterns (secret handling, transport choice, behavioral analysis). It does NOT measure capability risk — whether a server's tools are dangerous to use. A filesystem MCP server that properly handles secrets and uses appropriate transport can score well on security hygiene while still providing powerful capabilities that require user consent. Always review a server's tool descriptions before granting access.

Models Used

LLMs are used in several pipeline stages: tool schema extraction, endpoint discovery, source code inference, and schema interpretability evaluation. All models are pinned to specific versions — we never use latest aliases in production scoring. Model versions are listed below and updated only after running evaluation benchmarks to verify consistency.

Pipeline StageTaskProviderModel
Schema Interpretability Agent Judge A OpenAI gpt-5-nano
Schema Interpretability Agent Judge B Google gemini-2.0-flash
Schema Interpretability Agent Judge C Mistral mistral-small-2501
Static Analysis Endpoint Discovery Mistral mistral-small-2501
Static Analysis Tool Schema Extraction Mistral mistral-small-2501
Source Inference Error Handling Inference OpenAI gpt-4.1-nano
Source Inference Input Validation Inference OpenAI gpt-4.1-nano
Source Inference Security Pattern Inference OpenAI gpt-4.1-nano

Why no Anthropic models? Our pipeline tasks (schema extraction, endpoint discovery, security analysis) require high-throughput, low-cost inference. We select models based on cost-per-token at the required quality tier for each task. Anthropic's current model lineup doesn't include a nano/micro-tier model competitive on price for these high-volume batch workloads. If that changes, we'll evaluate Claude models the same way we evaluate any other provider.

Model Transition Policy

When upgrading a model version, we run the full evaluation benchmark on both old and new versions and compare aggregate score distributions. A model upgrade is only deployed if the aggregate score shift is within tolerance (<3%). Transition details are published on this page when they occur.

Validation & Accuracy

We've analyzed 24,000+ servers across 8 ingestion sources. We publish accuracy metrics for our LLM-dependent pipeline stages so users can assess confidence in the scores.

Pipeline Coverage

As of March 2026, the catalog contains 24,000+ servers discovered across 8 ingestion sources.

SignalCoverageApprox. ServersNotes
Source file extraction 83–87% ~20,000 Servers where source files are successfully retrieved from GitHub
Spec version detection 67% ~16,000 Accuracy of SDK manifest + source marker detection
Behavioral security 83–87% ~20,000 Servers receiving a behavioral security score (requires source files)
Sandbox probing ~40% ~3,200 Of ~8,000 local-only servers attempted; failures are external deps, OOM, or disk space

These figures are from our v2 evaluation benchmark (March 2026). If you notice scoring inconsistencies or have suggestions for improving our validation approach, we welcome feedback at the project maintainers via GitHub Issues.

Probe Policy

MCP Scoreboard probes remote MCP servers to measure protocol compliance and reliability. We take this responsibility seriously.

What We Probe

  • Fast probes (health checks): Connect, initialize, ping. Read-only. No tool calls. Run on a rolling schedule across the fleet — not every server every 5 minutes.
  • Deep probes (protocol compliance): Tool enumeration, schema validation, error handling tests, fuzz testing, auth discovery. Run daily per server.
  • Invocation smoke tests: Schema-valid tool calls as part of deep probes. We only call tools with read-like semantics (get, list, search, read, fetch, describe). Tools with names suggesting mutation (delete, write, send, create, update, execute) are never called. This is a name-based heuristic, not a security guarantee — we also inspect tool descriptions for mutation semantics, but a tool named fetch_and_purge would pass the name filter. The smoke test is a quality signal, not a safety boundary.

Rate Limiting & Consent

  • Maximum 1 concurrent connection per server
  • We respect Retry-After headers and back off on rate-limit responses
  • Our probe identifies itself as: MCPScoreboard/1.0 (+https://mcpscoreboard.com/scoreboard/methodology/)

Opt-Out

Server operators who wish to be excluded from probing have two options:

  • Programmatic: Add a .well-known/mcp-scoreboard JSON file to your server's endpoint with {"optout": true}. Our probes check for this before connecting and will skip opted-out servers within one probe cycle.
  • Email: Contact the project maintainers via GitHub Issues with your server URL. We'll exclude it within 24 hours.

Opted-out servers are excluded from probing only. If your server has a public GitHub repository, static analysis (source code, tool schemas) may still produce a partial score. To remove a server from the scoreboard entirely, include {"optout": true, "delist": true} or mention delisting in your email.

Disputing a Score

If you believe your server's score is incorrect — for example, a false-positive security flag, a sandbox probe failure due to transient infrastructure issues, or an LLM misclassification — email the project maintainers via GitHub Issues with your server URL and the specific dimension you're disputing. We'll investigate and respond within 72 hours.