What is an agent-readiness score?

A 0-100 score that measures how well AI agents can discover, authenticate with, and use a product. It covers five layers: Discovery, Identity, Auth & Access, Integration, and User Experience.

How does ora score a website?

ora runs 110+ automated checks against public URLs - no login or self-reporting required. Checks test for MCP server presence, OpenAPI specs, llms.txt, structured data, auth patterns, and more.

What is an MCP server?

Model Context Protocol (MCP) is an open standard for connecting AI agents to external tools and data. An MCP server exposes a product's capabilities as tools that AI agents can discover and invoke.

Introducing Deep Scan: a benchmark, not a checklist

ora research·Apr 8, 2026·5 min read

Most agent-readiness tools answer one question: does your site serve the right well-known files. It is a useful question, and a much smaller one than the question that actually matters.

A few free scanners (Cloudflare’s isitagentready.com is one) hit a domain, check a handful of well-known endpoints, and tell you which ones came back with a 200. That is a fine litmus test, and we run all of those checks inside Deep Scan as well.

What a checklist cannot tell you is whether an agent will actually call you. Whether your OAuth flow completes from a real client. Whether your MCP tool descriptions are sharp enough for an LLM to plan with. Whether a user can finish a real task, end-to-end, through your product. Deep Scan was built for the bigger question, and today we’re walking through how it works.

Standards-compliant is not the same as agent-ready

The well-known files a static scanner checks are the connective tissue of the agentic web. robots.txt, sitemap, llms.txt, OAuth metadata, MCP server card, A2A agent card, Web Bot Auth, API catalog, agent-skills, x402, ACP, UCP. We respect that work. Every one of those files lives inside our catalog and contributes to the score.

Shipping the file is the cheap half of the problem. A site can serve a perfectly valid /.well-known/oauth-authorization-server while the actual login flow rejects PKCE S256, mints tokens with the wrong audience, or hands back a refresh token that instantly fails. A site can publish an MCP server card while its Streamable HTTP transport drops the connection 30 seconds in. A site can put llms.txt at the right path while the content reads like marketing copy that no agent can act on.

A checklist confirms a file exists. A benchmark confirms the system works.

That distinction is the entire reason Deep Scan exists. Our scanning of thousands of products has the same shape every time: the gap between “ships an MCP endpoint” and “ships an MCP endpoint a real agent can use” is the gap that decides whether you get picked. A static scanner cannot see that gap. Deep Scan is built to.

What one scan actually does

A scan is a pipeline, not a request. From the moment you submit a domain, Deep Scan walks the 5 layers AgentReady defines, in order, and at each one it does the work an agent would have to do for real.

Discovery, both halves. The implicit half dispatches customer-style queries (“best CRM”, “API for transcribing audio”) through ChatGPT, Claude, Gemini, Perplexity, and ora-agent, and records whether you show up, with what positioning, against which competitors. The explicit half walks the well-known surface end-to-end. Static scanners stop at the explicit half. For us this is the cheap layer.
Identity, what an agent thinks you are. An LLM reads your product cold (llms.txt, agent-skills, MCP descriptions, the homepage) and tries to summarize what you do, who you serve, and what actions you expose. We score the gap between that summary and what your product actually does. A perfectly valid llms.txt can still flunk this layer if the content reads like a brochure no agent can act on.
Auth and access, executed not parsed. We run the real OAuth dance with PKCE S256, validate audience and scopes, walk the consent step, and exercise the refresh path. If your IdP mints a token your MCP server then rejects, the mismatch lands on the scorecard. Web Bot Auth is verified the same way: signed request issued, response graded.
Integration, the live handshakes. A real MCP Streamable HTTP session opens, tools are listed, each is called with sane arguments, and the trace is captured. If you advertise x402, the scan walks the 402 + retry handshake. If you advertise A2A, your agent card is negotiated against. Streaming, webhooks, and function calling all get the same treatment.
User experience, the multi-turn part. Each product gets a small set of realistic goals (“book a demo”, “create a ticket”, “complete a checkout”) dispatched through ChatGPT, Claude, and ora-agent as multi-turn conversations. Scoring is goal-success, not UI-presence. This is the most expensive part of the scan, and the part with the most signal.

A single scan runs hundreds of LLM calls, several live OAuth flows, a handful of payment handshakes, and a set of multi-turn task completions. Every layer tolerates things going wrong (rate limits, dropped MCP sessions, OAuth throttling, upstream timeouts) without silently passing or failing a check. If we couldn’t actually run something, the scorecard says so. Most of the engineering went into the pieces that don’t show up on the score.

~1 min

per scan, running hundreds of LLM calls, live OAuth flows, and protocol handshakes in parallel

How we know what we’re seeing

Spawning agents is the easy part. Knowing what they did and what it means is the part that took the most engineering. A few pieces of machinery sit on top of every scan.

Trace-level grading. Every agent session leaves a structured trace: which tool was called, with what arguments, in what order, what came back. Checks are graded against the trace, not the final string the agent typed. A check passes only if the right thing happened, not if the agent claimed it did. Security teams use the same shape of grading to red-team LLMs for jailbreaks; we use it to ask whether a real agent can use the product at all.
LLM-as-judge per check. An LLM reads each result in product context and marks it N/A if the check is genuinely irrelevant: agent payment rails for a free open-source library, A2A for a product with no other agents to talk to. N/A drops out of the denominator, so a site is never penalized for missing a protocol it had no reason to ship.
Multi-agent crawler identities. The explicit-discovery phase is replayed under 6 crawler identities (ChatGPT-User, ClaudeBot, Google-Extended, DeepSeek, ora-agent, OpenClaw). A site that scores 90 against ChatGPT-User and 30 against OpenClaw is not an agent-ready site; it is a site that recognizes one brand. The delta is its own line item.
Many phrasings of the same intent. Every customer query gets asked many ways across a category-aware taxonomy. The score is the median across them, not whichever phrasing landed best.
Layers have to agree with each other. When two parts of your stack should match (the MCP server card lists a tool, the live handshake should expose it; llms.txt advertises an action, the OAuth flow should support it) we score the disagreement. A product that advertises more than it can deliver shows up that way.

Reproducibility matters. The same scan run twice converges within a couple of points, and a held-out set of sites with known scores catches drift after every pipeline change. That is what lets the same number land on a procurement form, a leaderboard, and an AgentReady conformance report without anyone having to ask whether the runs agree.

Mapped 1:1 to AgentReady

Deep Scan is not a separate framework. It is the reference implementation of the AgentReady standard. Every check maps to a layer and weight defined by the spec, and the scoring math (5 layers, 110 checks, 0-100 score, letter grade, N/A exclusion) is identical to the v1 draft. We wrote about why the standard was overdue last month; this is the implementation underneath it.

That mapping matters because the score is portable. Another implementation that runs the same conformance suite converges on the same number, the way two compilers converge on the same binary. Methodology updates are not ours to make unilaterally either. The spec is versioned and governed in the open at agentready.org.

110

checks across 5 weighted layers, mapped 1:1 to AgentReady v1

the full catalog lives at /methodology

When to use which

If the question is “do my well-known files exist and respond”, run a static scanner. They are fast and answer that question well.

If the question is whether an agent picks your product (whether your MCP server hands back tools an LLM can plan with, whether your OAuth flow completes from a real client, whether a user can finish a task end-to-end, whether an unfamiliar agent can even read your site) that is the question Deep Scan was built for.

Both flavors live inside the same framework. The static checks a basic scanner runs are a strict subset of the AgentReady spec and account for a small share of the score. The rest is the work that decides whether agents move from reading your pages to actually using your product.

Run a scan at /#scan. The full check catalog and scoring math are at /methodology. The standard, its conformance suite, and the governance model are at agentready.org. If you maintain a protocol or ship agent infrastructure, we want your comments on the v1 draft.

Want to see where you rank?

Run the same scan we ran on thousands of sites. Free, public, takes about 1 minute.

Scan your site →Explore the data

← all posts