What is an agent-readiness score?

A 0-100 score that measures how well AI agents can discover, authenticate with, and use a product. It covers five layers: Discovery, Identity, Auth & Access, Integration, and User Experience.

How does ora score a website?

ora runs 110+ automated checks against public URLs - no login or self-reporting required. Checks test for MCP server presence, OpenAPI specs, llms.txt, structured data, auth patterns, and more.

What is an MCP server?

Model Context Protocol (MCP) is an open standard for connecting AI agents to external tools and data. An MCP server exposes a product's capabilities as tools that AI agents can discover and invoke.

Deep Scan v1.1: every layer just got deeper

Deep Scan v1 bet that the right primitive was real agents running at every layer. The bet held. v1.1 is what we built once we saw which questions those agents still weren’t being asked.

Two surfaces moved. Discovery used to lean on static files (sitemap, robots, llms.txt) as the answer to “can agents find you.” That is the easy half. The harder half is whether answer engines pick you when a user asks. v1.1 reshapes Layer 1 into a real AEO/GEO benchmark across five engines.

MCP got the same treatment from the other side. v1 confirmed your endpoint responds and your tools list. v1.1 takes the guidelines Anthropic has been publishing about what good MCP servers look like and turns them into graded checks. Sharp names, plan-ready schemas, real auth metadata, conformant Apps surfaces, bi-directional registry verification. The bar went up, and some scores will drop with it.

From file-exists to system-works

The clearest way to read v1.1 is as a climb up a depth ladder. Every check answers one of four questions: does the thing exist, is it shaped correctly, does it advertise the right contract, does it behave the way an agent needs. v1 mostly reached the first two rungs. v1.1 populates the top two.

Depth ladderexistence → behavior

L1
Presence
Does the file exist. Does the endpoint return 200. Does the path resolve.
Sitemap existsllms.txt existsMCP well-known discovery
v1
L2
Format
Is the response shape RFC-correct. Is the schema parseable. Are required fields present.
API catalog (RFC 9727)JSON-LD structured datallms.txt formatting
v1
L3
Contract
Do the names, descriptions, and schemas give an LLM enough to plan with. Do declared capabilities match what gets exposed.
MCP tool namingMCP tool descriptionsMCP parameter schemasMCP tool annotationsMCP server-card.jsonllms-full.txt quality
v1.1
L4
Behavior
Do real handshakes complete. Do agents actually pick you. Is the trace consistent with what the surface advertised.
Category share of voiceKnowledge cutoff coverageCitation quality vs. mentionMCP resource qualityMCP Apps view CSPListed in MCP registries
v1.1

A v1 check confirmed the file existed. A v1.1 check confirms the system works.

Discovery is now a real AEO/GEO benchmark

Layer 1 was where the gap was widest. It was named Discovery but most of the points came from foundation files: sitemap, robots, llms.txt, JSON-LD. Those still matter, but they belong to the layer that measures whether an agent understands you, not whether an agent finds you. They moved to Identity. Discovery now answers one question: when a user asks an answer engine in your category, does the answer name you.

AI search results for “best payment processing API for SaaS”

recommended

stripe.comStripeyou90%

squareup.comSquare80%

authorize.netAuthorize.net40%

amazon.comAmazon Pay30%

paypal.comPayPal20%

Real results from a Deep Scan of stripe.com - May 13, 2026

AEO (Answer Engine Optimization) is the live benchmark: do answer engines reach for you when someone describes a need in your category. GEO (Generative Engine Optimization) is the structural one - not just whether frontier models carry your brand in their weights, but whether your content is the kind retrieval systems quote when composing an answer.

Category share of voice handles the live half: a relevant user intent goes out to multiple answer engines and your rank comes back against the competitors that actually surfaced for that intent. Knowledge cutoff coverage handles the structural half: frontier models describe the product from training memory only, no context provided. Being recalled and being cited are different scores. Now you have both.

MCP gets graded on the principles, not just the protocol

v1 asked a binary question: do you have an MCP server. v1.1 asks the one the people writing the spec keep writing about: do you have an MCP server an agent can plan with. Sharp scoped names, descriptions that explain when and how to use a tool, schemas an LLM can fill, capability annotations that match reality, conformant Apps surfaces. v1.1 turns that guidance into graded checks across the entire MCP surface.

The pattern repeats wherever we look. We are not asking whether the path responds; we are asking whether the contract the path advertises holds up under an agent that actually tries to use it. A few examples of the shift:

MCP, before and afterv1 → v1.1

mcp-tool-listing· Identity
was
Servers with fewer than 3 tools failed automatically.
now
Code-mode dispatchers (one search + one command / code / sql executor) earn full credit; granularity from executable inputs is recognised.
mcp-tool-annotations· Identity
was
No check.
now
Capability annotations (read-only, destructive, idempotent, open-world) required to match observed behaviour.
mcp-registry-listed· Discovery
was
A lexical name match in any MCP registry passed.
now
A registry entry must bi-directionally verify (entry resolves back to the product, or the product links to the canonical entry, or curated verification on the registry side).

The MCP surface itself got a more careful read. Most brands ship more than one MCP - an authenticated product MCP at api.example.com/mcp for actions, a public docs MCP at developers.example.com/mcp for search. v1 folded both into a single winner and graded them on one rubric; docs servers got docked for having too few tools, product servers coasted on bare reachability. v1.1 discovers sibling MCPs, classifies each as product or docs, and grades each against the rubric that fits. A new bonus rewards brands that ship both.

The rest is plumbing. The MCP App surface (servers that render UI inside the host chat) gets its own depth checks. Domain scans now run the MCP checks too when the scanner finds a co-hosted endpoint. And when an MCP handshake returns 401 or 403, the score page no longer pretends to grade what it can’t see - it labels the server authenticated and renders the introspection checks as N/A.

The question shifted from “do you have an MCP” to “do you have an MCP an agent can plan with.”

Hardening, almost everywhere else

The other half of v1.1 is depth. Confirming that an endpoint exists was never enough - v1.1 evaluates the protocol quality behind it. Does the schema hold up. Do the declared capabilities match observed behavior. Is the contract tight enough for an agent to plan from. Every check that used to stop at presence now goes one rung further: from existence to format, from format to contract, from contract to behavior. Fewer false positives, more signal per check.

Yes, some scores will drop

A more demanding benchmark is a more honest benchmark. v1.1 catches signals v1 missed and rejects signals v1 was too quick to credit. We owe anyone reading their scorecard a clear answer on where the math moved.

Likely to drop: things that passed on lexical resemblance or weak proxies. Knowledge cutoff coverage scores that relied on name similarity alone. Listed in MCP registries without a verifiable link back to the product. Payment-protocol claims that depended on CDN defaults. Warning-tier credit that used to match a clean pass. Knowledge cutoff coverage checks that were quietly being handed the answer.

Likely to rise: products that were doing the thoughtful thing and getting penalised for it. MCP servers that put granularity inside executable inputs instead of in a long flat tool list. Domain scans that ship a co-hosted MCP endpoint and now pick up credit for that surface. MCP Apps that locked down their view origin and shipped a real Content Security Policy.

The movement is uneven. Most sites shift a few points either way. The bigger movement is at the tails, where the scorecard is now telling a story the old one couldn’t.

~5-10 pts

median absolute shift on an existing score between v1 and v1.1

What’s next, and where to look

Deep Scan ships in versions because the answer keeps changing. v1.0 bet that real agents at every layer was the right primitive. v1.1 is the depth check on what those agents have been finding. v1.2 will go after the layers we haven’t rebuilt yet - User Experience next, then deeper on auth and payment rails as the standards firm up. The MCP work is not done either.

A score from a few months ago is not your score today. Rescan periodically. Treat the number as a signal of where you stand on this version of this benchmark, not as a static badge. Every check still maps to the same AgentReady requirement; what changes between releases is the evidence we accept and the credit we give for it. The standard is open at agentready.org.

Run a fresh scan at /#scan. The full check catalog is at /methodology and the Discovery frame is at /aeo. If your number moved and you want to know why, every check ships with the evidence string that decided it.

Deep Scan v1 bet that the right primitive was real agents running at every layer. The bet held. v1.1 is what we built once we saw which questions those agents still weren’t being asked.

From file-exists to system-works

Depth ladderexistence → behavior

L1
Presence
Does the file exist. Does the endpoint return 200. Does the path resolve.
Sitemap existsllms.txt existsMCP well-known discovery
v1
L2
Format
Is the response shape RFC-correct. Is the schema parseable. Are required fields present.
API catalog (RFC 9727)JSON-LD structured datallms.txt formatting
v1
L3
Contract
Do the names, descriptions, and schemas give an LLM enough to plan with. Do declared capabilities match what gets exposed.
MCP tool namingMCP tool descriptionsMCP parameter schemasMCP tool annotationsMCP server-card.jsonllms-full.txt quality
v1.1
L4
Behavior
Do real handshakes complete. Do agents actually pick you. Is the trace consistent with what the surface advertised.
Category share of voiceKnowledge cutoff coverageCitation quality vs. mentionMCP resource qualityMCP Apps view CSPListed in MCP registries
v1.1

A v1 check confirmed the file existed. A v1.1 check confirms the system works.

Discovery is now a real AEO/GEO benchmark

AI search results for “best payment processing API for SaaS”

recommended

stripe.comStripeyou90%

squareup.comSquare80%

authorize.netAuthorize.net40%

amazon.comAmazon Pay30%

paypal.comPayPal20%

Real results from a Deep Scan of stripe.com - May 13, 2026

MCP gets graded on the principles, not just the protocol

MCP, before and afterv1 → v1.1

mcp-tool-listing· Identity
was
Servers with fewer than 3 tools failed automatically.
now
Code-mode dispatchers (one search + one command / code / sql executor) earn full credit; granularity from executable inputs is recognised.
mcp-tool-annotations· Identity
was
No check.
now
Capability annotations (read-only, destructive, idempotent, open-world) required to match observed behaviour.
mcp-registry-listed· Discovery
was
A lexical name match in any MCP registry passed.
now
A registry entry must bi-directionally verify (entry resolves back to the product, or the product links to the canonical entry, or curated verification on the registry side).

The question shifted from “do you have an MCP” to “do you have an MCP an agent can plan with.”

Hardening, almost everywhere else

Yes, some scores will drop

The movement is uneven. Most sites shift a few points either way. The bigger movement is at the tails, where the scorecard is now telling a story the old one couldn’t.

~5-10 pts

median absolute shift on an existing score between v1 and v1.1

Deep Scan v1.1: every layer just got deeper

From file-exists to system-works

Presence

Format

Contract

Behavior

Discovery is now a real AEO/GEO benchmark

MCP gets graded on the principles, not just the protocol

Hardening, almost everywhere else

Yes, some scores will drop

What’s next, and where to look

Want to see where you rank?

Deep Scan v1.1: every layer just got deeper

From file-exists to system-works

Presence

Format

Contract

Behavior

Discovery is now a real AEO/GEO benchmark

MCP gets graded on the principles, not just the protocol

Hardening, almost everywhere else

Yes, some scores will drop

What’s next, and where to look

Want to see where you rank?