Deep Scan v1 bet that the right primitive was real agents running at every layer. The bet held. v1.1 is what we built once we saw which questions those agents still weren’t being asked.
Two surfaces moved. Discovery used to lean on static files (sitemap, robots, llms.txt) as the answer to “can agents find you.” That is the easy half. The harder half is whether answer engines pick you when a user asks. v1.1 reshapes Layer 1 into a real AEO/GEO benchmark across five engines.
MCP got the same treatment from the other side. v1 confirmed your endpoint responds and your tools list. v1.1 takes the guidelines Anthropic has been publishing about what good MCP servers look like and turns them into graded checks. Sharp names, plan-ready schemas, real auth metadata, conformant Apps surfaces, bi-directional registry verification. The bar went up, and some scores will drop with it.
From file-exists to system-works
The clearest way to read v1.1 is as a climb up a depth ladder. Every check answers one of four questions: does the thing exist, is it shaped correctly, does it advertise the right contract, does it behave the way an agent needs. v1 mostly reached the first two rungs. v1.1 populates the top two.
- L1v1
Presence
Does the file exist. Does the endpoint return 200. Does the path resolve.
Sitemap existsllms.txt existsMCP well-known discovery - L2v1
Format
Is the response shape RFC-correct. Is the schema parseable. Are required fields present.
API catalog (RFC 9727)JSON-LD structured datallms.txt formatting - L3v1.1
Contract
Do the names, descriptions, and schemas give an LLM enough to plan with. Do declared capabilities match what gets exposed.
MCP tool namingMCP tool descriptionsMCP parameter schemasMCP tool annotationsMCP server-card.jsonllms-full.txt quality - L4v1.1
Behavior
Do real handshakes complete. Do agents actually pick you. Is the trace consistent with what the surface advertised.
Category share of voiceKnowledge cutoff coverageCitation quality vs. mentionMCP resource qualityMCP Apps view CSPListed in MCP registries
A v1 check confirmed the file existed. A v1.1 check confirms the system works.
Discovery is now a real AEO/GEO benchmark
Layer 1 was where the gap was widest. It was named Discovery but most of the points came from foundation files: sitemap, robots, llms.txt, JSON-LD. Those still matter, but they belong to the layer that measures whether an agent understands you, not whether an agent finds you. They moved to Identity. Discovery now answers one question: when a user asks an answer engine in your category, does the answer name you.
AI search results for “best payment processing API for SaaS”
recommendedReal results from a Deep Scan of stripe.com - May 13, 2026
AEO (Answer Engine Optimization) is the live benchmark: do answer engines reach for you when someone describes a need in your category. GEO (Generative Engine Optimization) is the structural one - not just whether frontier models carry your brand in their weights, but whether your content is the kind retrieval systems quote when composing an answer.
Category share of voice handles the live half: a relevant user intent goes out to multiple answer engines and your rank comes back against the competitors that actually surfaced for that intent. Knowledge cutoff coverage handles the structural half: frontier models describe the product from training memory only, no context provided. Being recalled and being cited are different scores. Now you have both.
MCP gets graded on the principles, not just the protocol
v1 asked a binary question: do you have an MCP server. v1.1 asks the one the people writing the spec keep writing about: do you have an MCP server an agent can plan with. Sharp scoped names, descriptions that explain when and how to use a tool, schemas an LLM can fill, capability annotations that match reality, conformant Apps surfaces. v1.1 turns that guidance into graded checks across the entire MCP surface.
The pattern repeats wherever we look. We are not asking whether the path responds; we are asking whether the contract the path advertises holds up under an agent that actually tries to use it. A few examples of the shift:
- mcp-tool-listing· Identitywas
Servers with fewer than 3 tools failed automatically.
nowCode-mode dispatchers (one search + one command / code / sql executor) earn full credit; granularity from executable inputs is recognised.
- mcp-tool-annotations· Identitywas
No check.
nowCapability annotations (read-only, destructive, idempotent, open-world) required to match observed behaviour.
- mcp-registry-listed· Discoverywas
A lexical name match in any MCP registry passed.
nowA registry entry must bi-directionally verify (entry resolves back to the product, or the product links to the canonical entry, or curated verification on the registry side).
The MCP surface itself got a more careful read. Most brands ship more than one MCP - an authenticated product MCP at api.example.com/mcp for actions, a public docs MCP at developers.example.com/mcp for search. v1 folded both into a single winner and graded them on one rubric; docs servers got docked for having too few tools, product servers coasted on bare reachability. v1.1 discovers sibling MCPs, classifies each as product or docs, and grades each against the rubric that fits. A new bonus rewards brands that ship both.
The rest is plumbing. The MCP App surface (servers that render UI inside the host chat) gets its own depth checks. Domain scans now run the MCP checks too when the scanner finds a co-hosted endpoint. And when an MCP handshake returns 401 or 403, the score page no longer pretends to grade what it can’t see - it labels the server authenticated and renders the introspection checks as N/A.
The question shifted from “do you have an MCP” to “do you have an MCP an agent can plan with.”
Hardening, almost everywhere else
The other half of v1.1 is depth. Confirming that an endpoint exists was never enough - v1.1 evaluates the protocol quality behind it. Does the schema hold up. Do the declared capabilities match observed behavior. Is the contract tight enough for an agent to plan from. Every check that used to stop at presence now goes one rung further: from existence to format, from format to contract, from contract to behavior. Fewer false positives, more signal per check.
Yes, some scores will drop
A more demanding benchmark is a more honest benchmark. v1.1 catches signals v1 missed and rejects signals v1 was too quick to credit. We owe anyone reading their scorecard a clear answer on where the math moved.
Likely to drop: things that passed on lexical resemblance or weak proxies. Knowledge cutoff coverage scores that relied on name similarity alone. Listed in MCP registries without a verifiable link back to the product. Payment-protocol claims that depended on CDN defaults. Warning-tier credit that used to match a clean pass. Knowledge cutoff coverage checks that were quietly being handed the answer.
Likely to rise: products that were doing the thoughtful thing and getting penalised for it. MCP servers that put granularity inside executable inputs instead of in a long flat tool list. Domain scans that ship a co-hosted MCP endpoint and now pick up credit for that surface. MCP Apps that locked down their view origin and shipped a real Content Security Policy.
The movement is uneven. Most sites shift a few points either way. The bigger movement is at the tails, where the scorecard is now telling a story the old one couldn’t.
What’s next, and where to look
Deep Scan ships in versions because the answer keeps changing. v1.0 bet that real agents at every layer was the right primitive. v1.1 is the depth check on what those agents have been finding. v1.2 will go after the layers we haven’t rebuilt yet - User Experience next, then deeper on auth and payment rails as the standards firm up. The MCP work is not done either.
A score from a few months ago is not your score today. Rescan periodically. Treat the number as a signal of where you stand on this version of this benchmark, not as a static badge. Every check still maps to the same AgentReady requirement; what changes between releases is the evidence we accept and the credit we give for it. The standard is open at agentready.org.
Run a fresh scan at /#scan. The full check catalog is at /methodology and the Discovery frame is at /aeo. If your number moved and you want to know why, every check ships with the evidence string that decided it.