Robots.txt for the Agentic Web

The robots.txt standard was designed in 1994 for a simple question: should this crawler index my page?

Thirty-two years later, we're using the same binary mechanism to manage a fundamentally different relationship. AI agents don't just index pages - they read, reason, extract, and act on web content. The conversation is stuck on a single axis: block or allow.

The Current State

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

Website owners face an all-or-nothing choice. Allow everything and lose control, or block everything and become invisible to the fastest-growing discovery channel on the web.

There's no way to say: "Yes, you can read my content, and here's a better way to do it."

The Proposal: SOM Directives

We propose extending robots.txt with directives that let websites advertise semantic representations of their content.

User-agent: *
Allow: /

# Semantic Object Model available
SOM-Endpoint: https://cache.example.com/v1/som
SOM-Format: SOM/1.0
SOM-Scope: main-content
SOM-Freshness: 3600

When an AI agent sees these directives, instead of fetching the full HTML page (50,000+ tokens), it can request the SOM endpoint and get a clean, structured representation (~3,000 tokens).

New Directives

Directive	Description
`SOM-Endpoint`	Base URL of the SOM service. Agents append `?url=` with the target page.
`SOM-Format`	Format of the representation: `SOM/1.0`, `markdown`, `accessibility-tree`
`SOM-Scope`	Content coverage: `full-page`, `main-content`, `article-body`
`SOM-Freshness`	Max age in seconds of a cached SOM (default: 86400)
`SOM-Token-Budget`	Suggested max tokens, helping agents estimate costs before fetching

Why This Matters

For website owners: Control without blocking. Direct agents to a representation you control - exclude ads, paywalls, and noise. Include what you want highlighted.

For agent developers: 10-16x fewer tokens, better extraction quality, no headless browser needed. A single HTTP request instead of Chrome + JS execution + DOM parsing.

For the web ecosystem: A cooperative alternative to the current adversarial dynamic where publishers block and agents circumvent.

Relationship to Existing Standards

SOM directives are purely additive to RFC 9309 (Robots Exclusion Protocol). Existing User-agent, Allow, Disallow, and Crawl-delay rules continue to work unchanged. Agents that don't understand SOM directives simply ignore them.

This complements our Schema.org proposal for WebPageSemanticRepresentation (per-page signaling) and the SOM specification being incubated at the W3C.

Get Involved

This proposal is being developed within the W3C Web Content Browser for AI Agents Community Group.