Why SOM: The Case for a Semantic Web Format for AI Agents
The web was built for humans looking at pixels. AI agents don't need pixels. They need meaning.
Every day, millions of agent API calls send raw HTML to language models, paying for CSS classes, script tags, tracking pixels, and layout divs that carry zero semantic value. SOM fixes this.
The Problem
A typical web page weighs 300-500KB of HTML. Between 80% and 95% of that markup is presentation: class names, inline styles, script blocks, SVG paths, tracking pixels, and deeply nested layout divs. None of it carries meaning.
But when an AI agent reads a web page, all of that noise goes straight into the context window. And context windows cost money.
Here's the deeper issue: the DOM is a rendering tree, not a meaning tree. It tells you WHERE things go on screen, not WHAT things are. A <div> with twelve CSS classes might be a navigation link, a button, a heading, or a decorative container. The DOM doesn't know and doesn't care. It was designed to paint pixels, not convey semantics.
AI agents deserve better input than a rendering tree with the renderer removed.
What SOM Is
SOM (Semantic Object Model) is a structured JSON representation of web content designed for machine consumption. It takes the meaningful content of a web page and expresses it in a format that LLMs can process efficiently.
Instead of this:
<div class="sc-1234 flex items-center gap-2">
<a href="/about" class="text-blue-500 hover:underline
font-medium tracking-tight">About</a>
</div>
SOM gives you this:
{
"role": "link",
"text": "About",
"attrs": { "href": "/about" },
"actions": ["click"]
}
Same information. Fraction of the tokens. And the agent actually knows it's a clickable link.
Key Properties
- Semantic roles (link, button, heading, paragraph, form, input) instead of div/span/a
- Actionable attributes only (href, value, placeholder) instead of class, style, data-*
- Region-based structure (navigation, content, form, footer) instead of arbitrary nesting
- Explicit interactivity: every interactive element is marked with its available actions (click, type, select)
- Structured data extraction: JSON-LD, OpenGraph, and meta tags normalized into a clean object
The Numbers
We benchmarked SOM against raw HTML on 49 real-world websites. Not toy examples. Real production pages from Google Cloud, Reddit, Stripe, The New York Times, and 45 others.
| Model | HTML Cost | SOM Cost | Savings |
|---|---|---|---|
| GPT-4 ($10/M) | $50,397 | $3,042 | $47,355 (94%) |
| GPT-4o ($2.50/M) | $12,599 | $761 | $11,839 (94%) |
| Claude Sonnet ($3/M) | $15,119 | $913 | $14,207 (94%) |
Best case: cloud.google.com compressed 116.9x, from 464K tokens down to 4K. Even minimal sites like postgresql.org still showed 1.2x compression.
See the full benchmark with all 49 sites →
Why Not Just Strip Tags?
Common objection: "Just use BeautifulSoup or Cheerio to strip HTML tags. Problem solved."
Not quite. Tag stripping is the wrong tool for this job:
- Loses structure. You can't tell a navigation link from a content link from a footer link. They all become plain text.
- Loses interactivity. You don't know what's clickable, typeable, or selectable. An agent needs to act on pages, not just read them.
- Loses hierarchy. Headings, sections, and regions disappear. The page becomes a flat wall of text.
- Lossy in the wrong direction. Tag stripping removes structure but keeps text noise: hidden elements, aria labels scattered everywhere, inline script content that leaked through.
SOM is selective. It removes noise but preserves meaning. A stripped page is text. A SOM page is a structured document with roles, regions, and actions.
Why Not the Accessibility Tree?
Accessibility trees are designed for screen readers. They solve a related but fundamentally different problem.
- Browser-dependent. You need a full browser runtime to generate an accessibility tree. SOM works from raw HTML, no browser required.
- Visual layout information. Accessibility trees include bounding boxes, visual states, and layout hints that agents don't need.
- Verbose. Every DOM node gets an accessibility role, even purely decorative ones. The tree inherits the DOM's depth and redundancy.
- Not designed for action. Accessibility trees describe what things are for human assistive technology. SOM describes what things are AND what an agent can do with them.
SOM is purpose-built for agent consumption: flat regions, semantic roles, explicit action annotations. It's what you'd design if you started from "what does an LLM need?" instead of "what does a screen reader need?"
Why Not Screenshots + Vision?
Vision models can look at screenshots. So why not just send a screenshot?
- Token cost. Image tokens are 4-10x more expensive than text tokens. A screenshot of a page costs far more than its SOM representation.
- Hallucination. Vision models hallucinate UI elements. They'll "see" buttons that aren't there and miss ones that are.
- No structured data. You can't extract JSON-LD, form values, or link targets from pixels.
- No interaction model. You can't identify elements by selector from a screenshot. You can't tell the model "click the third link in the navigation" if all it has is an image.
Screenshots are appropriate for visual verification: "does this page look right?" They're not appropriate for primary page understanding. SOM gives agents the structured, actionable data they actually need.
SOM as a Standard
SOM isn't locked inside Plasmate. It's an open specification designed to be consumed by any tool, framework, or agent.
- SOM Spec v1.0 is published and stable
- Standalone parsers available on npm (
som-parser) and PyPI (som-parser) - Zero dependency on Plasmate to consume SOM output
- JSON Schema validation available for tooling and CI
- Apache 2.0 licensed with no IP restrictions
You can generate SOM with Plasmate and consume it with anything. Or build your own SOM generator. The format is the standard, not the tool.
Who Benefits
- Agent framework developers (Browser Use, LangChain, CrewAI): lower token costs, faster inference, structured page data out of the box
- Enterprise AI teams: predictable, structured web data instead of HTML soup. No more prompt-engineering around broken DOM structures.
- Web scraping at scale: 10x reduction in LLM costs. When you're processing millions of pages, 94% savings is the difference between viable and bankrupt.
- Tool-use agents: explicit action annotations tell the model exactly what's clickable, typeable, and selectable. No guessing.
Get Started
Try SOM in under a minute:
# Install Plasmate
cargo install plasmate
# Fetch any URL as SOM
plasmate fetch https://example.com
Use SOM in your project:
# Node.js
npm install som-parser
# Python
pip install som-parser