Why SOM: The Case for a Semantic Web Format for AI Agents

The web was built for humans looking at pixels. AI agents don't need pixels. They need meaning.

Every day, millions of agent API calls send raw HTML to language models, paying for CSS classes, script tags, tracking pixels, and layout divs that carry zero semantic value. SOM fixes this.

The Problem

Web pages commonly contain presentation and runtime markup—class names, inline styles, script blocks, SVG paths, tracking elements, and deeply nested layout containers—that is not useful to every agent task.

But when an AI agent reads a web page, all of that noise goes straight into the context window. And context windows cost money.

How much of that material can be removed depends on the page, the SOM selector and budget, JavaScript mode, serialization, and the downstream tokenizer. Measure the exact workflow rather than applying a universal savings percentage.

Here's the deeper issue: the DOM is a rendering tree, not a meaning tree. It tells you WHERE things go on screen, not WHAT things are. A <div> with twelve CSS classes might be a navigation link, a button, a heading, or a decorative container. The DOM doesn't know and doesn't care. It was designed to paint pixels, not convey semantics.

AI agents deserve better input than a rendering tree with the renderer removed.

What SOM Is

SOM (Semantic Object Model) is a structured JSON representation of web content designed for machine consumption. It takes the meaningful content of a web page and expresses it in a format that LLMs can process efficiently.

Instead of this:

Raw HTML

<div class="sc-1234 flex items-center gap-2">
  <a href="/about" class="text-blue-500 hover:underline
     font-medium tracking-tight">About</a>
</div>

SOM gives you this:

SOM Output

{
  "role": "link",
  "text": "About",
  "attrs": { "href": "/about" },
  "actions": ["click"]
}

The example preserves the link's meaning and action while omitting its presentation classes. Whether it uses fewer tokens, and by how much, depends on the surrounding page and tokenizer.

Key Properties

Semantic roles (link, button, heading, paragraph, form, input) instead of div/span/a
Actionable attributes only (href, value, placeholder) instead of class, style, data-*
Region-based structure (navigation, content, form, footer) instead of arbitrary nesting
Explicit interactivity: every interactive element is marked with its available actions (click, type, select)
Structured data extraction: JSON-LD, OpenGraph, and meta tags normalized into a clean object

Retained Output-Size Evidence

The retained v0.5.1 public-web snapshots attempted 98 URLs per run. The non-JavaScript snapshot recorded a 9.98x median serialized-byte ratio over 83 successful inputs; the JavaScript snapshot recorded a 9.32x median over 82 successful inputs. Blocked and failed URLs remain in the full denominator.

These are historical observational byte ratios. They are not universal token savings, cost savings, latency, or task-success claims, and the legacy snapshots predate the current provenance/corpus-digest schema. See the retained non-JavaScript snapshot, the JavaScript snapshot, and the benchmark policy.

Why Not Just Strip Tags?

Common objection: "Just use BeautifulSoup or Cheerio to strip HTML tags. Problem solved."

Not quite. Tag stripping is the wrong tool for this job:

Loses structure. You can't tell a navigation link from a content link from a footer link. They all become plain text.
Loses interactivity. You don't know what's clickable, typeable, or selectable. An agent needs to act on pages, not just read them.
Loses hierarchy. Headings, sections, and regions disappear. The page becomes a flat wall of text.
Lossy in the wrong direction. Tag stripping removes structure but keeps text noise: hidden elements, aria labels scattered everywhere, inline script content that leaked through.

SOM is selective. It removes noise but preserves meaning. A stripped page is text. A SOM page is a structured document with roles, regions, and actions.

Why Not the Accessibility Tree?

Accessibility trees are designed for screen readers. They solve a related but fundamentally different problem.

Browser-dependent. You need a full browser runtime to generate an accessibility tree. SOM works from raw HTML, no browser required.
Visual layout information. Accessibility trees include bounding boxes, visual states, and layout hints that agents don't need.
Verbose. Every DOM node gets an accessibility role, even purely decorative ones. The tree inherits the DOM's depth and redundancy.
Not designed for action. Accessibility trees describe what things are for human assistive technology. SOM describes what things are AND what an agent can do with them.

SOM is purpose-built for agent consumption: flat regions, semantic roles, explicit action annotations. It's what you'd design if you started from "what does an LLM need?" instead of "what does a screen reader need?"

Why Not Screenshots + Vision?

Vision models can look at screenshots. So why not just send a screenshot?

Representation cost varies. Image and text tokenization depends on the model and input. Measure both paths for the selected model instead of assuming a fixed multiplier.
Hallucination. Vision models hallucinate UI elements. They'll "see" buttons that aren't there and miss ones that are.
No structured data. You can't extract JSON-LD, form values, or link targets from pixels.
No interaction model. You can't identify elements by selector from a screenshot. You can't tell the model "click the third link in the navigation" if all it has is an image.

Screenshots are appropriate for visual verification: "does this page look right?" They're not appropriate for primary page understanding. SOM gives agents the structured, actionable data they actually need.

SOM as a Standard

SOM isn't locked inside Plasmate. It's an open specification designed to be consumed by any tool, framework, or agent.

SOM Spec v1.0 is published and stable
Standalone parsers available on npm (som-parser) and PyPI (som-parser)
Zero dependency on Plasmate to consume SOM output
JSON Schema validation available for tooling and CI
Apache 2.0 licensed with no IP restrictions

You can generate SOM with Plasmate and consume it with anything. Or build your own SOM generator. The format is the standard, not the tool.

Who Benefits

Agent framework developers (Browser Use, LangChain, CrewAI): lower token costs, faster inference, structured page data out of the box
Enterprise AI teams: predictable, structured web data instead of HTML soup. No more prompt-engineering around broken DOM structures.
Web processing at scale: structured output can omit irrelevant markup, but cost impact must be measured on the actual corpus, tokenizer, and task.
Tool-use agents: explicit action annotations tell the model exactly what's clickable, typeable, and selectable. No guessing.

Get Started

Try SOM in under a minute:

# Install Plasmate
cargo install plasmate

# Fetch any URL as SOM
plasmate fetch https://example.com

Use SOM in your project:

# Node.js
npm install som-parser

# Python
pip install som-parser