LlamaIndex Integration

Use Plasmate as a document reader in your LlamaIndex RAG pipelines - structured SOM output for 10-16x fewer tokens than raw HTML readers.

Source: llama-index-readers-plasmate

Installation

pip install plasmate llama-index-readers-plasmate

Quick Start

from llama_index.readers.plasmate import PlasmateWebReader

reader = PlasmateWebReader()
documents = reader.load_data(urls=[
    "https://example.com",
    "https://docs.python.org/3/",
])

# Use in your RAG pipeline
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is this page about?")

Why Plasmate for RAG?

Standard web readers (SimpleWebPageReader, BeautifulSoupWebReader) return raw HTML or basic text extraction. Plasmate returns structured semantic content:

Configuration

PlasmateWebReader(
    binary="plasmate",    # Path to plasmate binary
    timeout=30,           # Timeout per page in seconds
    budget=None,          # Optional SOM token budget
    javascript=True,      # Enable JS execution
)

Document Metadata

Each loaded document includes metadata:

Field Description
url Source URL
title Page title
html_bytes Original HTML size
som_bytes SOM output size
element_count Total SOM elements
compression_ratio HTML → SOM ratio