LlamaIndex Integration

Use Plasmate as a document reader in your LlamaIndex RAG pipelines - structured SOM output for 10-16x fewer tokens than raw HTML readers.

Source: llama-index-readers-plasmate

Installation

pip install plasmate llama-index-readers-plasmate

Quick Start

from llama_index.readers.plasmate import PlasmateWebReader

reader = PlasmateWebReader()
documents = reader.load_data(urls=[
    "https://example.com",
    "https://docs.python.org/3/",
])

# Use in your RAG pipeline
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is this page about?")

Why Plasmate for RAG?

Standard web readers (SimpleWebPageReader, BeautifulSoupWebReader) return raw HTML or basic text extraction. Plasmate returns structured semantic content:

10-16x fewer tokens per page - cheaper embeddings and queries
Preserved document hierarchy - headings, sections, lists stay structured
Clean text extraction - no scripts, styles, or layout noise
Metadata included - title, language, byte sizes, element counts

Configuration

PlasmateWebReader(
    binary="plasmate",    # Path to plasmate binary
    timeout=30,           # Timeout per page in seconds
    budget=None,          # Optional SOM token budget
    javascript=True,      # Enable JS execution
)

Document Metadata

Each loaded document includes metadata: