Scrapy Integration

Use Plasmate as a drop-in downloader middleware for Scrapy, the most popular Python web scraping framework.

Install

pip install plasmate scrapy-plasmate

Setup

Add to your Scrapy project's settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_plasmate.PlasmateDownloaderMiddleware': 543,
}

Usage

import scrapy
from scrapy_plasmate.utils import extract_text, extract_links

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://example.com']

    def parse(self, response):
        som = response.meta.get('plasmate_som', {})
        yield {
            'url': response.url,
            'title': som.get('title', ''),
            'text': extract_text(som),
            'links': extract_links(som),
        }

How It Works

The middleware intercepts requests and routes them through Plasmate instead of the default HTTP downloader. The SOM is stored in response.meta['plasmate_som'] for easy access in your spider.

If Plasmate fails for any URL, the middleware falls back to the standard Scrapy downloader automatically.

Utilities

from scrapy_plasmate.utils import (
    extract_text,      # All text content
    extract_links,     # All links with text
    extract_headings,  # All headings with levels
    extract_tables,    # Table data
)

Settings

Setting Default Description
PLASMATE_ENABLED True Enable/disable the middleware
PLASMATE_TIMEOUT 30 Timeout in seconds per request
PLASMATE_JAVASCRIPT True Enable JavaScript execution