Scrapy Integration
Use Plasmate as a drop-in downloader middleware for Scrapy, the most popular Python web scraping framework.
Install
pip install plasmate scrapy-plasmate
Setup
Add to your Scrapy project's settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy_plasmate.PlasmateDownloaderMiddleware': 543,
}
Usage
import scrapy
from scrapy_plasmate.utils import extract_text, extract_links
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://example.com']
def parse(self, response):
som = response.meta.get('plasmate_som', {})
yield {
'url': response.url,
'title': som.get('title', ''),
'text': extract_text(som),
'links': extract_links(som),
}
How It Works
The middleware intercepts requests and routes them through Plasmate instead of the default HTTP downloader. The SOM is stored in response.meta['plasmate_som'] for easy access in your spider.
If Plasmate fails for any URL, the middleware falls back to the standard Scrapy downloader automatically.
Utilities
from scrapy_plasmate.utils import (
extract_text, # All text content
extract_links, # All links with text
extract_headings, # All headings with levels
extract_tables, # Table data
)
Settings
| Setting | Default | Description |
|---|---|---|
PLASMATE_ENABLED |
True |
Enable/disable the middleware |
PLASMATE_TIMEOUT |
30 |
Timeout in seconds per request |
PLASMATE_JAVASCRIPT |
True |
Enable JavaScript execution |