Scrapefold — turn any URL into clean markdown

One-click install for AI agents

scrapefold ships a CLI (scrapefold) and an MCP server (scrapefold-mcp) with four tools — scrape_url, crawl_site, list_engines, classify_url. All four definitions cost your agent ≈750 tokens (budget-tested). Need one fact from a long page? scrape_url(focus="query") returns only the relevant blocks. Failures come back structured — never error-page HTML posing as content. One command registers it in your client:

$ pip install "scrapefold[mcp]"
$ scrapefold install claude   # Claude Code
$ scrapefold install codex    # Codex CLI
$ scrapefold install cursor   # Cursor
$ scrapefold install vscode   # VS Code

Add to Cursor

Any other client: paste the config JSON above into its MCP settings. Full agent instructions live at scrapefold.com/install.md — fetchable by any agent.

Why Scrapefold?

Every scraping vendor has trade-offs. Scrapefold lets you switch between them with one line — and escalates from free local engines to paid APIs only as far as a site forces it.

Try a new vendor

Before: rewrite your pipeline

After: change one string — engines=("firecrawl",)

Cascade on block pages

Before: hand-roll try/except chains

After: built-in is_suspicious + ladder escalation

Whole-site crawl

Before: build sitemap parser + BFS + dedup

After: await crawl_site(root, opts)

LLM-ready output

Before: strip HTML by hand

After: result.markdown always populated

30 engines, one interface

Local engines are free and fast; SaaS engines add premium proxies and stealth. The router picks the cheapest tier that works. Ratings: ★★★ excellent · ★★☆ good · ★☆☆ basic.

requests local

Static HTML · ultra-fast · ★★★

scrapling_fast local

TLS-impersonation HTTP · ★★★

scrapling_stealth local

JS render + stealth · ★★★

crawl4ai local

JS render · native markdown · ★★★

PixelRAG local · visual

pixelshot + VLM/OCR reader · ★★★

cloakbrowser local

Anti-fingerprint browser · ★★★

selenium local

Classic JS rendering · ★★☆

Jina Reader saas · free tier

Direct URL → markdown · ★★★

Firecrawl saas

LLM-ready markdown + stealth · ★★★

ScrapingBee saas

Premium proxy + JS · ★★★

Scrapingdog saas

Affordable proxy + browser · ★★★

ScraperAPI saas

Proxy + JS render · native markdown · AI Parser → JSON · ★★★

Exa saas · search

Search · Contents · Answer · Agent · ★★★

Serper saas

Fast, cheap scrape · native markdown + JSON-LD · ★★★

Maxun local · self-hosted

No-code robot runs → structured JSON · ★★★

Cloudflare BR saas

Browser rendering at the edge · ★★★

Oxylabs saas

Web Scraper API · residential geo · ★★★

Anysite saas

General-purpose · native markdown · ★★★

Scrape Creators saas · site

Social-media JSON APIs · ★★★

SocialCrawl saas · site

Social-data JSON gateway · ★★★

Telegram local · social

Public channel previews · ★★☆

TGStat saas · telegram

Telegram posts + channels · ★★★

Telemetr saas · telegram

Telegram analytics API · ★★★

LabelUp saas · social

Cross-platform stats · ★★★

Apify Actor saas · site

Universal social actors · ★★☆

Apify (LinkedIn) saas · site

Vendor-managed actor runs · ★★☆

Outscraper saas · site

Niche aggregator scrapes · ★★☆

Wayback local · fallback

Dead-link recovery via archive.org, honestly marked · ★★☆

How to choose

Or skip the decision entirely — call scrape(url) and let the router pick.

Static blog or documentation siterequests — zero deps, sub-second
JS-rendered SPA, no anti-botscrapling_fast (free) or Jina Reader (free tier)
Cloudflare / Datadome / PerimeterXscrapling_stealth (free) → Firecrawl / ScrapingBee (paid)
Site that emits clean markdown via APIJina Reader — direct markdown, no parsing
Visual layouts, tables, charts, or screenshotsPixelRAG — local pixelshot tiles + reader markdown / JSON
LinkedIn / niche socialExa public people/company search + Apify (LinkedIn) actor fallback
Structured fields straight from a pageScraperAPI — AI Parser fills the json slot
IP-geofenced targetsOxylabs — residential pool + geo_location
Page is gone (404) or paywalled nowwayback — archive.org snapshot, marked source=archive.org
Need an MCP server for AI agentsscrapefold-mcp — built-in; register with scrapefold install claude

Quickstart

Install one extra per vendor, or scrapefold[all] for everything.

import asyncio
from scrapefold import scrape, crawl_site, ScrapeOptions

async def main():
    # Single URL, auto-engine — router picks the cheapest tier that works
    result = await scrape("https://example.com")
    print(result.markdown)        # always populated
    print(result.engine)          # which engine actually fetched it

    # Cloudflare-protected site — same call, router auto-escalates
    result = await scrape(
        "https://protected.example.com",
        opts=ScrapeOptions(render_js=True, stealth=True),
    )

    # Whole-site crawl with disk cache
    crawl = await crawl_site(
        "https://docs.example.com",
        opts=ScrapeOptions(max_pages=50, max_depth=3),
        output="site.md",
    )

asyncio.run(main())

# CLI
$ scrapefold scrape https://example.com
$ scrapefold scrape https://example.com --focus "pricing"  # only relevant blocks — saves tokens
$ scrapefold crawl https://docs.example.com --max-pages 50 --output site.md
$ scrapefold list-engines
$ scrapefold doctor            # health check: engines, MCP extra
$ scrapefold update --check    # self-update via PyPI

Built by & ecosystem

Scrapefold is built and maintained by Mike Sadofyev (CEO, Datatera.ai) — the scraping engine behind Datatera — alongside a small ecosystem of AI-data tooling. Connect on LinkedIn, X, or GitHub.

Datatera.ai platform

AI-powered data transformation & document processing

Docfold open source · sibling

Turn any document into structured data

Orquesta AI platform

AI orchestration & agent management

AI Agent Labs services

AI agent services & location intelligence