SQLUltronCoreFree

web-scraping-pipeline

Build web scraping pipelines that feed clean content into Claude Code workflows. Use when Claude needs to ingest web content, crawl sites for RAG, or extract structured data from pages. Trigger when users mention web scraping, Firecrawl, crawling websites, content ingestion, or extracting data from URLs.

Repo bundle on VersuzUltronCore/claude-skill-vault436 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/UltronCore/claude-skill-vault Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get web-scraping-pipeline.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install ultroncore-claude-skill-vault-skills-archive-web-scraping-pipeline-versions-v1

Or clone the repo

$git clone https://github.com/UltronCore/claude-skill-vault.git

Or copy the SKILL.md manually

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge ultroncore-claude-skill-vault-skills-archive-web-scraping-pipeline-versions-v1↵

Show SKILL.md content (~792 tokens)

---
name: web-scraping-pipeline
description: Build web scraping pipelines that feed clean content into Claude Code workflows. Use when Claude needs to ingest web content, crawl sites for RAG, or extract structured data from pages. Trigger when users mention web scraping, Firecrawl, crawling websites, content ingestion, or extracting data from URLs.
---

# Web Scraping Pipeline

Three patterns for getting web content into Claude — from simple single-URL to full site crawls.

## Pattern 1: Firecrawl (LLM-optimized, recommended)
Firecrawl converts any URL to clean markdown — perfect for Claude context.
```python
pip install firecrawl-py

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_KEY")

# Single page → markdown
result = app.scrape_url("https://docs.example.com/page", params={"formats": ["markdown"]})
markdown_content = result["markdown"]

# Crawl entire site
crawl_result = app.crawl_url(
    "https://docs.example.com",
    params={"limit": 50, "formats": ["markdown"]},
    poll_interval=5
)
pages = crawl_result["data"]

# Feed into Claude
for page in pages:
    # page["markdown"] is clean, ready for Claude context
    pass
```

## Pattern 2: Playwright (JavaScript-heavy sites)
```python
pip install playwright && python -m playwright install chromium

from playwright.async_api import async_playwright

async def scrape_dynamic(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        content = await page.inner_text("body")
        await browser.close()
        return content
```

## Pattern 3: Scrapy (large-scale structured crawl)
```python
pip install scrapy

import scrapy

class DocsSpider(scrapy.Spider):
    name = "docs"
    start_urls = ["https://docs.example.com"]

    def parse(self, response):
        yield {
            "url": response.url,
            "title": response.css("h1::text").get(),
            "content": " ".join(response.css("p::text").getall())
        }
        for link in response.css("a::attr(href)"):
            yield response.follow(link, self.parse)
```

## Choosing a pattern
| Need | Use |
|---|---|
| Clean markdown for Claude context | Firecrawl |
| Dynamic JS-rendered content | Playwright |
| Large-scale structured crawl | Scrapy |
| One-off URL extraction | Firecrawl single-URL |

## Pipeline: scrape → chunk → embed → retrieve
```python
# 1. Scrape
pages = firecrawl_crawl(site_url)

# 2. Chunk
chunks = [page["markdown"][:2000] for page in pages]  # simple chunking

# 3. Embed + store (see rag-pipeline-setup skill)
for chunk in chunks:
    collection.add(documents=[chunk], ids=[str(uuid4())])

# 4. Retrieve at query time
results = collection.query(query_texts=[user_question], n_results=3)
```

## Related skills
rag-pipeline-setup, playwright-flow-recorder, vector-db-integration