登录
ready~ /skills/web-scraper已公开
web-scraper.md已公开

web-scraper

Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages. compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.

search#analysis
downloads
4
updated
2026/03/10
author
Admin
visibility
已公开
overview.tsdecision summary
Fetch, search, and extract content from websites.
- User asks to fetch or read a webpage / URL - User wants to search the internet for information - User needs to extract links, tables, or structured data from a website - User asks to crawl a JavaScript-rendered (dynamic) page - User wants web content converted to clean Markdown for analysis
SKILL.md previewcollapsible

name
web-scraper
description
Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages. compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.
---
name: web-scraper
description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.
compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.
---

# Web Scraper

Fetch, search, and extract content from websites.

## When to use this skill

- User asks to fetch or read a webpage / URL
- User wants to search the internet for information
- User needs to extract links, tables, or structured data from a website
- User asks to crawl a JavaScript-rendered (dynamic) page
- User wants web content converted to clean Markdown for analysis

## Scripts overview

| Script | Purpose | Dependencies |
|---|---|---|
| `fetch_page.py` | Fetch a URL and extract readable content as Markdown | `requests`, `beautifulsoup4`, `readability-lxml`, `html2text` |
| `search_web.py` | Search the web via DuckDuckGo | `ddgs` |
| `crawl_dynamic.py` | Crawl JS-rendered pages with a headless browser | `crawl4ai` |
| `extract_links.py` | Extract and categorize all links from a page | `requests`, `beautifulsoup4` |

## Steps

### 1. Install dependencies (first time only)

For lightweight scraping (static pages, search, link extraction):
```bash
pip install requests beautifulsoup4 readability-lxml html2text ddgs
```

For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium):
```bash
pip install crawl4ai
crawl4ai-setup
```

> **Note**: `crawl4ai-setup` downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support.

> **CRITICAL — Dependency Error Recovery**: If ANY script below fails with an `ImportError` or "module not found" error, install the missing dependencies using the command above, then **re-run the EXACT SAME script command that failed**. Do NOT write inline Python code (`python -c "..."`) or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss.

### 2. Fetch a web page (static — recommended first choice)

Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc.

```bash
python scripts/fetch_page.py "URL"
```

Options:
- `--raw` — Output full page Markdown instead of extracted article content
- `--selector "CSS_SELECTOR"` — Extract only elements matching the CSS selector (e.g

预览已截断。下载完整技能包可查看全部文件内容。

quickstart.shinstall
安装命令
npx skills add web-scraper
使用建议

先看左侧概览和结构化信息,再决定是直接下载、复制安装命令,还是继续阅读原始 `SKILL.md`。

next-steps.mdrecommended flow

1. 先判断是否匹配你的任务和环境。

2. 再复制安装命令或直接下载 zip。

3. 需要程序化集成时,再去 Docs 查看 API 和 OpenAPI 描述。

related.tssame category

暂无同分类相关技能。