Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages. compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search.
--- name: web-scraper description: Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages. compatibility: Requires Python 3. Lightweight mode needs requests, beautifulsoup4, readability-lxml, html2text. Dynamic mode needs crawl4ai. Search needs duckduckgo-search. --- # Web Scraper Fetch, search, and extract content from websites. ## When to use this skill - User asks to fetch or read a webpage / URL - User wants to search the internet for information - User needs to extract links, tables, or structured data from a website - User asks to crawl a JavaScript-rendered (dynamic) page - User wants web content converted to clean Markdown for analysis ## Scripts overview | Script | Purpose | Dependencies | |---|---|---| | `fetch_page.py` | Fetch a URL and extract readable content as Markdown | `requests`, `beautifulsoup4`, `readability-lxml`, `html2text` | | `search_web.py` | Search the web via DuckDuckGo | `ddgs` | | `crawl_dynamic.py` | Crawl JS-rendered pages with a headless browser | `crawl4ai` | | `extract_links.py` | Extract and categorize all links from a page | `requests`, `beautifulsoup4` | ## Steps ### 1. Install dependencies (first time only) For lightweight scraping (static pages, search, link extraction): ```bash pip install requests beautifulsoup4 readability-lxml html2text ddgs ``` For dynamic / JavaScript-rendered pages (heavier, installs Playwright + Chromium): ```bash pip install crawl4ai crawl4ai-setup ``` > **Note**: `crawl4ai-setup` downloads a Chromium browser (~150 MB). Only install if you actually need dynamic page support. > **CRITICAL — Dependency Error Recovery**: If ANY script below fails with an `ImportError` or "module not found" error, install the missing dependencies using the command above, then **re-run the EXACT SAME script command that failed**. Do NOT write inline Python code (`python -c "..."`) or your own ad-hoc scripts as a substitute. These scripts handle encoding, error handling, and output formatting that inline code will miss. ### 2. Fetch a web page (static — recommended first choice) Use this for most websites. It's fast, lightweight, and works for articles, docs, blogs, etc. ```bash python scripts/fetch_page.py "URL" ``` Options: - `--raw` — Output full page Markdown instead of extracted article content - `--selector "CSS_SELECTOR"` — Extract only elements matching the CSS selector (e.g
预览已截断。下载完整技能包可查看全部文件内容。