[firecrawl] Firecrawl 고급 스크래핑 가이드

개요

Firecrawl은 단순한 웹 크롤러를 넘어, 다양한 고급 옵션과 기능을 통해 복잡한 웹사이트, PDF, 동적 페이지, 구조화 정보 추출 등 실무에 바로 적용할 수 있는 강력한 스크래핑 솔루션을 제공합니다. 본 글에서는 Firecrawl의 고급 스크래핑 옵션, 실전 활용법, 예제 코드, 그리고 공식 문서 링크까지 한 번에 정리합니다.

Firecrawl 고급 스크래핑이란?

Firecrawl은 /scrape, /crawl, /map 등 다양한 엔드포인트를 제공하며, 단일 페이지부터 대규모 사이트, PDF, 동적 콘텐츠, LLM 기반 구조화 추출까지 모두 지원합니다. 고급 옵션을 활용하면 원하는 데이터만 빠르고 정확하게 추출할 수 있습니다.

주요 옵션 및 파라미터

formats: 반환 포맷 지정 (markdown, html, rawHtml, links, screenshot, json)
onlyMainContent: 본문만 추출(true) 또는 전체 페이지(false)
includeTags/excludeTags: 포함/제외할 HTML 태그, 클래스, ID 지정
waitFor: 페이지 로딩 대기(ms)
timeout: 최대 대기 시간(ms)
parsePDF: PDF 자동 파싱 여부
extract: LLM 기반 구조화 정보 추출 옵션

예시: 다양한 옵션 활용

curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "formats": ["markdown", "links", "html", "rawHtml", "screenshot"],
      "includeTags": ["h1", "p", "a", ".main-content"],
      "excludeTags": ["#ad", "#footer"],
      "onlyMainContent": false,
      "waitFor": 1000,
      "timeout": 15000
    }'

PDF/동적 페이지/LLM 추출 활용법

PDF 스크래핑: PDF 링크도 기본 지원, parsePDF: true/false로 제어
동적 페이지: wait, click, write, press 등 액션 조합으로 동적 콘텐츠 추출 가능
LLM 추출: prompt/schema를 활용해 구조화 데이터 추출

LLM 추출 예시

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://firecrawl.dev",
      "formats": ["markdown", "json"],
      "json": {
        "prompt": "Extract the features of the product"
      }
    }'

크롤링/맵핑 등 실전 예제

여러 페이지 크롤링

curl -X POST https://api.firecrawl.dev/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "includePaths": ["^/blog/.*$", "^/docs/.*$"],
      "excludePaths": ["^/admin/.*$", "^/private/.*$"],
      "maxDepth": 2,
      "limit": 1000
    }'

맵핑(사이트 내 관련 링크 추출)

curl -X POST https://api.firecrawl.dev/v1/map \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

주요 옵션 요약 (표)

옵션명	설명	예시 값
formats	반환 포맷 지정	["markdown", "html", "links"]
onlyMainContent	본문만 추출 여부	true/false
includeTags	포함할 태그/클래스/ID	["h1", ".main-content"]
excludeTags	제외할 태그/클래스/ID	["#ad", "#footer"]
waitFor	페이지 로딩 대기(ms)	1000
timeout	최대 대기 시간(ms)	15000
parsePDF	PDF 자동 파싱 여부	true/false
prompt/schema	LLM 추출 프롬프트/스키마	"Extract the features of the product"
includePaths	크롤링 포함 경로(정규식)	["^/blog/.$", "^/docs/.$"]
excludePaths	크롤링 제외 경로(정규식)	["^/admin/.$", "^/private/.$"]
maxDepth	최대 크롤링 깊이	2
limit	최대 크롤링 페이지 수	1000