数据采集流水线搭建

架构

数据源 → 采集器 → 清洗器 → 转换器 → 存储
  ↑                                    ↓
调度器 ←─────── 状态监控 ←─────── 反馈

Step 1：定义数据源

{
  "sources": [
    {"type": "rss", "url": "https://hnrss.org/newest?points=100"},
    {"type": "api", "url": "https://api.github.com/repos/{owner}/{repo}/releases"},
    {"type": "web", "url": "https://example.com/news", "selector": ".article-item"}
  ]
}

Step 2：清洗规则

{
  "cleaning": [
    {"field": "title", "rules": ["strip", "normalize_whitespace"]},
    {"field": "content", "rules": ["strip_html", "remove_ads", "truncate:5000"]},
    {"field": "date", "rules": ["parse_iso8601", "timezone:UTC"]},
    {"field": "tags", "rules": ["lowercase", "deduplicate"]}
  ]
}

Step 3：去重策略

# 基于 URL 去重
seen_urls = set()
for item in new_items:
    if item['url'] not in seen_urls:
        seen_urls.add(item['url'])
        yield item

# 基于内容相似度去重（阈值 0.95）
from difflib import SequenceMatcher
for new_item in candidates:
    if not any(SequenceMatcher(None, new_item['content'], old['content']).ratio() > 0.95 for old in existing):
        yield new_item

数据采集流水线搭建

架构

数据源 → 采集器 → 清洗器 → 转换器 → 存储
  ↑                                    ↓
调度器 ←─────── 状态监控 ←─────── 反馈

Step 1：定义数据源

{
  "sources": [
    {"type": "rss", "url": "https://hnrss.org/newest?points=100"},
    {"type": "api", "url": "https://api.github.com/repos/{owner}/{repo}/releases"},
    {"type": "web", "url": "https://example.com/news", "selector": ".article-item"}
  ]
}

Step 2：清洗规则

{
  "cleaning": [
    {"field": "title", "rules": ["strip", "normalize_whitespace"]},
    {"field": "content", "rules": ["strip_html", "remove_ads", "truncate:5000"]},
    {"field": "date", "rules": ["parse_iso8601", "timezone:UTC"]},
    {"field": "tags", "rules": ["lowercase", "deduplicate"]}
  ]
}

Step 3：去重策略

# 基于 URL 去重
seen_urls = set()
for item in new_items:
    if item['url'] not in seen_urls:
        seen_urls.add(item['url'])
        yield item

# 基于内容相似度去重（阈值 0.95）
from difflib import SequenceMatcher
for new_item in candidates:
    if not any(SequenceMatcher(None, new_item['content'], old['content']).ratio() > 0.95 for old in existing):
        yield new_item