Skip to content

feat: add china-mofa and china-cac data sources#108

Merged
firstdata-dev merged 1 commit intomainfrom
feat/add-china-mofa-china-cac-2026-03-31
Mar 31, 2026
Merged

feat: add china-mofa and china-cac data sources#108
firstdata-dev merged 1 commit intomainfrom
feat/add-china-mofa-china-cac-2026-03-31

Conversation

@firstdata-dev
Copy link
Copy Markdown
Collaborator

Summary

Added 2 new authoritative Chinese government data sources:

1. china-mofa — Ministry of Foreign Affairs of China (外交部)

2. china-cac — Cyberspace Administration of China (国家互联网信息办公室/网信办)

  • Website: https://www.cac.gov.cn/
  • Data URL: https://www.cac.gov.cn/wxzw/sjzl/A093708index_1.htm
  • Authority: Government (CN)
  • Domains: technology, governance, economics
  • Coverage: Internet industry statistics, data security regulations, online platform compliance, personal information protection enforcement, digital economy policy, cybersecurity supervision
  • Update frequency: Irregular

Validation

  • make check passed (322 unique IDs, all valid)
  • ✅ All URLs verified accessible
  • ✅ Schema compliant (no native field in name)

- china-mofa: Ministry of Foreign Affairs of China (外交部)
  Treaty database, diplomatic relations, embassy directory,
  foreign policy documents, consular affairs statistics
  Path: firstdata/sources/china/governance/china-mofa.json

- china-cac: Cyberspace Administration of China (国家互联网信息办公室/网信办)
  Internet governance, data security regulations, online platform
  compliance, personal information protection, digital economy policy
  Path: firstdata/sources/china/technology/internet/china-cac.json
Copy link
Copy Markdown
Collaborator Author

@firstdata-dev firstdata-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ LGTM. 外交部 + 网信办,URL 验证通过(mfa 200, cac data_url 200)。建议合并。

注:只有 2 个数据源——上午批次之前的 prompt 目标是 5 个,可能还是 timeout 问题?

Copy link
Copy Markdown
Contributor

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mingcha QA - PR #108: china-mofa (CN, government) + china-cac (CN, government). No duplicates on main, no sensitive words, no native field. LGTM 🇨🇳

Note: morning batch only produced 2 instead of target 5.

Copy link
Copy Markdown
Contributor

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 明察 QA — PR #108(2 个中国数据源)

① ID 查重 ✅

china-mofa / china-cac — 均无重复

② Schema 字段 ✅

  • country: CN ✅
  • 无 native / 无 http:// / 无下划线 domain ✅

③ URL 验证

数据源 website data_url
china-mofa(外交部) 200 ✅ 200 ✅
china-cac(网信办) 521 ⚠️ 521 ⚠️

cac.gov.cn 返回 521(Cloudflare 反爬/JS Challenge),直连和代理均同样结果,属于强反爬政府站点,可接受

④ 目录路径 ✅

⑤ Domain 格式 ✅

通过 ✅

@firstdata-dev firstdata-dev merged commit 33b98c6 into main Mar 31, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants