Geo Decision Matrix

CPU + GPU 混合架構的地理決策系統 — Spark ETL + Parquet + RAPIDS

專案簡介

地圖 API 供應商遷移評估系統，展示如何用混合架構處理 50 萬 ~ 500 萬筆 GPS 數據。

方案	時間	狀態
Node.js 單機	數小時 / OOM	❌
Dask Multi-GPU（雙卡）	420s	❌ 通訊 overhead
Spark + RAPIDS（單卡）	150s	✅

架構

Raw Data (500K GPS)
       ↓
┌──────────────────────────────┐
│  Stage 1: Spark (CPU)        │  145s
│  - ETL 清洗                   │
│  - Window Function           │
│  - Haversine (Native Func)   │
└──────────────────────────────┘
       ↓ Parquet (零拷貝)
┌──────────────────────────────┐
│  Stage 2: RAPIDS (GPU)       │  5s
│  - K-Means 聚類               │
│  - StandardScaler            │
└──────────────────────────────┘
       ↓
  Decision Matrix API

快速開始

前置需求

Docker + NVIDIA Container Toolkit
NVIDIA GPU（CUDA 11.8+）

執行

# 1. 啟動容器
docker-compose up -d
docker exec -it geo_decision_matrix bash

# 2. 執行 Pipeline
./run_pipeline.sh

# 3. 效能測試
./benchmark.sh

專案結構

geo-decision-matrix/
├── src/
│   ├── 1_data_gen.py              # 資料生成（支援 500K~5M）
│   ├── 4_decision_matrix.py       # Spark ETL (Native Functions)
│   ├── 6_ai_clustering.py         # RAPIDS GPU 聚類
│   ├── api.py                     # FastAPI Gateway
│   ├── celery_worker.py           # GPU Worker（熱啟動）
│   ├── benchmark_udf_vs_native.py # UDF vs Native 效能對比
│   └── 9_dashboard.py             # Streamlit 儀表板
├── doc/
│   ├── tech.md                    # 技術規格
│   ├── 3.2_article_content.md     # 混合架構文章
│   └── 3.2_social_posts.md        # 社群貼文
├── outputs/                       # 視覺化圖表
├── Dockerfile
├── docker-compose.yml
├── run_pipeline.sh
└── benchmark.sh

核心技術

1. Parquet vs CSV

特性	CSV	Parquet
記憶體存取	跳躍（32次）	連續（1次）
GPU 加速	❌	✅ 32x
格式	字串解析	二進制直讀

2. GPU 熱啟動

# 模型在 Worker 啟動時載入一次
model = pipeline("sentiment-analysis", device=0)  # 15s

@app.task
def analyze(driver_id):
    return model(driver_id)  # 0.25s

3. Spark Native Functions vs Python UDF

# ❌ Python UDF (JVM ↔ Python serialization overhead)
haversine_udf = F.udf(calculate_haversine, DoubleType())

# ✅ Native Functions (runs in JVM, Tungsten optimized)
distance = F.lit(R) * 2 * F.atan2(F.sqrt(a), F.sqrt(1-a))

4. 微服務 API

# 非同步提交
curl -X POST http://localhost:8000/analyze \
  -d '{"driver_id": "D_1234", "speed": 25.5}'

# 輪詢結果
curl http://localhost:8000/result/{task_id}

Benchmark

Total: 150s
├─ Spark ETL:  145s (97%)
│  ├─ 讀取:     12s
│  ├─ 清洗:    108s
│  └─ 輸出:     25s
└─ RAPIDS GPU:   5s (3%)
   ├─ 讀取:    0.8s
   ├─ K-Means: 0.5s
   └─ 輸出:    3.7s

常見問題

CUDA out of memory

import rmm
rmm.reinitialize(managed_memory=True)

WSL 2 NCCL 錯誤

export CUDA_VISIBLE_DEVICES=0
export NCCL_P2P_DISABLE=1

License

MIT

關鍵字：Spark, RAPIDS, Parquet, GPU, Data Engineering, ETL, cuML, FastAPI, Celery

作者：Blake 更新：2026-02-01

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
content		content
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
benchmark.sh		benchmark.sh
decision_matrix_map.html		decision_matrix_map.html
docker-compose.yml		docker-compose.yml
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geo Decision Matrix

專案簡介

架構

快速開始

前置需求

執行

專案結構

核心技術

1. Parquet vs CSV

2. GPU 熱啟動

3. Spark Native Functions vs Python UDF

4. 微服務 API

Benchmark

常見問題

CUDA out of memory

WSL 2 NCCL 錯誤

相關文章

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Geo Decision Matrix

專案簡介

架構

快速開始

前置需求

執行

專案結構

核心技術

1. Parquet vs CSV

2. GPU 熱啟動

3. Spark Native Functions vs Python UDF

4. 微服務 API

Benchmark

常見問題

CUDA out of memory

WSL 2 NCCL 錯誤

相關文章

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages