Skip to content

High-performance, production-ready address similarity engine with built-in data generation, FAISS indexing, and system monitoring.

Notifications You must be signed in to change notification settings

jyusiwong/faiss_addr_sim_engine

Repository files navigation

faiss-addr-sim-engine

**faiss-addr-sim-engine**

High-performance, production-ready address similarity engine with built-in data generation, FAISS indexing, and system monitoring.


📄 Description

faiss-addr-sim-engine offers an end-to-end pipeline to generate realistic, human-like address variants and perform large-scale cosine-similarity searches using FAISS:

  • Parallelized Address Generator
    Applies configurable omission rates and typos to Chinese addresses, producing raw vs. human-like pairs.

  • Standardization & FAISS Indexer
    Chunks 300K+ unique addresses, vectorizes with HashingVectorizer, normalizes (L2), compresses to .npz, and builds CPU/GPU indices.

  • Multi-Process Similarity Search
    Uses multiprocessing.Pool, auto-fallback from GPU to CPU, and merges per-row results into a gzipped CSV.

  • System Monitoring
    Logs CPU%, RAM, and (optionally) GPU stats to CSV. Generates benchmark and resource-usage plots automatically.


✨ Key Features

Address Generator
  • Configurable omission/typo rates per component (分區, 地區, 城鎮, 道路, 屋苑名稱)
  • Randomized floor, unit, and ordering for human realism
  • Batch CSV output with progress logging
FAISS Indexing
  • Chunks of 1,000–50,000 vectors, L2-normalized
  • Compressed storage via `.npz`
  • CPU & GPU support with automatic fallback
Similarity Search
  • Parallel across CPU cores (or single GPU worker)
  • Top-K cosine similarity via FAISS or dot-product fallback
  • Temporary CSV per row, then merged & gzipped
System Monitoring & Benchmarking
  • Real-time logging: CPU%, RAM MB, GPU util/mem via NVIDIA-SMI
  • Automated visualizations: per-address time/RAM deltas & system usage
  • Robust error capture & resource-aware throttling

🚀 Quickstart

  1. Clone & Install
    git clone https://github.com/Jyusi/faiss-addr-sim-engine.git
    cd faiss-addr-sim-engine
    pip install -r requirements.txt
    
  2. Generate & Addresses
    python address_generator.py \
      --input input_addresses.csv \
      --output all_generated_addresses.csv \
      --batches 15
    
  3. Build & FAISS & Index
    python Cosine_Similarity_v1.py --build-index
    
  4. Run & Similarity & Search
    python Cosine_Similarity_v1.py --search \
      --input all_generated_addresses.csv \
      --output similarity_results.csv.gz \
      --top-k 4
    
  5. View logs and plots
    • address_generator.log and system_monitoring.csv
    • benchmark_report.csv and benchmark_visualisation.png
    • system_monitoring_visualisation.png

⚙️ Configuration

  • OMISSION_RATES & TYPO_RATES: Tune per-component in address_generator.py.
  • N_FEATURES, NGRAM_RANGE, STD_BATCH_SIZE: Adjust in Cosine_Similarity_v1.py.
  • N_PARALLEL_HUMAN_ADDRESSES: Control worker count.
  • MONITORING_INTERVAL_SEC: Change monitoring frequency.

🤝 Contributing

  • Fork & create a feature branch.
  • Write tests for new functionality.
  • Keep logging robust and resource-aware.
  • Submit a PR with detailed descriptions and benchmarks.

About

High-performance, production-ready address similarity engine with built-in data generation, FAISS indexing, and system monitoring.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages