High-performance, production-ready address similarity engine with built-in data generation, FAISS indexing, and system monitoring.
faiss-addr-sim-engine offers an end-to-end pipeline to generate realistic, human-like address variants and perform large-scale cosine-similarity searches using FAISS:
-
Parallelized Address Generator
Applies configurable omission rates and typos to Chinese addresses, producing raw vs. human-like pairs. -
Standardization & FAISS Indexer
Chunks 300K+ unique addresses, vectorizes withHashingVectorizer, normalizes (L2), compresses to.npz, and builds CPU/GPU indices. -
Multi-Process Similarity Search
Usesmultiprocessing.Pool, auto-fallback from GPU to CPU, and merges per-row results into a gzipped CSV. -
System Monitoring
Logs CPU%, RAM, and (optionally) GPU stats to CSV. Generates benchmark and resource-usage plots automatically.
- Configurable omission/typo rates per component (分區, 地區, 城鎮, 道路, 屋苑名稱)
- Randomized floor, unit, and ordering for human realism
- Batch CSV output with progress logging
- Chunks of 1,000–50,000 vectors, L2-normalized
- Compressed storage via `.npz`
- CPU & GPU support with automatic fallback
- Parallel across CPU cores (or single GPU worker)
- Top-K cosine similarity via FAISS or dot-product fallback
- Temporary CSV per row, then merged & gzipped
- Real-time logging: CPU%, RAM MB, GPU util/mem via NVIDIA-SMI
- Automated visualizations: per-address time/RAM deltas & system usage
- Robust error capture & resource-aware throttling
- Clone & Install
git clone https://github.com/Jyusi/faiss-addr-sim-engine.git cd faiss-addr-sim-engine pip install -r requirements.txt - Generate & Addresses
python address_generator.py \ --input input_addresses.csv \ --output all_generated_addresses.csv \ --batches 15
- Build & FAISS & Index
python Cosine_Similarity_v1.py --build-index
- Run & Similarity & Search
python Cosine_Similarity_v1.py --search \ --input all_generated_addresses.csv \ --output similarity_results.csv.gz \ --top-k 4
- View logs and plots
address_generator.logandsystem_monitoring.csvbenchmark_report.csvandbenchmark_visualisation.pngsystem_monitoring_visualisation.png
- OMISSION_RATES & TYPO_RATES: Tune per-component in address_generator.py.
- N_FEATURES, NGRAM_RANGE, STD_BATCH_SIZE: Adjust in Cosine_Similarity_v1.py.
- N_PARALLEL_HUMAN_ADDRESSES: Control worker count.
- MONITORING_INTERVAL_SEC: Change monitoring frequency.
- Fork & create a feature branch.
- Write tests for new functionality.
- Keep logging robust and resource-aware.
- Submit a PR with detailed descriptions and benchmarks.