diff --git a/README.md b/README.md index 9d3c9e95..33ae7cdd 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ chatpack wa chat.txt --from "Alice" --after 2024-01-01 -f json ## Features -- 🚀 **Fast** — 20K+ messages/sec +- 🚀 **Fast** — 1.6M+ messages/sec (full pipeline) - 📱 **Multi-platform** — Telegram, WhatsApp, Instagram, Discord - 🔀 **Smart merge** — Consecutive messages from same sender → one entry - 🎯 **Filters** — By date, by sender @@ -255,10 +255,16 @@ chatpack tg chat.json -o out.csv # Custom output path | Metric | Value | |--------|-------| -| Speed | 20-50K messages/sec | +| Full pipeline | 1.6-1.7 M messages/sec | +| Parsing (Instagram) | 2.6-2.8 M messages/sec | +| Parsing (Telegram) | 1.4-2.0 M messages/sec | +| Parsing (Discord) | 1.5-1.8 M messages/sec | +| Operations (merge/filter) | 11-14 M messages/sec | | CSV compression | 13x (92% token reduction) | | Tested file size | 500MB+ | +> Run `cargo bench --bench parsing` to reproduce benchmarks. + ## License [MIT](LICENSE) © [Mukhammedali Berektassuly](https://berektassuly.com) \ No newline at end of file diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md index 38664aa5..85d45d27 100644 --- a/docs/BENCHMARKS.md +++ b/docs/BENCHMARKS.md @@ -36,6 +36,68 @@ Tested with Telegram export (34,478 messages), measured with OpenAI tokenizer (c --- +## Criterion Benchmark Results + +All benchmarks run with `cargo bench --bench parsing` on release build. + +### Parsing Performance + +| Platform | 100 msgs | 1K msgs | 10K msgs | 50K msgs | +|----------|----------|---------|----------|----------| +| **Instagram** | 37.9 µs | 377 µs | 3.6 ms | 18.0 ms | +| **Telegram** | 49.2 µs | 487 µs | 7.2 ms | 32.0 ms | +| **Discord** | 60.1 µs | 567 µs | 5.6 ms | 32.1 ms | +| **WhatsApp** | 2.6 ms | 4.4 ms | 22.3 ms | 102.2 ms | + +### Throughput (messages/second) + +| Platform | Throughput | +|----------|------------| +| **Instagram** | 2.6-2.8 M/s | +| **Telegram** | 1.4-2.0 M/s | +| **Discord** | 1.5-1.8 M/s | +| **WhatsApp** | 38-489 K/s | + +> WhatsApp uses regex-based text parsing, hence slower than JSON parsers. + +### Operations Performance + +| Operation | 100 msgs | 1K msgs | 10K msgs | 100K msgs | +|-----------|----------|---------|----------|-----------| +| **Merge consecutive** | 8.9 µs | 84 µs | 690 µs | 7.0 ms | +| **Filter by sender** | 8.2 µs | 75 µs | 744 µs | 7.4 ms | +| **Filter by date** | 8.1 µs | 77 µs | 764 µs | 7.4 ms | + +| Operation | Throughput | +|-----------|------------| +| Merge consecutive | 11-14 M/s | +| Filter by sender | 12-13 M/s | +| Filter by date | 12-13 M/s | + +### Output Format Performance + +| Format | 100 msgs | 1K msgs | 10K msgs | +|--------|----------|---------|----------| +| **CSV** | 8.1 µs | 77 µs | 874 µs | +| **JSONL** | 10.4 µs | 102 µs | 998 µs | +| **JSON** | 16.6 µs | 158 µs | 1.5 ms | + +| Format | Throughput | +|--------|------------| +| CSV | 11-12 M/s | +| JSONL | 9-10 M/s | +| JSON | 6.0-6.6 M/s | + +### Full Pipeline (Parse → Merge → Filter → Output) + +| Messages | Time | Throughput | +|----------|------|------------| +| 1K | 602 µs | 1.66 M/s | +| 10K | 5.9 ms | 1.70 M/s | +| 50K | 29.8 ms | 1.68 M/s | + +--- + ## Message Merging Consecutive messages from the same sender are merged into one entry. @@ -58,34 +120,6 @@ Consecutive messages from the same sender are merged into one entry. --- -## Processing Speed - -### By platform (real data) - -| Platform | Messages | File Size | Time | Throughput | -|----------|----------|-----------|------|------------| -| Telegram | 34,478 | ~10 MB | 0.21s | 162K msg/s | -| Discord TXT | 1,232 | 646 KB | 0.01s | 85K msg/s | - -### By output format (34K Telegram messages) - -| Format | Time | Speed | -|--------|------|-------| -| CSV | 0.21s | **162K msg/s** | -| JSON | 0.18s | **186K msg/s** | -| JSONL | 0.26s | **131K msg/s** | - -### By operation (34K messages) - -| Operation | Time | -|-----------|------| -| Parse JSON | 0.15-0.22s | -| Merge | 0.00-0.01s | -| Write output | 0.03-0.04s | -| **Total** | **0.18-0.26s** | - ---- - ## Memory Usage chatpack loads entire file into memory. Expected usage: @@ -153,26 +187,23 @@ Toxic data generator with: --- -## Your Own Benchmarks - -Run the included stress test: +## Run Your Own Benchmarks ```bash -# Generate 100K toxic messages -cargo run --release --bin gen_test -- 100000 heavy_test.json telegram +# Run all benchmarks +cargo bench --bench parsing -# Process and see stats -./target/release/chatpack tg heavy_test.json -``` +# Run specific benchmark +cargo bench --bench parsing -- telegram_parsing -Output includes: -``` -⚡ Performance: - Total time: 3.57s - Throughput: 28011 messages/sec +# Save baseline for comparison +cargo bench --bench parsing -- --save-baseline main + +# Compare against baseline +cargo bench --bench parsing -- --baseline main ``` --- *Last updated: December 2025* -*Contributions welcome! Add your benchmarks via PR.* \ No newline at end of file +*Benchmarks run on Linux with Criterion.rs*