This repository contains the core data collection and processing pipeline for the network observability project.
It captures, transforms, and enriches network traffic data before sending it to Arkime for indexing and visualization.
The pipeline automates the entire workflow of collecting, decoding, and enriching network traffic data.
It acts as the bridge between raw network traffic (captured via Mitmproxy) and the visualization layer (Arkime Dashboard).
The main purpose of this module is to:
- Intercept and log HTTPS network traffic in real-time.
- Convert captured logs into a standard packet capture format (
.pcap). - Analyze traffic using nDPI (Deep Packet Inspection).
- Publish and consume data asynchronously using RabbitMQ.
- Enrich Arkime sessions with metadata such as application type, risk level, and category through the Flask API.
network-data-pipeline/
│
├── producer_of_logs.py # Captures and publishes .mitm logs to RabbitMQ
├── consumer_of_logs.py # Converts .mitm to .pcap and performs enrichment
├── flask_api.py # Flask web service for cache, enrich, and debug endpoints
├── run_pipeline.py # Orchestrates the pipeline (Flask, Producer, Consumer)
├── requirements.txt # Python dependencies
└── README.md
- Uses Mitmproxy to intercept HTTPS traffic from clients configured with a local proxy.
- Saves captured traffic into
.mitmlog files. - Publishes metadata and file paths into a RabbitMQ queue (
traffic_queue) for downstream processing.
Example:
python3 producer_of_logs.py 180Collects and logs traffic for 180 seconds.
- Listens to the
traffic_queuein RabbitMQ. - Downloads
.mitmfiles from the producer. - Converts them to
.pcapformat using a custom parser (mitm2pcap). - Analyzes packets using nDPI to detect application types and protocols.
- Sends processed
.pcapdata to the Flask API at/update-cache. - Finally, streams the processed traffic to Arkime using
pcap-over-ip.
The Flask API serves as the central hub that connects the pipeline with Arkime and the Wise Service.
| Endpoint | Method | Description |
|---|---|---|
/update-cache |
POST | Receives parsed flows and stores them temporarily in a TTL cache. |
/enrich |
POST | Provides enrichment data (e.g., risk, category, app) for Arkime Wise Service. |
/debug-cache |
GET | Displays the current cached sessions and debug info. |
This layer ensures data integrity, synchronization, and accessibility between the components.
- Runs the entire pipeline in a single command.
- Automatically starts Flask, Producer, and Consumer processes.
- Waits for the defined duration, then gracefully terminates all subprocesses.
Example:
python3 run_pipeline.py 180Make sure the following are installed:
sudo apt install python3 python3-pip rabbitmq-server mitmproxy git build-essentialpip3 install -r requirements.txtTypical dependencies include:
flask
requests
cachetools
scapy
pika
psutil
sudo sysctl -w net.ipv4.ip_forward=1# Terminal 1
python3 flask_api.py
# Terminal 2
python3 producer_of_logs.py 180
# Terminal 3
python3 consumer_of_logs.pypython3 run_pipeline.py 180| Step | Action | Tool/Module |
|---|---|---|
| 1 | Capture HTTPS traffic | Mitmproxy (via Producer) |
| 2 | Publish metadata | RabbitMQ |
| 3 | Consume and convert logs | Consumer |
| 4 | Analyze packets | nDPI |
| 5 | Cache and enrich data | Flask API |
| 6 | Stream enriched sessions | Arkime |
After running the pipeline:
.mitmlogs appear in/tmp/*.mitm.pcapfiles are generated for Arkime- Cached flow data can be checked with:
curl http://127.0.0.1:5000/debug-cacheSample JSON output:
{
"flows_cached": 154,
"last_update": "2025-10-17T14:32:06Z"
}Once the pipeline is running:
-
The Consumer sends
.pcapdata to Arkime viapcap-over-ip. -
Arkime indexes the flows.
-
When viewing sessions, Arkime queries
/enrichon the Flask API (or Wise Service) to display custom fields such as:appcategoryrisk_level
- Always capture traffic with user consent.
- Use TLS certificates properly with Mitmproxy.
- Restrict API access to localhost or trusted networks.
- Consider using HTTPS for the Flask API in production.
