NOTE: this was a failure
- Log streams are too parallel and noisy to give a clean state X -> state Y sequence, so the JEPA-style setup does not map cleanly onto this problem.
- e.g. your sliding window may be 10 logs but your training data ends up being just 500 logs of "received request" because that's the nature of production systems i guess
- The idea works but requires too much data preparation for me to keep going.
- And if i did all the preprocessing the logs may as well have become a trace and may as well cluster based on simple classification on the full trace
An experimental Joint-Embedding Predictive Architecture (JEPA) approach to log clustering.
Unlike traditional methods that group logs based on syntax, jepacluster groups logs based on their future consequence. It learns a world model of your system logs, where the similarity between two logs is defined by the similarity of the future system states they will lead to.
jepacluster utilizes a sliding window of preceding logs to generate embeddings. This ensures that clusters represent stateful transitions, rather than isolated strings. A log's position in latent space is determined by the trajectory that led to it, and the future it predicts.
So, jepacluster takes preceding logs, a target log, and groups the target log into a cluster.
Target Log: [INFO] Connection closed
| Preceding Sequence | Target Log | Cluster |
|---|---|---|
[INFO] Auth success -> [INFO] Data stream end |
[INFO] Connection closed |
"Graceful Shutdown" |
[ERROR] Timeout -> [WARN] Retry limit reached |
[INFO] Connection closed |
"Resource Exhaustion" |
Standard clustering would see the string Connection closed and put both into a single "Network Info" bucket.
jepacluster would recognize that the first sequence predicts a clean exit state, while the second predicts an unhealthy state. It separates them into different clusters because their causal trajectories have nothing in common.
-
Training (The World Model): The model watches sequences of logs. It takes a context (
$x$ ) and attempts to predict the latent representation of the future ($y$ ). - The Encoder: Through this predictive task, the encoder learns to compress logs into vectors that encode future system state.
- Inference: Live logs are passed through the trained encoder.
- Clustering: We run HDBSCAN on the resulting vectors.
- Put training logs in
data/as.txtfiles. The loader searches recursively, so you can organize logs into subfolders if needed. - Adjust
config.yamlfor your dataset. The main knobs arewindow_size,latent_dim, the training hyperparameters, and the UMAP/HDBSCAN clustering settings. - Train the model from the project root:
python src/jepacluster/main.py --train --data_dir data/ --model_dir models/ --config_file config.yaml- Trained artifacts are saved in
models/. - Run inference with the same entrypoint and path to the model of interest:
python src/jepacluster/main.py --infer --infer_dir infer_data/ --model_path models/jepacluster-v1.pt --config_file config.yaml