jepacluster

NOTE: this was a failure

Log streams are too parallel and noisy to give a clean state X -> state Y sequence, so the JEPA-style setup does not map cleanly onto this problem.
- e.g. your sliding window may be 10 logs but your training data ends up being just 500 logs of "received request" because that's the nature of production systems i guess
The idea works but requires too much data preparation for me to keep going.
And if i did all the preprocessing the logs may as well have become a trace and may as well cluster based on simple classification on the full trace

An experimental Joint-Embedding Predictive Architecture (JEPA) approach to log clustering.

Unlike traditional methods that group logs based on syntax, jepacluster groups logs based on their future consequence. It learns a world model of your system logs, where the similarity between two logs is defined by the similarity of the future system states they will lead to.

jepacluster utilizes a sliding window of preceding logs to generate embeddings. This ensures that clusters represent stateful transitions, rather than isolated strings. A log's position in latent space is determined by the trajectory that led to it, and the future it predicts.

So, jepacluster takes preceding logs, a target log, and groups the target log into a cluster.

Example

Target Log: [INFO] Connection closed

Preceding Sequence	Target Log	Cluster
`[INFO] Auth success` -> `[INFO] Data stream end`	`[INFO] Connection closed`	"Graceful Shutdown"
`[ERROR] Timeout` -> `[WARN] Retry limit reached`	`[INFO] Connection closed`	"Resource Exhaustion"

Standard clustering would see the string Connection closed and put both into a single "Network Info" bucket.

jepacluster would recognize that the first sequence predicts a clean exit state, while the second predicts an unhealthy state. It separates them into different clusters because their causal trajectories have nothing in common.

How it Works

Training (The World Model): The model watches sequences of logs. It takes a context ($x$) and attempts to predict the latent representation of the future ($y$).
The Encoder: Through this predictive task, the encoder learns to compress logs into vectors that encode future system state.
Inference: Live logs are passed through the trained encoder.
Clustering: We run HDBSCAN on the resulting vectors.

Usage

Put training logs in data/ as .txt files. The loader searches recursively, so you can organize logs into subfolders if needed.
Adjust config.yaml for your dataset. The main knobs are window_size, latent_dim, the training hyperparameters, and the UMAP/HDBSCAN clustering settings.
Train the model from the project root:

python src/jepacluster/main.py --train --data_dir data/ --model_dir models/ --config_file config.yaml

Trained artifacts are saved in models/.
Run inference with the same entrypoint and path to the model of interest:

python src/jepacluster/main.py --infer --infer_dir infer_data/ --model_path models/jepacluster-v1.pt --config_file config.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
models		models
src/jepacluster		src/jepacluster
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DRIFT.MD		DRIFT.MD
MY_EXPLANATION.md		MY_EXPLANATION.md
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jepacluster

Example

How it Works

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

jepacluster

Example

How it Works

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages