T-JEPA: A Text-Joint Embedding Predictive Architecture for Grounded Reasoning and Internal Knowledge Distillation

Author: Senthil Vasan (16 y/o Independent Researcher)
Version: 2.0 (Revised)
Date: September 2025

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, yet they exhibit fundamental limitations, including a propensity for factual hallucination, poor causal reasoning, and a lack of a robust underlying world model. This paper introduces the revised Text-Joint Embedding Predictive Architecture (T-JEPA), a simplified yet powerful framework that addresses these core challenges through a clean separation of understanding and expression.

The revised T-JEPA consists of two main components: (1) A JEPA Encoder Core that builds a rich, abstract world model by learning to predict missing information in a latent representation space, and (2) A Decoder Head that translates the encoder's abstract representations into fluent language through cross-attention mechanisms. Additionally, we introduce an iterative distillation process where larger JEPA models transfer knowledge to smaller ones, enabling efficient scaling and deployment.

1. Introduction

The proliferation of Large Language Models (LLMs) has marked a milestone in artificial intelligence, but their architectural paradigm suffers from inherent weaknesses including lack of world models, factual hallucination, and poor extrapolation capabilities.

T-JEPA addresses these limitations through a fundamental principle: thinking and speaking are distinct cognitive processes that should be modeled separately.

1.1 Contributions

Our contributions are threefold:

We present a simplified architecture that combines a self-supervised JEPA encoder for building a latent world model with a generative decoder connected through cross-attention mechanisms.
We introduce a streamlined two-phase training methodology that separates unsupervised world model learning from supervised language generation training.
We propose an iterative distillation framework for transferring knowledge from large JEPA models to smaller, more efficient ones while preserving the quality of the learned world representations.

2. The Revised T-JEPA Architecture

The revised T-JEPA is a cleaner, more focused architecture with two primary components working in harmony.

2.1 The JEPA Encoder Core: "The Thinker"

The heart of T-JEPA remains a non-generative encoder tasked with building a world model:

Online Encoder: Processes a masked version of the input text using variable-span masking. During pre-training, contiguous spans of text (ranging from single sentences to multiple paragraphs) are masked, forcing the model to rely on understanding of both local coherence and long-range semantic dependencies.

Target Encoder: Processes the full, unmasked input text. Its weights are maintained as an exponential moving average of the online encoder's weights, providing stable targets for training and preventing representational collapse.

Predictor Network: A smaller network that takes the output of the online encoder and attempts to predict the latent representation of the masked chunks, as generated by the target encoder.

The training objective minimizes the Mean Squared Error between the predicted representations and the target representations in the latent space, forcing the model to learn deep semantic and causal relationships.

2.2 The Decoder Head: "The Speaker"

A streamlined decoder architecture that focuses purely on language generation:

Standard Transformer Decoder: Uses the proven transformer decoder architecture for autoregressive token generation.

Cross-Attention Integration: The decoder connects to the JEPA encoder's output through cross-attention layers. At each generation step, the decoder queries the rich "thought vector" produced by the encoder, ensuring that every generated token is grounded in the semantic understanding of the original input.

Single Objective: The decoder is trained using standard cross-entropy loss for token generation, simplifying the training process and eliminating potential gradient conflicts.

3. Training Methodology: A Two-Phase Approach

The revised T-JEPA employs a clean two-phase training strategy:

Phase 1: Unsupervised Pre-training

The JEPA Encoder Core is trained on large corpora of unlabeled text using the variable-span masking and prediction objective. This phase builds the foundational world model through contrastive learning in the latent space, with no language generation capability active.

Objective: Learn rich semantic representations that capture the underlying structure of language and concepts.

Data: Trillions of tokens of unlabeled text from diverse sources.

Duration: This phase continues until the encoder demonstrates strong semantic understanding as measured by downstream evaluation tasks.

Phase 2: Supervised Fine-tuning

The JEPA encoder is frozen to preserve the learned world model. Only the decoder and cross-attention layers are trained on instruction-following datasets.

Encoder Status: Completely frozen - no gradient updates to preserve the world model.

Training Data: Labeled question-answer pairs, instruction-following datasets, and conversational data similar to current LLM training.

Process:

Input question is processed by the frozen JEPA encoder to produce a "thought vector"
The decoder uses cross-attention to query this thought vector while generating the response
Only the decoder and cross-attention parameters are updated via backpropagation

Objective: Learn to translate the encoder's semantic understanding into coherent, helpful language.

4. Iterative JEPA Distillation: Large-to-Small Knowledge Transfer

4.1 The Distillation Framework

Instead of using external teacher models, we implement an iterative distillation process where larger JEPA models teach smaller ones:

Teacher Model: A large, fully-trained T-JEPA model (e.g., 70B parameters) Student Model: A smaller T-JEPA architecture (e.g., 7B parameters)

4.2 Distillation Process

Step 1: Representation Alignment

Both teacher and student JEPA encoders process the same input text
The student's latent representations are projected to match the teacher's dimensionality if needed
MSE loss is computed between teacher and student "thought vectors"

Step 2: Iterative Training

The student JEPA encoder is trained to minimize the distance between its representations and the teacher's
Multiple iterations allow the student to gradually align with the teacher's world model
The process continues until convergence or satisfactory performance is achieved

Step 3: Student Decoder Training

Once the student encoder is distilled, its decoder is trained on the same labeled datasets
The frozen student encoder provides thought vectors for decoder training
This maintains consistency with the original two-phase approach

4.3 Advantages of JEPA-to-JEPA Distillation

Preserved World Model Quality: The student learns the same conceptual understanding as the teacher, just in a more compact form.

Scalable Deployment: Multiple model sizes can be created from a single large teacher, enabling deployment across different computational constraints.

Consistent Architecture: Both teacher and student use identical architectures, eliminating alignment issues between different model types.

5. Inference Process: Think Then Speak

5.1 The Thinking Phase

When presented with a query, the entire input is processed by the frozen JEPA Encoder:

Input: "Who wrote the book 'Sapiens'?"
        ↓
┌──────────────────┐
│ JEPA ENCODER     │ ← "The Brain" (Frozen)
│ (World Model)    │
└──────────────────┘
        ↓
┌──────────────────┐
│ "Thought Vector" │ ← Dense semantic embedding
└──────────────────┘

5.2 The Speaking Phase

The decoder generates the response through autoregressive token prediction, with each step grounded by cross-attention to the thought vector:

Token Generation Loop:
Decoder Input: [Previous Tokens] + Cross-Attention(Thought Vector)
        ↓
┌────────────────────┐
│ DECODER HEAD       │
└────────────────────┘
        ↓
Output: Next Token

This continues until an end-of-sequence token is generated.

6. Theoretical Advantages

6.1 Simplified Training

No Gradient Conflicts: The clean separation between phases eliminates multi-objective optimization issues.

Stable World Model: Once pre-trained, the world model remains fixed, ensuring consistent semantic understanding.

Standard Fine-tuning: The second phase uses familiar supervised learning techniques.

6.2 Enhanced Reasoning

Rich Semantic Representations: The JEPA encoder learns deep conceptual relationships rather than surface patterns.

Grounded Generation: Every token is generated with reference to semantic understanding, not just statistical probability.

6.3 Efficient Scaling

Distillation Pipeline: Large models can efficiently transfer knowledge to smaller variants.

Deployment Flexibility: Multiple model sizes enable deployment across various computational constraints.

Consistent Performance: Distilled models maintain the reasoning capabilities of their teachers.

7. Applications and Use Cases

7.1 Question Answering Systems

T-JEPA's separation of understanding and expression makes it ideal for factual question answering, where the model must first comprehend the query's semantic content before generating an appropriate response.

7.2 Educational Applications

The frozen world model provides consistent conceptual understanding across interactions, enabling reliable tutoring systems that can explain concepts from first principles.

7.3 Scientific Reasoning

The JEPA encoder's predictive world model enables reasoning about causal relationships and scientific principles, making it valuable for research assistance and hypothesis generation.

8. Technical Challenges and Considerations

8.1 World Model Quality

The effectiveness of the entire system depends on the quality of the world model learned during Phase 1. Insufficient or biased pre-training data could limit the model's reasoning capabilities.

8.2 Cross-Attention Efficiency

The decoder must efficiently query the thought vector at each generation step. Optimization of the cross-attention mechanism is crucial for inference speed.

8.3 Distillation Fidelity

Ensuring that smaller models capture the essential aspects of larger models' world representations requires careful tuning of the distillation process.

9. Future Work

9.1 Multi-Modal Extensions

The JEPA encoder architecture can be extended to other modalities (vision, audio) while maintaining the same decoder, enabling unified multi-modal reasoning.

9.2 Dynamic World Model Updates

Investigating methods to update the frozen world model with new information while preserving existing knowledge.

9.3 Hierarchical Distillation

Exploring multi-level distillation where very large models teach medium models, which then teach small models, creating a knowledge transfer hierarchy.

10. Conclusion

The revised T-JEPA architecture provides a clean, focused approach to building AI systems that separate understanding from expression. By eliminating complex multi-objective training and focusing on a simple two-phase approach, we create a more implementable path toward reliable, reasoning-capable AI systems.

The iterative JEPA distillation framework enables efficient scaling and deployment while preserving the quality of learned world models. This architecture represents a practical step toward AI systems that truly understand rather than merely pattern-match, providing a foundation for more reliable and capable artificial intelligence.

The simplicity of this approach, combined with its theoretical advantages in reasoning and factual grounding, makes it a promising direction for the next generation of language models that prioritize understanding over mere statistical fluency.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
t_jepa		t_jepa
README.md		README.md
data.py		data.py
test_model.py		test_model.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

T-JEPA: A Text-Joint Embedding Predictive Architecture for Grounded Reasoning and Internal Knowledge Distillation

Abstract

1. Introduction

1.1 Contributions

2. The Revised T-JEPA Architecture

2.1 The JEPA Encoder Core: "The Thinker"

2.2 The Decoder Head: "The Speaker"

3. Training Methodology: A Two-Phase Approach

Phase 1: Unsupervised Pre-training

Phase 2: Supervised Fine-tuning

4. Iterative JEPA Distillation: Large-to-Small Knowledge Transfer

4.1 The Distillation Framework

4.2 Distillation Process

4.3 Advantages of JEPA-to-JEPA Distillation

5. Inference Process: Think Then Speak

5.1 The Thinking Phase

5.2 The Speaking Phase

6. Theoretical Advantages

6.1 Simplified Training

6.2 Enhanced Reasoning

6.3 Efficient Scaling

7. Applications and Use Cases

7.1 Question Answering Systems

7.2 Educational Applications

7.3 Scientific Reasoning

8. Technical Challenges and Considerations

8.1 World Model Quality

8.2 Cross-Attention Efficiency

8.3 Distillation Fidelity

9. Future Work

9.1 Multi-Modal Extensions

9.2 Dynamic World Model Updates

9.3 Hierarchical Distillation

10. Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages