LeJEPA: Predictive Learning With Isotropic Latent Spaces

LeJEPA: Predictive World Models Through Latent Space Prediction

LeJEPA: Predictive Learning With Isotropic Latent Spaces

Self-supervised learning methods such as MAE, SimCLR, BYOL, DINO, and iBOT all attempt to learn useful representations by predicting missing information. Most of them reconstruct pixels or perform contrastive matching, which forces models to learn low-level details that are irrelevant for semantic understanding.

LeJEPA approaches representation learning differently:

    Instead of reconstructing pixels, the model predicts latent representations of the input, and those representations are regularized to live in a well-conditioned, isotropic space.
  

These animations demonstrate LeJEPA’s ability to predict future latent representations for different types of motion. The first animation shows a dog moving through a scene, highlighting semantic dynamics and object consistency. The second animation shows a person riding a bicycle, illustrating the model’s capacity to capture dynamic human motion and maintain object coherence in more complex activities. Instead of reconstructing pixels, LeJEPA models high-level features directly in latent space (as reported in the LeJEPA GitHub repository).

If you find LeJEPA useful or interesting, consider giving the project a star ⭐ on the official LeJEPA GitHub repository. Citation details are provided in the References section.

At the center of LeJEPA is SIGReg (Sketched Isotropic Gaussian Regularization), a mathematically principled method that shapes the geometry of the embedding space and removes the need for many heuristic collapse-prevention tricks.

The overall goal is to learn an encoder f that produces latent vectors:

$$ z = f(x) $$

1. JEPA View Construction and Masking

Given an input image x, two views are created:

a context view x_c
a target view x_t

Both can be produced by cropping, masking, or spatial perturbations. Masking is still used, but only to define the prediction task—not to stabilize training.

The encoder produces:

$$ z_c = f_\theta(x_c), \quad z_t = f_\xi(x_t) $$

where:

f_θ is the student encoder
f_ξ is the teacher encoder, updated via momentum (EMA): ξ ← τ ξ + (1−τ) θ

The teacher provides stable target representations. The target feature map is patch-wise:

$$ z_t = \{ z_t^{(1)}, …, z_t^{(N)} \} $$

We define a set M ⊂ {1, …, N} of masked/hidden positions to predict.

2. Latent Prediction

A predictor network P_θ takes context features and predicts the latent vectors at masked positions:

$$ \hat{z}_c(m) = P_θ(z_c, m), \quad m \in M $$

The JEPA prediction loss is:

$$ L_{pred} = \sum_{m \in M} \| \hat{z}_c(m) - z_t(m) \|_2^2 $$

Sometimes cosine distance is used:

$$ L_{cos} = 1 - \frac{\hat{z}_c(m) \cdot z_t(m)}{\| \hat{z}_c(m) \| \, \| z_t(m) \|} $$

    The key idea is that the model never predicts raw pixels: 
    $$ \hat{z} \approx z_t $$ 
    instead of 
    $$ \hat{x} \approx x $$. 
    This focuses learning on semantics rather than texture.
  

3. SIGReg: Sketched Isotropic Gaussian Regularization

A central insight of LeJEPA is that representations should not only be predictable, but should also have well-conditioned geometry. Let the empirical covariance of embeddings be:

$$ \Sigma_z = \mathbb{E}[z z^\top] - \mathbb{E}[z] \mathbb{E}[z]^\top $$

Without proper constraints, encoders may produce:

degenerate directions
anisotropic scaling
collapsed or low-rank features

SIGReg regularizes the covariance to be close to a scaled identity matrix:

$$ \Sigma_z \approx \sigma^2 I $$

This does not enforce that z is Gaussian distributed. It only ensures that the covariance is isotropic.

To apply this efficiently, SIGReg uses random sketching matrices S ∈ ℝ^{d_s × d} (with d_s ≪ d):

$$ \tilde{z} = S z $$

and regularizes the sketched covariance:

$$ \tilde{\Sigma}_z = \mathbb{E}[\tilde{z} \tilde{z}^\top] $$

The SIGReg loss is:

$$ L_{SIGReg} = \| \tilde{\Sigma}_z - \sigma^2 I \|_F^2 $$

This encourages:

equal variance per dimension
decorrelated latent directions
stable scaling
well-conditioned embeddings

All without modifying the architecture.

4. Why Isotropic Embeddings Matter

Suppose downstream tasks require learning a predictor h(z). If z has poorly conditioned covariance, prediction becomes unstable:

gradients scale unevenly
optimization becomes anisotropic
certain features dominate others
collapse prevention becomes fragile

With SIGReg:

$$ \Sigma_z = \sigma^2 I \quad \Rightarrow \quad \text{all directions equal} $$

This ensures:

better linear separability
easier optimization
stable learning dynamics
more robust multi-step prediction
more uniform information content

From a theoretical standpoint, isotropy minimizes upper bounds on prediction error and promotes optimal conditioning of the encoder Jacobian.

5. Combined Objective

The overall LeJEPA training loss is:

$$ L = L_{pred} + \lambda L_{SIGReg} $$

L_pred ensures semantic predictability
L_SIGReg ensures isotropic geometry

The two complement each other: prediction shapes semantic content, and SIGReg shapes structure.

6. Multi-step Latent Prediction

JEPA frameworks often extend prediction across time:

$$ \hat{z}^{t+1} = P_\theta(z^t), \quad \hat{z}^{t+2} = P_\theta(\hat{z}^{t+1}), \dots $$

leading to:

$$ L_{temporal} = \sum_{k=1}^K \| \hat{z}^{t+k} - z^{t+k} \|_2^2 $$

LeJEPA itself is not inherently temporal, but its formulation supports this extension naturally.

7. Relation to DINO, iBOT, and Masked Autoencoders

LeJEPA can be contrasted with other popular self-supervised learning methods:

    
        Method
        Key Mechanism
        Learns
      
        DINO
        Self-distillation, teacher–student
        Global or patch embeddings
      
        iBOT
        Masked token prediction
        Patch-level latent codes
      
        MAE
        Pixel reconstruction
        Low-level appearance
      
        LeJEPA
        Latent prediction + SIGReg
        Isotropic semantic embeddings

Method	Key Mechanism	Learns
DINO	Self-distillation, teacher–student	Global or patch embeddings
iBOT	Masked token prediction	Patch-level latent codes
MAE	Pixel reconstruction	Low-level appearance
LeJEPA	Latent prediction + SIGReg	Isotropic semantic embeddings

Key differences of LeJEPA:

It avoids pixel-space losses entirely.
It does not rely on contrastive mechanisms.
It explicitly regularizes embedding geometry using SIGReg.

Key Takeaways

LeJEPA is designed to make self-supervised learning both principled and practical. Its core strategies include:

Predicting latent features rather than raw pixels, emphasizing semantic understanding.
Using a teacher–student architecture with masking to define meaningful prediction tasks.
Applying SIGReg to ensure the latent space has isotropic covariance and well-conditioned embeddings.
Combining prediction and geometric regularization into a single, complementary objective:

$$ L = L_{pred} + \lambda L_{SIGReg} $$

With these design choices, LeJEPA provides a self-supervised learning framework that is:

stable and robust,
theoretically grounded,
efficient and scalable,
and highly effective for a wide range of downstream tasks.

References

Balestriero, R., LeCun, Y. (2025). LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics. arXiv preprint, arXiv:2511.08544.

License & Attribution

This blog includes content based on the LeJEPA GitHub repository, which is licensed under the Apache License 2.0.

You must cite the original work if you use LeJEPA in research:

@misc{balestriero2025lejepaprovablescalableselfsupervised,
  title={LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics}, 
  author={Randall Balestriero and Yann LeCun},
  year={2025},
  eprint={2511.08544},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2511.08544}, 
}

Vision Transformers

Vision Transformer (ViT): A Mathematical Explanation Vision Transformer (ViT) The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture—originally designed for language processing—to visual data. Unlike CNNs, which operate on local pixel neighborhoods, ViT divides an image into patches and models global relationships among them via self-attention. 1. Image to Patch Embeddings The input image: $$ \mathbf{x} \in \mathbb{R}^{H \times W \times C} $$ is divided into non-overlapping patches of size $ P \times P $, giving a total of $$ N = \frac{H \times W}{P^2} $$ patches. Each patch $ \mathbf{x}^{(i)} $ is flattened and linearly projected into a $ D $-dimensional embedding: $$ \mathbf{e}^{(i)} = \mathbf{W}_{\text{embed}} \, \text{vec}(\mathbf{x}^{(i)}) \in \mathbb{R}^D, \quad i = 1, \dots, N $$ After stacking all patch embeddings, we form: $$ \mathbf{E} = [\mathbf{e}^{(1)}, \dots, \mathb...

Explore more 🐱‍🏍

Learning to Learn

Search This Blog