Skip to main content

Sequence Networks

Sequence Networks Explained

Sequence Networks

Sequence networks can have either input as sequence, output as sequence, or both. We categorize them into three main types:

  1. Vec2Seq
  2. Seq2Vec
  3. Seq2Seq

1. Vec2Seq (Sequence Generation)

$$ f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{N_{\infty}\cdot C} $$
$$ p(y_{1:T}|x)=\sum_{h_{1:T}}p(y_{1:T},h_{1:T}|x)=\sum_{h_{1:T}}\prod_{t=1}^{T}p(y_{t}|h_{t})p(h_{t}|h_{t-1},y_{t-1},x) $$
Notation:
\(h_t\): hidden state at time \(t\)
\(p(h_1|h_0,y_0,x) = p(h_1|x)\): initial hidden state distribution

For categorical and real-valued outputs:

$$ p(y_t|h_t) = \text{Cat}(y_t | \text{softmax}(W_{hy} h_t + b_y)) $$
$$ p(y_t|h_t) = \mathcal{N}(y_t | W_{hy} h_t + b_y, \sigma^2 I) $$
This generative model is called a Recurrent Neural Network (RNN).

2. Seq2Vec (Sequence Classification)

$$ f_{\theta}:\mathbb{R}^{T D} \to \mathbb{R}^{C} $$

Output is a class label: \(y \in \{1, \dots, C\}\)

$$ p(y|x_{1:T}) = \text{Cat}(y | \text{softmax}(W h_T)) $$

Better results are obtained if hidden states depend on both past and future context, giving rise to Bidirectional RNNs:

$$ h_t^{\rightarrow} = \varphi(W_{xh}^{\rightarrow} x_t + W_{hh}^{\rightarrow} h_{t-1}^{\rightarrow} + b_h^{\rightarrow}) $$
$$ h_t^{\leftarrow} = \varphi(W_{xh}^{\leftarrow} x_t + W_{hh}^{\leftarrow} h_{t+1}^{\leftarrow} + b_h^{\leftarrow}) $$
$$ h_t = [h_t^{\rightarrow}, h_t^{\leftarrow}] $$

3. Seq2Seq (Sequence-to-Sequence Models)

Maps input sequence \(x_{1:T}\) to output sequence \(y_{1:T'}\). Commonly implemented using RNN encoder-decoder, LSTM, GRU, or Transformer architectures.

$$ p(y_{1:T'}|x_{1:T}) = \prod_{t=1}^{T'} p(y_t | y_{< t}, x_{1:T}) $$
Notation:
\(x_{1:T}\): input sequence of length \(T\)
\(y_{1:T'}\): output sequence of length \(T'\)
\(y_{< t}\): previously generated tokens

Seq2Seq models are widely used in machine translation, text summarization, and speech recognition.

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...

LeJEPA: Predictive Learning With Isotropic Latent Spaces

LeJEPA: Predictive World Models Through Latent Space Prediction LeJEPA: Predictive Learning With Isotropic Latent Spaces Self-supervised learning methods such as MAE, SimCLR, BYOL, DINO, and iBOT all attempt to learn useful representations by predicting missing information. Most of them reconstruct pixels or perform contrastive matching, which forces models to learn low-level details that are irrelevant for semantic understanding. LeJEPA approaches representation learning differently: Instead of reconstructing pixels, the model predicts latent representations of the input, and those representations are regularized to live in a well-conditioned, isotropic space. These animations demonstrate LeJEPA’s ability to predict future latent representations for different types of motion. The first animation shows a dog moving through a scene, highlighting semantic dynamics and object consisten...