Teaching One Model to See, Find, and Read

Teaching One Model to See, Find, and Readt

Teaching One Model to See, Find, and Read

Here is a question worth sitting with: when you read a scanned page, your brain does not run three separate programs — one to identify the title, another to draw mental boxes around each word, a third to actually read the letters. Something more elegant happens. Your visual system extracts features once, and those same features serve every level of understanding simultaneously. The strokes that tell you "this is a word" are the same strokes that tell you "this word is hello."

Trident v4 is built on this exact observation. Layout detection, word detection, and text recognition should not be three separate models stitched together with brittle handoffs. They should be three heads on one shared backbone, learning together, sharing features, fighting for the same parameters. This blog walks through that model end-to-end — the math, the architecture, the loss functions, and the training loop — for someone who wants to understand not just what we did but why every piece had to be there.

1. The Problem and the Notation

A document image is a complicated thing. At 1024×1024 resolution it is over three million numbers, but almost all of that data is redundant relative to what we actually want — a structured output that says "here is the title, here are the paragraphs, here is each word, and here is the text inside each word." The model's job is to take the raw pixel grid and produce that structured output in a single forward pass.

Let me set up notation that will stay consistent through the whole post. Each input batch is

$$\mathbf{X} \in \mathbb{R}^{B \times 3 \times H \times W}, \quad H = W = 1024, \quad B = 4$$

For each image $ b \in \{1, \dots, B\} $ we have a variable number of word annotations $ N_b $, capped at $ N_{\max} = 200 $. Each word $ i $ comes with a bounding box $ \mathbf{b}_i^{\text{w}} \in [0, 1]^4 $ (normalized corners) and a target text sequence $ \mathbf{y}_i \in \mathcal{V}^T $ of length $ T = 32 $, where $ \mathcal{V} $ is the character vocabulary including four special tokens: $ \langle\text{pad}\rangle, \langle\text{sos}\rangle, \langle\text{eos}\rangle, \langle\text{unk}\rangle $.

The whole model is a function \( f_\theta \) that maps an input image to three structured outputs simultaneously: layout regions, word boxes, and word text. The parameters \( \theta \) are shared across all three tasks via a common backbone — and that sharing is the entire point of the architecture.

2. The Data Pipeline

Before any neural network sees an image, the data goes through a careful preprocessing pipeline. Two things matter here: getting the geometry right so boxes line up with what the model actually sees, and getting the regularization right so the model does not memorize specific pixels.

2.1 The Annotation Format

Every training sample is a tuple of an image and its annotations:

$$\mathcal{D}_b = \big( I_b, \;\{(\mathbf{b}_i^{\text{w}}, \mathbf{y}_i)\}_{i=1}^{N_b}, \;\{(\mathbf{b}_j^{\ell}, c_j)\}_{j=1}^{M_b} \big)$$

where $ I_b $ is the original image, each word has a pixel-space bounding box $ \mathbf{b}_i^{\text{w}} = (x_1, y_1, x_2, y_2) $ and a character-token sequence $ \mathbf{y}_i $, and each layout region has a box $ \mathbf{b}_j^{\ell} $ and a class label $ c_j \in \{0, \dots, 4\} $. Currently the layout boxes are filled with random dummy values since we have not annotated real layout yet — and that turns out to matter for how we set up training.

2.2 Image Preprocessing

Three transformations happen in order:

$$I_b' = \text{Normalize}\!\Big(\text{Resize}_{1024 \times 1024}\!\big(\text{Aug}(I_b)\big)\Big)$$

The augmentation step Aug applies color jitter, occasional grayscale conversion, and Gaussian blur. Crucially, all of these are photometric — they change pixel values but not pixel locations. That means the bounding boxes do not need to be transformed alongside the image, which avoids a whole category of bugs. Resize maps to 1024×1024, and Normalize applies ImageNet statistics $ \mu = (0.485, 0.456, 0.406) $, $ \sigma = (0.229, 0.224, 0.225) $.

After these, on training only, we apply random erasing — wiping out small random patches of the normalized tensor:

$$I_b'' = \mathbb{1}[u > p_e] \cdot I_b' + \mathbb{1}[u \le p_e] \cdot \text{Erase}(I_b'), \quad u \sim \mathcal{U}(0, 1)$$

with $ p_e = 0.5 $. This is the single biggest regularizer against the model memorizing entire pages — by destroying random pieces of the input, we force the network to recognize words even when context is missing. Without it, with only 3,000 unique pages in the dataset, the recognition head learns the page rather than the words on the page.

2.3 Box and Token Encoding

Bounding boxes get normalized to $ [0, 1] $ by dividing by image dimensions:

$$\tilde{\mathbf{b}}_i^{\text{w}} = (x_1/W_0, \; y_1/H_0, \; x_2/W_0, \; y_2/H_0)$$

And each word's text gets tokenized character by character, with start and end markers added:

$$\mathbf{y}_i = [\langle\text{sos}\rangle, c_1, c_2, \dots, c_{L_i}, \langle\text{eos}\rangle, \langle\text{pad}\rangle, \dots] \in \{0, \dots, V-1\}^T$$

The pad tokens fill the rest of the fixed length $ T = 32 $. During loss computation, they are ignored — this is what lets us batch words of different lengths through the same decoder without the loss being polluted by predictions on padding.

3. The Model Architecture

Now we get to the part where pixels become predictions. The architecture has four pieces: a backbone that extracts multi-scale features, a feature pyramid that fuses them, three heads that solve different tasks, and an RoIAlign operation that bridges detection and recognition. Every piece is there for a reason.

3.1 Backbone: ResNet-50

The backbone is a ResNet-50 with ImageNet-V2 pretrained weights. We pull out three intermediate feature maps:

$$\mathbf{c}_3 = f^{(3)}(\mathbf{X}), \quad \mathbf{c}_4 = f^{(4)}(\mathbf{c}_3), \quad \mathbf{c}_5 = f^{(5)}(\mathbf{c}_4)$$

For a 1024×1024 input, the spatial sizes are 128×128, 64×64, and 32×32 respectively, with channel counts 512, 1024, and 2048. The deeper the layer, the smaller the spatial resolution but the more semantically rich the features. This is a fundamental tradeoff in convolutional networks — and it is exactly the tradeoff the FPN exists to resolve.

3.2 Feature Pyramid Network (FPN)

Different tasks need features at different scales. A whole paragraph fits comfortably in the 32×32 grid of $ \mathbf{c}_5 $. A single word, maybe 20 pixels tall on a 1024-pixel image, would correspond to less than one cell on that grid — completely undetectable. We need finer features for words.

The FPN solves this by fusing features top-down. Each level gets a 1×1 lateral projection to 256 channels, then the higher level is upsampled and added in:

$$\mathbf{p}_5 = \text{GN}(\hat{\mathbf{p}}_5), \quad \hat{\mathbf{p}}_k = \text{Conv}_{1\times 1}(\mathbf{c}_k)$$

$$\mathbf{p}_4 = \text{GN}\!\Big(\text{Conv}_{3\times 3}\big(\hat{\mathbf{p}}_4 + \text{Upsample}(\mathbf{p}_5)\big)\Big)$$

$$\mathbf{p}_3 = \text{GN}\!\Big(\text{Conv}_{3\times 3}\big(\hat{\mathbf{p}}_3 + \text{Upsample}(\mathbf{p}_4)\big)\Big)$$

The intuition: $ \mathbf{p}_3 $ inherits the semantic richness of deeper features but keeps the high spatial resolution of $ \mathbf{c}_3 $. After the FPN, all three feature maps have the same channel count (256), but P₃ becomes the workhorse for everything word-related.

The FPN is what makes word detection possible at all. On P₅ alone (stride 32), a typical word occupies less than one feature cell — there is literally nowhere for the model to make a positive prediction. On P₃ (stride 8), the same word covers about 2.5 cells, which is enough for FCOS-style dense detection to work. This is not a small detail — without the FPN, the detection head simply does not learn.

3.3 The Dense Heads (Layout H₁ and Detection H₂)

Both layout and word detection use the same head architecture, just with different input features and different numbers of output classes. The structure is FCOS-style: anchor-free, fully convolutional, predicting per-pixel classification scores plus bounding box offsets.

Given input feature $ \mathbf{f} $, a small tower of two 3×3 conv blocks (with GroupNorm and ReLU) produces $ \mathbf{t} $. Three parallel 3×3 convs then output:

$$\mathbf{s}^{\text{cls}} = W_{\text{cls}} * \mathbf{t} \in \mathbb{R}^{B \times K_{\text{cls}} \times H_k \times W_k}$$

$$\mathbf{s}^{\text{reg}} = \text{ReLU}(W_{\text{reg}} * \mathbf{t}) \in \mathbb{R}^{B \times 4 \times H_k \times W_k}$$

$$\mathbf{s}^{\text{ctr}} = W_{\text{ctr}} * \mathbf{t} \in \mathbb{R}^{B \times 1 \times H_k \times W_k}$$

The regression channels predict, for each spatial location, the four distances $ (\ell, t, r, b) $ from that location to the edges of the box it belongs to. The centerness channel predicts how central the location is within that box — this gets used at inference to down-weight predictions near box edges, which tend to be lower quality.

For detection (H₂), $ K_{\text{cls}} = 1 $ — just "is this a word or not" — and the input is P₃. For layout (H₁), $ K_{\text{cls}} = 5 $ for the five region classes, and the input is P₅.

3.4 RoIAlign — The Bridge to Recognition

Word detection gives us locations. To recognize the text inside each location, we need to extract features specifically for each word. RoIAlign does this — it takes the P₃ feature map and the list of word boxes, and produces a fixed-size feature tensor for each box:

$$\mathbf{r}_i = \text{RoIAlign}\!\Big(\mathbf{p}_3, \;\tilde{\mathbf{b}}_i^{\text{w}} \cdot (W_3, H_3, W_3, H_3)\Big) \in \mathbb{R}^{256 \times 8 \times 32}$$

The output is always 8×32 spatially regardless of how big the word was originally. This is what lets us batch all words from the page through a single recognition head with the same shape. Stacking all words across the batch gives $ \mathbf{R} \in \mathbb{R}^{K \times 256 \times 8 \times 32} $ where $ K = \sum_b N_b $.

3.5 Recognition Head H₃ — The Transformer Decoder

This is where pixels become text. The architecture is a 4-layer Transformer decoder where the "memory" comes from the RoI features and the "queries" come from the partial sequence of characters generated so far.

First, the RoI feature gets flattened spatially and projected to the decoder's hidden dimension:

$$\mathbf{m}_i = \text{PE}_{\text{enc}}\!\Big(\text{Conv}_{1\times 1}(\mathbf{r}_i)\Big) \in \mathbb{R}^{256 \times 256}$$

The first 256 is the sequence length (8 × 32 = 256 spatial tokens), the second is the hidden dimension. PE is sinusoidal positional encoding — it gives the decoder a sense of where in the word crop each feature token came from.

During training, we use teacher forcing: the decoder gets the ground-truth sequence shifted right by one as input, and predicts the same sequence shifted left:

$$\mathbf{e}_{i,t} = \text{Dropout}_{0.3}\!\Big(\text{PE}_{\text{dec}}\!\big(\mathbf{E}[y_{i,t}^{\text{in}}]\big)\Big)$$

where $ \mathbf{E} \in \mathbb{R}^{V \times 256} $ is the token embedding matrix. A causal mask prevents the decoder from attending to future positions:

$$M_{t,s} = \begin{cases} 0 & \text{if } s \le t \\ -\infty & \text{if } s > t \end{cases}$$

The decoder layers do masked self-attention on the partial sequence, cross-attention to the RoI memory, and an FFN. After 4 layers, a linear projection maps to vocabulary logits:

$$\mathbf{z}_{i,t} = W_{\text{out}} \, \mathbf{h}_{i,t} \in \mathbb{R}^V$$

Stacked across all words and time steps: $ \mathbf{Z} \in \mathbb{R}^{K \times (T-1) \times V} $. This is the raw output the loss will work on.

4. The Loss — Three Objectives, One Gradient

Training a multi-task model is mostly about getting the loss right. Three different tasks contribute three different losses, and they all flow back through the shared backbone. The total loss is a weighted sum:

$$\boxed{\;\mathcal{L}_{\text{total}} = \lambda_{\text{layout}} \,\mathcal{L}_{\text{layout}} + \lambda_{\text{det}} \,\mathcal{L}_{\text{det}} + \lambda_{\text{rec}} \,\mathcal{L}_{\text{rec}}\;}$$

Currently $ \lambda_{\text{layout}} = 0 $ — we have not annotated real layout regions yet, so this would just be noise — and $ \lambda_{\text{det}} = \lambda_{\text{rec}} = 1 $. Once real layout annotations exist, the layout weight comes back to 1.

4.1 The Dense Head Loss (FCOS-Style)

The same loss formula serves both H₁ and H₂. For each spatial location $ (b, h, w) $ on the feature map, we determine whether it falls inside any ground-truth box. If multiple boxes contain it, we assign the smallest one — this is FCOS's tie-breaking rule, and it prevents large boxes from swallowing positive locations that should belong to nearby small boxes.

For positive locations, the regression target is the four distances to box edges:

$$\mathbf{r}^*_{bhw} = (g_x - x_1, \; g_y - y_1, \; x_2 - g_x, \; y_2 - g_y)$$

And the centerness target is:

$$s^*_{bhw} = \sqrt{\frac{\min(\ell, r)}{\max(\ell, r)} \cdot \frac{\min(t, b)}{\max(t, b)}} \;\in [0, 1]$$

This is 1 at the exact center of a box and decays toward 0 at the edges. The model learns to predict it, and at inference we multiply it into the classification score to suppress edge-of-box predictions.

The classification loss is focal loss, which down-weights easy examples and focuses learning on the hard ones:

$$\mathcal{L}_{\text{cls}} = \frac{1}{N_{\text{pos}}} \sum_{b,h,w,k} \alpha_t \cdot (1 - p_t)^\gamma \cdot \text{BCE}(s^{\text{cls}}_{bkhw}, c^*_{bkhw})$$

with $ \alpha = 0.25 $ and $ \gamma = 2 $. Without focal loss, the millions of easy-negative background locations would drown out the few hundred true word locations — the model would learn "everything is background" and call it a day.

The regression loss only applies on positive locations and uses GIoU:

$$\mathcal{L}_{\text{reg}} = \frac{1}{N_{\text{pos}}} \sum_{(b,h,w) \in \text{pos}} \big(1 - \text{GIoU}(\mathbf{r}^{\text{pred}}_{bhw}, \mathbf{r}^*_{bhw})\big)$$

where

$$\text{GIoU}(B_p, B_g) = \text{IoU}(B_p, B_g) - \frac{|C \setminus (B_p \cup B_g)|}{|C|}$$

and $ C $ is the smallest box that encloses both $ B_p $ and $ B_g $. GIoU goes beyond plain IoU because it penalizes predicted boxes that do not even overlap the ground truth, by accounting for how much "wasted space" the enclosing box contains. This gives a useful gradient even when the prediction is far from correct.

The centerness loss is straightforward BCE on the centerness target:

$$\mathcal{L}_{\text{ctr}} = \frac{1}{N_{\text{pos}}} \sum_{(b,h,w) \in \text{pos}} \text{BCE}\!\big(\sigma(s^{\text{ctr}}_{bhw}), \, s^*_{bhw}\big)$$

And the head loss combines all three:

$$\mathcal{L}_{\text{head}} = \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{reg}} + \mathcal{L}_{\text{ctr}}$$

4.2 The Recognition Loss

This is cross-entropy with label smoothing on the predicted character sequences, ignoring pad tokens:

\[ \mathcal{L}_{\text{rec}} = -\frac{1}{K \cdot T'} \sum_{i=1}^{K} \sum_{t=1}^{T-1} \mathbb{1}\!\left[y_{i,t}^{\text{out}} \neq \text{pad}\right] \sum_{v=1}^{V} q_v^\epsilon\!\left(y_{i,t}^{\text{out}}\right) \log p_\theta\!\left(v \mid y_{i,\,<t},\, \mathbf{m}_i\right) \]

The smoothed label distribution is:

\[ q_v^\epsilon(y) = (1 - \epsilon)\cdot \mathbb{1}[v = y] + \frac{\epsilon}{V}, \quad \epsilon = 0.1 \]

Label smoothing prevents the model from becoming too confident — instead of pushing toward 100% probability on the correct character, it pushes toward 90% on correct and a tiny mass on each other character. This dramatically helps generalization, especially when the training data is small. With only 3,000 unique pages, every regularizer counts.

5. Optimization — How the Weights Actually Move

Having the loss is half the battle. The other half is moving the weights to minimize it without breaking everything along the way. Three pieces matter here: the optimizer, the learning rate schedule, and gradient clipping.

5.1 The Optimizer

We use AdamW, which is Adam with decoupled weight decay:

$$\theta_{t+1} = \theta_t - \eta_t \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \varepsilon} - \eta_t \cdot \lambda_{\text{wd}} \cdot \theta_t$$

with $ \beta_1 = 0.9 $, $ \beta_2 = 0.999 $, $ \varepsilon = 10^{-8} $, and weight decay $ \lambda_{\text{wd}} = 5 \times 10^{-4} $. The decoupling means weight decay acts as true L2 regularization regardless of the adaptive learning rate, which matters more than it sounds — vanilla Adam with weight decay in the loss term gets distorted by the adaptive denominator, and AdamW fixes that.

5.2 The Learning Rate Schedule

Two phases. First, linear warmup for 3 epochs from 0 to $ \eta_{\max} = 10^{-4} $:

$$\eta_e = \eta_{\max} \cdot \frac{e + 1}{E_w}, \quad e \in \{0, 1, 2\}$$

Then cosine annealing for the remaining epochs, decaying smoothly to $ \eta_{\min} = 10^{-6} $:

$$\eta_e = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\bigg(1 + \cos\!\Big(\pi \cdot \frac{e - E_w}{E - E_w}\Big)\bigg)$$

Warmup matters because the optimizer's running estimates $ \hat{\mathbf{m}}, \hat{\mathbf{v}} $ are unreliable in the first few steps. Starting at full learning rate amplifies that noise into bad early updates the model spends epochs recovering from. Linear warmup gives those estimates time to stabilize before any large step happens.

5.3 Gradient Clipping

Before each step, we rescale the gradient if its norm exceeds a threshold:

$$\mathbf{g} \leftarrow \mathbf{g} \cdot \min\!\Big(1, \;\frac{\tau}{\|\mathbf{g}\|_2}\Big), \quad \tau = 1.0$$

This prevents single bad batches from blowing up the optimizer state. It is cheap and almost always a good idea on transformer-based models.

5.4 The Training Loop

Putting it all together, one epoch looks like:

for epoch e:
  for batch (X, B_w, Y, B_l, C_l):
    c3, c4, c5     = Backbone(X)
    p3, p4, p5     = FPN(c3, c4, c5)
    layout_out     = LayoutHead(p5)         # H1
    det_out        = DetectionHead(p3)      # H2
    R              = RoIAlign(p3, B_w)
    rec_logits     = RecognitionHead(R, Y[:, :-1])   # H3, teacher forcing

    L_layout = dense_head_loss(layout_out, B_l, C_l)
    L_det    = dense_head_loss(det_out, B_w, ones)
    L_rec    = recog_loss(rec_logits, Y[:, 1:])

    L = lambda_l * L_layout + lambda_d * L_det + lambda_r * L_rec
    L.backward()
    clip_grad_norm(theta, 1.0)
    optimizer.step()
    optimizer.zero_grad()
  scheduler.step()

5.5 Best-Checkpoint Selection

After every epoch we evaluate on a held-out 10% validation split. The best model is the one that minimizes validation recognition loss:

$$\theta^* = \arg\min_{e} \mathcal{L}_{\text{rec}}^{\text{val}}(e)$$

Note that we use the recognition loss specifically, not the total loss. The total loss is contaminated by layout (currently random) and detection (which has its own dynamics), so picking based on it would lead us astray. Recognition loss is the cleanest signal we have right now. Early stopping triggers if validation recognition loss does not improve for 10 consecutive epochs, which prevents wasting compute once the model has plateaued.

6. Inference — From Pixels to Predictions

At training time, the recognition head sees the ground-truth sequence shifted right and predicts the next character at each position. At inference there is no ground truth — the model has to generate text autoregressively, one character at a time:

$$\hat{y}_{i,1} = \arg\max_v \, p_\theta(v \mid \langle\text{sos}\rangle, \mathbf{m}_i)$$

$$\hat{y}_{i,t+1} = \arg\max_v \, p_\theta(v \mid \hat{y}_{i,1:t}, \mathbf{m}_i)$$

The loop continues until either an end-of-sequence token is generated or the maximum length $ T - 1 = 31 $ is hit. Greedy decoding is what we use, though beam search would be a straightforward upgrade if it ever matters in practice.

For word boxes there are two options. The model has its own H₂ that produces dense predictions on P₃, which can be post-processed with thresholding and NMS into a list of boxes. But for the current pipeline we use DocTR's detector for word boxes and only use Trident v4 for recognition — H₂ is still training and not yet competitive with DocTR's purpose-built detector. Once H₂ converges, the whole pipeline collapses into a single forward pass through Trident v4.

7. The Bigger Picture: Why Unified Beats Modular

It is worth stepping back and asking: why does any of this matter beyond an interesting set of architectural choices?

The unified design exists because of one core observation: detecting words and recognizing them require the same kinds of features. Local strokes, edges, character shapes — these are useful for both saying "yes, there is a word here" and saying "and that word is hello." Running one shared backbone instead of three separate ones (one for layout, one for detection, one for recognition) saves compute and lets the features be jointly optimized for all three tasks. The gradients from recognition flow back through the same layers that produce detection features, and vice versa. They sharpen each other.

The FPN exists because the three tasks need features at different scales. Layout regions are big — P₅ at stride 32 is fine. Word boxes are small — they need P₃ at stride 8. Characters within word crops are even smaller, but RoIAlign handles that by extracting a fixed-size feature per word at high resolution from P₃. One pyramid serves three masters.

The current bottleneck is data, not architecture. With around 3,000 unique pages, the recognition head can memorize whole pages instead of generalizing — train recognition loss drops near zero while validation loss climbs. Augmentation (RandomErasing, ColorJitter), dropout (0.3), and label smoothing (0.1) are the regularizers fighting that memorization. The real fix is more data — and that is where the next phase of this project goes.

Two failure modes worth knowing about. First, when the layout head is trained on dummy random boxes, its loss never converges meaningfully — it sits at a plateau because there is no real signal. We zero out its weight in the total loss for now to keep it from polluting the gradient. Second, on a stride-32 feature map alone, individual words occupy less than one cell, so the detection head produces no positive locations and gets stuck at its initial loss forever. The FPN with P₃ at stride 8 is what unblocks this. We learned both of these the hard way from training v1, which had neither fix.

Looking forward, four pieces of work would meaningfully improve this model: more annotated pages (10k+ unique pages would change everything), real layout annotations to turn H₁ on, a beam search decoder for recognition, and self-distillation from a larger frozen teacher to compress the model. The architecture is solid; what it needs now is data.

But perhaps the most important insight is the simplest one: every modern document understanding system is, at some level, doing something like this. The question is just how the three tasks are coupled — separate models with hand-engineered handoffs, or one model with shared features and joint optimization. The shared-backbone bet is what Trident v4 makes, and the math we walked through is what makes that bet feasible.

Learning to Learn

Search This Blog