Conditional Diffusion Models in Computational Microscopy

Conditional Diffusion for Brightfield to Fluorescence Translation

Conditional Diffusion for Brightfield to Fluorescence Image Translation

Fluorescence microscopy provides critical biological insights, but acquiring fluorescent images is often time-consuming, expensive, and phototoxic. This blog describes a conditional diffusion model that translates Brightfield (BF) images into corresponding fluorescence channels (red or green), using a unified and probabilistic generative framework.

Instead of predicting fluorescence directly, the model learns
how to iteratively denoise fluorescence images
conditioned on Brightfield structure and channel identity.

Brightfield (BF)
Structural cell morphology captured without fluorescence labeling.

Green Fluorescence
Complementary fluorescence channel with distinct biological specificity.

Red Fluorescence
Fluorescent signal highlighting a specific cellular marker.

From a single Brightfield image, the conditional diffusion model can generate
multiple fluorescence modalities (red or green) from the same Brightfield input
by conditioning on the desired output channel.

1. Problem Setup

Given:

Brightfield image $ x_{\text{BF}} $
Fluorescence image $ x_0 $ (Red or Green)

The goal is to learn:

$$ p(x_0 \mid x_{\text{BF}}, c) $$

where $ c $ is a condition vector indicating the desired fluorescence channel:

Red channel → $ c = [1, 0] $
Green channel → $ c = [0, 1] $

2. Data Preparation

Each training sample consists of a triplet:

Brightfield image
Red fluorescence
Green fluorescence

All images resized to 256 × 256
Pixel values normalized to [-1, 1]
One fluorescence channel selected per iteration

3. Input Construction

At each diffusion step, the model receives a 6-channel input:

Noisy fluorescence image $ x_t $ (1 channel)
Brightfield RGB image (3 channels)
Condition map $ c $ broadcast spatially (2 channels)

This produces:

$$ \text{Input} \in \mathbb{R}^{6 \times 256 \times 256} $$

4. Forward Diffusion Process

Noise is gradually added to the clean fluorescence image:

$$ x_t = \sqrt{\alpha_t} \, x_0 + \sqrt{1 - \alpha_t} \, \epsilon $$

where:

$ \epsilon \sim \mathcal{N}(0, I) $
$ t $ is sampled uniformly
$ \alpha_t $ follows a predefined noise schedule

The model is trained to predict the added noise, not the image directly.

5. Conditional UNet Denoiser

A UNet receives:

$$ \epsilon_\theta(x_t, t, x_{\text{BF}}, c) $$

and predicts the noise component $ \hat{\epsilon} $.

From this, a clean fluorescence estimate is reconstructed:

$$ \hat{x}_0 = \frac{x_t - \sqrt{1 - \alpha_t} \, \hat{\epsilon}} {\sqrt{\alpha_t}} $$

6. Training Losses

Two complementary losses guide training:

6.1 Noise Prediction Loss

$$ \mathcal{L}_{\text{denoise}} = \| \hat{\epsilon} - \epsilon \|_1 $$

6.2 Perceptual Loss (VGG16)

$$ \mathcal{L}_{\text{perc}} = \| \text{VGG}(\hat{x}_0) - \text{VGG}(x_0) \|_1 $$

Perceptual loss encourages
biologically meaningful structures,
not just pixel accuracy.

Total Loss

$$ \mathcal{L} = \mathcal{L}_{\text{denoise}} + 0.01 \cdot \mathcal{L}_{\text{perc}} $$

7. Optimization & EMA

Training details:

Optimizer: AdamW
Exponential Moving Average (EMA) of weights

$$ \theta_{\text{EMA}} \leftarrow 0.995 \cdot \theta_{\text{EMA}} + 0.005 \cdot \theta $$

EMA weights produce smoother and more stable fluorescence predictions at inference.

8. Inference: Reverse Diffusion

At inference time, fluorescence images are generated by running the reverse diffusion process, starting from pure Gaussian noise.

$$ x_T \sim \mathcal{N}(0, I) $$

For each timestep $ t = T, T-1, \ldots, 1 $, the model predicts the noise component $ \hat{\epsilon}_t = \epsilon_\theta(x_t, t, x_{\text{BF}}, c) $, conditioned on the Brightfield image and the desired fluorescence channel.

The reverse DDPM update is given by:

$$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \hat{\epsilon}_t \right) + \sigma_t z $$

where:

$ z \sim \mathcal{N}(0, I) $ if $ t > 1 $, and $ z = 0 $ if $ t = 1 $
$ \alpha_t = 1 - \beta_t $
$ \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s $
$ \sigma_t = \sqrt{\beta_t} $ (or an equivalent variance schedule)

By conditioning the reverse process on the Brightfield image and a channel
indicator, the model generates structurally consistent fluorescence images
while allowing stochastic variability across samples.

As $ t $ decreases, noise is gradually removed and biologically meaningful fluorescence structures emerge, guided by Brightfield morphology and the specified fluorescence modality.

Important Losses

Pixel-wise reconstruction loss (intensity matching)

      Pros: Enforces accurate intensity matching and preserves overall fluorescence levels.

      Cons: Sensitive to misalignment and produces overly smooth (blurry) outputs.
  
Perceptual loss (structure matching)

      Pros: Preserves tissue morphology and high-level structural features.

      Cons: Depends on pre-trained features and may miss fine biological details.
  
Structural similarity loss (SSIM) (visual similarity matching)

      Pros: Maintains cellular structure and contrast consistent with human perception.

      Cons: Weak at enforcing absolute intensity accuracy.
  
Laplacian loss (Edge/high-frequency loss) (Edge/fine detail matching)

      Pros: Enhances sharp edges and fine cellular boundaries.

      Cons: Amplifies noise and is sensitive to registration errors.

My Work: Conditional Diffusion Framework

Conditional Diffusion Framework used to translate BF → Red/Green fluorescence
UNet backbone for epsilon prediction, conditioned on:
    Noisy fluorescence image
BF RGB image
One-hot fluorescence type (red/green)

  
6-channel input enabling multi-modal feature learning
DDPM noise schedule applied during forward diffusion to corrupt fluorescence targets
Model learns to predict noise (𝜖-prediction) at each timestep
Reconstruction of clean fluorescence from predicted 𝜖
Loss functions:
    L1 denoising loss
VGG16 perceptual loss (weighted)

  
EMA (Exponential Moving Average) of weights for stable inference
Reverse diffusion process generates final fluorescence output from pure noise at inference

Dataset

Dataset: 4 sets (8 folders) from different environments.
Training/Validation: 3 sets, split 80/20 per folder.
Testing: 1 held-out set.
Sample counts: Train: 159 | Val: 42 | Test: 51.
Training epochs: 100.
Dataset source: Kaggle - Brightfield vs Fluorescent Staining Dataset

Results

Training and validation losses over 100 epochs. The model converges steadily, showing good generalization and stable learning for both training and validation sets.

Each result shows 5 images from left to right: BF input, Red GT, Red Pred, Green GT, Green Pred.

Future Improvements

Due to GPU limitations, our current results are limited to 256×256 resolution and a moderate UNet size. For better results, the following improvements can be considered:

Increase image size (256 → 512+) – captures finer cellular details.
Use more training data – improves generalization and robustness.
Deeper/wider UNet – enhances feature extraction and captures complex structures.
Diffusion + GAN loss – generates sharper outputs and preserves high-frequency features.
Additional loss functions – e.g., SSIM and Laplacian loss can further improve structural similarity and edge fidelity.

GitHub Repository

This repository contains the complete code for training and evaluating the conditional diffusion model that translates Brightfield (BF) images into Red and Green fluorescence channels. It includes data preprocessing, model architecture, training scripts, and inference examples.

Conditional Diffusion for Brightfield-to-Fluorescence Translation

LeJEPA: Predictive Learning With Isotropic Latent Spaces

LeJEPA: Predictive World Models Through Latent Space Prediction LeJEPA: Predictive Learning With Isotropic Latent Spaces Self-supervised learning methods such as MAE, SimCLR, BYOL, DINO, and iBOT all attempt to learn useful representations by predicting missing information. Most of them reconstruct pixels or perform contrastive matching, which forces models to learn low-level details that are irrelevant for semantic understanding. LeJEPA approaches representation learning differently: Instead of reconstructing pixels, the model predicts latent representations of the input, and those representations are regularized to live in a well-conditioned, isotropic space. These animations demonstrate LeJEPA’s ability to predict future latent representations for different types of motion. The first animation shows a dog moving through a scene, highlighting semantic dynamics and object consisten...

Explore more 🐱‍🏍

Learning to Learn

Search This Blog