Skip to main content

Conditional Diffusion Models in Computational Microscopy

Conditional Diffusion for Brightfield to Fluorescence Translation

Conditional Diffusion for Brightfield to Fluorescence Image Translation

Fluorescence microscopy provides critical biological insights, but acquiring fluorescent images is often time-consuming, expensive, and phototoxic. This blog describes a conditional diffusion model that translates Brightfield (BF) images into corresponding fluorescence channels (red or green), using a unified and probabilistic generative framework.

Instead of predicting fluorescence directly, the model learns how to iteratively denoise fluorescence images conditioned on Brightfield structure and channel identity.
Brightfield Image

Brightfield (BF)
Structural cell morphology captured without fluorescence labeling.

Green Fluorescence Image

Green Fluorescence
Complementary fluorescence channel with distinct biological specificity.

Red Fluorescence Image

Red Fluorescence
Fluorescent signal highlighting a specific cellular marker.

From a single Brightfield image, the conditional diffusion model can generate multiple fluorescence modalities (red or green) from the same Brightfield input by conditioning on the desired output channel.

1. Problem Setup

Given:

  • Brightfield image \( x_{\text{BF}} \)
  • Fluorescence image \( x_0 \) (Red or Green)

The goal is to learn:

$$ p(x_0 \mid x_{\text{BF}}, c) $$

where \( c \) is a condition vector indicating the desired fluorescence channel:

  • Red channel → \( c = [1, 0] \)
  • Green channel → \( c = [0, 1] \)

2. Data Preparation

Each training sample consists of a triplet:

  • Brightfield image
  • Red fluorescence
  • Green fluorescence
  • All images resized to 256 × 256
  • Pixel values normalized to [-1, 1]
  • One fluorescence channel selected per iteration

3. Input Construction

At each diffusion step, the model receives a 6-channel input:

  • Noisy fluorescence image \( x_t \) (1 channel)
  • Brightfield RGB image (3 channels)
  • Condition map \( c \) broadcast spatially (2 channels)

This produces:

$$ \text{Input} \in \mathbb{R}^{6 \times 256 \times 256} $$

4. Forward Diffusion Process

Noise is gradually added to the clean fluorescence image:

$$ x_t = \sqrt{\alpha_t} \, x_0 + \sqrt{1 - \alpha_t} \, \epsilon $$

where:

  • \( \epsilon \sim \mathcal{N}(0, I) \)
  • \( t \) is sampled uniformly
  • \( \alpha_t \) follows a predefined noise schedule
The model is trained to predict the added noise, not the image directly.

5. Conditional UNet Denoiser

A UNet receives:

$$ \epsilon_\theta(x_t, t, x_{\text{BF}}, c) $$

and predicts the noise component \( \hat{\epsilon} \).

From this, a clean fluorescence estimate is reconstructed:

$$ \hat{x}_0 = \frac{x_t - \sqrt{1 - \alpha_t} \, \hat{\epsilon}} {\sqrt{\alpha_t}} $$

6. Training Losses

Two complementary losses guide training:

6.1 Noise Prediction Loss

$$ \mathcal{L}_{\text{denoise}} = \| \hat{\epsilon} - \epsilon \|_1 $$

6.2 Perceptual Loss (VGG16)

$$ \mathcal{L}_{\text{perc}} = \| \text{VGG}(\hat{x}_0) - \text{VGG}(x_0) \|_1 $$
Perceptual loss encourages biologically meaningful structures, not just pixel accuracy.

Total Loss

$$ \mathcal{L} = \mathcal{L}_{\text{denoise}} + 0.01 \cdot \mathcal{L}_{\text{perc}} $$

7. Optimization & EMA

Training details:

  • Optimizer: AdamW
  • Exponential Moving Average (EMA) of weights
$$ \theta_{\text{EMA}} \leftarrow 0.995 \cdot \theta_{\text{EMA}} + 0.005 \cdot \theta $$
EMA weights produce smoother and more stable fluorescence predictions at inference.

8. Inference: Reverse Diffusion

At inference time, fluorescence images are generated by running the reverse diffusion process, starting from pure Gaussian noise.

$$ x_T \sim \mathcal{N}(0, I) $$

For each timestep \( t = T, T-1, \ldots, 1 \), the model predicts the noise component \( \hat{\epsilon}_t = \epsilon_\theta(x_t, t, x_{\text{BF}}, c) \), conditioned on the Brightfield image and the desired fluorescence channel.

The reverse DDPM update is given by:

$$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \hat{\epsilon}_t \right) + \sigma_t z $$

where:

  • \( z \sim \mathcal{N}(0, I) \) if \( t > 1 \), and \( z = 0 \) if \( t = 1 \)
  • \( \alpha_t = 1 - \beta_t \)
  • \( \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s \)
  • \( \sigma_t = \sqrt{\beta_t} \) (or an equivalent variance schedule)
By conditioning the reverse process on the Brightfield image and a channel indicator, the model generates structurally consistent fluorescence images while allowing stochastic variability across samples.

As \( t \) decreases, noise is gradually removed and biologically meaningful fluorescence structures emerge, guided by Brightfield morphology and the specified fluorescence modality.

Important Losses

  • Pixel-wise reconstruction loss (intensity matching)
    Pros: Enforces accurate intensity matching and preserves overall fluorescence levels.
    Cons: Sensitive to misalignment and produces overly smooth (blurry) outputs.
  • Perceptual loss (structure matching)
    Pros: Preserves tissue morphology and high-level structural features.
    Cons: Depends on pre-trained features and may miss fine biological details.
  • Structural similarity loss (SSIM) (visual similarity matching)
    Pros: Maintains cellular structure and contrast consistent with human perception.
    Cons: Weak at enforcing absolute intensity accuracy.
  • Laplacian loss (Edge/high-frequency loss) (Edge/fine detail matching)
    Pros: Enhances sharp edges and fine cellular boundaries.
    Cons: Amplifies noise and is sensitive to registration errors.

My Work: Conditional Diffusion Framework

  • Conditional Diffusion Framework used to translate BF → Red/Green fluorescence
  • UNet backbone for epsilon prediction, conditioned on:
    • Noisy fluorescence image
    • BF RGB image
    • One-hot fluorescence type (red/green)
  • 6-channel input enabling multi-modal feature learning
  • DDPM noise schedule applied during forward diffusion to corrupt fluorescence targets
  • Model learns to predict noise (𝜖-prediction) at each timestep
  • Reconstruction of clean fluorescence from predicted 𝜖
  • Loss functions:
    • L1 denoising loss
    • VGG16 perceptual loss (weighted)
  • EMA (Exponential Moving Average) of weights for stable inference
  • Reverse diffusion process generates final fluorescence output from pure noise at inference

Dataset

Dataset: 4 sets (8 folders) from different environments.
Training/Validation: 3 sets, split 80/20 per folder.
Testing: 1 held-out set.
Sample counts: Train: 159 | Val: 42 | Test: 51.
Training epochs: 100.
Dataset source: Kaggle - Brightfield vs Fluorescent Staining Dataset

Results

Training and validation losses over 100 epochs. The model converges steadily, showing good generalization and stable learning for both training and validation sets.

Each result shows 5 images from left to right: BF input, Red GT, Red Pred, Green GT, Green Pred.

Future Improvements

Due to GPU limitations, our current results are limited to 256×256 resolution and a moderate UNet size. For better results, the following improvements can be considered:

  • Increase image size (256 → 512+) – captures finer cellular details.
  • Use more training data – improves generalization and robustness.
  • Deeper/wider UNet – enhances feature extraction and captures complex structures.
  • Diffusion + GAN loss – generates sharper outputs and preserves high-frequency features.
  • Additional loss functions – e.g., SSIM and Laplacian loss can further improve structural similarity and edge fidelity.

GitHub Repository

This repository contains the complete code for training and evaluating the conditional diffusion model that translates Brightfield (BF) images into Red and Green fluorescence channels. It includes data preprocessing, model architecture, training scripts, and inference examples.

Comments

Popular posts from this blog

DINOv3

DINOv3: Unified Global & Local Self-Supervision DINOv3: Unified Global & Local Self-Supervision DINOv3 extends the DINOv2 framework by combining global self-distillation with masked patch prediction . This allows the model to learn both image-level and dense, spatial representations within a single self-supervised pipeline. This image shows the cosine similarity maps from DINOv3 output features, illustrating the relationships between the patch marked with a red cross and all other patches (as reported in the DINOv3 GitHub repository ). If you find DINOv3 useful, consider giving it a star ⭐. Citation for this work is provided in the References section. 1. Student–Teacher Architecture As in DINOv2, DINOv3 uses a student–teacher setup: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the inpu...

DINOv2

DINOv2: A Mathematical Explanation of Self-Supervised Vision Learning DINOv2: Self-Distillation for Vision Without Labels DINOv2 is a powerful self-supervised vision model that learns visual representations without using labels. It builds on the original DINO framework, using a student–teacher architecture and advanced augmentations to produce strong, semantically rich embeddings. 1. Student–Teacher Architecture DINOv2 uses two networks: a student network with parameters \( \theta \) a teacher network with parameters \( \xi \) Both networks receive different augmented views of the same image. $$ x_s = \text{Aug}_{\text{student}}(x), \qquad x_t = \text{Aug}_{\text{teacher}}(x) $$ The student learns by matching the teacher’s output distribution. The teacher is updated using an exponential moving average (EMA) of the student. 2. Image Embeddings The student and teacher networks (often Vision Transformers) pr...

LeJEPA: Predictive Learning With Isotropic Latent Spaces

LeJEPA: Predictive World Models Through Latent Space Prediction LeJEPA: Predictive Learning With Isotropic Latent Spaces Self-supervised learning methods such as MAE, SimCLR, BYOL, DINO, and iBOT all attempt to learn useful representations by predicting missing information. Most of them reconstruct pixels or perform contrastive matching, which forces models to learn low-level details that are irrelevant for semantic understanding. LeJEPA approaches representation learning differently: Instead of reconstructing pixels, the model predicts latent representations of the input, and those representations are regularized to live in a well-conditioned, isotropic space. These animations demonstrate LeJEPA’s ability to predict future latent representations for different types of motion. The first animation shows a dog moving through a scene, highlighting semantic dynamics and object consisten...