Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-dual-diffusion-unifiedgit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-dual-diffusion-unified/SKILL.md---
name: dual-diffusion-unified-image-generation-and-unders
description: Skill for AI agent capabilities
---
# Dual Diffusion: Unified Image Generation and Understanding
## Description
A large-scale end-to-end diffusion model for multi-modal understanding and generation. Unifies image generation, captioning, and visual question answering under a single cross-modal maximum likelihood estimation framework, offering an alternative to autoregressive next-token prediction models.
**Key Innovation:**
- Unified diffusion for both generation and understanding
- Cross-modal maximum likelihood estimation
- MM-DiT architecture with joint training
- Full suite of vision-language capabilities
## Tools Used
- read: Load image and text inputs
- write: Generate images and text outputs
- exec: Run diffusion inference
- browser: Access image datasets
- memory_search: Retrieve diffusion patterns
## Instructions for Agents
### Supported Tasks
1. **Image Generation** - Text-to-image synthesis
2. **Image Captioning** - Generate descriptions
3. **Visual Question Answering** - Answer questions about images
4. **Image Understanding** - Multi-modal comprehension
### Core Concept
Single model handles both:
- Generation: text → image
- Understanding: image → text
## Overview
**Source:** arXiv:2501.00289v2
**Utility:** 0.91
**Architecture:** MM-DiT (Multimodal Diffusion Transformer)
## Activation Keywords
- dual diffusion
- unified image generation understanding
- multimodal diffusion
- MM-DiT
- vision language diffusion
---
## Architecture
### MM-DiT (Multimodal Diffusion Transformer)
```
Text Encoder → Text Tokens
↓
Cross-Attention → Diffusion Transformer → Output
↑
Image Encoder → Image Patches
```
### Dual Diffusion Process
```python
class DualDiffusion(nn.Module):
def __init__(self, image_encoder, text_encoder, dit):
self.image_encoder = image_encoder
self.text_encoder = text_encoder
self.dit = dit # Diffusion Transformer
def forward(self, images, text):
# Encode both modalities
image_tokens = self.image_encoder(images)
text_tokens = self.text_encoder(text)
# Cross-modal attention
combined = self.cross_attention(image_tokens, text_tokens)
# Diffusion denoising
output = self.dit(combined)
return output
```
---
## Training Framework
### Cross-Modal Maximum Likelihood Estimation
```python
class CrossModalMLE:
def train(self, images, texts):
# Forward diffusion
noisy_images = self.add_noise(images)
noisy_texts = self.add_noise(texts)
# Joint prediction
pred_images = self.model(noisy_images, texts)
pred_texts = self.model(images, noisy_texts)
# Combined loss
loss = (
self.image_loss(pred_images, images) +
self.text_loss(pred_texts, texts)
)
return loss
```
---
## Supported Tasks
### 1. Image Generation
```python
def generate_image(model, text_prompt):
# Start from noise
image = torch.randn(1, 3, 256, 256)
for t in reversed(range(T)):
# Conditional denoising
noise_pred = model(image, text_prompt, timestep=t)
image = denoise_step(image, noise_pred, t)
return image
```
### 2. Image Captioning
```python
def caption_image(model, image):
# Start from noise
text = torch.randn(1, seq_len, dim)
for t in reversed(range(T)):
# Conditional denoising on image
noise_pred = model(image, text, timestep=t)
text = denoise_step(text, noise_pred, t)
return decode_text(text)
```
### 3. Visual Question Answering
```python
def vqa(model, image, question):
# Encode question
question_tokens = model.text_encoder(question)
# Generate answer conditioned on image and question
answer = model.generate(
condition=image,
prompt=question_tokens
)
return answer
```
---
## Key Advantages
| Aspect | Autoregressive VLM | Dual Diffusion |
|--------|-------------------|-----------------|
| Generation | Separate models | Unified |
| Understanding | Strong | Competitive |
| Training | Separate stages | End-to-end |
| Flexibility | Limited tasks | Full suite |
---
## Performance Comparison
| Task | Metric | Dual Diffusion | Baseline |
|------|--------|----------------|----------|
| Image Gen | FID ↓ | Competitive | - |
| Captioning | BLEU ↑ | Competitive | - |
| VQA | Accuracy ↑ | Competitive | - |
---
## Implementation
### Model Architecture
```python
class MM_DiT(nn.Module):
def __init__(self, image_size=256, text_dim=768):
super().__init__()
# Image branch
self.image_encoder = VisionTransformer(image_size)
self.image_decoder = VisionTransformer(image_size)
# Text branch
self.text_encoder = TextTransformer()
self.text_decoder = TextTransformer()
# Cross-modal attention
self.cross_attn = CrossModalAttention()
# Diffusion timesteps
self.t_embed = TimestepEmbedding()
def forward(self, images, texts, t):
# Encode
img_emb = self.image_encoder(images)
txt_emb = self.text_encoder(texts)
# Add timestep
t_emb = self.t_embed(t)
img_emb = img_emb + t_emb
txt_emb = txt_emb + t_emb
# Cross-attention
combined = self.cross_attn(img_emb, txt_emb)
return combined
```
---
## Best Practices
1. **Joint training** - Train both modalities together
2. **Balanced loss** - Weight image and text losses equally
3. **Timestep scheduling** - Use appropriate noise schedules
4. **Cross-modal attention** - Enable interaction between modalities
5. **Fine-tuning** - Task-specific adaptation after pre-training
---
## Applications
| Domain | Use Case |
|--------|----------|
| Creative Tools | Image generation with understanding |
| Accessibility | Image description |
| Education | Visual Q&A systems |
| Research | Unified vision-language models |
---
## Examples
### Example 1: Basic Application
**User:** I need to apply Dual Diffusion: Unified Image Generation and Understanding to my analysis.
**Agent:** I'll help you apply dual-diffusion-unified. First, let me understand your specific use case...
**Context:** Apply the methodology
### Example 2: Advanced Scenario
**User:** Complex analysis scenario
**Agent:** Based on the methodology, I'll guide you through the advanced application...
### Example 2: Advanced Application
**User:** What are the key considerations for dual-diffusion-unified?
**Agent:** Let me search for the latest research and best practices...
## References
- Paper: https://arxiv.org/abs/2501.00289
- DOI: https://doi.org/10.48550/arXiv.2501.00289
---
**Created:** 2026-03-28
**Source:** arXiv:2501.00289v2 - "Dual Diffusion for Unified Image Generation and Understanding"