MMaDA: Multimodal Large Diffusion Language Models
Yang, Ling, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. “MMaDA: Multimodal Large Diffusion Language Models.” arXiv preprint arXiv:2505.15809 (2025).
What story does this paper tell?
Tasks
Task 1: Textual Reasoning
Task 2: Multimodal Reasoning
Task 3: World Knowledge-Aware Text-to-Image Generation
Training Datasets
Foundational Language and Multimodal Data
- Basic text generation:
- RefinedWeb
- Multimodal understanding and generation:
- ImageNet
- Conceptual 12m
- Segment Anything
- Laion-Aesthetics 12m Umap
- JourneyDB
Instruction Data
- Textual instruction:
- Alpaca
- Visual instruction:
- LLaVA-1.5
Reasoning Data
- Textual mathematical and logical reasoning:
- ReasonFlux
- LIMO
- s1k
- OpenThoughts
- AceMath-Instruct
- Multimodal reasoning:
- responses to GeoQA and CLEVER generated by LMM-R1 model
- World knowledge-aware image generation:
- factual (item, description)-pairs generated by GPT-4.1, spanning across science, culture and landmarks.
Reinforcement Learning Data
- mathematical and logical datasets used in Reasoning
Evaluation Datasets
Text Generation
- MMLU
- GSM8K
Multimodal Understanding
- POPE
- MME
- Flickr30k
- VQAv2
- GQA
- MMMU
Image Generation
- GenEval
- WISE
Model Architecture
Unified Tokenization across Modalities
- Text tokenization:
- Model: LLaDA’s tokenizer
- Input: raw text
- Output: a sequence of discrete text tokens
- Visual tokenization:
- Show-o’s pretrained image quantizer, based on MAGVIT-v2
- Input: images
- Output: sequences of discrete semantic tokens
Unified Training (Modeling) Objective and Probabilistic Formulation
- Masked Token Denoising: Predict the discrete masked tokens.
- Model both textual and visual modalities under a shared probabilistic formulation.
- Align the noise corruption and semantic recovery processes across modalities.
- Mask Token Predictor (for both text and image):
- Model: MMaDA
- Input: tokens
- Output: a sequence of predicted masked tokens
- Loss function: unified cross-entropy on the masked text or image tokens
- Model: MMaDA
- Title: MMaDA: Multimodal Large Diffusion Language Models
- Author: Der Steppenwolf
- Created at : 2025-05-24 19:02:46
- Updated at : 2025-06-22 20:46:50
- Link: https://st143575.github.io/steppenwolf.github.io/2025/05/24/MMaDA-Multimodal-Large-Diffusion-Language-Models/
- License: This work is licensed under CC BY-NC-SA 4.0.
Comments