MMaDA: Multimodal Large Diffusion Language Models

Der Steppenwolf

Yang, Ling, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. “MMaDA: Multimodal Large Diffusion Language Models.” arXiv preprint arXiv:2505.15809 (2025).

What story does this paper tell?

Tasks

Task 1: Textual Reasoning

Task 2: Multimodal Reasoning

Task 3: World Knowledge-Aware Text-to-Image Generation

Training Datasets

Foundational Language and Multimodal Data

  • Basic text generation:
    • RefinedWeb
  • Multimodal understanding and generation:
    • ImageNet
    • Conceptual 12m
    • Segment Anything
    • Laion-Aesthetics 12m Umap
    • JourneyDB

Instruction Data

  • Textual instruction:
    • Alpaca
  • Visual instruction:
    • LLaVA-1.5

Reasoning Data

  • Textual mathematical and logical reasoning:
    • ReasonFlux
    • LIMO
    • s1k
    • OpenThoughts
    • AceMath-Instruct
  • Multimodal reasoning:
    • responses to GeoQA and CLEVER generated by LMM-R1 model
  • World knowledge-aware image generation:
    • factual (item, description)-pairs generated by GPT-4.1, spanning across science, culture and landmarks.

Reinforcement Learning Data

  • mathematical and logical datasets used in Reasoning

Evaluation Datasets

Text Generation

  • MMLU
  • GSM8K

Multimodal Understanding

  • POPE
  • MME
  • Flickr30k
  • VQAv2
  • GQA
  • MMMU

Image Generation

  • GenEval
  • WISE

Model Architecture

Unified Tokenization across Modalities

  • Text tokenization:
    • Model: LLaDA’s tokenizer
    • Input: raw text
    • Output: a sequence of discrete text tokens
  • Visual tokenization:
    • Show-o’s pretrained image quantizer, based on MAGVIT-v2
    • Input: images
    • Output: sequences of discrete semantic tokens

Unified Training (Modeling) Objective and Probabilistic Formulation

  • Masked Token Denoising: Predict the discrete masked tokens.
  • Model both textual and visual modalities under a shared probabilistic formulation.
  • Align the noise corruption and semantic recovery processes across modalities.
  • Mask Token Predictor (for both text and image):
    • Model: MMaDA
    • Input: tokens
    • Output: a sequence of predicted masked tokens
    • Loss function: unified cross-entropy on the masked text or image tokens
  • Title: MMaDA: Multimodal Large Diffusion Language Models
  • Author: Der Steppenwolf
  • Created at : 2025-05-24 19:02:46
  • Updated at : 2025-06-22 20:46:50
  • Link: https://st143575.github.io/steppenwolf.github.io/2025/05/24/MMaDA-Multimodal-Large-Diffusion-Language-Models/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments