Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Der Steppenwolf

Bai, Jinze, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization.” Text Reading, and Beyond 2 (2023).

Motivation

  • Current open-source Large Vision-Language Models (LVLMs) suffer from inadequate training and optimization, lagging far behind the proprietary models.
  • The majority of open-source LVLMs perceive the image in a coarse-grained manner and lack fine-grained visual understanding ability.

Main Contribution

Qwen-VL series, a set of large-scale multilingual multi-task LVLMs based on Qwen-7B LM

  • Qwen-VL: pre-trained checkpoint
  • Qwen-VL-Chat: instruction-tuned checkpoint
  • a visual receptor including
    • a language-aligned visual encoder
    • a position-aware adapter
  • an input-output interface
  • a 3-stage training pipeline
  • a multilingual (English, Chinese) multimodal (text, image) corpus

Model Architecture

Large Language Model

  • Architecture: Transformer, LLaMA
  • Model: Qwen-7B (pretrained checkpoint)
  • Model serves as text encoder.

Visual Encoder

  • Architecture: Vision Transformer (ViT)
  • Model: ViT-bigG
  • Input: a set of resized images
  • Model splits input images into patches with a stride of 14.
  • Output: a set of image features

Position-Aware Vision-Language Adapter

  • Motivation:
    • efficiency issues caused by long image feature sequences compression
    • potential loss of positional details during compression positional encoding
  • Architecture: a randomly initialized single-layer cross-attention module
  • Funtionality: compress image features
  • Cross-attention operation:
    • queries: trainable embeddings
    • keys: image features from the visual encoder
    • compresses the visual feature sequence to a fixed length of 256
  • 2D absolute positional encodings are incorporated into the query-key pairs.
  • Input: query embeddings + image features from the visual encoder
  • Output: compressed image features of length 256
  • The output is then fed into the text encoder.

Input and Output

Image Input

  • Relevant model components: visual encoder, VL adapter
  • special tokens: ,
    • to differentiate between image feature input and text feature input

Bounding Box Input and Output

  • Relevant model components: visual encoder, text encoder
  • Bounding box are normalized to the range of [0, 1000) and transformed into a specific string format:
  • This string is tokenized as text and does not require an additional positional vocabulary.
  • Special tokens:
    • , : added to the beginning and end of the bounding box string, respectively
      • to distinguish between bounding box strings and regular text strings
    • , : added to the text strings which are referred to by the bounding box
      • to associate bounding boxes with their corresponding descriptive texts

Main Features of Qwen-VL Series

Support:

  • multiple image inputs
  • multilingual conversation
  • multi-round dialogue
  • text-reading (OCR)
  • localization
  • fine-grained visual recognition and understanding

Training

Pre-training

  • Pre-training data: image-text pairs

    • large-scale
    • weakly labeled
    • web-crawled
    • public available data + in-house data
    • cleaned
  • original dataset size: 5B

  • cleaned dataset size: 1.4B (77.3% English text data, 22.7% Chinese text data)

  • Freeze text encoder, only optimize vision encoder and VL adapter.

  • Resize input images to .

  • Training objective: minimize cross-entropy of text tokens

Multi-task Pre-training

  • Training data: interleaved image-text pairs

    • high-quality with fine-grained VL annotation
    • larger input resolution:
  • Ablate the window attention and global attention.

  • Jointly optimize the text encoder, visual encoder, and VL adapter.

  • Training objective: minimize cross-entropy of text tokens

Supervised Fine-tuning (SFT)

  • Fine-tuning data:

    • multi-modal instructions
      • curated from cpation data and dialogue data
      • generated through LLM self-instruction
      • only address single-image dialogue and reasoning
      • limited to image content comprehension
    • additional dialogue data
      • manual annotation + model generation + strategic concatenation
      • incorporate localization and multi-image comprehension abilities
    • Data size: 350K
  • Freeze visual encoder, optimize text encoder and VL adapter.

Evaluation

Image Caption and General Visual Question Answering

Text-Oriented Visual Question Answering

Refer Expression Comprehension

Few-Shot Learning on Vision-Language Tasks

Instruction Following in Real-World User Behavior

  • Title: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  • Author: Der Steppenwolf
  • Created at : 2025-02-25 23:49:12
  • Updated at : 2025-06-22 20:46:50
  • Link: https://st143575.github.io/steppenwolf.github.io/2025/02/25/Qwen-VL/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments