Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, Jinze, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization.” Text Reading, and Beyond 2 (2023).

Motivation

Current open-source Large Vision-Language Models (LVLMs) suffer from inadequate training and optimization, lagging far behind the proprietary models.
The majority of open-source LVLMs perceive the image in a coarse-grained manner and lack fine-grained visual understanding ability.

Main Contribution

Qwen-VL series, a set of large-scale multilingual multi-task LVLMs based on Qwen-7B LM

Qwen-VL: pre-trained checkpoint
Qwen-VL-Chat: instruction-tuned checkpoint
a visual receptor including
- a language-aligned visual encoder
- a position-aware adapter
an input-output interface
a 3-stage training pipeline
a multilingual (English, Chinese) multimodal (text, image) corpus

Model Architecture

Large Language Model

Architecture: Transformer, LLaMA
Model: Qwen-7B (pretrained checkpoint)
Model serves as text encoder.

Visual Encoder

Architecture: Vision Transformer (ViT)
Model: ViT-bigG
Input: a set of resized images
Model splits input images into patches with a stride of 14.
Output: a set of image features

Position-Aware Vision-Language Adapter

Motivation:
- efficiency issues caused by long image feature sequences compression
- potential loss of positional details during compression positional encoding
Architecture: a randomly initialized single-layer cross-attention module
Funtionality: compress image features
Cross-attention operation:
- queries: trainable embeddings
- keys: image features from the visual encoder
- compresses the visual feature sequence to a fixed length of 256
2D absolute positional encodings are incorporated into the query-key pairs.
Input: query embeddings + image features from the visual encoder
Output: compressed image features of length 256
The output is then fed into the text encoder.

Input and Output

Image Input

Relevant model components: visual encoder, VL adapter
special tokens: ,
- to differentiate between image feature input and text feature input

Bounding Box Input and Output

Relevant model components: visual encoder, text encoder
Bounding box are normalized to the range of [0, 1000) and transformed into a specific string format:
This string is tokenized as text and does not require an additional positional vocabulary.
Special tokens:
- , : added to the beginning and end of the bounding box string, respectively
  - to distinguish between bounding box strings and regular text strings
- , : added to the text strings which are referred to by the bounding box
  - to associate bounding boxes with their corresponding descriptive texts

Main Features of Qwen-VL Series

Support:

multiple image inputs
multilingual conversation
multi-round dialogue
text-reading (OCR)
localization
fine-grained visual recognition and understanding

Training

Pre-training

Pre-training data: image-text pairs
- large-scale
- weakly labeled
- web-crawled
- public available data + in-house data
- cleaned
original dataset size: 5B
cleaned dataset size: 1.4B (77.3% English text data, 22.7% Chinese text data)
Freeze text encoder, only optimize vision encoder and VL adapter.
Resize input images to .
Training objective: minimize cross-entropy of text tokens

Multi-task Pre-training

Training data: interleaved image-text pairs
- high-quality with fine-grained VL annotation
- larger input resolution:
Ablate the window attention and global attention.
Jointly optimize the text encoder, visual encoder, and VL adapter.
Training objective: minimize cross-entropy of text tokens

Supervised Fine-tuning (SFT)

Fine-tuning data:
- multi-modal instructions
  - curated from cpation data and dialogue data
  - generated through LLM self-instruction
  - only address single-image dialogue and reasoning
  - limited to image content comprehension
- additional dialogue data
  - manual annotation + model generation + strategic concatenation
  - incorporate localization and multi-image comprehension abilities
- Data size: 350K
Freeze visual encoder, optimize text encoder and VL adapter.