📌 Rethinking Multimodality from an Industry Perspective: Captioning Is Far More Important Than You Think

Community Article Published November 29, 2025

ArXiv: https://arxiv.org/abs/2511.21025
GitHub: https://github.com/bronyayang/CaptionQA
HuggingFace: https://huggingface.co/datasets/Borise/CaptionQA

pipeline

🔰 Introduction

This post serves as a more flexible, narrative extension of our paper—something closer to a technical blog. The first three sections focus on why we created CaptionQA. If you'd prefer to jump straight to the benchmark details, feel free to skip ahead to Section 4.

1. Why Did We Build This Benchmark?

After working on multimodal systems across several companies and product lines, I’ve noticed a striking pattern:

👉 Industry relies heavily on image/document captioning, yet our understanding of captioning remains surprisingly shallow.

I’ve worked on multimodal problems in various contexts—search & retrieval, content understanding, intelligent assistants, and agent systems. Across all these teams, one recurring product request kept showing up:

Can we first generate a good caption for this image/document/product?

However, problems arise the moment teams attempt to evaluate whether their captions are “good enough”:

  • There is no truly plug-and-play captioning benchmark available today.
  • Most existing evaluations focus on early MSCOCO-style short captions, far from real product use cases.
  • Even tools inspired by our prior work (e.g., concept extraction and concept matching in CCEval-style setups in HallE-Control) often fail to transfer directly to real-world production scenarios.

The more we discussed captioning with people from both academia and industry, the clearer the gap became:

🎯 In industry

Captioning is infrastructure: It powers search, ranking, content understanding, recommendation systems, agent state representation, and more.

🎓 In academia

Caption evaluation receives very little attention, especially evaluations that are:

  • easy to understand
  • generalizable
  • cost-efficient
  • practically useful for real-world systems

This gap is the reason CaptionQA exists.

CaptionQA aims to provide academia with an industrial perspective, and to provide industry with a more scientific, grounded evaluation tool—a high-density, cross-domain caption benchmark that more closely reflects real-world needs.

Our Design Principles

When designing CaptionQA, we followed five core principles:

  1. Return to the definition of captioning
    Our evaluation must align with what we believe captions should be—not just what legacy datasets measure.

  2. Prefer native images and avoid data leakage
    Most popular multimodal datasets have been overly exposed and are likely present in model pretraining data.

  3. Keep evaluation scores simple and interpretable
    Overly complicated scoring schemes make benchmarks harder to trust and harder to adopt.

  4. Make the evaluation comprehensive yet cost-efficient
    Teams need fast results and effective error analysis, without enormous compute costs.

  5. Ensure the method is general and highly transferable
    Industrial use cases vary wildly; a benchmark that cannot be adapted is a benchmark that won’t be used.

2. What Is a Caption? And Why Is It So Critical in Industry?

The task of captioning actually has a long history.

In the early days, Google’s motivation for generating captions was simple: teach models to mimic how humans summarize what they see. As multimodal LLMs evolved, the community shifted toward detailed captions, heavily influenced by traditions in object detection and scene graphs. Under this academic framing, a “good” caption is often viewed as an object-centric description that:

  • Enumerates all objects
  • Describes each object’s attributes
  • Sometimes includes relationships between objects

This definition has persisted for years in academia.

But industry needs something very different

In real-world applications, captioning needs are diverse, concrete, and deeply task-dependent—far beyond the academic interpretation. Anyone who has worked on multimodal products knows that captioning is used in ways that are much more complex than the traditional “describe the image” paradigm.

After working across multiple companies and product lines—and after talking with many multimodal teams—I gradually realized an important truth:

Almost every multimodal system relies on captioning, but the purpose is not to “describe images”—the goal is to convert visual information into useful, consumable text.

In production systems, captions function as textual interfaces to vision, supporting downstream components such as retrieval, ranking, summarization, recommendation, and agent reasoning. This reality is precisely why current caption benchmarks—which assume that captioning is merely object listing and attribute description—fail to reflect industry needs.

1) Search & Recommendation: Converting Images → Captions → Text-Based Systems

In e-commerce, short-video platforms, and social networks, a very common pipeline is:

  • Product image → caption
  • Video frame → caption
  • User post → caption

Why? Because:

  • User queries are inherently text
  • Most companies’ retrieval/ranking systems are fundamentally text-based
  • And the vast majority of companies simply do not have multimodal search infrastructure

This point is crucial:

We often assume that “everyone has multimodal search,” but in reality only a few tech giants do.
Most companies’ search and recommendation stacks remain purely text-driven.

Therefore, converting images into captions becomes the only practical way for many companies to unlock multimodal capabilities.

2) ToB / Document Tasks: Databases Cannot Store Images — They Store Captions

In enterprise (ToB) scenarios, document-related tasks represent a massive category:

  • Reports
  • Financial statements
  • News articles
  • Contracts
  • Manuals, etc.

Databases cannot directly apply query / join / rule logic to images. Therefore, companies typically need:

  • OCR
  • Document understanding
  • Information extraction
  • And ultimately, converting the page’s content into a caption-like text representation

In ToB applications, caption-like text has effectively become infrastructure.

3) Privacy & Compliance: Many Companies “Cannot Store Images—Only Captions”

For privacy and compliance reasons, some companies are strictly prohibited from:

  • Storing user images
  • Storing user videos
  • Accessing any multimodal data that has not gone through security review

As a result, the only thing they are allowed to retain is: A privacy-sanitized caption (auditable, controllable, and indexable)

This leads to an interesting phenomenon:

In some large enterprises, the lifecycle of image data is extremely short.
The only representation that persists is the caption, not the image.

4) A Key Component in Agent Systems: A “New” Use Case

In emerging multimodal agent systems and embodied AI, captions are becoming a core element of the workflow.

Captions increasingly act as:

  • the textual carrier of visual signals inside CoT reasoning
  • the serialized representation of agent state
  • the bridge between perception and decision-making

This trend has grown rapidly in the past two years, and I believe it will become one of the most important emerging directions—one that we cannot afford to ignore.

Different Companies Have Completely Different Caption Requirements

Across industry, “captioning” is not one task — it is dozens of different tasks.

E-commerce

  • Brand, price, size/specifications, material
  • Attribute-heavy descriptions
  • Strong emphasis on product properties

Social Platforms

  • Natural images
  • Event-centric (what happened)
  • Object-centric (what is present)

ToB / Document Workflows

  • OCR
  • Table structure extraction
  • Layout understanding
  • Business field extraction

Short-Video Platforms

  • Scene transitions
  • Actions
  • Object–event sequences

Album / Smartphone Manufacturers

  • Portraits, beautification, geolocation
  • Multi-scene blending

This highlights a key point:

In industry, captioning is not a single task — it is dozens of tasks.
Academic captioning research covers only the simplest one.

Caption Is Fundamentally an “Information Carrier,” Not a Description

Most of the time, we don’t use captions to describe an image. We use captions because:

We need a textual carrier to extract and transport the information inside images into downstream tasks. A caption happens to be the most convenient, safe, compact, and controllable representation.

In other words: Whatever the downstream task cares about — that is what the caption should express. It does not need to be infinitely detailed, nor cover every piece of information. The goal is not “the longer the better,” but rather:

The more accurately the caption captures task-relevant information, the better.

The idea of a “detailed caption” is fundamentally vague — there is no upper bound. But industry requirements are extremely clear: short and effective.

3. The Current Gap Between Academia and Industry

After switching between academia and industry over the years, I realized something increasingly clear:

What academia calls “captioning technology” is almost entirely different from the “captioning capability” that industry truly needs.

1) Academia Treats Captioning as a “Description Task,” While Industry Treats Captioning as an “Information Interface”

In academia, captioning means:

  • generating a descriptive sentence
  • optimizing BLEU/CIDEr
  • climbing the leaderboard

But in industry, captions function as:

  • input to search systems
  • input to recommendation systems
  • input to document-structuring pipelines
  • a normalized, storable data representation
  • part of the agent’s state and intermediate reasoning

Industry does not care about “how good the description sounds.” Industry cares about:

Whether the caption can reliably power downstream tasks.

2) Academia Optimizes for “More Details,” While Industry Optimizes for “More Effectiveness”

In academia, a “detailed caption” essentially has no upper bound—longer sentences, more objects, more attributes. But industry wants captions that are:

  • short and precise
  • free from hallucination
  • focused on key information
  • low latency
  • effective for the target task

In other words:

Industry does not need to “cover all information,” but to cover only the information required by the task.

These are fundamentally different optimization goals.

3) Academia Evaluates “Language Quality,” While Industry Evaluates “Task Outcome”

Academia uses BLEU / ROUGE / CIDEr. But industry evaluates:

  • whether search becomes more accurate
  • whether attribute extraction becomes more stable
  • whether document fields become more complete
  • whether agents can plan the next step more reliably

Whether a caption “sounds human-like” is irrelevant. The critical question is:

Does the caption improve task performance?


4) Agent Scenarios Make This Gap Even More Obvious

In multimodal agents, the model must integrate visual information into the reasoning loop. But LLM reasoning is inherently language-based, so the pipeline becomes:

image → structured language (caption) → chain-of-thought reasoning

Here, captions are not “descriptions.” Captions are:

  • state summaries
  • tool inputs
  • intermediate reasoning steps
  • foundational signals for action planning

Yet academia has almost no caption benchmarks designed for agent settings, which means academic caption research drifts further and further away from real industrial needs.

4. How Does CaptionQA Evaluate Captioning?

After laying all the groundwork, we can finally reach the core question:

How do we turn the messy, diverse, structure-varying task of captioning into a measurable, interpretable, scalable evaluation framework?

The CaptionQA evaluation pipeline is extremely simple — only three steps:


1) Use Any Model to Generate Captions

(prompts interchangeable, models interchangeable)

You may choose to use:

  • our provided short / simple / long / taxonomy prompts, or
  • your own custom-designed prompts

We do not restrict caption style, format, or length — the model is free to express itself.


2) A Fixed Evaluation Model (Qwen-2.5-72B) Answers Our Carefully Designed QA

Key point:

The evaluator only sees the caption — not the image.

This is the core principle of CaptionQA:

A caption is the textual substitute for an image. If a caption truly captures the image, it must support image-level QA.

Our QA covers object attributes, relationships, layout, states, OCR information, actions, and many more concept categories. If the evaluator fails to answer, we record:

  • “cannot answer” → coverage not sufficient
  • wrong answer → hallucination
  • correct answer → faithfulness & accuracy

3) Final Score Is Extremely Simple: Pure Accuracy (0–100)

We intentionally avoid complex metrics like BLEU / ROUGE / CIDEr.

Because:

  • accuracy is interpretable
  • accuracy is easy to debug
  • accuracy allows fair cross-model comparison
  • accuracy is friendly to product teams, managers, and researchers alike

In short:

A good evaluation metric should be understood by everyone instantly.

Screenshot 2025-11-25 at 7.56.21 PM

5. Why Choose QA (Instead of Concept Extraction)? Why This Design?

We experimented with many approaches — extraction, matching, NLP tools — and eventually realized:


1) QA Has Extremely High Information Density and Unlimited Expandability

If you want to evaluate something, you can directly ask it — without designing complex rules, tokenizers, POS taggers, or concept parsers.


2) QA Is Very Friendly for Human Annotation — and for LLM Auto-Generation

Teaching annotators to perform “concept extraction” is extremely difficult. But labeling QA is simple, direct, and cost-efficient.


3) QA Has a Short Evaluation Chain and High Stability

Concept extraction usually involves: extraction → matching → scoring. While QA is simply: caption → answer → accuracy. The shorter the chain, the more stable the evaluation.


4) QA Is a Unified Task Format (LLMs Excel at This)

LLMs are inherently good at QA and can naturally scale QA to large volumes. You don’t need complicated prompt engineering to make the model “guess” concepts.


High-Density (Dense QA) Is Our Core Idea

For every domain, we construct a concept schema. For each image, we create at least 50 questions on average. For natural images, e-commerce, documents, and embodied AI, we design schemas that represent what practitioners actually care about. Captions should cover these concepts — so we generate large numbers of questions accordingly. (We only release 25% of the QA set publicly. For full-density evaluation, please submit your captions so our team can run them.)

This high-density QA design allows us to:

  • use significantly less data
  • cover many more conceptual dimensions
  • converge much faster

question_density

Low-Cost: Stable Evaluation With Very Little Data

With only 100 images, a model’s scores across different domains already become stable. This means:

  • no need for massive evaluation datasets
  • evaluation can run entirely on local machines
  • any company — no matter the size — can afford to use it

This property is extremely important for real industrial use.

model_performance_vs_images

Easy Transfer: You Can Replicate CaptionQA to Any Domain

We open-source:

  • the QA generation pipeline
  • QA filtering and cleaning code
  • annotation guidelines
  • domain schema templates

Researchers can simply swap in a new schema to extend CaptionQA to any project or domain they care about. CaptionQA makes it possible for every field to have its own domain-specific caption benchmark.


Self-Collected Data and Multi-Domain Coverage

We manually collected and filtered 658 images, covering four different domains. We also invited experts from each domain to help refine the schemas. Based on their input, we selected four domains that best reflect real industry needs:

  • natural
  • e-commerce
  • document
  • embodied AI

These domains together cover a very wide spectrum of caption use cases.

taxonomy (1)

Agent-Ready Evaluation

Another major reason we choose a QA-based framework is that captions are essentially mandatory in multimodal agent systems. In an agent workflow, the caption becomes a representation of the multimodal state — a compact snapshot of what the model “knows” so far.

Captions are stored as intermediate information, and most downstream tasks in agent systems take the form of QA. The agent either answers questions or uses the caption as part of its reasoning steps. In many cases, QA is simply the next step in the chain of thought.

We believe future MM agents will rely even more heavily on systems like CaptionQA to diagnose and quantify the quality of these intermediate caption states.

6. CaptionQA Evaluation Results

Screenshot 2025-11-28 at 7.22.33 PM


1. Model Comparison

When reading any benchmark, the first question is of course: “Which models are stronger?”
We evaluated mainstream open-source and closed-source vision-language models using CaptionQA, and the results reveal several interesting patterns.

Open-Source Models: Qwen3-VL and GLM-4.1V Form a Consistent Top Tier

Across prompt types (Long / Simple / Taxonomy), Qwen3-VL and GLM-4.1V reliably occupy the top two spots among open-source models.

  • Qwen3-VL ranks #1 overall across all settings except short prompt.
  • GLM-4.1V delivers the best Document-domain performance among all open-source models.

Because the Document domain requires non-trivial OCR (tables, charts, layout understanding, structured text), GLM’s strong showing here aligns with expectations.

Closed-Source Models: GPT-5 and Gemini-2.5-Pro Remain in the Lead

If we temporarily ignore short-prompt results (closed-source models are not always optimized for it), the overall trend is:

  • GPT-5 and Gemini-2.5-Pro form the top tier.
  • In the Document domain, GPT-5 clearly outperforms Gemini-2.5-Pro, suggesting GPT-5 is more mature in handling complex document understanding (OCR, diagrams, layout).

Open-Source vs Closed-Source: The Gap Is Closing Quickly

A particularly notable observation:

Qwen3-VL’s overall performance is now very close to GPT-5 / Gemini-2.5-Pro, especially in domains such as Natural / E-commerce / Embodied AI.

In other words:

  • On the “caption-as-information-interface” foundation,
  • open-source models have already entered first-tier competitiveness.

This is a very encouraging signal for the entire industry.


2. Differences Across Prompts

In CaptionQA, we evaluated four common caption prompts — Short / Simple / Long / Taxonomy — covering the spectrum from traditional captioning to the instruction-following style widely used in modern MLLMs.

Below is a comparison of the average output lengths for the four prompt types:

Screenshot 2025-11-25 at 1.36.51 AM

Screenshot 2025-11-28 at 7.29.02 PM

Although longer prompts lead to longer captions, model performance does not simply improve with longer outputs. We observed several trends that are worth sharing.

Short Prompt: Traditional short captions can no longer meet modern multimodal needs

Short prompts resemble early-era one-sentence captioning (e.g., classic CLIP-style captions). Our results show:

  • The generated captions are too short
  • Information coverage is extremely limited
  • CaptionQA scores are consistently biased downward
  • Little to no utility for downstream tasks

This matches our expectations: short captions are largely unusable in modern multimodal applications, especially when tasks require fine-grained semantic detail.

Simple Prompt: “Describe this image in detail” is the most balanced and stable setting

The Simple prompt corresponds to the most widely adopted detailed caption format:

“Describe this image in detail.”

Its characteristics:

  • Clearly longer captions
  • Higher information density
  • More complete concept coverage
  • Stronger model performance across domains

Many modern MLLMs are actually trained using this kind of prompt, so it naturally reflects their “true” captioning capabilities. We recommend Simple as the default prompt, and treat it as the standard setting for CaptionQA.

Long Prompt: Captions get longer, but information density does not improve

The Long prompt is designed to push models to “write as long as possible.” Indeed, average length grows significantly — from ~356 → 510 words. But performance barely improves.The reason is simple:

Models have an upper bound on information density. Longer captions mostly repeat or expand wording, without adding new information.

This implies:

  • A long caption ≠ a good caption
  • Visual understanding has a ceiling that cannot be exceeded by verbose writing
  • Chasing longer captions yields diminishing returns

This also explains why simply “writing more” when building caption datasets does not lead to substantial quality improvements.

Taxonomy Prompt: Giving models the “test coverage” actually makes them fail

This was the most surprising part of our study. Our intuition was: “If we give models the QA concept schema, they could ‘fill in the blanks’ and cover more information.” But results show the opposite:

  • Model scores drop significantly across all domains
  • Many models show unstable instruction-following
  • Severe task drift appears in generated captions
  • Models focus on “format following” rather than image understanding

This reveals a very practical issue:

Even modern MLLMs still struggle with complex structured instructions, especially when the instruction resembles a schema.

Although multimodal pretraining and post-training often include structured prompts, few works have examined this failure mode systematically. CaptionQA makes this weakness clearly visible.

🌟 The failure of the Taxonomy prompt reveals a major future challenge for multimodal Agents

In many Agent systems, instructions tend to be:

  • Automatically generated
  • Structurally complex
  • Very long
  • Multi-stage and nested

This leads to instruction scale — instructions becoming increasingly long, complex, and difficult for models to follow. This raises several important questions:

  • How can models remain reliable when facing future ultra-long, auto-generated instructions?
  • How can we avoid task drift?
  • How can we achieve both good captioning and robust instruction-following at the same time?

This was an unexpected finding during our CaptionQA study, and we believe it is an extremely valuable direction for future research.


3. Comparison with VQA

📌 VQA vs CaptionQA: Why are models strong at VQA but still weak at captioning?

We divide the model’s behavior into two separate capabilities:

  • QA-on-image: Answering questions by directly looking at the image (similar to traditional VQA)
  • QA-on-caption: Answering questions only from the caption, without seeing the image (CaptionQA)

gap_overall

As we move toward the right side of the figure, the gap grows larger, meaning:

The model’s visual understanding ability (VQA) is much stronger than its ability to express that understanding through captions.

Stronger models show a smaller VQA → Caption gap, but the gap is still significant

For top-tier models such as GPT-5, Gemini-2.5-Pro, and Qwen3-VL, we observe:

  • QA-on-image: 95%–98%
  • QA-on-caption: 85%–90%

A gap of 9%–11%.

This means:

Stronger models are indeed better at extracting and stating visual information clearly.

Mid-tier and many open-source models show much larger gaps (20%–30%+)

Some open-source models (especially mid-tier ones) exhibit:

  • QA-on-image: around 90%
  • QA-on-caption: 60%–75%

Gaps can exceed 30%.

This indicates:

Models “understand,” but they cannot “say it clearly.”

In other words:

  • Their visual perception is not bad
  • But their caption generation is extremely unstable
  • Information is missing, disorganized, or drifting (task drift is very common)
  • As long as you rely on the caption rather than the raw image, all these issues surface immediately

This explains a common industry phenomenon:

Many models perform extremely well on VQA leaderboards, but produce unusable captions in real-world applications.

This reveals an overlooked truth: Captioning is the weakest, most neglected link in the capability chain

Although captioning is one of the most fundamental multimodal tasks, the ecosystem has evolved such that:

  • The research community focuses far more on VQA than captioning
  • Companies benchmark visual understanding (“can it recognize?”)
  • Few evaluate visual expression (“can it articulate clearly?”)
  • Multimodal pretraining treats captioning as a side effect
  • There are very few dedicated caption objectives
  • RL and instruction tuning rarely target captioning

This directly leads to:

Models can understand, but cannot express.

Yet in real applications—search, documents, recommendation, agent systems—caption is often the only usable information carrier. This is exactly one of the core questions CaptionQA aims to answer:

“Is the model lacking visual understanding, or lacking expression?” CaptionQA cleanly separates the two.


4. To improve captioning, we must treat captioning as an independent task, not a byproduct

From all our findings, the conclusion is clear:

Captioning must be re-prioritized, re-defined, and re-trained.

Future directions include:

  • Clearer caption task definitions across domains
  • Paired data mixing complex instructions × caption
  • More domains, more diversity, and more dense caption supervision
  • Stronger caption RL to reduce task drift
  • Specialized modeling for compressed & structured information
  • Teaching models how to generate “high-density, structured” captions

This is especially crucial for Agent systems:

Captioning is not “describing an image”—
it is the state representation for Agents.
If the caption is wrong, the entire Agent policy will be wrong.

Captioning is becoming a foundational requirement for future multimodal systems.

We include many more experiments and details in our original paper’s appendix. If you spot issues or want to extend our dataset, we welcome feedback and community contributions—we genuinely hope CaptionQA can help the open-source community. Feel free to leave comments or open an issue. We look forward to discussions!

At the end, here is the BibTeX citation:

@misc{yang2025captionqacaptionusefulimage,
      title={CaptionQA: Is Your Caption as Useful as the Image Itself?},
      author={Shijia Yang and Yunong Liu and Bohan Zhai and Ximeng Sun and Zicheng Liu and Emad Barsoum and Manling Li},
      year={2025},
      eprint={2511.21025},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21025},
}

Community

Sign up or log in to comment