ProVision: Programmatic Generation of Multimodal Instruction Data for Enhanced Model Training

The Challenge of Multimodal Instruction Data

Recent advances in multimodal language models (MLMs) like GPT-4V and BLIP have enabled sophisticated image-based reasoning, such as answering complex queries like “How many students are raising their hands in this image?” However, training these models requires high-quality instruction data—paired visual content with corresponding questions and answers—which is difficult to generate at scale.

Existing approaches face key limitations:

  • Manual annotation is expensive and time-consuming.
  • Proprietary models used for synthetic data generation are costly, prone to hallucinations, and struggle with scalability, interpretability, and factual accuracy.

Introducing ProVision: A Scalable, Programmatic Solution

To address these challenges, we developed ProVision, a framework that automatically synthesizes multimodal instruction data using scene graphs and human-written Python programs.

How It Works:

  1. Scene Graph Representation – Each image is parsed into a structured scene graph, where objects and attributes are nodes, and edges define their relationships.
  2. Programmatic Data Generation – Using Python-based generators and text templates, ProVision systematically produces diverse, accurate Q&A pairs from scene graphs.
    • Example: For an image of a busy street, ProVision can generate:
      • “What is the relationship between the pedestrian and the car?”
      • “Which object is closer to the red building: the car or the pedestrian?”

Key Advantages Over Traditional Methods:
Interpretability – Rules-based generation ensures factual correctness.
Scalability – New data generators can be added to expand question types.
Flexibility – Works with both annotated and automatically generated scene graphs.

ProVision-10M: A Large-Scale Multimodal Dataset

Our framework integrates 24 single-image and 14 multi-image instruction generators, producing over 10 million high-quality Q&A pairs—publicly released as ProVision-10M.

Performance Improvements in Fine-Tuning MLMs

We evaluated ProVision-10M by incorporating it into:

  • LLaVA-1.5 (single-image tasks)
  • Mantis-SigLIP-8B (multi-image tasks)

Results:

  • Manually annotated scene graphs yielded the highest performance gains.
  • When used in both pretraining and fine-tuning of xGen-MM-4B (BLIP-3), ProVision data improved performance by 1.6% on average across 11 benchmarks, outperforming baselines without our data.

Future Directions

ProVision opens new possibilities for scalable, high-quality multimodal training data. Future work could:

  • Expand data generators to cover new question types.
  • Improve scene graph generation for higher accuracy.
  • Extend to video-based instruction data and other modalities.

By enabling systematic, rule-based instruction synthesis, ProVision provides a cost-effective, transparent, and scalable alternative to traditional data generation methods—helping advance the next generation of multimodal AI.

🔔🔔  Follow us on LinkedIn  🔔🔔

Related Posts
AI Automated Offers with Marketing Cloud Personalization
Improving customer experiences with Marketing Cloud Personalization

AI-Powered Offers Elevate the relevance of each customer interaction on your website and app through Einstein Decisions. Driven by a Read more

Salesforce OEM AppExchange
Salesforce OEM AppExchange

Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more

The Salesforce Story
The Salesforce Story

In Marc Benioff's own words How did salesforce.com grow from a start up in a rented apartment into the world's Read more

Salesforce Jigsaw
Salesforce Jigsaw

Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more