ProVision: Programmatic Generation of Multimodal Instruction Data for Enhanced Model Training

The Challenge of Multimodal Instruction Data

Recent advances in multimodal language models (MLMs) like GPT-4V and BLIP have enabled sophisticated image-based reasoning, such as answering complex queries like “How many students are raising their hands in this image?” However, training these models requires high-quality instruction data—paired visual content with corresponding questions and answers—which is difficult to generate at scale.

Existing approaches face key limitations:

  • Manual annotation is expensive and time-consuming.
  • Proprietary models used for synthetic data generation are costly, prone to hallucinations, and struggle with scalability, interpretability, and factual accuracy.

Introducing ProVision: A Scalable, Programmatic Solution

To address these challenges, we developed ProVision, a framework that automatically synthesizes multimodal instruction data using scene graphs and human-written Python programs.

How It Works:

  1. Scene Graph Representation – Each image is parsed into a structured scene graph, where objects and attributes are nodes, and edges define their relationships.
  2. Programmatic Data Generation – Using Python-based generators and text templates, ProVision systematically produces diverse, accurate Q&A pairs from scene graphs.
    • Example: For an image of a busy street, ProVision can generate:
      • “What is the relationship between the pedestrian and the car?”
      • “Which object is closer to the red building: the car or the pedestrian?”

Key Advantages Over Traditional Methods:
Interpretability – Rules-based generation ensures factual correctness.
Scalability – New data generators can be added to expand question types.
Flexibility – Works with both annotated and automatically generated scene graphs.

ProVision-10M: A Large-Scale Multimodal Dataset

Our framework integrates 24 single-image and 14 multi-image instruction generators, producing over 10 million high-quality Q&A pairs—publicly released as ProVision-10M.

Performance Improvements in Fine-Tuning MLMs

We evaluated ProVision-10M by incorporating it into:

  • LLaVA-1.5 (single-image tasks)
  • Mantis-SigLIP-8B (multi-image tasks)

Results:

  • Manually annotated scene graphs yielded the highest performance gains.
  • When used in both pretraining and fine-tuning of xGen-MM-4B (BLIP-3), ProVision data improved performance by 1.6% on average across 11 benchmarks, outperforming baselines without our data.

Future Directions

ProVision opens new possibilities for scalable, high-quality multimodal training data. Future work could:

  • Expand data generators to cover new question types.
  • Improve scene graph generation for higher accuracy.
  • Extend to video-based instruction data and other modalities.

By enabling systematic, rule-based instruction synthesis, ProVision provides a cost-effective, transparent, and scalable alternative to traditional data generation methods—helping advance the next generation of multimodal AI.

🔔🔔  Follow us on LinkedIn  🔔🔔

Related Posts
Who is Salesforce?
Salesforce

Who is Salesforce? Here is their story in their own words. From our inception, we've proudly embraced the identity of Read more

Salesforce Marketing Cloud Transactional Emails
Salesforce Marketing Cloud

Salesforce Marketing Cloud Transactional Emails are immediate, automated, non-promotional messages crucial to business operations and customer satisfaction, such as order Read more

Salesforce Unites Einstein Analytics with Financial CRM
Financial Services Sector

Salesforce has unveiled a comprehensive analytics solution tailored for wealth managers, home office professionals, and retail bankers, merging its Financial Read more

AI-Driven Propensity Scores
AI-driven propensity scores

AI plays a crucial role in propensity score estimation as it can discern underlying patterns between treatments and confounding variables Read more