Salesforce ProVision

January 10, 2025in Data

ProVision: Programmatic Generation of Multimodal Instruction Data for Enhanced Model Training

The Challenge of Multimodal Instruction Data

Recent advances in multimodal language models (MLMs) like GPT-4V and BLIP have enabled sophisticated image-based reasoning, such as answering complex queries like “How many students are raising their hands in this image?” However, training these models requires high-quality instruction data—paired visual content with corresponding questions and answers—which is difficult to generate at scale.

Existing approaches face key limitations:

Manual annotation is expensive and time-consuming.
Proprietary models used for synthetic data generation are costly, prone to hallucinations, and struggle with scalability, interpretability, and factual accuracy.

Introducing ProVision: A Scalable, Programmatic Solution

To address these challenges, we developed ProVision, a framework that automatically synthesizes multimodal instruction data using scene graphs and human-written Python programs.

How It Works:

Scene Graph Representation – Each image is parsed into a structured scene graph, where objects and attributes are nodes, and edges define their relationships.
Programmatic Data Generation – Using Python-based generators and text templates, ProVision systematically produces diverse, accurate Q&A pairs from scene graphs.
- Example: For an image of a busy street, ProVision can generate:
  - “What is the relationship between the pedestrian and the car?”
  - “Which object is closer to the red building: the car or the pedestrian?”

Key Advantages Over Traditional Methods:
✔ Interpretability – Rules-based generation ensures factual correctness.
✔ Scalability – New data generators can be added to expand question types.
✔ Flexibility – Works with both annotated and automatically generated scene graphs.

ProVision-10M: A Large-Scale Multimodal Dataset

Our framework integrates 24 single-image and 14 multi-image instruction generators, producing over 10 million high-quality Q&A pairs—publicly released as ProVision-10M.

Performance Improvements in Fine-Tuning MLMs

We evaluated ProVision-10M by incorporating it into:

LLaVA-1.5 (single-image tasks)
Mantis-SigLIP-8B (multi-image tasks)

Results:

Manually annotated scene graphs yielded the highest performance gains.
When used in both pretraining and fine-tuning of xGen-MM-4B (BLIP-3), ProVision data improved performance by 1.6% on average across 11 benchmarks, outperforming baselines without our data.

Future Directions

ProVision opens new possibilities for scalable, high-quality multimodal training data. Future work could:

Expand data generators to cover new question types.
Improve scene graph generation for higher accuracy.
Extend to video-based instruction data and other modalities.

By enabling systematic, rule-based instruction synthesis, ProVision provides a cost-effective, transparent, and scalable alternative to traditional data generation methods—helping advance the next generation of multimodal AI.

🔔🔔 Follow us on LinkedIn 🔔🔔

Salesforce ProVision

Salesforce ProVision

ProVision: Programmatic Generation of Multimodal Instruction Data for Enhanced Model Training

The Challenge of Multimodal Instruction Data

Introducing ProVision: A Scalable, Programmatic Solution

ProVision-10M: A Large-Scale Multimodal Dataset

Performance Improvements in Fine-Tuning MLMs

Future Directions

Recent Posts

Mastering the AI Agent Revolution

Unlocking Hidden Insights

Leveraging Salesforce Person Accounts for Educational Institutions

Transforming Business Operations Through Autonomous Intelligence

The AI Frontier Code: Laws for Taming the Wild West of UX

Contact Us

Be in touch today — and start your business on a path to success.

Category

Archives

Salesforce ProVision

Salesforce ProVision

ProVision: Programmatic Generation of Multimodal Instruction Data for Enhanced Model Training

The Challenge of Multimodal Instruction Data

Introducing ProVision: A Scalable, Programmatic Solution

ProVision-10M: A Large-Scale Multimodal Dataset

Performance Improvements in Fine-Tuning MLMs

Future Directions

Related Posts

Recent Posts

Contact Us

Be in touch today — and start your business on a path to success.

Category

Tags

Archives