ProVision: Programmatic Generation of Multimodal Instruction Data for Enhanced Model Training
The Challenge of Multimodal Instruction Data
Recent advances in multimodal language models (MLMs) like GPT-4V and BLIP have enabled sophisticated image-based reasoning, such as answering complex queries like “How many students are raising their hands in this image?” However, training these models requires high-quality instruction data—paired visual content with corresponding questions and answers—which is difficult to generate at scale.
Existing approaches face key limitations:
- Manual annotation is expensive and time-consuming.
- Proprietary models used for synthetic data generation are costly, prone to hallucinations, and struggle with scalability, interpretability, and factual accuracy.
Introducing ProVision: A Scalable, Programmatic Solution
To address these challenges, we developed ProVision, a framework that automatically synthesizes multimodal instruction data using scene graphs and human-written Python programs.
How It Works:
- Scene Graph Representation – Each image is parsed into a structured scene graph, where objects and attributes are nodes, and edges define their relationships.
- Programmatic Data Generation – Using Python-based generators and text templates, ProVision systematically produces diverse, accurate Q&A pairs from scene graphs.
- Example: For an image of a busy street, ProVision can generate:
- “What is the relationship between the pedestrian and the car?”
- “Which object is closer to the red building: the car or the pedestrian?”
- Example: For an image of a busy street, ProVision can generate:
Key Advantages Over Traditional Methods:
✔ Interpretability – Rules-based generation ensures factual correctness.
✔ Scalability – New data generators can be added to expand question types.
✔ Flexibility – Works with both annotated and automatically generated scene graphs.
ProVision-10M: A Large-Scale Multimodal Dataset
Our framework integrates 24 single-image and 14 multi-image instruction generators, producing over 10 million high-quality Q&A pairs—publicly released as ProVision-10M.
Performance Improvements in Fine-Tuning MLMs
We evaluated ProVision-10M by incorporating it into:
- LLaVA-1.5 (single-image tasks)
- Mantis-SigLIP-8B (multi-image tasks)
Results:
- Manually annotated scene graphs yielded the highest performance gains.
- When used in both pretraining and fine-tuning of xGen-MM-4B (BLIP-3), ProVision data improved performance by 1.6% on average across 11 benchmarks, outperforming baselines without our data.
Future Directions
ProVision opens new possibilities for scalable, high-quality multimodal training data. Future work could:
- Expand data generators to cover new question types.
- Improve scene graph generation for higher accuracy.
- Extend to video-based instruction data and other modalities.
By enabling systematic, rule-based instruction synthesis, ProVision provides a cost-effective, transparent, and scalable alternative to traditional data generation methods—helping advance the next generation of multimodal AI.
🔔🔔 Follow us on LinkedIn 🔔🔔













