Description: MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
benchmark (356) world models (9)
Figure 1. MMWorld covers seven broad disciplines and 69 subdisciplines, focusing on the evaluation of multi-faceted reasoning beyond perception (e.g., explanation, counterfactual thinking, future prediction, domain expertise). On the right is a video sample from the Health & Medicine discipline.
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models"---interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two uni
Figure 2. Comparison between MMWorld and previous benchmarks for real-world video understanding on a variety of criteria. Multi-faced include Explanation (Explain.), Counterfactual Thinking (Counter.), Future Prediction (Future.) and Domain Expertise (Domain.) MMWorld is the first multi-discipline and multitask video understanding benchmark that covers wider reasoning questions, and also included first-party data annotations.