Mementos

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs’ sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs’ sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations.

Leaderboard

Recall. Precision, and F1 scores of Object and Behavior on Val set of Logo Mementos.

#	Model	Input type	Source	Date	Avg	Object-Recall	Object-Precision	Object-F1	Behavior-Recall	Behavior-Precision	Behavior-F1
1	GPT-4V 🥇	Sequential	Link	2024-01-20	45.68	60.24	54.13	55.36	42.36	29.40	32.58
2	Gemini 🥈	Sequential	Link	2024-01-20	33.98	38.36	43.12	38.91	26.28	31.01	26.18
3	LLaVA-1.5 🥉	Combined	Link	2024-01-20	32.78	36.90	46.14	39.29	22.09	29.22	23.01
4	Chat-UniVi	Sequential	Link	2024-01-20	31.69	39.09	38.26	37.06	25.36	26.67	23.74
5	Gemini	Combined	Link	2024-01-20	30.44	33.28	39.47	34.42	26.76	25.38	23.33
6	GPT-4v	Combined	Link	2024-01-20	30.13	35.41	36.34	34.46	30.70	20.82	23.07
7	mPLUG_Owl-v2	Combined	Link	2024-01-20	28.26	28.51	40.65	32.20	19.74	27.81	20.64
8	InstructBLIP	Combined	Link	2024-01-20	27.10	27.37	33.86	28.77	23.98	25.69	22.92
9	Chat-UniVi	Combined	Link	2024-01-20	25.67	30.14	32.24	29.86	20.32	21.97	19.52
10	Video-LLaMA-2	Sequential	Link	2024-01-20	21.13	25.59	23.50	23.35	16.21	21.47	16.62
11	MiniGPT4	Combined	Link	2024-01-20	18.73	25.33	17.95	20.01	16.02	17.82	15.26
12	MiniGPT5	Combined	Link	2024-01-20	18.28	24.58	17.69	19.44	15.04	17.93	15.02

💡 Sequential means frames from the image sequence are input sequentially for reasoning.

💡 Combined means combining all frames from an image sequence into one composite image as MLLM input.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to Evaluation.

Overview

Logo Mementos is a comprehensive benchmark designed to evaluate the reasoning capability of Multimodal Large Language Models (MLLMs) over image sequences. It includes 4,761 image sequences of varying lengths. The image sequences in Mementos are categorized into three domains: Daily-life, Robotics, and Comics. This diverse collection is crucial for evaluating the comprehensive time-varying reasoning abilities of MLLMs. Specifically, the robotics data, closely associated with embodied AI or real-world contexts, and the comic-style storyboard data, rich in stylistic and episodic diversity, significantly enhance the benchmark’s relevance and robustness.

Examples of hallucinations by GPT-4V in three domains on Logo Mementos: Daily-life, Robotics, and Comics.

Detailed evaluation results of different MLLMs on different domains of Logo Mementos.

Dataset Statistics

All the data are divided into training and validation sets.

training: 4,062 image sequences used for MLLM training or finetuning.
validation: 699 image sequences for evaluation.

You can download the dataset on Dataset.

Detailed image sequence numbers in each domain of Dataset.

Distribution of image sequence length in Logo Mementos Val set .

Distribution of episode length in Logo Mementos Val set.

GPT-4-assisted Evaluation

We employ a GPT-4-assisted evaluation procedure: after an MLLM produces a description for an image sequence, we extract behavior and object keywords from both AI-generated and human-annotated descriptions using GPT-4, then use keyword matching to assess the degree of behavioral and object hallucinations.

Evaluation Results on Existing MLLMs

GPT-4V with s-input demonstrates the best reasoning capability compared with all other MLLMs in understanding image sequences. Among open-source models, LLaVA1.5 performs the best, nearly matching or even surpassing the black-box model Gemini in object comprehension, but its ability to infer behaviors from image sequences is weaker compared to Gemini and GPT-4V.

Citation

If you find our work useful, please consider citing the paper as follows:


@misc{wang2024mementos,
      title={Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences}, 
      author={Xiyao Wang and Yuhang Zhou and Xiaoyu Liu and Hongjin Lu and Yuancheng Xu and Feihong He and Jaehong Yoon and Taixi Lu and Gedas Bertasius and Mohit Bansal and Huaxiu Yao and Furong Huang},
      year={2024},
      eprint={2401.10529},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
      }