VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Introduction

Recent advancements in Multimodal Large Language Models (MLLMs) have extended their capabilities to video understanding. Yet, these models are often plagued by "hallucinations", where irrelevant or nonsensical content is generated, deviating from the actual video context. This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs). VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis, including object-relation, temporal, semantic detail, extrinsic factual, and extrinsic non-factual hallucinations. We adopt an adversarial binary VideoQA method for comprehensive evaluation, where pairs of basic and hallucinated questions are crafted strategically. By evaluating eleven LVLMs on VideoHallucer, we reveal that (i) the majority of current models exhibit significant issues with hallucinations; (ii) while scaling datasets and parameters improves models' ability to detect basic visual cues and counterfactuals, it provides limited benefit for detecting extrinsic factual hallucinations; (iii) existing models are more adept at detecting facts than identifying hallucinations. As a byproduct, these analyses further instruct the development of our self-PEP framework, achieving an average of 5.38% improvement in hallucination resistance across all model architectures.

Leaderboard

Accuracy scores on the VideoHallucer.

#	Model	LLM	Frames	Date	Yes/No Bias		Accuracy
#	Model	LLM	Frames	Date	Pct. Diff (∼ 0)	FP Ratio (∼ 0.5)	Basic	Hallucinated	Overall
1	Human	-	-	2024-05-25	0.02	0.42	90	88.8	85
2	GPT-4o	-	16	2024-06-18	-0.02	0.43	75.1	74.2	53.3
3	PLLaVA-34B	Yi-34B	16	2024-05-25	0.18	0.78	90.8	50.8	45
4	PLLaVA-13B	Vicuna-13B-1.5	16	2024-05-25	0.17	0.72	87.5	48.6	41.2
5	PLLaVA	Vicuna-7B-1.5	16	2024-05-25	0.06	0.53	75.1	55.5	38.1
6	Gemini 1.5 Pro	-	1 fps (-128)	2024-05-25	0.15	0.62	83.6	42.3	37.8
7	LLaMA-VID-13B	Vicuna-13B-v1.5	1 fps	2024-05-25	0.21	0.72	85.2	36.9	29.2
8	LLaVA-NeXT-Video-DPO-34B	Yi-34B	4	2024-05-25	0.07	0.55	73.6	51.6	32.3
9	LLaVA-NeXT-Video-DPO	Vicuna-7B-1.5	4	2024-05-25	-0.04	0.40	62.5	60.9	32.0
10	MiniGPT4-Video	Mistral-7B	(-45)	2024-05-25	0.18	0.62	79.4	28.6	22.3
11	LLaMA-VID	Vicuna-7B-v1.5	1 fps	2024-05-25	0.29	0.83	89.9	26.6	21
12	VideoLaVIT	LLaMA2-7B	16	2024-05-25	0.36	0.91	94.9	21.3	18.9
13	Video-LLaVA	Vicuna-7B-v1.5	8	2024-05-25	0.36	0.91	95.1	20.3	17.8
14	ShareGPT4Video	LLaMA3-8B	16	2024-06-25	0.31	0.79	88.5	20.0	15.8
15	Video-LLaMA2	LLaMA2-7B	8	2024-05-25	0.36	0.84	90.9	12.7	10
16	VideoChat2	Vicuna-7B-v0	4	2024-05-25	-0.24	0.15	29.7	25.8	7.8
17	VideoChatGPT	LLaMA-7B	100	2024-05-25	0.40	0.89	92.8	10.4	6.4
18	Video-LLaMA2-13B	LLaMA2-13B	8	2024-05-25	0.36	0.79	88.3	3.8	3.3
19	Valley2	LLaMA2-7B	8	2024-05-25	-0.07	0.29	44.4	11.5	2.8

Accuracy scores on the different VideoHallucer settings.

Statistics

Statistic	Object-Relation	Temporal	Semantic-Detail	Extrinsic-Factual	Extrinsic-Nonfactual	All
Questions	200	200	200	200	200	1800
Videos	183	165	400	200	200	948
Avg. Question Length	23.7	69.2	27.3	92.3	94.7	61.4
Avg. Video Length (s)	7.0	33.8	13.5	187.0	187.0	85.6

Data Samples

Object Hallucination

Relation Hallucination (spatial)

Relation Hallucination (action)

Temporal Hallucination (absolute)

Temporal Hallucination (relative)

Semantic Detail Hallucination (attribution)

Semantic Detail Hallucination (camera)

Semantic Detail Hallucination (count)

Semantic Detail Hallucination (event)

Semantic Detail Hallucination (OCR)

Semantic Detail Hallucination (scene)

Extrinsic Factual Hallucination (instruction)

Extrinsic Factual Hallucination (course)

Extrinsic Non-factual Hallucination (instruction)

Extrinsic Non-factual Hallucination (course)

Correlation of Accuracy on VideoHallucer and Coverage score. We find the overall accuracy on VideoHallucer correlates positively with the caption-based methods Coverage.

Results Comparison of Hallucination Detection and Fact Detection for Extrinsic Hallucina- tion. We find existing methods are more good at fact detection than hallucination detection.

Results of Self-explain Strategy on Extrinsic Factual Hallucination. When adding additional explaination procedure, most methods would occur less hallucination issues.

Comparison between Video-langauge models and Image-language models. We highlight the Top 1 model of accuracy on VideoHallucer. We find image-language models have a superior performance in detecting object-relation hallucination, even though the dataset includes dynamic interactions.

BibTeX

@article{videohallucer,
  title={VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models},
  author={Wang, Yuxuan and Wang, Yueqian and Zhao, Dongyan and Xie, Cihang and Zheng, Zilong},
  journal={arxiv},
  year={2024}
}

VideoHallucer

Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Introduction

Leaderboard

VideoHallucer Dataset

Statistics

Data Samples

Experiment Results

BibTeX