Author name cluster

Zheqi He

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

Xuannan Liu
Zekun Li
Zheqi He
Peipei Li
shuhan xia
Xing Cui
Huaibo Huang
Xi Yang

The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2, 264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67. 2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.

PDF Details

IJCAI Conference 2024 Conference Paper

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Zheqi He
Xinya Wu
Pengfei Zhou
Richeng Xuan
Guang Liu
Xi Yang
Qiannan Zhu
Hua Huang

Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3, 603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https: //github. com/FlagOpen/CMMU.

PDF Details DOI

AAAI Conference 2017 Short Paper

SReN: Shape Regression Network for Comic Storyboard Extraction

Zheqi He
Yafeng Zhou
Yongtao Wang
Zhi Tang

The goal of storyboard extraction is to decompose the comic image into storyboards, which is the fundamental step of comic image understanding and producing digital comic documents suitable for mobile reading. Most of existing approaches are based on hand crafted low-level visual patters like edge segments and line segments, which do not capture high-level vision information. To overcome this drawback of the existing approaches, we propose a novel architecture based on deep convolutional neural network, named Shape Regression Network (SReN), to detect storyboards within comic images. Firstly, we use Fast R-CNN to generate rectangle bounding boxes as storyboard proposals. Then we train a deep neural network to predict quadrangles for these proposals. Unlike existing object detection methods which only output rectangle bounding boxes, SReN can produce more precise quadrangle bounding boxes. Experimental results on 7382 comic pages, demonstrate that SReN outperforms the state-of-the-art methods by more than 10% in terms of F1score and page correction rate.

PDF Details