Author name cluster

Zhonghong Ou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

AAAI Conference 2026 Conference Paper

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Kaiwen Xue
Chenglong Li
Zhonghong Ou
Guoxin Zhang
Kaoyan Lu
Shuai Lyu
Yifan Zhu
Ping Zong

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multityped instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine the human feedback to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-ofthe-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Jiangfeng Sun
Sihao He
Zhonghong Ou
Meina Song

Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multi-view contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely-used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU’s interpretability and its ability to capture nuanced emotional patterns through semantically-grounded interactions.

PDF Details DOI

ICML Conference 2025 Conference Paper

Efficient Robotic Policy Learning via Latent Space Backward Planning

Dongxiu Liu
Haoyi Niu
Zhihao Wang
Jinliang Zheng
Yinan Zheng
Zhonghong Ou
Jianming Hu
Jianxiong Li

Current robotic planning methods often rely on predicting multi-frame images with full pixel details. While this fine-grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real-time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse-grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off-task predictions due to accumulation errors, leading to misalignment with long-term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real-time control in long-horizon, multi-stage tasks? To address this, we propose a B ackward P lanning scheme in L atent space ( LBP ), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on-task prediction along the entire planning horizon. The subgoal-conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction. Through extensive simulation and real-robot long-horizon experiments, we show that LBP outperforms existing fine-grained and forward planning methods, achieving SOTA performance. Project Page: https: //lbp-authors. github. io.

Details

AAAI Conference 2025 Conference Paper

LS-TGNN: Long and Short-Term Temporal Graph Neural Network for Session-Based Recommendation

Zhonghong Ou
Xiao Zhang
Yifan Zhu
Shuai Lyu
Jiahao Liu
Tu Ao

Session-Based Recommendation (SBR) based on Graph Neural Networks (GNN) has become a new paradigm for recommender systems, and plays a fundamental role in e-commerce and other relevant domains. Existing graph aggregation methods primarily form node representations by capturing basic relationships between neighboring and central nodes. Despite their encouraging results, the global relationships of items and user intentions within sessions typically change over time, which degrades the effectiveness of existing embedding schemes. To resolve this challenge, we propose a Long and Short-Term Temporal Graph Neural Network (LS-TGNN) for SBR. LS-TGNN employs a novel temporal session graph to aggregate neighborhood information, and models user interests from both long and short-term perspectives. Specifically, we design long-term and short-term encoders to model the long and short-term interests of users, respectively. In order to better model the interests of users in different time dimensions, we introduce an item-granularity method that distinguishes between long and short-term interests. Extensive experiments on three widely used datasets demonstrate that LS-TGNN outperforms existing methods with a large margin.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Towards Recognizing Spatial-temporal Collaboration of EEG Phase Brain Networks for Emotion Understanding

Jiangfeng Sun
Kaiwen Xue
Qika Lin
Yufei Qiao
Yifan Zhu
Zhonghong Ou
Meina Song

Emotion recognition from EEG signals is crucial for understanding complex brain dynamics. Existing methods typically rely on static frequency bands and graph convolutional networks (GCNs) to model brain connectivity. However, EEG signals are inherently non-stationary and exhibit substantial individual variability, making static-band approaches inadequate for capturing their dynamic properties. Moreover, spatial-temporal dependencies in EEG often lead to feature degradation during node aggregation, ultimately limiting recognition performance. To address these challenges, we propose the Spatial-Temporal Electroencephalograph Collaboration framework (Stella). Our approach introduces an Adaptive Bands Selection module (ABS) that dynamically extracts low- and high-frequency components, generating dual-path features comprising phase brain networks for connectivity modeling and time-series representations for local dynamics. To further mitigate feature degradation, the Fourier Graph Operator (FGO) operates in the spectral domain, while the Spatial-Temporal Encoder (STE) enhances representation stability and density. Extensive experiments on benchmark EEG datasets demonstrate that Stella achieves state-of-the-art performance in emotion recognition, offering valuable insights for graph-based modeling of non-stationary neural signals. The code is available at https: //github. com/sun2017bupt/EEGBrainNetwork.

PDF Details DOI

AAAI Conference 2025 Conference Paper

TSVC: Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval

Shuai Lyu
Zijing Tian
Zhonghong Ou
Yifan Zhu
Xiao Zhang
Qiankun Ha
Haoran Luo
Meina Song

Cross-modal retrieval maps data under different modalities via semantic relevance. Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise, i.e., noisy correspondence (NC). Consequently, it inevitably causes performance degradation. Despite attempts that employ the co-teaching paradigm with identical architectures to provide distinct data perspectives, the differences between these architectures primarily stem from random initialization. Thus, the model becomes increasingly homogeneous along with the training process. Consequently, the additional information brought by this paradigm is severely limited. In order to resolve this problem, we introduce Tripartite Learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval. We design a tripartite cooperative learning mechanism comprising a Coordinator, a Master, and an Assistant model. The Coordinator distributes data, and the Assistant model supports the Master model's noisy label prediction with diverse data. Moreover, we introduce a soft label estimation method based on mutual information variation, which quantifies the noise in new samples and assigns corresponding soft labels. We also present a new loss function to enhance robustness and optimize training effectiveness. Extensive experiments on three widely used datasets demonstrate that, even at increasing noise ratios, TSVC exhibits significant advantages in retrieval accuracy and maintains stable training performance.

PDF Details DOI