Author name cluster

Hao Fei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

37 papers

1 author row

AAAI Conference 2026 Conference Paper

DragNeXt: Rethinking Drag-Based Image Editing

Yuan Zhou
Junbao Zhou
Qingshan Xu
Kesen Zhao
Yuxuan Wang
Hao Fei
Richang Hong
Hanwang Zhang

Drag-Based Image Editing (DBIE), which allows users to manipulate images by directly dragging objects within them, has recently attracted much attention from the community. However, it faces two key challenges: (i) point-based drag is often highly ambiguous and difficult to align with user intentions; (ii) current DBIE methods primarily rely on alternating between motion supervision and point tracking, which is not only cumbersome but also fails to produce high-quality results. These limitations motivate us to explore DBIE from a new perspective---unifying it as a Latent Region Optimization (LRO) problem that aims to use region-level geometric transformations to optimize latent code to realize drag manipulation. Thus, by specifying the areas and types of geometric transformations, we can effectively address the ambiguity issue. We also propose a simple yet effective editing framework, dubbed DragNeXt. It solves LRO through Progressive Backward Self-Intervention (PBSI), simplifying the overall procedure of the alternating workflow while further enhancing quality by fully leveraging region-level structure information and progressive guidance from intermediate drag states. We validate DragNeXt on our NextBench, and extensive experiments demonstrate that our proposed method can significantly outperform existing approaches.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Orthogonal Spatial-temporal Distributional Transfer for 4D Generation

Wei Liu
Shengqiong Wu
Bobo Li
Haoyu Zhao
Hao Fei
Mong-Li Lee
Wynne Hsu

In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Further, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features for better 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning

Shengqiong Wu
Hao Fei
Liangming Pan
William Yang Wang
Shuicheng Yan
Tat-Seng Chua

Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.

PDF Details DOI

AAAI Conference 2025 Conference Paper

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Zihui Cheng
Qiguang Chen
Jin Zhang
Hao Fei
Xiaocheng Feng
Wanxiang Che
Min Li
Libo Qin

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Divide-Solve-Combine: An Interpretable and Accurate Prompting Framework for Zero-shot Multi-Intent Detection

Libo Qin
Qiguang Chen
Jingxuan Zhou
Jin Wang
Hao Fei
Wanxiang Che
Min Li

Zero-shot multi-intent detection is capable of capturing multiple intents within a single utterance without any training data, which gains increasing attention. Building on the success of large language models (LLM), dominant approaches in the literature explore prompting techniques to enable zero-shot multi-intent detection. While significant advancements have been witnessed, the existing prompting approaches still face two major issues: lacking explicit reasoning and lacking interpretability. Therefore, in this paper, we introduce a Divide-Solve-Combine Prompting (DSCP) to address the above issues. Specifically, DSCP explicitly decomposes multi-intent detection into three components including (1) single-intent division prompting is utilized to decompose an input query into distinct sub-sentences, each containing a single intent; (2) intent-by-intent solution prompting is applied to solve each sub-sentence recurrently; and (3) multi-intent combination prompting is employed for combining each sub-sentence result to obtain the final multi-intent result. By decomposition, DSCP allows the model to track the explicit reasoning process and improve the interpretability. In addition, we propose an interactive divide-solve-combine prompting (Inter-DSCP) to naturally capture the interaction capabilities of large language models. Experimental results on two standard multi-intent benchmarks (i.e., MixATIS and MixSNIPS) reveal that both DSCP and Inter-DSCP obtain substantial improvements over baselines, achieving superior performance and higher interpretability.

PDF Details DOI

AIJ Journal 2025 Journal Article

Grammar induction from visual, speech and text

Yu Zhao
Hao Fei
Shengqiong Wu
Meishan Zhang
Min Zhang
Tat-Seng Chua

Details DOI

IJCAI Conference 2025 Conference Paper

Improving Consistency Identification in Task-oriented Dialogue Through Multi-Agent Collaboration

Peng Wang
Shuo Li
Ruoxi Zhou
Qiguang Chen
Xiao Xu
Hao Fei
Dagang Li
Wanxiang Che

Consistency identification in task-oriented dialog (CI-ToD) typically consists of three sub-tasks: User Query Inconsistency (QI) identification, Dialogue History Inconsistency (HI) identification, and Knowledge Base Inconsistency (KBI) identification, which aim to determine inconsistent relationships between system response and user query, dialogue history, and knowledge base. Previous approaches focus on the exploration of deep learning models for CI-ToD. While these models achieve remarkable progress, they still rely on large amounts of labeled data, which is hard to achieve in real-world scenarios. Motivated by this, in the paper, we aim to explore large language models for CI-ToD, which do not require any training data. In addition, we further introduce a multi-agent collaboration framework (MAC-CIToD) to model the interaction across three sub-tasks in CI-ToD, including (1) Full Connection paradigm, (2) Cycle Connection paradigm, and (3) Central Connection paradigm, which effectively builds interaction across QI, HI, and KBI. Experiments on the standard benchmark reveal that our framework achieves superior performance. Additionally, we compare MAC-CIToD with the most advanced trained approaches and find that its zero-shot performance on most metrics even surpasses that of models after training on the CI-ToD dataset.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Kai Liu
Jungang Li
Yuchong Sun
Shengqiong Wu
jianzhang gao
Daoan Zhang
Wei Zhang
Sheng Jin

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

PDF Details

AAAI Conference 2025 Conference Paper

Multi-Granular Multimodal Clue Fusion for Meme Understanding

Li Zheng
Hao Fei
Ting Dai
Zuquan Peng
Fei Li
Huisheng Ma
Chong Teng
Donghong Ji

With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of fine-grained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model's ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model's understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu
Hao Fei
Yuhui Zhang
Liangming Pan
Qijun Huang
Qian Liu
Preslav Nakov
Min-Yen Kan

Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1, 093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4. 1, achieving only 46. 8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4. 1’s Chain-of-Thought performance by 14. 13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements.

PDF Details

AAAI Conference 2025 Conference Paper

VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence

Hao Li
Hao Fei
Zechao Hu
Zhengwei Yang
Zheng Wang

Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model’s social intelligence level. While impressive multiple-choice question (MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model’s interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to offer a new perspective on Social-IQ and advance the development of human-like social AI.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Haidong Xu
Guangwei Xu
Zhedong Zheng
Xiatian Zhu
Wei Ji
Xiangtai Li
Ruijie Guo
Meishan Zhang

This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.

PDF Details

NeurIPS Conference 2025 Conference Paper

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

Zihui Cheng
Qiguang Chen
Xiao Xu
Jiaqi Wang
Weiyun Wang
Hao Fei
Yidong Wang
Alex Jinpeng Wang

Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating $\textit{visual thoughts}$, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.

PDF Details

NeurIPS Conference 2024 Conference Paper

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Mingrui Wu
Xinyue Cai
Jiayi Ji
Jiale Li
OuCheng Huang
Hao Fei
Guannan Jiang
Xiaoshuai Sun

In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through learnable latent variable optimization. We observe that attention, as the core module of MLLMs, connects text prompt tokens and visual tokens, ultimately determining the final results. Our approach involves adjusting visual tokens from the MLP output during inference, controlling the attention response to ensure text prompt tokens attend to visual tokens in referring regions. We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point. The results demonstrate that our method exhibits out-of-domain generalization and interpretability.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Harnessing Holistic Discourse Features and Triadic Interaction for Sentiment Quadruple Extraction in Dialogues

Bobo Li
Hao Fei
Lizi Liao
Yu Zhao
Fangfang Su
Fei Li
Donghong Ji

Dialogue Aspect-based Sentiment Quadruple (DiaASQ) is a newly-emergent task aiming to extract the sentiment quadruple (i.e., targets, aspects, opinions, and sentiments) from conversations. While showing promising performance, the prior DiaASQ approach unfortunately falls prey to the key crux of DiaASQ, including insufficient modeling of discourse features, and lacking quadruple extraction, which hinders further task improvement. To this end, we introduce a novel framework that not only capitalizes on comprehensive discourse feature modeling, but also captures the intrinsic interaction for optimal quadruple extraction. On the one hand, drawing upon multiple discourse features, our approach constructs a token-level heterogeneous graph and enhances token interactions through a heterogeneous attention network. We further propose a novel triadic scorer, strengthening weak token relations within a quadruple, thereby enhancing the cohesion of the quadruple extraction. Experimental results on the DiaASQ benchmark showcase that our model significantly outperforms existing baselines across both English and Chinese datasets. Our code is available at https://bit.ly/3v27pqA.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction

Kangkang Lu
Yanhua Yu
Hao Fei
Xuan Li
Zixuan Yang
Zirui Guo
Meiyu Liang
Mengran Yin

In recent years, spectral graph neural networks, characterized by polynomial filters, have garnered increasing attention and have achieved remarkable performance in tasks such as node classification. These models typically assume that eigenvalues for the normalized Laplacian matrix are distinct from each other, thus expecting a polynomial filter to have a high fitting ability. However, this paper empirically observes that normalized Laplacian matrices frequently possess repeated eigenvalues. Moreover, we theoretically establish that the number of distinguishable eigenvalues plays a pivotal role in determining the expressive power of spectral graph neural networks. In light of this observation, we propose an eigenvalue correction strategy that can free polynomial filters from the constraints of repeated eigenvalue inputs. Concretely, the proposed eigenvalue correction strategy enhances the uniform distribution of eigenvalues, thus mitigating repeated eigenvalues, and improving the fitting capacity and expressive power of polynomial filters. Extensive experimental results on both synthetic and real-world datasets demonstrate the superiority of our method.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Tao Zhang
Xiangtai Li
Hao Fei
Haobo Yuan
Shengqiong Wu
Shunping Ji
Chen Change Loy
Shuicheng Yan

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Reverse Multi-Choice Dialogue Commonsense Inference with Graph-of-Thought

Li Zheng
Hao Fei
Fei Li
Bobo Li
Lizi Liao
Donghong Ji
Chong Teng

With the proliferation of dialogic data across the Internet, the Dialogue Commonsense Multi-choice Question Answering (DC-MCQ) task has emerged as a response to the challenge of comprehending user queries and intentions. Although prevailing methodologies exhibit effectiveness in addressing single-choice questions, they encounter difficulties in handling multi-choice queries due to the heightened intricacy and informational density. In this paper, inspired by the human cognitive process of progressively excluding options, we propose a three-step Reverse Exclusion Graph-of-Thought (ReX-GoT) framework, including Option Exclusion, Error Analysis, and Combine Information. Specifically, our ReX-GoT mimics human reasoning by gradually excluding irrelevant options and learning the reasons for option errors to choose the optimal path of the GoT and ultimately infer the correct answer. By progressively integrating intricate clues, our method effectively reduces the difficulty of multi-choice reasoning and provides a novel solution for DC-MCQ. Extensive experiments on the CICERO and CICERO_v2 datasets validate the significant improvement of our approach on DC-MCQ task. On zero-shot setting, our model outperform the best baseline by 17.67% in terms of F1 score for the multi-choice task. Most strikingly, our GPT3.5-based ReX-GoT framework achieves a remarkable 39.44% increase in F1 score.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

Changli Wu
Qi Chen
Jiayi Ji
Haowei Wang
Yiwei Ma
You Huang
Hao Fei
Xiaoshuai Sun

3D Referring Expression Segmentation (3D-RES) aims to segment 3D objects by correlating referring expressions with point clouds. However, traditional approaches frequently encounter issues like over-segmentation or mis-segmentation, due to insufficient emphasis on spatial information of instances. In this paper, we introduce a Rule-Guided Spatial Awareness Network (RG-SAN) by utilizing solely the spatial information of the target instance for supervision. This approach enables the network to accurately depict the spatial relationships among all entities described in the text, thus enhancing the reasoning capabilities. The RG-SAN consists of the Text-driven Localization Module (TLM) and the Rule-guided Weak Supervision (RWS) strategy. The TLM initially locates all mentioned instances and iteratively refines their positional information. The RWS strategy, acknowledging that only target objects have supervised positional information, employs dependency tree rules to precisely guide the core instance’s positioning. Extensive testing on the ScanRefer benchmark has shown that RG-SAN not only establishes new performance benchmarks, with an mIoU increase of 5. 1 points, but also exhibits significant improvements in robustness when processing descriptions with spatial ambiguity. All codes are available at https: //github. com/sosppxo/RG-SAN.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image

Yu Zhao
Hao Fei
Xiangtai Li
Libo Qin
Jiayi Ji
Hongyuan Zhu
Meishan Zhang
Min Zhang

In the visual spatial understanding (VSU) field, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$\to$image and 3D$\to$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$\to$X processes to guide the hard X$\to$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

Kaihang Pan
Zhaoyu Fan
Juncheng Li
Qifan Yu
Hao Fei
Siliang Tang
Richang Hong
Hanwang Zhang

The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic knowledge editing and external knowledge resorting, each possess strengths and weaknesses, struggling to balance the desired properties of reliability, generality, and locality when applied to MLLMs. In this paper, we propose \textbf{UniKE}, a novel multimodal editing method that establishes a unified perspective and paradigm for intrinsic knowledge editing and external knowledge resorting. Both types of knowledge are conceptualized as vectorized key-value memories, with the corresponding editing processes resembling the assimilation and accommodation phases of human cognition, conducted at the same semantic levels. Within such a unified framework, we further promote knowledge collaboration by disentangling the knowledge representations into the semantic and truthfulness spaces. Extensive experiments validate the effectiveness of our method, which ensures that the post-edit MLLM simultaneously maintains excellent reliability, generality, and locality. The code for UniKE is available at https: //github. com/beepkh/UniKE.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Wei Chow
Juncheng Li
Qifan Yu
Kaihang Pan
Hao Fei
Zhiqi Ge
Shuai Yang
Siliang Tang

In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation. This paper addresses these challenges by proposing a unified approach that integrates the strengths of both paradigms. Considering interleaved image-text sequences as the general format of input samples, we introduce a structure-induced training strategy that imposes semantic relationships between input samples and the MLLM’s hidden state. This approach enhances the MLLM’s ability to capture global semantics and distinguish fine-grained semantics. By leveraging dynamic sequence alignment within the Dynamic Time Warping framework and integrating a novel kernel for fine-grained semantic differentiation, our method effectively balances generative and discriminative tasks. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks. By employing a retrieval-augmented generation strategy, our approach further enhances performance in some generative tasks within one model, offering a promising direction for future research in vision-language modeling.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Hao Fei
Shengqiong Wu
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan

Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. In this paper we present Vitron, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, Vitron incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which Vitron supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. To ensure an effective and precise message passing from LLM to backend modules for function invocation, we propose a novel hybrid method by simultaneously integrating discrete textual instructions and continuous signal embeddings. Further, we design various pixel-level spatiotemporal vision-language alignment learning for Vitron to reach the best fine-grained visual capability. Finally, a cross-task synergy module is advised to learn to maximize the task-invariant fine-grained visual features, enhancing the synergy between different visual tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, Vitron showcases its extensive capabilities in the four main vision task clusters. Overall, this work illuminates the great potential of developing a more unified multimodal generalist.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Libo Qin
Qiguang Chen
Hao Fei
Zhi Chen
Min Li
Wanxiang Che

Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: " What factors affect the performance of MM-ICL? " To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion

Shengqiong Wu
Hao Fei
Hanwang Zhang
Tat-Seng Chua

In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-intricate setting, i. e. , generating intricate visual content from simple abstract text prompts. Inspired by human imagination intuition, we propose a novel scene-graph hallucination (SGH) mechanism for effective abstract-to-intricate T2I synthesis. SGH carries out scene hallucination by expanding the initial scene graph (SG) of the input prompt with more feasible specific scene structures, in which the structured semantic representation of SG ensures high controllability of the intrinsic scene imagination. To approach the T2I synthesis, we deliberately build an SG-based hallucination diffusion system. First, we implement the SGH module based on the discrete diffusion technique, which evolves the SG structure by iteratively adding new scene elements. Then, we utilize another continuous-state diffusion model as the T2I synthesizer, where the overt image-generating process is navigated by the underlying semantic scene structure induced from the SGH module. On the benchmark COCO dataset, our system outperforms the existing best-performing T2I model by a significant margin, especially improving on the abstract-to-intricate T2I generation. Further in-depth analyses reveal how our methods advance.

PDF Details

NeurIPS Conference 2023 Conference Paper

VPGTrans: Transfer Visual Prompt Generator across LLMs

Ao Zhang
Hao Fei
Yuan Yao
Wei Ji
Li Li
Zhiyuan Liu
Tat-Seng Chua

Since developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm. However, further tuning the VPG component of the MLLM still incurs significant computational costs, such as thousands of GPU hours and millions of training data points. An alternative solution is transferring an existing VPG from one MLLM to the target MLLM. In this work, we investigate VPG transferability across LLMs for the first time, aiming to reduce the cost of VPG training. Specifically, we explore VPG transfer across different LLM sizes (e. g. , small-to-large) and types. We identify key factors to maximize transfer efficiency, based on which we develop a simple yet highly effective two-stage transfer framework, called VPGTrans. Notably, it enables VPG transfer from BLIP-2 OPT 2. 7B to BLIP-2 OPT 6. 7B with less than 10% of the GPU hours using only 10. 7% of the training data compared to training a VPG for OPT 6. 7B from scratch. Furthermore, we provide a series of intriguing findings and discuss potential explanations behind them. Finally, we showcase the practical value of our VPGTrans approach, by customizing two novel MLLMs, including VL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs.

PDF Details

IJCAI Conference 2022 Conference Paper

Conversational Semantic Role Labeling with Predicate-Oriented Latent Graph

Hao Fei
Shengqiong Wu
Meishan Zhang
Yafeng Ren
Donghong Ji

Conversational semantic role labeling (CSRL) is a newly proposed task that uncovers the shallow semantic structures in a dialogue text. Unfortunately several important characteristics of the CSRL task have been overlooked by the existing works, such as the structural information integration, near-neighbor influence. In this work, we investigate the integration of a latent graph for CSRL. We propose to automatically induce a predicate-oriented latent graph (POLar) with a predicate-centered gaussian mechanism, by which the nearer and informative words to the predicate will be allocated with more attention. The POLar structure is then dynamically pruned and refined so as to best fit the task need. We additionally introduce an effective dialogue-level pre-trained language model, CoDiaBERT, for better supporting multiple utterance sentences and handling the speaker coreference issue in CSRL. Our system outperforms best-performing baselines on three benchmark CSRL datasets with big margins, especially achieving over 4% F1 score improvements on the cross-utterance argument detection. Further analyses are presented to better understand the effectiveness of our proposed methods.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Global Inference with Explicit Syntactic and Discourse Structures for Dialogue-Level Relation Extraction

Hao Fei
Jingye Li
Shengqiong Wu
Chenliang Li
Donghong Ji
Fei Li

Recent research attention for relation extraction has been paid to the dialogue scenario, i. e. , dialogue-level relation extraction (DiaRE). Existing DiaRE methods either simply concatenate the utterances in a dialogue into a long piece of text, or employ naive words, sentences or entities to build dialogue graphs, while the structural characteristics in dialogues have not been fully utilized. In this work, we investigate a novel dialogue-level mixed dependency graph (D2G) and an argument reasoning graph (ARG) for DiaRE with a global relation reasoning mechanism. First, we model the entire dialogue into a unified and coherent D2G by explicitly integrating both syntactic and discourse structures, which enables richer semantic and feature learning for relation extraction. Second, we stack an ARG graph on top of D2G to further focus on argument inter-dependency learning and argument representation refinement, for sufficient argument relation inference. In our global reasoning framework, D2G and ARG work collaboratively, iteratively performing lexical, syntactic and semantic information exchange and representation learning over the entire dialogue context. On two DiaRE benchmarks, our framework shows considerable improvements over the current state-of-the-art baselines. Further analyses show that the model effectively solves the long-range dependence issue, and meanwhile gives explainable predictions.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Inheriting the Wisdom of Predecessors: A Multiplex Cascade Framework for Unified Aspect-based Sentiment Analysis

Hao Fei
Fei Li
Chenliang Li
Shengqiong Wu
Jingye Li
Donghong Ji

So far, aspect-based sentiment analysis (ABSA) has involved with total seven subtasks, in which, however the interactions among them have been left unexplored sufficiently. This work presents a novel multiplex cascade framework for unified ABSA and maintaining such interactions. First, we model total seven subtasks as a hierarchical dependency in the easy-to-hard order, based on which we then propose a multiplex decoding mechanism, transferring the sentiment layouts and clues in lower tasks to upper ones. The multiplex strategy enables highly-efficient subtask interflows and avoids repetitive training; meanwhile it sufficiently utilizes the existing data without requiring any further annotation. Further, based on the characteristics of aspect-opinion term extraction and pairing, we enhance our multiplex framework by integrating POS tag and syntactic dependency information for term boundary and pairing identification. The proposed Syntax-aware Multiplex (SyMux) framework enhances the ABSA performances on 28 subtasks (7×4 datasets) with big margins.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model

Hao Fei
Shengqiong Wu
Jingye Li
Bobo Li
Fei Li
Libo Qin
Meishan Zhang
Min Zhang

Universally modeling all typical information extraction tasks (UIE) with one generative language model (GLM) has revealed great potential by the latest study, where various IE predictions are unified into a linearized hierarchical expression under a GLM. Syntactic structure information, a type of effective feature which has been extensively utilized in IE community, should also be beneficial to UIE. In this work, we propose a novel structure-aware GLM, fully unleashing the power of syntactic knowledge for UIE. A heterogeneous structure inductor is explored to unsupervisedly induce rich heterogeneous structural representations by post-training an existing GLM. In particular, a structural broadcaster is devised to compact various latent trees into explicit high-order forests, helping to guide a better generation during decoding. We finally introduce a task-oriented structure fine-tuning mechanism, further adjusting the learned structures to most coincide with the end-task's need. Over 12 IE benchmarks across 7 tasks our system shows significant improvements over the baseline UIE system. Further in-depth analyses show that our GLM learns rich task-adaptive structural bias that greatly resolves the UIE crux, the long-range dependence issue and boundary identifying.

PDF Details

AAAI Conference 2022 Conference Paper

Mastering the Explicit Opinion-Role Interaction: Syntax-Aided Neural Transition System for Unified Opinion Role Labeling

Shengqiong Wu
Hao Fei
Fei Li
Meishan Zhang
Yijiang Liu
Chong Teng
Donghong Ji

Unified opinion role labeling (ORL) aims to detect all possible opinion structures of ‘opinion-holder-target’ in one shot, given a text. The existing transition-based unified method, unfortunately, is subject to longer opinion terms and fails to solve the term overlap issue. Current top performance has been achieved by employing the span-based graph model, which however still suffers from both high model complexity and insufficient interaction among opinions and roles. In this work, we investigate a novel solution by revisiting the transition architecture, and augmenting it with a pointer network (PointNet). The framework parses out all opinion structures in linear-time complexity, meanwhile breaks through the limitation of any length of terms with PointNet. To achieve the explicit opinion-role interactions, we further propose a unified dependency-opinion graph (UDOG), co-modeling the syntactic dependency structure and the partial opinion-role structure. We then devise a relation-centered graph aggregator (RCGA) to encode the multi-relational UDOG, where the resulting high-order representations are used to promote the predictions in the vanilla transition system. Our model achieves new state-of-the-art results on the MPQA benchmark. Analyses further demonstrate the superiority of our methods on both efficacy and efficiency.

PDF Details

AAAI Conference 2022 Conference Paper

Unified Named Entity Recognition as Word-Word Relation Classification

Jingye Li
Hao Fei
Jiang Liu
Shengqiong Wu
Meishan Zhang
Chong Teng
Donghong Ji
Fei Li

So far, named entity recognition (NER) has been involved with three major types, including flat, overlapped (aka. nested), and discontinuous NER, which have mostly been studied individually. Recently, a growing interest has been built for unified NER, tackling the above three jobs concurrently with one single model. Current best-performing methods mainly include span-based and sequence-to-sequence models, where unfortunately the former merely focus on boundary identification and the latter may suffer from exposure bias. In this work, we present a novel alternative by modeling the unified NER as word-word relation classification, namely W2 NER. The architecture resolves the kernel bottleneck of unified NER by effectively modeling the neighboring relations between entity words with Next-Neighboring-Word (NNW) and Tail-Head-Word-* (THW-*) relations. Based on the W2 NER scheme we develop a neural framework, in which the unified NER is modeled as a 2D grid of word pairs. We then propose multi-granularity 2D convolutions for better refining the grid representations. Finally, a co-predictor is used to sufficiently reason the word-word relations. We perform extensive experiments on 14 widely-used benchmark datasets for flat, overlapped, and discontinuous NER (8 English and 6 Chinese datasets), where our model beats all the current top-performing baselines, pushing the state-of-the-art performances of unified NER.

PDF Details

AAAI Conference 2021 Conference Paper

Encoder-Decoder Based Unified Semantic Role Labeling with Label-Aware Syntax

Hao Fei
Fei Li
Bobo Li
Donghong Ji

Currently the unified semantic role labeling (SRL) that achieves predicate identification and argument role labeling in an end-to-end manner has received growing interests. Recent works show that leveraging the syntax knowledge significantly enhances the SRL performances. In this paper, we investigate a novel unified SRL framework based on the sequence-to-sequence architecture with double enhancement in both the encoder and decoder sides. In the encoder side, we propose a novel label-aware graph convolutional network (LA-GCN) to encode both the syntactic dependent arcs and labels into BERT-based word representations. In the decoder side, we creatively design a pointer-network-based model for detecting predicates, arguments and roles jointly. Our pointernet decoder is able to make decisions by consulting all the input elements in a global view, and meanwhile it is syntacticaware by incorporating the syntax information from LA- GCN. Besides, a high-order interacted attention is introduced into the decoder for leveraging previously recognized triplets to help the current decision. Empirical experiments show that our framework significantly outperforms all existing graphbased methods on the CoNLL09 and Universal Proposition Bank datasets. In-depth analysis demonstrates that our model can effectively capture the correlations between syntactic and SRL structures.

PDF Details

AAAI Conference 2021 Conference Paper

End-to-end Semantic Role Labeling with Neural Transition-based Model

Hao Fei
Meishan Zhang
Bobo Li
Donghong Ji

End-to-end semantic role labeling (SRL) has been received increasing interest. It performs the two subtasks of SRL: predicate identification and argument role labeling, jointly. Recent work is mostly focused on graph-based neural models, while the transition-based framework with neural networks which has been widely used in a number of closely-related tasks, has not been studied for the joint task yet. In this paper, we present the first work of transition-based neural models for end-to-end SRL. Our transition model incrementally discovers all sentential predicates as well as their arguments by a set of transition actions. The actions of the two subtasks are executed mutually for full interactions. Besides, we suggest high-order compositions to extract non-local features, which can enhance the proposed transition model further. Experimental results on CoNLL09 and Universal Proposition Bank show that our final model can produce state-of-the-art performance, and meanwhile keeps highly efficient in decoding. We also conduct detailed experimental analysis for a deep understanding of our proposed model.

PDF Details

IJCAI Conference 2021 Conference Paper

Learn from Syntax: Improving Pair-wise Aspect and Opinion Terms Extraction with Rich Syntactic Knowledge

Shengqiong Wu
Hao Fei
Yafeng Ren
Donghong Ji
Jingye Li

In this paper, we propose to enhance the pair-wise aspect and opinion terms extraction (PAOTE) task by incorporating rich syntactic knowledge. We first build a syntax fusion encoder for encoding syntactic features, including a label-aware graph convolutional network (LAGCN) for modeling the dependency edges and labels, as well as the POS tags unifiedly, and a local-attention module encoding POS tags for better term boundary detection. During pairing, we then adopt Biaffine and Triaffine scoring for high-order aspect-opinion term pairing, in the meantime re-harnessing the syntax-enriched representations in LAGCN for syntactic-aware scoring. Experimental results on four benchmark datasets demonstrate that our model outperforms current state-of-the-art baselines, meanwhile yielding explainable predictions with syntactic knowledge.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Rethinking Boundaries: End-To-End Recognition of Discontinuous Mentions with Pointer Networks

Hao Fei
Donghong Ji
Bobo Li
Yijiang Liu
Yafeng Ren
Fei Li

A majority of research interests in irregular (e. g. , nested or discontinuous) named entity recognition (NER) have been paid on nested entities, while discontinuous entities received limited attention. Existing work for discontinuous NER, however, either suffers from decoding ambiguity or predicting using token-level local features. In this work, we present an innovative model for discontinuous NER based on pointer networks, where the pointer simultaneously decides whether a token at each decoding frame constitutes an entity mention and where the next constituent token is. Our model has three major merits compared with previous work: (1) The pointer mechanism is memory-augmented, which enhances the mention boundary detection and interactions between the current decision and prior recognized mentions. (2) The encoderdecoder architecture can linearize the complexity of structure prediction, and thus reduce search costs. (3) The model makes every decision using global information, i. e. , by consulting all the input, encoder and previous decoder output in a global view. Experimental results on the CADEC and ShARe13 datasets show that our model outperforms flat and hypergraph models as well as a state-of-the-art transitionbased model for discontinuous NER. Further in-depth analysis demonstrates that our model performs well in recognizing various entities including flat, overlapping and discontinuous ones. More crucially, our model is effective on boundary detection, which is the kernel source to NER.

PDF Details

AAAI Conference 2020 Conference Paper

Latent Emotion Memory for Multi-Label Emotion Classification

Hao Fei
Yue Zhang
Yafeng Ren
Donghong Ji

Identifying multiple emotions in a sentence is an important research topic. Existing methods usually model the problem as multi-label classiﬁcation task. However, previous methods have two issues, limiting the performance of the task. First, these models do not consider prior emotion distribution in a sentence. Second, they fail to effectively capture the context information closely related to the corresponding emotion. In this paper, we propose a Latent Emotion Memory network (LEM) for multi-label emotion classiﬁcation. The proposed model can learn the latent emotion distribution without external knowledge, and can effectively leverage it into the classiﬁcation network. Experimental results on two benchmark datasets show that the proposed model outperforms strong baselines, achieving the state-of-the-art performance.

PDF Details