Arrow Research search

Author name cluster

Fei Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

85 papers
2 author rows

Possible papers

85

AAAI Conference 2026 Conference Paper

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

  • Yuhang Liu
  • Zeyu Liu
  • Shuanghe Zhu
  • Pengxiang Li
  • Congkai Xie
  • Jiasheng Wang
  • Xueyu Hu
  • Xiaotian Han

The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevents models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency η=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding.

AAAI Conference 2026 Conference Paper

JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction

  • Yuhao Zhan
  • Yuqing Zhang
  • Jing Yuan
  • Qixiang Ma
  • Zhiqi Yang
  • Yu Gu
  • Zemin Liu
  • Fei Wu

Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.

EAAI Journal 2025 Journal Article

Adversarial-Causal Representation Learning Networks for Machine fault diagnosis under unseen conditions based on vibration and acoustic signals

  • Fei Wu
  • Zhuohang Xiang
  • Dengyu Xiao
  • Yaodong Hao
  • Yi Qin
  • Huayan Pu
  • Jun Luo

To address the challenges of obtaining diverse data, domain generalization (DG) methods for fault diagnosis have been developed. Domain adversarial methods are currently the most popular, due to their ability to handle data from unknown domains without requiring target domain information. However, their capacity to extract domain-irrelevant features remains challenging, often resulting in accuracy below 90% in many DG scenarios. This limitation stems from their inability to fully capture global dependencies, causing feature entanglement and redundant dependencies. To address these issues, we proposed a novel intelligent fault diagnosis method called Adversarial-Causal Representation Learning Networks (ACRLN), which is based on causal learning. By spatial mask domain adversarial method, ACRLN can significantly enhance data utilization by fully capturing the global dependency that are often ignored by domain adversarial algorithms. At the same time, causal learning is integrated into the ACRLN to further accomplish feature decoupling and the reduction of redundant dependency. This is achieved through channel feature orthogonality method combined with a loss function rooted in correlation analysis. Moreover, it adeptly addresses the spill-over effect often encountered in causal learning. Finally, ACRLN achieves better results and proves its effectiveness by comparison with several state-of-the-art fault diagnosis and DG algorithms on multiple datasets.

AILAW Journal 2025 Journal Article

An LLMs-based neuro-symbolic legal judgment prediction framework for civil cases

  • Bin Wei
  • Yaoyao Yu
  • Leilei Gan
  • Fei Wu

Abstract In recent years, the field of AI & Law has increasingly focused on predicting legal judgments, particularly in civil cases. While traditional neural network methods are highly effective at automatically learning patterns from large datasets, they often suffer from a lack of interpretability. To address this limitation, we propose a neuro-symbolic framework for legal judgment prediction, based on large language models (LLMs). This framework combines legal knowledge (e. g. , legal rules), represented through first-order logic rules, with deep neural networks (DNNs), using a discrepancy loss to minimize prediction differences between the two components. By integrating the logic module during end-to-end training, knowledge is effectively transferred to the model parameters. Additionally, we develop a Chain-of-Thought prompt that uses LLMs to extract fact elements from legal cases. These elements act as logical variables within the rules, supporting the reasoning process in the logic module and improving overall interpretability. To validate the effectiveness of this framework, we conduct extensive experiments on a large dataset of private lending cases. The results demonstrate that the framework not only improves predictive performance but also enhances the interpretability of judgment predictions.

NeurIPS Conference 2025 Conference Paper

Curriculum Model Merging: Harmonizing Chemical LLMs for Enhanced Cross-Task Generalization

  • Baoyi He
  • Luotian Yuan
  • Ying Wei
  • Fei Wu

The emergence of large language models (LLMs) has opened new opportunities for AI-driven chemical problem solving. However, existing chemical LLMs are typically tailored to specific task formats or narrow domains, limiting their capacity to integrate knowledge and generalize across tasks. Model merging offers a promising route for efficiently combining specialized LLMs into a unified model without access to original training data, which is urgently needed in the chemical domain where in-house data and privacy preservation are critical. However, effective model merging in the chemical domain poses unique challenges: (1) significant disparities among chemical LLMs due to task-specific specialization, and (2) a highly imbalanced distribution of chemical LLMs in targeted downstream tasks, where some are over-benchmarked while others remain underexplored. These challenges intensify model inconsistencies such as parameter interference and accumulated fine-tuning noise, which collectively hinder effective model merging. To this end, we propose Curriculum Model Merging (CMM), a curriculum-based framework that progressively merges expert chemical LLMs in a moderate and continual manner. CMM aims to harmonize their inconsistencies while meantime preserve their domain-specific expertise. Comprehensive experiments on two benchmark datasets show that CMM effectively consolidates task-specific expertise and outperforms the state-of-the-art methods by 29. 03\% in terms of overall average performance. Moreover, CMM facilitates chemical knowledge generalization across prediction and generative tasks without sacrificing robustness, exhibiting promising merging performance under both expert-abundant and expert-sparse scenarios.

IJCAI Conference 2025 Conference Paper

Device-Cloud Collaborative Correction for On-Device Recommendation

  • Tianyu Zhan
  • Shengyu Zhang
  • Zheqi Lv
  • Jieming Zhu
  • Jiwei Li
  • Fan Wu
  • Fei Wu

With the rapid development of recommendation models and device computing power, device-based recommendation has become an important research area due to its better real-time performance and privacy protection. Previously, Transformer-based sequential recommendation models have been widely applied in this field because they outperform Recurrent Neural Network (RNN)-based recommendation models in terms of performance. However, as the length of interaction sequences increases, Transformer-based models introduce significantly more space and computational overhead compared to RNN-based models, posing challenges for device-based recommendation. To balance real-time performance and high performance on devices, we propose Device-Cloud Collaborative Correction Framework for On-Device Recommendation (CoCorrRec). CoCorrRec uses a self-correction network (SCN) to correct parameters with extremely low time cost. By updating model parameters during testing based on the input token, it achieves performance comparable to current optimal but more complex Transformer-based models. Furthermore, to prevent SCN from overfitting, we design a global correction network (GCN) that processes hidden states uploaded from devices and provides a global correction solution. Extensive experiments on multiple datasets show that CoCorrRec outperforms existing Transformer-based and RNN-based device recommendation models in terms of performance, with fewer parameters and lower FLOPs, thereby achieving a balance between real-time performance and high efficiency. Code is available at https: //github. com/Yuzt-zju/CoCorrRec.

NeurIPS Conference 2025 Conference Paper

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

  • Baoqi Pei
  • Yifei Huang
  • Jilan Xu
  • Yuping He
  • Guo Chen
  • Fei Wu
  • Jiangmiao Pang
  • Yu Qiao

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models (MLLMs), which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand–object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning (RFT) to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks.

IJCAI Conference 2025 Conference Paper

ExpTalk: Diverse Emotional Expression via Adaptive Disentanglement and Refined Alignment for Speech-Driven 3D Facial Animation

  • Zhan Qu
  • Shengyu Zhang
  • Mengze Li
  • Zhuo Chen
  • Chengfei Lv
  • Zhou Zhao
  • Fei Wu

Speech-driven 3D facial animation aims to create lifelike facial expressions that synchronize accurately with speech. Despite significant progress, many existing methods may focus on generating facial animation with a fixed emotional state, neglecting the diverse transformations of facial emotions under a given speech input. To solve this issue, we focus on exploring the refined alignment between speech representations and multiple domains in facial expression information. We aim to disentangle the spoken language and emotion facial priors from speech expressions, to guide the refinement of the facial vertices based on speech. To accomplish this objective, we propose ExpTalk, which first applies an Adaptive Disentanglement Variational Autoencoder (AD-VAE) to decouple facial expression aligned with spoken language and emotions of speech through contrastive learning. Then a Refined Alignment Diffusion (RAD) is employed to iteratively refine the decoupled facial expression priors through diffusion-based perturbations, producing facial animations that align with the emotional variations of the given speech. Extensive experiments prove the effectiveness of our ExpTalk by surpassing state-of-the-arts by a large margin.

AAAI Conference 2025 Conference Paper

FedCFA: Alleviating Simpson’s Paradox in Model Aggregation with Counterfactual Federated Learning

  • Zhonghua Jiang
  • Jimin Xu
  • Shengyu Zhang
  • Tao Shen
  • Jiwei Li
  • Kun Kuang
  • Haibin Cai
  • Fei Wu

Federated learning (FL) is a promising technology for data privacy and distributed optimization, but it suffers from data imbalance and heterogeneity among clients. Existing FL methods try to solve the problems by aligning client with server model or by correcting client model with control variables. These methods excel on IID and general Non-IID data but perform mediocrely in Simpson's Paradox scenarios. Simpson's Paradox refers to the phenomenon that the trend observed on the global dataset disappears or reverses on a subset, which may lead to the fact that global model obtained through aggregation in FL does not accurately reflect the distribution of global data. Thus, we propose FedCFA, an novel FL framework employing counterfactual learning to generate counterfactual samples by replacing local data critical factors with global average data, aligning local data distributions with the global and mitigating Simpson's Paradox effects. In addition, to improve the counterfactual samples quality, we introduce factor decorrelation (FDC) loss to reduce the correlation among features and thus improve the independence of extracted factors. We conduct extensive experiments on six datasets and verify that our method outperforms other FL methods in terms of efficiency and global model accuracy under limited communication rounds.

NeurIPS Conference 2025 Conference Paper

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

  • Yanggan Gu
  • Yuanyi Wang
  • Zhaoyi Yan
  • Yiming Zhang
  • Qi Zhou
  • Fei Wu
  • Hongxia Yang

Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) —a critical phase for enhancing LLM performance—largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improves its average performance from 79. 95 to 83. 33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.

NeurIPS Conference 2025 Conference Paper

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

  • Yuanyi Wang
  • Zhaoyi Yan
  • Yiming Zhang
  • Qi Zhou
  • Yanggan Gu
  • Fei Wu
  • Hongxia Yang

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35. 6 improvement on Multistep Arithmetic and +37. 06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

NeurIPS Conference 2025 Conference Paper

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

  • Kaihang Pan
  • Yang Wu
  • Wendong Bu
  • Shen Kai
  • Juncheng Li
  • Yingting Wang
  • Yunfei Li
  • Siliang Tang

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: \url{https: //janus-pro-r1. github. io}.

AAAI Conference 2025 Conference Paper

Knowledge Is Power: Harnessing Large Language Models for Enhanced Cognitive Diagnosis

  • Zhiang Dong
  • Jingyuan Chen
  • Fei Wu

Cognitive Diagnosis Models (CDMs) are designed to assess students' cognitive states by analyzing their performance across a series of exercises. However, existing CDMs often struggle with diagnosing infrequent students and exercises due to a lack of rich prior knowledge. With the advancement in large language models (LLMs), which possess extensive domain knowledge, their integration into cognitive diagnosis presents a promising opportunity. Despite this potential, integrating LLMs with CDMs poses significant challenges. LLMs are not well-suited for capturing the fine-grained collaborative interactions between students and exercises, and the disparity between the semantic space of LLMs and the behavioral space of CDMs hinders effective integration. To address these issues, we propose a novel Knowledge-enhanced Cognitive Diagnosis (KCD) framework, which is a model-agnostic framework utilizing LLMs to enhance CDMs and compatible with various CDM architectures. The KCD framework operates in two stages: LLM Diagnosis and Cognitive Level Alignment. In the LLM Diagnosis stage, both students and exercises are diagnosed to achieve comprehensive and detailed modeling. In the Cognitive Level Alignment stage, we bridge the gap between the CDMs' behavioral space and the LLMs' semantic space using contrastive learning and mask-reconstruction approaches. Experiments on several real-world datasets demonstrate the effectiveness of our proposed framework.

AAAI Conference 2025 Conference Paper

MergeNet: Knowledge Migration Across Heterogeneous Models, Tasks, and Modalities

  • Kunxi Li
  • Tianyu Zhan
  • Kairui Fu
  • Shengyu Zhang
  • Kun Kuang
  • Jiwei Li
  • Zhou Zhao
  • Fan Wu

In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or task-specific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models, facilitating the direct interaction, extraction, and application of knowledge within these parameter spaces. The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters and adeptly learning to identify and map parameters into the target model. MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage, including the training trajectory knowledge of the source model. Extensive experiments on heterogeneous knowledge transfer demonstrate significant improvements in challenging settings, where representative approaches may falter or prove less applicable.

NeurIPS Conference 2025 Conference Paper

Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

  • Jinluan Yang
  • Dingnan Jin
  • Anke Tang
  • Li Shen
  • Didi Zhu
  • Zhengyu Chen
  • Ziyu Zhao
  • Daixin Wang

Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI. Existing methods like data mixture strategies face limitations, including heavy reliance on expert knowledge and conflicting optimization signals. While model merging offers parameter-level conflict-resolution strategies through integrating specialized models' parameters, its potential for 3H optimization remains underexplored. This paper systematically compares the effectiveness of model merging and data mixture methods in constructing 3H-aligned LLMs for the first time, revealing previously overlooked collaborative and conflict relationships among the 3H dimensions and discussing the advantages and drawbacks of data mixture (\textit{data-level}) and model merging (\textit{parameter-level}) methods in mitigating the conflict for balanced 3H optimization. Specially, we propose a novel \textbf{R}eweighting \textbf{E}nhanced task \textbf{S}ingular \textbf{M}erging method, \textbf{RESM}, through outlier weighting and sparsity-aware rank selection strategies to address the challenges of preference noise accumulation and layer sparsity adaptation inherent in 3H-aligned LLM merging. Extensive evaluations can verify the effectiveness and robustness of RESM compared to previous data mixture (2\%-5\% gain) and model merging (1\%-3\% gain) methods in achieving balanced LLM alignment.

NeurIPS Conference 2025 Conference Paper

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

  • Wenxiang Guo
  • Changhao Pan
  • Zhiyuan Zhu
  • Xintong Hu
  • Yu Zhang
  • Li Tang
  • Rui Yang
  • Han Wang

Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https: //mrsaudio. github. io.

NeurIPS Conference 2025 Conference Paper

MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study

  • Yuqing Zhang
  • Yue Han
  • Shuanghe Zhu
  • Haoxiang Wu
  • Hangqi Li
  • Shengyu Zhang
  • Junchi Yan
  • Zemin Liu

Analyzing ancient manuscripts has traditionally been a labor-intensive and time-consuming task for philologists. While recent advancements in LMMs have demonstrated their potential across diverse domains, their effectiveness in manuscript study remains underexplored. In this paper, we introduce MS-Bench, the first comprehensive benchmark co-developed with archaeologists, comprising 5, 076 high-resolution images from 4th to 14th century and 9, 982 expert-curated questions across nine sub-tasks aligned with archaeological workflows. Through four prompting strategies, we systematically evaluate 32 LMMs on their effectiveness, robustness, and cultural contextualization. Our analysis reveals scale-driven performance and reliability improvements, prompting strategies' impact on performance (CoT has two-sides effect, while visual retrieval-augmented prompts provide consistent boost), and task-specific preferences depending on LMM’s visual capabilities. Although current LMMs are not yet capable of replacing domain expertise, they demonstrate promising potential to accelerate manuscript research through future human–AI collaboration.

AAAI Conference 2025 Conference Paper

Optimize Incompatible Parameters Through Compatibility-aware Knowledge Integration

  • Zheqi Lv
  • Keming Ye
  • Zishu Wei
  • Qi Tian
  • Shengyu Zhang
  • Wenqiao Zhang
  • Wenjie Wang
  • Kun Kuang

Deep neural networks have become foundational to advancements in multiple domains, including recommendation systems, natural language processing, and so on. Despite their successes, these models often contain incompatible parameters that can be underutilized or detrimental to model performance, particularly when faced with specific, varying data distributions. Existing research excels in removing such parameters or merging the outputs of multiple different pretrained models. However, the former focuses on efficiency rather than performance, while the latter requires several times more computing and storage resources to support inference. In this paper, we set the goal to explicitly improve these incompatible parameters by leveraging the complementary strengths of different models, thereby directly enhancing the models without any additional parameters. Specifically, we propose Compatibility-aware Knowledge Integration (CKI), which consists of Parameter Compatibility Assessment and Parameter Splicing, which are used to evaluate the knowledge content of multiple models and integrate the knowledge into one model, respectively. The integrated model can be used directly for inference or for further fine-tuning. Extensive experiments on various recommendation and language datasets show that CKI can effectively optimize incompatible parameters under multiple tasks and settings to break through the training limit of the original model without increasing the inference cost.

NeurIPS Conference 2025 Conference Paper

Vinci: Deep Thinking in Text-to-Image Generation using Unified Model with Reinforcement Learning

  • Wang Lin
  • Wentao Hu
  • Liyu Jia
  • Kaihang Pan
  • Majun Zhang
  • Zhou Zhao
  • Fei Wu
  • Jingyuan Chen

With the continuous development of large language models and reasoning chain technologies, the potential of deep reasoning based on reinforcement learning has shown remarkable promise in multi-task scenarios. However, existing unified models have yet to achieve end-to-end integration in image generation and understanding tasks, limiting the model’s self-reflection ability and the realization of cross-modal reasoning chains. To address this, we propose Vinic, a novel framework designed to enable interleaved image generation and understanding through deep reasoning capabilities. We leverage a small amount of multimodal chain-of-thought (MCoT) data for cold-start and employ reinforcement learning to guide the integration of image generation and understanding tasks. Additionally, we introduce a momentum-based reward function, which dynamically adjusts the reward distribution by considering historical improvements, ensuring the stability of the model across multiple generations. Experimental results demonstrate that integrating MCoT can achieve a +22% improvement over the base model on Geneval, effectively enhancing both image generation quality and instruction alignment capabilities.

NeurIPS Conference 2024 Conference Paper

$E^3$: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

  • Wang Lin
  • Yueying Feng
  • Wenkang Han
  • Tao Jin
  • Zhou Zhao
  • Fei Wu
  • Chang Yao
  • Jingyuan Chen

Understanding human emotions is fundamental to enhancing human-computer interaction, especially for embodied agents that mimic human behavior. Traditional emotion analysis often takes a third-person perspective, limiting the ability of agents to interact naturally and empathetically. To address this gap, this paper presents $E^3$ for Exploring Embodied Emotion, the first massive first-person view video dataset. $E^3$ contains more than $50$ hours of video, capturing $8$ different emotion types in diverse scenarios and languages. The dataset features videos recorded by individuals in their daily lives, capturing a wide range of real-world emotions conveyed through visual, acoustic, and textual modalities. By leveraging this dataset, we define $4$ core benchmark tasks - emotion recognition, emotion classification, emotion localization, and emotion reasoning - supported by more than $80$k manually crafted annotations, providing a comprehensive resource for training and evaluating emotion analysis models. We further present Emotion-LlaMa, which complements visual modality with acoustic modality to enhance the understanding of emotion in first-person videos. The results of comparison experiments with a large number of baselines demonstrate the superiority of Emotion-LlaMa and set a new benchmark for embodied emotion analysis. We expect that $E^3$ can promote advances in multimodal understanding, robotics, and augmented reality, and provide a solid foundation for the development of more empathetic and context-aware embodied agents.

IROS Conference 2024 Conference Paper

3D Object Detection via Stereo Pyramid Transformers with Rich Semantic Feature Fusion

  • Rongqi Gu
  • Chu Yang
  • Yaohan Lu
  • Peigen Liu
  • Fei Wu
  • Guang Chen 0001

Camera-based 3D object detectors, prized for their broader applicability and cost-effectiveness compared to LiDAR sensors, still grapple with the inherently ill-posed nature of depth extraction from images. In this work, we present a novel approach that employs a transformer-based backbone and a fused geometry volume to bolster feature richness and elevate detection accuracy. Firstly, we propose the Stereo Pyramid Transformer backbone to extract features from stereo images, which can capture global information and establish cross-image semantic connections. Then, to tackle the challenge posed by small baseline binocular cameras, we propose to fuse stereo geometry volumes constructed by Stereo Plane Sweeping Volume (SPSV), Monocular Semantic Volume (MSV), and Lifted Volume (LV) to create finely detailed feature volumes. Through extensive experiments on both the KITTI and our datasets, our approach not only surpasses all existing transformer-based stereo 3D detection methods but also marks a significant milestone by achieving comparable performance with the leading CNN-based 3D detectors for the first time.

NeurIPS Conference 2024 Conference Paper

Action Imitation in Common Action Space for Customized Action Image Synthesis

  • Wang Lin
  • Jingyuan Chen
  • Jiaxin Shi
  • Zirun Guo
  • Yichen Zhu
  • Zehan Wang
  • Tao Jin
  • Zhou Zhao

We propose a novel method, \textbf{TwinAct}, to tackle the challenge of decoupling actions and actors in order to customize the text-guided diffusion models (TGDMs) for few-shot action image generation. TwinAct addresses the limitations of existing methods that struggle to decouple actions from other semantics (e. g. , the actor's appearance) due to the lack of an effective inductive bias with few exemplar images. Our approach introduces a common action space, which is a textual embedding space focused solely on actions, enabling precise customization without actor-related details. Specifically, TwinAct involves three key steps: 1) Building common action space based on a set of representative action phrases; 2) Imitating the customized action within the action space; and 3) Generating highly adaptable customized action images in diverse contexts with action similarity loss. To comprehensively evaluate TwinAct, we construct a novel benchmark, which provides sample images with various forms of actions. Extensive experiments demonstrate TwinAct's superiority in generating accurate, context-independent customized actions while maintaining the identity consistency of different subjects, including animals, humans, and even customized actors.

AAAI Conference 2024 Conference Paper

Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction

  • Jianping Zhu
  • Xin Guo
  • Yang Chen
  • Yao Yang
  • Wenbo Li
  • Bo Jin
  • Fei Wu

Long sequence prediction has broad and significant application value in fields such as finance, wind power, and weather. However, the complex long-term dependencies of long sequence data and the potential domain shift problems limit the effectiveness of traditional models in practical scenarios. To this end, we propose an Adaptive Meta-Learning Probabilistic Inference Framework (AMPIF) based on sequence decomposition, which can effectively enhance the long sequence prediction ability of various basic models. Specifically, first, we decouple complex sequences into seasonal and trend components through a frequency domain decomposition module. Then, we design an adaptive meta-learning task construction strategy, which divides the seasonal and trend components into different tasks through a clustering-matching approach. Finally, we design a dual-stream amortized network (ST-DAN) to capture shared information between seasonal-trend tasks and use the support set to generate task-specific parameters for rapid generalization learning on the query set. We conducted extensive experiments on six datasets, including wind power and finance scenarios, and the results show that our method significantly outperforms baseline methods in prediction accuracy, interpretability, and algorithm stability and can effectively enhance the long sequence prediction capabilities of base models. The source code is publicly available at https://github.com/Zhu-JP/AMPIF.

AAAI Conference 2024 Conference Paper

Contrastive Balancing Representation Learning for Heterogeneous Dose-Response Curves Estimation

  • Minqin Zhu
  • Anpeng Wu
  • Haoxuan Li
  • Ruoxuan Xiong
  • Bo Li
  • Xiaoqing Yang
  • Xuan Qin
  • Peng Zhen

Estimating the individuals' potential response to varying treatment doses is crucial for decision-making in areas such as precision medicine and management science. Most recent studies predict counterfactual outcomes by learning a covariate representation that is independent of the treatment variable. However, such independence constraints neglect much of the covariate information that is useful for counterfactual prediction, especially when the treatment variables are continuous. To tackle the above issue, in this paper, we first theoretically demonstrate the importance of the balancing and prognostic representations for unbiased estimation of the heterogeneous dose-response curves, that is, the learned representations are constrained to satisfy the conditional independence between the covariates and both of the treatment variables and the potential responses. Based on this, we propose a novel Contrastive balancing Representation learning Network using a partial distance measure, called CRNet, for estimating the heterogeneous dose-response curves without losing the continuity of treatments. Extensive experiments are conducted on synthetic and real-world datasets demonstrating that our proposal significantly outperforms previous methods.

AAAI Conference 2024 Conference Paper

De-biased Attention Supervision for Text Classification with Causality

  • Yiquan Wu
  • Yifei Liu
  • Ziyu Zhao
  • Weiming Lu
  • Yating Zhang
  • Changlong Sun
  • Fei Wu
  • Kun Kuang

In text classification models, while the unsupervised attention mechanism can enhance performance, it often produces attention distributions that are puzzling to humans, such as assigning high weight to seemingly insignificant conjunctions. Recently, numerous studies have explored Attention Supervision (AS) to guide the model toward more interpretable attention distributions. However, such AS can impact classification performance, especially in specialized domains. In this paper, we address this issue from a causality perspective. Firstly, we leverage the causal graph to reveal two biases in the AS: 1) Bias caused by the label distribution of the dataset. 2) Bias caused by the words' different occurrence ranges that some words can occur across labels while others only occur in a particular label. We then propose a novel De-biased Attention Supervision (DAS) method to eliminate these biases with causal techniques. Specifically, we adopt backdoor adjustment on the label-caused bias and reduce the word-caused bias by subtracting the direct causal effect of the word. Through extensive experiments on two professional text classification datasets (e.g., medicine and law), we demonstrate that our method achieves improved classification accuracy along with more coherent attention distributions.

AAAI Conference 2024 Conference Paper

Learning to Reweight for Generalizable Graph Neural Network

  • Zhengyu Chen
  • Teng Xiao
  • Kun Kuang
  • Zheqi Lv
  • Min Zhang
  • Jinluan Yang
  • Chengqiang Lu
  • Hongxia Yang

Graph Neural Networks (GNNs) show promising results for graph tasks. However, existing GNNs' generalization ability will degrade when there exist distribution shifts between testing and training graph data. The fundamental reason for the severe degeneration is that most GNNs are designed based on the I.I.D hypothesis. In such a setting, GNNs tend to exploit subtle statistical correlations existing in the training set for predictions, even though it is a spurious correlation. In this paper, we study the problem of the generalization ability of GNNs on Out-Of-Distribution (OOD) settings. To solve this problem, we propose the Learning to Reweight for Generalizable Graph Neural Network (L2R-GNN) to enhance the generalization ability for achieving satisfactory performance on unseen testing graphs that have different distributions with training graphs. We propose a novel nonlinear graph decorrelation method, which can substantially improve the out-of-distribution generalization ability and compares favorably to previous methods in restraining the over-reduced sample size. The variables of graph representation are clustered based on the stability of their correlations, and graph decorrelation method learns weights to remove correlations between the variables of different clusters rather than any two variables. Besides, we introduce an effective stochastic algorithm based on bi-level optimization for the L2R-GNN framework, which enables simultaneously learning the optimal weights and GNN parameters, and avoids the over-fitting issue. Experiments show that L2R-GNN greatly outperforms baselines on various graph prediction benchmarks under distribution shifts.

AAAI Conference 2024 Conference Paper

Null Space Matters: Range-Null Decomposition for Consistent Multi-Contrast MRI Reconstruction

  • Jiacheng Chen
  • Jiawei Jiang
  • Fei Wu
  • Jianwei Zheng

Consistency and interpretability have long been the critical issues in MRI reconstruction. While interpretability has been dramatically improved with the employment of deep unfolding networks (DUNs), current methods still suffer from inconsistencies and generate inferior anatomical structure. Especially in multi-contrast scenes, different imaging protocols often exacerbate the concerned issue. In this paper, we propose a range-null decomposition-assisted DUN architecture to ensure consistency while still providing desirable interpretability. Given the input decomposed, we argue that the inconsistency could be analytically relieved by feeding solely the null-space component into proximal mapping, while leaving the range-space counterpart fixed. More importantly, a correlation decoupling scheme is further proposed to narrow the information gap for multi-contrast fusion, which dynamically borrows isotropic features from the opponent while maintaining the modality-specific ones. Specifically, the two features are attached to different frequencies and learned individually by the newly designed isotropy encoder and anisotropy encoder. The former strives for the contrast-shared information, while the latter serves to capture the contrast-specific features. The quantitative and qualitative results show that our proposal outperforms most cutting-edge methods by a large margin. Codes will be released on https://github.com/chenjiachengzzz/RNU.

AAAI Conference 2024 Conference Paper

RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction

  • Yemin Yu
  • Luotian Yuan
  • Ying Wei
  • Hanyu Gao
  • Fei Wu
  • Zhihua Wang
  • Xinhai Ye

Machine learning-assisted retrosynthesis prediction models have been gaining widespread adoption, though their performances oftentimes degrade significantly when deployed in real-world applications embracing out-of-distribution (OOD) molecules or reactions. Despite steady progress on standard benchmarks, our understanding of existing retrosynthesis prediction models under the premise of distribution shifts remains stagnant. To this end, we first formally sort out two types of distribution shifts in retrosynthesis prediction and construct two groups of benchmark datasets. Next, through comprehensive experiments, we systematically compare state-of-the-art retrosynthesis prediction models on the two groups of benchmarks, revealing the limitations of previous in-distribution evaluation and re-examining the advantages of each model. More remarkably, we are motivated by the above empirical insights to propose two model-agnostic techniques that can improve the OOD generalization of arbitrary off-the-shelf retrosynthesis prediction algorithms. Our preliminary experiments show their high potential with an average performance improvement of 4.6%, and the established benchmarks serve as a foothold for further retrosynthesis prediction research towards OOD generalization.

NeurIPS Conference 2024 Conference Paper

Revisiting Score Propagation in Graph Out-of-Distribution Detection

  • Longfei Ma
  • Yiyou Sun
  • Kaize Ding
  • Zemin Liu
  • Fei Wu

The field of graph learning has been substantially advanced by the development of deep learning models, in particular graph neural networks. However, one salient yet largely under-explored challenge is detecting Out-of-Distribution (OOD) nodes on graphs. Prevailing OOD detection techniques developed in other domains like computer vision, do not cater to the interconnected nature of graphs. This work aims to fill this gap by exploring the potential of a simple yet effective method -- OOD score propagation, which propagates OOD scores among neighboring nodes along the graph structure. This post hoc solution can be easily integrated with existing OOD scoring functions, showcasing its excellent flexibility and effectiveness in most scenarios. However, the conditions under which score propagation proves beneficial remain not fully elucidated. Our study meticulously derives these conditions and, inspired by this discovery, introduces an innovative edge augmentation strategy with theoretical guarantee. Empirical evaluations affirm the superiority of our proposed method, outperforming strong OOD detection baselines in various scenarios and settings.

AAAI Conference 2023 Conference Paper

DE-net: Dynamic Text-Guided Image Editing Adversarial Networks

  • Ming Tao
  • Bing-Kun Bao
  • Hao Tang
  • Fei Wu
  • Longhui Wei
  • Qi Tian

Text-guided image editing models have shown remarkable results. However, there remain two problems. First, they employ fixed manipulation modules for various editing requirements (e.g., color changing, texture changing, content adding and removing), which results in over-editing or insufficient editing. Second, they do not clearly distinguish between text-required and text-irrelevant parts, which leads to inaccurate editing. To solve these limitations, we propose: (i) a Dynamic Editing Block (DEBlock) that composes different editing modules dynamically for various editing requirements. (ii) a Composition Predictor (Comp-Pred), which predicts the composition weights for DEBlock according to the inference on target texts and source images. (iii) a Dynamic text-adaptive Convolution Block (DCBlock) that queries source image features to distinguish text-required parts and text-irrelevant parts. Extensive experiments demonstrate that our DE-Net achieves excellent performance and manipulates source images more correctly and accurately.

NeurIPS Conference 2023 Conference Paper

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

  • Junkun Yuan
  • Xinyu Zhang
  • Hao Zhou
  • Jian Wang
  • Zhongwei Qiu
  • Zhiyin Shao
  • Shaofeng Zhang
  • Sifan Long

Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Specifically, we employ this prior to guide the mask sampling process. Image patches, corresponding to human part regions, have high priority to be masked out. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks. To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image. We term the entire method as HAP. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78. 1% mAP on MSMT17 for person re-identification, 86. 54% mA on PA-100K for pedestrian attribute recognition, 78. 2% AP on MS COCO for 2D pose estimation, and 56. 0 PA-MPJPE on 3DPW for 3D pose and shape estimation.

AAAI Conference 2023 Conference Paper

Learning Chemical Rules of Retrosynthesis with Pre-training

  • Yinjie Jiang
  • Ying Wei
  • Fei Wu
  • Zhengxing Huang
  • Kun Kuang
  • Zhihua Wang

Retrosynthesis aided by artificial intelligence has been a very active and bourgeoning area of research, for its critical role in drug discovery as well as material science. Three categories of solutions, i.e., template-based, template-free, and semi-template methods, constitute mainstream solutions to this problem. In this paper, we focus on template-free methods which are known to be less bothered by the template generalization issue and the atom mapping challenge. Among several remaining problems regarding template-free methods, failing to conform to chemical rules is pronounced. To address the issue, we seek for a pre-training solution to empower the pre-trained model with chemical rules encoded. Concretely, we enforce the atom conservation rule via a molecule reconstruction pre-training task, and the reaction rule that dictates reaction centers via a reaction type guided contrastive pre-training task. In our empirical evaluation, the proposed pre-training solution substantially improves the single-step retrosynthesis accuracies in three downstream datasets.

AAAI Conference 2023 Conference Paper

Learning Instrumental Variable from Data Fusion for Treatment Effect Estimation

  • Anpeng Wu
  • Kun Kuang
  • Ruoxuan Xiong
  • Minqin Zhu
  • Yuxuan Liu
  • Bo Li
  • Furui Liu
  • Zhihua Wang

The advent of the big data era brought new opportunities and challenges to draw treatment effect in data fusion, that is, a mixed dataset collected from multiple sources (each source with an independent treatment assignment mechanism). Due to possibly omitted source labels and unmeasured confounders, traditional methods cannot estimate individual treatment assignment probability and infer treatment effect effectively. Therefore, we propose to reconstruct the source label and model it as a Group Instrumental Variable (GIV) to implement IV-based Regression for treatment effect estimation. In this paper, we conceptualize this line of thought and develop a unified framework (Meta-EM) to (1) map the raw data into a representation space to construct Linear Mixed Models for the assigned treatment variable; (2) estimate the distribution differences and model the GIV for the different treatment assignment mechanisms; and (3) adopt an alternating training strategy to iteratively optimize the representations and the joint distribution to model GIV for IV regression. Empirical results demonstrate the advantages of our Meta-EM compared with state-of-the-art methods. The project page with the code and the Supplementary materials is available at https://github.com/causal-machine-learning-lab/meta-em.

NeurIPS Conference 2023 Conference Paper

PTADisc: A Cross-Course Dataset Supporting Personalized Learning in Cold-Start Scenarios

  • Liya Hu
  • Zhiang Dong
  • Jingyuan Chen
  • Guifeng Wang
  • Zhihua Wang
  • Zhou Zhao
  • Fei Wu

The focus of our work is on diagnostic tasks in personalized learning, such as cognitive diagnosis and knowledge tracing. The goal of these tasks is to assess students' latent proficiency on knowledge concepts through analyzing their historical learning records. However, existing research has been limited to single-course scenarios; cross-course studies have not been explored due to a lack of dataset. We address this issue by constructing PTADisc, a Diverse, Immense, Student-centered dataset that emphasizes its sufficient Cross-course information for personalized learning. PTADisc includes 74 courses, 1, 530, 100 students, 4, 054 concepts, 225, 615 problems, and over 680 million student response logs. Based on PTADisc, we developed a model-agnostic Cross-Course Learner Modeling Framework (CCLMF) which utilizes relationships between students' proficiency across courses to alleviate the difficulty of diagnosing student knowledge state in cold-start scenarios. CCLMF uses a meta network to generate personalized mapping functions between courses. The experimental results on PTADisc verify the effectiveness of CCLMF with an average improvement of 4. 2% on AUC. We also report the performance of baseline models for cognitive diagnosis and knowledge tracing over PTADisc, demonstrating that our dataset supports a wide scope of research in personalized learning. Additionally, PTADisc contains valuable programming logs and student-group information that are worth exploring in the future.

NeurIPS Conference 2023 Conference Paper

Two Heads are Better Than One: A Simple Exploration Framework for Efficient Multi-Agent Reinforcement Learning

  • Jiahui Li
  • Kun Kuang
  • Baoxiang Wang
  • Xingchen Li
  • Fei Wu
  • Jun Xiao
  • Long Chen

Exploration strategy plays an important role in reinforcement learning, especially in sparse-reward tasks. In cooperative multi-agent reinforcement learning~(MARL), designing a suitable exploration strategy is much more challenging due to the large state space and the complex interaction among agents. Currently, mainstream exploration methods in MARL either contribute to exploring the unfamiliar states which are large and sparse, or measuring the interaction among agents with high computational costs. We found an interesting phenomenon that different kinds of exploration plays a different role in different MARL scenarios, and choosing a suitable one is often more effective than designing an exquisite algorithm. In this paper, we propose a exploration method that incorporate the \underline{C}uri\underline{O}sity-based and \underline{IN}fluence-based exploration~(COIN) which is simple but effective in various situations. First, COIN measures the influence of each agent on the other agents based on mutual information theory and designs it as intrinsic rewards which are applied to each individual value function. Moreover, COIN computes the curiosity-based intrinsic rewards via prediction errors which are added to the extrinsic reward. For integrating the two kinds of intrinsic rewards, COIN utilizes a novel framework in which they complement each other and lead to a sufficient and effective exploration on cooperative MARL tasks. We perform extensive experiments on different challenging benchmarks, and results across different scenarios show the superiority of our method.

AAAI Conference 2023 Conference Paper

Video-Audio Domain Generalization via Confounder Disentanglement

  • Shengyu Zhang
  • Xusheng Feng
  • Wenyan Fan
  • Wenjing Fang
  • Fuli Feng
  • Wei Ji
  • Shuo Li
  • Li Wang

Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.

EAAI Journal 2022 Journal Article

Adversarial domain adaptation network with pseudo-siamese feature extractors for cross-bearing fault transfer diagnosis

  • Qunwang Yao
  • Quan Qian
  • Yi Qin
  • Liang Guo
  • Fei Wu

The traditional domain adaptation model just uses a single (siamese) feature extractor for mapping the source domain and target domain data to a feature space simultaneously, but it may be not well suitable for the cross-machine feature mapping. To improve the performance of the cross-bearing fault transfer diagnosis, an adversarial domain adaptation network with pseudo-siamese feature extractors (PSFEN) is proposed. The core idea is to construct a pair of feature extractors with the same structure but not sharing parameters, which form a pair of pseudo-siamese feature extractors. When the source domain data differs greatly from the target domain data in the cross-machine transfer diagnosis, a pair of pseudo-siamese feature extractors is used to extract the features of source domain and target domain respectively, thus some exclusive characteristics of two domains can be obtained except for the common characteristics. It is theoretically analyzed that the distribution discrepancy obtained by the pseudo-siamese feature extractors can be closer to its actual upper limit. By reducing the more real supremum, the domain adaptation can be better achieved, thus improving the transfer diagnosis accuracy. Then, a distance metric of maximum mean discrepancy and an unbalanced adversarial training algorithm are integrated to train the pseudo-siamese feature extractors and reduce the discrepancy between the source and target domains. The effectiveness of the proposed method is verified by experiments on six cross-bearing fault transfer diagnosis tasks. The comparative results show that the proposed method have much higher diagnostic accuracy compared to six classical models.

AIIM Journal 2022 Journal Article

Attribute-aware interpretation learning for thyroid ultrasound diagnosis

  • Ming Kong
  • Qing Guo
  • Shuowen Zhou
  • Mengze Li
  • Kun Kuang
  • Zhengxing Huang
  • Fei Wu
  • Xiaohong Chen

Thyroid nodule diagnosis from ultrasound images is a critical computer-aided diagnosis task. Previous works tried to imitate the doctor's diagnosis logic by considering the key attributes to improve the diagnosis performance and explaining the conclusion. However, their clinical feasibilities are still ambiguous because of the ignorance of the correlation between attribute features and global characteristics, as well as the lack of clinical effectiveness evaluation of result interpretations. Following the common logic of ultrasonic investigation, we design a novel Attribute-Aware Interpretation Learning (AAIL) model, consisting of attribute properties discovery module and attribute-global feature fusion module. Adequate result interpretation ensures reliability and transparency of diagnostic conclusions, including the visualization of attribute features and the relationship between attributes and the global feature. Extensive experiments on a practical dataset demonstrate the model's effectiveness, and an innovative human-computer collaborative experiment demonstrates the auxiliary diagnostic ability of the interpretations that can benefit professional doctors.

NeurIPS Conference 2022 Conference Paper

ConfounderGAN: Protecting Image Data Privacy with Causal Confounder

  • Qi Tian
  • Kun Kuang
  • Kelu Jiang
  • Furui Liu
  • Zhihua Wang
  • Fei Wu

The success of deep learning is partly attributed to the availability of massive data downloaded freely from the Internet. However, it also means that users' private data may be collected by commercial organizations without consent and used to train their models. Therefore, it's important and necessary to develop a method or tool to prevent unauthorized data exploitation. In this paper, we propose ConfounderGAN, a generative adversarial network (GAN) that can make personal image data unlearnable to protect the data privacy of its owners. Specifically, the noise produced by the generator for each image has the confounder property. It can build spurious correlations between images and labels, so that the model cannot learn the correct mapping from images to labels in this noise-added dataset. Meanwhile, the discriminator is used to ensure that the generated noise is small and imperceptible, thereby remaining the normal utility of the encrypted image for humans. The experiments are conducted in six image classification datasets, including three natural object datasets and three medical datasets. The results demonstrate that our method not only outperforms state-of-the-art methods in standard settings, but can also be applied to fast encryption scenarios. Moreover, we show a series of transferability and stability experiments to further illustrate the effectiveness and superiority of our method.

AAAI Conference 2022 Conference Paper

Feature Distillation Interaction Weighting Network for Lightweight Image Super-resolution

  • Guangwei Gao
  • Wenjie Li
  • Juncheng Li
  • Fei Wu
  • Huimin Lu
  • Yi Yu

Convolutional neural networks based single-image superresolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these issues, we propose a lightweight yet efficient Feature Distillation Interaction Weighted Network (FDIWN). Specifically, FDIWN utilizes a series of specially designed Feature Shuffle Weighted Groups (FSWG) as the backbone, and several novel mutual Wide-residual Distillation Interaction Blocks (WDIB) form an FSWG. In addition, Wide Identical Residual Weighting (WIRW) units and Wide Convolutional Residual Weighting (WCRW) units are introduced into WDIB for better feature distillation. Moreover, a Wide-Residual Distillation Connection (WRDC) framework and a Self-Calibration Fusion (SCF) unit are proposed to interact features with different scales more flexibly and efficiently. Extensive experiments show that our FDIWN is superior to other models to strike a good balance between model performance and efficiency. The code is available at https: //github. com/IVIPLab/FDIWN.

NeurIPS Conference 2022 Conference Paper

GRASP: Navigating Retrosynthetic Planning with Goal-driven Policy

  • Yemin Yu
  • Ying Wei
  • Kun Kuang
  • Zhengxing Huang
  • Huaxiu Yao
  • Fei Wu

Retrosynthetic planning occupies a crucial position in synthetic chemistry and, accordingly, drug discovery, which aims to find synthetic pathways of a target molecule through a sequential decision-making process on a set of feasible reactions. While the majority of recent works focus on the prediction of feasible reactions at each step, there have been limited attempts toward improving the sequential decision-making policy. Existing strategies rely on either the expensive and high-variance value estimation by online rollout, or a settled value estimation neural network pre-trained with simulated pathways of limited diversity and no negative feedback. Besides, how to return multiple candidate pathways that are not only diverse but also desirable for chemists (e. g. , affordable building block materials) remains an open challenge. To this end, we propose a Goal-dRiven Actor-critic retroSynthetic Planning (GRASP) framework, where we identify the policy that performs goal-driven retrosynthesis navigation toward a user-demand objective. Our experiments on the benchmark Pistachio dataset and a chemists-designed dataset demonstrate that the framework outperforms state-of-the-art approaches by up to 32. 2% on search efficiency and 5. 6% on quality. Remarkably, our user studies show that GRASP successfully plans pathways that accomplish the goal prescribed with a designated goal (building block materials).

IJCAI Conference 2022 Conference Paper

RoSA: A Robust Self-Aligned Framework for Node-Node Graph Contrastive Learning

  • Yun Zhu
  • Jianhao Guo
  • Fei Wu
  • Siliang Tang

Graph contrastive learning has gained significant progress recently. However, existing works have rarely explored non-aligned node-node contrasting. In this paper, we propose a novel graph contrastive learning method named RoSA that focuses on utilizing non-aligned augmented views for node-level representation learning. First, we leverage the earth mover's distance to model the minimum effort to transform the distribution of one view to the other as our contrastive objective, which does not require alignment between views. Then we introduce adversarial training as an auxiliary method to increase sampling diversity and enhance the robustness of our model. Experimental results show that RoSA outperforms a series of graph contrastive learning frameworks on homophilous, non-homophilous and dynamic graphs, which validates the effectiveness of our work. To the best of our awareness, RoSA is the first work focuses on the non-aligned node-node graph contrastive learning problem. Our codes are available at: https: //github. com/ZhuYun97/RoSA

IJCAI Conference 2022 Conference Paper

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

  • Zhenhui Ye
  • Zhou Zhao
  • Yi Ren
  • Fei Wu

The recent progress in non-autoregressive text-to-speech (NAR-TTS) has made fast and high-quality speech synthesis possible. However, current NAR-TTS models usually use phoneme sequence as input and thus cannot understand the tree-structured syntactic information of the input sequence, which hurts the prosody modeling. To this end, we propose SyntaSpeech, a syntax-aware and light-weight NAR-TTS model, which integrates tree-structured syntactic information into the prosody modeling modules in PortaSpeech. Specifically, 1) We build a syntactic graph based on the dependency tree of the input sentence, then process the text encoding with a syntactic graph encoder to extract the syntactic information. 2) We incorporate the extracted syntactic encoding with PortaSpeech to improve the prosody prediction. 3) We introduce a multi-length discriminator to replace the flow-based post-net in PortaSpeech, which simplifies the training pipeline and improves the inference speed, while keeping the naturalness of the generated audio. Experiments on three datasets not only show that the tree-structured syntactic information grants SyntaSpeech the ability to synthesize better audio with expressive prosody, but also demonstrate the generalization ability of SyntaSpeech to adapt to multiple languages and multi-speaker text-to-speech. Ablation studies demonstrate the necessity of each component in SyntaSpeech. Source code and audio samples are available at https: //syntaspeech. github. io.

AAAI Conference 2021 Conference Paper

Judgment Prediction via Injecting Legal Knowledge into Neural Networks

  • Leilei Gan
  • Kun Kuang
  • Yi Yang
  • Fei Wu

Legal Judgment Prediction (LJP) is a key problem in legal artificial intelligence, which aims to predict a law case’s judgment based on a given text describing the facts of the law case. Most of previous works treat LJP as a text classification task and generally adopt deep neural networks (DNNs) based methods to solve it. However, existing DNNs based models are data thirsty and hard to explain which legal knowledge is based on to make such a prediction. Thus, injecting legal knowledge into neural networks to interpret the model and improve performance remains a significant problem. In this paper, we propose to represent declarative legal knowledge as a set of first-order logic rules and integrate these logic rules into a co-attention network-based model explicitly. The use of logic rules enhances neural networks with direct logical reasoning capabilities and makes the model more interpretable. We take private loan scenario as a case study and demonstrate the effectiveness of the proposed method through comprehensive experiments and analyses conducted on the collected dataset.

AAAI Conference 2021 Conference Paper

MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

  • Liang Qiao
  • Ying Chen
  • Zhanzhan Cheng
  • Yunlu Xu
  • Yi Niu
  • Shiliang Pu
  • Fei Wu

Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (e. g. , the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a positionaware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitraryshaped text spotting and can be trained end-to-end with only coarse position information (e. g. , rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-ofthe-art performance on both regular and irregular text spotting benchmarks, i. e. , ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.

AAAI Conference 2021 Short Paper

Modeling High-order Interactions across Multi-interests for Micro-video Reommendation (Student Abstract)

  • Dong Yao
  • Shengyu Zhang
  • Zhou Zhao
  • Wenyan Fan
  • Jieming Zhu
  • Xiuqiang He
  • Fei Wu

Personalized recommendation system has become pervasive in various video platform. Many effective methods have been proposed, but most of them didn’t capture the user’s multilevel interest trait and dependencies between their viewed micro-videos well. To solve these problems, we propose a Self-over-Co Attention module to enhance user’s interest representation. In particular, we first use co-attention to model correlation patterns across different levels and then use selfattention to model correlation patterns within a specific level. Experimental results on filtered public datasets verify that our presented module is useful.

AAAI Conference 2021 Conference Paper

SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition

  • Chengwei Zhang
  • Yunlu Xu
  • Zhanzhan Cheng
  • Shiliang Pu
  • Yi Niu
  • Fei Wu
  • Futai Zou

Arbitrary text appearance poses a great challenge in scene text recognition tasks. Existing works mostly handle with the problem in consideration of the shape distortion, including perspective distortions, line curvature or other style variations. Rectification (i. e. , spatial transformers) as the preprocessing stage is one popular approach and extensively studied. However, chromatic difficulties in complex scenes have not been paid much attention on. In this work, we introduce a new learnable geometric-unrelated rectification, Structure- Preserving Inner Offset Network (SPIN), which allows the color manipulation of source data within the network. This differentiable module can be inserted before any recognition architecture to ease the downstream tasks, giving neural networks the ability to actively transform input intensity rather than only the spatial rectification. It can also serve as a complementary module to known spatial transformations and work in both independent and collaborative ways with them. Extensive experiments show the proposed transformation outperforms existing rectification networks and has comparable performance among the state-of-the-arts.

IJCAI Conference 2020 Conference Paper

Dress like an Internet Celebrity: Fashion Retrieval in Videos

  • Hongrui Zhao
  • Jin Yu
  • Yanan Li
  • Donghui Wang
  • Jie Liu
  • Hongxia Yang
  • Fei Wu

Nowadays, both online shopping and video sharing have grown exponentially. Although internet celebrities in videos are ideal exhibition for fashion corporations to sell their products, audiences do not always know where to buy fashion products in videos, which is a cross-domain problem called video-to-shop. In this paper, we propose a novel deep neural network, called Detect, Pick, and Retrieval Network (DPRNet), to break the gap between fashion products from videos and audiences. For the video side, we have modified the traditional object detector, which automatically picks out the best object proposals for every commodity in videos without duplication, to promote the performance of the video-to-shop task. For the fashion retrieval side, a simple but effective multi-task loss network obtains new state-of-the-art results on DeepFashion. Extensive experiments conducted on a new large-scale cross-domain video-to-shop dataset shows that DPRNet is efficient and outperforms the state-of-the-art methods on video-to-shop task.

IJCAI Conference 2020 Conference Paper

Polar Relative Positional Encoding for Video-Language Segmentation

  • Ke Ning
  • Lingxi Xie
  • Fei Wu
  • Qi Tian

In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way, i. e. , in terms of direction and range. Sentence feature can interact with positional embeddings in a more direct way to extract the implied relative positional relations. We also propose parameterized functions for these positional embeddings to adapt real-value directions and ranges. With PRPE, we design a Polar Attention Module (PAM) as the basic module for vision-language fusion. Our method outperforms previous best method by a large margin of 11. 4% absolute improvement in terms of mAP on the challenging A2D Sentences dataset. Our method also achieves competitive performances on the J-HMDB Sentences dataset.

NeurIPS Conference 2020 Conference Paper

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

  • Xiaoya Li
  • Yuxian Meng
  • Mingxin Zhou
  • Qinghong Han
  • Fei Wu
  • Jiwei Li

While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length. Based on SAC, we show that previous variants of self-attention models are its special cases. Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost.

IJCAI Conference 2020 Conference Paper

SEBF: A Single-Chain based Extension Model of Blockchain for Fintech

  • Yimu Ji
  • Weiheng Gu
  • Fei Chen
  • Xiaoying Xiao
  • Jing Sun
  • Shangdong Liu
  • Jing He
  • Yunyao Li

The traditional blockchain has the shortcoming that a single-chain can only deal with one or a few specific data types. The research question of how to make blockchain be able to deal with various data types has not been well studied. In this paper, we propose a single-chain based extension model of blockchain for fintech (SEBF). In the financial environment, we design a four-layer architecture for this model. By employing the external trusted or-acle group and a financial regulator agency, a variety types of data can be effectively stored in the blockchain, such that the data type extension based on a single-chain is realized. The experimental results indicate that the proposed model can improve the efficiency of simplified payment verifi-cation.

AAAI Conference 2020 Conference Paper

Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting

  • Liang Qiao
  • Sanli Tang
  • Zhanzhan Cheng
  • Yunlu Xu
  • Yi Niu
  • Shiliang Pu
  • Fei Wu

Many approaches have recently been proposed to detect irregular scene text and achieved promising results. However, their localization results may not well satisfy the following text recognition part mainly because of two reasons: 1) recognizing arbitrary shaped text is still a challenging task, and 2) prevalent non-trainable pipeline strategies between text detection and text recognition will lead to suboptimal performances. To handle this incompatibility problem, in this paper we propose an end-to-end trainable text spotting approach named Text Perceptron. Concretely, Text Perceptron first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information. Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies without extra parameters. It unites text detection and the following recognition part into a whole framework, and helps the whole network achieve global optimization. Experiments show that our method achieves competitive performance on two standard text benchmarks, i. e. , ICDAR 2013 and ICDAR 2015, and also obviously outperforms existing methods on irregular text benchmarks SCUT-CTW1500 and Total-Text.

AAAI Conference 2019 Conference Paper

Cross-Relation Cross-Bag Attention for Distantly-Supervised Relation Extraction

  • Yujin Yuan
  • Liyuan Liu
  • Siliang Tang
  • Zhongfei Zhang
  • Yueting Zhuang
  • Shiliang Pu
  • Fei Wu
  • Xiang Ren

Distant supervision leverages knowledge bases to automatically label instances, thus allowing us to train relation extractor without human annotations. However, the generated training data typically contain massive noise, and may result in poor performances with the vanilla supervised learning. In this paper, we propose to conduct multi-instance learning with a novel Cross-relation Cross-bag Selective Attention (C2 SA), which leads to noise-robust training for distant supervised relation extractor. Specifically, we employ the sentence-level selective attention to reduce the effect of noisy or mismatched sentences, while the correlation among relations were captured to improve the quality of attention weights. Moreover, instead of treating all entity-pairs equally, we try to pay more attention to entity-pairs with a higher quality. Similarly, we adopt the selective attention mechanism to achieve this goal. Experiments with two types of relation extractor demonstrate the superiority of the proposed approach over the state-of-the-art, while further ablation studies verify our intuitions and demonstrate the effectiveness of our proposed two techniques.

AAAI Conference 2019 Short Paper

Heterogeneous Attributed Network Embedding with Graph Convolutional Networks

  • Yueyang Wang
  • Ziheng Duan
  • Binbing Liao
  • Fei Wu
  • Yueting Zhuang

Network embedding which assigns nodes in networks to lowdimensional representations has received increasing attention in recent years. However, most existing approaches, especially the spectral-based methods, only consider the attributes in homogeneous networks. They are weak for heterogeneous attributed networks that involve different node types as well as rich node attributes and are common in real-world scenarios. In this paper, we propose HANE, a novel network embedding method based on Graph Convolutional Networks, that leverages both the heterogeneity and the node attributes to generate high-quality embeddings. The experiments on the real-world dataset show the effectiveness of our method.

AAAI Conference 2019 Conference Paper

Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action Detection

  • Yunlu Xu
  • Chengwei Zhang
  • Zhanzhan Cheng
  • Jianwen Xie
  • Yi Niu
  • Shiliang Pu
  • Fei Wu

This paper proposes a segregated temporal assembly recurrent (STAR) network for weakly-supervised multiple action detection. The model learns from untrimmed videos with only supervision of video-level labels and makes prediction of intervals of multiple actions. Specifically, we first assemble video clips according to class labels by an attention mechanism that learns class-variable attention weights and thus helps the noise relieving from background or other actions. Secondly, we build temporal relationship between actions by feeding the assembled features into an enhanced recurrent neural network. Finally, we transform the output of recurrent neural network into the corresponding action distribution. In order to generate more precise temporal proposals, we design a score term called segregated temporal gradient-weighted class activation mapping (ST-GradCAM) fused with attention weights. Experiments on THUMOS’14 and ActivityNet1. 3 datasets show that our approach outperforms the state-of-theart weakly-supervised method, and performs at par with the fully-supervised counterparts.

AAAI Conference 2019 Conference Paper

Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition

  • Bin Li
  • Xi Li
  • Zhongfei Zhang
  • Fei Wu

With the representation effectiveness, skeleton-based human action recognition has received considerable research attention, and has a wide range of real applications. In this area, many existing methods typically rely on fixed physicalconnectivity skeleton structure for recognition, which is incapable of well capturing the intrinsic high-order correlations among skeleton joints. In this paper, we propose a novel spatio-temporal graph routing (STGR) scheme for skeletonbased action recognition, which adaptively learns the intrinsic high-order connectivity relationships for physicallyapart skeleton joints. Specifically, the scheme is composed of two components: spatial graph router (SGR) and temporal graph router (TGR). The SGR aims to discover the connectivity relationships among the joints based on sub-group clustering along the spatial dimension, while the TGR explores the structural information by measuring the correlation degrees between temporal joint node trajectories. The proposed scheme is naturally and seamlessly incorporated into the framework of graph convolutional networks (GCNs) to produce a set of skeleton-joint-connectivity graphs, which are further fed into the classification networks. Moreover, an insightful analysis on receptive field of graph node is provided to explain the necessity of our method. Experimental results on two benchmark datasets (NTU-RGB+D and Kinetics) demonstrate the effectiveness against the state-of-the-art.

AAAI Conference 2018 Conference Paper

A Semantic QA-Based Approach for Text Summarization Evaluation

  • Ping Chen
  • Fei Wu
  • Tong Wang
  • Wei Ding

Many Natural Language Processing and Computational Linguistics applications involve the generation of new texts based on some existing texts, such as summarization, text simplification and machine translation. However, there has been a serious problem haunting these applications for decades, that is, how to automatically and accurately assess quality of these applications. In this paper, we will present some preliminary results on one especially useful and challenging problem in NLP system evaluation – how to pinpoint content differences of two text passages (especially for large passages such as articles and books). Our idea is intuitive and very different from existing approaches. We treat one text passage as a small knowledge base, and ask it a large number of questions to exhaustively identify all content points in it. By comparing the correctly answered questions from two text passages, we will be able to compare their content precisely. The experiment using 2007 DUC summarization corpus clearly shows promising results.

IJCAI Conference 2018 Conference Paper

Attentional Image Retweet Modeling via Multi-Faceted Ranking Network Learning

  • Zhou Zhao
  • Lingtao Meng
  • Jun Xiao
  • Min Yang
  • Fei Wu
  • Deng Cai
  • Xiaofei He
  • Yueting Zhuang

Retweet prediction is a challenging problem in social media sites (SMS). In this paper, we study the problem of image retweet prediction in social media, which predicts the image sharing behavior that the user reposts the image tweets from their followees. Unlike previous studies, we learn user preference ranking model from their past retweeted image tweets in SMS. We first propose heterogeneous image retweet modeling network (IRM) that exploits users' past retweeted image tweets with associated contexts, their following relations in SMS and preference of their followees. We then develop a novel attentional multi-faceted ranking network learning framework with multi-modal neural networks for the proposed heterogenous IRM network to learn the joint image tweet representations and user preference representations for prediction task. The extensive experiments on a large-scale dataset from Twitter site shows that our method achieves better performance than other state-of-the-art solutions to the problem.

AAAI Conference 2018 Conference Paper

Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction

  • Huaxiu Yao
  • Fei Wu
  • Jintao Ke
  • Xianfeng Tang
  • Yitian Jia
  • Siyu Lu
  • Pinghua Gong
  • Jieping Ye

Taxi demand prediction is an important building block to enabling intelligent transportation systems in a smart city. An accurate prediction model can help the city pre-allocate resources to meet travel demand and to reduce empty taxis on streets which waste energy and worsen the traffic congestion. With the increasing popularity of taxi requesting services such as Uber and Didi Chuxing (in China), we are able to collect large-scale taxi demand data continuously. How to utilize such big data to improve the demand prediction is an interesting and critical real-world problem. Traditional demand prediction methods mostly rely on time series forecasting techniques, which fail to model the complex non-linear spatial and temporal relations. Recent advances in deep learning have shown superior performance on traditionally challenging tasks such as image classification by learning the complex features and correlations from largescale data. This breakthrough has inspired researchers to explore deep learning techniques on traffic prediction problems. However, existing methods on traffic prediction have only considered spatial relation (e. g. , using CNN) or temporal relation (e. g. , using LSTM) independently. We propose a Deep Multi-View Spatial-Temporal Network (DMVST-Net) framework to model both spatial and temporal relations. Specifically, our proposed model consists of three views: temporal view (modeling correlations between future demand values with near time points via LSTM), spatial view (modeling local spatial correlation via local CNN), and semantic view (modeling correlations among regions sharing similar temporal patterns). Experiments on large-scale real taxi demand data demonstrate effectiveness of our approach over state-ofthe-art methods.

AAAI Conference 2018 Conference Paper

Dynamic Network Embedding by Modeling Triadic Closure Process

  • Lekui Zhou
  • Yang Yang
  • Xiang Ren
  • Fei Wu
  • Yueting Zhuang

Network embedding, which aims to learn the lowdimensional representations of vertices, is an important task and has attracted considerable research efforts recently. In real world, networks, like social network and biological networks, are dynamic and evolving over time. However, almost all the existing network embedding methods focus on static networks while ignore network dynamics. In this paper, we present a novel representation learning approach, DynamicTriad, to preserve both structural information and evolution patterns of a given network. The general idea of our approach is to impose triad, which is a group of three vertices and is one of the basic units of networks. In particular, we model how a closed triad, which consists of three vertices connected with each other, develops from an open triad that has two of three vertices not connected with each other. This triadic closure process is a fundamental mechanism in the formation and evolution of networks, thereby makes our model being able to capture the network dynamics and to learn representation vectors for each vertex at different time steps. Experimental results on three real-world networks demonstrate that, compared with several state-of-the-art techniques, DynamicTriad achieves substantial gains in several application scenarios. For example, our approach can effectively be applied and help to identify telephone frauds in a mobile network, and to predict whether a user will repay her loans or not in a loan network.

IJCAI Conference 2018 Conference Paper

HST-LSTM: A Hierarchical Spatial-Temporal Long-Short Term Memory Network for Location Prediction

  • Dejiang Kong
  • Fei Wu

The widely use of positioning technology has made mining the movements of people feasible and plenty of trajectory data have been accumulated. How to efficiently leverage these data for location prediction has become an increasingly popular research topic as it is fundamental to location-based services (LBS). The existing methods often focus either on long time (days or months) visit prediction (i. e. , the recommendation of point of interest) or on real time location prediction (i. e. , trajectory prediction). In this paper, we are interested in the location prediction problem in a weak real time condition and aim to predict users' movement in next minutes or hours. We propose a Spatial-Temporal Long-Short Term Memory (ST-LSTM) model which naturally combines spatial-temporal influence into LSTM to mitigate the problem of data sparsity. Further, we employ a hierarchical extension of the proposed ST-LSTM (HST-LSTM) in an encoder-decoder manner which models the contextual historic visit information in order to boost the prediction performance. The proposed HST-LSTM is evaluated on a real world trajectory data set and the experimental results demonstrate the effectiveness of the proposed model.

AAAI Conference 2018 Short Paper

Multi-Label Community-Based Question Classification via Personalized Sequence Memory Network Learning

  • Xinyu Duan
  • Shengyu Zhang
  • Zhou Zhao
  • Fei Wu
  • Yueting Zhuang

Multi-label community-based question classification is a challenging problem in Community-based Question Answering (CQA), arising in many real applications such as question navigation and expert finding. Most of the existing approaches consider the problem as content-based tag suggestion task, which suffers from the textual sparsity issue. In this paper, we consider the problem from the viewpoint of personalized sequence learning. We introduce the personalized sequence memory network that leverages not only the semantics of questions but also the personalized information of askers to provide the sequence tag learning function to capture the high-order tag dependency. The experiment on realworld dataset shows the effectiveness of our method.

IJCAI Conference 2018 Conference Paper

Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks

  • Zhou Zhao
  • Zhu Zhang
  • Shuwen Xiao
  • Zhou Yu
  • Jun Yu
  • Deng Cai
  • Fei Wu
  • Yueting Zhuang

Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question answering from the viewpoint of adaptive hierarchical reinforced encoder-decoder network learning. We propose the adaptive hierarchical encoder network to learn the joint representation of the long-form video contents according to the question with adaptive video segmentation. we then develop the reinforced decoder network to generate the natural language answer for open-ended video question answering. We construct a large-scale long-form video question answering dataset. The extensive experiments show the effectiveness of our method.

AAAI Conference 2018 Conference Paper

Representation Learning for Scale-Free Networks

  • Rui Feng
  • Yang Yang
  • Wenjie Hu
  • Fei Wu
  • Yueting Zhang

Network embedding aims to learn the low-dimensional representations of vertexes in a network, while structure and inherent properties of the network is preserved. Existing network embedding works primarily focus on preserving the microscopic structure, such as the first- and second-order proximity of vertexes, while the macroscopic scale-free property is largely ignored. Scale-free property depicts the fact that vertex degrees follow a heavy-tailed distribution (i. e. , only a few vertexes have high degrees) and is a critical property of realworld networks, such as social networks. In this paper, we study the problem of learning representations for scale-free networks. We first theoretically analyze the difficulty of embedding and reconstructing a scale-free network in the Euclidean space, by converting our problem to the sphere packing problem. Then, we propose the “degree penalty” principle for designing scale-free property preserving network embedding algorithm: punishing the proximity between high-degree vertexes. We introduce two implementations of our principle by utilizing the spectral techniques and a skip-gram model respectively. Extensive experiments on six datasets show that our algorithms are able to not only reconstruct heavy-tailed distributed degree distribution, but also outperform state-ofthe-art embedding models in various network mining tasks, such as vertex classification and link prediction.

AAAI Conference 2018 Conference Paper

Urban Dreams of Migrants: A Case Study of Migrant Integration in Shanghai

  • Yang Yang
  • Chenhao Tan
  • Zongtao Liu
  • Fei Wu
  • Yueting Zhuang

Unprecedented human mobility has driven the rapid urbanization around the world. In China, the fraction of population dwelling in cities increased from 17. 9% to 52. 6% between 1978 and 2012. Such large-scale migration poses challenges for policymakers and important questions for researchers. To investigate the process of migrant integration, we employ a one-month complete dataset of telecommunication metadata in Shanghai with 54 million users and 698 million call logs. We find systematic differences between locals and migrants in their mobile communication networks and geographical locations. For instance, migrants have more diverse contacts and move around the city with a larger radius than locals after they settle down. By distinguishing new migrants (who recently moved to Shanghai) from settled migrants (who have been in Shanghai for a while), we demonstrate the integration process of new migrants in their first three weeks. Moreover, we formulate classification problems to predict whether a person is a migrant. Our classifier is able to achieve an F1-score of 0. 82 when distinguishing settled migrants from locals, but it remains challenging to identify new migrants because of class imbalance. This classification setup holds promise for identifying new migrants who will successfully integrate into locals (new migrants that misclassified as locals).

IJCAI Conference 2017 Conference Paper

Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks

  • Jun Xiao
  • Hao Ye
  • Xiangnan He
  • Hanwang Zhang
  • Fei Wu
  • Tat-Seng Chua

Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. Despite effectiveness, FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance. In this work, we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network. Extensive experiments on two real-world datasets demonstrate the effectiveness of AFM. Empirically, it is shown on regression task AFM betters FM with a 8. 6% relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep [Cheng et al. , 2016] and DeepCross [Shan et al. , 2016] with a much simpler structure and fewer model parameters. Our implementation of AFM is publicly available at: https: //github. com/hexiangnan/attentional_factorization_machine

IJCAI Conference 2017 Conference Paper

Discriminant Tensor Dictionary Learning with Neighbor Uncorrelation for Image Set Based Classification

  • Fei Wu
  • Xiao-Yuan Jing
  • Wangmeng Zuo
  • Ruiping Wang
  • Xiaoke Zhu

Image set based classification (ISC) has attracted lots of research interest in recent years. Several ISC methods have been developed, and dictionary learning technique based methods obtain state-of-the-art performance. However, existing ISC methods usually transform the image sample of a set into a vector for subsequent processing, which breaks the inherent spatial structure of image sample and the set. In this paper, we utilize tensor to model an image set with two spatial modes and one set mode, which can fully explore the intrinsic structure of image set. We propose a novel ISC approach, named discriminant tensor dictionary learning with neighbor uncorrelation (DTDLNU), which jointly learns two spatial dictionaries and one set dictionary. The spatial and set dictionaries are composed by set-specific sub-dictionaries corresponding to the class labels, such that the reconstruction error is discriminative. To obtain dictionaries with favorable discriminative power, DTDLNU designs a neighbor-uncorrelated discriminant tensor dictionary term, which minimizes the within-class scatter of the training sets in the projected tensor space and reduces tensor dictionary correlation among set-specific sub-dictionaries corresponding to neighbor sets from different classes. Experiments on three challenging datasets demonstrate the effectiveness of DTDLNU.

IJCAI Conference 2017 Conference Paper

Group-wise Deep Co-saliency Detection

  • Lina Wei
  • Shanshan Zhao
  • Omar El Farouk Bourahla
  • Xi Li
  • Fei Wu

In this paper, we propose an end-to-end group-wise deep co-saliency detection approach to address the co-salient object discovery problem based on the fully convolutional network (FCN) with group input and group output. The proposed approach captures the group-wise interaction information for group images by learning a semantics-aware image representation based on a convolutional neural network, which adaptively learns the group-wise features for co-saliency detection. Furthermore, the proposed approach discovers the collaborative and interactive relationships between group-wise feature representation and single-image individual feature representation, and model this in a collaborative learning framework. Finally, we set up a unified end-to-end deep learning scheme to jointly optimize the process of group-wise feature representation learning and the collaborative learning, leading to more reliable and robust co-saliency detection results. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.

AAAI Conference 2017 Conference Paper

Learning Heterogeneous Dictionary Pair with Feature Projection Matrix for Pedestrian Video Retrieval via Single Query Image

  • Xiaoke Zhu
  • Xiao-Yuan Jing
  • Fei Wu
  • Yunhong Wang
  • Wangmeng Zuo
  • Wei-Shi Zheng

Person re-identification (re-id) plays an important role in video surveillance and forensics applications. In many cases, person re-id needs to be conducted between image and video clip, e. g. , re-identifying a suspect from large quantities of pedestrian videos given a single image of him. We call reid in this scenario as image to video person re-id (IVPR). In practice, image and video are usually represented with different features, and there usually exist large variations between frames within each video. These factors make matching between image and video become a very challenging task. In this paper, we propose a joint feature projection matrix and heterogeneous dictionary pair learning (PHDL) approach for IVPR. Specifically, PHDL jointly learns an intra-video projection matrix and a pair of heterogeneous image and video dictionaries. With the learned projection matrix, the influence of variations within each video to the matching can be reduced. With the learned dictionary pair, the heterogeneous image and video features can be transformed into coding coefficients with the same dimension, such that the matching can be conducted using coding coefficients. Furthermore, to ensure that the obtained coding coefficients have favorable discriminability, PHDL designs a point-to-set coefficient discriminant term. Experiments on the public iLIDS-VID and PRID 2011 datasets demonstrate the effectiveness of the proposed approach.

AAAI Conference 2017 Conference Paper

Multi-Kernel Low-Rank Dictionary Pair Learning for Multiple Features Based Image Classification

  • Xiaoke Zhu
  • Xiao-Yuan Jing
  • Fei Wu
  • Di Wu
  • Li Cheng
  • Sen Li
  • Ruimin Hu

Dictionary learning (DL) is an effective feature learning technique, and has led to interesting results in many classification tasks. Recently, by combining DL with multiple kernel learning (which is a crucial and effective technique for combining different feature representation information), a few multi-kernel DL methods have been presented to solve the multiple feature representations based classification problem. However, how to improve the representation capability and discriminability of multi-kernel dictionary has not been well studied. In this paper, we propose a novel multi-kernel DL approach, named multi-kernel low-rank dictionary pair learning (MKLDPL). Specifically, MKLDPL jointly learns a kernel synthesis dictionary and a kernel analysis dictionary by exploiting the class label information. The learned synthesis and analysis dictionaries work together to implement the coding and reconstruction of samples in the kernel space. To enhance the discriminability of the learned multi-kernel dictionaries, MKLDPL imposes the low-rank regularization on the analysis dictionary, which can make samples from the same class have similar representations. We apply MKLDPL for multiple features based image classification task. Experimental results demonstrate the effectiveness of the proposed approach.

AAAI Conference 2017 Conference Paper

Multiset Feature Learning for Highly Imbalanced Data Classification

  • Fei Wu
  • Xiao-Yuan Jing
  • Shiguang Shan
  • Wangmeng Zuo
  • Jing-Yu Yang

With the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio of data is high, most existing imbalanced learning methods decline in classification performance. To address this problem, a few highly imbalanced learning methods have been presented. However, most of them are still sensitive to the high imbalance ratio. This work aims to provide an effective solution for the highly imbalanced data classification problem. We conduct highly imbalanced learning from the perspective of feature learning. We partition the majority class into multiple blocks with each being balanced to the minority class and combine each block with the minority class to construct a balanced sample set. Multiset feature learning (MFL) is performed on these sets to learn discriminant features. We thus propose an uncorrelated cost-sensitive multiset learning (UCML) approach. UCML provides a multiple sets construction strategy, incorporates the cost-sensitive factor into MFL, and designs a weighted uncorrelated constraint to remove the correlation among multiset features. Experiments on five highly imbalanced datasets indicate that: UCML outperforms state-of-the-art imbalanced learning methods.

AAAI Conference 2017 Conference Paper

Semi-Supervised Multi-View Correlation Feature Learning with Application to Webpage Classification

  • Xiao-Yuan Jing
  • Fei Wu
  • Xiwei Dong
  • Shiguang Shan
  • Songcan Chen

Webpage classification has attracted a lot of research interest. Webpage data is often multi-view and high-dimensional, and the webpage classification application is usually semisupervised. Due to these characteristics, using semisupervised multi-view feature learning (SMFL) technique to deal with the webpage classification problem has recently received much attention. However, there still exists room for improvement for this kind of feature learning technique. How to effectively utilize the correlation information among multi-view of webpage data is an important research topic. Correlation analysis on multi-view data can facilitate extraction of the complementary information. In this paper, we propose a novel SMFL approach, named semi-supervised multi-view correlation feature learning (SMCFL), for webpage classification. SMCFL seeks for a discriminant common space by learning a multi-view shared transformation in a semi-supervised manner. In the discriminant space, the correlation between intra-class samples is maximized, and the correlation between inter-class samples and the global correlation among both labeled and unlabeled samples are minimized simultaneously. We transform the matrix-variable based nonconvex objective function of SMCFL into a convex quadratic programming problem with one real variable, and can achieve a global optimal solution. Experiments on widely used datasets demonstrate the effectiveness and efficiency of the proposed approach.

AAAI Conference 2016 Conference Paper

Community-Based Question Answering via Heterogeneous Social Network Learning

  • Hanyin Fang
  • Fei Wu
  • Zhou Zhao
  • Xinyu Duan
  • Yueting Zhuang
  • Martin Ester

Community-based question answering (cQA) sites have accumulated vast amount of questions and corresponding crowdsourced answers over time. How to efficiently share the underlying information and knowledge from reliable (usually highly-reputable) answerers has become an increasingly popular research topic. A major challenge in cQA tasks is the accurate matching of high-quality answers w. r. t given questions. Many of traditional approaches likely recommend corresponding answers merely depending on the content similarity between questions and answers, therefore suffer from the sparsity bottleneck of cQA data. In this paper, we propose a novel framework which encodes not only the contents of question-answer(Q-A) but also the social interaction cues in the community to boost the cQA tasks. More specifically, our framework collaboratively utilizes the rich interaction among questions, answers and answerers to learn the relative quality rank of different answers w. r. t a same question. Moreover, the information in heterogeneous social networks is comprehensively employed to enhance the quality of question-answering (QA) matching by our deep random walk learning framework. Extensive experiments on a large-scale dataset from a real world cQA site show that leveraging the heterogeneous social information indeed achieves better performance than other state-of-the-art cQA methods.

IJCAI Conference 2016 Conference Paper

Diverse Image Captioning via GroupTalk

  • Zhuhao Wang
  • Fei Wu
  • Weiming Lu
  • Jun Xiao
  • Xi Li
  • Zitong Zhang
  • Yueting Zhuang

Generally speaking, different persons tend to describe images from various aspects due to their individually subjective perception. As a result, generating the appropriate descriptions of images with both diversity and high quality is of great importance. In this paper, we propose a framework called GroupTalk to learn multiple image caption distributions simultaneously and effectively mimic the diversity of the image captions written by human beings. In particular, a novel iterative update strategy is proposed to separate training sentence samples into groups and learn their distributions at the same time. Furthermore, we introduce an efficient classifier to solve the problem brought about by the non-linear and discontinuous nature of language distributions which will impair performance. Experiments on several benchmark datasets show that GroupTalk naturally diversifies the generated captions of each image without sacrificing the accuracy.

IJCAI Conference 2016 Conference Paper

Self-Paced Boost Learning for Classification

  • Te Pi
  • Xi Li
  • Zhongfei Zhang
  • Deyu Meng
  • Fei Wu
  • Jun Xiao
  • Yueting Zhuang

Effectiveness and robustness are two essential aspects of supervised learning studies. For effective learning, ensemble methods are developed to build a strong effective model from ensemble of weak models. For robust learning, self-paced learning (SPL) is proposed to learn in a self-controlled pace from easy samples to complex ones. Motivated by simultaneously enhancing the learning effectiveness and robustness, we propose a unified framework, Self-Paced Boost Learning (SPBL). With an adaptive from-easy-to-hard pace in boosting process, SPBL asymptotically guides the model to focus more on the insufficiently learned samples with higher reliability. Via a max-margin boosting optimization with self-paced sample selection, SPBL is capable of capturing the intrinsic inter-class discriminative patterns while ensuring the reliability of the samples involved in learning. We formulate SPBL as a fully-corrective optimization for classification. The experiments on several real-world datasets show the superiority of SPBL in terms of both effectiveness and robustness.

IJCAI Conference 2016 Conference Paper

Video-Based Person Re-Identification by Simultaneously Learning Intra-Video and Inter-Video Distance Metrics

  • Xiaoke Zhu
  • Xiao-Yuan Jing
  • Fei Wu
  • Hui Feng

Video-based person re-identification (re-id) is an important application in practice. However, only a few methods have been presented for this problem. Since large variations exist between different pedestrian videos, as well as within each video, it's challenging to conduct re-identification between pedestrian videos. In this paper, we propose a simultaneous intra-video and inter-video distance learning (SI2DL) approach for video-based person re-id. Specifically, SI2DL simultaneously learns an intra-video distance metric and an inter-video distance metric from the training videos. The intra-video distance metric is to make each video more compact, and the inter-video one is to make that the distance between two truly matching videos is smaller than that between two wrong matching videos. To enhance the discriminability of learned metrics, we design a video relationship model, i. e. , video triplet, for SI2DL. Experiments on the public iLIDS-VID and PRID 2011 image sequence datasets show that our approach achieves the state-of-the-art performance.

AAAI Conference 2015 Conference Paper

Metric Learning Driven Multi-Task Structured Output Optimization for Robust Keypoint Tracking

  • Liming Zhao
  • Xi Li
  • Jun Xiao
  • Fei Wu
  • Yueting Zhuang

As an important and challenging problem in computer vision and graphics, keypoint-based object tracking is typically formulated in a spatio-temporal statistical learning framework. However, most existing keypoint trackers are incapable of effectively modeling and balancing the following three aspects in a simultaneous manner: temporal model coherence across frames, spatial model consistency within frames, and discriminative feature construction. To address this issue, we propose a robust keypoint tracker based on spatio-temporal multi-task structured output optimization driven by discriminative metric learning. Consequently, temporal model coherence is characterized by multi-task structured keypoint model learning over several adjacent frames, while spatial model consistency is modeled by solving a geometric verification based structured learning problem. Discriminative feature construction is enabled by metric learning to ensure the intra-class compactness and inter-class separability. Finally, the above three modules are simultaneously optimized in a joint learning scheme. Experimental results have demonstrated the effectiveness of our tracker.

IJCAI Conference 2015 Conference Paper

Sketch the Storyline with CHARCOAL: A Non-Parametric Approach

  • Siliang Tang
  • Fei Wu
  • Si Li
  • Weiming Lu
  • Zhongfei Zhang
  • Yueting Zhuang

Generating a coherent synopsis and revealing the development threads for news stories from the increasing amounts of news content remains a formidable challenge. In this paper, we proposed a hddCRP (hybird distant-dependent Chinese Restaurant Process) based HierARChical tOpic model for news Article cLustering, abbreviated as CHARCOAL. Given a bunch of news articles, the outcome of CHARCOAL is threefold: 1) it aggregates relevant new articles into clusters (i. e. , stories); 2) it disentangles the chain links (i. e. , storyline) between articles in their describing story; 3) it discerns the topics that each story is assigned (e. g. ,Malaysia Airlines Flight 370 story belongs to the aircraft accident topic and U. S presidential election stories belong to the politics topic). CHARCOAL completes this task by utilizing a hddCRP as prior, and the entities (e. g. , names of persons, organizations, or locations) that appear in news articles as clues. Moveover, the adaptation of non-parametric nature in CHARCOAL makes our model can adaptively learn the appropriate number of stories and topics from news corpus. The experimental analysis and results demonstrate both interpretability and superiority of the proposed approach.

AAAI Conference 2015 Conference Paper

Structured Embedding via Pairwise Relations and Long-Range Interactions in Knowledge Base

  • Fei Wu
  • Jun Song
  • Yi Yang
  • Xi Li
  • Zhongfei Zhang
  • Yueting Zhuang

We consider the problem of embedding entities and relations of knowledge bases into low-dimensional continuous vector spaces (distributed representations). Unlike most existing approaches, which are primarily efficient for modelling pairwise relations between entities, we attempt to explicitly model both pairwise relations and long-range interactions between entities, by interpreting them as linear operators on the low-dimensional embeddings of the entities. Therefore, in this paper we introduces path ranking to capture the long-range interactions of knowledge graph and at the same time preserve the pairwise relations of knowledge graph; we call it structured embedding via pairwise relation and longrange interactions (referred to as SePLi). Comparing with the-state-of-the-art models, SePLi achieves better performances of embeddings.

IJCAI Conference 2015 Conference Paper

Web Page Classification Based on Uncorrelated Semi-Supervised Intra-View and Inter-View Manifold Discriminant Feature Extraction

  • Xiao-Yuan Jing
  • Qian Liu
  • Fei Wu
  • Baowen Xu
  • Yangping Zhu
  • Songcan Chen

Web page classification has attracted increasing research interest. It is intrinsically a multi-view and semi-supervised application, since web pages usually contain two or more types of data, such as text, hyperlinks and images, and unlabeled pages are generally much more than labeled ones. Web page data is commonly high-dimensional. Thus, how to extract useful features from this kind of data in the multi-view semi-supervised scenario is important for web page classification. To our knowledge, only one method is specially presented for this topic. And with respect to a few semisupervised multi-view feature extraction methods on other applications, there still exists much room for improvement. In this paper, we firstly design a feature extraction schema called semi-supervised intra-view and inter-view manifold discriminant (SI2 MD) learning, which sufficiently utilizes the intra-view and inter-view discriminant information of labeled samples and the local neighborhood structures of unlabeled samples. We then design a semi-supervised uncorrelation constraint for the SI2 MD schema to remove the multi-view correlation in the semi-supervised scenario. By combining the SI2 MD schema with the constraint, we propose an uncorrelated semi-supervised intra-view and inter-view manifold discriminant (USI2 MD) learning approach for web page classification. Experiments on public web page databases validate the proposed approach.

AAAI Conference 2014 Conference Paper

Uncorrelated Multi-View Discrimination Dictionary Learning for Recognition

  • Xiao-Yuan Jing
  • Rui-Min Hu
  • Fei Wu
  • Xi-Lin Chen
  • Qian Liu
  • Yong-Fang Yao

Dictionary learning (DL) has now become an important feature learning technique that owns state-of-the-art recognition performance. Due to sparse characteristic of data in real-world applications, DL uses a set of learned dictionary bases to represent the linear decomposition of a data point. Fisher discrimination DL (FDDL) is a representative supervised DL method, which constructs a structured dictionary whose atoms correspond to the class labels. Recent years have witnessed a growing interest in multi-view (more than two views) feature learning techniques. Although some multi-view (or multi-modal) DL methods have been presented, there still exists much room for improvement. How to enhance the total discriminability of dictionaries and reduce their redundancy is a crucial research topic. To boost the performance of multi-view DL technique, we propose an uncorrelated multi-view discrimination DL (UMD2 L) approach for recognition. By making dictionary atoms correspond to the class labels such that the obtained reconstruction error is discriminative, UMD2 L aims to jointly learn multiple dictionaries with totally favorable discriminative power. Furthermore, we design the uncorrelated constraint for multi-view DL, so as to reduce the redundancy among dictionaries learned from different views. Experiments on several public datasets demonstrate the effectiveness of the proposed approach.

AAAI Conference 2013 Conference Paper

Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval

  • Yue Zhuang
  • Yan Wang
  • Fei Wu
  • Yin Zhang
  • Wei Lu

A better similarity mapping function across heterogeneous high-dimensional features is very desirable for many applications involving multi-modal data. In this paper, we introduce coupled dictionary learning (DL) into supervised sparse coding for multi-modal (crossmedia) retrieval. We call this Supervised coupleddictionary learning with group structures for Multi- Modal retrieval (SliM2 ). SliM2 formulates the multimodal mapping as a constrained dictionary learning problem. By utilizing the intrinsic power of DL to deal with the heterogeneous features, SliM2 extends unimodal DL to multi-modal DL. Moreover, the label information is employed in SliM2 to discover the shared structure inside intra-modality within the same class by a mixed norm (i. e. , `1/`2-norm). As a result, the multimodal retrieval is conducted via a set of jointly learned mapping functions across multi-modal data. The experimental results show the effectiveness of our proposed model when applied to cross-media retrieval.

AAAI Conference 2013 Conference Paper

Supervised Nonnegative Tensor Factorization with Maximum-Margin Constraint

  • Fei Wu
  • Xu Tan
  • Yi Yang
  • Dacheng Tao
  • Siliang Tang
  • Yueting Zhuang

Non-negative tensor factorization (NTF) has attracted great attention in the machine learning community. In this paper, we extend traditional non-negative tensor factorization into a supervised discriminative decomposition, referred as Supervised Non-negative Tensor Factorization with Maximum-Margin Constraint (SNTFM2 ). SNTFM2 formulates the optimal discriminative factorization of non-negative tensorial data as a coupled least-squares optimization problem via a maximum-margin method. As a result, SNTFM2 not only faithfully approximates the tensorial data by additive combinations of the basis, but also obtains a strong generalization power to discriminative analysis (in particular for classification in this paper). The experimental results show the superiority of our proposed model over state-of-the-art techniques on both toy and real world data sets.

IJCAI Conference 2013 Conference Paper

Synthesizing Union Tables from the Web

  • Xiao Ling
  • Alon Halevy
  • Fei Wu
  • Cong Yu

Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al. , 2008a; Elmeleegy et al. , 2009; Limaye et al. , 2010; Venetis et al. , 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts only scratch the surface of the true value of structured data on the Web, and study the challenging problem of synthesizing tables from the Web, i. e. , producing never-before-seen tables from raw tables on the Web. Table synthesis offers an important semantic advantage: when a set of related tables are combined into a single union table, powerful mechanisms, such as temporal or geographical comparison and visualization, can be employed to understand and mine the underlying data holistically. We focus on one fundamental task of table synthesis, namely, table stitching. Within a given site, many tables with identical schemas can be scattered across many pages. The task of table stitching involves combining such tables into a single meaningful union table and identifying extra attributes and values for its rows so that rows from different original tables can be distinguished. Specifically, we first define the notion of stitchable tables and identify collections of tables that can be stitched. Second, we design an effective algorithm for extracting hidden attributes that are essential for the stitching process and for aligning values of those attributes across tables to synthesize new columns. We also assign meaningful names to these synthesized columns. Experiments on real world tables demonstrate the effectiveness of our approach.

AAAI Conference 2010 Conference Paper

Multi-Task Sparse Discriminant Analysis (MtSDA) with Overlapping Categories

  • Yahong Han
  • Fei Wu
  • Jinzhu Jia
  • Yueting Zhuang
  • Bin Yu

Multi-task learning aims at combining information across tasks to boost prediction performance, especially when the number of training samples is small and the number of predictors is very large. In this paper, we first extend the Sparse Discriminate Analysis (SDA) of Clemmensen et al. . We call this Multi-task Sparse Discriminate Analysis (MtSDA). MtSDA formulates multi-label prediction as a quadratic optimization problem whereas SDA obtains single labels via a nearest class mean rule. Second, we propose a class of equicorrelation matrices to use in MtSDA which includes the identity matrix. MtSDA with both matrices are compared with singletask learning (SVM and LDA+SVM) and multi-task learning (HSML). The comparisons are made on real data sets in terms of AUC and F-measure. The data results show that MtSDA outperforms other methods substantially almost all the time and in some cases MtSDA with the equicorrelation matrix substantially outperforms MtSDA with identity matrix.