Arrow Research search

Author name cluster

Long Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

63 papers
2 author rows

Possible papers

63

JBHI Journal 2026 Journal Article

A General Global and Local Pre-Training Framework for 3D Medical Image Segmentation

  • Jianguo Ju
  • Ziyu Guan
  • Hao Lei
  • Dandan Qiu
  • Haoming Zhang
  • Long Chen
  • Fei Xie
  • Wei Zhao

Accurate target segmentation from computed tomography (CT) scans is crucial for surgical robots to perform clinical surgeries successfully. However, the lack of medical image data and annotations has been the biggest obstacle to learning robust medical image segmentation models. Self-supervised learning can effectively address this problem by providing a strategy to pre-train a model with unlabeled data, and then fine-tune downstream tasks with limited labeled data. Existing self-supervised methods fail to simultaneously utilize the abundant global anatomical structure information and local feature differences in medical imaging. In this work, we propose a new strategy for the pre-training framework, which uses the three-dimensional anatomical structure of medical images and specific task and background cues to segment volumetric medical images with limited annotations. Specifically, we propose (1) learning intrinsic patterns of volumetric medical image structures through multiple sub-tasks, and (2) designing a multi-level background cube contrastive learning strategy to enhance the target feature representation by exploiting the differences between the specific target and background. We conduct extensive evaluations on two publicly available datasets. Under limited annotation settings, the proposed method yields significant improvements compared to other self-supervised learning techniques. The proposed method achieves within 6% of the baseline performance using only five labeled CT volumes for training.

EAAI Journal 2026 Journal Article

A rapid image-based detection method for coalmine dust concentration and mass dispersion via multi-task deep learning

  • Haoran Fu
  • Xiaoyan Gong
  • Long Chen
  • Xinyu Wang
  • Yuxuan Xue
  • Wansu Yong
  • Hao Feng

Rapid and simultaneous detection of multiple coalmine dust parameters is crucial for accurate dust control. Existing methods often suffer from single detection indicators and delayed responses, making it difficult to effectively implement dust prevention and control measures. This study leverages artificial intelligence to propose an image-based detection method for coalmine dust, built upon multi-task deep learning. The method enables real-time detection of total dust concentration, respiratory dust concentration, and mass dispersion. An image preprocessing strategy is designed, and shared features are selected using the maximal information coefficient and variance inflation factor, with a multidimensional feature library constructed. The proposed architecture integrates a shared layer and a Set Transformer layer, both based on multi-head attention, to optimize inter-task representation consistency and enhance adaptability to dynamic variations in particle count. To jointly optimize the network architecture and training performance, an adaptive loss optimization mechanism and an Optuna-based hyperparameter tuning strategy are introduced. On an independent test set, the method is compared with multi-level baselines and evaluated via ablation studies. A dust particle image detection device is developed, and based on a fully mechanized tunneling face of a coalmine in Shaanxi Province, an experimental platform is constructed for application analysis. The results show that, under coal-dust conditions, the method achieves an average response cycle of 5. 5620 s, and the maximum average relative error across all outputs is 7. 3201%, meeting engineering requirements for real-time performance and detection accuracy. Overall, the method offers robust theoretical and technical support for intelligent dust monitoring.

TCS Journal 2026 Journal Article

Analysis of key reuse security for Aigis.KEM

  • Ke Wang
  • Haodong Jiang
  • Zhenfeng Zhang
  • Long Chen
  • Huiqin Xie

Key reuse security is an important security property considered in the NIST post-quantum cryptography algorithm standardization. At PKC’20, Zhang et al. proposed Aigis. KEM, a key encapsulation mechanism based on asymmetric MLWE. Aigis. KEM provides flexible parameter selection, has high comprehensive performance, and won the first prize of the China’s National cryptographic algorithm competition. However, its key reuse security is currently unclear. This paper studies the key reuse security of Aigis. KEM. Aigis. KEM is derived from public key encryption Aigis. PKE, so we will first assess its key reuse resilience using key recovery under plaintext-checking attack (KR-PCA). Then, we optimize the attack and proposes a two-positional KR-PCA attack to further approach the lower bound of attack complexity. We also verify these attacks through experiments, and discuss the further optimization and improvement. Finally, based on the KR-PCA attacks on Aigis. PKE, we further propose practical attacks on Aigis. KEM by utilizing side-channel attacks or fault-injection attacks. In response to these attacks, we explored possible countermeasures. The work helps to clarify the potential risks of Aigis. KEM and guide its application in practice.

EAAI Journal 2026 Journal Article

Control-allocation-based hierarchical coordinated control for the transient mode transition of a compound power-split hybrid electric vehicle

  • Dehua Shi
  • Xiangwei Rong
  • Shaohua Wang
  • Chunfang Yin
  • Long Chen
  • Jiajia Wang

To address the engineering challenges of controlling and allocating torque in the transient over-actuated system during mode transitions of compound power split hybrid electric vehicles, the paper proposes an intelligent optimal torque distribution control scheme. First, the mode transition logic of the over-actuated system is established by analyzing the timing and sequence of dual-clutch operations. Based on this logic, a hierarchical coordinated control strategy is developed. At the upper layer, a controller and observer are designed using finite-time theory, with speed-tracking errors as state inputs, to guarantee transient performance during mode transitions. At the lower layer, simulated annealing is integrated with optimal control allocation to ensure optimal torque distribution in the over-actuated system. Finally, simulation and hardware-in-the-loop test results demonstrate that, compared with the terminal sliding mode control method based on control allocation, the proposed approach reduces jerk, slippage work, and transition time by 19. 1 %, 52. 7 %, and 16. 7 %, respectively. The control framework provides a novel intelligent solution for mode transition control in hybrid electric vehicles.

EAAI Journal 2026 Journal Article

Electro-optical and infrared multi-sensor fusion based airborne target perception: A unified framework

  • Zhouyu Zhang
  • Chenyuan He
  • Yingfeng Cai
  • Long Chen
  • Hai Wang
  • Can Zhong
  • Yiqun Zhang

This paper presents a unified framework for airborne target perception, designed for unmanned aerial vehicles (UAVs) operating in non-cooperative airspace environments. The core contribution to artificial intelligence lies in the integration of electro-optical and infrared (EO/IR) sensors using a convolutional sparse representation-based image fusion algorithm, along with a novel spatiotemporal detection method that combines conditional random fields and motion history analysis. The engineering application focuses on real-time airborne Sense and Avoid (SAA) capabilities for small UAVs, where a local-angle-based collision avoidance path planning method is proposed to address the limitations of monocular vision-based perception. To validate the proposed framework, a distributed digital simulation and verification system is developed based on virtual camera feeds and local network communication. This system supports closed-loop testing of visual perception, target detection, and path planning in realistic airspace environments. Experiments conducted in three representative airport scenarios — Illinois State Hospital, Shanghai Pudong International Airport, and New York John F. Kennedy International Airport — demonstrate the framework’s effectiveness in enhancing visual quality under low illumination conditions, improving detection accuracy, and enabling robust and safe autonomous navigation. Specifically, the proposed system achieves a target detection accuracy of 94. 6% and reduces false alarm rate to 2. 1%, while successfully generating collision-free paths in 97. 8% of dynamic encounters. Compared to existing state-of-the-art EO/IR fusion-based perception systems, our framework improves detection precision by 4. 3% on average and increases planning robustness by 5. 6% in complex airspace environments. These results validate both the effectiveness and the generalizability of the unified framework for real-world UAVs SAA tasks.

AAAI Conference 2026 Conference Paper

Enhancing Diffusion Policies with Distribution-Matching Generator in Offline Reinforcement Learning

  • Xuemin Hu
  • Shen Li
  • Yingfen Xu
  • Bo Tang
  • Long Chen

Offline reinforcement learning (RL) can learn policies from pre-collected offline datasets without interacting with the environment, but it suffers from the issue of out-of-distribution (OOD). Recent methods use the generative adversarial paradigm to learn policies, but easily fail to handle the conflict of fooling the discriminator and maximizing expected returns. In this paper, we propose a novel offline RL method named Distribution-Matching Generator-based Diffusion Policies (DMGDP). A distribution matching-based policy learning method is first developed, where the diffusion serves as the policy generator, to handle the conflict of fooling the discriminator and maximizing expected returns. Furthermore, a policy confidence mechanism based on discriminator regularization is designed to prevent the agent from taking OOD actions, with the aim of robust generative adversarial learning. We conducted extensive experiments on the D4RL benchmarks, and the results demonstrate that DMGDP outperforms state-of-the-art methods.

AAAI Conference 2026 Conference Paper

Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning

  • Haomiao Tang
  • Jinpeng Wang
  • Minyi Zhao
  • GuangHao Meng
  • Ruisheng Luo
  • Long Chen
  • Shu-Tao Xia

Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens model's robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatments for queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings capturing detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive the comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG's effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.

AAAI Conference 2026 Conference Paper

LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models

  • Long Chen
  • Xiaotian Song
  • Yanan Sun

Spiking Large Language Models (LLMs) have emerged as an energy-efficient alternative to conventional LLMs through their event-driven computation. To effectively obtain spiking LLMs, researchers develop different ANN-to-SNN conversion methods by leveraging pre-trained ANN parameters while inheriting the energy efficiency of SNN. However, existing conversion methods struggle with extreme activation outliers and incompatible nonlinear operations of ANN-based LLMs. To address this, we propose a loss-less ANN-SNN conversion for fully spike-driven LLMs, termed LAS. Specifically, LAS introduces two novel neurons to convert the activation outlier and nonlinear operation of ANN-based LLMs. Moreover, LAS tailors the spike-equivalent Transformer components for spiking LLMs, which can ensure full spiking conversion without any loss of performance. Experimental results on six language models and two vision-language models demonstrate that LAS achieves loss-less conversion. Notably, on OPT-66B, LAS even improves the accuracy of 2% on the WSC task. In addition, the parameter and ablation studies further verify the effectiveness of LAS.

AAAI Conference 2026 Conference Paper

Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

  • Yuxuan Wang
  • Xuanyu Yi
  • Qingshan Xu
  • Yuan Zhou
  • Long Chen
  • Hanwang Zhang

Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality image-conditioned 3DGS personalization that significantly outperforms existing methods.

AAAI Conference 2026 Conference Paper

Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension

  • Lin Li
  • Wei Chen
  • Jiahui Li
  • Kwang-Ting Cheng
  • Long Chen

Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning. However, they remain limited in visual relation understanding, struggling even with binary relation detection, let alone N-ary relations involving multiple semantic roles. The core reason is the lack of modeling for structural semantic dependencies among multi-entities, leading to over-reliance on language priors (e.g., defaulting to "person drinks a milk" if a person is merely holding it). To this end, we propose Relation-R1, the first unified relation comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-rewards optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Furthermore, we investigate the impact of various CoT strategies within this framework, demonstrating that a specific-to-general progressive approach in CoT guidance further improves generalization, especially in capturing synonymous N-ary relations. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and N-ary relation understanding.

AAAI Conference 2026 Conference Paper

Spatial-Frequency Spiking Neural Network for Underwater Object Detection

  • Long Chen
  • Wei Miao
  • Xin Gao
  • Yunzhi Zhuge
  • Hongming Xu
  • Yaxin Li
  • Qi Xu

Underwater object detection presents significant challenges due to the unique visual degradations in underwater environments, such as low contrast, poor visibility, and blurry object boundaries. While ANNs have achieved impressive detection accuracy, their high computational cost and power consumption limit their deployment in resource-constrained underwater platforms. In this work, we propose a Spatial-Frequency Spiking Neural Network (SFSNN) that combines the energy-efficient and event-driven nature of Spiking Neural Networks (SNNs) with the discriminative power of spatial-frequency analysis. SFSNN introduces a novel spatial-frequency spiking module that integrates spatial and frequency-domain representations, enhancing edge and texture features crucial for object detection in murky waters. Furthermore, we adapt the YOLOX architecture into a spike-based detector via ANN-to-SNN conversion using signed spiking neurons. Extensive experiments on the RUOD dataset demonstrate that SFSNN achieves superior performance over both SNN- and ANN-based detection models, offering a compelling solution for low-power underwater object detection.

TMLR Journal 2026 Journal Article

Towards Customized Knowledge Distillation for Efficient Dense Image Predictions

  • Dong Zhang
  • Pingcheng Dong
  • Long Chen
  • Kwang-Ting Cheng

It has been revealed that efficient dense image prediction (EDIP) models designed for AI chips, trained using the knowledge distillation (KD) framework, encounter two key challenges, including maintaining boundary region completeness and ensuring target region connectivity, despite their favorable real-time capacity to recognize the main object regions. In this work, we propose a customized boundary and context knowledge distillation (BCKD) method for EDIPs, which facilitates the targeted KD from large accurate teacher models to compact small student models. Specifically, the boundary distillation focuses on extracting explicit object-level boundaries from the hierarchical feature maps to enhance the student model's mask quality in boundary regions. Meanwhile, the context distillation leverages self-relations as a bridge to transfer implicit pixel-level contexts from the teacher model to the student model, ensuring strong connectivity in target regions. Our method is specifically designed for the EDIP tasks and is characterized by its simplicity and efficiency. Theoretical analysis and extensive experimental results across semantic segmentation, object detection, and instance segmentation on five representative datasets demonstrate the effectiveness of BCKD, resulting in well-defined object boundaries and smooth connecting regions.

YNIMG Journal 2026 Journal Article

Transcutaneous auricular vagus nerve stimulation facilitates visuomotor association learning: Behavioral and electrophysiological evidence

  • Long Chen
  • Chenghu Tang
  • Huixin Gao
  • Lei Zhang
  • Shengcui Cheng
  • Zhongpeng Wang
  • Shuang Liu
  • Dong Ming

Associating visual cues with appropriate motor responses is a fundamental adaptive skill. Transcutaneous auricular vagus nerve stimulation (taVNS) may enhance visuomotor association (VMA) learning, though its neural mechanisms remain unclear. Electroencephalogram (EEG), with its millisecond temporal resolution, offers unique advantages for elucidating the neurodynamic of VMA plasticity. This single-blind, sham-controlled, between-subjects study investigated whether taVNS facilitates VMA learning through behavioral and EEG analysis. Participants (each group N = 19) performed a VMA task (associating five oracle pictures with five keyboard keys) before and after 20-min active/sham taVNS. Behavioral results revealed that compared to the sham group, the active group exhibited shorter reaction time, higher response accuracy and larger learning curve integration, confirming the positive effect of taVNS on VMA learning. Neurophysiologically, taVNS reduced the P200 and P300 amplitudes, enhanced N170 negativity and attenuated error-related negativity. Cross-regional-frequency phase-amplitude coupling results demonstrated enhanced synchronization of frontal-parietal-occipital neural cross-frequency activity. Additionally, parietal-occipital θ, α, β band inter-trial phase coherence was enhanced in the active group. These findings demonstrate that taVNS enhances VMA acquisition through optimizing visual and error processing efficiency. This study establishes a neurophysiological basis for taVNS's cognitive enhancement potential, suggesting its utility in rehabilitative paradigms targeting associative learning deficits.

AAAI Conference 2026 Conference Paper

VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

  • Qimao Chen
  • Fang Li
  • Shaoqing Xu
  • Zhiyi Lai
  • Zixun Xie
  • Yuechen Luo
  • Shengyin Jiang
  • Hanbing Li

The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

AAAI Conference 2026 Conference Paper

What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

  • Lingfeng Zhang
  • Haoxiang Fu
  • Xiaoshuai Hao
  • Shuyi Zhang
  • Qiang Zhang
  • Rui Liu
  • Long Chen
  • Wenbo Ding

Embodied navigation is a fundamental capability that enables embodied agents to effectively interact with the physical world in various complex environments. However, a significant gap remains between current embodied navigation tasks and real-world requirements, as existing methods often struggle to integrate high-level human instructions with spatial understanding. To address this gap, we propose a new task of embodied navigation called spatial navigation, which encompasses two key components: spatial object navigation (SpON) for object-specific guidance and spatial area navigation (SpAN) for navigating to designated areas. Specifically, SpON guides agents to specific objects by leveraging spatial relationships and contextual understanding, while SpAN focuses on navigating to defined areas within complex environments. Together, these components significantly enhance agents’ navigation capabilities, enabling more effective interactions in real-world scenarios. To support this task, we have generated a spatial navigation dataset consisting of 10K trajectories within the simulator. This dataset includes high-level human instructions, detailed observations, and corresponding navigation actions, providing a comprehensive resource to enhance agent training and performance. Building on the spatial navigation dataset, we introduce SpNav, a hierarchical navigation framework. Specifically, SpNav employs vision-language model (VLM) to interpret high-level human instructions and accurately identify goal objects or areas within the observation range, achieving precise point-to-point navigation using a map and enhancing the agent’s ability to oper- ate effectively in complex environments by bridging the gap between perception and action. Extensive experiments show that SpNav achieves state-of-the-art (SOTA) performance in spatial navigation tasks across both simulated and real-world environments, validating the effectiveness of our method.

AAAI Conference 2025 Conference Paper

3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving

  • Boyi Sun
  • Yuhang Liu
  • Xingxia Wang
  • Bin Tian
  • Long Chen
  • Fei-Yue Wang

Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas annotation-free learning training can avoid it by learning point cloud representations from unannotated data. In this paper, we propose AFOV, a novel 3D Annotation-Free framework assisted by 2D Open-Vocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of AFOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73% mIoU on the annotation-free 3D segmentation task in nuScenes, surpassing the previous best model by 3.13% mIoU. Meanwhile, the performance of fine-tuning with 1% data on nuScenes and SemanticKITTI reached a remarkable 51.75% mIoU and 48.14% mIoU, outperforming all previous pre-trained models.

EAAI Journal 2025 Journal Article

A deep learning based iterative denoising algorithm for multiple frequency lines recovery

  • Qifan Shen
  • Xinwei Luo
  • Long Chen

Passive detection technology constitutes a crucial research direction in underwater acoustic target detection. It has been the subject of ongoing investigations to address the pressing need for stealth capabilities. The most formidable hurdle that all types of detectors must overcome is the extraction of line spectral components relevant to the target, given the convoluted underwater environment teeming with significant noise pollution. In this paper, a pioneering deep learning-based algorithm, known as the Additive Diffusion Probabilistic Denoising Model (ADPDM), is proposed to rectify the performance inadequacies of neural network-based approaches when operating under low signal-to-noise ratios (SNRs). To begin with, the ADPDM was ingeniously crafted. It was designed to astutely modify the representation of underwater signals by transforming the generative inference process of the diffusion model into a deterministic recovery strategy. Subsequently, the ADPDM was expanded into the complex-valued time–frequency joint domain, in order to take full advantage of the multi-dimensional information representation brought about by the lofargram. Moreover, an accelerating inference algorithm was adopted and calibrated to be fully compatible with the ADPDM framework. In contrast to the prevailing frequency line trackers that predominantly concentrate on discerning the frequency positions of the line spectrum, the ADPDM is dedicated to unearthing and reconstructing the latent line spectrum components concealed within the observed signal. This, in turn, paves the way for more effective subsequent detection or estimation operations. Empirical results demonstrated that the frequency lines within the signal enhanced by the ADPDM can be detected with remarkable efficacy, even when a relatively less sophisticated tracker is employed. On the basis of these findings, the detection performance metrics of the ADPDM have been shown to outstrip those of the current state-of-the-art (SOTA) methods, both those founded on deep learning and the hidden Markov model (HMM), across the entire spectrum of experimental SNRs.

ICLR Conference 2025 Conference Paper

Accelerated Over-Relaxation Heavy-Ball Method: Achieving Global Accelerated Convergence with Broad Generalization

  • Jingrong Wei
  • Long Chen

The heavy-ball momentum method accelerates gradient descent with a momentum term but lacks accelerated convergence for general smooth strongly convex problems. This work introduces the Accelerated Over-Relaxation Heavy-Ball (AOR-HB) method, the first variant with provable global and accelerated convergence for such problems. AOR-HB closes a long-standing theoretical gap, extends to composite convex optimization and min-max problems, and achieves optimal complexity bounds. It offers three key advantages: (1) broad generalization ability, (2) potential to reshape acceleration techniques, and (3) conceptual clarity and elegance compared to existing methods.

NeurIPS Conference 2025 Conference Paper

Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

  • Wei Chen
  • Xin Yan
  • Bin Wen
  • Fan Yang
  • Tingting Gao
  • Di Zhang
  • Long Chen

Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious 'hallucination' issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e. g. , add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize ``robust'' hallucination mitigation (\ie, maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, \ie, it matches DPO’s hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.

NeurIPS Conference 2025 Conference Paper

Interaction-Centric Knowledge Infusion and Transfer for Open Vocabulary Scene Graph Generation

  • Lin Li
  • Chuhan ZHANG
  • Dong Zhang
  • Chong Sun
  • Chen Li
  • Long Chen

Open-vocabulary scene graph generation (OVSGG) extends traditional SGG by recognizing novel objects and relationships beyond predefined categories, leveraging the knowledge from pre-trained large-scale models. Existing OVSGG methods always adopt a two-stage pipeline: 1) Infusing knowledge into large-scale models via pre-training on large datasets; 2) Transferring knowledge from pre-trained models with fully annotated scene graphs during supervised fine-tuning. However, due to a lack of explicit interaction modeling, these methods struggle to distinguish between interacting and non-interacting instances of the same object category. This limitation induces critical issues in both stages of OVSGG: it generates noisy pseudo-supervision from mismatched objects during knowledge infusion, and causes ambiguous query matching during knowledge transfer. To this end, in this paper, we propose an interACtion-Centric end-to-end OVSGG framework (ACC) in an interaction-driven paradigm to minimize these mismatches. For interaction-centric knowledge infusion, ACC employs a bidirectional interaction prompt for robust pseudo-supervision generation to enhance the model's interaction knowledge. For interaction-centric knowledge transfer, ACC first adopts interaction-guided query selection that prioritizes pairing interacting objects to reduce interference from non-interacting ones. Then, it integrates interaction-consistent knowledge distillation to bolster robustness by pushing relational foreground away from the background while retaining general knowledge. Extensive experimental results on three benchmarks show that ACC achieves state-of-the-art performance, demonstrating the potential of interaction-centric paradigms for real-world applications.

AAAI Conference 2025 Conference Paper

Learning Causal Transition Matrix for Instance-dependent Label Noise

  • Jiahui Li
  • Tai-Wei Chang
  • Kun Kuang
  • Ximing Li
  • Long Chen
  • Jun Zhou

Noisy labels are both inevitable and problematic in machine learning methods, as they negatively impact models' generalization ability by causing overfitting. In the context of learning with noise, the transition matrix plays a crucial role in the design of statistically consistent algorithms. However, the transition matrix is often considered unidentifiable. One strand of methods typically addresses this problem by assuming that the transition matrix is instance-independent; that is, the probability of mislabeling a particular instance is not influenced by its characteristics or attributes. This assumption is clearly invalid in complex real-world scenarios. To better understand the transition relationship and relax this assumption, we propose to study the data generation process of noisy labels from a causal perspective. We discover that an unobservable latent variable can affect either the instance itself, the label annotation procedure, or both, which complicates the identification of the transition matrix. To address various scenarios, we have unified these observations within a new causal graph. In this graph, the input instance is divided into a noise-resistant component and a noise-sensitive component based on whether they are affected by the latent variable. These two components contribute to identifying the “causal transition matrix”, which approximates the true transition matrix with theoretical guarantee. In line with this, we have designed a novel training framework that explicitly models this causal relationship and, as a result, achieves a more accurate model for inferring the clean label.

IROS Conference 2025 Conference Paper

Learning Perceptive Humanoid Locomotion over Challenging Terrain

  • Wandong Sun
  • Baoshi Cao
  • Long Chen
  • Yongbo Su
  • Yang Liu 0054
  • Zongwu Xie
  • Hong Liu 0002

Humanoid robots are engineered to navigate terrains akin to those encountered by humans, which necessitates human-like locomotion and perceptual abilities. Currently, the most reliable controllers for humanoid motion rely exclusively on proprioception, a reliance that becomes both dangerous and unreliable when coping with rugged terrain. Although the integration of height maps into perception can enable proactive gait planning, robust utilization of this information remains a significant challenge, especially when exteroceptive perception is noisy. To surmount these challenges, we propose a solution based on a teacher-student distillation framework. In this paradigm, an oracle policy accesses noise-free data to establish an optimal reference policy, while the student policy not only imitates the teacher’s actions but also simultaneously trains a world model with a variational information bottleneck for sensor denoising and state estimation. Extensive evaluations demonstrate that our approach markedly enhances performance in scenarios characterized by unreliable terrain estimations. Moreover, we conducted rigorous testing in both challenging urban settings and off-road environments, the model successfully traverse 2 km of varied terrain without external intervention.

NeurIPS Conference 2025 Conference Paper

Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

  • Yanghao Wang
  • Long Chen

Although today's pretrained discriminative vision-language models (e. g. , CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e. g. , pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises'' that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: 1) Frequency Matching: noise should destroy the specific frequency signals; 2) Spatial Matching: noise should destroy the specific spatial areas. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i. e. , good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep $t$, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp. It is worth noting that our noise optimization is orthogonal to existing optimization methods (e. g. , prompt tuning), our NoOP can even benefit from these methods to further boost performance.

AAAI Conference 2025 Conference Paper

Open-World Multimodal Understanding and Generation with Efficiently Finetuned Foundation Models

  • Long Chen

With the astonishing ability of different pretrained foundation models (e.g., large language models (LLMs), vision-language models, diffusion models), today’s AI research and development tendency has been revolutionized. In this talk, I will answer two questions: Q1: How can we efficiently train or fine-tune foundation models? Q2: How can we build strong open-world multimodal understanding and generation models with these pretrained foundation models?

NeurIPS Conference 2025 Conference Paper

ReSim: Reliable World Simulation for Autonomous Driving

  • Jiazhi Yang
  • Kashyap Chitta
  • Shenyuan Gao
  • Long Chen
  • Yuqian Shao
  • Xiaosong Jia
  • Hongyang Li
  • Andreas Geiger

How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e. g. , CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates reward from ReSim’s simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

NeurIPS Conference 2025 Conference Paper

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

  • Xianda Guo
  • Ruijun Zhang
  • Yiqun Duan
  • Yuhang He
  • Dujun Nie
  • Wenke Huang
  • Chenming Zhang
  • Shuai Liu

Accurate spatial reasoning in outdoor environments—covering geometry, object pose, and inter-object relationships—is fundamental to downstream tasks such as mapping, motion forecasting, and high-level planning in autonomous driving. We introduce SURDS, a large-scale benchmark designed to systematically evaluate the spatial reasoning capabilities of vision language models (VLMs). Built on the nuScenes dataset, SURDS comprises 41, 080 vision–question–answer training instances and 9, 250 evaluation samples, spanning six spatial categories: orientation, depth estimation, pixel-level localization, pairwise distance, lateral ordering, and front–behind relations. We benchmark leading general-purpose VLMs, including GPT, Gemini, and Qwen, revealing persistent limitations in fine-grained spatial understanding. To address these deficiencies, we go beyond static evaluation and explore whether alignment techniques can improve spatial reasoning performance. Specifically, we propose a reinforcement learning–based alignment scheme leveraging spatially grounded reward signals—capturing both perception-level accuracy (location) and reasoning consistency (logic). We further incorporate final-answer correctness and output-format rewards to guide fine-grained policy adaptation. Our GRPO-aligned variant achieves overall score of 40. 80 in SURDS benchmark. Notably, it outperforms proprietary systems such as GPT-4o (13. 30) and Gemini-2. 0-flash (35. 71). To our best knowledge, this is the first study to demonstrate that reinforcement learning–based alignment can significantly and consistently enhance the spatial reasoning capabilities of VLMs in real-world driving contexts. We release the SURDS benchmark, evaluation toolkit, and GRPO alignment code through: https: //github. com/XiandaGuo/Drive-MLLM.

NeurIPS Conference 2024 Conference Paper

$\text{Di}^2\text{Pose}$: Discrete Diffusion Model for Occluded 3D Human Pose Estimation

  • Weiquan Wang
  • Jun Xiao
  • Chunping Wang
  • Wei Liu
  • Zhao Wang
  • Long Chen

Diffusion models have demonstrated their effectiveness in addressing the inherent uncertainty and indeterminacy in monocular 3D human pose estimation (HPE). Despite their strengths, the need for large search spaces and the corresponding demand for substantial training data make these models prone to generating biomechanically unrealistic poses. This challenge is particularly noticeable in occlusion scenarios, where the complexity of inferring 3D structures from 2D images intensifies. In response to these limitations, we introduce the **Di**screte **Di**ffusion **Pose** (**$\text{Di}^2\text{Pose}$**), a novel framework designed for occluded 3D HPE that capitalizes on the benefits of a discrete diffusion model. Specifically, **$\text{Di}^2\text{Pose}$** employs a two-stage process: it first converts 3D poses into a discrete representation through a pose quantization step, which is subsequently modeled in latent space through a discrete diffusion process. This methodological innovation restrictively confines the search space towards physically viable configurations and enhances the model’s capability to comprehend how occlusions affect human pose within the latent space. Extensive evaluations conducted on various benchmarks (e. g. , Human3. 6M, 3DPW, and 3DPW-Occ) have demonstrated its effectiveness.

AAAI Conference 2024 Conference Paper

Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities

  • Hammad Ayyubi
  • Christopher Thomas
  • Lovish Chum
  • Rahul Lokesh
  • Long Chen
  • Yulei Niu
  • Xudong Lin
  • Xuande Feng

Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, the abstract event of "war'' manifests at a lower semantic level through subevents "tanks firing'' (in video) and airplane "shot'' (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research. Data: https://github.com/hayyubi/multihieve

IJCAI Conference 2024 Conference Paper

ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces

  • Libing Yang
  • Yang Li
  • Long Chen

Vision-based robotic cloth unfolding has made great progress recently. However, prior works predominantly rely on value learning and have not fully explored policy-based techniques. Recently, the success of reinforcement learning on the large language model has shown that the policy gradient algorithm can enhance policy with huge action space. In this paper, we introduce ClothPPO, a framework that employs a policy gradient algorithm based on actor-critic architecture to enhance a pre-trained model with huge 10^6 action spaces aligned with observation in the task of unfolding clothes. To this end, we redefine the cloth manipulation problem as a partially observable Markov decision process. A supervised pre-training stage is employed to train a baseline model of our policy. In the second stage, the Proximal Policy Optimization (PPO) is utilized to guide the supervised model within the observation-aligned action space. By optimizing and updating the strategy, our proposed method increases the garment's surface area for cloth unfolding under the soft-body manipulation task. Experimental results show that our proposed framework can further improve the unfolding performance of other state-of-the-art methods. Our project is available at https: //vpx-ecnu. github. io/ClothPPO-website/.

JBHI Journal 2024 Journal Article

Enhancing Motor Sequence Learning via Transcutaneous Auricular Vagus Nerve Stimulation (taVNS): An EEG Study

  • Long Chen
  • Chenghu Tang
  • Zhongpeng Wang
  • Lei Zhang
  • Bin Gu
  • Xiuyun Liu
  • Dong Ming

Motor learning plays a crucial role in human life, and various neuromodulation methods have been utilized to strengthen or improve it. Transcutaneous auricular vagus nerve stimulation (taVNS) has gained increasing attention due to its non-invasive nature, affordability and ease of implementation. Although the potential of taVNS on regulating motor learning has been suggested, its actual regulatory effect has yet been fully explored. Electroencephalogram (EEG) analysis provides an in-depth understanding of cognitive processes involved in motor learning so as to offer methodological support for regulation of motor learning. To investigate the effect of taVNS on motor learning, this study recruited 22 healthy subjects to participate a single-blind, sham-controlled, and within-subject serial reaction time task (SRTT) experiment. Every subject involved in two sessions at least one week apart and received a 20-minute active/sham taVNS in each session. Behavioral indicators as well as EEG characteristics during the task state, were extracted and analyzed. The results revealed that compared to the sham group, the active group showed higher learning performance. Additionally, the EEG results indicated that after taVNS, the motor-related cortical potential amplitudes and alpha-gamma modulation index decreased significantly and functional connectivity based on partial directed coherence towards frontal lobe was enhanced. These findings suggest that taVNS can improve motor learning, mainly through enhancing cognitive and memory functions rather than simple movement learning. This study confirms the positive regulatory effect of taVNS on motor learning, which is particularly promising as it offers a potential avenue for enhancing motor skills and facilitating rehabilitation.

JBHI Journal 2024 Journal Article

Influence of Transcutaneous Vagus Nerve Stimulation on Motor Planning: A Resting-State and Task-State EEG Study

  • Long Chen
  • Jiatong He
  • Jiasheng Zhang
  • Zhongpeng Wang
  • Lei Zhang
  • Bin Gu
  • Xiuyun Liu
  • Dong Ming

Transcutaneous vagus nerve stimulation (tVNS) shows a potential regulatory role for motor planning. Still, existing research mainly focuses on behavioral studies, and the neural modulation mechanism needs to be clarified. Therefore, we designed a multi-condition (active or sham, pre or under, difficult or easy, left-hand or right-hand) motor planning experiment to explore the effect of online tVNS (i. e. , tVNS and tasks synchronized). Twenty-eight subjects were recruited and randomly assigned to active and sham groups. Both groups performed the same tasks in the experiment and separately collected task-state EEG and 5-min eye-open resting-state EEG. The results showed that the changes in event-related potential (ERP) and movement-related cortical potential (MRCP) amplitudes were more significant for the left-hand difficult task (LD) under active-tVNS. According to the power spectrum results, active-tVNS significantly modulated the activities of the contralateral motor cortex at beta and gamma bands in the resting state. The functional connectivity based on partial directed coherence (PDC) showed significant changes in the parietal lobe after active-tVNS. These findings suggest that tVNS is a promising way to improve motor planning ability.

NeurIPS Conference 2024 Conference Paper

LLMs Can Evolve Continually on Modality for $\mathbb{X}$-Modal Reasoning

  • Jiazuo Yu
  • Haomiao Xiong
  • Lu Zhang
  • Haiwen Diao
  • Yunzhi Zhuge
  • Lanqing Hong
  • Dong Wang
  • Huchuan Lu

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose \textbf{PathWeave}, a flexible and scalable framework with modal-\textbf{path} s\textbf{w}itching and \textbf{e}xp\textbf{a}nsion abilities that enables MLLMs to continually \textbf{ev}olve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called \textbf{C}ontinual \textbf{L}earning of \textbf{M}odality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, \textcolor{black}{audio, depth} and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98. 73\%. Our code locates at \url{https: //github. com/JiazuoYu/PathWeave}.

EAAI Journal 2024 Journal Article

Neural network energy management strategy for plug-in hybrid electric combine harvesters based on quasi-periodic samples

  • Shuofeng Weng
  • Chaochun Yuan
  • Youguo He
  • Jie Shen
  • Long Chen
  • Lizhang Xu
  • Zhihao Zhu
  • Qiuye Yu

Energy management strategies are crucial for Plug-in Hybrid Electric Combine Harvester (PHECH). However, many existing approaches rely on rigid, pre-setting rules that struggle to adjust to the PHECH operational conditions. This paper first introduces a power estimation model tailored to the quasi-periodic process of harvester activity. Then, Dynamic Programming (DP) is applied to derive optimal samples of engine power ratio across various scenarios. Building on the samples, a Neural Network (NN) is developed to enhance the strategy's economic and real-time performance. Simulation tests evaluate the proposed algorithm's efficacy and its energy conservation potential. The findings suggest that, compared to fuel-driven harvesters, the NN strategy achieves similar energy cost savings to the DP approach, exceeding 11%, which is better than the Charge Depleting and Charge Sustaining (CDCS) strategy's 7. 22% and the MPC-ECMS strategy's 7. 87%. Moreover, the NN strategy reduces the time expense to roughly one-fifth of that required by the DP approach.

AAAI Conference 2024 Conference Paper

SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

  • Yuhang He
  • Zhuangzhuang Dai
  • Niki Trigoni
  • Long Chen
  • Andrew Markham

In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sounds in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network~(which we call DyDecNet, consisting of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. The dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain time-frequency representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it to each intermediate parent waveform before feeding it to the two child filters. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony and mean polyphony. We test DyDecNet on various datasets to show its superiority, and we further show dyadic decomposition network can be used as a general front-end to tackle other acoustic tasks.

EAAI Journal 2024 Journal Article

Toward efficient and lightweight sea–land segmentation for remote sensing images

  • Xun Ji
  • Longbin Tang
  • Long Chen
  • Li-Ying Hao
  • Hui Guo

Sea–land segmentation is of great significance for autonomous coastline monitoring, which is fundamental research in the remote sensing community. Due to the diverse contents and easily confused sea–land boundaries contained in remote sensing images, it is always challenging to achieve precise sea–land segmentation for complex scenarios. Although existing deep learning-based methods have exhibited promising performance, excessive computational load and insufficient use of hierarchical features remain unresolved. In this paper, we contribute to addressing the problems by developing an efficient and lightweight convolutional neural network (CNN) termed E-Net. On the one hand, the proposed network adopts a novel E-shaped architecture that reforms the conventional U-codec structure to make full use of hierarchical features at different depths, so that the sea–land segmentation effect can be significantly improved without excessive computational load. On the other hand, a contextual aggregation attention mechanism (CA2M) is designed to further facilitate efficient aggregation and transmission of contextual information, so that the fuzzy and irregular sea–land boundaries can be accurately distinguished. Extensive experiments reveal that our approach not only produces superior sea–land segmentation effect but also demonstrates promising computational efficiency. Specifically, the proposed E-Net achieves state-of-the-art sea–land segmentation performance with 92. 78% and 93. 62% mean Intersection over Union (mIoU) on the SLSD and HRSC2016 datasets, respectively, while the frames per second (FPS) reaches 108. 032 with as low as 52. 287G floating point operations per second (FLOPs).

IJCAI Conference 2023 Conference Paper

Discrepancy-Guided Reconstruction Learning for Image Forgery Detection

  • Zenan Shi
  • Haipeng Chen
  • Long Chen
  • Dong Zhang

In this paper, we propose a novel image forgery detection paradigm for boosting the model learning capacity on both forgery-sensitive and genuine compact visual patterns. Compared to the existing methods that only focus on the discrepant-specific patterns (\eg, noises, textures, and frequencies), our method has a greater generalization. Specifically, we first propose a Discrepancy-Guided Encoder (DisGE) to extract forgery-sensitive visual patterns. DisGE consists of two branches, where the mainstream backbone branch is used to extract general semantic features, and the accessorial discrepant external attention branch is used to extract explicit forgery cues. Besides, a Double-Head Reconstruction (DouHR) module is proposed to enhance genuine compact visual patterns in different granular spaces. Under DouHR, we further introduce a Discrepancy-Aggregation Detector (DisAD) to aggregate these genuine compact visual patterns, such that the forgery detection capability on unknown patterns can be improved. Extensive experimental results on four challenging datasets validate the effectiveness of our proposed method against state-of-the-art competitors.

AAAI Conference 2023 Conference Paper

Progressive Deep Multi-View Comprehensive Representation Learning

  • Cai Xu
  • Wei Zhao
  • Jinglong Zhao
  • Ziyu Guan
  • Yaming Yang
  • Long Chen
  • Xiangyu Song

Multi-view Comprehensive Representation Learning (MCRL) aims to synthesize information from multiple views to learn comprehensive representations of data items. Prevalent deep MCRL methods typically concatenate synergistic view-specific representations or average aligned view-specific representations in the fusion stage. However, the performance of synergistic fusion methods inevitably degenerate or even fail when partial views are missing in real-world applications; the aligned based fusion methods usually cannot fully exploit the complementarity of multi-view data. To eliminate all these drawbacks, in this work we present a Progressive Deep Multi-view Fusion (PDMF) method. Considering the multi-view comprehensive representation should contain complete information and the view-specific data contain partial information, we deem that it is unstable to directly learn the mapping from partial information to complete information. Hence, PDMF employs a progressive learning strategy, which contains the pre-training and fine-tuning stages. In the pre-training stage, PDMF decodes the auxiliary comprehensive representation to the view-specific data. It also captures the consistency and complementarity by learning the relations between the dimensions of the auxiliary comprehensive representation and all views. In the fine-tuning stage, PDMF learns the mapping from the original data to the comprehensive representation with the help of the auxiliary comprehensive representation and relations. Experiments conducted on a synthetic toy dataset and 4 real-world datasets show that PDMF outperforms state-of-the-art baseline methods. The code is released at https://github.com/winterant/PDMF.

EAAI Journal 2023 Journal Article

Robust and fuzzy ensemble framework via spectral learning for random projection-based fuzzy-c-means clustering

  • Zhaoyin Shi
  • Long Chen
  • Junwei Duan
  • Guangyong Chen
  • Kai Zhao

The ensembles of random projection-based fuzzy-c-means (RP-FCM) can handle high-dimensional data efficiently. However, the performance of these ensemble frameworks is still hindered by some issues, such as misaligned membership matrices, information loss of co-similar matrices, large storage space, unstable ensemble results due to the additional re-clustering, etc. To address these issues, we propose a robust and fuzzy ensemble framework via spectral learning for RP-FCM clustering. After using random projection to generate different dimensional datasets and obtaining the membership matrices via fuzzy-c-means, we first convert these membership matrices into regularized graphs and approximates the affinity matrices of these graphs by spectral matrices. This step not only avoids the alignment problems of membership matrices but also excludes the storage of large-scale graphs. The spectral matrices of the same size are used as the features of membership matrices for the ensemble, avoiding the possible information loss by applying co-similar matrix transformations. More importantly, an optimization model is designed in our framework to learn the fusion of spectral features. In this model, the proportion of each base clustering is adjusted adaptively through a fuzzification exponent, and the effect of outliers is also suppressed by a robust norm. Finally, the Laplacian rank constraint in the model guarantees the ensemble can achieve the exact final partition. An efficient algorithm for this model is derived, and its time complexity and convergence are also analyzed. Competitive experimental results on benchmark data demonstrate the effectiveness of the proposed ensemble framework in comparison to state-of-the-art methods.

NeurIPS Conference 2023 Conference Paper

Two Heads are Better Than One: A Simple Exploration Framework for Efficient Multi-Agent Reinforcement Learning

  • Jiahui Li
  • Kun Kuang
  • Baoxiang Wang
  • Xingchen Li
  • Fei Wu
  • Jun Xiao
  • Long Chen

Exploration strategy plays an important role in reinforcement learning, especially in sparse-reward tasks. In cooperative multi-agent reinforcement learning~(MARL), designing a suitable exploration strategy is much more challenging due to the large state space and the complex interaction among agents. Currently, mainstream exploration methods in MARL either contribute to exploring the unfamiliar states which are large and sparse, or measuring the interaction among agents with high computational costs. We found an interesting phenomenon that different kinds of exploration plays a different role in different MARL scenarios, and choosing a suitable one is often more effective than designing an exquisite algorithm. In this paper, we propose a exploration method that incorporate the \underline{C}uri\underline{O}sity-based and \underline{IN}fluence-based exploration~(COIN) which is simple but effective in various situations. First, COIN measures the influence of each agent on the other agents based on mutual information theory and designs it as intrinsic rewards which are applied to each individual value function. Moreover, COIN computes the curiosity-based intrinsic rewards via prediction errors which are added to the extrinsic reward. For integrating the two kinds of intrinsic rewards, COIN utilizes a novel framework in which they complement each other and lead to a sufficient and effective exploration on cooperative MARL tasks. We perform extensive experiments on different challenging benchmarks, and results across different scenarios show the superiority of our method.

NeurIPS Conference 2023 Conference Paper

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models

  • Lin Li
  • Jun Xiao
  • Guikun Chen
  • Jian Shao
  • Yueting Zhuang
  • Long Chen

Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e. g. , it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.

EAAI Journal 2022 Journal Article

Efficient kernel fuzzy clustering via random Fourier superpixel and graph prior for color image segmentation

  • Long Chen
  • Yin-Ping Zhao
  • Chuanbin Zhang

The kernel fuzzy clustering algorithms can explore the non-linear relations of pixels in an image. However, most of kernel-based methods are computationally expensive for color image segmentation and neglect the inherent locality information in images. To alleviate these limitations, this paper proposes a novel kernel fuzzy clustering framework for fast color image segmentation. More specifically, we first design a new superpixel generation method that uses random Fourier maps to approximate Gaussian kernels and explicitly represent high-dimensional features of pixels. Clustering superpixels instead of large-sized pixels speeds up the segmentation of a color image significantly. More importantly, the features of superpixels used by fuzzy clustering are also calculated in the approximated kernel space and the local relationships between superpixels are depicted as a graph prior and appended into the objective function of fuzzy clustering as a Kullback–Leibler divergence term. This results in a new fuzzy clustering model that can further improve the accuracy of the image segmentation. Experiments on synthetic and real-world color image datasets verify the superiority and high efficiency of the proposed approach.

NeurIPS Conference 2022 Conference Paper

Respecting Transfer Gap in Knowledge Distillation

  • Yulei Niu
  • Long Chen
  • Chang Zhou
  • Hanwang Zhang

Knowledge distillation (KD) is essentially a process of transferring a teacher model's behavior, e. g. , network response, to a student model. The network response serves as additional supervision to formulate the machine domain, which uses the data collected from the human domain as a transfer set. Traditional KD methods hold an underlying assumption that the data collected in both human domain and machine domain are both independent and identically distributed (IID). We point out that this naive assumption is unrealistic and there is indeed a transfer gap between the two domains. Although the gap offers the student model external knowledge from the machine domain, the imbalanced teacher knowledge would make us incorrectly estimate how much to transfer from teacher to student per sample on the non-IID transfer set. To tackle this challenge, we propose Inverse Probability Weighting Distillation (IPWD) that estimates the propensity of a training sample belonging to the machine domain, and assigns its inverse amount to compensate for under-represented samples. Experiments on CIFAR-100 and ImageNet demonstrate the effectiveness of \ours~for both two-stage distillation and one-stage self-distillation.

AAAI Conference 2022 Conference Paper

Rethinking the Two-Stage Framework for Grounded Situation Recognition

  • Meng Wei
  • Long Chen
  • Wei Ji
  • Xiaoyu Yue
  • Tat-Seng Chua

Grounded Situation Recognition (GSR), i. e. , recognizing the salient activity (or verb) category in an image (e. g. , buying) and detecting all corresponding semantic roles (e. g. , agent and goods), is an essential step towards “human-like” event understanding. Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage. However, there are obvious drawbacks in both stages: 1) The widely-used cross-entropy (XE) loss for object recognition is insufficient in verb classification due to the large intraclass variation and high inter-class similarity among daily activities. 2) All semantic roles are detected in an autoregressive manner, which fails to model the complex semantic relations between different roles. To this end, we propose a novel SituFormer for GSR which consists of a Coarse-to- Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: a coarse-grained model trained with XE loss first proposes a set of verb candidates, and then a fine-grained model trained with triplet loss re-ranks these candidates with enhanced verb features (not only separable but also discriminative). TNM is a transformer-based semantic role detection model, which detects all roles parallelly. Owing to the global relation modeling ability and flexibility of the transformer decoder, TNM can fully explore the statistical dependency of the roles. Extensive validations on the challenging SWiG benchmark show that SituFormer achieves a new state-of-the-art performance with significant gains under various metrics. Code is available at https: //github. com/kellyiss/SituFormer.

AAAI Conference 2021 Conference Paper

Boundary Proposal Network for Two-stage Natural Language Video Localization

  • Shaoning Xiao
  • Long Chen
  • Songyang Zhang
  • Wei Ji
  • Jian Shao
  • Lu Ye
  • Jun Xiao

We aim to address the problem of Natural Language Video Localization (NLVL) — localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e. g. , by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame1 as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchorfree approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal twostage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visuallanguage fusion layer is proposed to jointly model the multimodal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BP- Net on three challenging NLVL benchmarks (i. e. , Charades- STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.

NeurIPS Conference 2021 Conference Paper

FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention

  • Tan Nguyen
  • Vai Suliafu
  • Stanley Osher
  • Long Chen
  • Bao Wang

We propose FMMformers, a class of efficient and flexible transformers inspired by the celebrated fast multipole method (FMM) for accelerating interacting particle simulation. FMM decomposes particle-particle interaction into near-field and far-field components and then performs direct and coarse-grained computation, respectively. Similarly, FMMformers decompose the attention into near-field and far-field attention, modeling the near-field attention by a banded matrix and the far-field attention by a low-rank matrix. Computing the attention matrix for FMMformers requires linear complexity in computational time and memory footprint with respect to the sequence length. In contrast, standard transformers suffer from quadratic complexity. We analyze and validate the advantage of FMMformers over the standard transformer on the Long Range Arena and language modeling benchmarks. FMMformers can even outperform the standard transformer in terms of accuracy by a significant margin. For instance, FMMformers achieve an average classification accuracy of $60. 74\%$ over the five Long Range Arena tasks, which is significantly better than the standard transformer's average accuracy of $58. 70\%$.

AAAI Conference 2021 Conference Paper

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

  • Long Chen
  • Wenbo Ma
  • Jun Xiao
  • Hanwang Zhang
  • Shih-Fu Chang

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i. e. , expression-agnostic), hoping that the proposals contain all right instances in the expression (i. e. , expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at: https: //github. com/ChopinSharp/ref-nms.

ICRA Conference 2021 Conference Paper

SimNet: Learning Reactive Self-driving Simulations from Real-world Observations

  • Luca Bergamini
  • Yawei Ye
  • Oliver Scheel
  • Long Chen
  • Chih Hu
  • Luca Del Pero
  • Blazej Osinski
  • Hugo Grimmett

In this work we present a simple end-to-end trainable machine learning system capable of realistically simulating driving experiences. This can be used for verification of self-driving system performance without relying on expensive and time-consuming road testing. In particular, we frame the simulation problem as a Markov Process, leveraging deep neural networks to model both state distribution and transition function. These are trainable directly from the existing raw observations without the need of any handcrafting in the form of plant or kinematic models. All that is needed is a dataset of historical traffic episodes. Our formulation allows the system to construct never seen scenes that unfold realistically reacting to the self-driving car’s behaviour. We train our system directly from 1, 000 hours of driving logs and measure both realism, reactivity of the simulation as the two key properties of the simulation. At the same time we apply the method to evaluate performance of a recently proposed state-of-the-art ML planning system [1] trained from human driving logs. We discover this planning system is prone to previously unreported causal confusion issues that are difficult to test by non-reactive simulation. To the best of our knowledge, this is the first work that directly merges highly realistic data-driven simulations with a closed loop evaluation for self-driving vehicles. We make the data, code, and pre-trained models publicly available to further stimulate simulation development.

JBHI Journal 2021 Journal Article

Unsupervised Eye Blink Artifact Detection From EEG With Gaussian Mixture Model

  • Jiuwen Cao
  • Long Chen
  • Dinghan Hu
  • Fang Dong
  • Tiejia Jiang
  • Weidong Gao
  • Feng Gao

Eye blink is one of the most common artifacts in electroencephalogram (EEG) and significantly affects the performance of the EEG related applications, such as epilepsy recognition, spike detection, encephalitis diagnosis, etc. To achieve an accurate and efficient eye blink detection, a novel unsupervised learning algorithm based on a hybrid thresholding followed with a Gaussian mixture model (GMM) is presented in this paper. The EEG signal is priliminarily screened by a cascaded thresholding method built on the distributions of signal amplitude, amplitude displacement, as well as the cross channel correlation. Then, the channel correlation of the two frontal electrodes (FP1, FP2), the fractal dimension, and the mean of amplitude difference between FP1 and FP2, are extracted to characterize the filtered EEGs. The GMM trained on these features is applied for the eye blink detection. The performance of the proposed algorithm is studied on two EEG datasets collected by the Temple University Hospital (TUH) and the Children's Hospital, Zhejiang University School of Medicine (CHZU), where the datasets are recorded from epilepsy and encephalitis patients, and contain a lot of eye blink artifacts. Experimental results show that the proposed algorithm can achieve the highest detection precision and F1 score over the state-of-the-art methods.

EAAI Journal 2021 Journal Article

Using modified term frequency to improve term weighting for text classification

  • Long Chen
  • Liangxiao Jiang
  • Chaoqun Li

Text classification (TC) is an essential task of natural language processing (NLP). In order to improve the performance of TC, term weighting is often used to obtain effective text representation by assigning appropriate weights to each term. A term weighting scheme is generally composed of term frequency factor, collection frequency factor and normalization factor. The normalization factor is commonly used as an optional factor to offset the influence of document length. Through the investigation of the existing term weighting schemes, we found that most of them focus on finding a more effective collection frequency factor, but rarely pay attention to finding a new term frequency factor. In this paper, we first proposed a new term frequency factor called modified term frequency (MTF). Different from the normalization factor, MTF directly modifies the raw term frequency based on the length information of all training documents. Then we proposed a new term weighting scheme by combining MTF with an existing collection frequency factor called modified distinguishing feature selector (MDFS). We denoted our scheme by MTF-MDFS (MDFS-based MTF). Extensive experimental results on 19 benchmark text datasets and 6 real-world text datasets show that our proposed MTF and MTF-MDFS are all much better than their state-of-the-art competitors in terms of the classification accuracy and the weighted average of F 1 of widely used base classifiers, such as MNB, SVM and LR.

ICRA Conference 2021 Conference Paper

What data do we need for training an AV motion planner?

  • Long Chen
  • Lukas Platinsky
  • Stefanie Speichert
  • Blazej Osinski
  • Oliver Scheel
  • Yawei Ye
  • Hugo Grimmett
  • Luca Del Pero

We investigate what grade of sensor data is required for training an imitation-learning-based AV planner on human expert demonstration. Machine-learned planners [1] are very hungry for training data, which is usually collected using vehicles equipped with the same sensors used for autonomous operation [1]. This is costly and non-scalable. If cheaper sensors could be used for collection instead, data availability would go up, which is crucial in a field where data volume requirements are large and availability is small. We present experiments using up to 1000 hours worth of expert demonstration and find that training with 10x lower-quality data outperforms 1x AV-grade data in terms of planner performance (see Fig. 1). The important implication of this is that cheaper sensors can indeed be used. This serves to improve data access and democratize the field of imitation-based motion planning. Alongside this, we perform a sensitivity analysis of planner performance as a function of perception range, field-of-view, accuracy, and data volume, and reason about why lower-quality data still provide good planning results.

AAAI Conference 2020 Conference Paper

Question-Driven Purchasing Propensity Analysis for Recommendation

  • Long Chen
  • Ziyu Guan
  • Qibin Xu
  • Qiong Zhang
  • Huan Sun
  • Guangyue Lu
  • Deng Cai

Merchants of e-commerce Websites expect recommender systems to entice more consumption which is highly correlated with the customers’ purchasing propensity. However, most existing recommender systems focus on customers’ general preference rather than purchasing propensity often governed by instant demands which we deem to be well conveyed by the questions asked by customers. A typical recommendation scenario is: Bob wants to buy a cell phone which can play the game PUBG. He is interested in HUAWEI P20 and asks “can PUBG run smoothly on this phone? ” under it. Then our system will be triggered to recommend the most eligible cell phones to him. Intuitively, diverse user questions could probably be addressed in reviews written by other users who have similar concerns. To address this recommendation problem, we propose a novel Question-Driven Attentive Neural Network (QDANN) to assess the instant demands of questioners and the eligibility of products based on user generated reviews, and do recommendation accordingly. Without supervision, QDANN can well exploit reviews to achieve this goal. The attention mechanisms can be used to provide explanations for recommendations. We evaluate QDANN in three domains of Taobao. The results show the efficacy of our method and its superiority over baseline methods.

AAAI Conference 2020 Conference Paper

Rethinking the Bottom-Up Framework for Query-Based Video Localization

  • Long Chen
  • Chujie Lu
  • Siliang Tang
  • Jun Xiao
  • Dong Zhang
  • Chilie Tan
  • Xiaolin Li

In this paper, we focus on the task query-based video localization, i. e. , localizing a query in a long and untrimmed video. The prevailing solutions for this problem can be grouped into two categories: i) Top-down approach: It pre-cuts the video into a set of moment candidates, then it does classification and regression for each candidate; ii) Bottom-up approach: It injects the whole query content into each video frame, then it predicts the probabilities of each frame as a ground truth segment boundary (i. e. , start or end). Both two frameworks have respective shortcomings: the top-down models suffer from heavy computations and they are sensitive to the heuristic rules, while the performance of bottom-up models is behind the performance of top-down counterpart thus far. However, we argue that the performance of bottom-up framework is severely underestimated by current unreasonable designs, including both the backbone and head network. To this end, we design a novel bottom-up model: Graph-FPN with Dense Predictions (GDP). For the backbone, GDP firstly generates a frame feature pyramid to capture multi-level semantics, then it utilizes graph convolution to encode the plentiful scene relationships, which incidentally mitigates the semantic gaps in the multi-scale feature pyramid. For the head network, GDP regards all frames falling in the ground truth segment as the foreground, and each foreground frame regresses the unique distances from its location to bi-directional boundaries. Extensive experiments on two challenging query-based video localization tasks (natural language video localization and video relocalization), involving four challenging benchmarks (TACoS, Charades-STA, ActivityNet Captions, and Activity- VRL), have shown that GDP surpasses the state-of-the-art top-down models.

NeurIPS Conference 2020 Conference Paper

Trading Personalization for Accuracy: Data Debugging in Collaborative Filtering

  • Long Chen
  • Yuan Yao
  • Feng Xu
  • Miao Xu
  • Hanghang Tong

Collaborative filtering has been widely used in recommender systems. Existing work has primarily focused on improving the prediction accuracy mainly via either building refined models or incorporating additional side information, yet has largely ignored the inherent distribution of the input rating data. In this paper, we propose a data debugging framework to identify overly personalized ratings whose existence degrades the performance of a given collaborative filtering model. The key idea of the proposed approach is to search for a small set of ratings whose editing (e. g. , modification or deletion) would near-optimally improve the recommendation accuracy of a validation set. Experimental results demonstrate that the proposed approach can significantly improve the recommendation accuracy. Furthermore, we observe that the identified ratings significantly deviate from the average ratings of the corresponding items, and the proposed approach tends to modify them towards the average. This result sheds light on the design of future recommender systems in terms of balancing between the overall accuracy and personalization.

AAAI Conference 2019 Conference Paper

Answer Identification from Product Reviews for User Questions by Multi-Task Attentive Networks

  • Long Chen
  • Ziyu Guan
  • Wei Zhao
  • Wanqing Zhao
  • Xiaopeng Wang
  • Zhou Zhao
  • Huan Sun

Online Shopping has become a part of our daily routine, but it still cannot offer intuitive experience as store shopping. Nowadays, most e-commerce Websites offer a Question Answering (QA) system that allows users to consult other users who have purchased the product. However, users still need to wait patiently for others’ replies. In this paper, we investigate how to provide a quick response to the asker by plausible answer identification from product reviews. By analyzing the similarity and discrepancy between explicit answers and reviews that can be answers, a novel multi-task deep learning method with carefully designed attention mechanisms is developed. The method can well exploit large amounts of user generated QA data and a few manually labeled review data to address the problem. Experiments on data collected from Amazon demonstrate its effectiveness and superiority over competitive baselines.

IJCAI Conference 2019 Conference Paper

MR-GNN: Multi-Resolution and Dual Graph Neural Network for Predicting Structured Entity Interactions

  • Nuo Xu
  • Pinghui Wang
  • Long Chen
  • Jing Tao
  • Junzhou Zhao

Predicting interactions between structured entities lies at the core of numerous tasks such as drug regimen and new material design. In recent years, graph neural networks have become attractive. They represent structured entities as graphs, and then extract features from each individual graph using graph convolution operations. However, these methods have some limitations: i) their networks only extract features from a fix-sized subgraph structure (i. e. , a fix-sized receptive field) of each node, and ignore features in substructures of different sizes, and ii) features are extracted by considering each entity independently, which may not effectively reflect the interaction between two entities. To resolve these problems, we present {\em MR-GNN}, an end-to-end graph neural network with the following features: i) it uses a multi-resolution based architecture to extract node features from different neighborhoods of each node, and, ii) it uses dual graph-state long short-term memory networks (LSTMs) to summarize local features of each graph and extracts the interaction features between pairwise graphs. Experiments conducted on real-world datasets show that MR-GNN improves the prediction of state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Dress Fashionably: Learn Fashion Collocation With Deep Mixed-Category Metric Learning

  • Long Chen
  • Yuhang He

In this paper, we seek to enable machine to answer questions like, given a clutch bag, what kind of skirt, heel and even accessory best fashionably collocate with it? This problem, dubbed fashion collocation, has almost been neglected by researchers due to the large uncertainty lies in fashion collocation and professional expertise required to address it. In this paper, we narrow down the well-collocated samples to be fashion images shared on fashion websites, with which we propose an end-to-end trainable deep mixed-category metric learning method to project well-collocated clothing items to lie close but items violating well-collocation far apart in the deep embedding space. Specifically, we simultaneously model the intra-category exclusiveness and cross-category inclusiveness of fashion collocation by feeding a set of wellcollocated clothing items and corresponding bad-collocated clothing items to the deep neural network, further a hardaware online exemplar mining strategy is designed to force the whole neural network to be trainable and learn discriminative features at the early and later training stages respectively. To motivate more research in fashion collocation, we collect a dataset of 0. 2 million fashionably well-collocated images consisting of either on-body or off-body clothing items or accessories. Extensive experimental results show the feasibility and superiority of our method.

JMLR Journal 2018 Journal Article

Maximum Principle Based Algorithms for Deep Learning

  • Qianxiao Li
  • Long Chen
  • Cheng Tai
  • Weinan E

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

ICRA Conference 2018 Conference Paper

Real-Time Learning of Efficient Lift Generation on a Dynamically Scaled Flapping Wing Using Policy Search

  • Yagiz E. Bayiz
  • Long Chen
  • Shih-Jung Hsu
  • Pan Liu
  • Aaron N. Aguiles
  • Bo Cheng 0008

In this work, we present a successful application of a policy search algorithm to a real-time robotic learning problem, where the goal is to maximize the efficiency of lift generation on a dynamically scaled flapping robotic wing. The robotic wing has two degrees-of-freedom, i. e. , stroke and pitch, and operates in a tank filled with mineral oil. For all experiments, the Reynolds number is maintained constant at 1000, where learning is performed for different prescribed stroke amplitudes to find the optimal wing pitching amplitude and the stroke-pitch phase difference that maximize the power loading (PL) of lift generation, a measure of aerodynamic efficiency. For the investigated stroke amplitude range (30°-90°), the efficiency is observed to increase with the stroke amplitude and the lift is mainly generated through the delayed stall, a quasi-steady aerodynamic mechanism. Furthermore, the wing rotation becomes more asymmetric with respect to stroke reversal as the stroke amplitude decreases, indicating an increased use of unsteady lift generation mechanisms at lower stroke amplitudes.

IJCAI Conference 2018 Conference Paper

Tag-based Weakly-supervised Hashing for Image Retrieval

  • Ziyu Guan
  • Fei Xie
  • Wanqing Zhao
  • Xiaopeng Wang
  • Long Chen
  • Wei Zhao
  • Jinye Peng

We are concerned with using user-tagged images to learn proper hashing functions for image retrieval. The benefits are two-fold: (1) we could obtain abundant training data for deep hashing models; (2) tagging data possesses richer semantic information which could help better characterize similarity relationships between images. However, tagging data suffers from noises, vagueness and incompleteness. Different from previous unsupervised or supervised hashing learning, we propose a novel weakly-supervised deep hashing framework which consists of two stages: weakly-supervised pre-training and supervised fine-tuning. The second stage is as usual. In the first stage, rather than performing supervision on tags, the framework introduces a semantic embedding vector (sem-vector) for each image and performs learning of hashing and sem-vectors jointly. By carefully designing the optimization problem, it can well leverage tagging information and image content for hashing learning. The framework is general and does not depend on specific deep hashing methods. Empirical results on real world datasets show that when it is integrated with state-of-art deep hashing methods, the performance increases by 8-10%.

IS Journal 2016 Journal Article

Combining Region-of-Interest Extraction and Image Enhancement for Nighttime Vehicle Detection

  • Hulin Kuang
  • Long Chen
  • Feng Gu
  • Jiajie Chen
  • Leanne Chan
  • Hong Yan

In nighttime images, vehicle detection is a challenging task because of low contrast and luminosity. In this article, the authors combine a novel region-of-interest (ROI) extraction approach that fuses vehicle light detection and object proposals together with a nighttime image enhancement approach based on improved multiscale retinex to extract accurate ROIs and enhance images for accurate nighttime vehicle detection. Experimental results demonstrate that the proposed nighttime image enhancement method, score-level multifeature fusion, and the ROI extraction method are all effective for nighttime vehicle detection. But the proposed vehicle detection method demonstrates 93. 34 percent detection rate and outperforms other models, detecting blurred and partly occluded vehicles, as well as vehicles in a variety of sizes, numbers, locations, and backgrounds.

IJCAI Conference 2016 Conference Paper

Weakly-Supervised Deep Learning for Customer Review Sentiment Classification

  • Ziyu Guan
  • Long Chen
  • Wei Zhao
  • Yi Zheng
  • Shulong Tan
  • Deng Cai

Sentiment analysis is one of the key challenges for mining online user generated content. In this work, we focus on customer reviews which are an important form of opinionated content. The goal is to identify each sentence's semantic orientation (e. g. positive or negative) of a review. Traditional sentiment classification methods often involve substantial human efforts, e. g. lexicon construction, feature engineering. In recent years, deep learning has emerged as an effective means for solving sentiment classification problems. A neural network intrinsically learns a useful representation automatically without human efforts. However, the success of deep learning highly relies on the availability of large-scale training data. In this paper, we propose a novel deep learning framework for review sentiment classification which employs prevalently available ratings as weak supervision signals. The framework consists of two steps: (1) learn a high level representation (embedding space) which captures the general sentiment distribution of sentences through rating information; (2) add a classification layer on top of the embedding layer and use labeled sentences for supervised fine-tuning. Experiments on review data obtained from Amazon show the efficacy of our method and its superiority over baseline methods.

YNIMG Journal 2010 Journal Article

Localization of cerebral functional deficits in treatment-naive, first-episode schizophrenia using resting-state fMRI

  • Xiao-Qi Huang
  • Su Lui
  • Wei Deng
  • Raymond C.K. Chan
  • Qi-Zhu Wu
  • Li-Jun Jiang
  • Jun-Ran Zhang
  • Zhi-Yun Jia

Background Spontaneous low-frequency fluctuations (LFF) in the blood oxygen level-dependent (BOLD) functional magnetic resonance imaging (fMRI) signal have been shown to reflect cerebral spontaneous neural activity, and the present study attempts to explore the functional changes in the regional brain in patients with schizophrenia using the amplitude of the BOLD signals. Methods A total of 66 treatment-naïve, first-episode schizophrenia (FES) patients and 66 normal age- and sex-matched controls were recruited. Resting-state fMRIs were obtained using a gradient-echo echo-planar imaging sequence. The amplitude of LFF (ALFF) was calculated using REST software. Voxel-based analysis of the ALFF maps between control and patient groups was performed with twos-sample t-tests using SPM2. Results Compared to the controls, the FES group showed significantly decreased ALFF in the medial prefrontal lobe (MPFC) and significant increases in the ALFF in the left and right putamen. Significant positive correlations were observed between ALFF values in the bilateral putamen in both the patient and control groups. Conclusions Alterations of the ALFF in the MPFC and putamen in FES observed in the present study suggest that the functional abnormalities of those areas are at an early stage of the disease.