Arrow Research search

Author name cluster

Jian Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

160 papers
2 author rows

Possible papers

160

AAAI Conference 2026 Conference Paper

Diffusion-Based Contextual Reconstruction for Point Cloud Segmentation with Limited Annotations

  • Jiawei Lian
  • Zhengxue Wang
  • Wentao Qu
  • Haobo Jiang
  • Le Hui
  • Jian Yang

Point cloud semantic segmentation is fundamental to 3D scene understanding, but dense annotation requirements limit scalability. Although recent label propagation and contrastive learning methods enhance local consistency, the incomplete object coverage caused by sparse annotations hinders global context modeling, ultimately limiting overall performance. To this end, we propose a diffusion-based contextual reconstruction framework for point cloud semantic segmentation with limited annotations. At its core, our framework guides denoising with semantic predictions, using better context reconstruction to enhance the conditional model for better segmentation. Specifically, our contributions include: (1) Diffusion-based segmentation framework: reconstructs contextual semantics from noise under conditional guidance, sharing the decoder with the segmentation module for robust contextual semantic learning. (2) Dynamically aggregates local context from segmentation features and guides denoising with global spatial structure, significantly enhancing denoising quality and contextual awareness. Notably, we pioneer diffusion models for 3D semantic segmentation with limited annotations, enabling efficient single-step inference. Experiments show robustness across varying annotation ratios and state-of-the-art performance on benchmarks.

AAAI Conference 2026 Conference Paper

From Parameter to Representation: A Closed-Form Approach for Controllable Model Merging

  • Jialin Wu
  • Jian Yang
  • Handing Wang
  • Jiajun Wen
  • Zhiyong Yu

Model merging combines expert models for multitask performance but faces challenges from parameter interference. This has sparked recent interest in controllable model merging, giving users the ability to explicitly balance performance trade-offs. Existing approaches employ a compile-then-query paradigm, performing a costly offline multi-objective optimization to enable fast, preference-aware model generation. This offline stage typically involves iterative search or dedicated training, with complexity that grows exponentially with the number of tasks. To overcome these limitations, we shift the perspective from parameter-space optimization to a direct correction of the model's final representation. Our approach models this correction as an optimal linear transformation, yielding a closed-form solution that replaces the entire offline optimization process with a single-step, architecture-agnostic computation. This solution directly incorporates user preferences, allowing a Pareto-optimal model to be generated on-the-fly with complexity that scales linearly with the number of tasks. Experimental results show our method generates a superior Pareto front with more precise preference alignment and drastically reduced computational cost.

JBHI Journal 2026 Journal Article

Point-Supervised Coronary Semantic Segmentation in X-Ray Angiographic Images

  • Ying Chen
  • Danni Ai
  • Jianyu Du
  • Yuanyuan Wang
  • Tianyu Fu
  • Deqiang Xiao
  • Yucong Lin
  • Long Shao

Coronary semantic segmentation in X-ray angiography is essential for computer-aided diagnosis and treatment planning of coronary artery disease (CAD). Despite its importance, this task remains highly challenging due to the complex and interconnected vascular topology, as well as the similar visual characteristics among different branches, making dense pixel-level manual annotation difficult and labor-intensive. To alleviate this burden, we propose a point-supervised coronary semantic segmentation framework that significantly reduces annotation effort without compromising segmentation accuracy. The primary challenge of point label based supervision lies in the model's tendency to overfit sparse point labels, leading to limited generalization to pixel-level predictions. To enrich the supervision signals and stabilize the training process with the sparse point labels, we propose an adaptive foreground mask generation module and a region regularization strategy to ensure accurate semantic guidance while maximizing meaningful coverage of the vascular structures. To enhance coronary topology perception and branch differentiation, we propose a multi-task learning framework that jointly performs keypoint detection and coronary semantic segmentation through a shared feature extraction encoder and two task-specific decoders. The experimental results demonstrate that our point-supervised model achieves performance comparable to fully supervised model, and outperforms the existing state-of-the-art point-supervised semantic segmentation methods.

AAAI Conference 2026 Conference Paper

RMLer: Synthesizing Novel Objects Across Diverse Categories via Reinforcement Mixing Learning

  • Jun Li
  • Zikun Chen
  • Haibo Chen
  • Shuo Chen
  • Jian Yang

Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in text-to-image generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs, resulting in conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, and optimize the policy via proximal policy optimization. At inference time, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate that RMLer synthesizes coherent, high-fidelity objects from diverse categories and consistently outperforms existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.

AAAI Conference 2026 Conference Paper

Shaping Without Tearing: Controllable Diffeomorphic Deformations for Topology-Preserving 3D Point Cloud Augmentation

  • Jian Bi
  • Qianliang Wu
  • Jianjun Qian
  • Lei Luo
  • Jian Yang

Point cloud data augmentation is critical to improving the generalization of 3D deep learning models. However, existing methods often fail to preserve the underlying manifold structure, leading to semantic distortion or topology violation. This causes models to learn untrustworthy features, thereby limiting the representational ability of the model. To overcome these limitations, we propose ManiPoint, a novel point cloud augmentation framework based on diffeomorphism that explicitly preserves manifold structure during deformation. ManiPoint constructs diffeomorphic transformations via continuous differentiable mappings, ensuring topological consistency and geometric continuity between original and augmented data. To prevent excessive distortion and ensure semantic consistency, we introduce a controllable deformation mechanism that quantitatively constrains the augmentation magnitude and enables fine-grained control over the deformation space. We further provide theoretical analysis, indicating that, compared with topologically inconsistent methods, ManiPoint reduces empirical and vicinal risks by generating diverse and structurally reliable samples. Extensive experiments and visualizations on object-level datasets demonstrate that ManiPoint produces high-quality augmentations and consistently improves model robustness over existing baselines. Meanwhile, the scalability of our method was further verified on the scene-level datasets.

AAAI Conference 2026 Conference Paper

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

  • Yuxuan Li
  • Xiang Li
  • Yunheng Li
  • Yicheng Zhang
  • Yimian Dai
  • Qibin Hou
  • Ming-Ming Cheng
  • Jian Yang

With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, we propose a novel consistency and synchronization optimization mechanism, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det's effectiveness and generalizability, consistently outperforming the combination of specialized models on individual datasets.

AAAI Conference 2026 Conference Paper

Small but Mighty: Dynamic Wavelet Expert-Guided Fine-Tuning of Large-Scale Models for Optical Remote Sensing Object Segmentation

  • Yanguang Sun
  • Chao Wang
  • Jian Yang
  • Lei Luo

Accurately localizing and segmenting relevant objects from optical remote sensing images (ORSIs) is critical for advancing remote sensing applications. Existing methods are typically built upon moderate-scale pre-trained models and employ diverse optimization strategies to achieve promising performance under full-parameter fine-tuning. In fact, deeper and larger-scale foundation models can provide stronger support for performance improvement. However, due to their massive number of parameters, directly adopting full-parameter fine-tuning leads to pronounced training difficulties, such as excessive GPU memory consumption and high computational costs, which result in extremely limited exploration of large-scale models in existing works. In this paper, we propose a novel dynamic wavelet expert-guided fine-tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large-scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. Specifically, we introduce a task-specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate their outputs, thereby generating trainable features enriched with task-specific information for subsequent fine-tuning. Furthermore, we construct an expert-guided conditional adapter that first enhances the fine-grained perception of frozen features for specific tasks by injecting trainable features, and then iteratively updates the information of both types of feature, allowing for efficient fine-tuning. Extensive experiments show that our WEFT not only outperforms 21 state-of-the-art (SOTA) methods on three ORSIs datasets, but also achieves optimal results in camouflage, natural, and medical scenarios.

AAAI Conference 2026 Conference Paper

SpatioTemporal Difference Network for Video Depth Super-Resolution

  • Zhengxue Wang
  • Yuan Wu
  • Xiang Li
  • Zhiqiang Yan
  • Jian Yang

Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

TMLR Journal 2026 Journal Article

SpikingBrain: Spiking Brain-inspired Large Models

  • Yuqi Pan
  • Yupeng Feng
  • JingHao Zhuang
  • siyu ding
  • Han Xu
  • Zehao Liu
  • Bohan Sun
  • Yuhong Chou

Mainstream Transformer-based large language models (LLMs) face significant efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly. These constraints limit their ability to process long sequences effectively. In addition, building large models on non-NVIDIA computing platforms poses major challenges in achieving stable and efficient training and deployment. To address these issues, we introduce SpikingBrain, a new family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three core aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline compatible with existing LLMs, along with a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to the MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and our training framework supports weeks of stable training on hundreds of MetaX GPUs with Model FLOPs Utilization (MFU) at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using exceptionally low data resources (continual pre-training of approximately 150B tokens). Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B achieves more than 100× speedup in Time to First Token (TTFT) for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15% sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

AAAI Conference 2025 Conference Paper

Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

  • Junkai Fan
  • Kun Wang
  • Zhiqiang Yan
  • Xiang Chen
  • Shangbing Gao
  • Jun Li
  • Jian Yang

In this paper, we study the challenging problem of simultaneously removing haze and estimating depth from real monocular hazy videos. These tasks are inherently complementary: enhanced depth estimation improves dehazing via the atmospheric scattering model (ASM), while superior dehazing contributes to more accurate depth estimation through the brightness consistency constraint (BCC). To tackle these intertwined tasks, we propose a novel depth-centric learning framework that integrates the ASM model with the BCC constraint. Our key idea is that both ASM and BCC rely on a shared depth estimation network. This network simultaneously exploits adjacent dehazed frames to enhance depth estimation via BCC and uses the refined depth cues to more effectively remove haze through ASM. Additionally, we leverage a non-aligned clear video and its estimated depth to independently regularize the dehazing and depth estimation networks. This is achieved by designing two discriminator networks: D_MFIR enhances high-frequency details in dehazed videos, and D_MDR reduces the occurrence of black holes in low-texture regions. Extensive experiments demonstrate that the proposed method outperforms current state-of-the-art techniques in both video dehazing and depth estimation tasks, especially in real-world hazy scenes.

AAAI Conference 2025 Conference Paper

Dual Manifold Regularization Steered Robust Representation Learning for Point Cloud Analysis

  • Jian Bi
  • Qianliang Wu
  • Jianjun Qian
  • Lei Luo
  • Jian Yang

With the rapid advancement of 3D scanning technology, point clouds have become a crucial data type in computer vision and machine learning. However, learning robust representations for point clouds remains a significant challenge due to their irregularity and sparsity. In this paper, we propose a novel Dual Manifold Regularization (DMR) framework that makes full use of the properties of positive and negative curvature in manifolds to improve the representation of point clouds. Specifically, we leverage DMR based on hyperbolic and hyperspherical manifolds to address the limitations of traditional single-manifold regularization techniques, including inadequate generalization ability and adaptability to data diversity, as well as the difficulty of capturing complex relationships between data. To begin, we utilize the tree-like structure of the hyperbolic manifold to model the part-whole hierarchical relationships within point clouds. This allows for a more comprehensive representation of the data, improving the model's capability to understand complex shapes. Additionally, we construct positive samples through topological consistency augmentation and employ contrastive learning techniques in the hyperspherical manifold to capture more discriminative features within the data. Our experimental results show that our method outperforms traditional supervised learning and single-manifold regularization techniques in point cloud analysis. Specifically, for shape classification, DMR achieves a new State-Of-The-Art (SOTA) performance with 94.8% Overall Accuracy (OA) on ModelNet40 and 90.7% OA on ScanObjectNN, surpassing the recent SOTA model without increasing the baseline parameters.

IJCAI Conference 2025 Conference Paper

Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images

  • Yanguang Sun
  • Jiexi Yan
  • Jianjun Qian
  • Chunyan Xu
  • Jian Yang
  • Lei Luo

Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: https: //github. com/CSYSI/DPU-Former.

NeurIPS Conference 2025 Conference Paper

Enhancing Contrastive Learning with Variable Similarity

  • Haowen Cui
  • Shuo Chen
  • Jun Li
  • Jian Yang

Contrastive learning has achieved remarkable success in self-supervised learning by pretraining a generalizable feature representation based on the augmentation invariance. Most existing approaches assume that different augmented views of the same instance (i. e. , the positive pairs ) remain semantically invariant. However, the augmentation results with varying extent may introduce semantic discrepancies or even content distortion, and thus the conventional (pseudo) supervision from augmentation invariance may lead to misguided learning objectives. In this paper, we propose a novel method called Contrastive Learning with Variable Similarity (CLVS) to accurately characterize the intrinsic similarity relationships between different augmented views. Our method dynamically adjusts the similarity based on the augmentation extent, and it ensures that strongly augmented views are always assigned lower similarity scores than weakly augmented ones. We provide a theoretical analysis to guarantee the effectiveness of the variable similarity in improving model generalizability. Extensive experiments demonstrate the superiority of our approach, achieving gains of 2. 1\% on ImageNet-100 and 1. 4\% on ImageNet-1k compared with the state-of-the-art methods.

AAAI Conference 2025 Conference Paper

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

  • Xiantao Hu
  • Ying Tai
  • Xu Zhao
  • Chen Zhao
  • Zhenyu Zhang
  • Jun Li
  • Bineng Zhong
  • Jian Yang

Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios.

IJCAI Conference 2025 Conference Paper

Federated Learning at the Forefront of Fairness: A Multifaceted Perspective

  • Noorain Mukhtiar
  • Adnan Mahmood
  • Yipeng Zhou
  • Jian Yang
  • Jing Teng
  • Quan Z. Sheng

Fairness in Federated Learning (FL) is emerging as a critical factor driven by heterogeneous clients’ constraints and balanced model performance across various scenarios. In this survey, we delineate a comprehensive classification of the state-of-the-art fairness-aware approaches from a multifaceted perspective, i. e. , model performance-oriented and capability-oriented. Moreover, we provide a framework to categorize and address various fairness concerns and associated technical aspects, examining their effectiveness in balancing equity and performance within FL frameworks. We further examine several significant evaluation metrics leveraged to measure fairness quantitatively. Finally, we explore exciting open research directions and propose prospective solutions that could drive future advancements in this important area, laying a solid foundation for researchers working toward fairness in FL.

JBHI Journal 2025 Journal Article

FeverMamba: Highlight Fever in Crowd Via Mamba for Remote Fever Screening

  • Mengkai Yan
  • Jianjun Qian
  • Jindi Bao
  • Jian Yang

Remote fever screening based on thermal infrared images can screen feverish faces in real time and play an important role in vital sign monitoring. Screening methods that rely on short-term crowd facial temperature differences can overcome environmental effects without the need for additional sensors, but effectively encoding these differences to highlight fever face remains a challenge. To this end, we develop a fever screening framework based on Mamba, which exploits the context capturing capability of Mamba to model the temperature differences of thermal infrared face images. Furthermore, considering that local contextual associations in the crowd will limit the construction of global differences, we propose a shuffle scanning method to break the local contextual associations and construct multiple shuffle scanning to achieve global difference representation. In addition, we design a temperature-aware self-supervised loss function to cope with the situation where data of fever faces is difficult to collect. Finally, we achieve state-of-the-art performance in both supervised and self-supervised cases on the thermal infrared face dataset.

AAAI Conference 2025 Conference Paper

Fine-Tuning Language Models with Collaborative and Semantic Experts

  • Jiaxi Yang
  • Binyuan Hui
  • Min Yang
  • Jian Yang
  • Lei Zhang
  • Qiang Qu
  • Junyang Lin

Recent advancements in large language models (LLMs) have broadened their application scope but revealed challenges in balancing capabilities across general knowledge, coding, and mathematics. To address this, we introduce a Collaborative and Semantic Experts (CoE) approach for supervised fine-tuning (SFT), which employs a two-phase training strategy. Initially, expert training fine-tunes the feed-forward network on specialized datasets, developing distinct experts in targeted domains. Subsequently, expert leveraging synthesizes these trained experts into a structured model with semantic guidance to activate specific experts, enhancing performance and interpretability. Evaluations on comprehensive benchmarks across MMLU, HumanEval, GSM8K, MT-Bench, and AlpacaEval confirm CoE's efficacy, demonstrating improved performance and expert collaboration in diverse tasks, significantly outperforming traditional SFT methods.

IROS Conference 2025 Conference Paper

Finite-time Guiding Vector Fields for Accelerated Path Following of Nonholonomic Robots

  • Jian Yang
  • Junlong Wu
  • Yuan Ouyang
  • Weijia Yao

Guiding vector fields (GVFs) have been widely applied in robotic path-following control. However, most, if not all, of the existing studies derive control algorithms that only render the path-following error asymptotically converging to zero, while more stringent time constraints on the path-following error convergence have not been fully studied. In this paper, by introducing a signum-based function, we propose a finite-time GVF that enables a nonholonomic robot to follow an arbitrary smooth nD desired path within a finite time. Note that the finite time is dependent on the initial condition and can be computed in advance. In practical applications, we design a controller based on the proposed GVF for the unicycle model. This controller drives a nonholonomic robot’s velocity to align with that of the GVF within a finite time. In addition, we introduce the extension of the proposed GVF to the distributed motion coordination among an arbitrary number of robots. Finally, we conduct two experiments using unmanned ground vehicles to validate the effectiveness of the proposed algorithms.

NeurIPS Conference 2025 Conference Paper

FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

  • Jiang Lin
  • Xinyu Chen
  • Song Wu
  • Zhiqiu Zhang
  • Jizhi Zhang
  • Ye Wang
  • Qiang Tang
  • Qian Wang

Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based approaches offer stronger alignment but incur high inference cost due to dual-path denoising. We present \textbf{FreeControl}, a training-free framework for semantic structural control in diffusion models. Unlike prior methods that extract attention across multiple timesteps, FreeControl performs \textit{one-step attention extraction} from a single, optimally chosen timestep and reuses it throughout denoising. This enables efficient structural guidance without inversion or retraining. To further improve quality and stability, we introduce \textit{Latent-Condition Decoupling (LCD)}: a principled separation of the timestep condition and the noised latent used in attention extraction. LCD provides finer control over attention quality and eliminates structural artifacts. FreeControl also supports compositional control via reference images assembled from multiple sources, enabling intuitive scene layout design and stronger prompt alignment. FreeControl introduces a new paradigm for test-time control—enabling structurally and semantically aligned, visually coherent generation directly from raw images, with the flexibility for intuitive compositional design and compatibility with modern diffusion models at ~5\% additional cost.

AAAI Conference 2025 Conference Paper

From Words to Worth: Newborn Article Impact Prediction with LLM

  • Penghai Zhao
  • Qinghua Xing
  • Kairan Dou
  • Jinyu Tian
  • Ying Tai
  • Jian Yang
  • Ming-Ming Cheng
  • Xiang Li

Predicting the future impact of newly published articles is pivotal for advancing scientific discovery in an era of unprecedented scholarly expansion. This paper introduces a promising approach, leveraging the capabilities of LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Breaking away from traditional methods heavily reliant on external data, we propose fine-tuning the LLM to uncover the intrinsic semantic patterns shared by highly impactful articles from a vast collection of text-score pairs. These semantic features are further utilized to predict the proposed indicator, TNCSIsp, which incorporates favorable normalization properties across value, field, and time. To facilitate parameter-efficient fine-tuning of the LLM, we have also meticulously curated a dataset containing over 12,000 entries, each annotated with titles, abstracts, and their corresponding TNCSIsp values. Experimental results reveal an MAE of 0.216 and an NDCG@20 of 0.901, setting new benchmarks in predicting the impact of newborn articles. Finally, we present a real-world application example for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for article impact prediction.

AAAI Conference 2025 Conference Paper

Harmonious Music-driven Group Choreography with Trajectory-Controllable Diffusion

  • Yuqin Dai
  • Wanlu Zhu
  • Ronghui Li
  • Zeping Ren
  • Xiangzheng Zhou
  • Jixuan Ying
  • Jun Li
  • Jian Yang

Creating group choreography from music is crucial in cultural entertainment and virtual reality, with a focus on generating harmonious movements. Despite growing interest, recent approaches often struggle with two major challenges: multi-dancer collisions and single-dancer foot sliding. To address these challenges, we propose a Trajectory-Controllable Diffusion (TCDiff) framework, which leverages non-overlapping trajectories to ensure coherent and aesthetically pleasing dance movements. To mitigate collisions, we introduce a Dance-Trajectory Navigator that generates collision-free trajectories for multiple dancers, utilizing a distance-consistency loss to maintain optimal spacing. Furthermore, to reduce foot sliding, we present a footwork adaptor that adjusts trajectory displacement between frames, supported by a relative forward-kinematic loss to further reinforce the correlation between movements and trajectories. Experiments demonstrate our method's superiority.

JBHI Journal 2025 Journal Article

Hepatic Vessel Roadmap Prediction Using Adaptive Tracking and Bending Energy Modeling in X-Ray Fluoroscopy

  • Shuo Yang
  • Deqiang Xiao
  • Haixiao Geng
  • Danni Ai
  • Jingfan Fan
  • Tianyu Fu
  • Hong Song
  • Feng Duan

Dynamic visualization of the hepatic vessel is crucial in X-ray image-guided transjugular intrahepatic portosystemic shunt (TIPS) procedures. However, intraoperative breathing and the presence of guidewires complicate the prediction of the vessel position and posture without contrast agents. The respiration compensation technique aims to utilize the intraoperative respiration modeling to deform the initial vessel roadmap, thereby achieving the dynamic vessel prediction in the X-ray image sequence for the interventional guidance. Therefore, we propose a novel respiration compensation framework utilizing the adaptive tracking and bending energy modeling to achieve the stable vessel roadmap prediction under free breathing. First, we introduce the inter-frame rigid displacement compensation module based on the domain adaptation and adaptive centroid tracking. This module fits the respiratory curve from the X-ray images, providing the temporal motion priors for aligning roadmaps across frames. Second, we propose the novel deformation compensation module based on the bending energy modeling to correct the respiratory motion, wherein we utilize the energy features of the guidewires to drive the non-rigid registration. The control points sampled by the bending energy guide the local image to form the deformation field, facilitating the dynamic overlap of the vessel roadmaps in X-ray images. Experimental results on simulated and clinical datasets show an average tracking error of 0. 95 $\pm$ 0. 26 mm and 1. 49 $\pm$ 0. 40 mm, respectively. The effective and fast (mean 57 ms per frame) compensation achieved by our framework has the potential for improving the outcome of liver intervention and reducing the reliance on contrast agents.

NeurIPS Conference 2025 Conference Paper

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

  • Zhijian Zhuo
  • Yutao Zeng
  • Ya Wang
  • Sijun Zhang
  • Xiaoqing Li
  • Jian Yang
  • zhou Xun
  • Jinwen Ma

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks, especially regarding the position of the layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose HybridNorm, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence to demonstrate that HybridNorm improves the gradient flow and the model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https: //github. com/BryceZhuo/HybridNorm.

NeurIPS Conference 2025 Conference Paper

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

  • Jiajun Shi
  • Jian Yang
  • Jiaheng Liu
  • Xingyuan Bu
  • Jiangjie Chen
  • Junting Zhou
  • Kaijing Ma
  • Zhoufutu Wen

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM’s general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

IJCAI Conference 2025 Conference Paper

Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

  • Xudong Yan
  • Songhe Feng
  • Yang Zhang
  • Jian Yang
  • Yueguan Lin
  • Haojun Fei

Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) The efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attributes with objects in the same parts. (2) Existing word embeddings fail to capture complex multimodal semantic information. (3) Overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named multimodal large language model (MLLM) embeddings and attribute smoothing guided disentanglement for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multi-granularity features for disentanglement. Moreover, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Furthermore, we propose attribute smoothing with auxiliary attributes generated by the large language model (LLM) for seen compositions to address the overconfidence challenge. Extensive experiments demonstrate that our method achieves state-of-the-art performance on three challenging datasets. The supplementary material and source code will be available at https: //github. com/xud-yan/Trident.

AAAI Conference 2025 Conference Paper

Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

  • Jiaqi Bai
  • Hongcheng Guo
  • Zhongyuan Peng
  • Jian Yang
  • Zhoujun Li
  • Mohan Li
  • Zhihong Tian

Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.

NeurIPS Conference 2025 Conference Paper

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

  • Tianhao Peng
  • Haochen Wang
  • Yuanxing Zhang
  • Noah Wang
  • Zili Wang
  • Ge Zhang
  • Jian Yang
  • Shihao Li

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e. g. , sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating M ulti- V ideo U nderstanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1, 824 meticulously curated question-answer pairs spanning 4, 959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs' ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.

NeurIPS Conference 2025 Conference Paper

OmniBench: Towards The Future of Universal Omni-Language Models

  • Yizhi Li
  • Ge Zhang
  • Yinghao Ma
  • Ruibin Yuan
  • Hangyu Guo
  • Yiming Liang
  • Jiaheng Liu
  • Noah Wang

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (below 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at https: //m-a-p. ai/OmniBench/.

IJCAI Conference 2025 Conference Paper

PDDFormer: Pairwise Distance Distribution Graph Transformer for Crystal Material Property Prediction

  • Xiangxiang Shen
  • Zheng Wan
  • Lingfeng Wen
  • Licheng Sun
  • Jian Yang
  • Xuan Tang
  • Shing-Ho J. Lin
  • Xiao He

Crystal structures can be simplified as a periodic point set that repeats across three-dimensional space along an underlying lattice. Traditionally, crystal representation methods rely on descriptors such as lattice parameters, symmetry, and space groups to characterize the structure. However, in reality, atoms in materials always vibrate above absolute zero, causing their positions to fluctuate continuously. This dynamic behavior disrupts the fundamental periodicity of the lattice, making crystal graphs based on static lattice parameters and conventional descriptors discontinuous under slight perturbations. Chemists proposed the pairwise distance distribution (PDD) method to address this. However, the completeness of PDD requires defining a large number of neighboring atoms, leading to high computational costs. Additionally, PDD does not account for atomic information, making it challenging to apply it directly to crystal material property prediction tasks. To tackle these challenges, we introduce the atom-weighted Pairwise Distance Distribution (WPDD) and Unit cell Pairwise Distance Distribution (UPDD) for the first time, applying them to the construction of multi-edge crystal graphs. We demonstrate the continuity and general completeness of crystal graphs under slight atomic position perturbations. Moreover, by modeling PDD as global information and integrating it into matrix-based message passing, we significantly reduce computational costs. Comprehensive evaluation results show that WPDDFormer achieves state-of-the-art predictive accuracy across tasks on benchmark datasets such as the Materials Project and JARVIS-DFT.

AAAI Conference 2025 Conference Paper

Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

  • Xiaoqi An
  • Lin Zhao
  • Chen Gong
  • Jun Li
  • Jian Yang

With the rapid development of autonomous driving, LiDAR-based 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by 10.0mm. Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by 20.7mm.

JBHI Journal 2025 Journal Article

RDguru: A Conversational Intelligent Agent for Rare Diseases

  • Jian Yang
  • Liqi Shu
  • Huilong Duan
  • Haomin Li

Large language models (LLMs) hold significant promise in clinical practice, yet their real-world adoption is constrained by their propensity to produce erroneous and occasionally harmful outputs, particularly in the intricate domain of rare diseases (RDs). This study introduces RDguru, a conversational intelligent agent leveraging the LangChain framework and powered by GPT-3. 5-turbo. RDguru offers a comprehensive suite of functionalities, encompassing evidence-traceable knowledge Q&A and professional medical consultations for differential diagnosis (DDX), integrating authoritative knowledge sources and reliable tools. A novel multi-source fusion diagnostic model, rooted in deep Q-network, amalgamates three diagnostic recommendation strategies (GPT-4, PheLR, and phenotype matching) to enhance diagnostic recall during medical consultations. Through tailored tools and advanced algorithms for retrieval-augmented generation, RDguru excels in knowledge Q&A, automated phenotype annotation, and RD DDX. A multi-aspect Q&A analysis demonstrates RDguru outperforms ChatGPT in generating descriptions aligned with authoritative knowledge, quantified by ROUGE scores, GPT-4-based automatic rating, and RAGAs evaluation metrics. Testing on 238 published RD cases reveals that RDguru's top 5 multi-source fusion diagnoses recapture 63. 87% of actual diagnoses, marking a 5. 47% improvement over the state-of-the-art diagnostic method PheLR. Furthermore, RDguru's consultation strategy proves effective in eliciting diagnostically beneficial phenotypes and refining the prioritization of genuine diagnoses through multi-round phenotype-orient questioning. Evaluations against established benchmarks and real-world patient data demonstrate RDguru's efficacy and reliability, highlighting its potential to enhance clinical decision-making in the realm of RDs.

AAAI Conference 2025 Conference Paper

Relaxed Rotational Equivariance via G-Biases in Vision

  • Zhiqiang Wu
  • Yingjie Liu
  • Licheng Sun
  • Jian Yang
  • Hanlin Dong
  • Shing-Ho J. Lin
  • Xuan Tang
  • Jinpeng Mi

Group Equivariant Convolution (GConv) can capture rotational equivariance from original data. It assumes uniform and strict rotational equivariance across all features as the transformations under the specific group. However, the presentation or distribution of real-world data rarely conforms to strict rotational equivariance, commonly referred to as Rotational Symmetry-Breaking (RSB) in the system or dataset, making GConv unable to adapt effectively to this phenomenon. Motivated by this, we propose a simple but highly effective method to address this problem, which utilizes a set of learnable biases called G-Biases under the group order to break strict group constraints and then achieve a Relaxed Rotational Equivariant Convolution (RREConv). To validate the efficiency of RREConv, we conduct extensive ablation experiments on the discrete rotational group Cn. Experiments demonstrate that the proposed RREConv-based methods achieve excellent performance compared to existing GConv-based methods in both classification and 2D object detection tasks on the natural image datasets.

NeurIPS Conference 2025 Conference Paper

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

  • Ge Wu
  • Shen Zhang
  • Ruijing Shi
  • Shanghua Gao
  • Zhenyuan Chen
  • Lei Wang
  • Zhaowei Chen
  • Hongcheng Gao

REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called $\textit{$\textbf{R}$epresentation $\textbf{E}$ntanglement for $\textbf{G}$eneration}$ ($\textbf{REG}$), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0. 5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https: //github. com/Martinser/REG.

NeurIPS Conference 2025 Conference Paper

See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction

  • Yuan Wu
  • Zhiqiang Yan
  • Yigong Zhang
  • Xiang Li
  • Jian Yang

Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose LIAR, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently, 3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available here.

IJCAI Conference 2025 Conference Paper

Self-calibration Enhanced Whole Slide Pathology Image Analysis

  • Haoming Luo
  • XiaoTian Yu
  • Shengxuming Zhang
  • Jiabin Xia
  • Jian Yang
  • Yuning Sun
  • Xiuming Zhang
  • Jing Zhang

Pathology images are considered the ``gold standard" for cancer diagnosis and treatment, with gigapixel images providing extensive tissue and cellular information. Existing methods fail to simultaneously extract global structural and local detail features for comprehensive pathology image analysis efficiently. To address these limitations, we propose a self-calibration enhanced framework for whole slide pathology image analysis, comprising three components: a global branch, a focus predictor, and a detailed branch. The global branch initially classifies using the pathological thumbnail, while the focus predictor identifies relevant regions for classification based on the last layer features of the global branch. The detailed extraction branch then assesses whether the magnified regions correspond to the lesion area. Finally, a feature consistency constraint between the global and detail branches ensures that the global branch focuses on the appropriate region and extracts sufficient discriminative features for final identification. These focused discriminative features can facilitate the discovery of novel prognostic tumor markers, from the perspective of feature uniqueness and tissue spatial distribution. Extensive experiment results demonstrate that the proposed framework can rapidly deliver accurate and explainable results for pathological grading and prognosis tasks.

NeurIPS Conference 2025 Conference Paper

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

  • Xeron Du
  • Yifan Yao
  • Kaijing Ma
  • Bingli Wang
  • Tianyu Zheng
  • Minghao Liu
  • Yiming Liang
  • Xiaolong Jin

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e. g. , the reasoning-focused model Gemini-2. 5-Pro achieved the highest accuracy of 63. 56% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

AAAI Conference 2025 Conference Paper

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

  • Xianjie Wu
  • Jian Yang
  • Linzheng Chai
  • Ge Zhang
  • Jiaheng Liu
  • Xeron Du
  • Di Liang
  • Daixin Shu

Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

AAAI Conference 2025 Conference Paper

Towards Better Spherical Sliced-Wasserstein Distance Learning with Data-Adaptive Discriminative Projection Direction

  • Hongliang Zhang
  • Shuo Chen
  • Lei Luo
  • Jian Yang

Spherical Sliced-Wasserstein (SSW) has recently been proposed to measure the discrepancy between spherical data distributions in various fields, such as geology, medical domains, computer vision, and deep representation learning. However, in the original SSW, all projection directions are treated equally, which is too idealistic and cannot accurately reflect the importance of different projection directions for various data distributions. To address this issue, we propose a novel data-adaptive Discriminative Spherical Sliced-Wasserstein (DSSW) distance, which utilizes a projected energy function to determine the discriminative projection direction for SSW. In our new DSSW, we introduce two types of projected energy functions to generate the weights for projection directions with complete theoretical guarantees. The first type employs a non-parametric deterministic function that transforms the projected Wasserstein distance into its corresponding weight in each projection direction. This improves the performance of the original SSW distance with negligible additional computational overhead. The second type utilizes a neural network-induced function that learns the projection direction weight through a parameterized neural network based on data projections. This further enhances the performance of the original SSW distance with less extra computational overhead. Finally, we evaluate the performance of our proposed DSSW by comparing it with several state-of-the-art methods across a variety of machine learning tasks, including gradient flows, density estimation on real earth data, and self-supervised learning.

NeurIPS Conference 2025 Conference Paper

UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset

  • Chen Zhao
  • En Ci
  • Yunzhe Xu
  • Tiehan Fan
  • Shanyan Guan
  • Yanhao Ge
  • Jian Yang
  • Ying Tai

Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain: 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{https: //github. com/NJU-PCALab/UltraHR-100k}{here}.

AAAI Conference 2025 Conference Paper

XCOT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning

  • Linzheng Chai
  • Jian Yang
  • Tao Sun
  • Hongcheng Guo
  • Jiaheng Liu
  • Bing Wang
  • Xinnian Liang
  • Jiaqi Bai

Chain-of-thought (CoT) has emerged as a powerful technique to elicit reasoning in large language models and improve a variety of downstream tasks. CoT mainly demonstrates excellent performance in English, but its usage in low-resource languages is constrained due to poor language generalization. To bridge the gap among different languages, we propose a cross-lingual instruction fine-tuning framework (xCoT) to transfer knowledge from high-resource languages to low-resource languages. Specifically, the multilingual instruction training data (xCoT-Instruct) is created to encourage the semantic alignment of multiple languages. We introduce cross-lingual in-context few-shot learning (xICL) to accelerate multilingual agreement in instruction tuning, where some fragments of source languages in examples are randomly substituted by their counterpart translations of target languages. During multilingual instruction tuning, we adopt the randomly online CoT strategy to enhance the multilingual reasoning ability of the large language model by first translating the query to another language and then answering in English. To further facilitate the language transfer, we leverage the high-resource CoT to supervise the training of low-resource languages with cross-lingual distillation. Experimental results demonstrate the superior performance of xCoT in reducing the gap among different languages, highlighting its potential to reduce the cross-lingual gap.

JBHI Journal 2024 Journal Article

A New Multi-Atlas Based Deep Learning Segmentation Framework With Differentiable Atlas Feature Warping

  • Huabing Liu
  • Dong Nie
  • Jian Yang
  • Jinda Wang
  • Zhenyu Tang

Deep learning based multi-atlas segmentation (DL-MA) has achieved the state-of-the-art performance in many medical image segmentation tasks, e. g. , brain parcellation. In DL-MA methods, atlas-target correspondence is the key for accurate segmentation. In most existing DL-MA methods, such correspondence is usually established using traditional or deep learning based registration methods at image level with no further feature level adaption. This could cause possible atlas-target feature inconsistency. As a result, the information from atlases often has limited positive and even counteractive impact on the final segmentation results. To tackle this issue, in this paper, we propose a new DL-MA framework, where a novel differentiable atlas feature warping module with a new smooth regularization term is presented to establish feature level atlas-target correspondence. Comparing with the existing DL-MA methods, in our framework, atlas features containing anatomical prior knowledge are more relevant to the target image feature, leading the final segmentation results to a high accuracy level. We evaluate our framework in the context of brain parcellation using two public MR brain image datasets: LPBA40 and NIREP-NA0. The experimental results demonstrate that our framework outperforms both traditional multi-atlas segmentation (MAS) and state-of-the-art DL-MA methods with statistical significance. Further ablation studies confirm the effectiveness of the proposed differentiable atlas feature warping module.

AAAI Conference 2024 Conference Paper

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

  • Kun Wang
  • Zhiqiang Yan
  • Huang Tian
  • Zhenyu Zhang
  • Xiang Li
  • Jun Li
  • Jian Yang

Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF---a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.

JBHI Journal 2024 Journal Article

Cross-Anatomy Transfer Learning via Shape-Aware Adaptive Fine-Tuning for 3D Vessel Segmentation

  • Tao Han
  • Danni Ai
  • Jingfan Fan
  • Hong Song
  • Deqiang Xiao
  • Yining Wang
  • Jian Yang

Deep learning methods have recently achieved remarkable performance in vessel segmentation applications, yet require numerous labor-intensive labeled data. To alleviate the requirement of manual annotation, transfer learning methods can potentially be used to acquire the related knowledge of tubular structures from public large-scale labeled vessel datasets for target vessel segmentation in other anatomic sites of the human body. However, the cross-anatomy domain shift is a challenging task due to the formidable discrepancy among various vessel structures in different anatomies, resulting in the limited performance of transfer learning. Therefore, we propose a cross-anatomy transfer learning framework for 3D vessel segmentation, which first generates a pre-trained model on a public hepatic vessel dataset and then adaptively fine-tunes our target segmentation network initialized from the model for segmentation of other anatomic vessels. In the framework, the adaptive fine-tuning strategy is presented to dynamically decide on the frozen or fine-tuned filters of the target network for each input sample with a proxy network. Moreover, we develop a Gaussian-based signed distance map that explicitly encodes vessel-specific shape context. The prediction of the map is added as an auxiliary task in the segmentation network to capture geometry-aware knowledge in the fine-tuning. We demonstrate the effectiveness of our method through extensive experiments on two small-scale datasets of coronary artery and brain vessel. The results indicate the proposed method effectively overcomes the discrepancy of cross-anatomy domain shift to achieve accurate vessel segmentation for these two datasets.

NeurIPS Conference 2024 Conference Paper

DCDepth: Progressive Monocular Depth Estimation in Discrete Cosine Domain

  • Kun Wang
  • Zhiqiang Yan
  • Junkai Fan
  • Wanlu Zhu
  • Xiang Li
  • Jun Li
  • Jian Yang

In this paper, we introduce DCDepth, a novel framework for the long-standing monocular depth estimation task. Moving beyond conventional pixel-wise depth estimation in the spatial domain, our approach estimates the frequency coefficients of depth patches after transforming them into the discrete cosine domain. This unique formulation allows for the modeling of local depth correlations within each patch. Crucially, the frequency transformation segregates the depth information into various frequency components, with low-frequency components encapsulating the core scene structure and high-frequency components detailing the finer aspects. This decomposition forms the basis of our progressive strategy, which begins with the prediction of low-frequency components to establish a global scene context, followed by successive refinement of local details through the prediction of higher-frequency components. We conduct comprehensive experiments on NYU-Depth-V2, TOFDC, and KITTI datasets, and demonstrate the state-of-the-art performance of DCDepth. Code is available at https: //github. com/w2kun/DCDepth.

AAAI Conference 2024 Conference Paper

Divide and Conquer: Hybrid Pre-training for Person Search

  • Yanling Tian
  • Di Chen
  • Yunan Liu
  • Jian Yang
  • Shanshan Zhang

Large-scale pre-training has proven to be an effective method for improving performance across different tasks. Current person search methods use ImageNet pre-trained models for feature extraction, yet it is not an optimal solution due to the gap between the pre-training task and person search task (as a downstream task). Therefore, in this paper, we focus on pre-training for person search, which involves detecting and re-identifying individuals simultaneously. Although labeled data for person search is scarce, datasets for two sub-tasks person detection and re-identification are relatively abundant. To this end, we propose a hybrid pre-training framework specifically designed for person search using sub-task data only. It consists of a hybrid learning paradigm that handles data with different kinds of supervisions, and an intra-task alignment module that alleviates domain discrepancy under limited resources. To the best of our knowledge, this is the first work that investigates how to support full-task pre-training using sub-task data. Extensive experiments demonstrate that our pre-trained model can achieve significant improvements across diverse protocols, such as person search method, fine-tuning data, pre-training data and model backbone. For example, our model improves ResNet50 based NAE by 10.3% relative improvement w.r.t. mAP. Our code and pre-trained models are released for plug-and-play usage to the person search community (https://github.com/personsearch/PretrainPS).

IJCAI Conference 2024 Conference Paper

Efficiency Calibration of Implicit Regularization in Deep Networks via Self-paced Curriculum-Driven Singular Value Selection

  • Zhe Li
  • Shuo Chen
  • Jian Yang
  • Lei Luo

The generalization of neural networks has been a major focus of research in deep learning. It is often interpreted as an implicit bias towards solutions with specific properties. Especially, in practical applications, it has been observed that linear neural networks (LNN) tend to favor low-rank solutions for matrix completion tasks. However, most existing methods rely on increasing the depth of the neural network to enhance the low rank of solutions, resulting in higher complexity. In this paper, we propose a new explicit regularization method that calibrates the implicit bias towards low-rank trends in matrix completion tasks. Our approach automatically incorporates smaller singular values into the training process using a self-paced learning strategy, gradually restoring matrix information. By jointly using both implicit and explicit regularization, we effectively capture the low-rank structure of LNN and accelerate its convergence. We also analyze how our proposed penalty term interacts with implicit regularization and provide theoretical guarantees for our new model. To evaluate the effectiveness of our method, we conduct a series of experiments on both simulated and real-world data. Our experimental results clearly demonstrate that our method has better robustness and generalization ability compared with other methods.

JBHI Journal 2024 Journal Article

Embedding-Alignment Fusion-Based Graph Convolution Network With Mixed Learning Strategy for 4D Medical Image Reconstruction

  • Jingshu Li
  • Tianyu Fu
  • Hong Song
  • Jingfan Fan
  • Deqiang Xiao
  • Yucong Lin
  • Ying Gu
  • Jian Yang

In recent years, 4D medical image involving structural and motion information of tissue has attracted increasing attention. The key to the 4D image reconstruction is to stack the 2D slices based on matching the aligned motion states. In this study, the distribution of the 2D slices with the different motion states is modeled as a manifold graph, and the reconstruction is turned to be the graph alignment. An embedding-alignment fusion-based graph convolution network (GCN) with a mixed-learning strategy is proposed to align the graphs. Herein, the embedding and alignment processes of graphs interact with each other to realize a precise alignment with retaining the manifold distribution. The mixed strategy of self- and semi-supervised learning makes the alignment sparse to avoid the mismatching caused by outliers in the graph. In the experiment, the proposed 4D reconstruction approach is validated on the different modalities including Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and Ultrasound (US). We evaluate the reconstruction accuracy and compare it with those of state-of-the-art methods. The experiment results demonstrate that our approach can reconstruct a more accurate 4D image.

NeurIPS Conference 2024 Conference Paper

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

  • Senmao Li
  • Taihang Hu
  • Joost van de Weijer
  • Fahad S. Khan
  • Tao Liu
  • Linxuan Li
  • Shiqi Yang
  • Yaxing Wang

One of the main drawback of diffusion models is the slow inference time for image generation. Among the most successful approaches to addressing this problem are distillation methods. However, these methods require considerable computational resources. In this paper, we take another approach to diffusion model acceleration. We conduct a comprehensive study of the UNet encoder and empirically analyze the encoder features. This provides insights regarding their changes during the inference process. In particular, we find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. This insight motivates us to omit encoder computation at certain adjacent time-steps and reuse encoder features of previous time-steps as input to the decoder in multiple time-steps. Importantly, this allows us to perform decoder computation in parallel, further accelerating the denoising process. Additionally, we introduce a prior noise injection method to improve the texture details in the generated image. Besides the standard text-to-image task, we also validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation. Without utilizing any knowledge distillation technique, our approach accelerates both the Stable Diffusion (SD) and DeepFloyd-IF model sampling by 41$\%$ and 24$\%$ respectively, and DiT model sampling by 34$\%$, while maintaining high-quality generation performance. Our code will be publicly released.

IJCAI Conference 2024 Conference Paper

Graph Neural Networks for Brain Graph Learning: A Survey

  • Xuexiong Luo
  • Jia Wu
  • Jian Yang
  • Shan Xue
  • Amin Beheshti
  • Quan Z. Sheng
  • David McAlpine
  • Paul Sowman

Exploring the complex structure of the human brain is crucial for understanding its functionality and diagnosing brain disorders. Thanks to advancements in neuroimaging technology, a novel approach has emerged that involves modeling the human brain as a graph-structured pattern, with different brain regions represented as nodes and the functional relationships among these regions as edges. Moreover, graph neural networks (GNNs) have demonstrated a significant advantage in mining graph-structured data. Developing GNNs to learn brain graph representations for brain disorder analysis has recently gained increasing attention. However, there is a lack of systematic survey work summarizing current research methods in this domain. In this paper, we aim to bridge this gap by reviewing brain graph learning works that utilize GNNs. We first introduce the process of brain graph modeling based on common neuroimaging data. Subsequently, we systematically categorize current works based on the type of brain graph generated and the targeted research problems. To make this research accessible to a broader range of interested researchers, we provide an overview of representative methods and commonly used datasets, along with their implementation sources. Finally, we present our insights on future research directions. The repository of this survey is available at https: //github. com/XuexiongLuoMQ/Awesome-Brain-Graph-Learning-with-GNNs.

NeurIPS Conference 2024 Conference Paper

Grid4D: 4D Decomposed Hash Encoding for High-Fidelity Dynamic Gaussian Splatting

  • Jiawei Xu
  • Zexin Fan
  • Jian Yang
  • Jin Xie

Recently, Gaussian splatting has received more and more attention in the field of static scene rendering. Due to the low computational overhead and inherent flexibility of explicit representations, plane-based explicit methods are popular ways to predict deformations for Gaussian-based dynamic scene rendering models. However, plane-based methods rely on the inappropriate low-rank assumption and excessively decompose the space-time 4D encoding, resulting in overmuch feature overlap and unsatisfactory rendering quality. To tackle these problems, we propose Grid4D, a dynamic scene rendering model based on Gaussian splatting and employing a novel explicit encoding method for the 4D input through the hash encoding. Different from plane-based explicit representations, we decompose the 4D encoding into one spatial and three temporal 3D hash encodings without the low-rank assumption. Additionally, we design a novel attention module that generates the attention scores in a directional range to aggregate the spatial and temporal features. The directional attention enables Grid4D to more accurately fit the diverse deformations across distinct scene components based on the spatial encoded features. Moreover, to mitigate the inherent lack of smoothness in explicit representation methods, we introduce a smooth regularization term that keeps our model from the chaos of deformation prediction. Our experiments demonstrate that Grid4D significantly outperforms the state-of-the-art models in visual quality and rendering speed.

AAAI Conference 2024 Conference Paper

Hyperbolic Graph Diffusion Model

  • Lingfeng Wen
  • Xuan Tang
  • Mingjie Ouyang
  • Xiangxiang Shen
  • Jian Yang
  • Daxin Zhu
  • Mingsong Chen
  • Xian Wei

Diffusion generative models (DMs) have achieved promising results in image and graph generation. However, real-world graphs, such as social networks, molecular graphs, and traffic graphs, generally share non-Euclidean topologies and hidden hierarchies. For example, the degree distributions of graphs are mostly power-law distributions. The current latent diffusion model embeds the hierarchical data in a Euclidean space, which leads to distortions and interferes with modeling the distribution. Instead, hyperbolic space has been found to be more suitable for capturing complex hierarchical structures due to its exponential growth property. In order to simultaneously utilize the data generation capabilities of diffusion models and the ability of hyperbolic embeddings to extract latent hierarchical distributions, we propose a novel graph generation method called, Hyperbolic Graph Diffusion Model (HGDM), which consists of an auto-encoder to encode nodes into successive hyperbolic embeddings, and a DM that operates in the hyperbolic latent space. HGDM captures the crucial graph structure distributions by constructing a hyperbolic potential node space that incorporates edge information. Extensive experiments show that HGDM achieves better performance in generic graph and molecule generation benchmarks, with a 48% improvement in the quality of graph generation with highly hierarchical structures.

TIST Journal 2024 Journal Article

Improving Faithfulness and Factuality with Contrastive Learning in Explainable Recommendation

  • Haojie Zhuang
  • Wei Zhang
  • Weitong Chen
  • Jian Yang
  • Quan Z. Sheng

Recommender systems have become increasingly important in navigating the vast amount of information and options available in various domains. By tailoring and personalizing recommendations to user preferences and interests, these systems improve the user experience, efficiency, and satisfaction. With a growing demand for transparency and understanding of recommendation outputs, explainable recommender systems have gained growing attention in recent years. Additionally, as user reviews could be considered the rationales behind why the user likes (or dislikes) the products, generating informative and reliable reviews alongside recommendations has thus emerged as a research focus in explainable recommendation. However, the model-generated reviews might contain factually inconsistent contents (i.e., the hallucination issue), which would thus compromise the recommendation rationales. To address this issue, we propose a contrastive learning framework to improve the faithfulness and factuality in explainable recommendation in this article. We further develop different strategies of generating positive and negative examples for contrastive learning, such as back-translation or synonym substitution for positive examples, and editing positive examples or utilizing model-generated texts for negative examples. Our proposed method optimizes the model to distinguish faithful explanations (i.e., positive examples) and unfaithful ones with factual errors (i.e., negative examples), which thus drives the model to generate faithful reviews as explanations while avoiding inconsistent contents. Extensive experiments and analysis on three benchmark datasets show that our proposed model outperforms other review generation baselines in faithfulness and factuality. In addition, the proposed contrastive learning component could be easily incorporated into other explainable recommender systems in a plug-and-play manner.

JBHI Journal 2024 Journal Article

Local Contractive Registration With Biomechanical Model: Assessing Microwave Ablation After Compensation for Tissue Shrinkage

  • Dingkun Liu
  • Danni Ai
  • Tianyu Fu
  • Yuanjin Gao
  • Jingfan Fan
  • Hong Song
  • Deqiang Xiao
  • Ping Liang

Microwave ablation (MWA) is a minimally invasive procedure for the treatment of liver tumor. Accumulating clinical evidence has considered the minimal ablative margin (MAM) as a significant predictor of local tumor progression (LTP). In clinical practice, MAM assessment is typically carried out through image registration of pre- and post-MWA images. However, this process faces two main challenges: non-homologous match between tumor and coagulation with inconsistent image appearance, and tissue shrinkage caused by thermal dehydration. These challenges result in low precision when using traditional registration methods for MAM assessment. In this paper, we present a local contractive nonrigid registration method using a biomechanical model (LC-BM) to address these challenges and precisely assess the MAM. The LC-BM contains two consecutive parts: 1) local contractive decomposition (LC-part), which reduces the incorrect match between the tumor and coagulation and quantifies the shrinkage in the external coagulation region, and 2) biomechanical model constraint (BM-part), which compensates for the shrinkage in the internal coagulation region. After quantifying and compensating for tissue shrinkage, the warped tumor is overlaid on the coagulation, and then the MAM is assessed. We evaluated the method using prospectively collected data from 36 patients with 47 liver tumors, comparing LC-BM with 11 state-of-the-art methods. LTP was diagnosed through contrast-enhanced MR follow-up images, serving as the ground truth for tumor recurrence. LC-BM achieved the highest accuracy (97. 9%) in predicting LTP, outperforming other methods. Therefore, our proposed method holds significant potential to improve MAM assessment in MWA surgeries.

AAAI Conference 2024 Conference Paper

LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection

  • Hongcheng Guo
  • Jian Yang
  • Jiaheng Liu
  • Jiaqi Bai
  • Boyang Wang
  • Zhoujun Li
  • Tieqiao Zheng
  • Bo Zhang

Log anomaly detection is a key component in the field of artificial intelligence for IT operations (AIOps). Considering log data of variant domains, retraining the whole network for unknown domains is inefficient in real industrial scenarios. However, previous deep models merely focused on extracting the semantics of log sequences in the same domain, leading to poor generalization on multi-domain logs. To alleviate this issue, we propose a unified Transformer-based framework for Log anomaly detection (LogFormer) to improve the generalization ability across different domains, where we establish a two-stage process including the pre-training and adapter-based tuning stage. Specifically, our model is first pre-trained on the source domain to obtain shared semantic knowledge of log data. Then, we transfer such knowledge to the target domain via shared parameters. Besides, the Log-Attention module is proposed to supplement the information ignored by the log-paring. The proposed method is evaluated on three public datasets and one real-world dataset. Experimental results on multiple benchmarks demonstrate the effectiveness of our LogFormer with fewer trainable parameters and lower training costs.

NeurIPS Conference 2024 Conference Paper

MambaLLIE: Implicit Retinex-Aware Low Light Enhancement with Global-then-Local State Space

  • Jiangwei Weng
  • Zhiqiang Yan
  • Ying Tai
  • Jianjun Qian
  • Jian Yang
  • Jun Li

Recent advances in low light image enhancement have been dominated by Retinex-based learning framework, leveraging convolutional neural networks (CNNs) and Transformers. However, the vanilla Retinex theory primarily addresses global illumination degradation and neglects local issues such as noise and blur in dark conditions. Moreover, CNNs and Transformers struggle to capture global degradation due to their limited receptive fields. While state space models (SSMs) have shown promise in the long-sequence modeling, they face challenges in combining local invariants and global context in visual data. In this paper, we introduce MambaLLIE, an implicit Retinex-aware low light enhancer featuring a global-then-local state space design. We first propose a Local-Enhanced State Space Module (LESSM) that incorporates an augmented local bias within a 2D selective scan mechanism, enhancing the original SSMs by preserving local 2D dependency. Additionally, an Implicit Retinex-aware Selective Kernel module (IRSK) dynamically selects features using spatially-varying operations, adapting to varying inputs through an adaptive kernel selection process. Our Global-then-Local State Space Block (GLSSB) integrates LESSM and IRSK with layer normalization (LN) as its core. This design enables MambaLLIE to achieve comprehensive global long-range modeling and flexible local feature aggregation. Extensive experiments demonstrate that MambaLLIE significantly outperforms state-of-the-art CNN and Transformer-based methods. Our code is available at https: //github. com/wengjiangwei/MambaLLIE.

AAAI Conference 2024 Conference Paper

MCL-NER: Cross-Lingual Named Entity Recognition via Multi-View Contrastive Learning

  • Ying Mo
  • Jian Yang
  • Jiahao Liu
  • Qifan Wang
  • Ruoyu Chen
  • Jingang Wang
  • Zhoujun Li

Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (MCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of MCL-NER over prior data-driven and model-based approaches. It achieves a substantial increase of nearly +2.0 F1 scores across a broad spectrum and establishes itself as the new state-of-the-art performer.

NeurIPS Conference 2024 Conference Paper

Novel Object Synthesis via Adaptive Text-Image Harmony

  • Zeren Xiong
  • Zedong Zhang
  • Zikun Chen
  • Shuo Chen
  • Xiang Li
  • Gan Sun
  • Jian Yang
  • Jun Li

In this paper, we study an object synthesis task that combines an object text with an object image to create a new object image. However, most diffusion models struggle with this task, \textit{i. e. }, often generating an object that predominantly reflects either the text or the image due to an imbalance between their inputs. To address this issue, we propose a simple yet effective method called Adaptive Text-Image Harmony (ATIH) to generate novel and surprising objects. First, we introduce a scale factor and an injection step to balance text and image features in cross-attention and to preserve image information in self-attention during the text-image inversion diffusion process, respectively. Second, to better integrate object text and image, we design a balanced loss function with a noise parameter, ensuring both optimal editability and fidelity of the object image. Third, to adaptively adjust these parameters, we present a novel similarity score function that not only maximizes the similarities between the generated object image and the input text/image but also balances these similarities to harmonize text and image integration. Extensive experiments demonstrate the effectiveness of our approach, showcasing remarkable object creations such as colobus-glass jar. https: //xzr52. github. io/ATIH/

NeurIPS Conference 2024 Conference Paper

RoleAgent: Building, Interacting, and Benchmarking High-quality Role-Playing Agents from Scripts

  • Jiaheng Liu
  • Zehao Ni
  • Haoran Que
  • Tao Sun
  • Zekun Wang
  • Jian Yang
  • Jiakai Wang
  • Hongcheng Guo

Believable agents can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication. Recently, generative agents have been proposed to simulate believable human behavior by using Large Language Models. However, the existing method heavily relies on human-annotated agent profiles (e. g. , name, age, personality, relationships with others, and so on) for the initialization of each agent, which cannot be scaled up easily. In this paper, we propose a scalable RoleAgent framework to generate high-quality role-playing agents from raw scripts, which includes building and interacting stages. Specifically, in the building stage, we use a hierarchical memory system to extract and summarize the structure and high-level information of each agent for the raw script. In the interacting stage, we propose a novel innovative mechanism with four steps to achieve a high-quality interaction between agents. Finally, we introduce a systematic and comprehensive evaluation benchmark called RoleAgentBench to evaluate the effectiveness of our RoleAgent, which includes 100 and 28 roles for 20 English and 5 Chinese scripts, respectively. Extensive experimental results on RoleAgentBench demonstrate the effectiveness of RoleAgent.

NeurIPS Conference 2024 Conference Paper

SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

  • Yuxuan Li
  • Xiang Li
  • Weijie Li
  • Qibin Hou
  • Li Liu
  • Ming-Ming Cheng
  • Jian Yang

Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at \url{https: //github. com/zcablii/SARDet_100K}.

AAAI Conference 2024 Conference Paper

SGNet: Structure Guided Network via Gradient-Frequency Awareness for Depth Map Super-resolution

  • Zhengxue Wang
  • Zhiqiang Yan
  • Jian Yang

Depth super-resolution (DSR) aims to restore high-resolution (HR) depth from low-resolution (LR) one, where RGB image is often used to promote this task. Recent image guided DSR approaches mainly focus on spatial domain to rebuild depth structure. However, since the structure of LR depth is usually blurry, only considering spatial domain is not very sufficient to acquire satisfactory results. In this paper, we propose structure guided network (SGNet), a method that pays more attention to gradient and frequency domains, both of which have the inherent ability to capture high-frequency structure. Specifically, we first introduce the gradient calibration module (GCM), which employs the accurate gradient prior of RGB to sharpen the LR depth structure. Then we present the Frequency Awareness Module (FAM) that recursively conducts multiple spectrum differencing blocks (SDB), each of which propagates the precise high-frequency components of RGB into the LR depth. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our SGNet, reaching the state-of-the-art (see Fig. 1). Codes and pre-trained models are available at https://github.com/yanzq95/SGNet.

AAAI Conference 2024 Conference Paper

SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation

  • Xiaoqi An
  • Lin Zhao
  • Chen Gong
  • Nannan Wang
  • Di Wang
  • Jian Yang

High-resolution representation is essential for achieving good performance in human pose estimation models. To obtain such features, existing works utilize high-resolution input images or fine-grained image tokens. However, this dense high-resolution representation brings a significant computational burden. In this paper, we address the following question: "Only sparse human keypoint locations are detected for human pose estimation, is it really necessary to describe the whole image in a dense, high-resolution manner?" Based on dynamic transformer models, we propose a framework that only uses Sparse High-resolution Representations for human Pose estimation (SHaRPose). In detail, SHaRPose consists of two stages. At the coarse stage, the relations between image regions and keypoints are dynamically mined while a coarse estimation is generated. Then, a quality predictor is applied to decide whether the coarse estimation results should be refined. At the fine stage, SHaRPose builds sparse high-resolution representations only on the regions related to the keypoints and provides refined high-precision human pose estimations. Extensive experiments demonstrate the outstanding performance of the proposed method. Specifically, compared to the state-of-the-art method ViTPose, our model SHaRPose-Base achieves 77.4 AP (+0.5 AP) on the COCO validation set and 76.7 AP (+0.5 AP) on the COCO test-dev set, and infers at a speed of 1.4x faster than ViTPose-Base. Code is available at https://github.com/AnxQ/sharpose.

NeurIPS Conference 2024 Conference Paper

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

  • Taihang Hu
  • Linxuan Li
  • Joost van de Weijer
  • Hongcheng Gao
  • Fahad S. Khan
  • Jian Yang
  • Ming-Ming Cheng
  • Kai Wang

Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributesin the input prompts; a challenge termed semantic binding. Previous approacheseither involve intensive fine-tuning of the entire T2I model or require users orlarge language models to specify generation layouts, adding complexity. In thispaper, we define semantic binding as the task of associating a given object with itsattribute, termed attribute binding, or linking it to other related sub-objects, referredto as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a singlecomposite token. This ensures that the object, its attributes and sub-objects all sharethe same cross-attention map. Additionally, to address potential confusion amongmain objects with complex textual prompts, we propose end token substitution asa complementary strategy. To further refine our approach in the initial stages ofT2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the compositetoken to improve the generation integrity. We conducted extensive experiments tovalidate the effectiveness of ToMe, comparing it against various existing methodson the T2I-CompBench and our proposed GPT-4o object binding benchmark. Ourmethod is particularly effective in complex scenarios that involve multiple objectsand attributes, which previous methods often fail to address. The code will be publicly available at https: //github. com/hutaihang/ToMe

JBHI Journal 2023 Journal Article

Compressibility Analysis of Functional Near-Infrared Spectroscopy Signals in Children With Attention-Deficit/Hyperactivity Disorder

  • Yue Gu
  • Shuo Miao
  • Yao Zhang
  • Jian Yang
  • Xiaoli Li

Functional near-infrared spectroscopy (fNIRS) as an emerging optical neuroimaging technique has attracted the interest and attention of many investigators. With the growth of fNIRS data volume, effective data compression methods are urgent. Compressive sensing (CS) has been demonstrated a promising tool to deal with biomedical data. However, whether the compressibility of fNIRS data can discriminate different brain states is unclear. In this study, the fNIRS signals from fifteen attention-deficit/hyperactivity disorder (ADHD) children and fifteen typically developing (TD) children were recorded during an N-back task and a Go/NoGo task respectively. A block sparse Bayesian learning-based CS method was used to reconstruct the compressed fNIRS data. To assess the performance of the CS method, we adopted two metrics, structural similarity index (SSIM) and mean squared error (MSE), both of them effective in evaluating the compressibility of fNIRS data. Then, the two metrics were analyzed to discriminate the brain states of ADHD children and TD children during the two tasks using the multivariate pattern analysis (MVPA) method. As indicated by the results, the CS method could reconstruct the compressed fNIRS data with a high reconstruction quality at different compression ratio ( $\text{SSIM} > \text{0. 988}$ and $\text{MSE} < \text{1. 2} \times \text{10}^{-4}$ ). Furthermore, the MVPA method could distinguish different brain states with high accuracy, and identify that the prefrontal cortex is a key brain region for distinguishing ADHD vs. TD or N-back vs. Go/NoGo. These findings indicated that CS is very promising for the storage and transmission of massive fNIRS data, and the compressibility of fNIRS data is a potential biomarker for the diagnosis of ADHD.

AAAI Conference 2023 Conference Paper

Curriculum Temperature for Knowledge Distillation

  • Zheng Li
  • Xiang Li
  • Lingfeng Yang
  • Borui Zhao
  • Renjie Song
  • Lei Luo
  • Jun Li
  • Jian Yang

Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method.

AAAI Conference 2023 Conference Paper

DesNet: Decomposed Scale-Consistent Network for Unsupervised Depth Completion

  • Zhiqiang Yan
  • Kun Wang
  • Xiang Li
  • Zhenyu Zhang
  • Jun Li
  • Jian Yang

Unsupervised depth completion aims to recover dense depth from the sparse one without using the ground-truth annotation. Although depth measurement obtained from LiDAR is usually sparse, it contains valid and real distance information, i.e., scale-consistent absolute depth values. Meanwhile, scale-agnostic counterparts seek to estimate relative depth and have achieved impressive performance. To leverage both the inherent characteristics, we thus suggest to model scale-consistent depth upon unsupervised scale-agnostic frameworks. Specifically, we propose the decomposed scale-consistent learning (DSCL) strategy, which disintegrates the absolute depth into relative depth prediction and global scale estimation, contributing to individual learning benefits. But unfortunately, most existing unsupervised scale-agnostic frameworks heavily suffer from depth holes due to the extremely sparse depth input and weak supervisory signal. To tackle this issue, we introduce the global depth guidance (GDG) module, which attentively propagates dense depth reference into the sparse target via novel dense-to-sparse attention. Extensive experiments show the superiority of our method on outdoor KITTI, ranking 1st and outperforming the best KBNet more than 12% in RMSE. Additionally, our approach achieves state-of-the-art performance on indoor NYUv2 benchmark as well.

AAAI Conference 2023 Conference Paper

Exploratory Inference Learning for Scribble Supervised Semantic Segmentation

  • Chuanwei Zhou
  • Zhen Cui
  • Chunyan Xu
  • Cao Han
  • Jian Yang

Scribble supervised semantic segmentation has achieved great advances in pseudo label exploitation, yet suffers insufficient label exploration for the mass of unannotated regions. In this work, we propose a novel exploratory inference learning (EIL) framework, which facilitates efficient probing on unlabeled pixels and promotes selecting confident candidates for boosting the evolved segmentation. The exploration of unannotated regions is formulated as an iterative decision-making process, where a policy searcher learns to infer in the unknown space and the reward to the exploratory policy is based on a contrastive measurement of candidates. In particular, we devise the contrastive reward with the intra-class attraction and the inter-class repulsion in the feature space w.r.t the pseudo labels. The unlabeled exploration and the labeled exploitation are jointly balanced to improve the segmentation, and framed in a close-looping end-to-end network. Comprehensive evaluations on the benchmark datasets (PASCAL VOC 2012 and PASCAL Context) demonstrate the superiority of our proposed EIL when compared with other state-of-the-art methods for the scribble-supervised semantic segmentation problem.

NeurIPS Conference 2023 Conference Paper

Fine-Grained Visual Prompting

  • Lingfeng Yang
  • Yueze Wang
  • Xiang Li
  • Xinlong Wang
  • Jian Yang

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our F ine- G rained V isual P rompting ( FGVP ) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3. 0\% to 4. 6\%, with a maximum improvement of 12. 5\% on the RefCOCO+ testA subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. Code is available at https: //github. com/ylingfeng/FGVP.

JBHI Journal 2023 Journal Article

M-CSAFN: Multi-Color Space Adaptive Fusion Network for Automated Port-Wine Stains Segmentation

  • Jinrong Mu
  • Yucong Lin
  • Xianqi Meng
  • Jingfan Fan
  • Danni Ai
  • Defu Chen
  • Haixia Qiu
  • Jian Yang

Automatic segmentation of port-wine stains (PWS) from clinical images is critical for accurate diagnosis and objective assessment of PWS. However, this is a challenging task due to the color heterogeneity, low contrast, and indistinguishable appearance of PWS lesions. To address such challenges, we propose a novel multi-color space adaptive fusion network (M-CSAFN) for PWS segmentation. First, a multi-branch detection model is constructed based on six typical color spaces, which utilizes rich color texture information to highlight the difference between lesions and surrounding tissues. Second, an adaptive fusion strategy is used to fuse complementary predictions, which address the significant differences within the lesions caused by color heterogeneity. Third, a structural similarity loss with color information is proposed to measure the detail error between predicted lesions and truth lesions. Additionally, a PWS clinical dataset consisting of 1413 image pairs was established for the development and evaluation of PWS segmentation algorithms. To verify the effectiveness and superiority of the proposed method, we compared it with other state-of-the-art methods on our collected dataset and four publicly available skin lesion datasets (ISIC 2016, ISIC 2017, ISIC 2018, and PH2). The experimental results show that our method achieves remarkable performance in comparison with other state-of-the-art methods on our collected dataset, achieving 92. 29% and 86. 14% on Dice and Jaccard metrics, respectively. Comparative experiments on other datasets also confirmed the reliability and potential capability of M-CSAFN in skin lesion segmentation.

AAMAS Conference 2023 Conference Paper

Multi-Agent Path Finding via Reinforcement Learning with Hybrid Reward

  • Cheng Zhao
  • Liansheng Zhuang
  • Haonan Liu
  • Yihong Huang
  • Jian Yang

Multi-agent path finding (MAPF) aims to find a set of conflict-free paths for multiple agents so that each agent can reach its destination while optimizing a global cost. Recently, learning-based methods gain much attention due to their better real-time performance and scalability. However, most existing learning-based methods suffer from poor cooperation among agents since only local observations are used to make decisions. Meanwhile, methods that are bent on team benefits perform poorly due to a lack of individual exploration. To address this problem, this paper proposes a novel Hybrid Reward Path Finding (HRPF), which employs the global information to learn a cooperation mechanism for agents during the training, and embeds it in distributed networks to generate strategies during the execution. HRPF enforces agents to learn strategies from a new type of reward function that decomposes a complex MAPF task into a team task and individual tasks. Experiments on random obstacle grid worlds show that, HRPF performs significantly better in success rate and collision rate than state-of-the-art learning-based methods.

AAAI Conference 2023 Conference Paper

Recurrent Structure Attention Guidance for Depth Super-resolution

  • Jiayi Yuan
  • Haobo Jiang
  • Xiang Li
  • Jianjun Qian
  • Jun Li
  • Jian Yang

Image guidance is an effective strategy for depth super-resolution. Generally, most existing methods employ hand-crafted operators to decompose the high-frequency (HF) and low-frequency (LF) ingredients from low-resolution depth maps and guide the HF ingredients by directly concatenating them with image features. However, the hand-designed operators usually cause inferior HF maps (e.g., distorted or structurally missing) due to the diverse appearance of complex depth maps. Moreover, the direct concatenation often results in weak guidance because not all image features have a positive effect on the HF maps. In this paper, we develop a recurrent structure attention guided (RSAG) framework, consisting of two important parts. First, we introduce a deep contrastive network with multi-scale filters for adaptive frequency-domain separation, which adopts contrastive networks from large filters to small ones to calculate the pixel contrasts for adaptive high-quality HF predictions. Second, instead of the coarse concatenation guidance, we propose a recurrent structure attention block, which iteratively utilizes the latest depth estimation and the image features to jointly select clear patterns and boundaries, aiming at providing refined guidance for accurate depth recovery. In addition, we fuse the features of HF maps to enhance the edge structures in the decomposed LF maps. Extensive experiments show that our approach obtains superior performance compared with state-of-the-art depth super-resolution methods. Our code is available at: https://github.com/Yuanjiayii/DSR-RSAG.

NeurIPS Conference 2023 Conference Paper

SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation

  • Haobo Jiang
  • Mathieu Salzmann
  • Zheng Dang
  • Jin Xie
  • Jian Yang

In this paper, we introduce an SE(3) diffusion model-based point cloud registration framework for 6D object pose estimation in real-world scenarios. Our approach formulates the 3D registration task as a denoising diffusion process, which progressively refines the pose of the source point cloud to obtain a precise alignment with the model point cloud. Training our framework involves two operations: An SE(3) diffusion process and an SE(3) reverse process. The SE(3) diffusion process gradually perturbs the optimal rigid transformation of a pair of point clouds by continuously injecting noise (perturbation transformation). By contrast, the SE(3) reverse process focuses on learning a denoising network that refines the noisy transformation step-by-step, bringing it closer to the optimal transformation for accurate pose estimation. Unlike standard diffusion models used in linear Euclidean spaces, our diffusion model operates on the SE(3) manifold. This requires exploiting the linear Lie algebra $\mathfrak{se}(3)$ associated with SE(3) to constrain the transformation transitions during the diffusion and reverse processes. Additionally, to effectively train our denoising network, we derive a registration-specific variational lower bound as the optimization objective for model learning. Furthermore, we show that our denoising network can be constructed with a surrogate registration model, making our approach applicable to different deep registration networks. Extensive experiments demonstrate that our diffusion registration framework presents outstanding pose estimation performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.

AAAI Conference 2023 Conference Paper

Structure Flow-Guided Network for Real Depth Super-resolution

  • Jiayi Yuan
  • Haobo Jiang
  • Xiang Li
  • Jianjun Qian
  • Jun Li
  • Jian Yang

Real depth super-resolution (DSR), unlike synthetic settings, is a challenging task due to the structural distortion and the edge noise caused by the natural degradation in real-world low-resolution (LR) depth maps. These defeats result in significant structure inconsistency between the depth map and the RGB guidance, which potentially confuses the RGB-structure guidance and thereby degrades the DSR quality. In this paper, we propose a novel structure flow-guided DSR framework, where a cross-modality flow map is learned to guide the RGB-structure information transferring for precise depth upsampling. Specifically, our framework consists of a cross-modality flow-guided upsampling network (CFUNet) and a flow-enhanced pyramid edge attention network (PEANet). CFUNet contains a trilateral self-attention module combining both the geometric and semantic correlations for reliable cross-modality flow learning. Then, the learned flow maps are combined with the grid-sampling mechanism for coarse high-resolution (HR) depth prediction. PEANet targets at integrating the learned flow map as the edge attention into a pyramid network to hierarchically learn the edge-focused guidance feature for depth edge refinement. Extensive experiments on real and synthetic DSR datasets verify that our approach achieves excellent performance compared to state-of-the-art methods. Our code is available at: https://github.com/Yuanjiayii/DSR-SFG.

IJCAI Conference 2022 Conference Paper

Active Contrastive Set Mining for Robust Audio-Visual Instance Discrimination

  • Hanyu Xuan
  • Yihong Xu
  • Shuo Chen
  • Zhiliang Wu
  • Jian Yang
  • Yan Yan
  • Xavier Alameda-Pineda

The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision. As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm. Existing AVID methods construct the contrastive set by random sampling based on the assumption that the audio and visual clips from all other videos are not semantically related. We argue that this assumption is rough, since the resulting contrastive sets have a large number of faulty negatives. In this paper, we overcome this limitation by proposing a novel Active Contrastive Set Mining (ACSM) that aims to mine the contrastive sets with informative and diverse negatives for robust AVID. Moreover, we also integrate a semantically-aware hard-sample mining strategy into our ACSM. The proposed ACSM is implemented into two most recent state-of-the-art AVID methods and significantly improves their performance. Extensive experiments conducted on both action and sound recognition on multiple datasets show the remarkably improved performance of our method.

NeurIPS Conference 2022 Conference Paper

Dual-discriminative Graph Neural Network for Imbalanced Graph-level Anomaly Detection

  • Ge Zhang
  • Zhenyu Yang
  • Jia Wu
  • Jian Yang
  • Shan Xue
  • Hao Peng
  • Jianlin Su
  • Chuan Zhou

Graph-level anomaly detection aims to distinguish anomalous graphs in a graph dataset from normal graphs. Anomalous graphs represent a very few but essential patterns in the real world. The anomalous property of a graph may be referable to its anomalous attributes of particular nodes and anomalous substructures that refer to a subset of nodes and edges in the graph. In addition, due to the imbalance nature of anomaly problem, anomalous information will be diluted by normal graphs with overwhelming quantities. Various anomaly notions in the attributes and/or substructures and the imbalance nature together make detecting anomalous graphs a non-trivial task. In this paper, we propose a graph neural network for graph-level anomaly detection, namely iGAD. Specifically, an anomalous graph attribute-aware graph convolution and an anomalous graph substructure-aware deep Random Walk Kernel (deep RWK) are welded into a graph neural network to achieve the dual-discriminative ability on anomalous attributes and substructures. Deep RWK in iGAD makes up for the deficiency of graph convolution in distinguishing structural information caused by the simple neighborhood aggregation mechanism. Further, we propose a Point Mutual Information (PMI)-based loss function to target the problems caused by imbalance distributions. PMI-based loss function enables iGAD to capture essential correlation between input graphs and their anomalous/normal properties. We evaluate iGAD on four real-world graph datasets. Extensive experiments demonstrate the superiority of iGAD on the graph-level anomaly detection task.

JBHI Journal 2022 Journal Article

Few-Shot Learning for Deformable Medical Image Registration With Perception-Correspondence Decoupling and Reverse Teaching

  • Yuting He
  • Tiantian Li
  • Rongjun Ge
  • Jian Yang
  • Youyong Kong
  • Jian Zhu
  • Huazhong Shu
  • Guanyu Yang

Deformable medical image registration estimates corresponding deformation to align the regions of interest (ROIs) of two images to a same spatial coordinate system. However, recent unsupervised registration models only have correspondence ability without perception, making misalignment on blurred anatomies and distortion on task-unconcerned backgrounds. Label-constrained (LC) registration models embed the perception ability via labels, but the lack of texture constraints in labels and the expensive labeling costs causes distortion internal ROIs and overfitted perception. We propose the first few-shot deformable medical image registration framework, Perception-Correspondence Registration (PC-Reg), which embeds perception ability to registration models only with few labels, thus greatly improving registration accuracy and reducing distortion. 1) We propose the Perception-Correspondence Decoupling which decouples the perception and correspondence actions of registration to two CNNs. Therefore, independent optimizations and feature representations are available avoiding interference of the correspondence due to the lack of texture constraints. 2) For few-shot learning, we propose Reverse Teaching which aligns labeled and unlabeled images to each other to provide supervision information to the structure and style knowledge in unlabeled images, thus generating additional training data. Therefore, these data will reversely teach our perception CNN more style and structure knowledge, improving its generalization ability. Our experiments on three datasets with only five labels demonstrate that our PC-Reg has competitive registration accuracy and effective distortion-reducing ability. Compared with LC-VoxelMorph( $\lambda =1$ ), we achieve the 12. 5%, 6. 3% and 1. 0% Reg-DSC improvements on three datasets, revealing our framework with great potential in clinical application.

IJCAI Conference 2022 Conference Paper

High-resource Language-specific Training for Multilingual Neural Machine Translation

  • Jian Yang
  • Yuwei Yin
  • Shuming Ma
  • Dongdong Zhang
  • Zhoujun Li
  • Furu Wei

Multilingual neural machine translation (MNMT) trained in multiple language pairs has attracted considerable attention due to fewer model parameters and lower training costs by sharing knowledge among multiple languages. Nonetheless, multilingual training is plagued by language interference degeneration in shared parameters because of the negative interference among different translation directions, especially on high-resource languages. In this paper, we propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference, which adopts the two-stage training with the language-specific selection mechanism. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder to enhance the translation quality of high-resource directions. Next, the model is further trained on all available corpora to transfer knowledge from high-resource languages (HRLs) to low-resource languages (LRLs). Experimental results show that HLT-MT outperforms various strong baselines on WMT-10 and OPUS-100 benchmarks. Furthermore, the analytic experiments validate the effectiveness of our method in mitigating the negative interference in multilingual training.

AAAI Conference 2022 Conference Paper

Keypoint Message Passing for Video-Based Person Re-identification

  • Di Chen
  • Andreas Doering
  • Shanshan Zhang
  • Jian Yang
  • Juergen Gall
  • Bernt Schiele

Video-based person re-identification (re-ID) is an important technique in visual surveillance systems which aims to match video snippets of people captured by different cameras. Existing methods are mostly based on convolutional neural networks (CNNs), whose building blocks either process local neighbor pixels at a time, or, when 3D convolutions are used to model temporal information, suffer from the misalignment problem caused by person movement. In this paper, we propose to overcome the limitations of normal convolutions with a human-oriented graph method. Specifically, features located at person joint keypoints are extracted and connected as a spatial-temporal graph. These keypoint features are then updated by message passing from their connected nodes with a graph convolutional network (GCN). During training, the GCN can be attached to any CNN-based person re-ID model to assist representation learning on feature maps, whilst it can be dropped after training for better inference speed. Our method brings significant improvements over the CNN-based baseline model on the MARS dataset with generated person keypoints and a newly annotated dataset: PoseTrackReID. It also defines a new state-of-the-art method in terms of top-1 accuracy and mean average precision in comparison to prior works.

NeurIPS Conference 2022 Conference Paper

Learning Contrastive Embedding in Low-Dimensional Space

  • Shuo Chen
  • Chen Gong
  • Jun Li
  • Jian Yang
  • Gang Niu
  • Masashi Sugiyama

Contrastive learning (CL) pretrains feature embeddings to scatter instances in the feature space so that the training data can be well discriminated. Most existing CL techniques usually encourage learning such feature embeddings in the highdimensional space to maximize the instance discrimination. However, this practice may lead to undesired results where the scattering instances are sparsely distributed in the high-dimensional feature space, making it difficult to capture the underlying similarity between pairwise instances. To this end, we propose a novel framework called contrastive learning with low-dimensional reconstruction (CLLR), which adopts a regularized projection layer to reduce the dimensionality of the feature embedding. In CLLR, we build the sparse / low-rank regularizer to adaptively reconstruct a low-dimensional projection space while preserving the basic objective for instance discrimination, and thus successfully learning contrastive embeddings that alleviate the above issue. Theoretically, we prove a tighter error bound for CLLR; empirically, the superiority of CLLR is demonstrated across multiple domains. Both theoretical and experimental results emphasize the significance of learning low-dimensional contrastive embeddings.

NeurIPS Conference 2022 Conference Paper

Learning Superpoint Graph Cut for 3D Instance Segmentation

  • Le Hui
  • Linghua Tang
  • Yaqi Shen
  • Jin Xie
  • Jian Yang

3D instance segmentation is a challenging task due to the complex local geometric structures of objects in point clouds. In this paper, we propose a learning-based superpoint graph cut method that explicitly learns the local geometric structures of the point cloud for 3D instance segmentation. Specifically, we first oversegment the raw point clouds into superpoints and construct the superpoint graph. Then, we propose an edge score prediction network to predict the edge scores of the superpoint graph, where the similarity vectors of two adjacent nodes learned through cross-graph attention in the coordinate and feature spaces are used for regressing edge scores. By forcing two adjacent nodes of the same instance to be close to the instance center in the coordinate and feature spaces, we formulate a geometry-aware edge loss to train the edge score prediction network. Finally, we develop a superpoint graph cut network that employs the learned edge scores and the predicted semantic classes of nodes to generate instances, where bilateral graph attention is proposed to extract discriminative features on both the coordinate and feature spaces for predicting semantic labels and scores of instances. Extensive experiments on two challenging datasets, ScanNet v2 and S3DIS, show that our method achieves new state-of-the-art performance on 3D instance segmentation.

IJCAI Conference 2022 Conference Paper

Modeling Spatio-temporal Neighbourhood for Personalized Point-of-interest Recommendation

  • Xiaolin Wang
  • Guohao Sun
  • Xiu Fang
  • Jian Yang
  • Shoujin Wang

Point-of-interest (POI) recommendations can help users explore attractive locations, which is playing an important role in location-based social networks (LBSNs). In POI recommendations, the results are largely impacted by users' preferences. However, the existing POI methods model user and location almost separately, which cannot capture users' personal and dynamic preferences to location. In addition, they also ignore users' acceptance to distance/time of location. To overcome the limitations of the existing methods, we first introduce Knowledge Graph with temporal information (known as TKG) into POI recommendation, including both user and location with timestamps. Then, based on TKG, we propose a Spatial-Temporal Graph Convolutional Attention Network (STGCAN), a novel network that learns users' preferences on TKG by dynamically capturing the spatial-temporal neighbourhoods. Specifically, in STGCAN, we construct receptive fields on TKG to aggregate neighbourhoods of user and location respectively at each timestamp. And we measure the spatial-temporal interval as users' acceptance to distance/time with self-attention. Experiments on three real-world datasets demonstrate that the proposed model outperforms the state-of-the-art POI recommendation approaches.

JBHI Journal 2022 Journal Article

MVSGAN: Spatial-Aware Multi-View CMR Fusion for Accurate 3D Left Ventricular Myocardium Segmentation

  • Xiaoming Qi
  • Yuting He
  • Guanyu Yang
  • Yang Chen
  • Jian Yang
  • Wangyag Liu
  • Yinsu Zhu
  • Yi Xu

The accurate 3D left ventricular (LV) myocardium segmentation in short-axis (SAX) view of cardiac magnetic resonance (CMR) is challenged by the sparse spatial structure of CMR. The strategy of multi-view CMR fusion can provide fine-grained spatial structure for accurate segmentation. However, the large information misalignment and lack of dense 3D CMR as fusion target in multi-view CMR fusion, and the different spatial resolution between the fusion result and the ground truth in segmentation limit the strategy. In this study, we propose a multi-view spatial-aware adversarial network (MVSGAN). It studies the perception of fine-grained cardiac structure for accurate segmentation by the spatialaware multi-view CMR fusion. It consists of three modules: (1) A residual adversarial fusion (RAF) module takes inter-slices deep correlation and anatomical prior to refine the spatial structures by residual supplement and adversarial optimization. (2) A structural perception-aggregation (SPA) module establishes the spatial correlation between the dense cardiac model and sparse label for accurate CMR LV myocardium segmentation. (3) A joint training strategy utilizes the dense SAX volume as explicit and implicit goals to jointly optimize the framework. The experiments are applied on a public dataset and a clinical dataset to evaluate the performance of MVSGAN. The average Dice and Jaccard score of LV myocardium segmentation obtained by MVSGAN are highest among seven existing state-of-the-art methods, which are up to 0. 92 and 0. 75. It is concluded that the spatial-aware multi-view CMR fusion can provide meaningful spatial correlation for accurate LV myocardium segmentation.

NeurIPS Conference 2022 Conference Paper

RecursiveMix: Mixed Learning with History

  • Lingfeng Yang
  • Xiang Li
  • Borui Zhao
  • Renjie Song
  • Jian Yang

Mix-based augmentation has been proven fundamental to the generalization of deep vision models. However, current augmentations only mix samples from the current data batch during training, which ignores the possible knowledge accumulated in the learning history. In this paper, we propose a recursive mixed-sample learning paradigm, termed ``RecursiveMix'' (RM), by exploring a novel training strategy that leverages the historical input-prediction-label triplets. More specifically, we iteratively resize the input image batch from the previous iteration and paste it into the current batch while their labels are fused proportionally to the area of the operated patches. Furthermore, a consistency loss is introduced to align the identical image semantics across the iterations, which helps the learning of scale-invariant feature representations. Based on ResNet-50, RM largely improves classification accuracy by $\sim$3. 2% on CIFAR-100 and $\sim$2. 8% on ImageNet with negligible extra computation/storage costs. In the downstream object detection task, the RM-pretrained model outperforms the baseline by 2. 1 AP points and surpasses CutMix by 1. 4 AP points under the ATSS detector on COCO. In semantic segmentation, RM also surpasses the baseline and CutMix by 1. 9 and 1. 1 mIoU points under UperNet on ADE20K, respectively. Codes and pretrained models are available at https: //github. com/implus/RecursiveMix.

AAAI Conference 2022 Conference Paper

Reliable Inlier Evaluation for Unsupervised Point Cloud Registration

  • Yaqi Shen
  • Le Hui
  • Haobo Jiang
  • Jin Xie
  • Jian Yang

Unsupervised point cloud registration algorithm usually suffers from the unsatisfied registration precision in the partially overlapping problem due to the lack of effective inlier evaluation. In this paper, we propose a neighborhood consensus based reliable inlier evaluation method for robust unsupervised point cloud registration. It is expected to capture the discriminative geometric difference between the source neighborhood and the corresponding pseudo target neighborhood for effective inlier distinction. Specifically, our model consists of a matching map refinement module and an inlier evaluation module. In our matching map refinement module, we improve the point-wise matching map estimation by integrating the matching scores of neighbors into it. The aggregated neighborhood information potentially facilitates the discriminative map construction so that high-quality correspondences can be provided for generating the pseudo target point cloud. Based on the observation that the outlier has the significant structure-wise difference between its source neighborhood and corresponding pseudo target neighborhood while this difference for inlier is small, the inlier evaluation module exploits this difference to score the inlier confidence for each estimated correspondence. In particular, we construct an effective graph representation for capturing this geometric difference between the neighborhoods. Finally, with the learned correspondences and the corresponding inlier confidence, we use the weighted SVD algorithm for transformation estimation. Under the unsupervised setting, we exploit the Huber function based global alignment loss, the local neighborhood consensus loss, and spatial consistency loss for model optimization. The experimental results on extensive datasets demonstrate that our unsupervised point cloud registration method can yield comparable performance.

IJCAI Conference 2022 Conference Paper

UM4: Unified Multilingual Multiple Teacher-Student Model for Zero-Resource Neural Machine Translation

  • Jian Yang
  • Yuwei Yin
  • Shuming Ma
  • Dongdong Zhang
  • ShuangZhi Wu
  • Hongcheng Guo
  • Zhoujun Li
  • Furu Wei

Most translation tasks among languages belong to the zero-resource translation problem where parallel corpora are unavailable. Multilingual neural machine translation (MNMT) enables one-pass translation using shared semantic space for all languages compared to the two-pass pivot translation but often underperforms the pivot-based method. In this paper, we propose a novel method, named as Unified Multilingual Multiple teacher-student Model for NMT (UM4). Our method unifies source-teacher, target-teacher, and pivot-teacher models to guide the student model for the zero-resource translation. The source teacher and target teacher force the student to learn the direct source-target translation by the distilled knowledge on both source and target sides. The monolingual corpus is further leveraged by the pivot-teacher model to enhance the student model. Experimental results demonstrate that our model of 72 directions significantly outperforms previous methods on the WMT benchmark.

NeurIPS Conference 2022 Conference Paper

Uncertainty-Aware Hierarchical Refinement for Incremental Implicitly-Refined Classification

  • Jian Yang
  • Kai Zhu
  • Kecheng Zheng
  • Yang Cao

Incremental implicitly-refined classification task aims at assigning hierarchical labels to each sample encountered at different phases. Existing methods tend to fail in generating hierarchy-invariant descriptors when the novel classes are inherited from the old ones. To address the issue, this paper, which explores the inheritance relations in the process of multi-level semantic increment, proposes an Uncertainty-Aware Hierarchical Refinement (UAHR) scheme. Specifically, our proposed scheme consists of a global representation extension strategy that enhances the discrimination of incremental representation by widening the corresponding margin distance, and a hierarchical distribution alignment strategy that refines the distillation process by explicitly determining the inheritance relationship of the incremental class. Particularly, the shifting subclasses are corrected under the guidance of hierarchical uncertainty, ensuring the consistency of the homogeneous features. Extensive experiments on widely used benchmarks (i. e. , IIRC-CIFAR, IIRC-ImageNet-lite, IIRC-ImageNet-Subset, and IIRC-ImageNet-full) demonstrate the superiority of our proposed method over the state-of-the-art approaches.

IJCAI Conference 2022 Conference Paper

Webly-Supervised Fine-Grained Recognition with Partial Label Learning

  • Yu-Yan Xu
  • Yang Shen
  • Xiu-Shen Wei
  • Jian Yang

The task of webly-supervised fine-grained recognition is to boost recognition accuracy of classifying subordinate categories (e. g. , different bird species) by utilizing freely available but noisy web data. As the label noises significantly hurt the network training, it is desirable to distinguish and eliminate noisy images. In this paper, we propose two strategies, i. e. , open-set noise removal and closed-set noise correction, to both remove such two kinds of web noises w. r. t. fine-grained recognition. Specifically, for open-set noise removal, we utilize a pre-trained deep model to perform deep descriptor transformation to estimate the positive correlation between these web images, and detect the open-set noises based on the correlation values. Regarding closed-set noise correction, we develop a top-k recall optimization loss for firstly assigning a label set towards each web image to reduce the impact of hard label assignment for closed-set noises. Then, we further propose to correct the sample with its label set as the true single label from a partial label learning perspective. Experiments on several webly-supervised fine-grained benchmark datasets show that our method obviously outperforms other existing state-of-the-art methods.

NeurIPS Conference 2021 Conference Paper

3D Siamese Voxel-to-BEV Tracker for Sparse Point Clouds

  • Le Hui
  • Lingpeng Wang
  • Mingmei Cheng
  • Jin Xie
  • Jian Yang

3D object tracking in point clouds is still a challenging problem due to the sparsity of LiDAR points in dynamic environments. In this work, we propose a Siamese voxel-to-BEV tracker, which can significantly improve the tracking performance in sparse 3D point clouds. Specifically, it consists of a Siamese shape-aware feature learning network and a voxel-to-BEV target localization network. The Siamese shape-aware feature learning network can capture 3D shape information of the object to learn the discriminative features of the object so that the potential target from the background in sparse point clouds can be identified. To this end, we first perform template feature embedding to embed the template's feature into the potential target and then generate a dense 3D shape to characterize the shape information of the potential target. For localizing the tracked target, the voxel-to-BEV target localization network regresses the target's 2D center and the z-axis center from the dense bird's eye view (BEV) feature map in an anchor-free manner. Concretely, we compress the voxelized point cloud along z-axis through max pooling to obtain a dense BEV feature map, where the regression of the 2D center and the z-axis center can be performed more effectively. Extensive evaluation on the KITTI tracking dataset shows that our method significantly outperforms the current state-of-the-art methods by a large margin. Code is available at https: //github. com/fpthink/V2B.

TIST Journal 2021 Journal Article

A Comprehensive Survey of the Key Technologies and Challenges Surrounding Vehicular Ad Hoc Networks

  • Zhenchang Xia
  • Jia Wu
  • Libing Wu
  • Yanjiao Chen
  • Jian Yang
  • Philip S. Yu

Vehicular ad hoc networks ( VANETs ) and the services they support are an essential part of intelligent transportation. Through physical technologies, applications, protocols, and standards, they help to ensure traffic moves efficiently and vehicles operate safely. This article surveys the current state of play in VANETs development. The summarized and classified include the key technologies critical to the field, the resource-management and safety applications needed for smooth operations, the communications and data transmission protocols that support networking, and the theoretical and environmental constructs underpinning research and development, such as graph neural networks and the Internet of Things. Additionally, we identify and discuss several challenges facing VANETs, including poor safety, poor reliability, non-uniform standards, and low intelligence levels. Finally, we touch on hot technologies and techniques, such as reinforcement learning and 5G communications, to provide an outlook for the future of intelligent transportation systems.

NeurIPS Conference 2021 Conference Paper

A$^2$-Net: Learning Attribute-Aware Hash Codes for Large-Scale Fine-Grained Image Retrieval

  • Xiu-Shen Wei
  • Yang Shen
  • Xuhao Sun
  • Han-Jia Ye
  • Jian Yang

Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i. e. , the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose an Attribute-Aware hashing Network (A$^2$-Net) for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. A$^2$-Net is also equipped with a feature decorrelation constraint upon these attribute vectors to enhance their representation abilities. Finally, the required hash codes are generated by the attribute vectors driven by preserving original similarities. Qualitative experiments on five benchmark fine-grained datasets show our superiority over competing methods. More importantly, quantitative results demonstrate the obtained hash codes can strongly correspond to certain kinds of crucial properties of fine-grained objects.

AAAI Conference 2021 Conference Paper

Action Candidate Based Clipped Double Q-learning for Discrete and Continuous Action Tasks

  • Haobo Jiang
  • Jin Xie
  • Jian Yang

Double Q-learning is a popular reinforcement learning algorithm in Markov decision process (MDP) problems. Clipped Double Q-learning, as an effective variant of Double Qlearning, employs the clipped double estimator to approximate the maximum expected action value. Due to the underestimation bias of the clipped double estimator, performance of clipped Double Q-learning may be degraded in some stochastic environments. In this paper, in order to reduce the underestimation bias, we propose an action candidate based clipped double estimator for Double Q-learning. Specifically, we first select a set of elite action candidates with the high action values from one set of estimators. Then, among these candidates, we choose the highest valued action from the other set of estimators. Finally, we use the maximum value in the second set of estimators to clip the action value of the chosen action in the first set of estimators and the clipped value is used for approximating the maximum expected action value. Theoretically, the underestimation bias in our clipped Double Q-learning decays monotonically as the number of the action candidates decreases. Moreover, the number of action candidates controls the trade-off between the overestimation and underestimation biases. In addition, we also extend our clipped Double Q-learning to continuous action tasks via approximating the elite continuous action candidates. We empirically verify that our algorithm can more accurately estimate the maximum expected action value on some toy environments and yield good performance on several benchmark problems.

AAAI Conference 2021 Conference Paper

Contrastive and Generative Graph Convolutional Networks for Graph-based Semi-Supervised Learning

  • Sheng Wan
  • Shirui Pan
  • Jian Yang
  • Chen Gong

Graph-based Semi-Supervised Learning (SSL) aims to transfer the labels of a handful of labeled data to the remaining massive unlabeled data via a graph. As one of the most popular graph-based SSL approaches, the recently proposed Graph Convolutional Networks (GCNs) have gained remarkable progress by combining the sound expressiveness of neural networks with graph structure. Nevertheless, the existing graph-based methods do not directly address the core problem of SSL, i. e. , the shortage of supervision, and thus their performances are still very limited. To accommodate this issue, a novel GCN-based SSL algorithm is presented in this paper to enrich the supervision signals by utilizing both data similarities and graph structure. Firstly, by designing a semisupervised contrastive loss, improved node representations can be generated via maximizing the agreement between different views of the same data or the data from the same class. Therefore, the rich unlabeled data and the scarce yet valuable labeled data can jointly provide abundant supervision information for learning discriminative node representations, which helps improve the subsequent classification result. Secondly, the underlying determinative relationship between the data features and input graph topology is extracted as supplementary supervision signals for SSL via using a graph generative loss related to the input features. Intensive experimental results on a variety of real-world datasets firmly verify the effectiveness of our algorithm compared with other state-ofthe-art methods.

IS Journal 2021 Journal Article

CoTrRank: Trust Ranking on Twitter

  • Peiyao Li
  • Weiliang Zhao
  • Jian Yang
  • Jia Wu

Trust evaluation of people and information on social media is critical for maintaining a healthy online social environment. How to evaluate the trustworthiness of users and tweets is challenging due to the complex and complicated relationships between/among users and their posts. As existing approaches use a single network to represent users, posts, and their relationships, they have the limitation to reflect the different statistical features of users and tweets, which has reduced the ability to determine the trustworthiness of users and tweets. To address this issue, we develop a trust evaluation method that models users and tweets separately in two networks that are coupled with each other via interactions. We provide mapping functions to map the statistical numbers of actions of users/tweets to trust values that indicate their relevant trust degrees. The proposed method provides a configurable solution that has the capability to consider the effects of users and tweets differently in different trust ranking situations. A set of experiments are conducted against real-data collected from Twitter. The experimental results show that the proposed approach is more effective in trust evaluation compared with several baseline methods.

AAAI Conference 2021 Conference Paper

Deep Wasserstein Graph Discriminant Learning for Graph Classification

  • Tong Zhang
  • Yun Wang
  • Zhen Cui
  • Chuanwei Zhou
  • Baoliang Cui
  • Haikuan Huang
  • Jian Yang

Graph topological structures are crucial to distinguish different-class graphs. In this work, we propose a deep Wasserstein graph discriminant learning (WGDL) framework to learn discriminative embeddings of graphs in Wassersteinmetric (W-metric) matching space. In order to bypass the calculation of W-metric class centers in discriminant analysis, as well as better support batch process learning, we introduce a reference set of graphs (aka graph dictionary) to express those representative graph samples (aka dictionary keys). On the bridge of graph dictionary, every input graph can be projected into the latent dictionary space through our proposed Wasserstein graph transformation (WGT). In WGT, we formulate inter-graph distance in W-metric space by virtue of the optimal transport (OT) principle, which effectively expresses the correlations of cross-graph structures. To make WGDL better representation ability, we dynamically update graph dictionary during training by maximizing the Wasserstein Discriminant loss, i. e. the ratio of inter-class versus intra-class Wasserstein distance. To evaluate our WGDL method, comprehensive experiments are conducted on six graph classification datasets. Experimental results demonstrate the effectiveness of our WGDL, and state-of-the-art performance.

JBHI Journal 2021 Journal Article

Divergence-Free Fitting-Based Incompressible Deformation Quantification of Liver

  • Tianyu Fu
  • Jingfan Fan
  • Dingkun Liu
  • Hong Song
  • Chaoyi Zhang
  • Danni Ai
  • Zhigang Cheng
  • Ping Liang

Liver is an incompressible organ that maintains its volume during the respiration-induced deformation. Quantifying this deformation with the incompressible constraint is significant for liver tracking. The constraint can be accomplished with retaining the divergence-free field obtained by the deformation decomposition. However, the decomposition process is time-consuming, and the removal of non-divergence-free field weakens the deformation. In this study, a divergence-free fitting-based registration method is proposed to quantify the incompressible deformation rapidly and accurately. First, the deformation to be estimated is mapped to the velocity in a diffeomorphic space. Then, this velocity is decomposed by a fast Fourier-based Hodge-Helmholtz decomposition to obtain the divergence-free, curl-free, and harmonic fields. The curl-free field is replaced and fitted by the obtained harmonic field with a translation field to generate a new divergence-free velocity. By optimizing this velocity, the final incompressible deformation is obtained. Moreover, a deep learning framework (DLF) is constructed to accelerate the incompressible deformation quantification. An incompressible respiratory motion model is built for the DLF by using the proposed registration method and is then used to augment the training data. An encoder-decoder network is introduced to learn appearance-velocity correlation at patch scale. In the experiment, we compare the proposed registration with three state-of-the-art methods. The results show that the proposed method can accurately achieve the incompressible registration of liver with a mean liver overlap ratio of 95. 33%. Moreover, the time consumed by DLF is nearly 15 times shorter than that by other methods.

IJCAI Conference 2021 Conference Paper

Graph Deformer Network

  • Wenting Zhao
  • Yuan Fang
  • Zhen Cui
  • Tong Zhang
  • Jian Yang

Convolution learning on graphs draws increasing attention recently due to its potential applications to a large amount of irregular data. Most graph convolution methods leverage the plain summation/average aggregation to avoid the discrepancy of responses from isomorphic graphs. However, such an extreme collapsing way would result in a structural loss and signal entanglement of nodes, which further cause the degradation of the learning ability. In this paper, we propose a simple yet effective Graph Deformer Network (GDN) to fulfill anisotropic convolution filtering on graphs, analogous to the standard convolution operation on images. Local neighborhood subgraphs (acting like receptive fields) with different structures are deformed into a unified virtual space, coordinated by several anchor nodes. In the deformation process, we transfer components of nodes therein into affinitive anchors by learning their correlations, and build a multi-granularity feature space calibrated with anchors. Anisotropic convolutional kernels can be further performed over the anchor-coordinated space to well encode local variations of receptive fields. By parameterizing anchors and stacking coarsening layers, we build a graph deformer network in an end-to-end fashion. Theoretical analysis indicates its connection to previous work and shows the promising property of graph isomorphism testing. Extensive experiments on widely-used datasets validate the effectiveness of GDN in graph and node classifications.

AAAI Conference 2021 Conference Paper

Graph Game Embedding

  • Xiaobin Hong
  • Tong Zhang
  • Zhen Cui
  • Yuge Huang
  • Pengcheng Shen
  • Shaoxin Li
  • Jian Yang

Graph embedding aims to encode nodes/edges into lowdimensional continuous features, and has become a crucial tool for graph analysis including graph/node classification, link prediction, etc. In this paper we propose a novel graph learning framework, named graph game embedding, to learn discriminative node representation as well as encode graph structures. Inspired by the spirit of game learning, node embedding is converted to the selection/searching process of player strategies, where each node corresponds to one player and each edge corresponds to the interaction of two players. Then, a utility function, which theoretically satisfies the Nash Equilibrium, is defined to measure the benefit/loss of players during graph evolution. Furthermore, a collaboration and competition mechanism is introduced to increase the discriminant learning ability. Under this graph game embedding framework, considering different interaction manners of nodes, we propose two specific models, named paired game embedding for paired nodes and group game embedding for group interaction. Comparing with existing graph embedding methods, our algorithm possesses two advantages: (1) the designed utility function ensures the stable graph evolution with theoretical convergence and Nash Equilibrium satisfaction; (2) the introduced collaboration and competition mechanism endows the graph game embedding framework with discriminative feature leaning ability by guiding each node to learn an optimal strategy distinguished from others. We test the proposed method on three public datasets about citation networks, and the experimental results verify the effectiveness of our method.

AAAI Conference 2021 Conference Paper

Hierarchical Information Passing Based Noise-Tolerant Hybrid Learning for Semi-Supervised Human Parsing

  • Yunan Liu
  • Shanshan Zhang
  • Jian Yang
  • PongChi Yuen

Deep learning based human parsing methods usually require a large amount of training data to reach high performance. However, it is costly and time-consuming to obtain manually annotated high quality labels for a large scale dataset. To alleviate annotation efforts, we propose a new semi-supervised human parsing method for which we only need a small number of labels for training. First, we generate high quality pseudo labels on unlabeled images using a hierarchical information passing network (HIPN), which reasons human part segmentation in a coarse to fine manner. Furthermore, we develop a noise-tolerant hybrid learning method, which takes advantage of positive and negative learning to better handle noisy pseudo labels. When evaluated on standard human parsing benchmarks, our HIPN achieves a new state-of-theart performance. Moreover, our noise-tolerant hybrid learning method further improves the performance and outperforms the state-of-the-art semi-supervised method (i. e. GRN) by 4. 47 points w. r. t mIoU on the LIP dataset.

NeurIPS Conference 2021 Conference Paper

Learning to Adapt via Latent Domains for Adaptive Semantic Segmentation

  • Yunan Liu
  • Shanshan Zhang
  • Yang Li
  • Jian Yang

Domain adaptive semantic segmentation aims to transfer knowledge learned from labeled source domain to unlabeled target domain. To narrow down the domain gap and ease adaptation difficulty, some recent methods translate source images to target-like images (latent domains), which are used as supplement or substitute to the original source data. Nevertheless, these methods neglect to explicitly model the relationship of knowledge transferring across different domains. Alternatively, in this work we break through the standard “source-target” one pair adaptation framework and construct multiple adaptation pairs (e. g. “source-latent” and “latent-target”). The purpose is to use the meta-knowledge (how to adapt) learned from one pair as guidance to assist the adaptation of another pair under a meta-learning framework. Furthermore, we extend our method to a more practical setting of open compound domain adaptation (a. k. a multiple-target domain adaptation), where the target is a compound of multiple domains without domain labels. In this setting, we embed an additional pair of “latent-latent” to reduce the domain gap between the source and different latent domains, allowing the model to adapt well on multiple target domains simultaneously. When evaluated on standard benchmarks, our method is superior to the state-of-the-art methods in both the single target and multiple-target domain adaptation settings.

IJCAI Conference 2021 Conference Paper

Planning with Learned Dynamic Model for Unsupervised Point Cloud Registration

  • Haobo Jiang
  • Jianjun Qian
  • Jin Xie
  • Jian Yang

Point cloud registration is a fundamental problem in 3D computer vision. In this paper, we cast point cloud registration into a planning problem in reinforcement learning, which can seek the transformation between the source and target point clouds through trial and error. By modeling the point cloud registration process as a Markov decision process (MDP), we develop a latent dynamic model of point clouds, consisting of a transformation network and evaluation network. The transformation network aims to predict the new transformed feature of the point cloud after performing a rigid transformation (i. e. , action) on it while the evaluation network aims to predict the alignment precision between the transformed source point cloud and target point cloud as the reward signal. Once the dynamic model of the point cloud is trained, we employ the cross-entropy method (CEM) to iteratively update the planning policy by maximizing the rewards in the point cloud registration process. Thus, the optimal policy, i. e. , the transformation between the source and target point clouds, can be obtained via gradually narrowing the search space of the transformation. Experimental results on ModelNet40 and 7Scene benchmark datasets demonstrate that our method can yield good registration performance in an unsupervised manner.

IJCAI Conference 2021 Conference Paper

Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

  • Yang Yang
  • Chubing Zhang
  • Yi-Chu Xu
  • Dianhai Yu
  • De-Chuan Zhan
  • Jian Yang

The main challenge of cross-modal retrieval is to learn the consistent embedding for heterogeneous modalities. To solve this problem, traditional label-wise cross-modal approaches usually constrain the inter-modal and intra-modal embedding consistency relying on the label ground-truths. However, the experiments reveal that different modal networks actually have various generalization capacities, thereby end-to-end joint training with consistency loss usually leads to sub-optimal uni-modal model, which in turn affects the learning of consistent embedding. Therefore, in this paper, we argue that what really needed for supervised cross-modal retrieval is a good shared classification model. In other words, we learn the consistent embedding by ensuring the classification performance of each modality on the shared model, without the consistency loss. Specifically, we consider a technique called Semantic Sharing, which directly trains the two modalities interactively by adopting a shared self-attention based classification model. We evaluate the proposed approach on three representative datasets. The results validate that the proposed semantic sharing can consistently boost the performance under NDCG metric.

JMLR Journal 2021 Journal Article

Sparse Tensor Additive Regression

  • Botao Hao
  • Boxiang Wang
  • Pengyuan Wang
  • Jingfei Zhang
  • Jian Yang
  • Will Wei Sun

Tensors are becoming prevalent in modern applications such as medical imaging and digital marketing. In this paper, we propose a sparse tensor additive regression (STAR) that models a scalar response as a flexible nonparametric function of tensor covariates. The proposed model effectively exploits the sparse and low-rank structures in the tensor additive regression. We formulate the parameter estimation as a non-convex optimization problem, and propose an efficient penalized alternating minimization algorithm. We establish a non-asymptotic error bound for the estimator obtained from each iteration of the proposed algorithm, which reveals an interplay between the optimization error and the statistical rate of convergence. We demonstrate the efficacy of STAR through extensive comparative simulation studies, and an application to the click-through-rate prediction in online advertising. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

AAAI Conference 2021 Conference Paper

SSPC-Net: Semi-supervised Semantic 3D Point Cloud Segmentation Network

  • Mingmei Cheng
  • Le Hui
  • Jin Xie
  • Jian Yang

Point cloud semantic segmentation is a crucial task in 3D scene understanding. Existing methods mainly focus on employing a large number of annotated labels for supervised semantic segmentation. Nonetheless, manually labeling such large point clouds for the supervised segmentation task is time-consuming. In order to reduce the number of annotated labels, we propose a semi-supervised semantic point cloud segmentation network, named SSPC-Net, where we train the semantic segmentation network by inferring the labels of unlabeled points from the few annotated 3D points. In our method, we first partition the whole point cloud into superpoints and build superpoint graphs to mine the long-range dependencies in point clouds. Based on the constructed superpoint graph, we then develop a dynamic label propagation method to generate the pseudo labels for the unsupervised superpoints. Particularly, we adopt a superpoint dropout strategy to dynamically select the generated pseudo labels. In order to fully exploit the generated pseudo labels of the unsupervised superpoints, we furthermore propose a coupled attention mechanism for superpoint feature embedding. Finally, we employ the cross-entropy loss to train the semantic segmentation network with the labels of the supervised superpoints and the pseudo labels of the unsupervised superpoints. Experiments on various datasets demonstrate that our semisupervised segmentation method can achieve better performance than the current semi-supervised segmentation method with fewer annotated 3D points.

AAAI Conference 2021 Conference Paper

Tackling Instance-Dependent Label Noise via a Universal Probabilistic Model

  • Qizhou Wang
  • Bo Han
  • Tongliang Liu
  • Gang Niu
  • Jian Yang
  • Chen Gong

The drastic increase of data quantity often brings the severe decrease of data quality, such as incorrect label annotations, which poses a great challenge for robustly training Deep Neural Networks (DNNs). Existing learning methods with label noise either employ ad-hoc heuristics or restrict to specific noise assumptions. However, more general situations, such as instance-dependent label noise, have not been fully explored, as scarce studies focus on their label corruption process. By categorizing instances into confusing and unconfusing instances, this paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances. The resultant model can be realized by DNNs, where the training procedure is accomplished by employing an alternating optimization algorithm. Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness over state-of-the-art counterparts.

NeurIPS Conference 2021 Conference Paper

Universal Semi-Supervised Learning

  • Zhuo Huang
  • Chao Xue
  • Bo Han
  • Jian Yang
  • Chen Gong

Universal Semi-Supervised Learning (UniSSL) aims to solve the open-set problem where both the class distribution (i. e. , class set) and feature distribution (i. e. , feature domain) are different between labeled dataset and unlabeled dataset. Such a problem seriously hinders the realistic landing of classical SSL. Different from the existing SSL methods targeting at the open-set problem that only study one certain scenario of class distribution mismatch and ignore the feature distribution mismatch, we consider a more general case where a mismatch exists in both class and feature distribution. In this case, we propose a ''Class-shAring data detection and Feature Adaptation'' (CAFA) framework which requires no prior knowledge of the class relationship between the labeled dataset and unlabeled dataset. Particularly, CAFA utilizes a novel scoring strategy to detect the data in the shared class set. Then, it conducts domain adaptation to fully exploit the value of the detected class-sharing data for better semi-supervised consistency training. Exhaustive experiments on several benchmark datasets show the effectiveness of our method in tackling open-set problems.

AAAI Conference 2020 Conference Paper

Alternating Language Modeling for Cross-Lingual Pre-Training

  • Jian Yang
  • Shuming Ma
  • Dongdong Zhang
  • ShuangZhi Wu
  • Zhoujun Li
  • Ming Zhou

Language model pre-training has achieved success in many natural language processing tasks. Existing methods for cross-lingual pre-training adopt Translation Language Model to predict masked words with the concatenation of the source sentence and its target equivalent. In this work, we introduce a novel cross-lingual pre-training method, called Alternating Language Modeling (ALM). It code-switches sentences of different languages rather than simple concatenation, hoping to capture the rich cross-lingual context of words and phrases. More specifically, we randomly substitute source phrases with target translations to create code-switched sentences. Then, we use these code-switched data to train ALM model to learn to predict words of different languages. We evaluate our pre-training ALM on the downstream tasks of machine translation and cross-lingual classification. Experiments show that ALM can outperform the previous pretraining methods on three benchmarks. 1

AAAI Conference 2020 Conference Paper

Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

  • Hanyu Xuan
  • Zhenyu Zhang
  • Shuo Chen
  • Jian Yang
  • Yan Yan

In human multi-modality perception systems, the benefits of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at specific locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select “where” to attend, “when” to attend and “which” to attend for audio-visual event localization. In this way, even with large temporal inconsistent between vision and audio, our network is able to adaptively trade information between different modalities and successfully achieve event localization. Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. In addition, we also systemically investigate audio-visual event localization tasks. The visualization results also help us better understand how our model works.

AAAI Conference 2020 Conference Paper

Deep Discriminative CNN with Temporal Ensembling for Ambiguously-Labeled Image Classification

  • Yao Yao
  • Jiehui Deng
  • Xiuhua Chen
  • Chen Gong
  • Jianxin Wu
  • Jian Yang

In this paper, we study the problem of image classification where training images are ambiguously annotated with multiple candidate labels, among which only one is correct but is not accessible during the training phase. Due to the adopted non-deep framework and improper disambiguation strategies, traditional approaches are usually short of the representation ability and discrimination ability, so their performances are still to be improved. To remedy these two shortcomings, this paper proposes a novel approach termed “Deep Discriminative CNN” (D2 CNN) with temporal ensembling. Specifically, to improve the representation ability, we innovatively employ the deep convolutional neural networks for ambiguously-labeled image classification, in which the wellknown ResNet is adopted as our backbone. To enhance the discrimination ability, we design an entropy-based regularizer to maximize the margin between the potentially correct label and the unlikely ones of each image. In addition, we utilize the temporally assembled predictions of different epochs to guide the training process so that the latent groundtruth label can be confidently highlighted. This is much superior to the traditional disambiguation operations which treat all candidate labels equally and identify the hidden groundtruth label via some heuristic ways. Thorough experimental results on multiple datasets firmly demonstrate the effectiveness of our proposed D2 CNN when compared with other existing stateof-the-art approaches.

IJCAI Conference 2020 Conference Paper

Deep Learning for Community Detection: Progress, Challenges and Opportunities

  • Fanzhen Liu
  • Shan Xue
  • Jia Wu
  • Chuan Zhou
  • Wenbin Hu
  • Cecile Paris
  • Surya Nepal
  • Jian Yang

As communities represent similar opinions, similar functions, similar purposes, etc. , community detection is an important and extremely useful tool in both scientific inquiry and data analytics. However, the classic methods of community detection, such as spectral clustering and statistical inference, are falling by the wayside as deep learning techniques demonstrate an increasing capacity to handle high-dimensional graph data with impressive performance. Thus, a survey of current progress in community detection through deep learning is timely. Structured into three broad research streams in this domain – deep neural networks, deep graph embedding, and graph neural networks, this article summarizes the contributions of the various frameworks, models, and algorithms in each stream along with the current challenges that remain unsolved and the future research opportunities yet to be explored.

TIST Journal 2020 Journal Article

From Appearance to Essence

  • Xiu Susie Fang
  • Quan Z. Sheng
  • Xianzhi Wang
  • Wei Emma Zhang
  • Anne H. H. Ngu
  • Jian Yang

Truth discovery has been widely studied in recent years as a fundamental means for resolving the conflicts in multi-source data. Although many truth discovery methods have been proposed based on different considerations and intuitions, investigations show that no single method consistently outperforms the others. To select the right truth discovery method for a specific application scenario, it becomes essential to evaluate and compare the performance of different methods. A drawback of current research efforts is that they commonly assume the availability of certain ground truth for the evaluation of methods. However, the ground truth may be very limited or even impossible to obtain, rendering the evaluation biased. In this article, we present CompTruthHyp, a generic approach for comparing the performance of truth discovery methods without using ground truth. In particular, our approach calculates the probability of observations in a dataset based on the output of different methods. The probability is then ranked to reflect the performance of these methods. We review and compare 12 representative truth discovery methods and consider both single-valued and multi-valued objects. The empirical studies on both real-world and synthetic datasets demonstrate the effectiveness of our approach for comparing truth discovery methods.

NeurIPS Conference 2020 Conference Paper

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

  • Xiang Li
  • Wenhai Wang
  • Lijun Wu
  • Shuo Chen
  • Xiaolin Hu
  • Jun Li
  • Jinhui Tang
  • Jian Yang

One-stage detector basically formulates object detection as dense classification and localization (i. e. , bounding box regression). The classification is usually optimized by Focal Loss and the box location is commonly learned under Dirac delta distribution. A recent trend for one-stage detectors is to introduce an \emph{individual} prediction branch to estimate the quality of localization, where the predicted quality facilitates the classification to improve detection performance. This paper delves into the \emph{representations} of the above three fundamental elements: quality estimation, classification and localization. Two problems are discovered in existing practices, including (1) the inconsistent usage of the quality estimation and classification between training and inference, and (2) the inflexible Dirac delta distribution for localization. To address the problems, we design new representations for these elements. Specifically, we merge the quality estimation into the class prediction vector to form a joint representation, and use a vector to represent arbitrary distribution of box locations. The improved representations eliminate the inconsistency risk and accurately depict the flexible distribution in real data, but contain \emph{continuous} labels, which is beyond the scope of Focal Loss. We then propose Generalized Focal Loss (GFL) that generalizes Focal Loss from its discrete form to the \emph{continuous} version for successful optimization. On COCO {\tt test-dev}, GFL achieves 45. 0\% AP using ResNet-101 backbone, surpassing state-of-the-art SAPD (43. 5\%) and ATSS (43. 6\%) with higher or comparable inference speed.

AAAI Conference 2020 Conference Paper

Hierarchical Online Instance Matching for Person Search

  • Di Chen
  • Shanshan Zhang
  • Wanli Ouyang
  • Jian Yang
  • Bernt Schiele

Person Search is a challenging task which requires to retrieve a person’s image and the corresponding position from an image dataset. It consists of two sub-tasks: pedestrian detection and person re-identification (re-ID). One of the key challenges is to properly combine the two sub-tasks into a unified framework. Existing works usually adopt a straightforward strategy by concatenating a detector and a re-ID model directly, either into an integrated model or into separated models. We argue that simply concatenating detection and re-ID is a sub-optimal solution, and we propose a Hierarchical Online Instance Matching (HOIM) loss which exploits the hierarchical relationship between detection and re-ID to guide the learning of our network. Our novel HOIM loss function harmonizes the objectives of the two sub-tasks and encourages better feature learning. In addition, we improve the loss update policy by introducing Selective Memory Refreshment (SMR) for unlabeled persons, which takes advantage of the potential discrimination power of unlabeled data. From the experiments on two standard person search benchmarks, i. e. CUHK-SYSU and PRW, we achieve state-of-the-art performance, which justifies the effectiveness of our proposed HOIM loss on learning robust features.

AAAI Conference 2020 Conference Paper

Image Formation Model Guided Deep Image Super-Resolution

  • Jinshan Pan
  • Yang Liu
  • Deqing Sun
  • Jimmy Ren
  • Ming-Ming Cheng
  • Jian Yang
  • Jinhui Tang

We present a simple and effective image super-resolution algorithm that imposes an image formation constraint on the deep neural networks via pixel substitution. The proposed algorithm first uses a deep neural network to estimate intermediate high-resolution images, blurs the intermediate images using known blur kernels, and then substitutes values of the pixels at the un-decimated positions with those of the corresponding pixels from the low-resolution images. The output of the pixel substitution process strictly satisfies the image formation model and is further refined by the same deep neural network in a cascaded manner. The proposed framework is trained in an end-to-end fashion and can work with existing feed-forward deep neural networks for super-resolution and converges fast in practice. Extensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods.

IS Journal 2020 Journal Article

MGNN: Mutualistic Graph Neural Network for Joint Friend and Item Recommendation

  • Yang Xiao
  • Lina Yao
  • Qingqi Pei
  • Xianzhi Wang
  • Jian Yang
  • Quan Z. Sheng

Many social studies and practical cases suggest that people's consumption behaviors and social behaviors are not isolated but interrelated in social network services. However, most existing research either predicts users’ consumption preferences or recommends friends to users without dealing with them simultaneously. We propose a holistic approach to predict users’ preferences on friends and items jointly and thereby make better recommendations. To this end, we design a graph neural network that incorporates a mutualistic mechanism to model the mutual reinforcement relationship between users’ consumption behaviors and social behaviors. Our experiments on the two-real world datasets demonstrate the effectiveness of our approach in both social recommendation and link prediction.

IJCAI Conference 2020 Conference Paper

Online Positive and Unlabeled Learning

  • Chuang Zhang
  • Chen Gong
  • Tengfei Liu
  • Xun Lu
  • Weiqiang Wang
  • Jian Yang

Positive and Unlabeled learning (PU learning) aims to build a binary classifier where only positive and unlabeled data are available for classifier training. However, existing PU learning methods all work on a batch learning mode, which cannot deal with the online learning scenarios with sequential data. Therefore, this paper proposes a novel positive and unlabeled learning algorithm in an online training mode, which trains a classifier solely on the positive and unlabeled data arriving in a sequential order. Specifically, we adopt an unbiased estimate for the loss induced by the arriving positive or unlabeled examples at each time. Then we show that for any coming new single datum, the model can be updated independently and incrementally by gradient based online learning method. Furthermore, we extend our method to tackle the cases when more than one example is received at each time. Theoretically, we show that the proposed online PU learning method achieves low regret even though it receives sequential positive and unlabeled data. Empirically, we conduct intensive experiments on both benchmark and real-world datasets, and the results clearly demonstrate the effectiveness of the proposed method.

JMLR Journal 2020 Journal Article

Provable Convex Co-clustering of Tensors

  • Eric C. Chi
  • Brian J. Gaines
  • Will Wei Sun
  • Hua Zhou
  • Jian Yang

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

AAAI Conference 2020 Conference Paper

Understanding the Disharmony between Weight Normalization Family and Weight Decay

  • Xiang Li
  • Shuo Chen
  • Jian Yang

The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight W to W, which makes W independent to the magnitude of W. Surprisingly, W must be decayed during gradient descent, otherwise we will observe a severe under- fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. Moreover, if we substitute (e. g. , weight normalization) W = W ||W || in the original loss function i L(f(xi; W ), yi) + 1 2 λ||W ||2, it is observed that the regularization term 1 2 λ||W ||2 will be canceled as a constant 1 2 λ in the optimization objective. Therefore, to decay W, we need to explicitly append: 1 2 λ||W ||2. In this paper, we theoretically prove that 1 2 λ||W ||2 improves optimization only by modulating the effective learning rate and fairly has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several serious problems when introducing weight decay term to weight normalization family, including the missing of global minimum, training instability and sensitivity of initialization. To address these problems, we propose an Adaptive Weight Shrink (AWS) scheme, which gradually shrinks the weights during optimization by a dynamic coefficient proportional to the magnitude of the parameter. This simple yet effective method appropriately controls the effective learning rate, which significantly improves the training stability and makes optimization more robust to initialization.

AAAI Conference 2020 Conference Paper

Variational Pathway Reasoning for EEG Emotion Recognition

  • Tong Zhang
  • Zhen Cui
  • Chunyan Xu
  • Wenming Zheng
  • Jian Yang

Research on human emotion cognition revealed that connections and pathways exist between spatially-adjacent and functional-related areas during emotion expression (Adolphs 2002a; Bullmore and Sporns 2009). Deeply inspired by this mechanism, we propose a heuristic Variational Pathway Reasoning (VPR) method to deal with EEG-based emotion recognition. We introduce random walk to generate a large number of candidate pathways along electrodes. To encode each pathway, the dynamic sequence model is further used to learn between-electrode dependencies. The encoded pathways around each electrode are aggregated to produce a pseudo maximum-energy pathway, which consists of the most important pair-wise connections. To find those most salient connections, we propose a sparse variational scaling (SVS) module to learn scaling factors of pseudo pathways by using the Bayesian probabilistic process and sparsity constraint, where the former endows good generalization ability while the latter favors adaptive pathway selection. Finally, the salient pathways from those candidates are jointly decided by the pseudo pathways and scaling factors. Extensive experiments on EEG emotion recognition demonstrate that the proposed VPR is superior to those state-of-the-art methods, and could find some interesting pathways w. r. t. different emotions.

IJCAI Conference 2019 Conference Paper

CoTrRank: Trust Evaluation of Users and Tweets

  • Peiyao Li
  • Weiliang Zhao
  • Jian Yang
  • Jia Wu

Trust evaluation of people and information on Twitter is critical for maintaining a healthy online social environment. How to evaluate the trustworthiness of users and tweets becomes a challenging question. In this demo, we show how our proposed CoTrRank approach deal with this problem. This approach models users and tweets in two coupled networks and calculate their trust values in different trust spaces. In particular, our solution provides a configurable way when mapping the calculated raw evidences to the trust values. The CoTrRank demo system has an interactive interface to show how our proposed approach produces more effective and adaptive trust evaluation results comparing with baseline methods.

NeurIPS Conference 2019 Conference Paper

Curvilinear Distance Metric Learning

  • Shuo Chen
  • Lei Luo
  • Jian Yang
  • Chen Gong
  • Jun Li
  • Heng Huang

Distance Metric Learning aims to learn an appropriate metric that faithfully measures the distance between two data points. Traditional metric learning methods usually calculate the pairwise distance with fixed distance functions (\emph{e. g. ,}\ Euclidean distance) in the projected feature spaces. However, they fail to learn the underlying geometries of the sample space, and thus cannot exactly predict the intrinsic distances between data points. To address this issue, we first reveal that the traditional linear distance metric is equivalent to the cumulative arc length between the data pair's nearest points on the learned straight measurer lines. After that, by extending such straight lines to general curved forms, we propose a Curvilinear Distance Metric Learning (CDML) method, which adaptively learns the nonlinear geometries of the training data. By virtue of Weierstrass theorem, the proposed CDML is equivalently parameterized with a 3-order tensor, and the optimization algorithm is designed to learn the tensor parameter. Theoretical analysis is derived to guarantee the effectiveness and soundness of CDML. Extensive experiments on the synthetic and real-world datasets validate the superiority of our method over the state-of-the-art metric learning models.

AAAI Conference 2019 Conference Paper

Data-Adaptive Metric Learning with Scale Alignment

  • Shuo Chen
  • Chen Gong
  • Jian Yang
  • Ying Tai
  • Le Hui
  • Jun Li

The central problem for most existing metric learning methods is to find a suitable projection matrix on the differences of all pairs of data points. However, a single unified projection matrix can hardly characterize all data similarities accurately as the practical data are usually very complicated, and simply adopting one global projection matrix might ignore important local patterns hidden in the dataset. To address this issue, this paper proposes a novel method dubbed “Data-Adaptive Metric Learning” (DAML), which constructs a data-adaptive projection matrix for each data pair by selectively combining a set of learned candidate matrices. As a result, every data pair can obtain a specific projection matrix, enabling the proposed DAML to flexibly fit the training data and produce discriminative projection results. The model of DAML is formulated as an optimization problem which jointly learns candidate projection matrices and their sparse combination for every data pair. Nevertheless, the over-fitting problem may occur due to the large amount of parameters to be learned. To tackle this issue, we adopt the Total Variation (TV) regularizer to align the scales of data embedding produced by all candidate projection matrices, and thus the generated metrics of these learned candidates are generally comparable. Furthermore, we extend the basic linear DAML model to the kernerlized version (denoted “KDAML”) to handle the non-linear cases, and the Iterative Shrinkage-Thresholding Algorithm (ISTA) is employed to solve the optimization model. Intensive experimental results on various applications including retrieval, classification, and verification clearly demonstrate the superiority of our algorithm to other state-of-the-art metric learning methodologies.

AAAI Conference 2019 Conference Paper

Gaussian-Induced Convolution for Graphs

  • Jiatao Jiang
  • Zhen Cui
  • Chunyan Xu
  • Jian Yang

Learning representation on graph plays a crucial role in numerous tasks of pattern recognition. Different from gridshaped images/videos, on which local convolution kernels can be lattices, however, graphs are fully coordinate-free on vertices and edges. In this work, we propose a Gaussianinduced convolution (GIC) framework to conduct local convolution filtering on irregular graphs. Specifically, an edgeinduced Gaussian mixture model is designed to encode variations of subgraph region by integrating edge information into weighted Gaussian models, each of which implicitly characterizes one component of subgraph variations. In order to coarsen a graph, we derive a vertex-induced Gaussian mixture model to cluster vertices dynamically according to the connection of edges, which is approximately equivalent to the weighted graph cut. We conduct our multi-layer graph convolution network on several public datasets of graph classification. The extensive experiments demonstrate that our GIC is effective and can achieve the state-of-the-art results.

AAAI Conference 2019 Conference Paper

Inter-Class Angular Loss for Convolutional Neural Networks

  • Le Hui
  • Xiang Li
  • Chen Gong
  • Meng Fang
  • Joey Tianyi Zhou
  • Jian Yang

Convolutional Neural Networks (CNNs) have shown great power in various classification tasks and have achieved remarkable results in practical applications. However, the distinct learning difficulties in discriminating different pairs of classes are largely ignored by the existing networks. For instance, in CIFAR-10 dataset, distinguishing cats from dogs is usually harder than distinguishing horses from ships. By carefully studying the behavior of CNN models in the training process, we observe that the confusion level of two classes is strongly correlated with their angular separability in the feature space. That is, the larger the inter-class angle is, the lower the confusion will be. Based on this observation, we propose a novel loss function dubbed “Inter-Class Angular Loss” (ICAL), which explicitly models the class correlation and can be directly applied to many existing deep networks. By minimizing the proposed ICAL, the networks can effectively discriminate the examples in similar classes by enlarging the angle between their corresponding class vectors. Thorough experimental results on a series of vision and nonvision datasets confirm that ICAL critically improves the discriminative ability of various representative deep neural networks and generates superior performance to the original networks with conventional softmax loss.

TIST Journal 2019 Journal Article

Multi-Modal Curriculum Learning over Graphs

  • Chen Gong
  • Jian Yang
  • Dacheng Tao

Curriculum Learning (CL) is a recently proposed learning paradigm that aims to achieve satisfactory performance by properly organizing the learning sequence from simple curriculum examples to more difficult ones. Up to now, few works have been done to explore CL for the data with graph structure. Therefore, this article proposes a novel CL algorithm that can be utilized to guide the Label Propagation (LP) over graphs, of which the target is to “learn” the labels of unlabeled examples on the graphs. Specifically, we assume that different unlabeled examples have different levels of difficulty for propagation, and their label learning should follow a simple-to-difficult sequence with the updated curricula. Furthermore, considering that the practical data are often characterized by multiple modalities, every modality in our method is associated with a “teacher” that not only evaluates the difficulties of examples from its own viewpoint, but also cooperates with other teachers to generate the overall simplest curriculum examples for propagation. By taking the curriculums suggested by the teachers as a whole, the common preference (i.e., commonality) of teachers on selecting the simplest examples can be discovered by a row-sparse matrix, and their distinct opinions (i.e., individuality) are captured by a sparse noise matrix. As a result, an accurate curriculum sequence can be established and the propagation quality can thus be improved. Theoretically, we prove that the propagation risk bound is closely related to the examples’ difficulty information, and empirically, we show that our method can generate higher accuracy than the state-of-the-art CL approach and LP algorithms on various multi-modal tasks.

TIST Journal 2019 Journal Article

Multi-View Fusion with Extreme Learning Machine for Clustering

  • Yongshan Zhang
  • Jia Wu
  • Chuan Zhou
  • Zhihua Cai
  • Jian Yang
  • Philip S. Yu

Unlabeled, multi-view data presents a considerable challenge in many real-world data analysis tasks. These data are worth exploring because they often contain complementary information that improves the quality of the analysis results. Clustering with multi-view data is a particularly challenging problem as revealing the complex data structures between many feature spaces demands discriminative features that are specific to the task and, when too few of these features are present, performance suffers. Extreme learning machines (ELMs) are an emerging form of learning model that have shown an outstanding representation ability and superior performance in a range of different learning tasks. Motivated by the promise of this advancement, we have developed a novel multi-view fusion clustering framework based on an ELM, called MVEC. MVEC learns the embeddings from each view of the data via the ELM network, then constructs a single unified embedding according to the correlations and dependencies between each embedding and automatically weighting the contribution of each. This process exposes the underlying clustering structures embedded within multi-view data with a high degree of accuracy. A simple yet efficient solution is also provided to solve the optimization problem within MVEC. Experiments and comparisons on eight different benchmarks from different domains confirm MVEC’s clustering accuracy.

JBHI Journal 2019 Journal Article

Patch-Based Adaptive Background Subtraction for Vascular Enhancement in X-Ray Cineangiograms

  • Shuang Song
  • Alejandro F. Frangi
  • Jian Yang
  • Danni Ai
  • Chenbing Du
  • Yong Huang
  • Hong Song
  • Luosha Zhang

Objective: Automatic vascular enhancement in X-ray cineangiography is of crucial interest, for instance, for better visualizing and quantifying coronary arteries in diagnostic and interventional procedures. Methods: A novel patch-based adaptive background subtraction method (PABSM) is proposed automatically enhancing vessels in coronary X-ray cineangiography. First, pixels in the cineangiogram are described by the vesselness and Gabor features. Second, a classifier is utilized to separate the cineangiogram into the rough vascular and non-vascular region. Dilation is applied to the classified binary image to include more vascular region. Third, a patch-based background synthesis is utilized to fill the removed vascular region. Results: A database containing 320 cineangiograms of 175 patients was collected, and then an interventional cardiologist annotated all vascular structures. The performance of PABSM is compared with six state-of-the-art vascular enhancement methods regarding the precision–recall curve and C-value. The area under the precision–recall curve is $0. 7133$, and the C-value is $0. 9659$. Conclusion: PABSM can automatically enhance the coronary artery in the cineangiograms. It preserves the integrity of vascular topological structures, particularly in complex vascular regions, and removes noise caused by the non-uniform gray-level distribution in the cineangiogram. Significance: PABSM can avoid the motion artifacts and it eases the subsequent vascular segmentation, which is crucial for the diagnosis and interventional procedures of coronary artery diseases.

IJCAI Conference 2019 Conference Paper

Positive and Unlabeled Learning with Label Disambiguation

  • Chuang Zhang
  • Dexin Ren
  • Tongliang Liu
  • Jian Yang
  • Chen Gong

Positive and Unlabeled (PU) learning aims to learn a binary classifier from only positive and unlabeled training data. The state-of-the-art methods usually formulate PU learning as a cost-sensitive learning problem, in which every unlabeled example is simultaneously treated as positive and negative with different class weights. However, the ground-truth label of an unlabeled example should be unique, so the existing models inadvertently introduce the label noise which may lead to the biased classifier and deteriorated performance. To solve this problem, this paper proposes a novel algorithm dubbed as "Positive and Unlabeled learning with Label Disambiguation'' (PULD). We first regard all the unlabeled examples in PU learning as ambiguously labeled as positive and negative, and then employ the margin-based label disambiguation strategy, which enlarges the margin of classifier response between the most likely label and the less likely one, to find the unique ground-truth label of each unlabeled example. Theoretically, we derive the generalization error bound of the proposed method by analyzing its Rademacher complexity. Experimentally, we conduct intensive experiments on both benchmark and real-world datasets, and the results clearly demonstrate the superiority of the proposed PULD to the existing PU learning approaches.

IJCAI Conference 2018 Conference Paper

Adversarial Metric Learning

  • Shuo Chen
  • Chen Gong
  • Jian Yang
  • Xiang Li
  • Yang Wei
  • Jun Li

In the past decades, intensive efforts have been put to design various loss functions and metric forms for metric learning problem. These improvements have shown promising results when the test data is similar to the training data. However, the trained models often fail to produce reliable distances on the ambiguous test pairs due to the different samplings between training set and test set. To address this problem, the Adversarial Metric Learning (AML) is proposed in this paper, which automatically generates adversarial pairs to remedy the sampling bias and facilitate robust metric learning. Specifically, AML consists of two adversarial stages, i. e. confusion and distinguishment. In confusion stage, the ambiguous but critical adversarial data pairs are adaptively generated to mislead the learned metric. In distinguishment stage, a metric is exhaustively learned to try its best to distinguish both adversarial pairs and original training pairs. Thanks to the challenges posed by the confusion stage in such competing process, the AML model is able to grasp plentiful difficult knowledge that has not been contained by the original training pairs, so the discriminability of AML can be significantly improved. The entire model is formulated into optimization framework, of which the global convergence is theoretically proved. The experimental results on toy data and practical datasets clearly demonstrate the superiority of AML to representative state-of-the-art metric learning models.

IJCAI Conference 2018 Conference Paper

Mixed Link Networks

  • Wenhai Wang
  • Xiang Li
  • Tong Lu
  • Jian Yang

On the basis of the analysis by revealing the equivalence of modern networks, we find that both ResNet and DenseNet are essentially derived from the same "dense topology", yet they only differ in the form of connection: addition (dubbed "inner link") vs. concatenation (dubbed "outer link"). However, both forms of connections have the superiority and insufficiency. To combine their advantages and avoid certain limitations on representation learning, we present a highly efficient and modularized Mixed Link Network (MixNet) which is equipped with flexible inner link and outer link modules. Consequently, ResNet, DenseNet and Dual Path Network (DPN) can be regarded as a special case of MixNet, respectively. Furthermore, we demonstrate that MixNets can achieve superior efficiency in parameter over the state-of-the-art architectures on many competitive datasets like CIFAR-10/100, SVHN and ImageNet.

IJCAI Conference 2018 Conference Paper

Nonrigid Points Alignment with Soft-weighted Selection

  • Xuelong Li
  • Jian Yang
  • Qi Wang

Point set registration (PSR) is a crucial problem in computer vision and pattern recognition. Existing PSR methods cannot align point sets robustly due to degradations, such as deformation, noise, occlusion, outlier, and multi-view changes. In this paper, we present a self-selected regularized Gaussian fields criterion for nonrigid point matching. Unlike most existing methods, we formulate the registration problem as a sparse approximation task with low rank constraint in reproducing kernel Hilbert space (RKHS). A self-selected mechanism is used to dynamically assign real-valued label for each point in an accuracy-aware weighting manner, which makes the model focus more on the reliable points in position. Based on the label, an equivalent matching number optimization is embedded into the non-rigid criterion to enhance the reliability of the approximation. Experimental results show that the proposed method can achieve a better result in both registration accuracy and correct matches compared to state-of-the-art approaches.

IJCAI Conference 2018 Conference Paper

Positive and Unlabeled Learning via Loss Decomposition and Centroid Estimation

  • Hong Shi
  • Shaojun Pan
  • Jian Yang
  • Chen Gong

Positive and Unlabeled learning (PU learning) aims to train a binary classifier based on only positive and unlabeled examples, where the unlabeled examples could be either positive or negative. The state-of-the-art algorithms usually cast PU learning as a cost-sensitive learning problem and impose distinct weights to different training examples via a manual or automatic way. However, such weight adjustment or estimation can be inaccurate and thus often lead to unsatisfactory performance. Therefore, this paper regards all unlabeled examples as negative, which means that some of the original positive data are mistakenly labeled as negative. By doing so, we convert PU learning into the risk minimization problem in the presence of false negative label noise, and propose a novel PU learning algorithm termed? Loss Decomposition and Centroid Estimation? (LDCE). By decomposing the hinge loss function into two parts, we show that only the second part is influenced by label noise, of which the adverse effect can be reduced by estimating the centroid of negative examples. We intensively validate our approach on synthetic dataset, UCI benchmark datasets and real-world datasets, and the experimental results firmly demonstrate the effectiveness of our approach when compared with other state-of-the-art PU learning methodologies.

AAAI Conference 2018 Conference Paper

Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition

  • Chaolong Li
  • Zhen Cui
  • Wenming Zheng
  • Chunyan Xu
  • Jian Yang

Variations of human body skeletons may be considered as dynamic graphs, which are generic data representation for numerous real-world applications. In this paper, we propose a spatio-temporal graph convolution (STGC) approach for assembling the successes of local convolutional filtering and sequence learning ability of autoregressive moving average. To encode dynamic graphs, the constructed multi-scale local graph convolution filters, consisting of matrices of local receptive fields and signal mappings, are recursively performed on structured graph data of temporal and spatial domain. The proposed model is generic and principled as it can be generalized into other dynamic models. We theoretically prove the stability of STGC and provide an upper-bound of the signal transformation to be learnt. Further, the proposed recursive model can be stacked into a multi-layer architecture. To evaluate our model, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D. The experimental results demonstrate the effectiveness of our proposed model and the improvement over the state-of-the-art.

IJCAI Conference 2018 Conference Paper

Teaching Semi-Supervised Classifier via Generalized Distillation

  • Chen Gong
  • Xiaojun Chang
  • Meng Fang
  • Jian Yang

Semi-Supervised Learning (SSL) is able to build reliable classifier with very scarce labeled examples by properly utilizing the abundant unlabeled examples. However, existing SSL algorithms often yield unsatisfactory performance due to the lack of supervision information. To address this issue, this paper formulates SSL as a Generalized Distillation (GD) problem, which treats existing SSL algorithm as a learner and introduces a teacher to guide the learner? s training process. Specifically, the intelligent teacher holds the privileged knowledge that? explains? the training data but remains unknown to the learner, and the teacher should convey its rich knowledge to the imperfect learner through a specific teaching function. After that, the learner gains knowledge by? imitating? the output of the teaching function under an optimization framework. Therefore, the learner in our algorithm learns from both the teacher and the training data, so its output can be substantially distilled and enhanced. By deriving the Rademacher complexity and error bounds of the proposed algorithm, the usefulness of the introduced teacher is theoretically demonstrated. The superiority of our algorithm to the related state-of-the-art methods has also been empirically demonstrated by the experiments on different datasets with various sources of privileged knowledge.

IJCAI Conference 2017 Conference Paper

Importance-Aware Semantic Segmentation for Autonomous Driving System

  • Bi-ke Chen
  • Chen Gong
  • Jian Yang

Semantic Segmentation (SS) partitions an image into several coherent semantically meaningful parts, and classifies each part into one of the pre-determined classes. In this paper, we argue that existing SS methods cannot be reliably applied to autonomous driving system as they ignore the different importance levels of distinct classes for safe-driving. For example, pedestrians in the scene are much more important than sky when driving a car, so their segmentations should be as accurate as possible. To incorporate the importance information possessed by various object classes, this paper designs an "Importance-Aware Loss" (IAL) that specifically emphasizes the critical objects for autonomous driving. IAL operates under a hierarchical structure, and the classes with different importance are located in different levels so that they are assigned distinct weights. Furthermore, we derive the forward and backward propagation rules for IAL and apply them to deep neural networks for realizing SS in intelligent driving system. The experiments on CamVid and Cityscapes datasets reveal that by employing the proposed loss function, the existing deep learning models including FCN, SegNet and ENet are able to consistently obtain the improved segmentation results on the pre-defined important classes for safe-driving.

NeurIPS Conference 2016 Conference Paper

Large Margin Discriminant Dimensionality Reduction in Prediction Space

  • Mohammad Saberian
  • Jose Costa Pereira
  • Can Xu
  • Jian Yang
  • Nuno Nvasconcelos

In this paper we establish a duality between boosting and SVM, and use this to derive a novel discriminant dimensionality reduction algorithm. In particular, using the multiclass formulation of boosting and SVM we note that both use a combination of mapping and linear classification to maximize the multiclass margin. In SVM this is implemented using a pre-defined mapping (induced by the kernel) and optimizing the linear classifiers. In boosting the linear classifiers are pre-defined and the mapping (predictor) is learned through combination of weak learners. We argue that the intermediate mapping, e. g. boosting predictor, is preserving the discriminant aspects of the data and by controlling the dimension of this mapping it is possible to achieve discriminant low dimensional representations for the data. We use the aforementioned duality and propose a new method, Large Margin Discriminant Dimensionality Reduction (LADDER) that jointly learns the mapping and the linear classifiers in an efficient manner. This leads to a data-driven mapping which can embed data into any number of dimensions. Experimental results show that this embedding can significantly improve performance on tasks such as hashing and image/scene classification.

NeurIPS Conference 2016 Conference Paper

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

  • Xiang Li
  • Tao Qin
  • Jian Yang
  • Tie-Yan Liu

Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e. g. , possibly beyond the memory capacity of a GPU device) and its training will become very inefficient. In this work, we propose a novel technique to tackle this challenge. The key idea is to use 2-Component (2C) shared embedding for word representations. We allocate every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \sqrt{|V|}$ vectors to represent a vocabulary of $|V|$ unique words, which are far less than the $|V|$ vectors required by existing approaches. Based on the 2-Component shared embedding, we design a new RNN algorithm and evaluate it using the language modeling task on several benchmark datasets. The results show that our algorithm significantly reduces the model size and speeds up the training process, without sacrifice of accuracy (it achieves similar, if not better, perplexity as compared to state-of-the-art language models). Remarkably, on the One-Billion-Word benchmark Dataset, our algorithm achieves comparable perplexity to previous language models, whilst reducing the model size by a factor of 40-100, and speeding up the training process by a factor of 2. We name our proposed algorithm \emph{LightRNN} to reflect its very small model size and very high training speed.

AAAI Conference 2015 Conference Paper

Causal Inference via Sparse Additive Models with Application to Online Advertising

  • Wei Sun
  • Pengyuan Wang
  • Dawei Yin
  • Jian Yang
  • Yi Chang

Advertising effectiveness measurement is a fundamental problem in online advertising. Various causal inference methods have been employed to measure the causal effects of ad treatments. However, existing methods mainly focus on linear logistic regression for univariate and binary treatments and are not well suited for complex ad treatments of multi-dimensions, where each dimension could be discrete or continuous. In this paper we propose a novel two-stage causal inference framework for assessing the impact of complex ad treatments. In the first stage, we estimate the propensity parameter via a sparse additive model; in the second stage, a propensity-adjusted regression model is applied for measuring the treatment effect. Our approach is shown to provide an unbiased estimation of the ad effectiveness under regularity conditions. To demonstrate the efficacy of our approach, we apply it to a real online advertising campaign to evaluate the impact of three ad treatments: ad frequency, ad channel, and ad size. We show that the ad frequency usually has a treatment effect cap when ads are showing on mobile device. In addition, the strategies for choosing best ad size are completely different for mobile ads and online ads.

AAAI Conference 2015 Conference Paper

Sparse Deep Stacking Network for Image Classification

  • Jun Li
  • Heyou Chang
  • Jian Yang

Sparse coding can learn good robust representation to noise and model more higher-order representation for image classification. However, the inference algorithm is computationally expensive even though the supervised signals are used to learn compact and discriminative dictionaries in sparse coding techniques. Luckily, a simplified neural network module (SNNM) has been proposed to directly learn the discriminative dictionaries for avoiding the expensive inference. But the SNNM module ignores the sparse representations. Therefore, we propose a sparse SNNM module by adding the mixed-norm regularization (l1/l2 norm). The sparse SNNM modules are further stacked to build a sparse deep stacking network (S-DSN). In the experiments, we evaluate S-DSN with four databases, including Extended YaleB, AR, 15 scene and Caltech101. Experimental results show that our model outperforms related classification methods with only a linear classifier. It is worth noting that we reach 98. 8% recognition accuracy on 15 scene.

AAAI Conference 2014 Conference Paper

Delivering Guaranteed Display Ads under Reach and Frequency Requirements

  • Ali Hojjat
  • John Turner
  • Suleyman Cetintas
  • Jian Yang

We propose a novel idea in the allocation and serving of online advertising. We show that by using predetermined fixedlength streams of ads (which we call patterns) to serve advertising, we can incorporate a variety of interesting features into the ad allocation optimization problem. In particular, our formulation optimizes for representativeness as well as userlevel diversity and pacing of ads, under reach and frequency requirements. We show how the problem can be solved efficiently using a column generation scheme in which only a small set of best patterns are kept in the optimization problem. Our numerical tests suggest that with parallelization of the pattern generation process, the algorithm has a promising run time and memory usage.

IJCAI Conference 2013 Conference Paper

Instance Selection and Instance Weighting for Cross-Domain Sentiment Classification via PU Learning

  • Rui Xia
  • Xuelei Hu
  • Jianfeng Lu
  • Jian Yang
  • Chengqing Zong

Due to the explosive growth of the Internet online reviews, we can easily collect a large amount of labeled reviews from different domains. But only some of them are beneficial for training a desired target-domain sentiment classifier. Therefore, it is important for us to identify those samples that are the most relevant to the target domain and use them as training data. To address this problem, a novel approach, based on instance selection and instance weighting via PU learning, is proposed. PU learning is used at first to learn an in-target-domain selector, which assigns an in-target-domain probability to each sample in the training set. For instance selection, the samples with higher in-target-domain probability are used as training data; For instance weighting, the calibrated in-target-domain probabilities are used as sampling weights for training an instance-weighted naive Bayes model, based on the principle of maximum weighted likelihood estimation. The experimental results prove the necessity and effectiveness of the approach, especially when the size of training data is large. It is also proved that the larger the Kullback-Leibler divergence between the training and test data is, the more effective the proposed approach will be.

IS Journal 2011 Journal Article

Using Brain Imaging to Interpret Student Problem Solving

  • John Anderson
  • Shawn Betts
  • Jennifer Ferris
  • Jon Fincham
  • Jian Yang

We have been exploring whether multi voxel pattern analysis (MVPA) of functional magnet resonance imaging (fMRI) data can be used to infer the mental states of students learning mathematics. This approach has shown considerable success in tracking static mental states such as whether a person is thinking about a location or an animal. Applying this to our case involves significant challenges not faced in many MVPA applications because it is necessary to track changing student states over time. The paths of states that students take in solving problems can be quite variable. Nevertheless, we have achieved relatively high accuracy in determining what step a student is on when solving a sequence of problems and whether that step is being performed correctly. Hidden Markov models can then be used to combine behavioral and brain-imaging data from an intelligent tutoring system to track mental states during student's problem-solving episodes.