Author name cluster

Yan Yan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers

2 author rows

AAAI Conference 2026 Conference Paper

Cost-Sensitive Conformal Training with Provably Controllable Learning Bounds

Xuesong Jia
Yuanjie Shi
Ziquan Liu
Yi Xu
Yan Yan

Conformal prediction (CP) is a general framework to quantify the predictive uncertainty of machine learning models that uses a set prediction to include the true label with a valid probability. To align the uncertainty measured by CP, conformal training methods minimize the size of the prediction sets. A typical way is to use a surrogate indicator function, usually Sigmoid or Gaussian error function. However, these surrogate functions do not have a uniform error bound to the indicator function, leading to uncontrollable learning bounds. In this paper, we propose a simple cost-sensitive conformal training algorithm that does not rely on the indicator approximation mechanism. Specifically, we theoretically show that minimizing the expected size of prediction sets is upper bounded by the expected rank of true labels. To this end, we develop an importance weighting strategy that assigns the weight using the rank of true label on each data. Our analysis provably demonstrates the tightness between the proposed weighted objective and the expected size of conformal prediction sets. Extensive experiments verify the validity of our theoretical insights, and superior empirical performance over other conformal training in terms of predictive efficiency with 21.38% reduction for average prediction set size.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Joint Implicit and Explicit Language Learning for Pedestrian Attribute Recognition

Yukang Zhang
Lei Tan
Yang Lu
Yan Yan
Hanzi Wang

Pedestrian attribute recognition (PAR) has received increasing attention due to its wide application in video surveillance and pedestrian analysis. Some text-enhanced methods tackle this task by converting attributes into language descriptions to facilitate interactive learning between attributes and visual images. However, these generic languages fail to uniquely describe different pedestrian images, missing individual characteristics. In this paper, we propose a Joint Implicit and Explicit Language Guidance Enhancement Learning (JGEL) method, which converts each pedestrian image into a language description with dual language learning to effectively learn enhanced attribute information. Specifically, we first propose an Implicit Language Guidance Learning (ILGL) stream. It projects visual image features into the text embedding space to generate pseudo-word tokens, implicitly modeling image attributes and providing personalized descriptions. Moreover, we propose an Explicit Attribute Enhancement Learning (EAEL) stream to guide the generated pseudo-word tokens obtained by ILGL explicitly aligned with pedestrian attributes, which can effectively align the pseudo-word tokens with the attribute concepts in the text embedding space. Extensive experiments show that JGEL has significant advantages in improving the performance of PAR and the challenging zero-shot PAR task.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Minimum-Length Conformal Prediction Sets for Ordinal Classification

Zijian Zhang
Xinyu Chen
Yuanjie Shi
Liyuan Lillian Ma
Zifan Xu
Yan Yan

Ordinal classification has been widely applied in many high-stakes applications, e.g., medical imaging and diagnosis, where reliable uncertainty quantification (UQ) is essential for decision making. Conformal prediction (CP) is a general UQ framework that provides statistically valid guarantees, which is especially useful in practice. However, prior ordinal CP methods mainly focus on heuristic algorithms or restrictively require the underlying model to predict a unimodal distribution over ordinal labels. Consequently, they provide limited insight into coverage–efficiency trade-offs, or a model-agnostic and distribution-free nature favored by CP methods. To this end, we fill this gap by propose an ordinal-CP method that is model-agnostic and provides instance-level optimal prediction intervals. Specifically, we formulate conformal ordinal classification as a minimum-length covering problem at the instance level. To solve this problem, we develop a sliding-window algorithm that is optimal on each calibration data, with only a linear time complexity in K, the # of label candidates. The local optimality per instance further also improves predictive efficiency in expectation. Moreover, we propose a length-regularized variant that shrinks prediction set size while preserving coverage. Experiments on four benchmark datasets from diverse domains are conducted to demonstrate the significantly improved predictive efficiency of the proposed methods over baselines (by 15%↓ on average over four datasets).

PDF Details DOI

AAAI Conference 2026 Conference Paper

Predicting Emergent Tool Use in LLMs Before It Emerges: A Proxy Perspective

Bo-Wen Zhang
Yan Yan
Guang Liu
Xu-Cheng Yin

Tool-use capabilities fundamentally transform large language models (LLMs) from passive language generators into active agents with real-world utility, drawing intense research focus. Yet, their emergent nature renders traditional scaling laws ineffective for early-stage prediction, obstructing principled model design and efficient training. In this work, we propose a proxy-task perspective that predicts tool-use capabilities by measuring early model performance on selected non-emergent proxy tasks. Our method quantifies two properties of each proxy task: alignment, which reflects how well it captures tool-use trajectories, and stability, which indicates how consistently it behaves across training conditions. These properties are used to weight predictive signals. Theoretically, we formalize how these weighted signals approximate emergent tool use through bounded extrapolation under relaxed assumptions. Empirically, we validate our approach across training checkpoints, model scales, and data setups. Results show that a carefully weighted ensemble of proxy tasks can accurately rank downstream tool-use ability long before it arises. Our findings provide new theoretical foundations and practical tools for efficient training and capability planning, and advance the understanding of how complex abilities arise in LLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ProxyTTT: Proxy-driven Test-Time Training for Multi-modal Re-identification

Aihua Zheng
Zhaojun Liu
Xixi Wan
Chenglong Li
Jin Tang
Yan Yan

Multi-modal object re-identification (ReID) aims to retrieve specific targets by leveraging complementary cues from different sensing modalities. Despite recent progress, two key challenges remain: (1) the limited ability to jointly address both modality and viewpoint discrepancies, and (2) the difficulty of effectively leveraging reliable target-domain data to improve generalization. To address these challenges, we propose Proxy-driven Test-Time Training (ProxyTTT), a unified framework that enhances both multi-modal identity representation learning and model generalization. During training, we propose a Multi-Proxy Learning (MPL) mechanism to address the representation bias across different views and modalities. MPL disentangles fine-grained modality-specific and modality-common identity proxies as semantic anchors to align identity features across diverse perspectives and sensing modalities. This alignment strategy enables the model to learn robust and discriminative global identity representations under heterogeneous modality conditions. At test time, to reliably exploit target domain data, we propose Proxy-guided Entropy-based Selective Adaptation (PESA) for test-time training. Specifically, PESA leverages the semantic structure encoded by identity proxies to estimate prediction uncertainty via entropy, and selectively adapts the model using only high-confidence samples. This selective adaptation effectively mitigates the domain shift between training and deployment environments, improving the model’s generalization in real-world scenarios. Extensive experiments on four public multi-modal ReID benchmarks (RGBNT201, RGBNT100, MSVR310, and WMVeID863) demonstrate the effectiveness of ProxyTTT.

PDF Details DOI

EAAI Journal 2025 Journal Article

A Multi-Domain Patch-Differentiated Transformer for vehicle re-identification

Zhi Yu
Zhiyong Huang
Mingyang Hou
Yan Yan
Yushi Liu
Daming Sun
Hans Gregersen

Vehicle re-identification aims to identify the specific vehicles in cross-camera systems, which is a crucial applied engineering task in intelligent transportation Recently, transformer-based architectures have gained prominence in this field due to the robust feature modeling capabilities. However, most transformer-based approaches apply uniform consideration to all patches, disregarding the heterogeneous contributions of diverse patches to the final representation. To this end, this work proposes a Multi-Domain Patch-Differentiated Transformer (MDPDTrans) built upon the transformer architecture for vehicle re-identification applied engineering. Specifically, a Multi-Domain Patch Differentiation Module (MDPDM) is designed to evaluate the importance of diverse patches adaptively by integrating the domains of attention response, information entropy, and feature energy, enabling differential adjustment for diverse patches. The MDPDM is then embedded within the vision transformer to construct the MDPDTrans, enhancing the transformer’s ability to handle diverse patch contributions and distinguish the heterogeneous importance. Finally, to ensure alignment across these domains, this work designs a Multi-Domain Alignment (MDA) loss. This constrains both direction and distribution to align the patch importance obtained from different domains. By integrating multi-domain patch differentiation and alignment into the transformer, MDPDTrans demonstrates strong performance under challenging conditions. Simultaneously, the experiments verify engineering advancement and practical value of MDPDTrans in vehicle re-identification applied engineering.

EAAI Journal 2025 Journal Article

A transfer reinforcement learning and digital-twin based task allocation method for human-robot collaboration assembly

Jingfei Wang
Yan Yan
Yaoguang Hu
Xiaonan Yang
Lixiang Zhang

Currently, human-robot collaboration systems are considered to have great application potential in complex and flexible assembly tasks. In human-robot collaborative assembly systems, the roles and tasks allocation between human and robot are the vital stage to exert and integrate the strength of both. Current many researches of task allocation mainly train decision-making model depending on a large number of predefined standard assembly data such as assembly time and assembly difficulty. However, due to individual differences and frequent product updates, the dynamic assembly conditions lack enough data for model training, and the constructed model based on historical data may not adapt to the new condition. To solve this gap, a transfer reinforcement learning and digital-twin based task allocation method is proposed to achieve the accurate and efficient multi-agent human-robot collaboration task allocation policy learning. Firstly, based on the digital twin of human-robot collaboration environment, the augmented reality is leveraged to simulate assembly process and collect execution data of workers and robots before the physical assembly. Secondly, a multi-agent reinforcement learning method is introduced to use domain randomization strategy to pre-train decision policy before the accurate model training. Thirdly, the knowledge distillation is leveraged to develop the transfer reinforcement learning framework to learn the accurate decision policy by reusing the pre-trained policy and utilizing the collected assembly simulation data. Finally, a case study is performed to demonstrate the effectiveness of the proposed method.

NeurIPS Conference 2025 Conference Paper

Efficient Multimodal Dataset Distillation via Generative Models

Zhenghao Zhao
Haoxuan Wang
Junyi Wu
Yuzhang Shang
Gaowen Liu
Yan Yan

Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18$\times$ faster than the state-of-the-art method. Our code will be made public at https: //github. com/ichbill/EDGE.

NeurIPS Conference 2025 Conference Paper

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei
Weitai Kang
Zijian Zhang
Winson Chen
Xi Xie
Shan Zuo
Mimi Xie
Ali Payani

This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i. e. , OSWorld), but also more general or tool-intensive benchmarks (e. g. , GAIA and SWE-Bench). Specifically, we achieve a $\mathbf{7. 27\\%}$ accuracy gain over Claude-Computer-Use on OSWorld. Codes and evaluation scripts are included in the supplementary material and will be released as open-source.

JAIR Journal 2025 Journal Article

LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking

Yan Yan
Junyuan Liu
Bo-Wen Zhang

Despite recent advancements in semantic representation driven by pre-trained and large-scale language models, addressing long tail challenges in multi-label text classification remains a significant issue. Long tail challenges have persistently posed difficulties in accurately classifying less frequent labels. Current approaches often focus on improving text semantics while neglecting the crucial role of label relationships. This paper introduces LabelCoRank, a novel approach inspired by ranking principles. LabelCoRank leverages label co-occurrence relationships to refine initial label classifications through a dual-stage reranking process. The first stage uses initial classification results to form a preliminary ranking. In the second stage, a label co-occurrence matrix is utilized to rerank the preliminary results, enhancing the accuracy and relevance of the final classifications. By integrating the reranked label representations as additional text features, LabelCoRank effectively mitigates long tail issues in multi-label text classification. Experimental evaluations on popular datasets including MAG-CS, PubMed, and AAPD demonstrate the effectiveness and robustness of LabelCoRank. The implementation code is publicly available on https://github.com/821code/LabelCoRank.

PDF Details DOI

JBHI Journal 2025 Journal Article

MSCD-VM-UNet: A Vision Mamba Combining Multi-Scale Global and Local Feature Extraction With Cross-Domain Feature Fusion for Medical Image Segmentation

Zhiyong Huang
Shuxin Wang
Mingyang Hou
Zhi Yu
Shiwei Wang
Xiaoyu Li
Yan Yan
Yushi Liu

Accurate segmentation of tissues and lesions is essential for diagnosis and treatment. State Space Models (SSMs) have gained attention for their linear complexity and ability to model long-range dependencies. However, the existing Mamba architecture relies on direct skip connections, which limits its ability to integrate multi-scale and multi-level features and handle boundary details effectively. To address these limitations, we propose the MSCD-VM-UNet architecture, which incorporates three novel modules: the Spatial Group Multi-Scale Attention Module (SGMAM), the Cross-Domain Feature Fusion Module (CDFFM), and the Attention-Based Feature Injection Module (ABFIM). The SGMAM captures multi-scale global and local information and adaptively adjusts feature importance to highlight key regions while suppressing noise. The CDFFM enhances boundary and detail handling by aligning semantic features from both the frequency and spatial domains. The ABFIM utilizes attention mechanisms to adaptively fuse and weigh features from different scales and semantics, promoting feature collaboration and improving the model’s robustness in complex tasks. Experiments on multiple datasets show that these modules significantly enhance the accuracy of MSCD-VM-UNet, setting a new benchmark for medical image segmentation.

NeurIPS Conference 2025 Conference Paper

Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos

Junyi Wu
Jiachen Tao
Haoxuan Wang
Gaowen Liu
Ramana Kompella
Yan Yan

We present Orientation-anchored Gaussian Splatting (OriGS), a novel framework for high-quality 4D reconstruction from casually captured monocular videos. While recent advances extend 3D Gaussian Splatting to dynamic scenes via various motion anchors, such as graph nodes or spline control points, they often rely on low-rank assumptions and fall short in modeling complex, region-specific deformations inherent to unconstrained dynamics. OriGS addresses this by introducing a hyperdimensional representation grounded in scene orientation. We first estimate a Global Orientation Field that propagates principal forward directions across space and time, serving as stable structural guidance for dynamic modeling. Built upon this, we propose Orientation-aware Hyper-Gaussian, a unified formulation that embeds time, space, geometry, and orientation into a coherent probabilistic state. This enables inferring region-specific deformation through principled conditioned slicing, adaptively capturing diverse local dynamics in alignment with global motion intent. Experiments demonstrate the superior reconstruction fidelity of OriGS over mainstream methods in challenging real-world dynamic scenes.

NeurIPS Conference 2025 Conference Paper

WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting

Kaitao Huang
Yan Yan
Jing-Hao Xue
Hanzi Wang

3D GAN inversion projects a single image into the latent space of a pre-trained 3D GAN to achieve single-shot novel view synthesis, which requires visible regions with high fidelity and occluded regions with realism and multi-view consistency. However, existing methods focus on the reconstruction of visible regions, while the generation of occluded regions relies only on the generative prior of 3D GAN. As a result, the generated occluded regions often exhibit poor quality due to the information loss caused by the low bit-rate latent code. To address this, we introduce the warping-and-inpainting strategy to incorporate image inpainting into 3D GAN inversion and propose a novel 3D GAN inversion method, WarpGAN. Specifically, we first employ a 3D GAN inversion encoder to project the single-view image into a latent code that serves as the input to 3D GAN. Then, we perform warping to a novel view using the depth map generated by 3D GAN. Finally, we develop a novel SVINet, which leverages the symmetry prior and multi-view image correspondence w. r. t. the same latent code to perform inpainting of occluded regions in the warped image. Quantitative and qualitative experiments demonstrate that our method consistently outperforms several state-of-the-art methods.

NeurIPS Conference 2025 Conference Paper

X-Field: A Physically Informed Representation for 3D X-ray Reconstruction

Feiran Wang
Jiachen Tao
Junyi Wu
Haoxuan Wang
Bin Duan
Kai Wang
Zongxin Yang
Yan Yan

X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to radiation exposure. Recent research borrows representations from the 3D reconstruction area to complete two tasks with reduced radiation dose: X-ray Novel View Synthesis (NVS) and Computed Tomography (CT) reconstruction. However, these representations fail to fully capture the penetration and attenuation properties of X-ray imaging as they originate from visible light imaging. In this paper, we introduce X-Field, a 3D representation informed in the physics of X-ray imaging. First, we employ homogeneous 3D ellipsoids with distinct attenuation coefficients to accurately model diverse materials within internal structures. Second, we introduce an efficient path-partitioning algorithm that resolves the intricate intersection of ellipsoids to compute cumulative attenuation along an X-ray path. We further propose a hybrid progressive initialization to refine the geometric accuracy of X-Field and incorporate material-based optimization to enhance model fitting along material boundaries. Experiments show that X-Field achieves superior visual fidelity on both real-world human organ and synthetic object datasets, outperforming state-of-the-art methods in X-ray NVS and CT Reconstruction. Our code is available on the project page: https: //github. com/Brack-Wang/X-Field.

NeurIPS Conference 2024 Conference Paper

Conformal Prediction for Class-wise Coverage via Augmented Label Rank Calibration

Yuanjie Shi
Subhankar Ghosh
Taha Belkhouja
Janardhan R. Doppa
Yan Yan

Conformal prediction (CP) is an emerging uncertainty quantification framework that allows us to construct a prediction set to cover the true label with a pre-specified marginal or conditional probability. Although the valid coverage guarantee has been extensively studied for classification problems, CP often produces large prediction sets which may not be practically useful. This issue is exacerbated for the setting of class-conditional coverage on imbalanced classification tasks with many and/or imbalanced classes. This paper proposes the Rank Calibrated Class-conditional CP (RC3P) algorithm to reduce the prediction set sizes to achieve class-conditional coverage, where the valid coverage holds for each class. In contrast to the standard class-conditional CP (CCP) method that uniformly thresholds the class-wise conformity score for each class, the augmented label rank calibration step allows RC3P to selectively iterate this class-wise thresholding subroutine only for a subset of classes whose class-wise top-$k$ error is small. We prove that agnostic to the classifier and data distribution, RC3P achieves class-wise coverage. We also show that RC3P reduces the size of prediction sets compared to the CCP method. Comprehensive experiments on multiple real-world datasets demonstrate that RC3P achieves class-wise coverage and $26. 25\\%$ $\downarrow$ reduction in prediction set sizes on average.

PDF Details DOI

EAAI Journal 2024 Journal Article

Dynamic flexible scheduling with transportation constraints by multi-agent reinforcement learning

Lixiang Zhang
Yan Yan
Yaoguang Hu

Reinforcement learning-based methods have addressed production scheduling problems with flexible processing constraints. However, delayed rewards arise due to the dynamic arrival of jobs and transportation constraints between two successive operations. The flow time of operations can only be determined after processing due to the possibility that the solution for job sequencing may change if new operations are inserted in dynamic environments. Job sequencing is often overlooked in single-agent-based scheduling methods. The lack of information sharing between multiple agents necessitates that researchers manually design reward functions to fit the relationship between optimization objectives and rewards, thereby reducing the accuracy of the learned policies. Thus, this paper proposes a multi-agent-based scheduling optimization framework that facilitates collaboration between the agents of both machines and jobs to address dynamic flexible job-shop scheduling problems (DFJSP) with transportation time constraints. Then, this paper formulates the Partial Observation Markov Decision Process and constructs a reward-sharing mechanism to tackle the delayed reward issue and facilitate policy learning. Finally, we develop an improved multi-agent dueling double deep Q network algorithm to optimize scheduling policy during long-term training. The results show that, compared with the state-of-the-art methods, the proposed method efficiently shortens the weighted flow time under the trained and unseen scenarios. Additionally, the case study results demonstrate its efficiency and responsiveness. It indicates that the proposed method efficiently addresses production scheduling problems with complex constraints, including the insertion of jobs, transportation time constraints, and flexible processing routes.

AAAI Conference 2024 Conference Paper

Federated Partial Label Learning with Local-Adaptive Augmentation and Regularization

Yan Yan
Yuhong Guo

Partial label learning (PLL) expands the applicability of supervised machine learning models by enabling effective learning from weakly annotated overcomplete labels. Existing PLL methods however focus on the standard centralized learning scenarios. In this paper, we expand PLL into the distributed computation setting by formalizing a new learning scenario named as federated partial label learning (FedPLL), where the training data with partial labels are distributed across multiple local clients with privacy constraints. To address this challenging problem, we propose a novel Federated PLL method with Local-Adaptive Augmentation and Regularization (FedPLL-LAAR). In addition to alleviating the partial label noise with moving-average label disambiguation, the proposed method performs MixUp-based local-adaptive data augmentation to mitigate the challenge posed by insufficient and imprecisely annotated local data, and dynamically incorporates the guidance of global model to minimize client drift through adaptive gradient alignment regularization between the global and local models. Extensive experiments conducted on multiple datasets under the FedPLL setting demonstrate the effectiveness of the proposed FedPLL-LAAR method for federated partial label learning.

PDF Details DOI

AAAI Conference 2024 Conference Paper

High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-identification

Liuxiang Qiu
Si Chen
Yan Yan
Jing-Hao Xue
Da-Han Wang
Shunzhi Zhu

Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same persons captured by visible (VIS) and infrared (IR) cameras. Existing VI-ReID methods ignore high-order structure information of features while being relatively difficult to learn a reasonable common feature space due to the large modality discrepancy between VIS and IR images. To address the above problems, we propose a novel high-order structure based middle-feature learning network (HOS-Net) for effective VI-ReID. Specifically, we first leverage a short- and long-range feature extraction (SLE) module to effectively exploit both short-range and long-range features. Then, we propose a high-order structure learning (HSL) module to successfully model the high-order relationship across different local features of each person image based on a whitened hypergraph network. This greatly alleviates model collapse and enhances feature representations. Finally, we develop a common feature space learning (CFL) module to learn a discriminative and reasonable common feature space based on middle features generated by aligning features from different modalities and ranges. In particular, a modality-range identity-center contrastive (MRIC) loss is proposed to reduce the distances between the VIS, IR, and middle features, smoothing the training process. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that our HOS-Net achieves superior state-of-the-art performance. Our code is available at https://github.com/Jaulaucoeng/HOS-Net.

PDF Details DOI

EAAI Journal 2024 Journal Article

Multi-agent policy learning-based path planning for autonomous mobile robots

Lixiang Zhang
Ze Cai
Yan Yan
Chen Yang
Yaoguang Hu

The study addresses path planning problems for autonomous mobile robots (AMRs), considering their kinematics, where performance and responsiveness are often incompatible. This study proposes a multi-agent policy learning-based method to tackle this challenge in dynamic environments. The proposed method features a centralized learning and decentralized execution-based path planning framework designed to meet performance and responsiveness requirements. The problem is modeled as a partial observation Markov Decision Process for policy learning while considering the kinematics using conventional neural networks. Then, an improved proximal policy optimization algorithm is developed with highlight experience replay that corrects failed experiences to speed up the learning processes. The experimental results show that the proposed method outperforms the baselines in both static and dynamic environments. The proposed method shortens the movement distance and time in static environments by about 29. 1% and 5. 7%, as well as in dynamic environments by about 21. 1% and 20. 4%, respectively. The runtime is maintained in milliseconds across various environments, taking only 0. 07 s. Overall, the proposed method is valid and efficient in ensuring the performance and responsiveness of AMRs when dealing with complex and dynamic path planning problems.

NeurIPS Conference 2024 Conference Paper

PTQ4DiT: Post-training Quantization for Diffusion Transformers

Junyi Wu
Haoxuan Wang
Yuzhang Shang
Mubarak Shah
Yan Yan

The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $\rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Spatial-Contextual Discrepancy Information Compensation for GAN Inversion

Ziqiang Zhang
Yan Yan
Jing-Hao Xue
Hanzi Wang

Most existing GAN inversion methods either achieve accurate reconstruction but lack editability or offer strong editability at the cost of fidelity. Hence, how to balance the distortion-editability trade-off is a significant challenge for GAN inversion. To address this challenge, we introduce a novel spatial-contextual discrepancy information compensation-based GAN-inversion method (SDIC), which consists of a discrepancy information prediction network (DIPN) and a discrepancy information compensation network (DICN). SDIC follows a ``compensate-and-edit'' paradigm and successfully bridges the gap in image details between the original image and the reconstructed/edited image. On the one hand, DIPN encodes the multi-level spatial-contextual information of the original and initial reconstructed images and then predicts a spatial-contextual guided discrepancy map with two hourglass modules. In this way, a reliable discrepancy map that models the contextual relationship and captures fine-grained image details is learned. On the other hand, DICN incorporates the predicted discrepancy information into both the latent code and the GAN generator with different transformations, generating high-quality reconstructed/edited images. This effectively compensates for the loss of image details during GAN inversion. Both quantitative and qualitative experiments demonstrate that our proposed method achieves the excellent distortion-editability trade-off at a fast inference speed for both image inversion and editing tasks. Our code is available at https://github.com/ZzqLKED/SDIC.

PDF Details DOI

AAAI Conference 2024 Conference Paper

WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting

Zhiliang Wu
Changchang Sun
Hanyu Xuan
Gaowen Liu
Yan Yan

Video inpainting aims to fill in the missing regions of the video frames with plausible content. Benefiting from the outstanding long-range modeling capacity, the transformer-based models have achieved unprecedented performance regarding inpainting quality. Essentially, coherent contents from all the frames along both spatial and temporal dimensions are concerned by a patch-wise attention module, and then the missing contents are generated based on the attention-weighted summation. In this way, attention retrieval accuracy has become the main bottleneck to improve the video inpainting performance, where the factors affecting attention calculation should be explored to maximize the advantages of transformer. Towards this end, in this paper, we theoretically certificate that noise is the culprit that entangles the process of attention calculation. Meanwhile, we propose a novel wavelet transformer network with noise robustness for video inpainting, named WaveFormer. Unlike existing transformer-based methods that utilize the whole embeddings to calculate the attention, our WaveFormer first separates the noise existing in the embedding into high-frequency components by introducing the Discrete Wavelet Transform (DWT), and then adopts clean low-frequency components to calculate the attention. In this way, the impact of noise on attention computation can be greatly mitigated and the missing content regarding different frequencies can be generated by sharing the calculated attention. Extensive experiments validate the superior performance of our method over state-of-the-art baselines both qualitatively and quantitatively.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Boundary Guided Learning-Free Semantic Control with Diffusion Models

Ye Zhu
Yu Wu
Zhiwei Deng
Olga Russakovsky
Yan Yan

Applying pre-trained generative denoising diffusion models (DDMs) for downstream tasks such as image semantic editing usually requires either fine-tuning DDMs or learning auxiliary editing networks in the existing literature. In this work, we present our BoundaryDiffusion method for efficient, effective and light-weight semantic control with frozen pre-trained DDMs, without learning any extra networks. As one of the first learning-free diffusion editing works, we start by seeking a more comprehensive understanding of the intermediate high-dimensional latent spaces by theoretically and empirically analyzing their probabilistic and geometric behaviors in the Markov chain. We then propose to further explore the critical step in the denoising trajectory that characterizes the convergence of a pre-trained DDM and introduce an automatic search method. Last but not least, in contrast to the conventional understanding that DDMs have relatively poor semantic behaviors (in generic latent spaces), we prove that the critical latent space we found already forms semantic subspace boundaries at the generic level in unconditional DDMs, which allows us to do controllable manipulation by guiding the denoising trajectory towards the targeted boundary via a single-step operation. We conduct extensive experiments on multiple DPMs architectures (DDPM, iDDPM) and datasets (CelebA, CelebA-HQ, LSUN-church, LSUN-bedroom, AFHQ-dog) with different resolutions (64, 256), achieving superior or state-of-the-art performance in various task scenarios (image semantic editing, text-based editing, unconditional semantic control) to demonstrate the effectiveness.

AAAI Conference 2023 Conference Paper

Enhance Robustness of Machine Learning with Improved Efficiency

Yan Yan

Robustness of machine learning, often referring to securing performance on different data, is always an active field due to the ubiquitous variety and diversity of data in practice. Many studies have been investigated to enhance the learning process robust in recent years. To this end, there is usually a trade-off that results in somewhat extra cost, e.g., more data samples, more complicated objective functions, more iterations to converge in optimization, etc. Then this problem boils down to finding a better trade-off under some conditions. My recent research focuses on robust machine learning with improved efficiency. Particularly, the efficiency here represents learning speed to find a model, and the number of data required to secure the robustness. In the talk, I will survey three pieces of my recent research by elaborating the algorithmic idea and theoretical analysis as technical contributions --- (i) epoch stochastic gradient descent ascent for min-max problems, (ii) stochastic optimization algorithm for non-convex inf-projection problems, and (iii) neighborhood conformal prediction. In the first two pieces of work, the proposed optimization algorithms are general and cover objective functions for robust machine learning. In the third one, I will elaborate an efficient conformal prediction algorithm that guarantee the robustness of prediction after model is trained. Particularly, the efficiency of conformal prediction is measured by its bandwidth.

PDF Details DOI

JMLR Journal 2023 Journal Article

Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave Min-Max Problems with PL Condition

Zhishuai Guo
Yan Yan
Zhuoning Yuan
Tianbao Yang

This paper focuses on stochastic methods for solving smooth non-convex strongly-concave min-max problems, which have received increasing attention due to their potential applications in deep learning (e.g., deep AUC maximization, distributionally robust optimization). However, most of the existing algorithms are slow in practice, and their analysis revolves around the convergence to a nearly stationary point. We consider leveraging the Polyak-Lojasiewicz (PL) condition to design faster stochastic algorithms with stronger convergence guarantee. Although PL condition has been utilized for designing many stochastic minimization algorithms, their applications for non-convex min-max optimization remain rare. In this paper, we propose and analyze a generic framework of proximal stage-based method with many well-known stochastic updates embeddable. Fast convergence is established in terms of both the primal objective gap and the duality gap. Compared with existing studies, (i) our analysis is based on a novel Lyapunov function consisting of the primal objective gap and the duality gap of a regularized function, and (ii) the results are more comprehensive with improved rates that have better dependence on the condition number under different assumptions. We also conduct deep and non-deep learning experiments to verify the effectiveness of our methods. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

AAAI Conference 2023 Conference Paper

Improving Uncertainty Quantification of Deep Classifiers via Neighborhood Conformal Prediction: Novel Algorithm and Theoretical Analysis

Subhankar Ghosh
Taha Belkhouja
Yan Yan
Janardhan Rao Doppa

Safe deployment of deep neural networks in high-stake real-world applications require theoretically sound uncertainty quantification. Conformal prediction (CP) is a principled framework for uncertainty quantification of deep models in the form of prediction set for classification tasks with a user-specified coverage (i.e., true class label is contained with high probability). This paper proposes a novel algorithm referred to as Neighborhood Conformal Prediction (NCP) to improve the efficiency of uncertainty quantification from CP for deep classifiers (i.e., reduce prediction set size). The key idea behind NCP is to use the learned representation of the neural network to identify k nearest-neighbor calibration examples for a given testing input and assign them importance weights proportional to their distance to create adaptive prediction sets. We theoretically show that if the learned data representation of the neural network satisfies some mild conditions, NCP will produce smaller prediction sets than traditional CP algorithms. Our comprehensive experiments on CIFAR-10, CIFAR-100, and ImageNet datasets using diverse deep neural networks strongly demonstrate that NCP leads to significant reduction in prediction set size over prior CP methods.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

MIM4DD: Mutual Information Maximization for Dataset Distillation

Yuzhang Shang
Zhihang Yuan
Yan Yan

Dataset distillation (DD) aims to synthesize a small dataset whose test performance is comparable to a full dataset using the same model. State-of-the-art (SoTA) methods optimize synthetic datasets primarily by matching heuristic indicators extracted from two networks: one from real data and one from synthetic data (see Fig. 1, Left), such as gradients and training trajectories. DD is essentially a compression problem that emphasizes on maximizing the preservation of information contained in the data. We argue that well-defined metrics which measure the amount of shared information between variables in information theory are necessary for success measurement, but are never considered by previous works. Thus, we introduce mutual information (MI) as the metric to quantify the shared information between the synthetic and the real datasets, and devise MIM4DD numerically maximizing the MI via a newly designed optimizable objective within a contrastive learning framework to update the synthetic dataset. Specifically, we designate the samples in different datasets who share the same labels as positive pairs, and vice versa negative pairs. Then we respectively pull and push those samples in positive and negative pairs into contrastive space via minimizing NCE loss. As a result, the targeted MI can be transformed into a lower bound represented by feature maps of samples, which is numerically feasible. Experiment results show that MIM4DD can be implemented as an add-on module to existing SoTA DD methods.

AAAI Conference 2023 Conference Paper

MRCN: A Novel Modality Restitution and Compensation Network for Visible-Infrared Person Re-identification

Yukang Zhang
Yan Yan
Jie Li
Hanzi Wang

Visible-infrared person re-identification (VI-ReID), which aims to search identities across different spectra, is a challenging task due to large cross-modality discrepancy between visible and infrared images. The key to reduce the discrepancy is to filter out identity-irrelevant interference and effectively learn modality-invariant person representations. In this paper, we propose a novel Modality Restitution and Compensation Network (MRCN) to narrow the gap between the two modalities. Specifically, we first reduce the modality discrepancy by using two Instance Normalization (IN) layers. Next, to reduce the influence of IN layers on removing discriminative information and to reduce modality differences, we propose a Modality Restitution Module (MRM) and a Modality Compensation Module (MCM) to respectively distill modality-irrelevant and modality-relevant features from the removed information. Then, the modality-irrelevant features are used to restitute to the normalized visible and infrared features, while the modality-relevant features are used to compensate for the features of the other modality. Furthermore, to better disentangle the modality-relevant features and the modality-irrelevant features, we propose a novel Center-Quadruplet Causal (CQC) loss to encourage the network to effectively learn the modality-relevant features and the modality-irrelevant features. Extensive experiments are conducted to validate the superiority of our method on the challenging SYSU-MM01 and RegDB datasets. More remarkably, our method achieves 95.1% in terms of Rank-1 and 89.2% in terms of mAP on the RegDB dataset.

PDF Details DOI

TIST Journal 2023 Journal Article

Out-of-distribution Detection in Time-series Domain: A Novel Seasonal Ratio Scoring Approach

Taha Belkhouja
Yan Yan
Janardhan Rao Doppa

Safe deployment of time-series classifiers for real-world applications relies on the ability to detect the data that is not generated from the same distribution as training data. This task is referred to as out-of-distribution (OOD) detection. We consider the novel problem of OOD detection for the time-series domain. We discuss the unique challenges posed by time-series data and explain why prior methods from the image domain will perform poorly. Motivated by these challenges, this article proposes a novel Seasonal Ratio Scoring (SRS) approach. SRS consists of three key algorithmic steps. First, each input is decomposed into class-wise semantic component and remainder. Second, this decomposition is employed to estimate the class-wise conditional likelihoods of the input and remainder using deep generative models. The seasonal ratio score is computed from these estimates. Third, a threshold interval is identified from the in-distribution data to detect OOD examples. Experiments on diverse real-world benchmarks demonstrate that the SRS method is well-suited for time-series OOD detection when compared to baseline methods.

IJCAI Conference 2022 Conference Paper

Active Contrastive Set Mining for Robust Audio-Visual Instance Discrimination

Hanyu Xuan
Yihong Xu
Shuo Chen
Zhiliang Wu
Jian Yang
Yan Yan
Xavier Alameda-Pineda

The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision. As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm. Existing AVID methods construct the contrastive set by random sampling based on the assumption that the audio and visual clips from all other videos are not semantically related. We argue that this assumption is rough, since the resulting contrastive sets have a large number of faulty negatives. In this paper, we overcome this limitation by proposing a novel Active Contrastive Set Mining (ACSM) that aims to mine the contrastive sets with informative and diverse negatives for robust AVID. Moreover, we also integrate a semantically-aware hard-sample mining strategy into our ACSM. The proposed ACSM is implemented into two most recent state-of-the-art AVID methods and significantly improves their performance. Extensive experiments conducted on both action and sound recognition on multiple datasets show the remarkably improved performance of our method.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Training Robust Deep Models for Time-Series Domain: Novel Algorithms and Theoretical Analysis

Taha Belkhouja
Yan Yan
Janardhan Rao Doppa

Despite the success of deep neural networks (DNNs) for real-world applications over time-series data such as mobile health, little is known about how to train robust DNNs for time-series domain due to its unique characteristics compared to images and text data. In this paper, we fill this gap by proposing a novel algorithmic framework referred as RObust Training for Time-Series (RO-TS) to create robust deep models for time-series classification tasks. Specifically, we formulate a min-max optimization problem over the model parameters by explicitly reasoning about the robustness criteria in terms of additive perturbations to time-series inputs measured by the global alignment kernel (GAK) based distance. We also show the generality and advantages of our formulation using the summation structure over time-series alignments by relating both GAK and dynamic time warping (DTW). This problem is an instance of a family of compositional min-max optimization problems, which are challenging and open with unclear theoretical guarantee. We propose a principled stochastic compositional alternating gradient descent ascent (SCAGDA) algorithm for this family of optimization problems. Unlike traditional methods for timeseries that require approximate computation of distance measures, SCAGDA approximates the GAK based distance onthe-fly using a moving average approach. We theoretically analyze the convergence rate of SCAGDA and provide strong theoretical support for the estimation of GAK based distance. Our experiments on real-world benchmarks demonstrate that RO-TS creates more robust deep models when compared to adversarial training using prior methods that rely on data augmentation or new definitions of loss functions. We also demonstrate the importance of GAK for time-series data over the Euclidean distance.

AAAI Conference 2022 Conference Paper

When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework

Xinyi Zou
Yan Yan
Jing-Hao Xue
Si Chen
Hanzi Wang

Human emotions involve basic and compound facial expressions. However, current research on facial expression recognition (FER) mainly focuses on basic expressions, and thus fails to address the diversity of human emotions in practical scenarios. Meanwhile, existing work on compound FER relies heavily on abundant labeled compound expression training data, which are often laboriously collected under the professional instruction of psychology. In this paper, we study compound FER in the cross-domain few-shot learning setting, where only a few images of novel classes from the target domain are required as a reference. In particular, we aim to identify unseen compound expressions with the model trained on easily accessible basic expression datasets. To alleviate the problem of limited base classes in our FER task, we propose a novel Emotion Guided Similarity Network (EGS-Net), consisting of an emotion branch and a similarity branch, based on a two-stage learning framework. Specifically, in the first stage, the similarity branch is jointly trained with the emotion branch in a multi-task fashion. With the regularization of the emotion branch, we prevent the similarity branch from overfitting to sampled base classes that are highly overlapped across different episodes. In the second stage, the emotion branch and the similarity branch play a “two-student game” to alternately learn from each other, thereby further improving the inference ability of the similarity branch on unseen compound expressions. Experimental results on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed method against several stateof-the-art methods.

AAAI Conference 2021 Conference Paper

Adversarial Partial Multi-Label Learning with Label Disambiguation

Yan Yan
Yuhong Guo

Partial multi-label learning (PML), which tackles the problem of learning multi-label prediction models from instances with overcomplete noisy annotations, has recently started gaining attention from the research community. In this paper, we propose a novel adversarial learning model, PML-GAN, under a generalized encoder-decoder framework for partial multilabel learning. The PML-GAN model uses a disambiguation network to identify irrelevant labels and uses a multi-label prediction network to map the training instances to their disambiguated label vectors, while deploying a generative adversarial network as an inverse mapping from label vectors to data samples in the input feature space. The learning of the overall model corresponds to a minimax adversarial game, which enhances the correspondence of input features with the output labels in a bi-directional mapping. Extensive experiments are conducted on both synthetic and real-world partial multi-label datasets, while the proposed model demonstrates the state-of-the-art performance.

IJCAI Conference 2021 Conference Paper

Multi-level Generative Models for Partial Label Learning with Non-random Label Noise

Yan Yan
Yuhong Guo

Partial label (PL) learning tackles the problem where each training instance is associated with a set of candidate labels that include both the true label and some irrelevant noise labels. In this paper, we propose a novel multi-level generative model for partial label learning (MGPLL), which tackles the PL problem by learning both a label level adversarial generator and a feature level adversarial generator under a bi-directional mapping framework between the label vectors and the data samples. MGPLL uses a conditional noise label generation network to model the non-random noise labels and perform label denoising, and uses a multi-class predictor to map the training instances to the denoised label vectors, while a conditional data feature generator is used to form an inverse mapping from the denoised label vectors to data samples. Both the noise label generator and the data feature generator are learned in an adversarial manner to match the observed candidate labels and data features respectively. We conduct extensive experiments on both synthesized and real-world partial label datasets. The proposed approach demonstrates the state-of-the-art performance for partial label learning.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Adversarial Localized Energy Network for Structured Prediction

Pingbo Pan
Ping Liu
Yan Yan
Tianbao Yang
Yi Yang

This paper focuses on energy model based structured output prediction. Though inheriting the beneﬁts from energybased models to handle the sophisticated cases, previous deep energy-based methods suffered from the substantial computation cost introduced by the enormous amounts of gradient steps in the inference process. To boost the efﬁciency and accuracy of the energy-based models on structured output prediction, we propose a novel method analogous to the adversarial learning framework. Speciﬁcally, in our proposed framework, the generator consists of an inference network while the discriminator is comprised of an energy network. The two sub-modules, i. e. , the inference network and the energy network, can beneﬁt each other mutually during the whole computation process. On the one hand, our modiﬁed inference network can boost the efﬁciency by predicting good initializations and reducing the searching space for the inference process; On the other hand, inheriting the beneﬁts of the energy network, the energy module in our network can evaluate the quality of the generated output from the inference network and correspondingly provides a resourceful guide to the training of the inference network. In the ideal case, the adversarial learning strategy makes sure the two sub-modules can achieve an equilibrium state after steps. We conduct extensive experiments to verify the effectiveness and efﬁciency of our proposed method.

AAAI Conference 2020 Conference Paper

Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Hanyu Xuan
Zhenyu Zhang
Shuo Chen
Jian Yang
Yan Yan

In human multi-modality perception systems, the beneﬁts of integrating auditory and visual information are extensive as they provide plenty supplementary cues for understanding the events. Despite some recent methods proposed for such application, they cannot deal with practical conditions with temporal inconsistency. Inspired by human system which puts different focuses at speciﬁc locations, time segments and media while performing multi-modality perception, we provide an attention-based method to simulate such process. Similar to human mechanism, our network can adaptively select “where” to attend, “when” to attend and “which” to attend for audio-visual event localization. In this way, even with large temporal inconsistent between vision and audio, our network is able to adaptively trade information between different modalities and successfully achieve event localization. Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. In addition, we also systemically investigate audio-visual event localization tasks. The visualization results also help us better understand how our model works.

NeurIPS Conference 2020 Conference Paper

Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Yan Yan
Yi Xu
Qihang Lin
Wei Liu
Tianbao Yang

Epoch gradient descent method (a. k. a. Epoch-GD) proposed by (Hazan and Kale, 2011) was deemeda breakthrough for stochastic strongly convex minimization, which achieves theoptimal convergence rate of O(1/T) with T iterative updates for the objective gap. However, its extension to solving stochastic min-max problems with strong convexity and strong concavity still remains open, and it is still unclear whethera fast rate ofO(1/T)for theduality gapis achievable for stochastic min-max optimization under strong convexity and strong concavity. Although some re-cent studies have proposed stochastic algorithms with fast convergence rates formin-max problems, they require additional assumptions about the problem, e. g. ,smoothness, bi-linear structure, etc. In this paper, we bridge this gap by providinga sharp analysis of epoch-wise stochastic gradient descent ascent method (referredto as Epoch-GDA) for solving strongly convex strongly concave (SCSC) min-maxproblems, without imposing any additional assumption about smoothness or the function’s structure. To the best of our knowledge, our result is the first one that shows Epoch-GDA can achieve the optimal rate ofO(1/T)for the duality gapof general SCSC min-max problems. We emphasize that such generalization of Epoch-GD for strongly convex minimization problems to Epoch-GDA for SCSC min-max problems is non-trivial and requires novel technical analysis. Moreover, we notice that the key lemma can also be used for proving the convergence of Epoch-GDA for weakly-convex strongly-concave min-max problems, leading to a nearly optimal complexity without resorting to smoothness or other structural conditions.

AAAI Conference 2020 Conference Paper

Partial Label Learning with Batch Label Correction

Yan Yan
Yuhong Guo

Partial label (PL) learning tackles the problem where each training instance is associated with a set of candidate labels, among which only one is the true label. In this paper, we propose a simple but effective batch-based partial label learning algorithm named PL-BLC, which tackles the partial label learning problem with batch-wise label correction (BLC). PL-BLC dynamically corrects the label conﬁdence matrix of each training batch based on the current prediction network, and adopts a MixUp data augmentation scheme to enhance the underlying true labels against the redundant noisy labels. In addition, it introduces a teacher model through a consistency cost to ensure the stability of the batch-based prediction network update. Extensive experiments are conducted on synthesized and real-world partial label learning datasets, while the proposed approach demonstrates the state-of-the-art performance for partial label learning.

AAAI Conference 2019 Conference Paper

A Bottom-Up Clustering Approach to Unsupervised Person Re-Identification

Yutian Lin
Xuanyi Dong
Liang Zheng
Yan Yan
Yi Yang

Most person re-identification (re-ID) approaches are based on supervised learning, which requires intensive manual annotation for training data. However, it is not only resourceintensive to acquire identity annotation but also impractical to label the large-scale real-world data. To relieve this problem, we propose a bottom-up clustering (BUC) approach to jointly optimize a convolutional neural network (CNN) and the relationship among the individual samples. Our algorithm considers two fundamental facts in the re-ID task, i. e. , diversity across different identities and similarity within the same identity. Specifically, our algorithm starts with regarding individual sample as a different identity, which maximizes the diversity over each identity. Then it gradually groups similar samples into one identity, which increases the similarity within each identity. We utilizes a diversity regularization term in the bottom-up clustering procedure to balance the data volume of each cluster. Finally, the model achieves an effective trade-off between the diversity and similarity. We conduct extensive experiments on the large-scale image and video re-ID datasets, including Market-1501, DukeMTMCreID, MARS and DukeMTMC-VideoReID. The experimental results demonstrate that our algorithm is not only superior to state-of-the-art unsupervised re-ID approaches, but also performs favorably than competing transfer learning and semi-supervised learning methods.

AAAI Conference 2019 Conference Paper

Adaptive Sparse Confidence-Weighted Learning for Online Feature Selection

Yanbin Liu
Yan Yan
Ling Chen
Yahong Han
Yi Yang

In this paper, we propose a new online feature selection algorithm for streaming data. We aim to focus on the following two problems which remain unaddressed in literature. First, most existing online feature selection algorithms merely utilize the first-order information of the data streams, regardless of the fact that second-order information explores the correlations between features and significantly improves the performance. Second, most online feature selection algorithms are based on the balanced data presumption, which is not true in many real-world applications. For example, in fraud detection, the number of positive examples are much less than negative examples because most cases are not fraud. The balanced assumption will make the selected features biased towards the majority class and fail to detect the fraud cases. We propose an Adaptive Sparse Confidence-Weighted (ASCW) algorithm to solve the aforementioned two problems. We first introduce an `0-norm constraint into the second-order confidence-weighted (CW) learning for feature selection. Then the original loss is substituted with a cost-sensitive loss function to address the imbalanced data issue. Furthermore, our algorithm maintains multiple sparse CW learner with the corresponding cost vector to dynamically select an optimal cost. We theoretically enhance the theory of sparse CW learning and analyze the performance behavior in F-measure. Empirical studies show the superior performance over the stateof-the-art online learning methods in the online-batch setting.

AAAI Conference 2019 Conference Paper

Hypergraph Optimization for Multi-Structural Geometric Model Fitting

Shuyuan Lin
Guobao Xiao
Yan Yan
David Suter
Hanzi Wang

Recently, some hypergraph-based methods have been proposed to deal with the problem of model fitting in computer vision, mainly due to the superior capability of hypergraph to represent the complex relationship between data points. However, a hypergraph becomes extremely complicated when the input data include a large number of data points (usually contaminated with noises and outliers), which will significantly increase the computational burden. In order to overcome the above problem, we propose a novel hypergraph optimization based model fitting (HOMF) method to construct a simple but effective hypergraph. Specifically, HOMF includes two main parts: an adaptive inlier estimation algorithm for vertex optimization and an iterative hyperedge optimization algorithm for hyperedge optimization. The proposed method is highly efficient, and it can obtain accurate model fitting results within a few iterations. Moreover, HOMF can then directly apply spectral clustering, to achieve good fitting performance. Extensive experimental results show that HOMF outperforms several state-of-the-art model fitting methods on both synthetic data and real images, especially in sampling efficiency and in handling data with severe outliers.

YNICL Journal 2019 Journal Article

MR diffusion changes in the perimeter of the lateral ventricles demonstrate periventricular injury in post-hemorrhagic hydrocephalus of prematurity

Albert M. Isaacs
Christopher D. Smyser
Rachel E. Lean
Dimitrios Alexopoulos
Rowland H. Han
Jeffrey J. Neil
Sophia A. Zimbalist
Cynthia E. Rogers

OBJECTIVES: Injury to the preterm lateral ventricular perimeter (LVP), which contains the neural stem cells responsible for brain development, may contribute to the neurological sequelae of intraventricular hemorrhage (IVH) and post-hemorrhagic hydrocephalus of prematurity (PHH). This study utilizes diffusion MRI (dMRI) to characterize the microstructural effects of IVH/PHH on the LVP and segmented frontal-occipital horn perimeters (FOHP). STUDY DESIGN: Prospective study of 56 full-term infants, 72 very preterm infants without brain injury (VPT), 17 VPT infants with high-grade IVH without hydrocephalus (HG-IVH), and 13 VPT infants with PHH who underwent dMRI at term equivalent. LVP and FOHP dMRI measures and ventricular size-dMRI correlations were assessed. RESULTS: In the LVP, PHH had consistently lower FA and higher MD and RD than FT and VPT (p .050). Ventricular size correlated negatively with FA, and positively with MD and RD (p<.001) in both the LVP and FOHP. In the PHH group, FA was lower in the FOHP than in the LVP, which was contrary to the observed findings in the healthy infants (p<.001). Nevertheless, there were no regional differences in AD, MD, and RD in the PHH group. CONCLUSION: HG-IVH and PHH results in aberrant LVP/FOHP microstructure, with prominent abnormalities among the PHH group, most notably in the FOHP. Larger ventricular size was associated with greater magnitude of abnormality. LVP/FOHP dMRI measures may provide valuable biomarkers for future studies directed at improving the management and neurological outcomes of IVH/PHH.

IJCAI Conference 2019 Conference Paper

Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching

Zhibin Hu
Yongsheng Luo
Jiong Lin
Yan Yan
Jian Chen

Image-text matching is central to visual-semantic cross-modal retrieval and has been attracting extensive attention recently. Previous studies have been devoted to finding the latent correspondence between image regions and words, e. g. , connecting key words to specific regions of salient objects. However, existing methods are usually committed to handle concrete objects, rather than abstract ones, e. g. , a description of some action, which in fact are also ubiquitous in description texts of real-world. The main challenge in dealing with abstract objects is that there is no explicit connections between them, unlike their concrete counterparts. One therefore has to alternatively find the implicit and intrinsic connections between them. In this paper, we propose a relation-wise dual attention network (RDAN) for image-text matching. Specifically, we maintain an over-complete set that contains pairs of regions and words. Then built upon this set, we encode the local correlations and the global dependencies between regions and words by training a visual-semantic network. Then a dual pathway attention network is presented to infer the visual-semantic alignments and image-text similarity. Extensive experiments validate the efficacy of our method, by achieving the state-of-the-art performance on several public benchmark datasets.

NeurIPS Conference 2019 Conference Paper

Stagewise Training Accelerates Convergence of Testing Error Over SGD

Zhuoning Yuan
Yan Yan
Rong Jin
Tianbao Yang

Stagewise training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e. g. , SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both training error and testing error. {\it But how to explain this phenomenon has been largely ignored by existing studies. } This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise training strategy for minimizing empirical risk that satisfies the Polyak-\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of ``nice-behaviored" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise training than the vanilla SGD under the PL condition on both training error and testing error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory.

IJCAI Conference 2018 Conference Paper

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Yan Yan
Tianbao Yang
Zhe Li
Qihang Lin
Yi Yang

Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famous variants, i. e. , the stochastic heavy-ball (SHB) method and the stochastic variant of Nesterov? s accelerated gradient (SNAG) method. We propose a framework that unifies the three variants. We then derive the convergence rates of the norm of gradient for the non-convex optimization problem, and analyze the generalization performance through the uniform stability approach. Particularly, the convergence analysis of the training objective exhibits that SHB and SNAG have no advantage over SG. However, the stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance. These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.

TIST Journal 2018 Journal Article

Few-Shot Text and Image Classification via Analogical Transfer Learning

Wenhe Liu
Xiaojun Chang
Yan Yan
Yi Yang
Alexander G. Hauptmann

Learning from very few samples is a challenge for machine learning tasks, such as text and image classification. Performance of such task can be enhanced via transfer of helpful knowledge from related domains, which is referred to as transfer learning. In previous transfer learning works, instance transfer learning algorithms mostly focus on selecting the source domain instances similar to the target domain instances for transfer. However, the selected instances usually do not directly contribute to the learning performance in the target domain. Hypothesis transfer learning algorithms focus on the model/parameter level transfer. They treat the source hypotheses as well-trained and transfer their knowledge in terms of parameters to learn the target hypothesis. Such algorithms directly optimize the target hypothesis by the observable performance improvements. However, they fail to consider the problem that instances that contribute to the source hypotheses may be harmful for the target hypothesis, as instance transfer learning analyzed. To relieve the aforementioned problems, we propose a novel transfer learning algorithm, which follows an analogical strategy. Particularly, the proposed algorithm first learns a revised source hypothesis with only instances contributing to the target hypothesis. Then, the proposed algorithm transfers both the revised source hypothesis and the target hypothesis (only trained with a few samples) to learn an analogical hypothesis. We denote our algorithm as Analogical Transfer Learning. Extensive experiments on one synthetic dataset and three real-world benchmark datasets demonstrate the superior performance of the proposed algorithm.

AAAI Conference 2017 Conference Paper

A Framework of Online Learning with Imbalanced Streaming Data

Yan Yan
Tianbao Yang
Yi Yang
Jianhui Chen

A challenge for mining large-scale streaming data overlooked by most existing studies on online learning is the skewdistribution of examples over different classes. Many previous works have considered cost-sensitive approaches in an online setting for streaming data, where ﬁxed costs are assigned to different classes, or ad-hoc costs are adapted based on the distribution of data received so far. However, it is not necessary for them to achieve optimal performance in terms of the measures suited for imbalanced data, such as Fmeasure, area under ROC curve (AUROC), area under precision and recall curve (AUPRC). This work proposes a general framework for online learning with imbalanced streaming data, where examples are coming sequentially and models are updated accordingly on-the-ﬂy. By simultaneously learning multiple classiﬁers with different cost vectors, the proposed method can be adopted for different target measures for imbalanced data, including F-measure, AUROC and AUPRC. Moreover, we present a rigorous theoretical justiﬁcation of the proposed framework for the F-measure maximization. Our empirical studies demonstrate the competitive if not better performance of the proposed method compared to previous cost-sensitive and resampling based online learning algorithms and those that are designed for optimizing certain measures.

AAAI Conference 2016 Conference Paper

Fortune Teller: Predicting Your Career Path

Ye Liu
Luming Zhang
Liqiang Nie
Yan Yan
David Rosenblum

People go to fortune tellers in hopes of learning things about their future. A future career path is one of the topics most frequently discussed. But rather than rely on “black arts” to make predictions, in this work we scientiﬁcally and systematically study the feasibility of career path prediction from social network data. In particular, we seamlessly fuse information from multiple social networks to comprehensively describe a user and characterize progressive properties of his or her career path. This is accomplished via a multi-source learning framework with fused lasso penalty, which jointly regularizes the source and career-stage relatedness. Extensive experiments on real-world data conﬁrm the accuracy of our model.

NeurIPS Conference 2016 Conference Paper

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than $O(1/\epsilon)$

Yi Xu
Yan Yan
Qihang Lin
Tianbao Yang

In this paper, we develop a novel {\bf ho}moto{\bf p}y {\bf s}moothing (HOPS) algorithm for solving a family of non-smooth problems that is composed of a non-smooth term with an explicit max-structure and a smooth term or a simple non-smooth term whose proximal mapping is easy to compute. The best known iteration complexity for solving such non-smooth optimization problems is $O(1/\epsilon)$ without any assumption on the strong convexity. In this work, we will show that the proposed HOPS achieved a lower iteration complexity of $\tilde O(1/\epsilon^{1-\theta})$ with $\theta\in(0, 1]$ capturing the local sharpness of the objective function around the optimal solutions. To the best of our knowledge, this is the lowest iteration complexity achieved so far for the considered non-smooth optimization problems without strong convexity assumption. The HOPS algorithm employs Nesterov's smoothing technique and Nesterov's accelerated gradient method and runs in stages, which gradually decreases the smoothing parameter in a stage-wise manner until it yields a sufficiently good approximation of the original function. We show that HOPS enjoys a linear convergence for many well-known non-smooth problems (e. g. , empirical risk minimization with a piece-wise linear loss function and $\ell_1$ norm regularizer, finding a point in a polyhedron, cone programming, etc). Experimental results verify the effectiveness of HOPS in comparison with Nesterov's smoothing algorithm and the primal-dual style of first-order methods.

AAAI Conference 2016 Conference Paper

Learning Sparse Confidence-Weighted Classifier on Very High Dimensional Data

Mingkui Tan
Yan Yan
Li Wang
Anton van den Hengel
Ivor W. Tsang
Qinfeng (Javen) Shi

Conﬁdence-weighted (CW) learning is a successful online learning paradigm which maintains a Gaussian distribution over classiﬁer weights and adopts a covariance matrix to represent the uncertainties of the weight vectors. However, there are two deﬁciencies in existing full CW learning paradigms, these being the sensitivity to irrelevant features, and the poor scalability to high dimensional data due to the maintenance of the covariance structure. In this paper, we begin by presenting an online-batch CW learning scheme, and then present a novel paradigm to learn sparse CW classiﬁers. The proposed paradigm essentially identiﬁes feature groups and naturally builds a block diagonal covariance structure, making it very suitable for CW learning over very high-dimensional data. Extensive experimental results demonstrate the superior performance of the proposed methods over state-of-the-art counterparts on classiﬁcation and feature selection tasks.

AAAI Conference 2016 Conference Paper

Robust Semi-Supervised Learning through Label Aggregation

Yan Yan
Zhongwen Xu
Ivor Tsang
Guodong Long
Yi Yang

Semi-supervised learning is proposed to exploit both labeled and unlabeled data. However, as the scale of data in real world applications increases signiﬁcantly, conventional semisupervised algorithms usually lead to massive computational cost and cannot be applied to large scale datasets. In addition, label noise is usually present in the practical applications due to human annotation, which very likely results in remarkable degeneration of performance in semi-supervised methods. To address these two challenges, in this paper, we propose an efﬁcient RObust Semi-Supervised Ensemble Learning (ROSSEL) method, which generates pseudo-labels for unlabeled data using a set of weak annotators, and combines them to approximate the ground-truth labels to assist semisupervised learning. We formulate the weighted combination process as a multiple label kernel learning (MLKL) problem which can be solved efﬁciently. Compared with other semisupervised learning algorithms, the proposed method has linear time complexity. Extensive experiments on ﬁve benchmark datasets demonstrate the superior effectiveness, efﬁciency and robustness of the proposed algorithm.

AAAI Conference 2015 Conference Paper

Complex Event Detection via Event Oriented Dictionary Learning

Yan Yan
Yi Yang
Haoquan Shen
Deyu Meng
Gaowen Liu
Alex Hauptmann
Nicu Sebe

Complex event detection is a retrieval task with the goal of finding videos of a particular event in a largescale unconstrained internet video archive, given example videos and text descriptions. Nowadays, different multimodal fusion schemes of low-level and high-level features are extensively investigated and evaluated for the complex event detection task. However, how to effectively select the high-level semantic meaningful concepts from a large pool to assist complex event detection is rarely studied in the literature. In this paper, we propose two novel strategies to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concepts high-level feature descriptions. Moreover, we introduce a novel event oriented dictionary representation based on the selected semantic concepts. Towards this goal, we leverage training samples of selected concepts from the Semantic Indexing (SIN) dataset with a pool of 346 concepts, into a novel supervised multitask dictionary learning framework. Extensive experimental results on TRECVID Multimedia Event Detection (MED) dataset demonstrate the efficacy of our proposed method.

IJCAI Conference 2015 Conference Paper

Inferring Painting Style with Multi-Task Dictionary Learning

Gaowen Liu
Yan Yan
Elisa Ricci
Yi Yang
Yahong Han
Stefan Winkler
Nicu Sebe

Recent advances in imaging and multimedia technologies have paved the way for automatic analysis of visual art. Despite notable attempts, extracting relevant patterns from paintings is still a challenging task. Different painters, born in different periods and places, have been influenced by different schools of arts. However, each individual artist also has a unique signature, which is hard to detect with algorithms and objective features. In this paper we propose a novel dictionary learning approach to automatically uncover the artistic style from paintings. Specifically, we present a multi-task learning algorithm to learn a style-specific dictionary representation. Intuitively, our approach, by automatically decoupling style-specific and artist-specific patterns, is expected to be more accurate for retrieval and recognition tasks than generic methods. To demonstrate the effectiveness of our approach, we introduce the DART dataset, containing more than 1. 5K images of paintings representative of different styles. Our extensive experimental evaluation shows that our approach significantly outperforms state-of-the-art methods.

IJCAI Conference 2015 Conference Paper

Looking at Mondrian's Victory Boogie-Woogie: What Do I Feel?

Andreza Sartori
Yan Yan
G
ouml; zde
ouml; zbal
Alkim Almila Akdag Salah
Albert Ali Salah
Nicu Sebe

Abstract artists use non-figurative elements (i. e. colours, lines, shapes, and textures) to convey emotions and often rely on the titles of their various compositions to generate (or enhance) an emotional reaction in the audience. Several psychological works observed that the metadata (i. e. , titles, description and/or artist statements) associated with paintings increase the understanding and the aesthetic appreciation of artworks. In this paper we explore if the same metadata could facilitate the computational analysis of artworks, and reveal what kind of emotional responses they awake. To this end, we employ computer vision and sentiment analysis to learn statistical patterns associated with positive and negative emotions on abstract paintings. We propose a multimodal approach which combines both visual and metadata features in order to improve the machine performance. In particular, we propose a novel joint flexible Schatten pnorm model which can exploit the sharing patterns between visual and textual information for abstract painting emotion analysis. Moreover, we conduct a qualitative analysis on the cases in which metadata help improving the machine performance.

IJCAI Conference 2015 Conference Paper

Scalable Maximum Margin Matrix Factorization by Active Riemannian Subspace Search

Yan Yan
Mingkui Tan
Ivor Tsang
Yi Yang
Chengqi Zhang
Qinfeng Shi

The user ratings in recommendation systems are usually in the form of ordinal discrete values. To give more accurate prediction of such rating data, maximum margin matrix factorization (M3 F) was proposed. Existing M3 F algorithms, however, either have massive computational cost or require expensive model selection procedures to determine the number of latent factors (i. e. the rank of the matrix to be recovered), making them less practical for large scale data sets. To address these two challenges, in this paper, we formulate M3 F with a known number of latent factors as the Riemannian optimization problem on a fixed-rank matrix manifold and present a block-wise nonlinear Riemannian conjugate gradient method to solve it efficiently. We then apply a simple and efficient active subspace search scheme to automatically detect the number of latent factors. Empirical studies on both synthetic data sets and large real-world data sets demonstrate the superior efficiency and effectiveness of the proposed method.

AAAI Conference 2014 Conference Paper

Hybrid Heterogeneous Transfer Learning through Deep Learning

Joey Zhou
Sinno Pan
Ivor Tsang
Yan Yan

Most previous heterogeneous transfer learning methods learn a cross-domain feature mapping between heterogeneous feature spaces based on a few cross-domain instance-correspondences, and these corresponding instances are assumed to be representative in the source and target domains respectively. However, in many realworld scenarios, this assumption may not hold. As a result, the constructed feature mapping may not be precise due to the bias issue of the correspondences in the target or (and) source domain(s). In this case, a classifier trained on the labeled transformed-sourcedomain data may not be useful for the target domain. In this paper, we present a new transfer learning framework called Hybrid Heterogeneous Transfer Learning (HHTL), which allows the corresponding instances across domains to be biased in either the source or target domain. Specifically, we propose a deep learning approach to learn a feature mapping between crossdomain heterogeneous features as well as a better feature representation for mapped data to reduce the bias issue caused by the cross-domain correspondences. Extensive experiments on several multilingual sentiment classification tasks verify the effectiveness of our proposed approach compared with some baseline methods.

ICRA Conference 2012 Conference Paper

Almost-uniform sampling of rotations for conformational searches in Robotics and Structural Biology

Yan Yan
Gregory S. Chirikjian

We propose a new method for sampling the rotation group that involves decomposing it into identical Voronoi cells centered on rotational symmetry operations of the Platonic solids. Within each cell, Cartesian grids in exponential coordinates are used to achieve almost-uniform sampling at any level of resolution, without having to store large numbers of coordinates, and without requiring sophisticated data structures. We analyze the shape of these cells, and explain how this new method can be used in the context of conformational searches in the fields of Robotics and Structural Biology.