Author name cluster

Xiangyuan Lan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers

2 author rows

AAAI Conference 2026 Conference Paper

Bolster Hallucination Detection via Prompt-Guided Data Augmentation

Wenyun Li
Zheng Zhang
Dongmei Jiang
Xiangyuan Lan

Large language models (LLMs) have garnered significant interest in AI community. Despite their impressive generation capabilities, they have been found to produce misleading or fabricated information, a phenomenon known as hallucinations. Consequently, hallucination detection has become critical to ensure the reliability of LLM-generated content. One primary challenge in hallucination detection is the scarcity of well-labeled datasets containing both truthful and hallucinated outputs. To address this issue, we introduce Prompt-guided data Augmented haLlucination dEtection (PALE), a novel framework that leverages prompt-guided responses from LLMs as data augmentation for hallucination detection. This strategy can generate both truthful and hallucinated data under prompt guidance at a relatively low cost. To more effectively evaluate the truthfulness of the sparse intermediate embeddings produced by LLMs, we introduce an estimation metric called the Contrastive Mahalanobis Score (CM Score). This score is based on modeling the distributions of truthful and hallucinated data in the activation space. CM Score employs a matrix decomposition approach to more accurately capture the underlying structure of these distributions. Importantly, our framework does not require additional human annotations, offering strong generalizability and practicality for real-world applications. Extensive experiments demonstrate that PALE achieves superior hallucination detection performance, outperforming the competitive baseline by a significant margin of 6.55%.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CATAL: Causally Disentangled Task Representation Learning for Offline Meta-Reinforcement Learning

Shan Cong
Chao Yu
Xiangyuan Lan

Context-based Offline Meta Reinforcement Learning (COMRL) has shown promising results in improving the cross-task generalization ability of meta-policies. However, current methods often lead to entangled task representations, in which each latent dimension is influenced by multiple causal factors that govern variations in environment dynamics and reward mechanisms. This entanglement can degrade generalization performance, particularly when multiple causal factors vary simultaneously across tasks. To address this limitation, we propose CAusally disentangled TAsk representation Learning (CATAL) method for COMRL that aims to improve the generalization ability of the meta-policy, where each latent dimension in the task representations aligns to a single causal factor.Theoretically, we show that under mild conditions, the task representations learned by CATAL are causally disentangled. Empirically, extensive results on multi-task MuJoCo benchmarks show that CATAL consistently outperforms existing COMRL baselines in both in-distribution and out-of-distribution generalization.

PDF Details DOI

AAAI Conference 2026 Conference Paper

X-SAM: From Segment Anything to Any Segmentation

Hao Wang
Limeng Qiao
Zequn Jie
Zhijian Huang
Chengjian Feng
Qingfang Zheng
Lin Ma
Xiangyuan Lan

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from segment anything to any segmentation. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.

PDF Details DOI

AAAI Conference 2025 Conference Paper

DM-Adapter: Domain-Aware Mixture-of-Adapters for Text-Based Person Retrieval

Yating Liu
Zimo Liu
Xiangyuan Lan
Wenming Yang
Yaowei Li
Qingmin Liao

Text-based person retrieval (TPR) has gained significant attention as a fine-grained and challenging task that closely aligns with practical applications. Tailoring CLIP to person domain is now a emerging research topic due to the abundant knowledge of vision-language pretraining, but challenges still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is computationally expensive and prone to overfitting.(ii) Existing parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained feature extraction. To address these issues, we propose Domain-Aware Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and PETL to enhance fine-grained feature representations while maintaining efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to MLP layers in both vision and language branches, where different experts specialize in distinct aspects of person knowledge to handle features more finely. To promote the router to exploit domain information effectively and alleviate the routing imbalance, Domain-Aware Router is then developed by building a novel gating function and injecting learnable domain-aware prompts. Extensive experiments show that our DM-Adapter achieves state-of-the-art performance, outperforming previous methods by a significant margin.

PDF Details DOI

ICLR Conference 2025 Conference Paper

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

Yifei Xing 0001
Xiangyuan Lan
Ruiping Wang 0001
Dongmei Jiang
Wenjun Huang
Qingfang Zheng
Yaowei Wang 0001

Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code provided at https://github.com/xingyifei2016/EMMA.

Details

NeurIPS Conference 2025 Conference Paper

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Songjun Tu
Jiahao Lin
Qichao Zhang
Xiangyu Tian
Linjing Li
Xiangyuan Lan
Dongbin Zhao

Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities—enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis (". .. ") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy–efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6. 4\% while reducing token usage by 52\% on DeepSeek-R1-Distill-Qwen-1. 5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https: //github. com/ScienceOne-AI/AutoThink.

PDF Details

AAMAS Conference 2025 Conference Paper

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

Songjun Tu
Jingbo Sun
Qichao Zhang
Xiangyuan Lan
Dongbin Zhao

Preference-based reinforcement learning (PbRL) provides a powerful paradigm to avoid meticulous reward engineering by learning rewards based on human preferences. However, real-time human feedback is hard to obtain in online tasks. Most work suppose there is a "scripted teacher" that utilizes privileged predefined reward to provide preference feedback. In this paper, we propose a RL Self-augmented Large Language Model Feedback (RL-SaLLM-F) technique that does not rely on privileged information for online PbRL. RL-SaLLM-F leverages the reflective and discriminative capabilities of LLM to generate self-augmented trajectories and provide preference labels for reward learning. First, we identify a failure issue in LLM-based preference discrimination, specifically "query ambiguity", in online PbRL. Then LLM is employed to provide preference labels and generate self-augmented imagined trajectories that better achieve the task goal, thereby enhancing the quality and efficiency of feedback. Additionally, a double-check mechanism is introduced to mitigate randomness in the preference labels, improving the reliability of LLM feedback. The experiment across multiple tasks in the MetaWorld benchmark demonstrates the specific contributions of each proposed module in RL-SaLLM-F, and shows that self-augmented LLM feedback can effectively replace the impractical "scripted teacher" feedback. In summary, RL-SaLLM-F introduces a new direction of feedback acquisition in online PbRL that does not rely on any online privileged information, offering an efficient and lightweight solution with LLM-driven feedback. 1 1Corresponding author: Qichao Zhang (zhangqichao2014@ia. ac. cn) and Xiangyuan Lan (lanxy@pcl. ac. cn). Code Page: https: //github. com/TU2021/RL-SaLLM-F This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

PDF

ICML Conference 2025 Conference Paper

Open-Det: An Efficient Learning Framework for Open-Ended Detection

Guiping Cao
Tao Wang
Wenjian Huang 0001
Xiangyuan Lan
Jianguo Zhang 0001
Dongmei Jiang

Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1. 5% of the training data (0. 077M vs. 5. 077M), 20. 8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1. 0% in APr). The source codes are available at: https: //github. com/Med-Process/Open-Det.

Details

NeurIPS Conference 2025 Conference Paper

Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Feng Lu
Tong Jin
Canming Ye
Xiangyuan Lan
Yunpeng Liu
Chun Yuan

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e. g. , NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https: //github. com/lu-feng/image.

PDF Details

AAAI Conference 2025 Conference Paper

Transferable Adversarial Face Attack with Text Controlled Attribute

Wenyun Li
Zheng Zhang
Xiangyuan Lan
Dongmei Jiang

Traditional adversarial attacks typically produce adversarial examples under norm-constrained conditions, whereas unrestricted adversarial examples are free-form with semantically meaningful perturbations. Current unrestricted adversarial impersonation attacks exhibit limited control over adversarial face attributes and often suffer from low transferability. In this paper, we propose a novel Text Controlled Attribute Attack (TCA2) to generate photorealistic adversarial impersonation faces guided by natural language. Specifically, the category-level personal softmax vector is employed to precisely guide the impersonation attacks. Additionally, we propose both data and model augmentation strategies to achieve transferable attacks on unknown target models. Finally, a generative model, i.e, Style-GAN, is utilized to synthesize impersonated faces with desired attributes. Extensive experiments on two high-resolution face recognition datasets validate that our TCA2 method can generate natural text-guided adversarial impersonation faces with high transferability. We also evaluate our method on real-world face recognition systems, i.e, Face++ and Aliyun, further demonstrating the practical potential of our approach.

PDF Details DOI

ECAI Conference 2024 Conference Paper

Cascade Memory for Unsupervised Anomaly Detection

Jiahao Li 0007
Yiqiang Chen 0001
Yunbing Xing
Yang Gu 0001
Xiangyuan Lan

Unsupervised anomaly detection is to detect previously unseen rare samples without any prior knowledge about them. With the emergence of deep learning, many methods employ normal data reconstruction to train detection models, which is expected to yield relatively large errors when reconstructing anomalies. However, recent studies find that anomalies can be overgeneralized, resulting in reconstruction errors as small as normal samples. In this paper, we examine the anomaly overgeneralization problem and propose global semantic information learning. Normal and anomalous samples may share the same local feature such as textures, edges, and corners, but have separability at the global semantic level. To address this, we propose a novel cascade memory architecture designed to capture global semantic information in the latent space and introduce a configurable sparsification and random forgetting mechanism. Our proposed method achieves state-of-the-art experimental results on different public benchmarks, without the introduction of any additional auxiliary loss terms. The code is available at https: //github. com/LiJiahao-Alex/Cascade-Memory.

Details

AAAI Conference 2024 Conference Paper

Deep Homography Estimation for Visual Place Recognition

Feng Lu
Shuting Dong
Lijun Zhang
Bingxi Liu
Xiangyuan Lan
Dongmei Jiang
Chun Yuan

Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

High-Resolution Image Harmonization with Adaptive-Interval Color Transformation

Quanling Meng
Qinglin Liu
Zonglin Li
Xiangyuan Lan
Shengping Zhang
Liqiang Nie

Existing high-resolution image harmonization methods typically rely on global color adjustments or the upsampling of parameter maps. However, these methods ignore local variations, leading to inharmonious appearances. To address this problem, we propose an Adaptive-Interval Color Transformation method (AICT), which predicts pixel-wise color transformations and adaptively adjusts the sampling interval to model local non-linearities of the color transformation at high resolution. Specifically, a parameter network is first designed to generate multiple position-dependent 3-dimensional lookup tables (3D LUTs), which use the color and position of each pixel to perform pixel-wise color transformations. Then, to enhance local variations adaptively, we separate a color transform into a cascade of sub-transformations using two 3D LUTs to achieve the non-uniform sampling intervals of the color transform. Finally, a global consistent weight learning method is proposed to predict an image-level weight for each color transform, utilizing global information to enhance the overall harmony. Extensive experiments demonstrate that our AICT achieves state-of-the-art performance with a lightweight architecture. The code is available at https: //github. com/aipixel/AICT.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection

Guiping Cao
Wenjian Huang
Xiangyuan Lan
Jianguo Zhang
Dongmei Jiang
Yaowei Wang

Popular transformer-based detectors detect objects in a one-to-one manner, where both the bounding box and category of each object are predicted only by the single query, leading to the box-sensitive category predictions. Additionally, the initialization of positional queries solely based on the predicted confidence scores or learnable embeddings neglects the significant spatial interrelation between different queries. This oversight leads to an imbalanced spatial distribution of queries (SDQ). In this paper, we propose a new MLP-DINO model to address these issues. Firstly, we present a new Query-Independent Category Supervision (QICS) approach for modeling categories information, decoupling the sensitive bounding box prediction process to improve the detection performance. Additionally, to further improve the category predictions, we introduce a deep MLP model into transformer-based detection framework to capture the long-range and short-range information simultaneously. Thirdly, to balance the SDQ, we design a novel Graph-based Query Selection (GQS) method that distributes each query point in a discrete manner by graphing the spatial information of queries to cover a broader range of potential objects, significantly enhancing the hit-rate of queries. Experimental results on COCO indicate that our MLP-DINO achieves 54. 6% AP with only 44M parame ters under 36-epoch setting, greatly outperforming the original DINO by +3. 7% AP with fewer parameters and FLOPs. The source codes will be available at https: //github. com/Med-Process/MLP-DINO.

PDF Details DOI

ICML Conference 2024 Conference Paper

Revisiting Context Aggregation for Image Matting

Qinglin Liu
Xiaoqian Lv
Quanling Meng
Zonglin Li
Xiangyuan Lan
Shuo Yang 0006
Shengping Zhang
Liqiang Nie

Traditional studies emphasize the significance of context information in improving matting performance. Consequently, deep learning-based matting methods delve into designing pooling or affinity-based context aggregation modules to achieve superior results. However, these modules cannot well handle the context scale shift caused by the difference in image size during training and inference, resulting in matting performance degradation. In this paper, we revisit the context aggregation mechanisms of matting networks and find that a basic encoder-decoder network without any context aggregation modules can actually learn more universal context aggregation, thereby achieving higher matting performance compared to existing methods. Building on this insight, we present AEMatter, a matting network that is straightforward yet very effective. AEMatter adopts a Hybrid-Transformer backbone with appearance-enhanced axis-wise learning (AEAL) blocks to build a basic network with strong context aggregation learning capability. Furthermore, AEMatter leverages a large image training strategy to assist the network in learning context aggregation from data. Extensive experiments on five popular matting datasets demonstrate that the proposed AEMatter outperforms state-of-the-art matting methods by a large margin. The source code is available at https: //github. com/aipixel/AEMatter.

Details

NeurIPS Conference 2024 Conference Paper

SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition

Feng Lu
Xinyao Zhang
Canming Ye
Shuting Dong
Lijun Zhang
Xiangyuan Lan
Chun Yuan

Visual place recognition (VPR) is an essential task for multiple applications such as augmented reality and robot localization. Over the past decade, mainstream methods in the VPR area have been to use feature representation based on global aggregation, as exemplified by NetVLAD. These features are suitable for large-scale VPR and robust against viewpoint changes. However, the VLAD-based aggregation methods usually learn a large number of (e. g. , 64) clusters and their corresponding cluster centers, which directly leads to a high dimension of the yielded global features. More importantly, when there is a domain gap between the data in training and inference, the cluster centers determined on the training set are usually improper for inference, resulting in a performance drop. To this end, we first attempt to improve NetVLAD by removing the cluster center and setting only a small number of (e. g. , only 4) clusters. The proposed method not only simplifies NetVLAD but also enhances the generalizability across different domains. We name this method SuperVLAD. In addition, by introducing ghost clusters that will not be retained in the final output, we further propose a very low-dimensional 1-Cluster VLAD descriptor, which has the same dimension as the output of GeM pooling but performs notably better. Experimental results suggest that, when paired with a transformer-based backbone, our SuperVLAD shows better domain generalization performance than NetVLAD with significantly fewer parameters. The proposed method also surpasses state-of-the-art methods with lower feature dimensions on several benchmark datasets. The code is available at https: //github. com/lu-feng/SuperVLAD.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Feng Lu
Lijun Zhang
Xiangyuan Lan
Shuting Dong
Yaowei Wang 0001
Chun Yuan 0003

Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.

Details

AAAI Conference 2020 Conference Paper

Regularized Fine-Grained Meta Face Anti-Spoofing

Rui Shao
Xiangyuan Lan
Pong C. Yuen

Face presentation attacks have become an increasingly critical concern when face recognition is widely applied. Many face anti-spooﬁng methods have been proposed, but most of them ignore the generalization ability to unseen attacks. To overcome the limitation, this work casts face anti-spooﬁng as a domain generalization (DG) problem, and attempts to address this problem by developing a new meta-learning framework called Regularized Fine-grained Meta-learning. To let our face anti-spooﬁng model generalize well to unseen attacks, the proposed framework trains our model to perform well in the simulated domain shift scenarios, which is achieved by ﬁnding generalized learning directions in the meta-learning process. Speciﬁcally, the proposed framework incorporates the domain knowledge of face anti-spooﬁng as the regularization so that meta-learning is conducted in the feature space regularized by the supervision of domain knowledge. This enables our model more likely to ﬁnd generalized learning directions with the regularized meta-learning for face anti-spooﬁng task. Besides, to further enhance the generalization ability of our model, the proposed framework adopts a ﬁne-grained learning strategy that simultaneously conducts meta-learning in a variety of domain shift scenarios in each iteration. Extensive experiments on four public datasets validate the effectiveness of the proposed method.

PDF Details

IJCAI Conference 2019 Conference Paper

LRDNN: Local-refining based Deep Neural Network for Person Re-Identification with Attribute Discerning

Qinqin Zhou
Bineng Zhong
Xiangyuan Lan
Gan Sun
Yulun Zhang
Mengran Gou

Recently, pose or attribute information has been widely used to solve person re-identification (re-ID) problem. However, the inaccurate output from pose or attribute modules will impair the final person re-ID performance. Since re-ID, pose estimation and attribute recognition are all based on the person appearance information, we propose a Local-refining based Deep Neural Network (LRDNN) to aggregate pose estimation and attribute recognition to improve the re-ID performance. To this end, we add a pose branch to extract the local spatial information and optimize the whole network on both person identity and attribute objectives. To diminish the negative affect from unstable pose estimation, a novel structure called channel parse block (CPB) is introduced to learn weights on different feature channels in pose branch. Then two branches are combined with compact bilinear pooling. Experimental results on Market1501 and DukeMTMC-reid datasets illustrate the effectiveness of the proposed method.

PDF Details

AAAI Conference 2018 Conference Paper

Hierarchical Discriminative Learning for Visible Thermal Person Re-Identification

Mang Ye
Xiangyuan Lan
Jiawei Li
Pong Yuen

Person re-identiﬁcation is widely studied in visible spectrum, where all the person images are captured by visible cameras. However, visible cameras may not capture valid appearance information under poor illumination conditions, e. g, at night. In this case, thermal camera is superior since it is less dependent on the lighting by using infrared light to capture the human body. To this end, this paper investigates a cross-modal re-identiﬁcation problem, namely visible-thermal person reidentiﬁcation (VT-REID). Existing cross-modal matching methods mainly focus on modeling the cross-modality discrepancy, while VT-REID also suffers from cross-view variations caused by different camera views. Therefore, we propose a hierarchical cross-modality matching model by jointly optimizing the modality-speciﬁc and modality-shared metrics. The modality-speciﬁc metrics transform two heterogenous modalities into a consistent space that modality-shared metric can be subsequently learnt. Meanwhile, the modalityspeciﬁc metric compacts features of the same person within each modality to handle the large intra-modality intra-person variations (e. g. viewpoints, pose). Additionally, an improved two-stream CNN network is presented to learn the multimodality sharable feature representations. Identity loss and contrastive loss are integrated to enhance the discriminability and modality-invariance with partially shared layer parameters. Extensive experiments illustrate the effectiveness and robustness of the proposed method.

PDF Details

AAAI Conference 2018 Conference Paper

Robust Collaborative Discriminative Learning for RGB-Infrared Tracking

Xiangyuan Lan
Mang Ye
Shengping Zhang
Pong Yuen

Tracking target of interests is an important step for motion perception in intelligent video surveillance systems. While most recently developed tracking algorithms are grounded in RGB image sequences, it should be noted that information from RGB modality is not always reliable (e. g. in a dark environment with poor lighting condition), which urges the need to integrate information from infrared modality for effective tracking because of the insensitivity to illumination condition of infrared thermal camera. However, several issues encountered during the tracking process limit the fusing performance of these heterogeneous modalities: 1) the crossmodality discrepancy of visual and motion characteristics, 2) the uncertainty of degree of reliability in different modalities, and 3) large target appearance variations and background distractions within each modality. To address these issues, this paper proposes a novel and optimal discriminative learning framework for multi-modality tracking. In particular, the proposed discriminative learning framework is able to: 1) jointly eliminate outlier samples caused by large variations and learn discriminability-consistent features from heterogeneous modalities, and 2) collaboratively perform modality reliability measurement and target-background separation. Extensive experiments on RGB-infrared image sequences demonstrate the effectiveness of the proposed method.

PDF Details

IJCAI Conference 2018 Conference Paper

Visible Thermal Person Re-Identification via Dual-Constrained Top-Ranking

Mang Ye
Zheng Wang
Xiangyuan Lan
Pong C. Yuen

Cross-modality person re-identification between the thermal and visible domains is extremely important for night-time surveillance applications. Existing works in this filed mainly focus on learning sharable feature representations to handle the cross-modality discrepancies. However, besides the cross-modality discrepancy caused by different camera spectrums, visible thermal person re-identification also suffers from large cross-modality and intra-modality variations caused by different camera views and human poses. In this paper, we propose a dual-path network with a novel bi-directional dual-constrained top-ranking loss to learn discriminative feature representations. It is advantageous in two aspects: 1) end-to-end feature learning directly from the data without extra metric learning steps, 2) it simultaneously handles the cross-modality and intra-modality variations to ensure the discriminability of the learnt representations. Meanwhile, identity loss is further incorporated to model the identity-specific information to handle large intra-class variations. Extensive experiments on two datasets demonstrate the superior performance compared to the state-of-the-arts.

PDF Details

AAAI Conference 2017 Conference Paper

Robust MIL-Based Feature Template Learning for Object Tracking

Xiangyuan Lan
Pong C. Yuen
Rama Chellappa

Because of appearance variations, training samples of the tracked targets collected by the online tracker are required for updating the tracking model. However, this often leads to tracking drift problem because of potentially corrupted samples: 1) contaminated/outlier samples resulting from large variations (e. g. occlusion, illumination), and 2) misaligned samples caused by tracking inaccuracy. Therefore, in order to reduce the tracking drift while maintaining the adaptability of a visual tracker, how to alleviate these two issues via an effective model learning (updating) strategy is a key problem to be solved. To address these issues, this paper proposes a novel and optimal model learning (updating) scheme which aims to simultaneously eliminate the negative effects from these two issues mentioned above in a uniﬁed robust feature template learning framework. Particularly, the proposed feature template learning framework is capable of: 1) adaptively learning uncontaminated feature templates by separating out contaminated samples, and 2) resolving label ambiguities caused by misaligned samples via a probabilistic multiple instance learning (MIL) model. Experiments on challenging video sequences show that the proposed tracker performs favourably against several state-of-the-art trackers.

PDF Details

IJCAI Conference 2016 Conference Paper

Robust Joint Discriminative Feature Learning for Visual Tracking

Xiangyuan Lan
Shengping Zhang
Pong C. Yuen

Because of the complementarity of multiple visual cues (features) in appearance modeling, many tracking algorithms attempt to fuse multiple features to improve the tracking performance from two aspects: increasing the representation accuracy against appearance variations and enhancing the discriminability between the tracked target and its background. Since both these two aspects simultaneously contribute to the success of a visual tracker, how to fully unleash the capabilities of multiple features from these two aspects in appearance modeling is a key issue for feature fusion-based visual tracking. To address this problem, different from other feature fusion-based trackers which consider one of these two aspects only, this paper proposes an unified feature learning framework which simultaneously exploits both the representation capability and the discriminability of multiple features for visual tracking. In particular, the proposed feature learning framework is capable of: 1) learning robust features by separating out corrupted features for accurate feature representation, 2) seamlessly imposing the discriminabiltiy of multiple visual cues into feature learning, and 3) fusing features by exploiting their shared and feature-specific discriminative information. Extensive experiment results on challenging videos show that the proposed tracker performs favourably against other ten state-of-the-art trackers.

PDF Details