Author name cluster

Jianlong Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency

Zihan Li
Wei Sun
Jing Hu
Jianhua Yin
Xing Wang
Erwei Yin
Jianlong Wu

While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model's task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.

PDF Details DOI

IROS Conference 2025 Conference Paper

AffordGrasp: In-Context Affordance Reasoning for Open-Vocabulary Task-Oriented Grasping in Clutter

Yingbo Tang
Shuaike Zhang
Xiaoshuai Hao
Pengwei Wang 0004
Jianlong Wu
Zhongyuan Wang 0006
Shanghang Zhang

Inferring the affordance of an object and grasping it in a task-oriented manner is crucial for robots to successfully complete manipulation tasks. Affordance indicates where and how to grasp an object by taking its functionality into account, serving as the foundation for effective task-oriented grasping. However, current task-oriented methods often depend on extensive training data that is confined to specific tasks and objects, making it difficult to generalize to novel objects and complex scenes. In this paper, we introduce AffordGrasp, a novel open-vocabulary grasping framework that leverages the reasoning capabilities of vision-language models (VLMs) for in-context affordance reasoning. Unlike existing methods that rely on explicit task and object specifications, our approach infers tasks directly from implicit user instructions, enabling more intuitive and seamless human-robot interaction in everyday scenarios. Building on the reasoning outcomes, our framework identifies task-relevant objects and grounds their part-level affordances using a visual grounding module. This allows us to generate task-oriented grasp poses precisely within the affordance regions of the object, ensuring both functional and context-aware robotic manipulation. Extensive experiments demonstrate that AffordGrasp achieves state-of-the-art performance in both simulation and real-world scenarios, highlighting the effectiveness of our method. We believe our approach advances robotic manipulation techniques and contributes to the broader field of embodied AI. Project website: https://eqcy.github.io/affordgrasp/.

Details

NeurIPS Conference 2024 Conference Paper

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning

Yibo Yang
Xiaojie Li
Zhongzhu Zhou
Shuaiwen L. Song
Jianlong Wu
Liqiang Nie
Bernard Ghanem

Current parameter-efficient fine-tuning (PEFT) methods build adapters widely agnostic of the context of downstream task to learn, or the context of important knowledge to maintain. As a result, there is often a performance gap compared to full-parameter fine-tuning, and meanwhile the fine-tuned model suffers from catastrophic forgetting of the pre-trained world knowledge. In this paper, we propose **CorDA**, a Context-oriented Decomposition Adaptation method that builds learnable **task-aware adapters** from weight decomposition oriented by the context of downstream task or the world knowledge to maintain. Concretely, we collect a few data samples, and perform singular value decomposition for each linear layer of a pre-trained LLM multiplied by the covariance matrix of the input activation using these samples. The inverse of the covariance matrix is multiplied with the decomposed components to reconstruct the original weights. By doing so, the context of the representative samples is captured through deciding the factorizing orientation. Our method enables two options, the **knowledge-preserved adaptation** and the **instruction-previewed adaptation**. For the former, we use question-answering samples to obtain the covariance matrices, and use the decomposed components with the smallest $r$ singular values to initialize a learnable adapter, with the others frozen such that the world knowledge is better preserved. For the latter, we use the instruction data from the fine-tuning task, such as math or coding, to orientate the decomposition and train the largest $r$ components that most correspond to the task to learn. We conduct extensive experiments on Math, Code, and Instruction Following tasks. Our knowledge-preserved adaptation not only achieves better performance than LoRA on fine-tuning tasks, but also mitigates the forgetting of world knowledge. Our instruction-previewed adaptation is able to further enhance the fine-tuning performance to be comparable with full fine-tuning, surpassing the state-of-the-art PEFT methods such as LoRA, DoRA, and PiSSA.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

Shangchen Du
Shan You
Xiaojie Li
Jianlong Wu
Fei Wang
Chen Qian
Changshui Zhang

Distilling knowledge from an ensemble of teacher models is expected to have a more promising performance than that from a single one. Current methods mainly adopt a vanilla average rule, i. e. , to simply take the average of all teacher losses for training the student network. However, this approach treats teachers equally and ignores the diversity among them. When conflicts or competitions exist among teachers, which is common, the inner compromise might hurt the distillation performance. In this paper, we examine the diversity of teacher models in the gradient space and regard the ensemble knowledge distillation as a multi-objective optimization problem so that we can determine a better optimization direction for the training of student network. Besides, we also introduce a tolerance parameter to accommodate disagreement among teachers. In this way, our method can be seen as a dynamic weighting method for each teacher in the ensemble. Extensive experiments validate the effectiveness of our method for both logits-based and feature-based cases.

PDF Details

AAAI Conference 2020 Conference Paper

Dynamical System Inspired Adaptive Time Stepping Controller for Residual Network Families

Yibo Yang
Jianlong Wu
Hongyang Li
Xia Li
Tiancheng Shen
Zhouchen Lin

The correspondence between residual networks and dynamical systems motivates researchers to unravel the physics of ResNets with well-developed tools in numeral methods of ODE systems. The Runge-Kutta-Fehlberg method is an adaptive time stepping that renders a good trade-off between the stability and efﬁciency. Can we also have an adaptive time stepping for ResNets to ensure both stability and performance? In this study, we analyze the effects of time stepping on the Euler method and ResNets. We establish a stability condition for ResNets with step sizes and weight parameters, and point out the effects of step sizes on the stability and performance. Inspired by our analyses, we develop an adaptive time stepping controller that is dependent on the parameters of the current step, and aware of previous steps. The controller is jointly optimized with the network training so that variable step sizes and evolution time can be adaptively adjusted. We conduct experiments on ImageNet and CIFAR to demonstrate the effectiveness. It is shown that our proposed method is able to improve both stability and accuracy without introducing additional overhead in inference phase.

PDF Details

ICML Conference 2020 Conference Paper

Maximum-and-Concatenation Networks

Xingyu Xie
Hao Kong
Jianlong Wu
Wayne Zhang 0001
Guangcan Liu
Zhouchen Lin

While successful in many fields, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance. In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well. Remarkably, we prove that MCN has a very nice property; that is, every local minimum of an (l+1)-layer MCN can be better than, at least as good as, the global minima of the network consisting of its first l layers. In other words, by increasing the network depth, MCN can autonomously improve its local minima’s goodness, what is more, it is easy to plug MCN into an existing deep model to make it also have this property. Finally, under mild conditions, we show that MCN can approximate certain continuous function arbitrarily well with high efficiency; that is, the covering number of MCN is much smaller than most existing DNNs such as deep ReLU. Based on this, we further provide a tight generalization bound to guarantee the inference ability of MCN when dealing with testing samples.

Details

AAAI Conference 2020 Conference Paper

SOGNet: Scene Overlap Graph Network for Panoptic Segmentation

Yibo Yang
Hongyang Li
Xia Li
Qijie Zhao
Jianlong Wu
Zhouchen Lin

The panoptic segmentation task requires a uniﬁed result from semantic and instance segmentation outputs that may contain overlaps. However, current studies widely ignore modeling overlaps. In this study, we aim to model overlap relations among instances and resolve them for panoptic segmentation. Inspired by scene graph representation, we formulate the overlapping problem as a simpliﬁed case, named scene overlap graph. We leverage each object’s category, geometry and appearance features to perform relational embedding, and output a relation matrix that encodes overlap relations. In order to overcome the lack of supervision, we introduce a differentiable module to resolve the overlap between any pair of instances. The mask logits after removing overlaps are fed into per-pixel instance id classiﬁcation, which leverages the panoptic supervision to assist in the modeling of overlap relations. Besides, we generate an approximate ground truth of overlap relations as the weak supervision, to quantify the accuracy of overlap relations predicted by our method. Experiments on COCO and Cityscapes demonstrate that our method is able to accurately predict overlap relations, and outperform the state-of-the-art performance for panoptic segmentation. Our method also won the Innovation Award in COCO 2019 challenge.

PDF Details

AAAI Conference 2020 Conference Paper

Unified Graph and Low-Rank Tensor Learning for Multi-View Clustering

Jianlong Wu
Xingyu Xie
Liqiang Nie
Zhouchen Lin
Hongbin Zha

Multi-view clustering aims to take advantage of multiple views information to improve the performance of clustering. Many existing methods compute the afﬁnity matrix by low-rank representation (LRR) and pairwise investigate the relationship between views. However, LRR suffers from the high computational cost in self-representation optimization. Besides, compared with pairwise views, tensor form of all views’ representation is more suitable for capturing the highorder correlations among all views. Towards these two issues, in this paper, we propose the uniﬁed graph and low-rank tensor learning (UGLTL) for multi-view clustering. Speciﬁcally, on the one hand, we learn the view-speciﬁc afﬁnity matrix based on projected graph learning. On the other hand, we reorganize the afﬁnity matrices into tensor form and learn its intrinsic tensor based on low-rank tensor approximation. Finally, we unify these two terms together and jointly learn the optimal projection matrices, afﬁnity matrices and intrinsic low-rank tensor. We also propose an efﬁcient algorithm to iteratively optimize the proposed model. To evaluate the performance of the proposed method, we conduct extensive experiments on multiple benchmarks across different scenarios and sizes. Compared with the state-of-the-art approaches, our method achieves much better performance.

PDF Details

ICML Conference 2019 Conference Paper

Differentiable Linearized ADMM

Xingyu Xie
Jianlong Wu
Guangcan Liu
Zhisheng Zhong
Zhouchen Lin

Recently, a number of learning-based optimization methods that combine data-driven architectures with the classical optimization algorithms have been proposed and explored, showing superior empirical performance in solving various ill-posed inverse problems, but there is still a scarcity of rigorous analysis about the convergence behaviors of learning-based optimization. In particular, most existing analyses are specific to unconstrained problems but cannot apply to the more general cases where some variables of interest are subject to certain constraints. In this paper, we propose Differentiable Linearized ADMM (D-LADMM) for solving the problems with linear constraints. Specifically, D-LADMM is a K-layer LADMM inspired deep neural network, which is obtained by firstly introducing some learnable weights in the classical Linearized ADMM algorithm and then generalizing the proximal operator to some learnable activation function. Notably, we rigorously prove that there exist a set of learnable parameters for D-LADMM to generate globally converged solutions, and we show that those desired parameters can be attained by training D-LADMM in a proper way. To the best of our knowledge, we are the first to provide the convergence analysis for the learning-based optimization method on constrained problems.

Details