Arrow Research search

Author name cluster

Ming Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers
2 author rows

Possible papers

47

AAAI Conference 2026 Conference Paper

Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes

  • Haokai Hong
  • Wanyu Lin
  • Ming Yang
  • Kay Chen Tan

Can we train a 3D molecule generator using data from dense regions to generate samples in sparse regions? This challenge can be framed as an out-of-distribution (OOD) generation problem. While prior research on OOD generation predominantly targets property shifts, structural shifts, such as differences in molecular scaffolds or functional groups, represent an equally critical source of distributional shifts. This work introduces the Geometric OOD Diffusion Model (GODD), a novel diffusion-based framework that enables training on data-abundant molecular distributions while generalizing to data-scarce distributions under distributional structural shifts. Central to our approach is a designated equivariant asymmetric autoencoder to capture distributional structural priors. The asymmetric design allows the model to generalize to unseen structural variations by capturing distributional priors representing distinct distributions. The encoded structural-grained priors guide generation toward sparse regions without requiring explicit training on such data. Evaluated across standard benchmarks encompassing OOD structural shifts (e.g., scaffolds, rings), GODD achieves an improvement of 12.6% in success rate, defined based on molecular validity, uniqueness, and novelty. Furthermore, the framework demonstrates promising performance and generalization on canonical fragment-based drug design tasks, highlighting its utility in learning-based molecular discovery.

AAAI Conference 2026 Conference Paper

DSAP: Enhancing Generalization in Goal-Conditioned Reinforcement Learning

  • Yiming Wang
  • Kaiyan Zhao
  • Ming Yang
  • Yan Li
  • Furui Liu
  • Jiayu Chen
  • Leong Hou U

Goal-conditioned Reinforcement Learning (RL) is a promising direction for training agents capable of tackling a variety of tasks. However, generalizing to new goals in different environments remains a central challenge for goal-conditioned RL agents. Existing methods often rely on state abstraction, which involves learning abstracted state representations by excluding irrelevant features, to improve generalization. Despite their success in simplified settings, these methods often fail to generalize effectively to realistic environments with varied goals. In this work, we propose to enhance generalization through state abstraction from the perspective of causal inference. We hypothesize that the generalization gap arises in part due to unobserved confounders: latent variables that simultaneously influence both the global and goal states. To address this, we introduce Deconfounded State Abstraction for Policy learning (DSAP), a novel framework that mitigates backdoor confounding by employing a learned causal graph as a *proxy* for the hidden confounders. We provide theoretical analysis demonstrating that DSAP improves both the learning process and the generalization capability of goal-conditioned policies. Extensive experiments across different settings of multiple benchmarks show that our method significantly outperforms existing methods.

TMLR Journal 2026 Journal Article

EgoPlan: Towards Effective Embodied Agents via Egocentric Planning

  • Zhirui Fang
  • Ming Yang
  • Weishuai Zeng
  • Junpeng Yue
  • Boyu Li
  • Jiafei Lyu
  • Xiu Li
  • Zongqing Lu

We explore leveraging large multi-modal models (LMMs) and Text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This pipeline leverages a diffusion model to simulate the fundamental dynamics between states and actions, discusses how to integrate computer vision related techniques like style transfer and optical flow to enhance ability of modeling spatial states and generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. By using LMM, we can output text actions, using a series of mechanisms such as reflection to perform high-level task decomposition and low-level action output end-to-end. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

AAAI Conference 2026 Conference Paper

Learning Diffusion Policy from Primitive Skills for Robot Manipulation

  • Zhihao Gu
  • Ming Yang
  • Difan Zou
  • Dong Xu

Diffusion policies have recently shown great promise for generating actions in robotic manipulation. However, existing approaches often rely on global instructions to produce short-term control signals, which can result in misalignment in action generation. We conjecture that the primitive skills, referred to as fine-grained, short-horizon manipulations, such as "move up" and "open the gripper", provide a more intuitive and effective interface for robot learning. To bridge this gap, we propose SDP, a skill-conditioned diffusion policy that integrates interpretable skill learning with conditional action planning. SDP abstracts eight reusable primitive skills across tasks and employs a vision-language model to extract discrete representations from visual observations and language instructions. Based on the representations, a lightweight router network is designed to assign a desired primitive skill for each state, which helps construct a single-skill policy to generate skill-aligned actions. By decomposing complex tasks into a sequence of primitive skills and selecting a single-skill policy, the proposed SDP ensures skill-consistent behavior across diverse tasks. Extensive experiments on two challenging simulation benchmarks and real-world robot deployments demonstrate that SDP consistently outperforms state-of-the-art methods, providing a new paradigm for skill-based robot learning with diffusion policies.

AAAI Conference 2026 Conference Paper

SCAN: Self-Calibrated AutoregressioN for High-Quality Visual Generation

  • Zhanzhou Feng
  • Qingpei Guo
  • Jingdong Chen
  • Feng Gao
  • Ming Yang
  • Shiliang Zhang

Human artists can continuously refine their coarse sketches during artistic creation. This is quite different from existing autoregressive generation, where a token is determined once sampled. Aiming to flexibly refine the generated contents, this paper presents a Self-Calibrated AutoregressioN (SCAN) model capable of self-evaluating and refining generation quality without regenerating the entire image. We unify image token generation and quality evaluation into a single autoregressive model, formulating both tasks as categorical prediction problems. During inference, the model first generates a coarse initial image, then iteratively refines the lowest-quality patches until satisfactory image quality is achieved. Experimental results demonstrate that SCAN effectively handles diverse real-world generation errors and achieves a promising balance between image quality and speed. For example, SCAN-XL achieves an FID of 2.10 and an IS of 326.1, surpassing the LlamaGen-XL by 1.29 (+38%) in FID and 99.0 (+43.6%) in IS, with a 5.6× speedup (19.76s to 3.56s). Compared to recent works, SCAN improves FID and speed by +18.3% and +23% over VAR-d20, and by +7% and +46% over RandAR-XL.

TMLR Journal 2026 Journal Article

Single-loop Algorithms for Stochastic Non-Convex Optimization with Weakly-Convex Constraints

  • Ming Yang
  • Gang Li
  • Quanqi Hu
  • Qihang Lin
  • Tianbao Yang

Constrained optimization with multiple functional inequality constraints has significant applications in machine learning. This paper examines a crucial subset of such problems where both the objective and constraint functions are weakly convex. Existing methods often face limitations, including slow convergence rates or reliance on double-loop algorithmic designs. To overcome these challenges, we introduce a novel single-loop penalty-based stochastic algorithm. Following the classical exact penalty method, our approach employs a hinge-based penalty, which permits the use of a constant penalty parameter, enabling us to achieve a state-of-the-art complexity for finding an approximate Karush-Kuhn-Tucker (KKT) solution. We further extend our algorithm to address finite-sum coupled compositional objectives, which are prevalent in artificial intelligence applications, establishing improved complexity over existing approaches. Finally, we validate our method through experiments on fair learning with receiver operating characteristic (ROC) fairness constraints and continual learning with non-forgetting constraints.

AAAI Conference 2026 Conference Paper

Tensorized Label Learning via Balanced Tensor Regression

  • Guangyu Yang
  • Yuzhuo Feng
  • Qin Li
  • Quanxue Gao
  • Ming Yang
  • Rui Wang

The multi-view clustering methods based on tensor regression can make full use of the potential structural information between views and achieve data-level fusion. However, existing tensor regression-based approaches for anchor graph often overlook the probabilistic nature of anchor graph, focusing solely on sample labels while ignoring the influence of anchor labels on clustering results. To overcome these limitations, we introduce Tensorized Label Learning via Balanced Tensor Regression (TLL-BTR). Our key idea is to exploit the probabilistic nature of the anchor graph by regarding the sample labels as a projection tensor that maps the anchor graph into the label space, thereby producing anchor labels. By enforcing constraints on these anchor labels, we guide the concurrent learning of sample labels and achieve co-label learning between anchors and samples. To prevent trivial solutions, we maximize the nuclear norm to promote an even distribution of samples across clusters. Extensive experiments on benchmark datasets demonstrate that TLL-BTR consistently outperforms state-of-the-art methods.

AAAI Conference 2026 Conference Paper

Unified View Extraction with Low-Rankness and Smoothness Fusion for Multi-View Subspace Clustering

  • Yapeng Wang
  • Quanxue Gao
  • Fangfang Li
  • Yu Yun
  • Ming Yang

Tensor-based multi-view subspace clustering (MVSC) has achieved significant success by capturing high-order inter-view correlations. However, existing approaches face two principal limitations. First, most methods either exclusively emphasize the inter-view low‑rankness (R) prior while neglecting the intra-view local smoothness (S) prior, or treat R and S as two separate regularizers—complicating joint optimization. Second, conventional tensor‑based methods impose only low‑rank constraints on the representation tensor, which limits their ability to simultaneously model consistency and complementary information. To address these issues, we propose a Unified View Extraction with Low‑Rankness and Smoothness Fusion (UVELRS) method. Our framework first extracts a consistent cross‑view representation and then constructs a tensor by stacking these representations. We introduce a novel tensor total variation Schatten-p norm that simultaneously encodes both R and S priors while offering flexible singular‑value control. This unified formulation effectively captures both high-order inter-view correlations and intra-view local smoothness. Extensive experiments on real‑world datasets demonstrate UVELRS's superior performance and robustness.

IROS Conference 2025 Conference Paper

AVP Scene Graph: Hierarchical Visual Language Mapping and Navigation for Autonomous Valet Parking

  • Xiangru Mu
  • Fengyi Chen
  • Runhan Wang
  • Siyuan Chen
  • Jiyuan Cai
  • Jia Cai
  • Ming Yang
  • Tong Qin

Autonomous valet parking (AVP) aims to help the human drivers navigate to the desired location in the parking lot. Currently, the AVP task is not flexible enough to perform the open-vocabulary navigation tasks such as "navigate to the exit" or "park near the elevator". The widely used map formats for AVP like vectorized maps have some limitations including limited semantics, high cost and poor human-machine interaction, restricting the flexible application of AVP in complex scenarios. To address these problems, we propose AVP Scene Graph (AVP-SG), a hierarchical visual language mapping and navigation framework for open-vocabulary AVP tasks, which enables autonomous navigation from multi-modal human instructions. Our framework consists of two parts: a bottom-up mapping module and a top-down navigation module. In the mapping pipeline, assisted by the vision-language model (VLM) and optical character recognition (OCR) model, we first extract open-vocabulary conceptual semantics from images and project them to the elements of map. Next, by the bottom-up scheme performing feature fusion layer by layer, the scene graph is built hierarchically, consisting of slot, lane, block, and garage layer. In the top-down navigation pipeline, the navigation goal can be efficiently found by an LLM-enhanced graph retrieval approach. Experiments on real-world AVP tasks prove that the self-driving vehicle can perform open-vocabulary AVP tasks successfully utilizing the AVP-SG.

IJCAI Conference 2025 Conference Paper

BMIP: Bi-directional Modality Interaction Prompt Learning for VLM

  • Song-Lin Lv
  • Yu-Yang Chen
  • Zhi Zhou
  • Ming Yang
  • Lan-Zhe Guo

Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called Bi-directional Modality Interaction Prompt (BMIP), which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.

IROS Conference 2025 Conference Paper

Building Hybrid Omnidirectional Visual-Lidar Map for Visual-Only Localization

  • Jingyang Huang
  • Hao Wei
  • Changze Li
  • Tong Qin
  • Fei Gao
  • Ming Yang

Recently, there has been growing interest in using low-cost sensor combinations, such as cameras and IMUs, to achieve accurate localization within pre-built pointcloud maps. In this paper, we propose a novel hybrid visual-Lidar mapping and visual-only re-localization framework, specifically designed for UAVs with limited computational resources operating in challenging environments. Keyframes function as a bridge in our system, associating images with pointcloud to facilitate efficient and accurate pose estimation. Besides, our system creates omnidirectional keyframes at the mapping stage, enabling effective re-localization from any orientation, which enhance the robustness and practicability of our system. Experiments show that the proposed algorithm achieves high localization accuracy on pre-built maps and is capable of running in real-time on UAVs for autonomous navigation tasks. The source code will be made publicly available soon.

IJCAI Conference 2025 Conference Paper

FedSaaS: Class-Consistency Federated Semantic Segmentation via Global Prototype Supervision and Local Adversarial Harmonization

  • Xiaoyang Yu
  • Xiaoming Wu
  • Xin Wang
  • Dongrun Li
  • Ming Yang
  • Peng Cheng

Federated semantic segmentation enables pixel-level classification in images through collaborative learning while maintaining data privacy. However, existing research commonly overlooks the fine-grained class relationships within the semantic space when addressing heterogeneous problems, particularly domain shift. This oversight results in ambiguities between class representation. To overcome this challenge, we propose a novel federated segmentation framework that strikes class consistency, termed FedSaaS. Specifically, we introduce class exemplars as a criterion for both local- and global-level class representations. On the server side, the uploaded class exemplars are leveraged to model class prototypes, which supervise global branch of clients, ensuring alignment with global-level representation. On the client side, we incorporate an adversarial mechanism to harmonize contributions of global and local branches, leading to consistent output. Moreover, multilevel contrastive losses are employed on both sides to enforce consistency between two-level representations in the same semantic space. Extensive experiments on five driving scene segmentation datasets demonstrate that our framework outperforms state-of-the-art methods, significantly improving average segmentation accuracy and effectively addressing the class-consistency representation problem.

IROS Conference 2025 Conference Paper

Flow-Aware Navigation of Magnetic Micro-Robots in Complex Fluids via PINN-Based Prediction

  • Yongyi Jia
  • Shu Miao
  • Jiayu Wu
  • Ming Yang
  • Chengzhi Hu
  • Xiang Li 0009

While magnetic micro-robots have demonstrated significant potential across various applications, including drug delivery and microsurgery, the open issue of precise navigation and control in complex fluid environments is crucial for in vivo implementation. This paper introduces a novel flow-aware navigation and control strategy for magnetic micro-robots that explicitly accounts for the impact of fluid flow on their movement. First, the proposed method employs a Physics-Informed U-Net (PI-UNet) to refine the numerically predicted fluid velocity using local observations. The predicted velocity is then incorporated into a flow-aware A* path planning algorithm, ensuring efficient navigation while mitigating flow-induced disturbances. Finally, a control scheme is developed to compensate for the predicted fluid velocity, thereby optimizing the micro-robot’s performance. A series of simulation studies and real-world experiments are conducted to validate the efficacy of the proposed approach. This method enhances both planning accuracy and control precision, expanding the potential applications of magnetic micro-robots in fluid-affected environments typical of many medical scenarios.

NeurIPS Conference 2025 Conference Paper

From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots

  • Yuxuan Wang
  • Ming Yang
  • Gang Ding
  • Yu Zhang
  • Weishuai Zeng
  • Xinrun Xu
  • Haobin Jiang
  • Zongqing Lu

Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world.

AAAI Conference 2025 Conference Paper

HomoMatcher: Achieving Dense Feature Matching with Semi-Dense Efficiency by Homography Estimation

  • Xiaolong Wang
  • Lei Yu
  • Yingying Zhang
  • Jiangwei Lao
  • Lixiang Ru
  • Liheng Zhong
  • Jingdong Chen
  • Yu Zhang

Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semi-dense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency.

ICRA Conference 2025 Conference Paper

Human-Like Walking Motion Generation for Self-Balancing Lower Limb Rehabilitation Exoskeletons

  • Ming Yang
  • Ziqiang Chen
  • Wentao Li
  • Feng Li 0059
  • Weiwei Shang 0001
  • Dingkui Tian
  • Xinyu Wu 0001

Self-balancing lower limb rehabilitation exoskeletons (SLLREs) allow individuals with lower limb dysfunction to walk without the use of crutches. Stable and human-like walking motions are crucial for SLLREs because achieving a close imitation of healthy human walking is a key goal in rehabilitation therapy. Existing SLLREs can realize stable walking but lack human-like features such as knee-stretched, heel-strike and toe-off. This paper designs a walking motion generator based on hierarchical optimization to generate a human-like walking motion with variable hip height, heelstrike, toe-off, and knee-stretched features. This generator consists of a knee-stretched optimizer and a stabilizing filter. Specifically, the knee-stretched optimizer realizes the stretched knee feature by optimizing the hip trajectory with varying heights. And the stabilizing filter realizes stable walking by optimizing the hip trajectory in the sagittal plane direction. To validate the effectiveness of the proposed human-like walking motion generator, walking experiments were conducted on SLLRE AutoLEE-G3 both in a simulation environment and the real world. The experimental results show that the humanlike walking motions look more natural and reduce the required torque for the knee joint compared with knee-bent walking.

ICRA Conference 2025 Conference Paper

Robotic Sim-to-Real Transfer for Long-Horizon Pick-and-Place Tasks in the Robotic Sim2Real Competition

  • Ming Yang
  • Hongyu Cao
  • Lixuan Zhao
  • Chenrui Zhang
  • Yaran Chen

This paper presents a fully autonomous robotic system that performs sim-to-real transfer in complex longhorizon tasks involving navigation, recognition, grasping, and stacking in an environment with multiple obstacles. The key feature of the system is the ability to overcome typical sensing and actuation discrepancies during sim-to-real transfer and to achieve consistent performance without any algorithmic modifications. To accomplish this, a lightweight noise-resistant visual perception system and a nonlinearityrobust servo system are adopted. We conduct a series of tests in both simulated and realworld environments. The visual perception system achieves the speed of 11 ms per frame due to its lightweight nature, and the servo system achieves sub-centimeter accuracy with the proposed controller. Both exhibit high consistency during sim-to-real transfer. Benefiting from these, our robotic system took first place in the mineral searching task of the Robotic Sim2Real Challenge hosted at ICRA 2024.

NeurIPS Conference 2025 Conference Paper

Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization

  • Xingyu Chen
  • Bokun Wang
  • Ming Yang
  • Qihang Lin
  • Tianbao Yang

Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of $O(1/\epsilon^6)$ under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a **new state-of-the-art** iteration complexity of $O(1/\epsilon^5)$. Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a **new state-of-the-art** complexity of $O(1/\epsilon^5)$ for finding an (nearly) $\epsilon$-level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.

IROS Conference 2024 Conference Paper

A Closed-loop Control for Lower Limb Exoskeleton Considering Overall Deformations: A Simple and Direct Application Method

  • Feng Li 0059
  • Ming Yang
  • Ziqiang Chen
  • Mengbo Luan
  • Dingkui Tian
  • Xinyu Wu 0001

In this paper, considering overall deformations of the exoskeleton, we couple deformations relationship network (DRN) with fractional order viscoelastic (FOV) controller, proposing a novel DRN-FOV closed-loop control method, endowing exoskeleton with stable dynamic walking ability. Simply by utilizing only the data from the 6-axis force/torque sensors, the DRN can directly capture the mapping relationship between the foot reaction force/torque of the exoskeleton and its overall deformations. We introduce the FOV to eliminate disturbances and stabilize during walking tasks. The closed-loop control method directly compensates for the overall deformations of the exoskeleton and enables the wearer to walk stably wearing the exoskeleton. To assess the effectiveness of the proposed control method, walking tasks were effectively carried out on subjects with varying body parameters using the developed exoskeleton. The experimental results show that the DRN-FOV closed-loop control method accurately estimates and compensates for deformations, resulting in an improved dynamic walking ability of the exoskeleton with wearers.

NeurIPS Conference 2024 Conference Paper

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

  • Ziyuan Huang
  • Kaixiang Ji
  • Biao Gong
  • Zhiwu Qing
  • Qinglong Zhang
  • Kecheng Zheng
  • Jian Wang
  • Jingdong Chen

This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by $\sim$73\%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.

NeurIPS Conference 2024 Conference Paper

CSPG: Crossing Sparse Proximity Graphs for Approximate Nearest Neighbor Search

  • Ming Yang
  • Yuzheng Cai
  • Weiguo Zheng

The state-of-the-art approximate nearest neighbor search (ANNS) algorithm builds a large proximity graph on the dataset and performs a greedy beam search, which may bring many unnecessary explorations. We develop a novel framework, namely corssing sparse proximity graph (CSPG), based on random partitioning of the dataset. It produces a smaller sparse proximity graph for each partition and routing vectors that bind all the partitions. An efficient two-staged approach is designed for exploring CSPG, with fast approaching and cross-partition expansion. We theoretically prove that CSPG can accelerate the existing graph-based ANNS algorithms by reducing unnecessary explorations. In addition, we conduct extensive experiments on benchmark datasets. The experimental results confirm that the existing graph-based methods can be significantly outperformed by incorporating CSPG, achieving 1. 5x to 2x speedups of QPS in almost all recalls.

ICML Conference 2024 Conference Paper

DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection

  • Zhi Zhou 0007
  • Ming Yang
  • Jiang-Xin Shi
  • Lan-Zhe Guo
  • Yufeng Li 0008

Vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot capabilities for various downstream tasks. Their performance can be further enhanced through few-shot prompt tuning methods. However, current studies evaluate the performance of learned prompts separately on base and new classes. This evaluation lacks practicality for real-world applications since downstream tasks cannot determine whether the data belongs to base or new classes in advance. In this paper, we explore a problem setting called O *pen-world P rompt T uning (OPT), which involves tuning prompts on base classes and evaluating on a combination of base and new classes. By introducing De composed P rompt T uning framework (DePT), we theoretically demonstrate that OPT can be solved by incorporating out-of-distribution detection into prompt tuning, thereby enhancing the base-to-new discriminability. Based on DePT, we present a novel prompt tuning approach, namely, De composed Co ntext Op* timization (DeCoOp), which introduces new-class detectors and sub-classifiers to further enhance the base-class and new-class discriminability. Experimental results on 11 benchmark datasets validate the effectiveness of DePT and demonstrate that DeCoOp outperforms current state-of-the-art methods, providing a significant 2% average accuracy improvement.

IJCAI Conference 2024 Conference Paper

EVE: Efficient Zero-Shot Text-Based Video Editing With Depth Map Guidance and Temporal Consistency Constraints

  • Yutao Chen
  • Xingning Dong
  • Tian Gan
  • Chunluan Zhou
  • Ming Yang
  • Qingpei Guo

Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and Efficient zero-shot Video Editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark named ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE achieves a satisfactory trade-off between performance and efficiency. Codebase, datasets, and video editing demos are available at https: //github. com/alipay/Ant-Multi-Modal-Framework/blob/main/prj/EVE.

ICRA Conference 2024 Conference Paper

Monocular Localization with Semantics Map for Autonomous Vehicles

  • Jixiang Wan
  • Xudong Zhang
  • Shuzhou Dong
  • Yuwei Zhang
  • Yuchen Yang
  • Ruoxi Wu
  • Ye Jiang
  • Jijunnan Li

Accurate and robust localization remains a significant challenge for autonomous vehicles. The cost of sensors and limitations in local computational efficiency make it difficult to scale to large commercial applications. Traditional vision-based approaches focus on texture features that are susceptible to changes in lighting, season, perspective, and appearance. Additionally, the large storage size of maps with descriptors and complex optimization processes hinder system performance. To balance efficiency and accuracy, we propose a novel lightweight visual semantic localization algorithm that employs stable semantic features instead of low-level texture features. First, semantic maps are constructed offline by detecting semantic objects, such as ground markers, lane lines, and poles, using cameras or LiDAR sensors. Then, online visual localization is performed through data association of semantic features and map objects. We evaluated our proposed localization framework in the publicly available KAIST Urban dataset and in scenarios recorded by ourselves. The experimental results demonstrate that our method is a reliable and practical localization solution in various autonomous driving localization tasks.

NeurIPS Conference 2024 Conference Paper

Referencing Where to Focus: Improving Visual Grounding with Referential Query

  • Yabing Wang
  • Zhuotao Tian
  • Qingpei Guo
  • Zheng Qin
  • Sanping Zhou
  • Ming Yang
  • Le Wang

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.

ICML Conference 2024 Conference Paper

Stability and Generalization of Stochastic Compositional Gradient Descent Algorithms

  • Ming Yang
  • Xiyuan Wei
  • Tianbao Yang
  • Yiming Ying

Many machine learning tasks can be formulated as a stochastic compositional optimization (SCO) problem such as reinforcement learning, AUC maximization and meta-learning, where the objective function involves a nested composition associated with an expectation. Although many studies have been devoted to studying the convergence behavior of SCO algorithms, there is little work on understanding their generalization, that is, how these learning algorithms built from training data would behave on future test examples. In this paper, we provide the stability and generalization analysis of stochastic compositional gradient descent algorithms in the framework of statistical learning theory. Firstly, we introduce a stability concept called compositional uniform stability and establish its quantitative relation with generalization for SCO problems. Then, we establish the compositional uniform stability results for two notable stochastic compositional gradient descent algorithms, namely SCGD and SCSC. Finally, we derive dimension-independent excess risk bounds for SCGD and SCSC by balancing stability results and optimization errors. To the best of our knowledge, these are the first-ever known results on stability and generalization analysis of stochastic compositional gradient descent algorithms.

JAAMAS Journal 2024 Journal Article

Team-wise effective communication in multi-agent reinforcement learning

  • Ming Yang
  • Kaiyan Zhao
  • Leong Hou U

Abstract Effective communication is crucial for the success of multi-agent systems, as it promotes collaboration for attaining joint objectives and enhances competitive efforts towards individual goals. In the context of multi-agent reinforcement learning, determining “whom”, “how” and “what” to communicate are crucial factors for developing effective policies. Therefore, we propose TeamComm, a novel framework for multi-agent communication reinforcement learning. First, it introduces a dynamic team reasoning policy, allowing agents to dynamically form teams and adapt their communication partners based on task requirements and environment states in cooperative or competitive scenarios. Second, TeamComm utilizes heterogeneous communication channels consisting of intra- and inter-team to achieve diverse information flow. Lastly, TeamComm leverages the information bottleneck principle to optimize communication content, guiding agents to convey relevant and valuable information. Through experimental evaluations on three popular environments with seven different scenarios, we empirically demonstrate the superior performance of TeamComm compared to existing methods.

AAAI Conference 2023 Conference Paper

Centerless Multi-View K-means Based on the Adjacency Matrix

  • Han Lu
  • Quanxue Gao
  • Qianqian Wang
  • Ming Yang
  • Wei Xia

Although K-Means clustering has been widely studied due to its simplicity, these methods still have the following fatal drawbacks. Firstly, they need to initialize the cluster centers, which causes unstable clustering performance. Secondly, they have poor performance on non-Gaussian datasets. Inspired by the affinity matrix, we propose a novel multi-view K-Means based on the adjacency matrix. It maps the affinity matrix to the distance matrix according to the principle that every sample has a small distance from the points in its neighborhood and a large distance from the points outside of the neighborhood. Moreover, this method well exploits the complementary information embedded in different views by minimizing the tensor Schatten p-norm regularize on the third-order tensor which consists of cluster assignment matrices of different views. Additionally, this method avoids initializing cluster centroids to obtain stable performance. And there is no need to compute the means of clusters so that our model is not sensitive to outliers. Experiment on a toy dataset shows the excellent performance on non-Gaussian datasets. And other experiments on several benchmark datasets demonstrate the superiority of our proposed method.

NeurIPS Conference 2023 Conference Paper

Efficient Potential-based Exploration in Reinforcement Learning using Inverse Dynamic Bisimulation Metric

  • Yiming Wang
  • Ming Yang
  • Renzhi Dong
  • Binbin Sun
  • Furui Liu
  • Leong Hou U

Reward shaping is an effective technique for integrating domain knowledge into reinforcement learning (RL). However, traditional approaches like potential-based reward shaping totally rely on manually designing shaping reward functions, which significantly restricts exploration efficiency and introduces human cognitive biases. While a number of RL methods have been proposed to boost exploration by designing an intrinsic reward signal as exploration bonus. Nevertheless, these methods heavily rely on the count-based episodic term in their exploration bonus which falls short in scalability. To address these limitations, we propose a general end-to-end potential-based exploration bonus for deep RL via potentials of state discrepancy, which motivates the agent to discover novel states and provides them with denser rewards without manual intervention. Specifically, we measure the novelty of adjacent states by calculating their distance using the bisimulation metric-based potential function, which enhances agent's exploration and ensures policy invariance. In addition, we offer a theoretical guarantee on our inverse dynamic bisimulation metric, bounding the value difference and ensuring that the agent explores states with higher TD error, thus significantly improving training efficiency. The proposed approach is named \textbf{LIBERTY} (exp\textbf{L}oration v\textbf{I}a \textbf{B}isimulation m\textbf{E}t\textbf{R}ic-based s\textbf{T}ate discrepanc\textbf{Y}) which is comprehensively evaluated on the MuJoCo and the Arcade Learning Environments. Extensive experiments have verified the superiority and scalability of our algorithm compared with other competitive methods.

AAAI Conference 2023 Conference Paper

High-Level Semantic Feature Matters Few-Shot Unsupervised Domain Adaptation

  • Lei Yu
  • Wanqi Yang
  • Shengqi Huang
  • Lei Wang
  • Ming Yang

In few-shot unsupervised domain adaptation (FS-UDA), most existing methods followed the few-shot learning (FSL) methods to leverage the low-level local features (learned from conventional convolutional models, e.g., ResNet) for classification. However, the goal of FS-UDA and FSL are relevant yet distinct, since FS-UDA aims to classify the samples in target domain rather than source domain. We found that the local features are insufficient to FS-UDA, which could introduce noise or bias against classification, and not be used to effectively align the domains. To address the above issues, we aim to refine the local features to be more discriminative and relevant to classification. Thus, we propose a novel task-specific semantic feature learning method (TSECS) for FS-UDA. TSECS learns high-level semantic features for image-to-class similarity measurement. Based on the high-level features, we design a cross-domain self-training strategy to leverage the few labeled samples in source domain to build the classifier in target domain. In addition, we minimize the KL divergence of the high-level feature distributions between source and target domains to shorten the distance of the samples between the two domains. Extensive experiments on DomainNet show that the proposed method significantly outperforms SOTA methods in FS-UDA by a large margin (i.e., ~10%).

TIST Journal 2023 Journal Article

Hyper-Laplacian Regularized Multi-View Clustering with Exclusive L21 Regularization and Tensor Log-Determinant Minimization Approach

  • Qilun Luo
  • Ming Yang
  • Wen Li
  • Mingqing Xiao

Multi-view clustering aims to capture the multiple views inherent information by identifying the data clustering that reflects distinct features of datasets. Since there is a consensus in literature that different views of a dataset share a common latent structure, most existing multi-view subspace learning methods rely on the nuclear norm to seek the low-rank representation of the underlying subspace. However, the nuclear norm often fails to distinguish the variance of features for each cluster due to its convex nature and data tends to fall in multiple non-linear subspaces for multi-dimensional datasets. To address these problems, we propose a new and novel multi-view clustering method (HL-L21-TLD-MSC) that unifies the Hyper-Laplacian (HL) and exclusive ℓ 2,1 (L21) regularization with the Tensor Log-Determinant Rank Minimization (TLD) setting. Specifically, the hyper-Laplacian regularization maintains the local geometrical structure that makes the estimation prune to nonlinearities, and the mixed ℓ 2,1 and ℓ 1,2 regularization provides the joint sparsity within-cluster as well as the exclusive sparsity between-cluster. Furthermore, a log-determinant function is used as a tighter tensor rank approximation to discriminate the dimension of features. An efficient alternating algorithm is then derived to optimize the proposed model, and the construction of a convergent sequence to the Karush-Kuhn-Tucker (KKT) critical point solution is mathematically validated in detail. Extensive experiments are conducted on ten well-known datasets to demonstrate that the proposed approach outperforms the existing state-of-the-art approaches with various scenarios, in which, six of them achieve perfect results under our framework developed in this article, demonstrating highl effectiveness for the proposed approach.

NeurIPS Conference 2023 Conference Paper

Orthogonal Non-negative Tensor Factorization based Multi-view Clustering

  • Jing Li
  • Quanxue Gao
  • Qianqian Wang
  • Ming Yang
  • Wei Xia

Multi-view clustering (MVC) based on non-negative matrix factorization (NMF) and its variants have attracted much attention due to their advantages in clustering interpretability. However, existing NMF-based multi-view clustering methods perform NMF on each view respectively and ignore the impact of between-view. Thus, they can't well exploit the within-view spatial structure and between-view complementary information. To resolve this issue, we present orthogonal non-negative tensor factorization (Orth-NTF) and develop a novel multi-view clustering based on Orth-NTF with one-side orthogonal constraint. Our model directly performs Orth-NTF on the 3rd-order tensor which is composed of anchor graphs of views. Thus, our model directly considers the between-view relationship. Moreover, we use the tensor Schatten $p$-norm regularization as a rank approximation of the 3rd-order tensor which characterizes the cluster structure of multi-view data and exploits the between-view complementary information. In addition, we provide an optimization algorithm for the proposed method and prove mathematically that the algorithm always converges to the stationary KKT point. Extensive experiments on various benchmark datasets indicate that our proposed method is able to achieve satisfactory clustering performance.

ICLR Conference 2022 Conference Paper

Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training

  • Pengcheng Yang
  • Xiaoming Zhang
  • Wenpeng Zhang 0003
  • Ming Yang
  • Hong Wei

The recent trend of using large-scale deep neural networks (DNN) to boost performance has propelled the development of the parallel pipelining technique for efficient DNN training, which has resulted in the development of several prominent pipelines such as GPipe, PipeDream, and PipeDream-2BW. However, the current leading pipeline PipeDream-2BW still suffers from two major drawbacks, i.e., the excessive memory redundancy and the delayed weight updates across all stages. In this work, we propose a novel pipeline named WPipe, which achieves better memory efficiency and fresher weight updates. WPipe uses a novel pipelining scheme that divides model partitions into two groups. It moves the forward pass of the next period of weight updates to the front of the backward pass of the current period of weight updates in the first group, retains the order in the second group, and updates each group alternatively. This scheme can eliminate half of the delayed gradients and memory redundancy compared to PipeDream-2BW. The experiments, which train large BERT language models, show that compared to PipeDream-2BW, WPipe achieves $1.4\times$ acceleration and reduces the memory footprint by 36%, without nearly sacrificing any final model accuracy.

AAAI Conference 2022 Conference Paper

Towards Accurate Facial Motion Retargeting with Identity-Consistent and Expression-Exclusive Constraints

  • Langyuan Mo
  • Haokun Li
  • Chaoyang Zou
  • Yubing Zhang
  • Ming Yang
  • Yihong Yang
  • Mingkui Tan

We address the problem of facial motion retargeting that aims to transfer facial motion from a 2D face image to 3D characters. Existing methods often formulate this problem as a 3D face reconstruction problem, which estimates the face attributes such as face identity and expression from face images. However, due to the lack of ground-truth labels for both identity and expression, most 3D-face reconstruction-based methods fail to capture the facial identity and expression accurately. As a result, these methods may not achieve promising performance. To address this, we propose an identity-consistent constraint to learn accurate identities by encouraging consistent identity prediction across multiple frames. Based on a more accurate identity, we are able to obtain a more accurate facial expression. Moreover, we further propose an expressionexclusive constraint to improve performance by avoiding the co-occurrence of contradictory expression units (e. g. , “brow lower” vs. “brow raise”). Extensive experiments on facial motion retargeting and 3D face reconstruction tasks demonstrate the superiority of the proposed method over existing methods. Our code and supplementary materials are available at https: //github. com/deepmo24/CPEM.

AAAI Conference 2021 Conference Paper

Robust Knowledge Transfer via Hybrid Forward on the Teacher-Student Model

  • Liangchen Song
  • Jialian Wu
  • Ming Yang
  • Qian Zhang
  • Yuan Li
  • Junsong Yuan

When adopting deep neural networks for a new vision task, a common practice is to start with fine-tuning some offthe-shelf well-trained network models from the community. Since a new task may require training a different network architecture with new domain data, taking advantage of off-theshelf models is not trivial and generally requires considerable try-and-error and parameter tuning. In this paper, we denote a well-trained model as a teacher network and a model for the new task as a student network. We aim to ease the efforts of transferring knowledge from the teacher to the student network, robust to the gaps between their network architectures, domain data, and task definitions. Specifically, we propose a hybrid forward scheme in training the teacher-student models, alternately updating layer weights of the student model. The key merit of our hybrid forward scheme is on the dynamical balance between the knowledge transfer loss and task specific loss in training. We demonstrate the effectiveness of our method on a variety of tasks, e. g, model compression, segmentation, and detection, under a variety of knowledge transfer settings.

JBHI Journal 2020 Journal Article

An Effective MR-Guided CT Network Training for Segmenting Prostate in CT Images

  • Wanqi Yang
  • Yinghuan Shi
  • Sang Hyun Park
  • Ming Yang
  • Yang Gao
  • Dinggang Shen

Segmentation of prostate in medical imaging data (e. g. , CT, MRI, TRUS) is often considered as a critical yet challenging task for radiotherapy treatment. It is relatively easier to segment prostate from MR images than from CT images, due to better soft tissue contrast of the MR images. For segmenting prostate from CT images, most previous methods mainly used CT alone, and thus their performances are often limited by low tissue contrast in the CT images. In this article, we explore the possibility of using indirect guidance from MR images for improving prostate segmentation in the CT images. In particular, we propose a novel deep transfer learning approach, i. e. , MR-guided CT network training (namely MICS-NET), which can employ MR images to help better learning of features in CT images for prostate segmentation. In MICS-NET, the guidance from MRI consists of two steps: (1) learning informative and transferable features from MRI and then transferring them to CT images in a cascade manner, and (2) adaptively transferring the prostate likelihood of MRI model (i. e. , well-trained convnet by purely using MR images) with a view consistency constraint. To illustrate the effectiveness of our approach, we evaluate MICS-NET on a real CT prostate image set, with the manual delineations available as the ground truth for evaluation. Our methods generate promising segmentation results which achieve (1) six percentages higher Dice Ratio than the CT model purely using CT images and (2) comparable performance with the MRI model purely using MR images.

AAAI Conference 2020 Conference Paper

Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control

  • Chacha Chen
  • Hua Wei
  • Nan Xu
  • Guanjie Zheng
  • Ming Yang
  • Yuanhao Xiong
  • Kai Xu
  • Zhenhui Li

Traffic congestion plagues cities around the world. Recent years have witnessed an unprecedented trend in applying reinforcement learning for traffic signal control. However, the primary challenge is to control and coordinate traffic lights in large-scale urban networks. No one has ever tested RL models on a network of more than a thousand traffic lights. In this paper, we tackle the problem of multi-intersection traf- fic signal control, especially for large-scale networks, based on RL techniques and transportation theories. This problem is quite difficult because there are challenges such as scalability, signal coordination, data feasibility, etc. To address these challenges, we (1) design our RL agents utilizing ‘pressure’ concept to achieve signal coordination in region-level; (2) show that implicit coordination could be achieved by individual control agents with well-crafted reward design thus reducing the dimensionality; and (3) conduct extensive experiments on multiple scenarios, including a real-world scenario with 2510 traffic lights in Manhattan, New York City 1 2.

IJCAI Conference 2019 Conference Paper

Resolution-invariant Person Re-Identification

  • Shunan Mao
  • Shiliang Zhang
  • Ming Yang

Exploiting resolution invariant representation is critical for person Re-Identification (ReID) in real applications, where the resolutions of captured person images may vary dramatically. This paper learns person representations robust to resolution variance through jointly training a Foreground-Focus Super-Resolution (FFSR) module and a Resolution-Invariant Feature Extractor (RIFE) by end-to-end CNN learning. FFSR upscales the person foreground using a fully convolutional auto-encoder with skip connections learned with a foreground focus training loss. RIFE adopts two feature extraction streams weighted by a dual-attention block to learn features for low and high resolution images, respectively. These two complementary modules are jointly trained, leading to a strong resolution invariant representation. We evaluate our methods on five datasets containing person images at a large range of resolutions, where our methods show substantial superiority to existing solutions. For instance, we achieve Rank-1 accuracy of 36. 4% and 73. 3% on CAVIAR and MLR-CUHK03, outperforming the state-of-the art by 2. 9% and 2. 6%, respectively.

IJCAI Conference 2018 Conference Paper

Feature Integration with Adaptive Importance Maps for Visual Tracking

  • Aishi Li
  • Ming Yang
  • Wanqi Yang

Discriminative correlation filters have recently achieved excellent performance for visual object tracking. The key to success is to make full use of dense sampling and specific properties of circulant matrices in the Fourier domain. However, previous studies don't take into consideration the importance and complementary information of different features, simply concatenating them. This paper investigates an effective method of feature integration for correlation filters, which jointly learns filters, as well as importance maps in each frame. These importance maps borrow the advantages of different features, aiming to achieve complementary traits and improve robustness. Moreover, for each feature, an importance map is shared by its all channels to avoid overfitting. In addition, we introduce a regularization term for the importance maps and use the penalty factor to control the significance of features. Based on handcrafted and CNN features, we implement two trackers, which achieve a competitive performance compared with several state-of-the-art trackers.

AAAI Conference 2017 Conference Paper

Learning with Feature Network and Label Network Simultaneously

  • Yingming Li
  • Ming Yang
  • Zenglin Xu
  • Zhongfei (Mark) Zhang

For many supervised learning problems, limited training samples and incomplete labels are two difficult challenges, which usually lead to degenerated performance on label prediction. To improve the generalization performance, in this paper, we propose Doubly Regularized Multi-Label learning (DRML) by exploiting feature network and label network regularization simultaneously. In more details, the proposed algorithm first constructs a feature network and a label network with marginalized linear denoising autoencoder in data feature set and label set, respectively, and then learns a robust predictor with the feature network and the label network regularization simultaneously. While DRML is a general method for multilabel learning, in the evaluations we focus on the specific application of multi-label text tagging. Extensive evaluations on three benchmark data sets demonstrate that DRML outstands with a superior performance in comparison with some existing multi-label learning methods.

AAAI Conference 2016 Conference Paper

Learning with Marginalized Corrupted Features and Labels Together

  • Yingming Li
  • Ming Yang
  • Zenglin Xu
  • Zhongfei Zhang

Tagging has become increasingly important in many real-world applications noticeably including web applications, such as web blogs and resource sharing systems. Despite this importance, tagging methods often face difficult challenges such as limited training samples and incomplete labels, which usually lead to degenerated performance on tag prediction. To improve the generalization performance, in this paper, we propose Regularized Marginalized Cross-View learning (RMCV) by jointly modeling on attribute noise and label noise. In more details, the proposed model constructs infinite training examples with attribute noises from known exponential-family distributions and exploits label noise via marginalized denoising autoencoder. Therefore, the model benefits from its robustness and alleviates the problem of tag sparsity. While RMCV is a general method for learning tagging, in the evaluations we focus on the specific application of multi-label text tagging. Extensive evaluations on three benchmark data sets demonstrate that RMCV outstands with a superior performance in comparison with state-of-the-art methods.

IJCAI Conference 2016 Conference Paper

Multi-View Learning with Limited and Noisy Tagging

  • Yingming Li
  • Ming Yang
  • Zenglin Xu
  • Zhongfei (Mark) Zhang

Multi-view tagging has become increasingly popular in the applications where data representations by multiple views exist. A robust multi-view tagging method must have the capability to meet the two challenging requirements: limited labeled training samples and noisy labeled training samples. In this paper, we investigate this challenging problem of learning with limited and noisy tagging and propose a discriminative model, called MSMC, that exploits both labeled and unlabeled data through a semi-parametric regularization and takes advantage of the multi-label space consistency into the optimization. While MSMC is a general method for learning with multi-view, limited, and noisy tagging, in the evaluations we focus on the specific application of noisy image tagging with limited labeled training samples on a benchmark dataset. Extensive evaluations in comparison with state-of-the-art literature demonstrate that MSMC outstands with a superior performance.