Arrow Research search

Author name cluster

Jieping Ye

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

127 papers
2 author rows

Possible papers

127

AAAI Conference 2026 Conference Paper

Bridging the Language Gap: Uncovering and Aligning Shared Circuits for Multi-Hop Reasoning in Multilingual LLMs

  • Chenghao Sun
  • Zhen Huang
  • Yonggang Zhang
  • Xinmei Tian
  • Xu Shen
  • Jieping Ye

Large language models (LLMs) present a paradox: they can correctly answer a multi-hop factual query in a high-resource language like English, yet fail on the identical query in another language. This raises a fundamental question about the nature of multilingual knowledge: are facts missing, or merely inaccessible? The underlying mechanisms for this knowledge gap have remained largely unexplored. In this work, we resolve this question by introducing a mechanistic interpretability framework that traces the causal pathways of multi-hop knowledge reasoning. Our analysis reveals a core, non-obvious finding: cross-lingual inconsistencies do not stem from a knowledge deficit. Instead, factual knowledge is robustly stored in a set of **shared, language-agnostic semantic neurons**. The failure originates from **misaligned attention pathways**, where a common set of critical attention heads fails to correctly route information along the reasoning chain to the appropriate knowledge neurons in lower-resource languages. This mechanistic diagnosis motivates a targeted alignment strategy: a surgical fine-tuning of only these critical heads. Experiments demonstrate that our method achieves significant improvements in multilingual multi-hop factuality—with positive cross-lingual transfer—while uniquely preserving general model capabilities, offering a scalable and mechanistically-grounded approach to building more reliable multilingual models.

AAAI Conference 2026 Conference Paper

Enhancing Spatial Reasoning Through Visual and Textual Thinking

  • Xun Liang
  • Xin Guo
  • Zhongming Jin
  • Weihang Pan
  • Penghui Shang
  • Deng Cai
  • Binbin Lin
  • Jieping Ye

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of important targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, our model conducts long-term thinking based on visual cues and dialogues and gradually inferences the answers to spatial reasoning problems. To effectively support the model's training, we made manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization, and developing a reasoning framework for model thinking. Without introducing any additional information (such as masks or depth), our model's overall average level in several spatial understanding tasks has significantly improved compared with other models.

AAAI Conference 2026 Conference Paper

FGD-Align: Pluralistic Alignment for Large Language Models via Fuzzy Group Decision-Making

  • Weihang Pan
  • Zhengxu Yu
  • Yong Wu
  • Xun Liang
  • Zhongming Jin
  • Qiang Fu
  • Penghui Shang
  • Binbin Lin

Ensuring alignment with human values is essential for modern large language models (LLMs), especially amid growing concerns around AI safety and social impact. Yet achieving such alignment remains challenging due to the limited, noisy, and often conflicting nature of human feedback from diverse annotators. Most existing approaches, such as Direct Preference Optimization (DPO), assume consistent and conflict-free supervision, overlooking the ambiguity, inconsistency, and value trade-offs inherent in real-world preferences—often leading to reduced robustness and exclusion of minority views. To address this, we propose FGD-Align, a novel pluralistic alignment framework grounded in Fuzzy Group Decision-Making theory. Our approach rigorously models and aggregates human preferences while retaining the complexity of real-world value trade-offs. Unlike traditional methods that rely on coarse-grained preference pairs, FGD-Align introduces fuzzy preference modeling via triangular fuzzy numbers to capture nuanced, multi-criteria human judgments. We further develop a new training objective, Probabilistic Fuzzy DPO, which incorporates fuzzy preference strength as adaptive loss weights and gradient filters, enhancing robustness to ambiguity and inconsistency in feedback. Comprehensive experiments demonstrate that FGD-Align consistently outperforms both DPO variants and advanced preference aggregation methods in terms of preference accuracy and robustness to ambiguity. It achieves superior alignment stability and better preserves minority preferences, all with minimal computational overhead. Our work bridges the gap between algorithmic tractability and the nuanced landscape of human values, enabling more scalable, inclusive, and socially-aware AI alignment.

AAAI Conference 2026 Conference Paper

Flora: Effortless Context Construction to Arbitrary Length and Scale

  • Tianxiang Chen
  • Zhentao Tan
  • Xiaofan Bo
  • Yue Wu
  • Tao Gong
  • Qi Chu
  • Jieping Ye

Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks.

AIIM Journal 2025 Journal Article

CATI: A medical context-enhanced framework for diagnosis code assignment in the UK Biobank study

  • Yue Shen
  • Jie Wang
  • Zhe Wang
  • Zhihao Shi
  • Hanzhu Chen
  • Zheng Wang
  • Yukang Jiang
  • Xiaopu Wang

Diagnosis codes are standard code format of diseases or medical conditions. This study is aimed at assigning diagnosis codes to patients in large-scale biobanks, particularly addressing the issue of missing codes for some patients. This is crucial for downstream disease-related tasks. While recent methods primarily rely on structured biobank data for code assignment, they often overlook the valuable medical context provided by textual information in the biobanks and hierarchical structure of the disease coding system. To address this gap, we have developed CATI, a medical context-enhanced framework for diagnosis Code Assignment by integrating Textual details derived from key features and disease hIerarchy. The study is based on the UK Biobank data and considers Phecodes and ICD-10 codes as standard disease formats. We start by representing ten informative codified features using their formal names and then integrate them into CATI as text embeddings, achieved through prompt tuning on the pre-trained language model BioBERT. Recognizing the hierarchical structure of diagnosis codes, we have developed a novel convolution layer in our method that effectively propagates logits between adjacent diagnosis codes. Evaluation results demonstrate that CATI outperforms existing state-of-the-art methods in terms of both Phecodes and ICD-10 codes, boasting at least a 5. 16% improvement in average AUROC for unseen disease codes and an 8. 68% rise in average AUPRC for disease codes with training instances ranging in (1000, 10000]. This framework contributes to the formation of well-defined cohorts for downstream studies and offers a unique perspective for addressing complex healthcare tasks by incorporating vital medical context.

ICRA Conference 2025 Conference Paper

CoL3D: Collaborative Learning of Single-view Depth and Camera Intrinsics for Metric 3D Shape Recovery

  • Chenghao Zhang
  • Lubin Fan
  • Shen Cao
  • Bojian Wu
  • Jieping Ye

Recovering the metric 3D shape from a single image is particularly relevant for robotics and embodied in-telligence applications, where accurate spatial understanding is crucial for navigation and interaction with environments. Usu-ally, the mainstream approaches achieve it through monocular depth estimation. However, without camera intrinsics, the 3D metric shape can not be recovered from depth alone. In this study, we theoretically demonstrate that depth serves as a 3D prior constraint for estimating camera intrinsics and uncover the reciprocal relations between these two elements. Motivated by this, we propose a collaborative learning framework for jointly estimating depth and camera intrinsics, named CoL3D, to learn metric 3D shapes from single images. Specifically, CoL3D adopts a unified network and performs collaborative optimization at three levels: depth, camera intrinsics, and 3D point clouds. For camera intrinsics, we design a canonical incidence field mechanism as a prior that enables the model to learn the residual incident field for enhanced calibration. Additionally, we incorporate a shape similarity measurement loss in the point cloud space, which improves the quality of 3D shapes essential for robotic applications. As a result, when training and testing on a single dataset with in-domain settings, CoL3D delivers outstanding performance in both depth estimation and camera calibration across several indoor and outdoor benchmark datasets, which leads to remarkable 3D shape quality for the perception capabilities of robots.

NeurIPS Conference 2025 Conference Paper

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning

  • Kongcheng Zhang
  • QI YAO
  • Shunyu Liu
  • Yingjie Wang
  • Baisheng Lai
  • Jieping Ye
  • Mingli Song
  • Dacheng Tao

Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers ( high consistency ) with minimal deviation toward other candidates ( low volatility ). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Co nsistency and Vo latility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https: //github. com/sastpg/CoVo.

NeurIPS Conference 2025 Conference Paper

Controlling Thinking Speed in Reasoning Models

  • Zhengkai Lin
  • Zhihang Fu
  • Ze Chen
  • Chao Chen
  • Liang Xie
  • Wenxiao Wang
  • Deng Cai
  • Zheng Wang

Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1. 3\% accuracy with -8. 6\% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

ICLR Conference 2025 Conference Paper

Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models

  • Shaotian Yan
  • Chen Shen 0003
  • Wenxiao Wang 0001
  • Liang Xie 0003
  • Junjie Liu 0002
  • Jieping Ye

Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs), functioning as a whole to guide these models in generating reasoning steps toward final answers. However, we observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs. The model may overly concentrate on certain local information present in the demonstration, introducing irrelevant noise into the reasoning process and potentially leading to incorrect answers. In this paper, we investigate the underlying mechanism of CoT through dynamically tracing and manipulating the inner workings of LLMs at each output step, which demonstrates that tokens exhibiting specific attention characteristics are more likely to induce the model to take things out of context; these tokens directly attend to the hidden states tied with prediction, without substantial integration of non-local information. Building upon these insights, we propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens and subsequently make targeted adjustments to the attention weights to effectively suppress their distracting effect on LLMs. Comprehensive experiments across multiple benchmarks demonstrate consistent improvements over baseline methods, with a remarkable 5.91\% improvement on the AQuA dataset, further highlighting the effectiveness of FAI.

NeurIPS Conference 2025 Conference Paper

EchoShot: Multi-Shot Portrait Video Generation

  • Jiahao Wang
  • Hualian Sheng
  • Sijia Cai
  • Weizhan Zhang
  • Caixia Yan
  • Yachuang Feng
  • Bing Deng
  • Jieping Ye

Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within the video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model training within multi-shot scenarios, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts. Extensive evaluations demonstrate that EchoShot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling. Project page: https: //johnneywang. github. io/EchoShot-webpage.

AAAI Conference 2025 Conference Paper

GARLIC: GPT-Augmented Reinforcement Learning with Intelligent Control for Vehicle Dispatching

  • Xiao Han
  • Zijian Zhang
  • Xiangyu Zhao
  • Yuanshao Zhu
  • Guojiang Shen
  • Xiangjie Kong
  • Xuetao Wei
  • Liqiang Nie

As urban residents demand higher travel quality, vehicle dispatch has become a critical component of online ride-hailing services. However, current vehicle dispatch systems struggle to navigate the complexities of urban traffic dynamics, including unpredictable traffic conditions, diverse driver behaviors, and fluctuating supply and demand patterns. These challenges have resulted in travel difficulties for passengers in certain areas, while many drivers in other areas are unable to secure orders, leading to a decline in the overall quality of urban transportation services. To address these issues, this paper introduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with Intelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs to capture hierarchical traffic states, and learns a dynamic reward function that accounts for individual driving behaviors. The framework further integrates a GPT model trained with a custom loss function to enable high-precision predictions and optimize dispatching policies in real-world scenarios. Experiments conducted on two real-world datasets demonstrate that GARLIC effectively aligns with driver behaviors while reducing the empty load rate of vehicles.

ICLR Conference 2025 Conference Paper

Improving Complex Reasoning with Dynamic Prompt Corruption: A Soft Prompt Optimization Approach

  • Sinan Fan
  • Liang Xie 0003
  • Chen Shen 0003
  • Ge Teng
  • Xiaosong Yuan
  • Xiaofeng Zhang 0006
  • Chenxi Huang 0004
  • Wenxiao Wang 0001

Prompt Tuning (PT) has emerged as a promising Parameter-Efficient Fine-Tuning (PEFT) approach by appending trainable continuous prompt vectors to the input, maintaining competitive performance with significantly fewer trainable parameters. While PT has shown effectiveness in enhancing task performance, particularly for classification tasks, its application to complex reasoning tasks has been largely overlooked. Our investigation reveals that PT provides limited improvement and may even degrade performance in reasoning tasks. This phenomenon suggests that soft prompts can positively impact certain instances while negatively affecting others, particularly during the latter stages of reasoning. To address these challenges, we propose a novel method called Dynamic Prompt Corruption (DPC), which seeks to optimize the use of soft prompts in reasoning tasks. DPC dynamically adjusts the influence of soft prompts based on their impact on the reasoning process. Specifically, it involves two key components: Dynamic Trigger and Dynamic Corruption. Dynamic Trigger measures the influence of soft prompts, determining whether their impact is beneficial or detrimental. Dynamic Corruption mitigates the negative effects of soft prompts by selectively masking key tokens that interfere with the reasoning process. We validate our approach through extensive experiments on various large language models (LLMs) and reasoning tasks, including GSM8K, MATH, and AQuA. The results demonstrate that Dynamic Prompt Corruption consistently improves the performance of LLMs, achieving 4\%-8\% accuracy gains compared to standard prompt tuning. These findings highlight the effectiveness of our approach and its potential to enhance complex reasoning in LLMs.

ICLR Conference 2025 Conference Paper

Knowledge Graph Finetuning Enhances Knowledge Manipulation in Large Language Models

  • Hanzhu Chen
  • Xu Shen 0001
  • Jie Wang 0005
  • Zehao Wang
  • Qitan Lv
  • Junjie He
  • Rong Wu
  • Feng Wu 0001

Despite the impressive performance of general large language models(LLMs), many of their applications in specific domains (e.g., low-data and knowledge-intensive) still confront significant challenges. Supervised fine-tuning (SFT)---where a general LLM is further trained on a small labeled dataset to adapt for specific tasks or domains---has shown great power for developing domain-specific LLMs. However, existing SFT data primarily consist of Question and Answer (Q&A) pairs, which poses a significant challenge for LLMs to comprehend the correlation and logic of knowledge underlying the Q&A. To address this challenge, we propose a conceptually flexible and general framework to boost SFT, namely Knowledge Graph-Driven Supervised Fine-Tuning (KG-SFT). The key idea of KG-SFT is to generate high-quality explanations for each Q&A pair via a structured knowledge graph to enhance the knowledge comprehension and manipulation of LLMs. Specifically, KG-SFT consists of three components: Extractor, Generator, and Detector. For a given Q&A pair, (i) Extractor first identifies entities within Q&A pairs and extracts relevant reasoning subgraphs from external KGs, (ii) Generator then produces corresponding fluent explanations utilizing these reasoning subgraphs, and (iii) finally, Detector performs sentence-level knowledge conflicts detection on these explanations to guarantee the reliability. KG-SFT focuses on generating high-quality explanations to improve the quality of Q&A pair, which reveals a promising direction for supplementing existing data augmentation methods. Extensive experiments on fifteen different domains and six different languages demonstrate the effectiveness of KG-SFT, leading to an accuracy improvement of up to 18% and an average of 8.7% in low-data scenarios.

NeurIPS Conference 2025 Conference Paper

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

  • yuyang Hong
  • Jiaqi Gu
  • Yang Qi
  • Lubin Fan
  • Yue Wu
  • Ying Wang
  • Kun Ding
  • Shiming Xiang

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

ICLR Conference 2025 Conference Paper

Leveraging Submodule Linearity Enhances Task Arithmetic Performance in LLMs

  • Rui Dai
  • Sile Hu
  • Xu Shen 0001
  • Yonggang Zhang 0003
  • Xinmei Tian 0001
  • Jieping Ye

Task arithmetic is a straightforward yet highly effective strategy for model merging, enabling the resultant model to exhibit multi-task capabilities. Recent research indicates that models demonstrating linearity enhance the performance of task arithmetic. In contrast to existing methods that rely on the global linearization of the model, we argue that this linearity already exists within the model's submodules. In particular, we present a statistical analysis and show that submodules (e.g., layers, self-attentions, and MLPs) exhibit significantly higher linearity than the overall model. Based on these findings, we propose an innovative model merging strategy that independently merges these submodules. Especially, we derive a closed-form solution for optimal merging weights grounded in the linear properties of these submodules. Experimental results demonstrate that our method consistently outperforms the standard task arithmetic approach and other established baselines across different model scales and various tasks. This result highlights the benefits of leveraging the linearity of submodules and provides a new perspective for exploring solutions for effective and practical multi-task model merging.

ICRA Conference 2025 Conference Paper

PTZ-Calib: Robust Pan-Tilt-Zoom Camera Calibration

  • Jinhui Guo
  • Lubin Fan
  • Bojian Wu
  • Jiaqi Gu 0004
  • Shen Cao
  • Jieping Ye

In this paper, we present PTZ-Calib, a robust two-stage PTZ camera calibration method, that efficiently and accurately estimates camera parameters for arbitrary viewpoints. Our method includes an offline and an online stage. In the offline stage, we first uniformly select a set of reference images that sufficiently overlap to encompass a complete 360° view. We then utilize the novel PTZ-IBA (PTZ Incremental Bundle Adjustment) algorithm to automatically calibrate the cameras within a local coordinate system. Additionally, for practical application, we can further optimize camera parameters and align them with the geographic coordinate system using extra global reference 3D information. In the online stage, we formulate the calibration of any new viewpoints as a relocalization problem. Our approach balances the accuracy and computational efficiency to meet real-world demands. Extensive evaluations demonstrate our robustness and superior performance over state-of-the-art methods on various real and synthetic datasets. Datasets and source code can be accessed online at https://github.com/gjgjh/PTZ-Calib

ICML Conference 2025 Conference Paper

Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

  • Qi Yang 0015
  • Chenghao Zhang 0003
  • Lubin Fan
  • Kun Ding 0001
  • Jieping Ye
  • Shiming Xiang

Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs.

ICML Conference 2025 Conference Paper

ROPO: Robust Preference Optimization for Large Language Models

  • Xize Liang
  • Chao Chen 0026
  • Shuang Qiu
  • Jie Wang 0005
  • Yue Wu
  • Zhihang Fu
  • Hanzhu Chen
  • Feng Wu 0001

The prevalent noise in the preference data unavoidably poses significant challenges to the preference alignment of large language models (LLMs). Existing efforts for this problem either marginally alleviate the impact of noise without noise reduction, or rely on external LLMs that incur substantial computational costs. To address these challenges, we propose RO bust P reference O ptimization ( ROPO ), an iterative alignment approach that integrates noise-tolerance and noise filtering without the aid of external models. Specifically, ROPO first formulates the training process with adaptive noise reduction as an optimization problem, which can be efficiently solved in an iterative paradigm. Then, to equip this solving process with noise-tolerance and noise-identification capabilities, we derive a robust loss that suppresses the gradients from samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is key to the noise-tolerance and effective filtering of noisy samples. The derived loss further inspires a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Extensive experiments on several widely-used datasets and model architectures demonstrate that ROPO significantly outperforms all baselines under four practical noise settings and the random symmetric noise, with its advantage increasing as the noise rate increases.

ICLR Conference 2025 Conference Paper

ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

  • Yang Qin
  • Chao Chen 0026
  • Zhihang Fu
  • Ze Chen 0001
  • Dezhong Peng
  • Peng Hu 0002
  • Jieping Ye

Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model's understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.

NeurIPS Conference 2025 Conference Paper

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

  • Pingyi Chen
  • Yujing Lou
  • Shen Cao
  • Jinhui Guo
  • Lubin Fan
  • Yue Wu
  • Lin Yang
  • Lizhuang Ma

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains underexplored due to the deficiency of spatial representation ability of 2D images. In this paper, we analyze the problem hindering VLMs’ spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs’ spatial awareness. MSMU dataset includes massive quantitative spatial tasks with 700K QA pairs, 2. 5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPTBench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26. 91% and 25. 56% respectively on MSMU-Bench. Code and models are released at https: //github. com/cpystan/SD-VLM.

NeurIPS Conference 2025 Conference Paper

TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

  • Yuxiang Zhang
  • Zhengxu Yu
  • Weihang Pan
  • Zhongming Jin
  • Qiang Fu
  • Deng Cai
  • Binbin Lin
  • Jieping Ye

Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrated the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek‑R1‑Distill‑Qwen‑7B fine-tuned by using our proposed method achieved a 50\% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model's self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at \url{https: //github. com/zhangyx1122/TokenSqueeze}.

ICML Conference 2024 Conference Paper

A3S: A General Active Clustering Method with Pairwise Constraints

  • Xun Deng
  • Junlong Liu
  • Han Zhong
  • Fuli Feng
  • Chen Shen 0003
  • Xiangnan He 0001
  • Jieping Ye
  • Zheng Wang 0027

Active clustering aims to boost the clustering performance by integrating human-annotated pairwise constraints through strategic querying. Conventional approaches with semi-supervised clustering schemes encounter high query costs when applied to large datasets with numerous classes. To address these limitations, we propose a novel Adaptive Active Aggregation and Splitting (A3S) framework, falling within the cluster-adjustment scheme in active clustering. A3S features strategic active clustering adjustment on the initial cluster result, which is obtained by an adaptive clustering algorithm. In particular, our cluster adjustment is inspired by the quantitative analysis of Normalized mutual information gain under the information theory framework and can provably improve the clustering quality. The proposed A3S framework significantly elevates the performance and scalability of active clustering. In extensive experiments across diverse real-world datasets, A3S achieves desired results with significantly fewer human queries compared with existing methods.

ICLR Conference 2024 Conference Paper

Boosting Vanilla Lightweight Vision Transformers via Re-parameterization

  • Zhentao Tan
  • Xiaodan Li
  • Yue Wu
  • Qi Chu 0001
  • Le Lu 0001
  • Nenghai Yu
  • Jieping Ye

Large-scale Vision Transformers have achieved promising performance on downstream tasks through feature pre-training. However, the performance of vanilla lightweight Vision Transformers (ViTs) is still far from satisfactory compared to that of recent lightweight CNNs or hybrid networks. In this paper, we aim to unlock the potential of vanilla lightweight ViTs by exploring the adaptation of the widely-used re-parameterization technology to ViTs for improving learning ability during training without increasing the inference cost. The main challenge comes from the fact that CNNs perfectly complement with re-parameterization over convolution and batch normalization, while vanilla Transformer architectures are mainly comprised of linear and layer normalization layers. We propose to incorporate the nonlinear ensemble into linear layers by expanding the depth of the linear layers with batch normalization and fusing multiple linear features with hierarchical representation ability through a pyramid structure. We also discover and solve a new transformer-specific distribution rectification problem caused by multi-branch re-parameterization. Finally, we propose our Two-Dimensional Re-parameterized Linear module (TDRL) for ViTs. Under the popular self-supervised pre-training and supervised fine-tuning strategy, our TDRL can be used in these two stages to enhance both generic and task-specific representation. Experiments demonstrate that our proposed method not only boosts the performance of vanilla Vit-Tiny on various vision tasks to new state-of-the-art (SOTA) but also shows promising generality ability on other networks. Code will be available.

NeurIPS Conference 2024 Conference Paper

Bridge-IF: Learning Inverse Protein Folding with Markov Bridges

  • Yiheng Zhu
  • Jialu Wu
  • Qiuyi Li
  • Jiahuan Yan
  • Mingze Yin
  • Wei Wu
  • Mingyang Li
  • Jieping Ye

Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at https: //github. com/violet-sto/Bridge-IF.

NeurIPS Conference 2024 Conference Paper

Delving into the Reversal Curse: How Far Can Large Language Models Generalize?

  • Zhengkai Lin
  • Zhihang Fu
  • Kai Liu
  • Liang Xie
  • Binbin Lin
  • Wenxiao Wang
  • Deng Cai
  • Yue Wu

While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated "reversal curse", which surfaces when models, having been trained on the fact "A is B", struggle to generalize this knowledge to infer that "B is A". In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to "B is A" when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact "A is B" in the training documents. For example, this generalization only applies to biographies structured in "[Name] is [Description]" but not to "[Description] is [Name]". (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. Based on these intriguing findings, our work not only presents a novel perspective for interpreting LLMs' generalization abilities from their intrinsic working mechanism but also provides new insights for the development of more effective learning methods for LLMs.

ICML Conference 2024 Conference Paper

Discrete Latent Perspective Learning for Segmentation and Detection

  • Deyi Ji
  • Feng Zhao 0004
  • Lanyun Zhu
  • Wenwei Jin
  • Hongtao Lu 0001
  • Jieping Ye

In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework, Discrete Latent Perspective Learning (DLPL), for latent multi-perspective fusion learning using conventional single-view images. DLPL comprises three main modules: Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA), which work together to discretize visual features, transform perspectives, and fuse multi-perspective semantic information, respectively. DLPL is a universal perspective learning framework applicable to a variety of scenarios and vision tasks. Extensive experiments demonstrate that DLPL significantly enhances the network’s capacity to depict images across diverse scenarios (daily photos, UAV, auto-driving) and tasks (detection, segmentation).

ICML Conference 2024 Conference Paper

Efficient Denoising Diffusion via Probabilistic Masking

  • Weizhong Zhang
  • Zhiwei Zhang
  • Renjie Pi
  • Zhongming Jin 0001
  • Yuan Gao 0015
  • Jieping Ye
  • Kani Chen

Diffusion models have exhibited remarkable advancements in generating high-quality data. However, a critical drawback is their computationally intensive inference process, which requires a large number of timesteps to generate a single sample. Existing methods address this challenge by decoupling the forward and reverse processes, and they rely on handcrafted rules for sampling acceleration, leading to the risk of discarding important steps. In this paper, we propose an Efficient Denoising Diffusion method via Probabilistic Masking (EDDPM) that can identify and skip the redundant steps during training. To determine whether a timestep should be skipped or not, we employ probabilistic reparameterization to continualize the binary determination mask. The mask distribution parameters are learned jointly with model weights. By incorporating a real-time sparse constraint, our method can effectively identify and eliminate unnecessary steps during the training iterations, thereby improving inference efficiency. Notably, as the model becomes fully trained, the random masks converge to a sparse and deterministic one, retaining only a small number of essential steps. Empirical results demonstrate the superiority of our proposed EDDPM over the state-of-the-art sampling acceleration methods across various domains. EDDPM can generate high-quality samples with only 20% of the steps for time series imputation and achieve 4. 89 FID with 5 steps for CIFAR-10. Moreover, when starting from a pretrained model, our method efficiently identifies the most informative timesteps within a single epoch, which demonstrates the potential of EDDPM to be a practical tool to explore large diffusion models with limited resources.

NeurIPS Conference 2024 Conference Paper

Enhancing LLM’s Cognition via Structurization

  • Kai Liu
  • Zhihang Fu
  • Chao Chen
  • Wei Zhang
  • Rongxin Jiang
  • Fan Zhou
  • Yaowu Chen
  • Yue Wu

When reading long-form text, human cognition is complex and structurized. While large language models (LLMs) process input contexts through a causal and sequential perspective, this approach can potentially limit their ability to handle intricate and complex inputs effectively. To enhance LLM’s cognition capability, this paper presents a novel concept of context structurization. Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements. By doing so, LLMs can better grasp intricate and extended contexts through precise attention and information-seeking along the organized structures. Extensive evaluations are conducted across various model architectures and sizes (including a series of auto-regressive LLMs as well as BERT-like masking models) on a diverse set of NLP tasks (e. g. , context-based question-answering, exhaustive hallucination evaluation, and passage-level dense retrieval). Empirical results show consistent and significant performance gains afforded by a single-round structurization. In particular, we boost the open-sourced LLaMA2-70B model to achieve comparable performance against GPT-3. 5-Turbo as the halluci- nation evaluator. Besides, we show the feasibility of distilling advanced LLMs’ language processing abilities to a smaller yet effective StruXGPT-7B to execute structurization, addressing the practicality of our approach. Code is available at https: //github. com/alibaba/struxgpt.

NeurIPS Conference 2024 Conference Paper

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

  • Yuxin Xiao
  • Chaoqun Wan
  • Yonggang Zhang
  • Wenxiao Wang
  • Binbin Lin
  • Xiaofei He
  • Xu Shen
  • Jieping Ye

As the development and application of Large Language Models (LLMs) continue to advance rapidly, enhancing their trustworthiness and aligning them with human preferences has become a critical area of research. Traditional methods rely heavily on extensive data for Reinforcement Learning from Human Feedback (RLHF), but representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states, enabling the model to meet specific requirements such as increased honesty or heightened safety awareness. However, a significant challenge arises when attempting to fulfill multiple requirements simultaneously. It proves difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature, restricting its practicality. In this work, we address this challenge through Sparse Activation Control. By delving into the intrinsic mechanisms of LLMs, we manage to identify and pinpoint modules that are closely related to specific tasks within the model, i. e. attention heads. These heads display sparse characteristics that allow for near-independent control over different tasks. Our experiments, conducted on the open-source Llama series models, have yielded encouraging results. The models were able to align with human preferences on issues of safety, factualness, and bias concurrently.

ICML Conference 2024 Conference Paper

From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning

  • Wei Chen 0005
  • Zhen Huang 0007
  • Liang Xie 0003
  • Binbin Lin
  • Houqiang Li
  • Le Lu 0001
  • Xinmei Tian 0001
  • Deng Cai 0001

Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs’ general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage ($<$5%) of the basic modules, which significantly affect a particular behavior of LLMs. i. e. , sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.

ICLR Conference 2024 Conference Paper

Graph-constrained diffusion for End-to-End Path Planning

  • Dingyuan Shi
  • Yongxin Tong
  • Zimu Zhou
  • Ke Xu 0001
  • Zheng Wang
  • Jieping Ye

Path planning underpins various applications such as transportation, logistics, and robotics. Conventionally, path planning is formulated with explicit optimization objectives such as distance or time. However, real-world data reveals that user intentions are hard-to-model, suggesting a need for data-driven path planning that implicitly incorporates the complex user intentions. In this paper, we propose GDP, a diffusion-based model for end-to-end data-driven path planning. It effectively learns path patterns via a novel diffusion process that incorporates constraints from road networks, and plans paths as conditional path generation given the origin and destination as prior evidence. GDP is the first solution that bypasses the traditional search-based frameworks, a long-standing performance bottleneck in path planning. We validate the efficacy of GDP on two real-world datasets. Our GDP beats strong baselines by 14.2% ~ 43.5% and achieves state-of-the-art performances.

ICLR Conference 2024 Conference Paper

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

  • Chao Chen 0026
  • Kai Liu 0023
  • Ze Chen 0001
  • Yi Gu
  • Yue Wu
  • Mingyuan Tao
  • Zhihang Fu
  • Jieping Ye

Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.

NeurIPS Conference 2024 Conference Paper

Instance-adaptive Zero-shot Chain-of-Thought Prompting

  • Xiaosong Yuan
  • Chen Shen
  • Shaotian Yan
  • Xiaofeng Zhang
  • Liang Xie
  • Wenxiao Wang
  • Renchu Guan
  • Ying Wang

Zero-shot Chain-of-Thought (CoT) prompting emerges as a simple and effective strategy for enhancing the performance of large language models (LLMs) in real-world reasoning tasks. Nonetheless, the efficacy of a singular, task-level prompt uniformly applied across the whole of instances is inherently limited since one prompt cannot be a good partner for all, a more appropriate approach should consider the interaction between the prompt and each instance meticulously. This work introduces an instance-adaptive prompting algorithm as an alternative zero-shot CoT reasoning scheme by adaptively differentiating good and bad prompts. Concretely, we first employ analysis on LLMs through the lens of information flow to detect the mechanism under zero-shot CoT reasoning, in which we discover that information flows from question to prompt and question to rationale jointly influence the reasoning results most. We notice that a better zero-shot CoT reasoning needs the prompt to obtain semantic information from the question then the rationale aggregates sufficient information from the question directly and via the prompt indirectly. On the contrary, lacking any of those would probably lead to a bad one. Stem from that, we further propose an instance-adaptive prompting strategy (IAP) for zero-shot CoT reasoning. Experiments conducted with LLaMA-2, LLaMA-3, and Qwen on math, logic, and commonsense reasoning tasks (e. g. , GSM8K, MMLU, Causal Judgement) obtain consistent improvement, demonstrating that the instance-adaptive zero-shot CoT prompting performs better than other task-level methods with some curated prompts or sophisticated procedures, showing the significance of our findings in the zero-shot CoT reasoning mechanism.

ICML Conference 2024 Conference Paper

Interpreting and Improving Large Language Models in Arithmetic Calculation

  • Wei Zhang
  • Chaoqun Wan
  • Yonggang Zhang 0003
  • Yiu-ming Cheung
  • Xinmei Tian 0001
  • Xu Shen 0001
  • Jieping Ye

Large language models (LLMs) have demonstrated remarkable potential across numerous applications and have shown an emergent ability to tackle complex reasoning tasks, such as mathematical computations. However, even for the simplest arithmetic calculations, the intrinsic mechanisms behind LLMs remains mysterious, making it challenging to ensure reliability. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. Through comprehensive experiments, we find that LLMs frequently involve a small fraction ($<$5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. Subsequently, the information from these operands is processed through multi-layer perceptrons (MLPs), progressively leading to the final solution. These pivotal heads/MLPs, though identified on a specific dataset, exhibit transferability across different datasets and even distinct tasks. This insight prompted us to investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs’ computational performance. We empirically find that such precise tuning can yield notable enhancements on mathematical prowess, without compromising the performance on non-mathematical tasks. Our work serves as a preliminary exploration into the arithmetic calculation abilities inherent in LLMs, laying a solid foundation to reveal more intricate mathematical tasks.

NeurIPS Conference 2024 Conference Paper

Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs

  • Liyi Chen
  • Panrong Tong
  • Zhongming Jin
  • Ying Sun
  • Jieping Ye
  • Hui Xiong

Large Language Models (LLMs) have shown remarkable reasoning capabilities on complex tasks, but they still suffer from out-of-date knowledge, hallucinations, and opaque decision-making. In contrast, Knowledge Graphs (KGs) can provide explicit and editable knowledge for LLMs to alleviate these issues. Existing paradigm of KG-augmented LLM manually predefines the breadth of exploration space and requires flawless navigation in KGs. However, this paradigm cannot adaptively explore reasoning paths in KGs based on the question semantics and self-correct erroneous reasoning paths, resulting in a bottleneck in efficiency and effect. To address these limitations, we propose a novel self-correcting adaptive planning paradigm for KG-augmented LLM named Plan-on-Graph (PoG), which first decomposes the question into several sub-objectives and then repeats the process of adaptively exploring reasoning paths, updating memory, and reflecting on the need to self-correct erroneous reasoning paths until arriving at the answer. Specifically, three important mechanisms of Guidance, Memory, and Reflection are designed to work together, to guarantee the adaptive breadth of self-correcting planning for graph reasoning. Finally, extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of PoG.

NeurIPS Conference 2024 Conference Paper

Rethinking Out-of-Distribution Detection on Imbalanced Data Distribution

  • Kai Liu
  • Zhihang Fu
  • Sheng Jin
  • Chao Chen
  • Ze Chen
  • Rongxin Jiang
  • Fan Zhou
  • Yaowu Chen

Detecting and rejecting unknown out-of-distribution (OOD) samples is critical for deployed neural networks to void unreliable predictions. In real-world scenarios, however, the efficacy of existing OOD detection methods is often impeded by the inherent imbalance of in-distribution (ID) data, which causes significant performance decline. Through statistical observations, we have identified two common challenges faced by different OOD detectors: misidentifying tail class ID samples as OOD, while erroneously predicting OOD samples as head class from ID. To explain this phenomenon, we introduce a generalized statistical framework, termed ImOOD, to formulate the OOD detection problem on imbalanced data distribution. Consequently, the theoretical analysis reveals that there exists a class-aware bias item between balanced and imbalanced OOD detection, which contributes to the performance gap. Building upon this finding, we present a unified training-time regularization technique to mitigate the bias and boost imbalanced OOD detectors across architecture designs. Our theoretically grounded method translates into consistent improvements on the representative CIFAR10-LT, CIFAR100-LT, and ImageNet-LT benchmarks against several state-of-the-art OOD detection ap- proaches. Code is available at https: //github. com/alibaba/imood.

NeurIPS Conference 2023 Conference Paper

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

  • Kai Liu
  • Zhihang Fu
  • Chao Chen
  • Sheng Jin
  • Ze Chen
  • Mingyuan Tao
  • Rongxin Jiang
  • Jieping Ye

The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e. g. , cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e. g. , cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX’s effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications.

NeurIPS Conference 2023 Conference Paper

OFCOURSE: A Multi-Agent Reinforcement Learning Environment for Order Fulfillment

  • Yiheng Zhu
  • Yang Zhan
  • Xuankun Huang
  • Yuwei Chen
  • Yujie Chen
  • Jiangwen Wei
  • Wei Feng
  • Yinzhi Zhou

The dramatic growth of global e-commerce has led to a surge in demand for efficient and cost-effective order fulfillment which can increase customers' service levels and sellers' competitiveness. However, managing order fulfillment is challenging due to a series of interdependent online sequential decision-making problems. To clear this hurdle, rather than solving the problems separately as attempted in some recent researches, this paper proposes a method based on multi-agent reinforcement learning to integratively solve the series of interconnected problems, encompassing order handling, packing and pickup, storage, order consolidation, and last-mile delivery. In particular, we model the integrated problem as a Markov game, wherein a team of agents learns a joint policy via interacting with a simulated environment. Since no simulated environment supporting the complete order fulfillment problem exists, we devise Order Fulfillment COoperative mUlti-agent Reinforcement learning Scalable Environment (OFCOURSE) in the OpenAI Gym style, which allows reproduction and re-utilization to build customized applications. By constructing the fulfillment system in OFCOURSE, we optimize a joint policy that solves the integrated problem, facilitating sequential order-wise operations across all fulfillment units and minimizing the total cost of fulfilling all orders within the promised time. With OFCOURSE, we also demonstrate that the joint policy learned by multi-agent reinforcement learning outperforms the combination of locally optimal policies. The source code of OFCOURSE is available at: https: //github. com/GitYiheng/ofcourse.

NeurIPS Conference 2023 Conference Paper

Optimal Parameter and Neuron Pruning for Out-of-Distribution Detection

  • Chao Chen
  • Zhihang Fu
  • Kai Liu
  • Ze Chen
  • Mingyuan Tao
  • Jieping Ye

For a machine learning model deployed in real world scenarios, the ability of detecting out-of-distribution (OOD) samples is indispensable and challenging. Most existing OOD detection methods focused on exploring advanced training skills or training-free tricks to prevent the model from yielding overconfident confidence score for unknown samples. The training-based methods require expensive training cost and rely on OOD samples which are not always available, while most training-free methods can not efficiently utilize the prior information from the training data. In this work, we propose an \textbf{O}ptimal \textbf{P}arameter and \textbf{N}euron \textbf{P}runing (\textbf{OPNP}) approach, which aims to identify and remove those parameters and neurons that lead to over-fitting. The main method is divided into two steps. In the first step, we evaluate the sensitivity of the model parameters and neurons by averaging gradients over all training samples. In the second step, the parameters and neurons with exceptionally large or close to zero sensitivities are removed for prediction. Our proposal is training-free, compatible with other post-hoc methods, and exploring the information from all training data. Extensive experiments are performed on multiple OOD detection tasks and model architectures, showing that our proposed OPNP consistently outperforms the existing methods by a large margin.

NeurIPS Conference 2021 Conference Paper

Offline Model-based Adaptable Policy Learning

  • Xiong-Hui Chen
  • Yang Yu
  • Qingyang Li
  • Fan-Ming Luo
  • Zhiwei Qin
  • Wenjie Shang
  • Jieping Ye

In reinforcement learning, a promising direction to avoid online trial-and-error costs is learning from an offline dataset. Current offline reinforcement learning methods commonly learn in the policy space constrained to in-support regions by the offline dataset, in order to ensure the robustness of the outcome policies. Such constraints, however, also limit the potential of the outcome policies. In this paper, to release the potential of offline policy learning, we investigate the decision-making problems in out-of-support regions directly and propose offline Model-based Adaptable Policy LEarning (MAPLE). By this approach, instead of learning in in-support regions, we learn an adaptable policy that can adapt its behavior in out-of-support regions when deployed. We conduct experiments on MuJoCo controlling tasks with offline datasets. The results show that the proposed method can make robust decisions in out-of-support regions and achieve better performance than SOTA algorithms.

ICML Conference 2021 Conference Paper

On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

  • Shuang Qiu
  • Jieping Ye
  • Zhaoran Wang 0001
  • Zhuoran Yang

To achieve sample efficiency in reinforcement learning (RL), it necessitates to efficiently explore the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the optimal policy via offline RL with data collected in the exploration stage. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.

ICML Conference 2021 Conference Paper

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

  • Shuang Qiu
  • Xiaohan Wei
  • Jieping Ye
  • Zhaoran Wang 0001
  • Zhuoran Yang

While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{T})$ regret bounds after $T$ steps in a two-agent competitive game scenario. The regret of each player is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{T})$.

AAAI Conference 2020 Conference Paper

An Attention-Based Graph Neural Network for Heterogeneous Structural Learning

  • Huiting Hong
  • Hantao Guo
  • Yucheng Lin
  • Xiaoqing Yang
  • Zang Li
  • Jieping Ye

In this paper, we focus on graph representation learning of heterogeneous information network (HIN), in which various types of vertices are connected by various types of relations. Most of the existing methods conducted on HIN revise homogeneous graph embedding models via meta-paths to learn low-dimensional vector space of HIN. In this paper, we propose a novel Heterogeneous Graph Structural Attention Neural Network (HetSANN) to directly encode structural information of HIN without meta-path and achieve more informative representations. With this method, domain experts will not be needed to design meta-path schemes and the heterogeneous information can be processed automatically by our proposed model. Specifically, we implicitly represent heterogeneous information using the following two methods: 1) we model the transformation between heterogeneous vertices through a projection in low-dimensional entity spaces; 2) afterwards, we apply the graph neural network to aggregate multi-relational information of projected neighborhood by means of attention mechanism. We also present three extensions of HetSANN, i. e. , voices-sharing product attention for the pairwise relationships in HIN, cycle-consistency loss to retain the transformation between heterogeneous entity spaces, and multi-task learning with full use of information. The experiments conducted on three public datasets demonstrate that our proposed models achieve significant and consistent improvements compared to state-of-the-art solutions.

AAAI Conference 2020 Conference Paper

AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates

  • Ning Liu
  • Xiaolong Ma
  • Zhiyuan Xu
  • Yanzhi Wang
  • Jian Tang
  • Jieping Ye

Structured weight pruning is a representative model compression technique of DNNs to reduce the storage and computation requirements and accelerate inference. An automatic hyperparameter determination process is necessary due to the large number of flexible hyperparameters. This work proposes AutoCompress, an automatic structured pruning framework with the following key performance improvements: (i) effectively incorporate the combination of structured pruning schemes in the automatic process; (ii) adopt the stateof-art ADMM-based structured weight pruning as the core algorithm, and propose an innovative additional purification step for further weight reduction without accuracy loss; and (iii) develop effective heuristic search method enhanced by experience-based guided search, replacing the prior deep reinforcement learning technique which has underlying incompatibility with the target pruning problem. Extensive experiments on CIFAR-10 and ImageNet datasets demonstrate that AutoCompress is the key to achieve ultra-high pruning rates on the number of weights and FLOPs that cannot be achieved before. As an example, AutoCompress outperforms the prior work on automatic model compression by up to 33× in pruning rate (120× reduction in the actual parameter count) under the same accuracy. Significant inference speedup has been observed from the AutoCompress framework on actual measurements on smartphone. We release models of this work at anonymous link: http: //bit. ly/2VZ63dS.

NeurIPS Conference 2020 Conference Paper

Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control

  • Zhiyuan Xu
  • Kun Wu
  • Zhengping Che
  • Jian Tang
  • Jieping Ye

While Deep Reinforcement Learning (DRL) has emerged as a promising approach to many complex tasks, it remains challenging to train a single DRL agent that is capable of undertaking multiple different continuous control tasks. In this paper, we present a Knowledge Transfer based Multi-task Deep Reinforcement Learning framework (KTM-DRL) for continuous control, which enables a single DRL agent to achieve expert-level performance in multiple different tasks by learning from task-specific teachers. In KTM-DRL, the multi-task agent first leverages an offline knowledge transfer algorithm designed particularly for the actor-critic architecture to quickly learn a control policy from the experience of task-specific teachers, and then it employs an online learning algorithm to further improve itself by learning from new online transition samples under the guidance of those teachers. We perform a comprehensive empirical study with two commonly-used benchmarks in the MuJoCo continuous control task suite. The experimental results well justify the effectiveness of KTM-DRL and its knowledge transfer and online learning algorithms, as well as its superiority over the state-of-the-art by a large margin.

NeurIPS Conference 2020 Conference Paper

Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

  • Shuang Qiu
  • Xiaohan Wei
  • Zhuoran Yang
  • Jieping Ye
  • Zhaoran Wang

We consider online learning for episodic stochastically constrained Markov decision processes (CMDP), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, whereas both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the MDP is known a priori and establish regret bounds that depend polynomially on the cardinalities of the state space $\mathcal{S}$ and the action space $\mathcal{A}$. In this work, we propose a new \emph{upper confidence primal-dual} algorithm, which only requires the trajectories sampled from the transition model. In particular, we prove that the proposed algorithm achieves $\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T})$ upper bounds of both the regret and the constraint violation, where $L$ is the length of each episode. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning, which demonstrates the power of ``optimism in the face of uncertainty'' in constrained online learning.

IJCAI Conference 2019 Conference Paper

Deep Multi-Task Learning with Adversarial-and-Cooperative Nets

  • Pei Yang
  • Qi Tan
  • Jieping Ye
  • Hanghang Tong
  • Jingrui He

In this paper, we propose a deep multi-Task learning model based on Adversarial-and-COoperative nets (TACO). The goal is to use an adversarial-and-cooperative strategy to decouple the task-common and task-specific knowledge, facilitating the fine-grained knowledge sharing among tasks. TACO accommodates multiple game players, i. e. , feature extractors, domain discriminator, and tri-classifiers. They play the MinMax games adversarially and cooperatively to distill the task-common and task-specific features, while respecting their discriminative structures. Moreover, it adopts a divide-and-combine strategy to leverage the decoupled multi-view information to further improve the generalization performance of the model. The experimental results show that our proposed method significantly outperforms the state-of-the-art algorithms on the benchmark datasets in both multi-task learning and semi-supervised domain adaptation scenarios.

AAAI Conference 2019 Conference Paper

Incorporating Semantic Similarity with Geographic Correlation for Query-POI Relevance Learning

  • Ji Zhao
  • Dan Peng
  • Chuhan Wu
  • Huan Chen
  • Meiyu Yu
  • Wanji Zheng
  • Li Ma
  • Hua Chai

Point-of-interest (POI) retrieval that searches for relevant destination locations plays a significant role in on-demand ridehailing services. Existing solutions to POI retrieval mainly retrieve and rank POIs based on their semantic similarity scores. Although intuitive, quantifying the relevance of a Query-POI pair by single-field semantic similarity is subject to inherent limitations. In this paper, we propose a novel Query-POI relevance model for effective POI retrieval for ondemand ride-hailing services. Different from existing relevance models, we capture and represent multi-field and local&global semantic features of a Query-POI pair to measure the semantic similarity. Besides, we observe a hidden correlation between origin-destination locations in ride-hailing scenarios, and propose two location embeddings to characterize the specific correlation. By incorporating the geographic correlation with the semantic similarity, our model achieves better performance in POI ranking. Experimental results on two real-world click-through datasets demonstrate the improvements of our model over state-of-the-art methods.

JMLR Journal 2019 Journal Article

Scaling Up Sparse Support Vector Machines by Simultaneous Feature and Sample Reduction

  • Bin Hong
  • Weizhong Zhang
  • Wei Liu
  • Jieping Ye
  • Deng Cai
  • Xiaofei He
  • Jie Wang

Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and ultra-high dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the inactive features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in the computational cost without sacrificing the accuracy. Moreover, we show that our method can be extended to multi-class sparse support vector machines. To the best of our knowledge, the proposed method is the first static feature and sample reduction method for sparse SVMs and multi-class sparse SVMs. Experiments on both synthetic and real data sets demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

AAAI Conference 2019 Conference Paper

Spatiotemporal Multi-Graph Convolution Network for Ride-Hailing Demand Forecasting

  • Xu Geng
  • Yaguang Li
  • Leye Wang
  • Lingyu Zhang
  • Qiang Yang
  • Jieping Ye
  • Yan Liu

Region-level demand forecasting is an essential task in ridehailing services. Accurate ride-hailing demand forecasting can guide vehicle dispatching, improve vehicle utilization, reduce the wait-time, and mitigate traffic congestion. This task is challenging due to the complicated spatiotemporal dependencies among regions. Existing approaches mainly focus on modeling the Euclidean correlations among spatially adjacent regions while we observe that non-Euclidean pair-wise correlations among possibly distant regions are also critical for accurate forecasting. In this paper, we propose the spatiotemporal multi-graph convolution network (ST-MGCN), a novel deep learning model for ride-hailing demand forecasting. We first encode the non-Euclidean pair-wise correlations among regions into multiple graphs and then explicitly model these correlations using multi-graph convolution. To utilize the global contextual information in modeling the temporal correlation, we further propose contextual gated recurrent neural network which augments recurrent neural network with a contextual-aware gating mechanism to re-weights different historical observations. We evaluate the proposed model on two real-world large scale ride-hailing demand datasets and observe consistent improvement of more than 10% over stateof-the-art baselines.

JMLR Journal 2019 Journal Article

Two-Layer Feature Reduction for Sparse-Group Lasso via Decomposition of Convex Sets

  • Jie Wang
  • Zhanqiu Zhang
  • Jieping Ye

Sparse-Group Lasso (SGL) has been shown to be a powerful regression technique for simultaneously discovering group and within-group sparse patterns by using a combination of the $\ell_1$ and $\ell_2$ norms. However, in large-scale applications, the complexity of the regularizers entails great computational challenges. In this paper, we propose a novel two-layer feature reduction method (TLFre) for SGL via a decomposition of its dual feasible set. The two-layer reduction is able to quickly identify the inactive groups and the inactive features, respectively, which are guaranteed to be absent from the sparse representation and can be removed from the optimization. Existing feature reduction methods are only applicable to sparse models with one sparsity-inducing regularizer. To our best knowledge, TLFre is the first one that is capable of dealing with multiple sparsity-inducing regularizers. Moreover, TLFre has a very low computational cost and can be integrated with any existing solvers. We also develop a screening method---called DPC (decomposition of convex set)---for nonnegative Lasso. Experiments on both synthetic and real data sets show that TLFre and DPC improve the efficiency of SGL and nonnegative Lasso by several orders of magnitude. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

AAAI Conference 2019 Conference Paper

Which Factorization Machine Modeling Is Better: A Theoretical Answer with Optimal Guarantee

  • Ming Lin
  • Shuang Qiu
  • Jieping Ye
  • Xiaomin Song
  • Qi Qian
  • Liang Sun
  • Shenghuo Zhu
  • Rong Jin

Factorization machine (FM) is a popular machine learning model to capture the second order feature interactions. The optimal learning guarantee of FM and its generalized version is not yet developed. For a rank k generalized FM of d dimensional input, the previous best known sampling complexity is O[k3 d · polylog(kd)] under Gaussian distribution. This bound is sub-optimal comparing to the information theoretical lower bound O(kd). In this work, we aim to tighten this bound towards optimal and generalize the analysis to sub-gaussian distribution. We prove that when the input data satisfies the so-called τ-Moment Invertible Property, the sampling complexity of generalized FM can be improved to O[k2 d · polylog(kd)/τ2 ]. When the second order self-interaction terms are excluded in the generalized FM, the bound can be improved to the optimal O[kd · polylog(kd)] up to the logarithmic factors. Our analysis also suggests that the positive semi-definite constraint in the conventional FM is redundant as it does not improve the sampling complexity while making the model difficult to optimize. We evaluate our improved FM model in real-time high precision GPS signal calibration task to validate its superiority.

AAAI Conference 2018 Conference Paper

Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction

  • Huaxiu Yao
  • Fei Wu
  • Jintao Ke
  • Xianfeng Tang
  • Yitian Jia
  • Siyu Lu
  • Pinghua Gong
  • Jieping Ye

Taxi demand prediction is an important building block to enabling intelligent transportation systems in a smart city. An accurate prediction model can help the city pre-allocate resources to meet travel demand and to reduce empty taxis on streets which waste energy and worsen the traffic congestion. With the increasing popularity of taxi requesting services such as Uber and Didi Chuxing (in China), we are able to collect large-scale taxi demand data continuously. How to utilize such big data to improve the demand prediction is an interesting and critical real-world problem. Traditional demand prediction methods mostly rely on time series forecasting techniques, which fail to model the complex non-linear spatial and temporal relations. Recent advances in deep learning have shown superior performance on traditionally challenging tasks such as image classification by learning the complex features and correlations from largescale data. This breakthrough has inspired researchers to explore deep learning techniques on traffic prediction problems. However, existing methods on traffic prediction have only considered spatial relation (e. g. , using CNN) or temporal relation (e. g. , using LSTM) independently. We propose a Deep Multi-View Spatial-Temporal Network (DMVST-Net) framework to model both spatial and temporal relations. Specifically, our proposed model consists of three views: temporal view (modeling correlations between future demand values with near time points via LSTM), spatial view (modeling local spatial correlation via local CNN), and semantic view (modeling correlations among regions sharing similar temporal patterns). Experiments on large-scale real taxi demand data demonstrate effectiveness of our approach over state-ofthe-art methods.

AAAI Conference 2018 Conference Paper

Margin Based PU Learning

  • Tieliang Gong
  • Guangtao Wang
  • Jieping Ye
  • Zongben Xu
  • Ming Lin

The PU learning problem concerns about learning from positive and unlabeled data. A popular heuristic is to iteratively enlarge training set based on some marginbased criterion. However, little theoretical analysis has been conducted to support the success of these heuristic methods. In this work, we show that not all marginbased heuristic rules are able to improve the learned classifiers iteratively. We find that a so-called large positive margin oracle is necessary to guarantee the success of PU learning. Under this oracle, a provable positivemargin based PU learning algorithm is proposed for linear regression and classification under the truncated Gaussian distributions. The proposed algorithm is able to reduce the recovering error geometrically proportional to the positive margin. Extensive experiments on real-world datasets verify our theory and the state-ofthe-art performance of the proposed PU learning algorithm.

YNIMG Journal 2017 Journal Article

ENIGMA and the individual: Predicting factors that affect the brain in 35 countries worldwide

  • Paul M. Thompson
  • Ole A. Andreassen
  • Alejandro Arias-Vasquez
  • Carrie E. Bearden
  • Premika S. Boedhoe
  • Rachel M. Brouwer
  • Randy L. Buckner
  • Jan K. Buitelaar

In this review, we discuss recent work by the ENIGMA Consortium (http: //enigma. ini. usc. edu) – a global alliance of over 500 scientists spread across 200 institutions in 35 countries collectively analyzing brain imaging, clinical, and genetic data. Initially formed to detect genetic influences on brain measures, ENIGMA has grown to over 30 working groups studying 12 major brain diseases by pooling and comparing brain data. In some of the largest neuroimaging studies to date – of schizophrenia and major depression – ENIGMA has found replicable disease effects on the brain that are consistent worldwide, as well as factors that modulate disease effects. In partnership with other consortia including ADNI, CHARGE, IMAGEN and others 1 1 Abbreviations: ADNI, Alzheimer's Disease Neuroimaging Initiative (http: //www. adni-info. org); CHARGE, the Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium (http: //www. chargeconsortium. com); IMAGEN, IMAging GENetics Consortium (http: //www. imagen-europe. com). , ENIGMA's genomic screens – now numbering over 30, 000 MRI scans – have revealed at least 8 genetic loci that affect brain volumes. Downstream of gene findings, ENIGMA has revealed how these individual variants – and genetic variants in general – may affect both the brain and risk for a range of diseases. The ENIGMA consortium is discovering factors that consistently affect brain structure and function that will serve as future predictors linking individual brain scans and genomic data. It is generating vast pools of normative data on brain measures – from tens of thousands of people – that may help detect deviations from normal development or aging in specific groups of subjects. We discuss challenges and opportunities in applying these predictors to individual subjects and new cohorts, as well as lessons we have learned in ENIGMA's efforts so far.

ICML Conference 2017 Conference Paper

Scaling Up Sparse Support Vector Machines by Simultaneous Feature and Sample Reduction

  • Weizhong Zhang
  • Bin Hong
  • Wei Liu 0005
  • Jieping Ye
  • Deng Cai 0001
  • Xiaofei He 0001
  • Jie Wang 0005

Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and extremely high-dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in both the memory usage and computational cost without sacrificing accuracy. To the best of our knowledge, the proposed method is the first static feature and sample reduction method for sparse SVMs. Experiments on both synthetic and real datasets (e. g. , the kddb dataset with about 20 million samples and 30 million features) demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude.

NeurIPS Conference 2016 Conference Paper

A Non-convex One-Pass Framework for Generalized Factorization Machine and Rank-One Matrix Sensing

  • Ming Lin
  • Jieping Ye

We develop an efficient alternating framework for learning a generalized version of Factorization Machine (gFM) on steaming data with provable guarantees. When the instances are sampled from $d$ dimensional random Gaussian vectors and the target second order coefficient matrix in gFM is of rank $k$, our algorithm converges linearly, achieves $O(\epsilon)$ recovery error after retrieving $O(k^{3}d\log(1/\epsilon))$ training instances, consumes $O(kd)$ memory in one-pass of dataset and only requires matrix-vector product operations in each iteration. The key ingredient of our framework is a construction of an estimation sequence endowed with a so-called Conditionally Independent RIP condition (CI-RIP). As special cases of gFM, our framework can be applied to symmetric or asymmetric rank-one matrix sensing problems, such as inductive matrix completion and phase retrieval.

IJCAI Conference 2016 Conference Paper

Fast Laplace Approximation for Sparse Bayesian Spike and Slab Models

  • Syed Abbas Z. Naqvi
  • Shandian Zhe
  • Yuan Qi
  • Yifan Yang
  • Jieping Ye

We consider the application of Bayesian spike-and-slab models in high-dimensional feature selection problems. To do so, we propose a simple yet effective fast approximate Bayesian inference algorithm based on Laplace's method. We exploit two efficient optimization methods, GIST and L-BFGS, to obtain the mode of the posterior distribution. Then we propose an ensemble Nystrom based approach to calculate the diagonal of the inverse Hessian over the mode to obtain the approximate posterior marginals in O(knp) time, k ≪ p. Furthermore, we provide the theoretical analysis about the estimation consistency and approximation error bounds. With the posterior marginals of the model weights, we use quadrature integration to estimate the marginal posteriors of selection probabilities and indicator variables for all features, which quantify the selection uncertainty. Our method not only maintains the benefits of the Bayesian treatment (e. g. , uncertainty quantification) but also possesses the computational efficiency, and oracle properties of the frequentist methods. Simulation shows that our method estimates better or comparable selection probabilities and indicator variables than alternative approximate inference methods such as VB and EP, but with less running time. Extensive experiments on large real datasets demonstrate that our method often improves prediction accuracy over Bayesian automatic relevance determination, EP, and frequentist L1 type methods.

ICML Conference 2015 Conference Paper

A Modified Orthant-Wise Limited Memory Quasi-Newton Method with Convergence Analysis

  • Pinghua Gong
  • Jieping Ye

The Orthant-Wise Limited memory Quasi-Newton (OWL-QN) method has been demonstrated to be very effective in solving the \ell_1-regularized sparse learning problem. OWL-QN extends the L-BFGS from solving unconstrained smooth optimization problems to \ell_1-regularized (non-smooth) sparse learning problems. At each iteration, OWL-QN does not involve any \ell_1-regularized quadratic optimization subproblem and only requires matrix-vector multiplications without an explicit use of the (inverse) Hessian matrix, which enables OWL-QN to tackle large-scale problems efficiently. Although many empirical studies have demonstrated that OWL-QN works quite well in practice, several recent papers point out that the existing convergence proof of OWL-QN is flawed and a rigorous convergence analysis for OWL-QN still remains to be established. In this paper, we propose a modified Orthant-Wise Limited memory Quasi-Newton (mOWL-QN) algorithm by slightly modifying the OWL-QN algorithm. As the main technical contribution of this paper, we establish a rigorous convergence proof for the mOWL-QN algorithm. To the best of our knowledge, our work fills the theoretical gap by providing the first rigorous convergence proof for the OWL-QN-type algorithm on solving \ell_1-regularized sparse learning problems. We also provide empirical studies to show that mOWL-QN works well and is as efficient as OWL-QN.

AIJ Journal 2015 Journal Article

Efficient nonconvex sparse group feature selection via continuous and discrete optimization

  • Shuo Xiang
  • Xiaotong Shen
  • Jieping Ye

Sparse feature selection has proven to be effective in analyzing high-dimensional data. While promising, most existing works apply convex methods, which may be suboptimal in terms of the accuracy of feature selection and parameter estimation. In this paper, we consider both continuous and discrete nonconvex paradigms to sparse group feature selection, which are motivated by applications that require identifying the underlying group structure and performing feature selection simultaneously. The main contribution of this article is twofold: (1) computationally, we develop efficient optimization algorithms for both continuous and discrete formulations, of which the key step is a projection with two coupled constraints; (2) statistically, we show that the proposed continuous model reconstructs the oracle estimator. Therefore, consistent feature selection and parameter estimation are achieved simultaneously. Numerical results on synthetic and real-world data suggest that the proposed nonconvex methods compare favorably against their competitors, thus achieving desired goal of delivering high performance.

NeurIPS Conference 2015 Conference Paper

HONOR: Hybrid Optimization for NOn-convex Regularized problems

  • Pinghua Gong
  • Jieping Ye

Recent years have witnessed the superiority of non-convex sparse learning formulations over their convex counterparts in both theory and practice. However, due to the non-convexity and non-smoothness of the regularizer, how to efficiently solve the non-convex optimization problem for large-scale data is still quite challenging. In this paper, we propose an efficient \underline{H}ybrid \underline{O}ptimization algorithm for \underline{NO}n convex \underline{R}egularized problems (HONOR). Specifically, we develop a hybrid scheme which effectively integrates a Quasi-Newton (QN) step and a Gradient Descent (GD) step. Our contributions are as follows: (1) HONOR incorporates the second-order information to greatly speed up the convergence, while it avoids solving a regularized quadratic programming and only involves matrix-vector multiplications without explicitly forming the inverse Hessian matrix. (2) We establish a rigorous convergence analysis for HONOR, which shows that convergence is guaranteed even for non-convex problems, while it is typically challenging to analyze the convergence for non-convex problems. (3) We conduct empirical studies on large-scale data sets and results demonstrate that HONOR converges significantly faster than state-of-the-art algorithms.

JMLR Journal 2015 Journal Article

Lasso Screening Rules via Dual Polytope Projection

  • Jie Wang
  • Peter Wonka
  • Jieping Ye

Lasso is a widely used regression technique to find sparse representations. When the dimension of the feature space and the number of samples are extremely large, solving the Lasso problem remains challenging. To improve the efficiency of solving large- scale Lasso problems, El Ghaoui and his colleagues have proposed the SAFE rules which are able to quickly identify the inactive predictors, i.e., predictors that have $0$ components in the solution vector. Then, the inactive predictors or features can be removed from the optimization problem to reduce its scale. By transforming the standard Lasso to its dual form, it can be shown that the inactive predictors include the set of inactive constraints on the optimal dual solution. In this paper, we propose an efficient and effective screening rule via Dual Polytope Projections (DPP), which is mainly based on the uniqueness and nonexpansiveness of the optimal dual solution due to the fact that the feasible set in the dual space is a convex and closed polytope. Moreover, we show that our screening rule can be extended to identify inactive groups in group Lasso. To the best of our knowledge, there is currently no exact screening rule for group Lasso. We have evaluated our screening rule using synthetic and real data sets. Results show that our rule is more effective in identifying inactive predictors than existing state-of-the-art screening rules for Lasso. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

NeurIPS Conference 2015 Conference Paper

Multi-Layer Feature Reduction for Tree Structured Group Lasso via Hierarchical Projection

  • Jie Wang
  • Jieping Ye

Tree structured group Lasso (TGL) is a powerful technique in uncovering the tree structured sparsity over the features, where each node encodes a group of features. It has been applied successfully in many real-world applications. However, with extremely large feature dimensions, solving TGL remains a significant challenge due to its highly complicated regularizer. In this paper, we propose a novel Multi-Layer Feature reduction method (MLFre) to quickly identify the inactive nodes (the groups of features with zero coefficients in the solution) hierarchically in a top-down fashion, which are guaranteed to be irrelevant to the response. Thus, we can remove the detected nodes from the optimization without sacrificing accuracy. The major challenge in developing such testing rules is due to the overlaps between the parents and their children nodes. By a novel hierarchical projection algorithm, MLFre is able to test the nodes independently from any of their ancestor nodes. Moreover, we can integrate MLFre---that has a low computational cost---with any existing solvers. Experiments on both synthetic and real data sets demonstrate that the speedup gained by MLFre can be orders of magnitude.

ICML Conference 2015 Conference Paper

Safe Screening for Multi-Task Feature Learning with Multiple Data Matrices

  • Jie Wang 0005
  • Jieping Ye

Multi-task feature learning (MTFL) is a powerful technique in boosting the predictive performance by learning multiple related classification/regression/clustering tasks simultaneously. However, solving the MTFL problem remains challenging when the feature dimension is extremely large. In this paper, we propose a novel screening rule—that is based on the dual projection onto convex sets (DPC)—to quickly identify the inactive features—that have zero coefficients in the solution vectors across all tasks. One of the appealing features of DPC is that: it is safe in the sense that the detected inactive features are guaranteed to have zero coefficients in the solution vectors across all tasks. Thus, by removing the inactive features from the training phase, we may have substantial savings in the computational cost and memory usage without sacrificing accuracy. To the best of our knowledge, it is the first screening rule that is applicable to sparse models with multiple data matrices. A key challenge in deriving DPC is to solve a nonconvex problem. We show that we can solve for the global optimum efficiently via a properly chosen parametrization of the constraint set. Moreover, DPC has very low computational cost and can be integrated with any existing solvers. We have evaluated the proposed DPC rule on both synthetic and real data sets. The experiments indicate that DPC is very effective in identifying the inactive features—especially for high dimensional data—which leads to a speedup up to several orders of magnitude.

JMLR Journal 2015 Journal Article

Simultaneous Pursuit of Sparseness and Rank Structures for Matrix Decomposition

  • Qi Yan
  • Jieping Ye
  • Xiaotong Shen

In multi-response regression, pursuit of two different types of structures is essential to battle the curse of dimensionality. In this paper, we seek a sparsest decomposition representation of a parameter matrix in terms of a sum of sparse and low rank matrices, among many overcomplete decompositions. On this basis, we propose a constrained method subject to two nonconvex constraints, respectively for sparseness and low-rank properties. Computationally, obtaining an exact global optimizer is rather challenging. To overcome the difficulty, we use an alternating directions method solving a low-rank subproblem and a sparseness subproblem alternatively, where we derive an exact solution to the low-rank subproblem, as well as an exact solution in a special case and an approximated solution generally through a surrogate of the $L_0$-constraint and difference convex programming, for the sparse subproblem. Theoretically, we establish convergence rates of a global minimizer in the Hellinger-distance, providing an insight into why pursuit of two different types of decomposed structures is expected to deliver higher estimation accuracy than its counterparts based on either sparseness alone or low-rank approximation alone. Numerical examples are given to illustrate these aspects, in addition to an application to facial imagine recognition and multiple time series analysis. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

ICML Conference 2014 Conference Paper

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

  • Jie Wang 0005
  • Qingyang Li 0001
  • Sen Yang 0004
  • Wei Fan 0001
  • Peter Wonka
  • Jieping Ye

Total variation (TV) models are among the most popular and successful tools in signal processing. However, due to the complex nature of the TV term, it is challenging to efficiently compute a solution for large-scale problems. State-of-the-art algorithms that are based on the alternating direction method of multipliers (ADMM) often involve solving large-size linear systems. In this paper, we propose a highly scalable parallel algorithm for TV models that is based on a novel decomposition strategy of the problem domain. As a result, the TV models can be decoupled into a set of small and independent subproblems, which admit closed form solutions. This makes our approach particularly suitable for parallel implementation. Our algorithm is guaranteed to converge to its global minimum. With N variables and n_p processes, the time complexity is O(N/(εn_p)) to reach an epsilon-optimal solution. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art algorithms, especially in dealing with high-resolution, mega-size images.

NeurIPS Conference 2014 Conference Paper

A Safe Screening Rule for Sparse Logistic Regression

  • Jie Wang
  • Jiayu Zhou
  • Jun Liu
  • Peter Wonka
  • Jieping Ye

The l1-regularized logistic regression (or sparse logistic regression) is a widely used method for simultaneous classification and feature selection. Although many recent efforts have been devoted to its efficient implementation, its application to high dimensional data still poses significant challenges. In this paper, we present a fast and effective sparse logistic regression screening rule (Slores) to identify the zero components in the solution vector, which may lead to a substantial reduction in the number of features to be entered to the optimization. An appealing feature of Slores is that the data set needs to be scanned only once to run the screening and its computational cost is negligible compared to that of solving the sparse logistic regression problem. Moreover, Slores is independent of solvers for sparse logistic regression, thus Slores can be integrated with any existing solver to improve the efficiency. We have evaluated Slores using high-dimensional data sets from different applications. Extensive experimental results demonstrate that Slores outperforms the existing state-of-the-art screening rules and the efficiency of solving sparse logistic regression is improved by one magnitude in general.

YNIMG Journal 2014 Journal Article

Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study

  • Rashmi Dubey
  • Jiayu Zhou
  • Yalin Wang
  • Paul M. Thompson
  • Jieping Ye

Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results.

YNIMG Journal 2014 Journal Article

Bi-level multi-source learning for heterogeneous block-wise missing data

  • Shuo Xiang
  • Lei Yuan
  • Wei Fan
  • Yalin Wang
  • Paul M. Thompson
  • Jieping Ye

Bio-imaging technologies allow scientists to collect large amounts of high-dimensional data from multiple heterogeneous sources for many biomedical applications. In the study of Alzheimer's Disease (AD), neuroimaging data, gene/protein expression data, etc. , are often analyzed together to improve predictive power. Joint learning from multiple complementary data sources is advantageous, but feature-pruning and data source selection are critical to learn interpretable models from high-dimensional data. Often, the data collected has block-wise missing entries. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), most subjects have MRI and genetic information, but only half have cerebrospinal fluid (CSF) measures, a different half has FDG-PET; only some have proteomic data. Here we propose how to effectively integrate information from multiple heterogeneous data sources when data is block-wise missing. We present a unified “bi-level” learning model for complete multi-source data, and extend it to incomplete data. Our major contributions are: (1) our proposed models unify feature-level and source-level analysis, including several existing feature learning approaches as special cases; (2) the model for incomplete data avoids imputing missing data and offers superior performance; it generalizes to other applications with block-wise missing data sources; (3) we present efficient optimization algorithms for modeling complete and incomplete data. We comprehensively evaluate the proposed models including all ADNI subjects with at least one of four data types at baseline: MRI, FDG-PET, CSF and proteomics. Our proposed models compare favorably with existing approaches.

ICML Conference 2014 Conference Paper

Forward-Backward Greedy Algorithms for General Convex Smooth Functions over A Cardinality Constraint

  • Ji Liu 0002
  • Jieping Ye
  • Ryohei Fujimaki

We consider forward-backward greedy algorithms for solving sparse feature selection problems with general convex smooth functions. A state-of-the-art greedy method, the Forward-Backward greedy algorithm (FoBa-obj) requires to solve a large number of optimization problems, thus it is not scalable for large-size problems. The FoBa-gdt algorithm, which uses the gradient information for feature selection at each forward iteration, significantly improves the efficiency of FoBa-obj. In this paper, we systematically analyze the theoretical properties of both algorithms. Our main contributions are: 1) We derive better theoretical bounds than existing analyses regarding FoBa-obj for general smooth convex functions; 2) We show that FoBa-gdt achieves the same theoretical performance as FoBa-obj under the same condition: restricted strong convexity condition. Our new bounds are consistent with the bounds of a special case (least squares) and fills a previously existing theoretical gap for general convex smooth functions; 3) We show that the restricted strong convexity condition is satisfied if the number of independent samples is more than \bark\log d where \bark is the sparsity number and d is the dimension of the variable; 4) We apply FoBa-gdt (with the conditional random field objective) to the sensor selection problem for human indoor activity recognition and our results show that FoBa-gdt outperforms other methods based on forward greedy selection and L1-regularization.

ICML Conference 2014 Conference Paper

Geodesic Distance Function Learning via Heat Flow on Vector Fields

  • Binbin Lin
  • Ji Yang
  • Xiaofei He 0001
  • Jieping Ye

Learning a distance function or metric on a given data manifold is of great importance in machine learning and pattern recognition. Many of the previous works first embed the manifold to Euclidean space and then learn the distance function. However, such a scheme might not faithfully preserve the distance function if the original manifold is not Euclidean. In this paper, we propose to learn the distance function directly on the manifold without embedding. We first provide a theoretical characterization of the distance function by its gradient field. Based on our theoretical analysis, we propose to first learn the gradient field of the distance function and then learn the distance function itself. Specifically, we set the gradient field of a local distance function as an initial vector field. Then we transport it to the whole manifold via heat flow on vector fields. Finally, the geodesic distance function can be obtained by requiring its gradient field to be close to the normalized vector field. Experimental results on both synthetic and real data demonstrate the effectiveness of our proposed algorithm.

ICML Conference 2014 Conference Paper

Rank-One Matrix Pursuit for Matrix Completion

  • Zheng Wang 0011
  • Ming-Jun Lai
  • Zhaosong Lu
  • Wei Fan 0001
  • Hasan Davulcu
  • Jieping Ye

Low rank matrix completion has been applied successfully in a wide range of machine learning applications, such as collaborative filtering, image inpainting and Microarray data imputation. However, many existing algorithms are not scalable to large-scale problems, as they involve computing singular value decomposition. In this paper, we present an efficient and scalable algorithm for matrix completion. The key idea is to extend the well-known orthogonal matching pursuit from the vector case to the matrix case. In each iteration, we pursue a rank-one matrix basis generated by the top singular vector pair of the current approximation residual and update the weights for all rank-one matrices obtained up to the current iteration. We further propose a novel weight updating rule to reduce the time and storage complexity, making the proposed algorithm scalable to large matrices. We establish the linear convergence of the proposed algorithm. The fast convergence is achieved due to the proposed construction of matrix bases and the estimation of the weights. We empirically evaluate the proposed algorithm on many real-world large scale datasets. Results show that our algorithm is much more efficient than state-of-the-art matrix completion algorithms while achieving similar or better prediction performance.

ICML Conference 2014 Conference Paper

Safe Screening with Variational Inequalities and Its Application to Lasso

  • Jun Liu 0003
  • Zheng Zhao
  • Jie Wang 0005
  • Jieping Ye

Sparse learning techniques have been routinely used for feature selection as the resulting model usually has a small number of non-zero entries. Safe screening, which eliminates the features that are guaranteed to have zero coefficients for a certain value of the regularization parameter, is a technique for improving the computational efficiency. Safe screening is gaining increasing attention since 1) solving sparse learning formulations usually has a high computational cost especially when the number of features is large and 2) one needs to try several regularization parameters to select a suitable model. In this paper, we propose an approach called “Sasvi" (Safe screening with variational inequalities). Sasvi makes use of the variational inequality that provides the sufficient and necessary optimality condition for the dual problem. Several existing approaches for Lasso screening can be casted as relaxed versions of the proposed Sasvi, thus Sasvi provides a stronger safe screening rule. We further study the monotone properties of Sasvi for Lasso, based on which a sure removal regularization parameter can be identified for each feature. Experimental results on both synthetic and real data sets are reported to demonstrate the effectiveness of the proposed Sasvi for Lasso screening.

ICML Conference 2014 Conference Paper

Scaling SVM and Least Absolute Deviations via Exact Data Reduction

  • Jie Wang 0005
  • Peter Wonka
  • Jieping Ye

The support vector machine (SVM) is a widely used method for classification. Although many efforts have been devoted to develop efficient solvers, it remains challenging to apply SVM to large-scale problems. A nice property of SVM is that the non-support vectors have no effect on the resulting classifier. Motivated by this observation, we present fast and efficient screening rules to discard non-support vectors by analyzing the dual problem of SVM via variational inequalities (DVI). As a result, the number of data instances to be entered into the optimization can be substantially reduced. Some appealing features of our screening method are: (1) DVI is safe in the sense that the vectors discarded by DVI are guaranteed to be non-support vectors; (2) the data set needs to be scanned only once to run the screening, and its computational cost is negligible compared to that of solving the SVM problem; (3) DVI is independent of the solvers and can be integrated with any existing efficient solver. We also show that the DVI technique can be extended to detect non-support vectors in the least absolute deviations regression (LAD). To the best of our knowledge, there are currently no screening methods for LAD. We have evaluated DVI on both synthetic and real data sets. Experiments indicate that DVI significantly outperforms the existing state-of-the-art screening rules for SVM, and it is very effective in discarding non-support vectors for LAD. The speedup gained by DVI rules can be up to two orders of magnitude.

NeurIPS Conference 2014 Conference Paper

Two-Layer Feature Reduction for Sparse-Group Lasso via Decomposition of Convex Sets

  • Jie Wang
  • Jieping Ye

Sparse-Group Lasso (SGL) has been shown to be a powerful regression technique for simultaneously discovering group and within-group sparse patterns by using a combination of the l1 and l2 norms. However, in large-scale applications, the complexity of the regularizers entails great computational challenges. In this paper, we propose a novel two-layer feature reduction method (TLFre) for SGL via a decomposition of its dual feasible set. The two-layer reduction is able to quickly identify the inactive groups and the inactive features, respectively, which are guaranteed to be absent from the sparse representation and can be removed from the optimization. Existing feature reduction methods are only applicable for sparse models with one sparsity-inducing regularizer. To our best knowledge, TLFre is the first one that is capable of dealing with multiple sparsity-inducing regularizers. Moreover, TLFre has a very low computational cost and can be integrated with any existing solvers. Experiments on both synthetic and real data sets show that TLFre improves the efficiency of SGL by orders of magnitude.

ICML Conference 2013 Conference Paper

A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems

  • Pinghua Gong
  • Changshui Zhang
  • Zhaosong Lu
  • Jianhua Huang
  • Jieping Ye

Non-convex sparsity-inducing penalties have recently received considerable attentions in sparse learning. Recent theoretical investigations have demonstrated their superiority over the convex counterparts in several sparse learning settings. However, solving the non-convex optimization problems associated with non-convex penalties remains a big challenge. A commonly used approach is the Multi-Stage (MS) convex relaxation (or DC programming), which relaxes the original non-convex problem to a sequence of convex problems. This approach is usually not very practical for large-scale problems because its computational cost is a multiple of solving a single convex problem. In this paper, we propose a General Iterative Shrinkage and Thresholding (GIST) algorithm to solve the nonconvex optimization problem for a large class of non-convex penalties. The GIST algorithm iteratively solves a proximal operator problem, which in turn has a closed-form solution for many commonly used penalties. At each outer iteration of the algorithm, we use a line search initialized by the Barzilai-Borwein (BB) rule that allows finding an appropriate step size quickly. The paper also presents a detailed convergence analysis of the GIST algorithm. The efficiency of the proposed algorithm is demonstrated by extensive experiments on large-scale data sets.

IJCAI Conference 2013 Conference Paper

Active Learning from Relative Queries

  • Buyue Qian
  • Xiang Wang
  • Fei Wang
  • Hongfei Li
  • Jieping Ye
  • Ian Davidson

Active learning has been extensively studied and shown to be useful in solving real problems. The typical setting of traditional active learning methods is querying labels from an oracle. This is only possible if an expert exists, which may not be the case in many real world applications. In this paper, we focus on designing easier questions that can be answered by a non-expert. These questions poll relative information as opposed to absolute information and can be even generated from sideinformation. We propose an active learning approach that queries the ordering of the importance of an instance’s neighbors rather than its label. We explore our approach on real datasets and make several interesting discoveries including that querying neighborhood information can be an effective question to ask and sometimes can even yield better performance than querying labels.

YNIMG Journal 2013 Journal Article

Applying tensor-based morphometry to parametric surfaces can improve MRI-based disease diagnosis

  • Yalin Wang
  • Lei Yuan
  • Jie Shi
  • Alexander Greve
  • Jieping Ye
  • Arthur W. Toga
  • Allan L. Reiss
  • Paul M. Thompson

Many methods have been proposed for computer-assisted diagnostic classification. Full tensor information and machine learning with 3D maps derived from brain images may help detect subtle differences or classify subjects into different groups. Here we develop a new approach to apply tensor-based morphometry to parametric surface models for diagnostic classification. We use this approach to identify cortical surface features for use in diagnostic classifiers. First, with holomorphic 1-forms, we compute an efficient and accurate conformal mapping from a multiply connected mesh to the so-called slit domain. Next, the surface parameterization approach provides a natural way to register anatomical surfaces across subjects using a constrained harmonic map. To analyze anatomical differences, we then analyze the full Riemannian surface metric tensors, which retain multivariate information on local surface geometry. As the number of voxels in a 3D image is large, sparse learning is a promising method to select a subset of imaging features and to improve classification accuracy. Focusing on vertices with greatest effect sizes, we train a diagnostic classifier using the surface features selected by an L1-norm based sparse learning method. Stability selection is applied to validate the selected feature sets. We tested the algorithm on MRI-derived cortical surfaces from 42 subjects with genetically confirmed Williams syndrome and 40 age-matched controls, multivariate statistics on the local tensors gave greater effect sizes for detecting group differences relative to other TBM-based statistics including analysis of the Jacobian determinant and the largest eigenvalue of the surface metric. Our method also gave reasonable classification results relative to the Jacobian determinant, the pair of eigenvalues of the Jacobian matrix and volume features. This analysis pipeline may boost the power of morphometry studies, and may assist with image-based classification.

ICML Conference 2013 Conference Paper

Efficient Sparse Group Feature Selection via Nonconvex Optimization

  • Shuo Xiang
  • Xiaoshen Tong
  • Jieping Ye

Sparse feature selection has been demonstrated to be effective in handling high-dimensional data. While promising, most of the existing works use convex methods, which may be suboptimal in terms of the accuracy of feature selection and parameter estimation. In this paper, we expand a nonconvex paradigm to sparse group feature selection, which is motivated by applications that require identifying the underlying group structure and performing feature selection simultaneously. The main contributions of this article are twofold: (1) computationally, we introduce a nonconvex sparse group feature selection model and present an efficient optimization algorithm, of which the key step is a projection with two coupled constraints; (2) statistically, we show that the proposed model can reconstruct the oracle estimator. Therefore, consistent feature selection and parameter estimation can be achieved. Numerical results on synthetic and real-world data suggest that the proposed nonconvex method compares favorably against its competitors, thus achieving desired goal of delivering high performance.

ICML Conference 2013 Conference Paper

Guaranteed Sparse Recovery under Linear Transformation

  • Ji Liu 0002
  • Lei Yuan 0001
  • Jieping Ye

We consider the following signal recovery problem: given a measurement matrix Φ∈\mathbbR^n\times p and a noisy observation vector c∈\mathbbR^n constructed from c = Φθ^* + εwhere ε∈\mathbbR^n is the noise vector whose entries follow i. i. d. centered sub-Gaussian distribution, how to recover the signal θ^* if Dθ^* is sparse \rca under a linear transformation D∈\mathbbR^m\times p? One natural method using convex optimization is to solve the following problem: $\min_θ 1\over 2\|Φθ- c\|^2 + λ\|Dθ\|_1. This paper provides an upper bound of the estimate error and shows the consistency property of this method by assuming that the design matrix Φis a Gaussian random matrix. Specifically, we show 1) in the noiseless case, if the condition number of D is bounded and the measurement number n≥Ω(s\log(p)) where s is the sparsity number, then the true solution can be recovered with high probability; and 2) in the noisy case, if the condition number of D is bounded and the measurement increases faster than s\log(p), that is, s\log(p)=o(n), the estimate error converges to zero with probability 1 when p and s go to infinity. Our results are consistent with those for the special case D=\boldI_p\times p (equivalently LASSO) and improve the existing analysis. The condition number of D plays a critical role in our analysis. We consider the condition numbers in two cases including the fused LASSO and the random graph: the condition number in the fused LASSO case is bounded by a constant, while the condition number in the random graph case is bounded with high probability if m\over p (i. e. , #\textedge\over #\textvertex$) is larger than a certain constant. Numerical simulations are consistent with our theoretical results.

ICML Conference 2013 Conference Paper

Joint Transfer and Batch-mode Active Learning

  • Rita Chattopadhyay
  • Wei Fan 0001
  • Ian Davidson
  • Sethuraman Panchanathan
  • Jieping Ye

Active learning and transfer learning are two different methodologies that address the common problem of insufficient labels. Transfer learning addresses this problem by using the knowledge gained from a related and already labeled data source, whereas active learning focuses on selecting a small set of informative samples for manual annotation. Recently, there has been much interest in developing frameworks that combine both transfer and active learning methodologies. A few such frameworks reported in literature perform transfer and active learning in two separate stages. In this work, we present an integrated framework that performs transfer and active learning simultaneously by solving a single convex optimization problem. The proposed framework computes the weights of source domain data and selects the samples from the target domain data simultaneously, by minimizing a common objective of reducing distribution difference between the data set consisting of reweighted source and the queried target domain data and the set of unlabeled target domain data. Comprehensive experiments on three real world data sets demonstrate that the proposed method improves the classification accuracy by 5% to 10% over the existing two-stage approach

NeurIPS Conference 2013 Conference Paper

Lasso Screening Rules via Dual Polytope Projection

  • Jie Wang
  • Jiayu Zhou
  • Peter Wonka
  • Jieping Ye

Lasso is a widely used regression technique to find sparse representations. When the dimension of the feature space and the number of samples are extremely large, solving the Lasso problem remains challenging. To improve the efficiency of solving large-scale Lasso problems, El Ghaoui and his colleagues have proposed the SAFE rules which are able to quickly identify the inactive predictors, i. e. , predictors that have $0$ components in the solution vector. Then, the inactive predictors or features can be removed from the optimization problem to reduce its scale. By transforming the standard Lasso to its dual form, it can be shown that the inactive predictors include the set of inactive constraints on the optimal dual solution. In this paper, we propose an efficient and effective screening rule via Dual Polytope Projections (DPP), which is mainly based on the uniqueness and nonexpansiveness of the optimal dual solution due to the fact that the feasible set in the dual space is a convex and closed polytope. Moreover, we show that our screening rule can be extended to identify inactive groups in group Lasso. To the best of our knowledge, there is currently no exact" screening rule for group Lasso. We have evaluated our screening rule using many real data sets. Results show that our rule is more effective to identify inactive predictors than existing state-of-the-art screening rules for Lasso. "

YNIMG Journal 2013 Journal Article

Modeling disease progression via multi-task learning

  • Jiayu Zhou
  • Jun Liu
  • Vaibhav A. Narayan
  • Jieping Ye

Alzheimer's disease (AD), the most common type of dementia, is a severe neurodegenerative disorder. Identifying biomarkers that can track the progress of the disease has recently received increasing attentions in AD research. An accurate prediction of disease progression would facilitate optimal decision-making for clinicians and patients. A definitive diagnosis of AD requires autopsy confirmation, thus many clinical/cognitive measures including Mini Mental State Examination (MMSE) and Alzheimer's Disease Assessment Scale cognitive subscale (ADAS-Cog) have been designed to evaluate the cognitive status of the patients and used as important criteria for clinical diagnosis of probable AD. In this paper, we consider the problem of predicting disease progression measured by the cognitive scores and selecting biomarkers predictive of the progression. Specifically, we formulate the prediction problem as a multi-task regression problem by considering the prediction at each time point as a task and propose two novel multi-task learning formulations. We have performed extensive experiments using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Specifically, we use the baseline MRI features to predict MMSE/ADAS-Cog scores in the next 4years. Results demonstrate the effectiveness of the proposed multi-task learning formulations for disease progression in comparison with single-task learning algorithms including ridge regression and Lasso. We also perform longitudinal stability selection to identify and analyze the temporal patterns of biomarkers in disease progression. We observe that cortical thickness average of left middle temporal, cortical thickness average of left and right Entorhinal, and white matter volume of left Hippocampus play significant roles in predicting ADAS-Cog at all time points. We also observe that several MRI biomarkers provide significant information for predicting MMSE scores for the first 2years, however very few are shown to be significant in predicting MMSE score at later stages. The lack of predictable MRI biomarkers in later stages may contribute to the lower prediction performance of MMSE than that of ADAS-Cog in our study and other related studies.

JMLR Journal 2013 Journal Article

Multi-Stage Multi-Task Feature Learning

  • Pinghua Gong
  • Jieping Ye
  • Changshui Zhang

Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an $\ell_0$-type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

JMLR Journal 2012 Journal Article

A Multi-Stage Framework for Dantzig Selector and LASSO

  • Ji Liu
  • Peter Wonka
  • Jieping Ye

We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X∈ ℝ n✕ m (m >> n) and a noisy observation vector y∈ ℝ n satisfying y=Xβ * +ε where ε is the noise vector following a Gaussian distribution N(0,σ 2 I), how to recover the signal (or parameter vector) β * when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β *. We show that if X obeys a certain condition, then with a large probability the difference between the solution β̂ estimated by the proposed method and the true solution β * measured in terms of the l p norm ( p≥ 1 ) is bounded as ||β̂-β * || p ≤ (C(s-N) 1/p √log m+Δ)σ, where C is a constant, s is the number of nonzero entries in β *, the risk of the oracle estimator Δ is independent of m and is much smaller than the first term, and N is the number of entries of β * larger than a certain value in the order of O(σ√log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately from Cs 1/p √log mσ to C(s-N) 1/p √log mσ where the value N depends on the number of large entries in β *. When N=s, the proposed algorithm achieves the oracle solution with a high probability, where the oracle solution is the projection of the observation vector y onto true features. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. Finally, we extend this multi-stage procedure to the LASSO case. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Generalization Bounds for Domain Adaptation

  • Chao Zhang
  • Lei Zhang
  • Jieping Ye

In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. Without loss of generality, we consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we introduce two quantities that capture the inherent characteristics of domains. For either kind of domain adaptation, based on the two quantities, we then develop the specific Hoeffding-type deviation inequality and symmetrization inequality to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for such kind of domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results.

YNIMG Journal 2012 Journal Article

Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data

  • Lei Yuan
  • Yalin Wang
  • Paul M. Thompson
  • Vaibhav A. Narayan
  • Jieping Ye

Analysis of incomplete data is a big challenge when integrating large-scale brain imaging datasets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. In this paper, we address this problem by proposing an incomplete Multi-Source Feature (iMSF) learning method where all the samples (with at least one available data source) can be used. To illustrate the proposed approach, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172AD, 397 MCI, 211 NC), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithm. Depending on the problem being solved, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-the-art sparse learning methods. To build a practical and robust system, we construct a classifier ensemble by combining our method with four other methods for missing value estimation. Comprehensive experiments with various parameters show that our proposed iMSF method and the ensemble model yield stable and promising results.

NeurIPS Conference 2012 Conference Paper

Multi-Stage Multi-Task Feature Learning

  • Pinghua Gong
  • Jieping Ye
  • Chang-shui Zhang

Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an $\ell_0$-type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms.

NeurIPS Conference 2012 Conference Paper

Multi-task Vector Field Learning

  • Binbin Lin
  • Sen Yang
  • Chiyuan Zhang
  • Jieping Ye
  • Xiaofei He

Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the prediction functions and the vector fields simultaneously. MTVFL has the following key properties: (1) the vector fields we learned are close to the gradient fields of the prediction functions; (2) within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace; (3) the vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach.

NeurIPS Conference 2011 Conference Paper

A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

  • Qian Sun
  • Rita Chattopadhyay
  • Sethuraman Panchanathan
  • Jieping Ye

Discriminative learning when training and test data belong to different distributions is a challenging and complex task. Often times we have very few or no labeled data from the test or target distribution but may have plenty of labeled data from multiple related sources with different distributions. The difference in distributions may be in both marginal and conditional probabilities. Most of the existing domain adaptation work focuses on the marginal probability distribution difference between the domains, assuming that the conditional probabilities are similar. However in many real world applications, conditional probability distribution differences are as commonplace as marginal probability differences. In this paper we propose a two-stage domain adaptation methodology which combines weighted data from multiple sources based on marginal probability differences (first stage) as well as conditional probability differences (second stage), with the target domain data. The weights for minimizing the marginal probability differences are estimated independently, while the weights for minimizing conditional probability differences are computed simultaneously by exploiting the potential interaction among multiple sources. We also provide a theoretical analysis on the generalization performance of the proposed multi-source domain adaptation formulation using the weighted Rademacher complexity measure. Empirical comparisons with existing state-of-the-art domain adaptation methods using three real-world datasets demonstrate the effectiveness of the proposed approach.

NeurIPS Conference 2011 Conference Paper

Clustered Multi-Task Learning Via Alternating Structure Optimization

  • Jiayu Zhou
  • Jianhui Chen
  • Jieping Ye

Multi-task learning (MTL) learns multiple related tasks simultaneously to improve generalization performance. Alternating structure optimization (ASO) is a popular MTL method that learns a shared low-dimensional predictive structure on hypothesis spaces from multiple related tasks. It has been applied successfully in many real world applications. As an alternative MTL approach, clustered multi-task learning (CMTL) assumes that multiple tasks follow a clustered structure, i. e. , tasks are partitioned into a set of groups where tasks in the same group are similar to each other, and that such a clustered structure is unknown a priori. The objectives in ASO and CMTL differ in how multiple tasks are related. Interestingly, we show in this paper the equivalence relationship between ASO and CMTL, providing significant new insights into ASO and CMTL as well as their inherent relationship. The CMTL formulation is non-convex, and we adopt a convex relaxation to the CMTL formulation. We further establish the equivalence relationship between the proposed convex relaxation of CMTL and an existing convex relaxation of ASO, and show that the proposed convex CMTL formulation is significantly more efficient especially for high-dimensional data. In addition, we present three algorithms for solving the convex CMTL formulation. We report experimental results on benchmark datasets to demonstrate the efficiency of the proposed algorithms.

NeurIPS Conference 2011 Conference Paper

Efficient Methods for Overlapping Group Lasso

  • Lei Yuan
  • Jun Liu
  • Jieping Ye

The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8, 141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms.

NeurIPS Conference 2011 Conference Paper

Identifying Alzheimer's Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis

  • Shuai Huang
  • Jing Li
  • Jieping Ye
  • Teresa Wu
  • Kewei Chen
  • Adam Fleisher
  • Eric Reiman

Diagnosis of Alzheimer's disease (AD) at the early stage of the disease development is of great clinical importance. Current clinical assessment that relies primarily on cognitive measures proves low sensitivity and specificity. The fast growing neuroimaging techniques hold great promise. Research so far has focused on single neuroimaging modalities. However, as different modalities provide complementary measures for the same disease pathology, fusion of multi-modality data may increase the statistical power in identification of disease-related brain regions. This is especially true for early AD, at which stage the disease-related regions are most likely to be weak-effect regions that are difficult to be detected from a single modality alone. We propose a sparse composite linear discriminant analysis model (SCLDA) for identification of disease-related brain regions of early AD from multi-modality data. SCLDA uses a novel formulation that decomposes each LDA parameter into a product of a common parameter shared by all the modalities and a parameter specific to each modality, which enables joint analysis of all the modalities and borrowing strength from one another. We prove that this formulation is equivalent to a penalized likelihood with non-convex regularization, which can be solved by the DC ((difference of convex functions) programming. We show that in using the DC programming, the property of the non-convex regularization in terms of preserving weak-effect features can be nicely revealed. We perform extensive simulations to show that SCLDA outperforms existing competing algorithms on feature selection, especially on the ability for identifying weak-effect features. We apply SCLDA to the Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) images of 49 AD patients and 67 normal controls (NC). Our study identifies disease-related brain regions consistent with findings in the AD literature.

NeurIPS Conference 2011 Conference Paper

Projection onto A Nonnegative Max-Heap

  • Jun Liu
  • Liang Sun
  • Jieping Ye

We consider the problem of computing the Euclidean projection of a vector of length $p$ onto a non-negative max-heap---an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative max-heap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classification task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to find the so-called \emph{maximal root-tree} of the subtree rooted at each node. A naive approach for finding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for efficiently finding the maximal root-tree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and $O(p^2)$ for a general tree. We report simulation results showing the effectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1.

YNIMG Journal 2010 Journal Article

Learning brain connectivity of Alzheimer's disease by sparse inverse covariance estimation

  • Shuai Huang
  • Jing Li
  • Liang Sun
  • Jieping Ye
  • Adam Fleisher
  • Teresa Wu
  • Kewei Chen
  • Eric Reiman

Rapid advances in neuroimaging techniques provide great potentials for study of Alzheimer's disease (AD). Existing findings have shown that AD is closely related to alteration in the functional brain network, i. e. , the functional connectivity between different brain regions. In this paper, we propose a method based on sparse inverse covariance estimation (SICE) to identify functional brain connectivity networks from PET data. Our method is able to identify both the connectivity network structure and strength for a large number of brain regions with small sample sizes. We apply the proposed method to the PET data of AD, mild cognitive impairment (MCI), and normal control (NC) subjects. Compared with NC, AD shows decrease in the amount of inter-region functional connectivity within the temporal lobe especially between the area around hippocampus and other regions and increase in the amount of connectivity within the frontal lobe as well as between the parietal and occipital lobes. Also, AD shows weaker between-lobe connectivity than within-lobe connectivity and weaker between-hemisphere connectivity, compared with NC. In addition to being a method for knowledge discovery about AD, the proposed SICE method can also be used for classifying new subjects, which makes it a suitable approach for novel connectivity-based AD biomarker identification. Our experiments show that the best sensitivity and specificity our method can achieve in AD vs. NC classification are 88% and 88%, respectively.

NeurIPS Conference 2010 Conference Paper

Moreau-Yosida Regularization for Grouped Tree Structure Learning

  • Jun Liu
  • Jieping Ye

We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.

NeurIPS Conference 2010 Conference Paper

Multi-Stage Dantzig Selector

  • Ji Liu
  • Peter Wonka
  • Jieping Ye

We consider the following sparse signal recovery (or feature selection) problem: given a design matrix $X\in \mathbb{R}^{n\times m}$ $(m\gg n)$ and a noisy observation vector $y\in \mathbb{R}^{n}$ satisfying $y=X\beta^*+\epsilon$ where $\epsilon$ is the noise vector following a Gaussian distribution $N(0, \sigma^2I)$, how to recover the signal (or parameter vector) $\beta^*$ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal $\beta^*$. We show that if $X$ obeys a certain condition, then with a large probability the difference between the solution $\hat\beta$ estimated by the proposed method and the true solution $\beta^*$ measured in terms of the $l_p$ norm ($p\geq 1$) is bounded as \begin{equation*} \|\hat\beta-\beta^*\|_p\leq \left(C(s-N)^{1/p}\sqrt{\log m}+\Delta\right)\sigma, \end{equation*} $C$ is a constant, $s$ is the number of nonzero entries in $\beta^*$, $\Delta$ is independent of $m$ and is much smaller than the first term, and $N$ is the number of entries of $\beta^*$ larger than a certain value in the order of $\mathcal{O}(\sigma\sqrt{\log m})$. The proposed method improves the estimation bound of the standard Dantzig selector approximately from $Cs^{1/p}\sqrt{\log m}\sigma$ to $C(s-N)^{1/p}\sqrt{\log m}\sigma$ where the value $N$ depends on the number of large entries in $\beta^*$. When $N=s$, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector.

IJCAI Conference 2009 Conference Paper

  • Ying-Xin Li
  • Shuiwang Ji
  • Sudhir Kumar
  • Jieping Ye
  • Zhi-Hua Zhou

The Berkeley Drosophila Genome Project (BDGP) has produced a large number of gene expression patterns, many of which have been annotated textually with anatomical and developmental terms. These terms spatially correspond to local regions of the images; however, they are attached collectively to groups of images, such that it is unknown which term is assigned to which region of which image in the group. This poses a challenge to the development of the computational method to automate the textual description of expression patterns contained in each image. In this paper, we show that the underlying nature of this task matches well with a new machine learning framework, Multi-Instance Multi-Label learning (MIML). We propose a new MIML support vector machine to solve the problems that beset the annotation task. Empirical study shows that the proposed method outperforms the state-of-the-art Drosophila gene expression pattern annotation methods.

IJCAI Conference 2009 Conference Paper

  • Shuiwang Ji
  • Jieping Ye

Dimensionality reduction is an essential step in high-dimensional data analysis. Many dimensionality reduction algorithms have been applied successfully to multi-class and multi-label problems. They are commonly applied as a separate data preprocessing step before classification algorithms. In this paper, we study a joint learning framework in which we perform dimensionality reduction and multi-label classification simultaneously. We show that when the least squares loss is used in classification, this joint learning decouples into two separate components, i. e. , dimensionality reduction followed by multi-label classification. This analysis partially justifies the current practice of a separate application of dimensionality reduction for classi- fication problems. We extend our analysis using other loss functions, including the hinge loss and the squared hinge loss. We further extend the formulation to the more general case where the input data for different class labels may differ, overcoming the limitation of traditional dimensionality reduction algorithms. Experiments on benchmark data sets have been conducted to evaluate the proposed joint formulations.

IJCAI Conference 2009 Conference Paper

  • Lei Tang
  • Jianhui Chen
  • Jieping Ye

For classification with multiple labels, a common approach is to learn a classifier for each label. With a kernel-based classifier, there are two options to set up kernels: select a specific kernel for each label or the same kernel for all labels. In this work, we present a unified framework for multi-label multiple kernel learning, in which the above two approaches can be considered as two extreme cases. Moreover, our framework allows the kernels shared partially among multiple labels, enabling flexible degrees of label commonality. We systematically study how the sharing of kernels among multiple labels affects the performance based on extensive experiments on various benchmark data including images and microarray data. Interesting findings concerning efficacy and efficiency are reported.

IJCAI Conference 2009 Conference Paper

  • Liang Sun
  • Shuiwang Ji
  • Shipeng Yu
  • Jieping Ye

Canonical correlation analysis (CCA) and partial least squares (PLS) are well-known techniques for feature extraction from two sets of multidimensional variables. The fundamental difference between CCA and PLS is that CCA maximizes the correlation while PLS maximizes the covariance. Although both CCA and PLS have been applied successfully in various applications, the intrinsic relationship between them remains unclear. In this paper, we attempt to address this issue by showing the equivalence relationship between CCA and orthonormalized partial least squares (OPLS), a variant of PLS. We further extend the equivalence relationship to the case when regularization is employed for both sets of variables. In addition, we show that the CCA projection for one set of variables is independent of the regularization on the other set of variables. We have performed experimental studies using both synthetic and real data sets and our results confirm the established equivalence relationship. The presented analysis provides novel insights into the connection between these two existing algorithms as well as the effect of the regularization.

IJCAI Conference 2009 Conference Paper

  • Zheng Zhao
  • Liang Sun
  • Shipeng Yu
  • Huan Liu
  • Jieping Ye

Kernel discriminant analysis (KDA) is an effective approach for supervised nonlinear dimensionality reduction. Probabilistic models can be used with KDA to improve its robustness. However, the state of the art of such models could only handle binary class problems, which confines their application in many real world problems. To overcome this limitation, we propose a novel nonparametric probabilistic model based on Gaussian Process for KDA to handle multiclass problems. The model provides a novel Bayesian interpretation for KDA, which allows its parameters to be automatically tuned through the optimization of the marginal loglikelihood of the data. Empirical study demonstrates the efficacy of the proposed model.

IJCAI Conference 2009 Conference Paper

  • Jun Liu
  • Jianhui Chen
  • Songcan Chen
  • Jieping Ye

Kernel methods have been applied successfully in many applications. The kernel matrix plays an important role in kernel-based learning methods, but the “ideal” kernel matrix is usually unknown in practice and needs to be estimated. In this paper, we propose to directly learn the “ideal” kernel matrix (called the optimal neighborhood kernel matrix) from a pre-specified kernel matrix for improved classification performance. We assume that the prespecified kernel matrix generated from the specific application is a noisy observation of the ideal one. The resulting optimal neighborhood kernel matrix is shown to be the summation of the pre-specified kernel matrix and a rank-one matrix. We formulate the problem of learning the optimal neighborhood kernel as a constrained quartic problem, and propose to solve it using two methods: level method and constrained gradient descent. Empirical results on several benchmark data sets demonstrate the efficiency and effectiveness of the proposed algorithms.

ICML Conference 2009 Conference Paper

An accelerated gradient method for trace norm minimization

  • Shuiwang Ji
  • Jieping Ye

We consider the minimization of a smooth loss function regularized by the trace norm of the matrix variable. Such formulation finds applications in many machine learning tasks including multi-task learning, matrix classification, and matrix completion. The standard semidefinite programming formulation for this problem is computationally expensive. In addition, due to the non-smooth nature of the trace norm, the optimal first-order black-box method for solving such class of problems converges as O (1/√ k ), where k is the iteration counter. In this paper, we exploit the special structure of the trace norm, based on which we propose an extended gradient algorithm that converges as O (1/ k ). We further propose an accelerated gradient algorithm, which achieves the optimal convergence rate of O (1/ k 2 ) for smooth problems. Experiments on multi-task learning problems demonstrate the efficiency of the proposed algorithms.

ICML Conference 2009 Conference Paper

Efficient Euclidean projections in linear time

  • Jun Liu 0003
  • Jieping Ye

We consider the problem of computing the Euclidean projection of a vector of length n onto a closed convex set including the l 1 ball and the specialized polyhedra employed in (Shalev-Shwartz & Singer, 2006). These problems have played building block roles in solving several l 1 -norm based sparse learning problems. Existing methods have a worst-case time complexity of O ( n log n ). In this paper, we propose to cast both Euclidean projections as root finding problems associated with specific auxiliary functions, which can be solved in linear time via bisection. We further make use of the special structure of the auxiliary functions, and propose an improved bisection algorithm. Empirical studies demonstrate that the proposed algorithms are much more efficient than the competing ones for computing the projections.

NeurIPS Conference 2009 Conference Paper

Efficient Recovery of Jointly Sparse Vectors

  • Liang Sun
  • Jun Liu
  • Jianhui Chen
  • Jieping Ye

We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model, in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the $(2, 1)$-norm minimization, which is an extension of the well-known $1$-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semidefinite programming (SDP), which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.

NeurIPS Conference 2009 Conference Paper

Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data

  • Shuai Huang
  • Jing Li
  • Liang Sun
  • Jun Liu
  • Teresa Wu
  • Kewei Chen
  • Adam Fleisher
  • Eric Reiman

Recent advances in neuroimaging techniques provide great potentials for effective diagnosis of Alzheimer’s disease (AD), the most common form of dementia. Previous studies have shown that AD is closely related to alternation in the functional brain network, i. e. , the functional connectivity among different brain regions. In this paper, we consider the problem of learning functional brain connectivity from neuroimaging, which holds great promise for identifying image-based markers used to distinguish Normal Controls (NC), patients with Mild Cognitive Impairment (MCI), and patients with AD. More specifically, we study sparse inverse covariance estimation (SICE), also known as exploratory Gaussian graphical models, for brain connectivity modeling. In particular, we apply SICE to learn and analyze functional brain connectivity patterns from different subject groups, based on a key property of SICE, called the “monotone property” we established in this paper. Our experimental results on neuroimaging PET data of 42 AD, 116 MCI, and 67 NC subjects reveal several interesting connectivity patterns consistent with literature findings, and also some new patterns that can help the knowledge discovery of AD.

UAI Conference 2009 Conference Paper

Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization

  • Jun Liu 0003
  • Shuiwang Ji
  • Jieping Ye

The problem of joint feature selection across a group of related tasks has applications in many areas including biomedical informatics and computer vision. We consider the 2, 1 -norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the 2, 1 -norm regularization is that it encourages multiple predictors to share similar sparsity patterns. However, the resulting optimization problem is challenging to solve due to the non-smoothness of the 2, 1 -norm regularization. In this paper, we propose to accelerate the computation by reformulating it as two equivalent smooth convex optimization problems which are then solved via the Nesterov’s method—an optimal first-order black-box method for smooth convex optimization. A key building block in solving the reformulations is the Euclidean projection. We show that the Euclidean projection for the first reformulation can be analytically computed, while the Euclidean projection for the second one can be computed in linear time. Empirical evaluations on several data sets verify the efficiency of the proposed algorithms.

JMLR Journal 2008 Journal Article

Comments on the Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

  • Jieping Ye

Loog (2007) provided a complete characterization of the family of solutions to a generalized Fisher criterion. We show that this characterization is essentially equivalent to the original characterization proposed in Ye (2005). The computational advantage of the original characterization over the new one is discussed, which justifies its practical use. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

JMLR Journal 2008 Journal Article

Multi-class Discriminant Kernel Learning via Convex Programming

  • Jieping Ye
  • Shuiwang Ji
  • Jianhui Chen

Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

NeurIPS Conference 2008 Conference Paper

Multi-label Multiple Kernel Learning

  • Shuiwang Ji
  • Liang Sun
  • Rong Jin
  • Jieping Ye

We present a multi-label multiple kernel learning (MKL) formulation, in which the data are embedded into a low-dimensional space directed by the instance-label correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, and it can be cast into a semi-infinite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained and convex optimization problem. In addition, we show that the objective function of the approximate formulation is continuously differentiable with Lipschitz gradient, and hence existing methods can be employed to compute the optimal solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.

ICML Conference 2007 Conference Paper

Discriminant kernel and regularization parameter learning via semidefinite programming

  • Jieping Ye
  • Jianhui Chen
  • Shuiwang Ji

Regularized Kernel Discriminant Analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. The performance of RKDA depends on the selection of kernels. In this paper, we consider the problem of learning an optimal kernel over a convex set of kernels. We show that the kernel learning problem can be formulated as a semidefinite program (SDP) in the binary-class case. We further extend the SDP formulation to the multi-class case. It is based on a key result established in this paper, that is, the multi-class kernel learning problem can be decomposed into a set of binary-class kernel learning problems. In addition, we propose an approximation scheme to reduce the computational complexity of the multi-class SDP formulation. The performance of RKDA also depends on the value of the regularization parameter. We show that this value can be learned automatically in the framework. Experimental results on benchmark data sets demonstrate the efficacy of the proposed SDP formulations.

NeurIPS Conference 2007 Conference Paper

Discriminative K-means for Clustering

  • Jieping Ye
  • Zheng Zhao
  • Mingrui Wu

We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets.

JMLR Journal 2006 Journal Article

Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

  • Jieping Ye
  • Tao Xiong

Dimensionality reduction is an important pre-processing step in many applications. Linear discriminant analysis (LDA) is a classical statistical approach for supervised dimensionality reduction. It aims to maximize the ratio of the between-class distance to the within-class distance, thus maximizing the class discrimination. It has been used widely in many applications. However, the classical LDA formulation requires the nonsingularity of the scatter matrices involved. For undersampled problems, where the data dimensionality is much larger than the sample size, all scatter matrices are singular and classical LDA fails. Many extensions, including null space LDA (NLDA) and orthogonal LDA (OLDA), have been proposed in the past to overcome this problem. NLDA aims to maximize the between-class distance in the null space of the within-class scatter matrix, while OLDA computes a set of orthogonal discriminant vectors via the simultaneous diagonalization of the scatter matrices. They have been applied successfully in various applications. In this paper, we present a computational and theoretical analysis of NLDA and OLDA. Our main result shows that under a mild condition which holds in many applications involving high-dimensional data, NLDA is equivalent to OLDA. We have performed extensive experiments on various types of data and results are consistent with our theoretical analysis. We further apply the regularization to OLDA. The algorithm is called regularized OLDA (or ROLDA for short). An efficient algorithm is presented to estimate the regularization value in ROLDA. A comparative study on classification shows that ROLDA is very competitive with OLDA. This confirms the effectiveness of the regularization in ROLDA. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

JMLR Journal 2005 Journal Article

Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems

  • Jieping Ye

A generalized discriminant analysis based on a new optimization criterion is presented. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) when the scatter matrices are singular. An efficient algorithm for the new optimization problem is presented. The solutions to the proposed criterion form a family of algorithms for generalized LDA, which can be characterized in a closed form. We study two specific algorithms, namely Uncorrelated LDA (ULDA) and Orthogonal LDA (OLDA). ULDA was previously proposed for feature extraction and dimension reduction, whereas OLDA is a novel algorithm proposed in this paper. The features in the reduced space of ULDA are uncorrelated, while the discriminant vectors of OLDA are orthogonal to each other. We have conducted a comparative study on a variety of real-world data sets to evaluate ULDA and OLDA in terms of classification accuracy. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Efficient Kernel Discriminant Analysis via QR Decomposition

  • Tao Xiong
  • Jieping Ye
  • Qi Li
  • Ravi Janardan
  • Vladimir Cherkassky

Linear Discriminant Analysis (LDA) is a well-known method for fea- ture extraction and dimension reduction. It has been used widely in many applications such as face recognition. Recently, a novel LDA algo- rithm based on QR Decomposition, namely LDA/QR, has been proposed, which is competitive in terms of classification accuracy with other LDA algorithms, but it has much lower costs in time and space. However, LDA/QR is based on linear projection, which may not be suitable for data with nonlinear structure. This paper first proposes an algorithm called KDA/QR, which extends the LDA/QR algorithm to deal with nonlin- ear data by using the kernel operator. Then an efficient approximation of KDA/QR called AKDA/QR is proposed. Experiments on face image data show that the classification accuracy of both KDA/QR and AKDA/QR are competitive with Generalized Discriminant Analysis (GDA), a gen- eral kernel discriminant analysis algorithm, while AKDA/QR has much lower time and space costs. 1 Introduction Linear Discriminant Analysis [3] is a wellknown method for dimension reduction. It has been used widely in many applications such as face recognition [2]. Classical LDA aims to find optimal transformation by minimizing the within-class distance and maximizing the between-class distance simultaneously, thus achieving maximum discrimination. The optimal transformation can be readily computed by computing the eigen-decomposition on the scatter matrices. Although LDA works well for linear problems, it may be less effective when severe non- linearity is involved. To deal with such a limitation, nonlinear extensions through kernel functions have been proposed. The main idea of kernel-based methods is to map the input data to a feature space through a nonlinear mapping, where the inner products in the feature space can be computed by a kernel function without knowing the nonlinear mapping explic- itly [9]. Kernel Principal Component Analysis (KPCA) [10], Kernel Fisher Discriminant Analysis (KFDA) [7] and Generalized Discriminant Analysis (GDA) [1] are, respectively, kernel-based nonlinear extensions of the well known PCA, FDA and LDA methods. To our knowledge, there are few efficient algorithms for general kernel based discriminant algorithms -- most known algorithms effectively scale as O(n3) where n is the sample size. In [6, 8], S. Mika et al. made a first attempt to speed up KFDA through a greedy approximation technique. However the algorithm was developed to handle the binary clas- sification problem. For multi-class problem, the authors suggested the one against the rest scheme by considering all two-class problems. Recently, an efficient variant of LDA, namely LDA/QR, was proposed in [11, 12]. The essence of LDA/QR is the utilization of QR-decomposition on a small size matrix. The time complexity of LDA/QR is linear in the size of the training data, as well as the number of dimensions of the data. Moreover, experiments in [11, 12] show that the classification accuracy of LDA/QR is competitive with other LDA algorithms. In this paper, we first propose an algorithm, namely KDA/QR1, which is a nonlinear exten- sion of LDA/QR. Since KDA/QR involves the whole kernel matrix, which is not scalable for large datasets, we also propose an approximation of KDA/QR, namely AKDA/QR. A distinct property of AKDA/QR is that it scales as O(ndc), where n is the size of the data, d is the dimension of the data, and c is the number of classes. We apply the proposed algorithms on face image datasets and compare them with LDA/QR, and Generalized Discriminant Analysis (GDA) [1], a general method for kernel discrim- inant analysis. Experiments show that: (1) AKDA/QR is competitive with KDA/QR and GDA in classification; (2) both KDA/QR and AKDA/QR outperform LDA/QR in classifi- cation; and (3) AKDA/QR has much lower costs in time and space than GDA.

ICML Conference 2004 Conference Paper

Feature extraction via generalized uncorrelated linear discriminant analysis

  • Jieping Ye
  • Ravi Janardan
  • Qi Li 0001
  • Haesun Park

Feature extraction is important in many applications, such as text and image retrieval, because of high dimensionality. Uncorrelated Linear Discriminant Analysis (ULDA) was recently proposed for feature extraction. The extracted features via ULDA were shown to be statistically uncorrelated, which is desirable for many applications. In this paper, we will first propose the ULDA/QR algorithm to simplify the previous implementation of ULDA. Then we propose the ULDA/GSVD algorithm, based on a novel optimization criterion, to address the singularity problem. It is applicable for undersampled problem, where the data dimension is much larger than the data size, such as text and image retrieval. The novel criterion used in ULDA/GSVD is the perturbed version of the one from ULDA/QR, while surprisingly, the solution to ULDA/GSVD is shown to be independent of the amount of perturbation applied. We did extensive experiments on text and face image data to show the effectiveness of ULDA/GSVD and compare with other popular feature extraction algorithms.

NeurIPS Conference 2004 Conference Paper

Two-Dimensional Linear Discriminant Analysis

  • Jieping Ye
  • Ravi Janardan
  • Qi Li

Linear Discriminant Analysis (LDA) is a well-known scheme for feature extraction and dimension reduction. It has been used widely in many ap- plications involving high-dimensional data, such as face recognition and image retrieval. An intrinsic limitation of classical LDA is the so-called singularity problem, that is, it fails when all scatter matrices are singu- lar. A well-known approach to deal with the singularity problem is to apply an intermediate dimension reduction stage using Principal Com- ponent Analysis (PCA) before LDA. The algorithm, called PCA+LDA, is used widely in face recognition. However, PCA+LDA has high costs in time and space, due to the need for an eigen-decomposition involving the scatter matrices. In this paper, we propose a novel LDA algorithm, namely 2DLDA, which stands for 2-Dimensional Linear Discriminant Analysis. 2DLDA over- comes the singularity problem implicitly, while achieving efficiency. The key difference between 2DLDA and classical LDA lies in the model for data representation. Classical LDA works with vectorized representa- tions of data, while the 2DLDA algorithm works with data in matrix representation. To further reduce the dimension by 2DLDA, the combi- nation of 2DLDA and classical LDA, namely 2DLDA+LDA, is studied, where LDA is preceded by 2DLDA. The proposed algorithms are ap- plied on face recognition and compared with PCA+LDA. Experiments show that 2DLDA and 2DLDA+LDA achieve competitive recognition accuracy, while being much more efficient.