Arrow Research search

Author name cluster

Ge Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

48 papers
2 author rows

Possible papers

48

AAAI Conference 2026 Conference Paper

DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

  • Yitong Zhang
  • Jia Li
  • Liyi Cai
  • Ge Li

Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries. Existing safety alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose DAVSP, which is built upon two key innovations. First, we introduce Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential to the overall effectiveness.

AAAI Conference 2026 Conference Paper

Large Language Model Unlearning for Source Code

  • Xue Jiang
  • Yihong Dong
  • Huangzhao Zhang
  • Tangxinyu Wang
  • Zheng Fang
  • Yingwei Ma
  • Rongyu Cao
  • Binhua Li

While Large Language Models (LLMs) excel at code generation, their inherent tendency toward verbatim memorization of training data introduces critical risks like copyright infringement, insecurity emission, and deprecated API utilization, etc. A straightforward yet promising defense is unlearning, i.e., erasing or down-weighting the offending snippets through post-training. However, we find its application to source code often tends to spill over, damaging the basic knowledge of programming languages learned by the LLM and degrading the overall capability. To ease this challenge, we propose PROD for precise source code unlearning. PROD surgically zeroes out the prediction probability of the prohibited tokens, and renormalizes the remaining distribution so that the generated code stays correct. By excising only the targeted snippets, PROD achieves precise forgetting without much degradation of the LLM's overall capability. To facilitate in-depth evaluation against PROD, we establish an unlearning benchmark consisting of three downstream tasks (i.e., unlearning of copyrighted code, insecure code, and deprecated APIs), and introduce Pareto Dominance Ratio (PDR) metric, which indicates both the forget quality and the LLM utility. Our comprehensive evaluation demonstrates that PROD achieves superior overall performance between forget quality and model utility compared to existing unlearning approaches across three downstream tasks, while consistently exhibiting improvements when applied to LLMs of varying series. PROD also exhibits superior robustness against adversarial attacks without generating or exposing the data to be forgotten. These results underscore that our approach not only successfully extends the application boundary of unlearning techniques to source code, but also holds significant implications for advancing reliable code generation.

AAAI Conference 2026 Conference Paper

Video Spatial Reasoning with Object-Centric 3D Rollout

  • Haoran Tang
  • Meng Cao
  • Ruyang Liu
  • Xiaoxi Liang
  • Linglong Li
  • Ge Li
  • Xiaodan Liang

Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning—the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes—remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).

NeurIPS Conference 2025 Conference Paper

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

  • Hongyi Zhou
  • Weiran Liao
  • Xi Huang
  • Yucheng Tang
  • Fabian Otto
  • Xiaogang Jia
  • Xinkai Jiang
  • Simon Hilber

We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel action tokenizer that encodes action sequences into compact discrete or continuous tokens using B-splines. In contrast to existing action tokenizers based on vector quantization or byte pair encoding, BEAST requires no separate tokenizer training and consistently produces tokens of uniform length, enabling fast action sequence generation via parallel decoding. Leveraging our B-spline formulation, BEAST inherently ensures generating smooth trajectories without discontinuities between adjacent segments. We extensively evaluate BEAST by integrating it with three distinct model architectures: a Variational Autoencoder (VAE) with continuous tokens, a decoder-only Transformer with discrete tokens, and Florence-2, a pretrained Vision-Language Model with an encoder-decoder architecture, demonstrating BEAST's compatibility and scalability with large pretrained models. We evaluate BEAST across three established benchmarks consisting of 166 simulated tasks and on three distinct robot settings with a total of 8 real-world tasks. Experimental results demonstrate that BEAST (i) significantly reduces both training and inference computational costs, and (ii) consistently generates smooth, high-frequency control signals suitable for continuous control tasks while (iii) reliably achieves competitive task success rates compared to state-of-the-art methods.

AAAI Conference 2025 Conference Paper

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

  • Mingda Jia
  • Liming Zhao
  • Ge Li
  • Yun Zheng

Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-ambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.

EWRL Workshop 2025 Workshop Paper

DIME: Diffusion-Based Maximum Entropy Reinforcement Learning

  • Onur Celik
  • Zechu Li
  • Denis Blessing
  • Ge Li
  • Daniel Palenicek
  • Jan Peters
  • Georgia Chalvatzaki
  • Gerhard Neumann

Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges—primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.

ICML Conference 2025 Conference Paper

DIME: Diffusion-Based Maximum Entropy Reinforcement Learning

  • Onur Celik
  • Zechu Li
  • Denis Blessing
  • Ge Li
  • Daniel Palenicek
  • Jan Peters 0001
  • Georgia Chalvatzaki
  • Gerhard Neumann

Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges—primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.

NeurIPS Conference 2025 Conference Paper

FAN: Fourier Analysis Networks

  • Yihong Dong
  • Ge Li
  • Yongding Tao
  • Xue Jiang
  • Kechi Zhang
  • Jia Li
  • Jinliang Deng
  • Jing Su

Despite the remarkable successes of general-purpose neural networks, such as MLPs and Transformers, we find that they exhibit notable shortcomings in modeling and reasoning about periodic phenomena, achieving only marginal performance within the training domain and failing to generalize effectively to out-of-domain (OOD) scenarios. Periodicity is ubiquitous throughout nature and science. Therefore, neural networks should be equipped with the essential ability to model and handle periodicity. In this work, we propose FAN, a novel neural network that effectively addresses periodicity modeling challenges while offering broad applicability similar to MLP with fewer parameters and FLOPs. Periodicity is naturally integrated into FAN's structure and computational processes by introducing the Fourier Principle. Unlike existing Fourier-based networks, which possess particular periodicity modeling abilities but face challenges in scaling to deeper networks and are typically designed for specific tasks, our approach overcomes this challenge to enable scaling to large-scale models and maintains the capability to be applied to more types of tasks. Through extensive experiments, we demonstrate the superiority of FAN in periodicity modeling tasks and the effectiveness and generalizability of FAN across a range of real-world tasks. Moreover, we reveal that compared to existing Fourier-based networks, FAN accommodates both periodicity modeling and general-purpose modeling well.

AAAI Conference 2025 Conference Paper

MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval

  • Haoran Tang
  • Meng Cao
  • Jinfa Huang
  • Ruyang Liu
  • Peng Jin
  • Ge Li
  • Xiaodan Liang

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to CLIP's inherent plain structure, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

AAAI Conference 2025 Conference Paper

Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection

  • Mingda Jia
  • Liming Zhao
  • Ge Li
  • Yun Zheng

Human-object interaction (HOI) detectors with popular query-transformer architecture have achieved promising performance. However, accurately identifying uncommon visual patterns and distinguishing between ambiguous HOIs continue to be difficult for them. We observe that these difficulties may arise from the limited capacity of traditional detector queries to represent diverse intra-category patterns and inter-category dependencies. To address this, we introduce the Interaction Prompt Distribution Learning (InterProDa) approach. InterProDa learns multiple sets of soft prompts and estimates category distributions from various prompts. It then incorporates HOI queries with category distributions, making them capable of representing near-infinite intra-category dynamics and universal cross-category relationships. Our InterProDa detector demonstrates competitive performance on HICO-DET and vcoco benchmarks. Additionally, our method can be integrated into most transformer-based HOI detectors, significantly enhancing their performance with minimal additional parameters.

AAAI Conference 2025 Conference Paper

Point Cloud Semantic Segmentation with Sparse and Inhomogeneous Annotations

  • Zhiyi Pan
  • Nan Zhang
  • Wei Gao
  • Shan Liu
  • Ge Li

Utilizing uniformly distributed sparse annotations, weakly supervised learning alleviates the heavy reliance on fine-grained annotations in point cloud semantic segmentation tasks. However, few works discuss the inhomogeneity of sparse annotations, albeit it is common in real-world scenarios. Therefore, this work introduces the probability density function into the gradient sampling approximation method to qualitatively analyze the impact of annotation sparsity and inhomogeneity under weakly supervised learning. Based on our analysis, we propose an Adaptive Annotation Distribution Network (AADNet) capable of robust learning on arbitrarily distributed sparse annotations. Specifically, we propose a label-aware point cloud downsampling strategy to increase the proportion of annotations involved in the training stage. Furthermore, we design the multiplicative dynamic entropy as the gradient calibration function to mitigate the gradient bias caused by non-uniformly distributed sparse annotations and explicitly reduce the epistemic uncertainty. Without any prior restrictions and additional information, our proposed method achieves comprehensive performance improvements at multiple label rates and different annotation distributions.

NeurIPS Conference 2025 Conference Paper

PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning

  • Xiaogang Jia
  • Qian Wang
  • Anrui Wang
  • Han Wang
  • Balázs Gyenes
  • Emiliyan Gospodinov
  • Xinkai Jiang
  • Ge Li

Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https: //point-map. github. io/Point-Map/

NeurIPS Conference 2025 Conference Paper

Reasoning is Periodicity? Improving Large Language Models Through Effective Periodicity Modeling

  • Yihong Dong
  • Ge Li
  • Xue Jiang
  • Yongding Tao
  • Kechi Zhang
  • Lecheng Wang
  • Hao Zhu
  • Huanyu Liu

Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. Our pretrained FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. Moreover, we reveal that FANformer exhibits superior ability to learn and apply rules for reasoning compared to Transformer. The results position FANformer as an effective and promising architecture for advancing LLMs.

NeurIPS Conference 2025 Conference Paper

Recursive Transformer: Boosting Reasoning Ability with State Stack

  • Kechi Zhang
  • Ge Li
  • Huangzhao Zhang
  • Yihong Dong
  • Jia Li
  • Jingjing Xu
  • Zhi Jin

The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively recognize regular expressions or deterministic context-free grammars. Standard Transformers lack an explicit mechanism for recursion and structured state transitions, which can hinder systematic generalization on nested and hierarchical patterns. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we equip layers with a differentiable stack and propose StackTrans with recursion to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchy and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2–3x more parameters, showcasing its superior efficiency and reasoning capability.

NeurIPS Conference 2025 Conference Paper

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

  • Huanyu Liu
  • Ge Li
  • Jia Li
  • Hao Zhu
  • Kechi Zhang
  • Yihong Dong

How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e. g. , math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2. 6k, a dataset of 2, 660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1. 5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1. 5B and Saturn-7B achieve average pass@3 improvements of +14. 0 and +28. 1, respectively. (2) On math and programming tasks, Saturn-1. 5B and Saturn-7B improve average scores by +4. 9 and +1. 8 on benchmarks (e. g. , AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8. 8\%. We release the source code, data, and models to support future research.

IJCAI Conference 2025 Conference Paper

Stochasticity-aware No-Reference Point Cloud Quality Assessment

  • Songlin Fan
  • Wei Gao
  • Zhineng Chen
  • Ge Li
  • Guoqing Liu
  • Qicheng Wang

The evolution of point cloud processing algorithms necessitates an accurate assessment for their quality. Previous works consistently regard point cloud quality assessment (PCQA) as a MOS regression problem and devise a deterministic mapping, ignoring the stochasticity in generating MOS from subjective tests. This work presents the first probabilistic architecture for no-reference PCQA, motivated by the labeling process of existing datasets. The proposed method can model the quality judging stochasticity of subjects through a tailored conditional variational autoencoder (CVAE) and produces multiple intermediate quality ratings. These intermediate ratings simulate the judgments from different subjects and are then integrated into an accurate quality prediction, mimicking the generation process of a ground truth MOS. Specifically, our method incorporates a Prior Module, a Posterior Module, and a Quality Rating Generator, where the former two modules are introduced to model the judging stochasticity in subjective tests, while the latter is developed to generate diverse quality ratings. Extensive experiments indicate that our approach outperforms previous cutting-edge methods by a large margin and exhibits gratifying crossdataset robustness. Codes are available at https: //git. openi. org. cn/OpenPointCloud/nrpcqa.

ICLR Conference 2025 Conference Paper

TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

  • Ge Li
  • Dong Tian
  • Hongyi Zhou
  • Xinkai Jiang
  • Rudolf Lioutikov
  • Gerhard Neumann

This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajectories are typically parameterized by trajectory generators such as Movement Primitives (MP), allowing for smooth and efficient exploration over long horizons while capturing high-level temporal correlations. However, ERL methods are often constrained to on-policy frameworks due to the difficulty of evaluating state-action values for entire action sequences, limiting their sample efficiency and preventing the use of more efficient off-policy architectures. TOP-ERL addresses this shortcoming by segmenting long action sequences and estimating the state-action values for each segment using a transformer-based critic architecture alongside an n-step return estimation. These contributions result in efficient and stable training that is reflected in the empirical results conducted on sophisticated robot learning environments. TOP-ERL significantly outperforms state-of-the-art RL methods. Thorough ablation studies additionally show the impact of key design choices on the model performance.

ICRA Conference 2025 Conference Paper

Trustworthy Robot Behavior Tree Generation Based on Multi-Source Heterogeneous Knowledge Graph

  • Jianchao Yuan
  • Shuo Yang
  • Qi Zhang
  • Ge Li
  • Jianping Tang

In robotics, the design of robot behavior trees generally requires roboticists to comprehensively and customizable consider all the relevant factors including the robot hardware capabilities, task descriptions, etc, posing great challenges for design quality and efficiency. The mainstream practice of BT design paradigm has been utilizing the BT component framework to develop task-specific BT structures manually. In contrast, the latest advances in Generative Pretrained Transformers (GPTs) have also opened up the possibility of BT design automation. However, these approaches generally show low efficiency or are less trustworthy for complex robot task goals due to time-consuming manual design and unreliable GPT reasoning. To solve the above limitations, this paper proposes a novel knowledge-driven approach that develops a specialized knowledge graph from multi-sourced and heterogeneous highquality robot knowledge to reason out a trustworthy robot plan for achieving complex task goals. Then we present the plan transformation and BT merging algorithms to automatically generate the plan-level BT structure. The comparative experiment results have shown that our approach can generate highquality and trustworthy BT structure regarding the task plan accuracy and consistency, as well as the BT generation time, compared with the manual design and GPT-based approaches.

NeurIPS Conference 2024 Conference Paper

CableInspect-AD: An Expert-Annotated Anomaly Detection Dataset

  • Akshatha Arodi
  • Margaux Luck
  • Jean-Luc Bedwani
  • Aldo Zaimi
  • Ge Li
  • Nicolas Pouliot
  • Julien Beaudry
  • Gaétan M. Caron

Machine learning models are increasingly being deployed in real-world contexts. However, systematic studies on their transferability to specific and critical applications are underrepresented in the research literature. An important example is visual anomaly detection (VAD) for robotic power line inspection. While existing VAD methods perform well in controlled environments, real-world scenarios present diverse and unexpected anomalies that current datasets fail to capture. To address this gap, we introduce CableInspect-AD, a high-quality, publicly available dataset created and annotated by domain experts from Hydro-Québec, a Canadian public utility. This dataset includes high-resolution images with challenging real-world anomalies, covering defects with varying severity levels. To address the challenges of collecting diverse anomalous and nominal examples for setting a detection threshold, we propose an enhancement to the celebrated PatchCore algorithm. This enhancement enables its use in scenarios with limited labeled data. We also present a comprehensive evaluation protocol based on cross-validation to assess models' performances. We evaluate our Enhanced-PatchCore for few-shot and many-shot detection, and Vision-Language Models for zero-shot detection. While promising, these models struggle to detect all anomalies, highlighting the dataset's value as a challenging benchmark for the broader research community. Project page: https: //mila-iqia. github. io/cableinspect-ad/.

NeurIPS Conference 2024 Conference Paper

Distribution Guidance Network for Weakly Supervised Point Cloud Semantic Segmentation

  • Zhiyi Pan
  • Wei Gao
  • Shan Liu
  • Ge Li

Despite alleviating the dependence on dense annotations inherent to fully supervised methods, weakly supervised point cloud semantic segmentation suffers from inadequate supervision signals. In response to this challenge, we introduce a novel perspective that imparts auxiliary constraints by regulating the feature space under weak supervision. Our initial investigation identifies which distributions accurately characterize the feature space, subsequently leveraging this priori to guide the alignment of the weakly supervised embeddings. Specifically, we analyze the superiority of the mixture of von Mises-Fisher distributions (moVMF) among several common distribution candidates. Accordingly, we develop a Distribution Guidance Network (DGNet), which comprises a weakly supervised learning branch and a distribution alignment branch. Leveraging reliable clustering initialization derived from the weakly supervised learning branch, the distribution alignment branch alternately updates the parameters of the moVMF and the network, ensuring alignment with the moVMF-defined latent space. Extensive experiments validate the rationality and effectiveness of our distribution choice and network design. Consequently, DGNet achieves state-of-the-art performance under multiple datasets and various weakly supervised settings.

NeurIPS Conference 2024 Conference Paper

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

  • Jia Li
  • Ge Li
  • Xuanming Zhang
  • YunFei Zhao
  • Yihong Dong
  • Zhi Jin
  • Binhua Li
  • Fei Huang

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Many benchmarks have been proposed, but they have two limitations, i. e. , data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: (1) Evolving data. EvoCodeBench will be dynamically updated every period (e. g. , 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories. (2) A domain taxonomy and domain labels. Based on the statistics of open-source communities, we design a programming domain taxonomy consisting of 10 popular domains. Based on the taxonomy, we annotate each sample in EvoCodeBench with a domain label. EvoCodeBench provides a broad platform for domain-specific evaluations. (3) Domain-specific evaluations. Besides the Pass@k, we compute the Domain-Specific Improvement (DSI) and define LLMs' comfort and strange domains. These evaluations help practitioners select superior LLMs in specific domains and discover the shortcomings of existing LLMs. Besides, EvoCodeBench is collected by a rigorous pipeline and aligns with real-world repositories in multiple aspects (e. g. , code distributions). We evaluate 8 popular LLMs (e. g. , gpt-4, DeepSeek Coder, StarCoder 2) on EvoCodeBench and summarize some insights. EvoCodeBench reveals the actual abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 on EvoCodeBench-2403 is only 20. 74%. Besides, we evaluate LLMs in different domains and discover their comfort and strange domains. For example, gpt-4 performs best in most domains but falls behind others in the Internet domain. StarCoder 2-15B unexpectedly performs well in the Database domain and even outperforms 33B LLMs. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.

AAAI Conference 2024 Conference Paper

Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models

  • Yuqi Zhu
  • Ge Li
  • YunFei Zhao
  • Jia Li
  • Zhi Jin
  • Hong Mei

Recently, Large Language Models (LLMs) have shown impressive abilities in code generation. However, existing LLMs' decoding strategies are designed for Natural Language (NL) generation, overlooking the differences between NL and programming languages (PL). Due to this oversight, a better decoding strategy for code generation remains an open question. In this paper, we conduct the first systematic study to explore a decoding strategy specialized in code generation. With an analysis of loss distributions of code tokens, we find that code tokens can be divided into two categories: challenging tokens that are difficult to predict and confident tokens that can be easily inferred. Among them, the challenging tokens mainly appear at the beginning of a code block. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling, which dynamically adjusts the temperature coefficient when decoding different tokens. We apply a larger temperature when sampling for challenging tokens, allowing LLMs to explore diverse choices. We employ a smaller temperature for confident tokens avoiding the influence of tail randomness noises. We apply AdapT sampling to LLMs with different sizes and conduct evaluations on two popular datasets. Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.

AAAI Conference 2024 Conference Paper

Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation

  • Zhiyi Pan
  • Nan Zhang
  • Wei Gao
  • Shan Liu
  • Ge Li

Weak supervision has proven to be an effective strategy for reducing the burden of annotating semantic segmentation tasks in 3D space. However, unconstrained or heuristic weakly supervised annotation forms may lead to suboptimal label efficiency. To address this issue, we propose a novel label recommendation framework for weakly supervised point cloud semantic segmentation. Distinct from pre-training and active learning, the label recommendation framework consists of three stages: inductive bias learning, recommendations for points to be labeled, and point cloud semantic segmentation learning. In practice, we first introduce the point cloud upsampling task to induct inductive bias from structural information. During the recommendation stage, we present a cross-scene clustering strategy to generate centers of clustering as recommended points. Then we introduce a recommended point positions attention module LabelAttention to model the long-range dependency under sparse annotations. Additionally, we employ position encoding to enhance the spatial awareness of semantic features. Throughout the framework, the useful information obtained from inductive bias learning is propagated to subsequent semantic segmentation networks in the form of label positions. Experimental results demonstrate that our framework outperforms weakly supervised point cloud semantic segmentation methods and other methods for labeling efficiency on S3DIS and ScanNetV2, even at an extremely low label rate.

ICLR Conference 2024 Conference Paper

Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning

  • Ge Li
  • Hongyi Zhou
  • Dominik Roth
  • Serge Thilges
  • Fabian Otto
  • Rudolf Lioutikov
  • Gerhard Neumann

Current advancements in reinforcement learning (RL) have predominantly focused on learning step-based policies that generate actions for each perceived state. While these methods efficiently leverage step information from environmental interaction, they often ignore the temporal correlation between actions, resulting in inefficient exploration and unsmooth trajectories that are challenging to implement on real hardware. Episodic RL (ERL) seeks to overcome these challenges by exploring in parameters space that capture the correlation of actions. However, these approaches typically compromise data efficiency, as they treat trajectories as opaque black boxes. In this work, we introduce a novel ERL algorithm, Temporally-Correlated Episodic RL (TCE), which effectively utilizes step information in episodic policy updates, opening the 'black box' in existing ERL methods while retaining the smooth and consistent exploration in parameter space. TCE synergistically combines the advantages of step-based and episodic RL, achieving comparable performance to recent ERL methods while maintaining data efficiency akin to state-of-the-art (SoTA) step-based RL.

NeurIPS Conference 2024 Conference Paper

Variational Distillation of Diffusion Policies into Mixture of Experts

  • Hongyi Zhou
  • Denis Blessing
  • Ge Li
  • Onur Celik
  • Xiaogang Jia
  • Gerhard Neumann
  • Rudolf Lioutikov

This work introduces Variational Diffusion Distillation (VDD), a novel method that distills denoising diffusion policies into Mixtures of Experts (MoE) through variational inference. Diffusion Models are the current state-of-the-art in generative modeling due to their exceptional ability to accurately learn and represent complex, multi-modal distributions. This ability allows Diffusion Models to replicate the inherent diversity in human behavior, making them the preferred models in behavior learning such as Learning from Human Demonstrations (LfD). However, diffusion models come with some drawbacks, including the intractability of likelihoods and long inference times due to their iterative sampling process. The inference times, in particular, pose a significant challenge to real-time applications such as robot control. In contrast, MoEs effectively address the aforementioned issues while retaining the ability to represent complex distributions but are notoriously difficult to train. VDD is the first method that distills pre-trained diffusion models into MoE models, and hence, combines the expressiveness of Diffusion Models with the benefits of Mixture Models. Specifically, VDD leverages a decompositional upper bound of the variational objective that allows the training of each expert separately, resulting in a robust optimization scheme for MoEs. VDD demonstrates across nine complex behavior learning tasks, that it is able to: i) accurately distill complex distributions learned by the diffusion model, ii) outperform existing state-of-the-art distillation methods, and iii) surpass conventional methods for training MoE. The code and videos are available at https: //intuitive-robots. github. io/vdd-website.

NeurIPS Conference 2023 Conference Paper

Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training

  • Hanyang Peng
  • Shuang Qin
  • Yue Yu
  • Jin Wang
  • Hui Wang
  • Ge Li

Various gradient compression algorithms have been proposed to alleviate the communication bottleneck in distributed learning, and they have demonstrated effectiveness in terms of high compression ratios and theoretical low communication complexity. However, when it comes to practically training modern deep neural networks (DNNs), these algorithms have yet to match the inference performance of uncompressed SGD-momentum (SGDM) and adaptive optimizers (e. g. ,Adam). More importantly, recent studies suggest that these algorithms actually offer no speed advantages over SGDM/Adam when used with common distributed DNN training frameworks ( e. g. , DistributedDataParallel (DDP)) in the typical settings, due to heavy compression/decompression computation or incompatibility with the efficient All-Reduce or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel 1-bit adaptive optimizer, dubbed *Bi*nary *r*andomization a*d*aptive optimiz*er* (**Birder**). The quantization of Birder can be easily and lightly computed, and it does not require warmup with its uncompressed version in the beginning. Also, we devise Hierarchical-1-bit-All-Reduce to further lower the communication volume. We theoretically prove that it promises the same convergence rate as the Adam. Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to ${2. 5 \times}$ speedup for training ResNet-50 and ${6. 3\times}$ speedup for training BERT-Base. Code is publicly available at https: //openi. pcl. ac. cn/c2net_optim/Birder.

JBHI Journal 2023 Journal Article

Multimodal Data Matters: Language Model Pre-Training Over Structured and Unstructured Electronic Health Records

  • Sicen Liu
  • Xiaolong Wang
  • Yongshuai Hou
  • Ge Li
  • Hui Wang
  • Hui Xu
  • Yang Xiang
  • Buzhou Tang

As two important textual modalities in electronic health records (EHR), both structured data (clinical codes) and unstructured data (clinical narratives) have recently been increasingly applied to the healthcare domain. Most existing EHR-oriented studies, however, either focus on a particular modality or integrate data from different modalities in a straightforward manner, which usually treats structured and unstructured data as two independent sources of information about patient admission and ignore the intrinsic interactions between them. In fact, the two modalities are documented during the same encounter where structured data inform the documentation of unstructured data and vice versa. In this paper, we proposed a Medical Multimodal Pre-trained Language Model, named MedM-PLM, to learn enhanced EHR representations over structured and unstructured data and explore the interaction of two modalities. In MedM-PLM, two Transformer-based neural network components are firstly adopted to learn representative characteristics from each modality. A cross-modal module is then introduced to model their interactions. We pre-trained MedM-PLM on the MIMIC-III dataset and verified the effectiveness of the model on three downstream clinical tasks, i. e. , medication recommendation, 30-day readmission prediction and ICD coding. Extensive experiments demonstrate the power of MedM-PLM compared with state-of-the-art methods. Further analyses and visualizations show the robustness of our model, which could potentially provide more comprehensive interpretations for clinical decision-making.

IJCAI Conference 2023 Conference Paper

Null-Space Diffusion Sampling for Zero-Shot Point Cloud Completion

  • Xinhua Cheng
  • Nan Zhang
  • Jiwen Yu
  • Yinhuai Wang
  • Ge Li
  • Jian Zhang

Point cloud completion aims at estimating the complete data of objects from degraded observations. Despite existing completion methods achieving impressive performances, they rely heavily on degraded-complete data pairs for supervision. In this work, we propose a novel framework named Null-Space Diffusion Sampling (NSDS) to solve the point cloud completion task in a zero-shot manner. By leveraging a pre-trained point cloud diffusion model as the off-the-shelf generator, our sampling approach can generate desired completion outputs with the guidance of the observed degraded data without any extra training. Furthermore, we propose a tolerant loop mechanism to improve the quality of completion results for hard cases. Experimental results demonstrate our zero-shot framework achieves superior completion performance than unsupervised methods and comparable performance to supervised methods in various degraded situations.

NeurIPS Conference 2022 Conference Paper

Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

  • Haojie Zhang
  • Ge Li
  • Jia Li
  • Zhongjin Zhang
  • Yuqi Zhu
  • Zhi Jin

Large-scale pre-trained language models have achieved impressive results on a wide range of downstream tasks recently. However, fine-tuning an extremely large-scale pre-trained language model on limited target datasets is often plagued by overfitting and representation degradation. In this paper, we propose a Dynamic Parameter Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning, which adaptively selects a more promising subnetwork to perform staging updates based on gradients of back-propagation. Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability, and consistently achieves better results with variable pre-trained language models. In addition, DPS brings a large magnitude of improvement in out-of-domain transferring experiments and low-resource scenarios, which shows that it can maintain stable general contextual features and reduce the representation collapse. We release our code at \url{https: //github. com/ZhangHaojie077/DPS}.

NeurIPS Conference 2022 Conference Paper

Learning to Share in Networked Multi-Agent Reinforcement Learning

  • Yuxuan Yi
  • Ge Li
  • Yaowei Wang
  • Zongqing Lu

In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network and each interacts only with nearby agents. Networked MARL requires all agents to make decisions in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. Inspired by the fact that sharing plays a key role in human's learning of cooperation, we propose LToS, a hierarchically decentralized MARL framework that enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective through collectives. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize the local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and networked MARL scenarios across scales.

AAAI Conference 2022 Conference Paper

Local Surface Descriptor for Geometry and Feature Preserved Mesh Denoising

  • Wenbo Zhao
  • Xianming Liu
  • Junjun Jiang
  • Debin Zhao
  • Ge Li
  • Xiangyang Ji

3D meshes are widely employed to represent geometry structure of 3D shapes. Due to limitation of scanning sensor precision and other issues, meshes are inevitably affected by noise, which hampers the subsequent applications. Convolultional neural networks (CNNs) achieve great success in image processing tasks, including 2D image denoising, and have been proven to own the capacity of modeling complex features at different scales, which is also particularly useful for mesh denoising. However, due to the nature of irregular structure, CNNs-based denosing strategies cannot be trivially applied for meshes. To circumvent this limitation, in the paper, we propose the local surface descriptor (LSD), which is able to transform the local deformable surface around a face into 2D grid representation and thus facilitates the deployment of CNNs to generate denoised face normals. To verify the superiority of LSD, we directly feed LSD into the classical Resnet without any complicated network design. The extensive experimental results show that, compared to the state-ofthe-arts, our method achieves encouraging performance with respect to both objective and subjective evaluations.

AAAI Conference 2022 Conference Paper

OctAttention: Octree-Based Large-Scale Contexts Model for Point Cloud Compression

  • Chunyang Fu
  • Ge Li
  • Rui Song
  • Wei Gao
  • Shan Liu

In point cloud compression, sufficient contexts are significant for modeling the point cloud distribution. However, the contexts gathered by the previous voxel-based methods decrease when handling sparse point clouds. To address this problem, we propose a multiple-contexts deep learning framework called OctAttention employing the octree structure, a memory-efficient representation for point clouds. Our approach encodes octree symbol sequences in a lossless way by gathering the information of sibling and ancestor nodes. Expressly, we first represent point clouds with octree to reduce spatial redundancy, which is robust for point clouds with different resolutions. We then design a conditional entropy model with a large receptive field that models the sibling and ancestor contexts to exploit the strong dependency among the neighboring nodes and employ an attention mechanism to emphasize the correlated nodes in the context. Furthermore, we introduce a mask operation during training and testing to make a trade-off between encoding time and performance. Compared to the previous state-of-the-art works, our approach obtains a 10%-35% BD-Rate gain on the LiDAR benchmark (e. g. SemanticKITTI) and object point cloud dataset (e. g. MPEG 8i, MVUB), and saves 95% coding time compared to the voxel-based baseline. The code is available at https: //github. com/zb12138/OctAttention.

NeurIPS Conference 2022 Conference Paper

SKFlow: Learning Optical Flow with Super Kernels

  • Shangkun Sun
  • Yuanqi Chen
  • Yu Zhu
  • Guodong Guo
  • Ge Li

Optical flow estimation is a classical yet challenging task in computer vision. One of the essential factors in accurately predicting optical flow is to alleviate occlusions between frames. However, it is still a thorny problem for current top-performing optical flow estimation methods due to insufficient local evidence to model occluded areas. In this paper, we propose the Super Kernel Flow Network (SKFlow), a CNN architecture to ameliorate the impacts of occlusions on optical flow estimation. SKFlow benefits from the super kernels which bring enlarged receptive fields to complement the absent matching information and recover the occluded motions. We present efficient super kernel designs by utilizing conical connections and hybrid depth-wise convolutions. Extensive experiments demonstrate the effectiveness of SKFlow on multiple benchmarks, especially in the occluded areas. Without pre-trained backbones on ImageNet and with a modest increase in computation, SKFlow achieves compelling performance and ranks $\textbf{1st}$ among currently published methods on the Sintel benchmark. On the challenging Sintel clean and final passes (test), SKFlow surpasses the best-published result in the unmatched areas ($7. 96$ and $12. 50$) by $9. 09\%$ and $7. 92\%$. The code is available at https: //github. com/littlespray/SKFlow.

NeurIPS Conference 2021 Conference Paper

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

  • Shuai Lu
  • Daya Guo
  • Shuo Ren
  • Junjie Huang
  • Alexey Svyatkovskiy
  • Ambrosio Blanco
  • Colin Clement
  • Dawn Drain

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

NeurIPS Conference 2021 Conference Paper

Integrating Tree Path in Transformer for Code Representation

  • Han Peng
  • Ge Li
  • Wenhan Wang
  • YunFei Zhao
  • Zhi Jin

Learning distributed representation of source code requires modelling its syntax and semantics. Recent state-of-the-art models leverage highly structured source code representations, such as the syntax trees and paths therein. In this paper, we investigate two representative path encoding methods shown in previous research work and integrate them into the attention module of Transformer. We draw inspiration from the ideas of positional encoding and modify them to incorporate these path encoding. Specifically, we encode both the pairwise path between tokens of source code and the path from the leaf node to the tree root for each token in the syntax tree. We explore the interaction between these two kinds of paths by integrating them into the unified Transformer framework. The detailed empirical study for path encoding methods also leads to our novel state-of-the-art representation model TPTrans, which finally outperforms strong baselines. Extensive experiments and ablation studies on code summarization across four different languages demonstrate the effectiveness of our approaches. We release our code at \url{https: //github. com/AwdHanPeng/TPTrans}.

AAAI Conference 2021 Conference Paper

SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains

  • Yuanqi Chen
  • Ge Li
  • Cece Jin
  • Shan Liu
  • Thomas Li

This paper observes that there is an issue of high frequencies missing in the discriminator of standard GAN, and we reveal it stems from downsampling layers employed in the network architecture. This issue makes the generator lack the incentive from the discriminator to learn high-frequency content of data, resulting in a significant spectrum discrepancy between generated images and real images. Since the Fourier transform is a bijective mapping, we argue that reducing this spectrum discrepancy would boost the performance of GANs. To this end, we introduce SSD-GAN, an enhancement of GANs to alleviate the spectral information loss in the discriminator. Specifically, we propose to embed a frequency-aware classifier into the discriminator to measure the realness of the input in both the spatial and spectral domains. With the enhanced discriminator, the generator of SSD-GAN is encouraged to learn high-frequency content of real data and generate exact details. The proposed method is general and can be easily integrated into most existing GANs framework without excessive cost. The effectiveness of SSD-GAN is validated on various network architectures, objective functions, and datasets. Code is available at https: //github. com/cyq373/SSD-GAN.

AAAI Conference 2020 Conference Paper

Generating Adversarial Examples for Holding Robustness of Source Code Processing Models

  • Huangzhao Zhang
  • Zhuo Li
  • Ge Li
  • Lei Ma
  • Yang Liu
  • Zhi Jin

Automated processing, analysis, and generation of source code are among the key activities in software and system lifecycle. To this end, while deep learning (DL) exhibits a certain level of capability in handling these tasks, the current stateof-the-art DL models still suffer from non-robust issues and can be easily fooled by adversarial attacks. Different from adversarial attacks for image, audio, and natural languages, the structured nature of programming languages brings new challenges. In this paper, we propose a Metropolis-Hastings sampling-based identifier renaming technique, named Metropolis-Hastings Modifier (MHM), which generates adversarial examples for DL models specialized for source code processing. Our in-depth evaluation on a functionality classification benchmark demonstrates the effectiveness of MHM in generating adversarial examples of source code. The higher robustness and performance enhanced through our adversarial training with MHM further confirms the usefulness of DL models-based method for future fully automated source code processing.

IJCAI Conference 2020 Conference Paper

NLocalSAT: Boosting Local Search with Solution Prediction

  • Wenjie Zhang
  • Zeyu Sun
  • Qihao Zhu
  • Ge Li
  • Shaowei Cai
  • Yingfei Xiong
  • Lu Zhang

The Boolean satisfiability problem (SAT) is a famous NP-complete problem in computer science. An effective way for solving a satisfiable SAT problem is the stochastic local search (SLS). However, in this method, the initialization is assigned in a random manner, which impacts the effectiveness of SLS solvers. To address this problem, we propose NLocalSAT. NLocalSAT combines SLS with a solution prediction model, which boosts SLS by changing initialization assignments with a neural network. We evaluated NLocalSAT on five SLS solvers (CCAnr, Sparrow, CPSparrow, YalSAT, and probSAT) with instances in the random track of SAT Competition 2018. The experimental results show that solvers with NLocalSAT achieve 27% ~ 62% improvement over the original SLS solvers.

AAAI Conference 2020 Conference Paper

Region Focus Network for Joint Optic Disc and Cup Segmentation

  • Ge Li
  • Changsheng Li
  • Chan Zeng
  • Peng Gao
  • Guotong Xie

Glaucoma is one of the three leading causes of blindness in the world and is predicted to affect around 80 million people by 2020. The optic cup (OC) to optic disc (OD) ratio (CDR) in fundus images plays a pivotal role in the screening and diagnosis of glaucoma. Existing methods usually crop the optic disc region first, and subsequently perform segmentation in this region. However, these approaches come up with high complexities due to the separate operations. To remedy this issue, we propose a Region Focus Network (RF-Net) that innovatively integrates detection and multi-class segmentation into a unified architecture for end-to-end joint optic disc and cup segmentation with global optimization. The key idea of our method is designing a novel multi-class mask branch which generates a high-quality segmentation in the detected region for both disc and cup. To bridge the connection between the backbone and multi-class mask branch, a Fusion Feature Pooling (FFP) structure is presented to extract features from each level of the pyramid network and fuse them into a final feature representation for segmentation. Extensive experimental results on the REFUGE-2018 challenge dataset and the Drishti-GS dataset show that the proposed method achieves the best performance, compared with competitive approaches reported in the literature and the official leaderboard. Our code will be released soon.

AAAI Conference 2019 Conference Paper

A Grammar-Based Structural CNN Decoder for Code Generation

  • Zeyu Sun
  • Qihao Zhu
  • Lili Mou
  • Yingfei Xiong
  • Ge Li
  • Lu Zhang

Code generation maps a program description to executable source code in a programming language. Existing approaches mainly rely on a recurrent neural network (RNN) as the decoder. However, we find that a program contains significantly more tokens than a natural language sentence, and thus it may be inappropriate for RNN to capture such a long sequence. In this paper, we propose a grammar-based structural convolutional neural network (CNN) for code generation. Our model generates a program by predicting the grammar rules of the programming language; we design several CNN modules, including the tree-based convolution and pre-order convolution, whose information is further aggregated by dedicated attentive pooling layers. Experimental results on the HearthStone benchmark dataset show that our CNN code generator significantly outperforms the previous state-of-the-art method by 5 percentage points; additional experiments on several semantic parsing tasks demonstrate the robustness of our model. We also conduct in-depth ablation test to better understand each component of our model.

IJCAI Conference 2019 Conference Paper

ARMIN: Towards a More Efficient and Light-weight Recurrent Memory Network

  • Zhangheng Li
  • Jia-Xing Zhong
  • Jingjia Huang
  • Tao Zhang
  • Thomas Li
  • Ge Li

In recent years, memory-augmented neural networks(MANNs) have shown promising power to enhance the memory ability of neural networks for sequential processing tasks. However, previous MANNs suffer from complex memory addressing mechanism, making them relatively hard to train and causing computational overheads. Moreover, many of them reuse the classical RNN structure such as LSTM for memory processing, causing inefficient exploitations of memory information. In this paper, we introduce a novel MANN, the Auto-addressing and Recurrent Memory Integrating Network (ARMIN) to address these issues. The ARMIN only utilizes hidden state h_t for automatic memory addressing, and uses a novel RNN cell for refined integration of memory information. Empirical results on a variety of experiments demonstrate that the ARMIN is more light-weight and efficient compared to existing memory networks. Moreover, we demonstrate that the ARMIN can achieve much lower computational overhead than vanilla LSTM while keeping similar performances. Codes are available on github. com/zoharli/armin.

NeurIPS Conference 2019 Conference Paper

Code Generation as a Dual Task of Code Summarization

  • Bolin Wei
  • Ge Li
  • Xin Xia
  • Zhiyi Fu
  • Zhi Jin

Code summarization (CS) and code generation (CG) are two crucial tasks in the field of automatic software development. Various neural network-based approaches are proposed to solve these two tasks separately. However, there exists a specific intuitive correlation between CS and CG, which has not been exploited in previous work. In this paper, we apply the relations between two tasks to improve the performance of both tasks. In other words, exploiting the duality between the two tasks, we propose a dual training framework to train the two tasks simultaneously. In this framework, we consider the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality. We evaluate our approach on two datasets collected from GitHub, and experimental results show that our dual framework can improve the performance of CS and CG tasks over baselines.

TIST Journal 2019 Journal Article

Exploiting the Value of the Center-dark Channel Prior for Salient Object Detection

  • Chunbiao Zhu
  • Wenhao Zhang
  • Thomas H. Li
  • Shan Liu
  • Ge Li

Saliency detection aims to detect the most attractive objects in images and is widely used as a foundation for various applications. In this article, we propose a novel salient object detection algorithm for RGB-D images using center-dark channel priors. First, we generate an initial saliency map based on a color saliency map and a depth saliency map of a given RGB-D image. Then, we generate a center-dark channel map based on center saliency and dark channel priors. Finally, we fuse the initial saliency map with the center dark channel map to generate the final saliency map. Extensive evaluations over four benchmark datasets demonstrate that our proposed method performs favorably against most of the state-of-the-art approaches. Besides, we further discuss the application of the proposed algorithm in small target detection and demonstrate the universal value of center-dark channel priors in the field of object detection.

NeurIPS Conference 2019 Conference Paper

Multi-mapping Image-to-Image Translation via Learning Disentanglement

  • Xiaoming Yu
  • Yuanqi Chen
  • Shan Liu
  • Thomas Li
  • Ge Li

Recent advances of image-to-image translation focus on learning the one-to-many mapping from two aspects: multi-modal translation and multi-domain translation. However, the existing methods only consider one of the two perspectives, which makes them unable to solve each other's problem. To address this issue, we propose a novel unified model, which bridges these two objectives. First, we disentangle the input images into the latent representations by an encoder-decoder architecture with a conditional adversarial training in the feature space. Then, we encourage the generator to learn multi-mappings by a random cross-domain translation. As a result, we can manipulate different parts of the latent representations to perform multi-modal and multi-domain translations simultaneously. Experiments demonstrate that our method outperforms state-of-the-art methods.

AAAI Conference 2018 Conference Paper

SAP: Self-Adaptive Proposal Model for Temporal Action Detection Based on Reinforcement Learning

  • Jingjia Huang
  • Nannan Li
  • Tao Zhang
  • Ge Li
  • Tiejun Huang
  • Wen Gao

Existing action detection algorithms usually generate action proposals through an extensive search over the video at multiple temporal scales, which brings about huge computational overhead and deviates from the human perception procedure. We argue that the process of detecting actions should be naturally one of observation and refinement: observe the current window and refine the span of attended window to cover true action regions. In this paper, we propose a Self-Adaptive Proposal (SAP) model that learns to find actions through continuously adjusting the temporal bounds in a self-adaptive way. The whole process can be deemed as an agent, which is firstly placed at the beginning of the video and traverse the whole video by adopting a sequence of transformations on the current attended region to discover actions according to a learned policy. We utilize reinforcement learning, especially the Deep Q-learning algorithm to learn the agent’s decision policy. In addition, we use temporal pooling operation to extract more effective feature representation for the long temporal window, and design a regression network to adjust the position offsets between predicted results and the ground truth. Experiment results on THUMOS’14 validate the effectiveness of SAP, which can achieve competitive performance with current action detection algorithms via much fewer proposals.

IJCAI Conference 2018 Conference Paper

Summarizing Source Code with Transferred API Knowledge

  • Xing Hu
  • Ge Li
  • Xin Xia
  • David Lo
  • Shuai Lu
  • Zhi Jin

Code summarization, aiming to generate succinct natural language description of source code, is extremely useful for code search and code comprehension. It has played an important role in software maintenance and evolution. Previous approaches generate summaries by retrieving summaries from similar code snippets. However, these approaches heavily rely on whether similar code snippets can be retrieved, how similar the snippets are, and fail to capture the API knowledge in the source code, which carries vital information about the functionality of the source code. In this paper, we propose a novel approach, named TL-CodeSum, which successfully uses API knowledge learned in a different but related task to code summarization. Experiments on large-scale real-world industry Java projects indicate that our approach is effective and outperforms the state-of-the-art in code summarization.

AAAI Conference 2016 Conference Paper

Convolutional Neural Networks over Tree Structures for Programming Language Processing

  • Lili Mou
  • Ge Li
  • Lu Zhang
  • Tao Wang
  • Zhi Jin

Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

ICRA Conference 2014 Conference Paper

Attitude-guided robust adaptive path following control for ducted fan UAV

  • Yanhe Zhu
  • Ge Li
  • Jie Zhao 0003
  • Hongzhe Jin

This article presents an approach and a systematic design methodology to path following control based on motion decoupling for high-performance ducted fan unmanned aerial vehicles (UAVs). The decoupling is performed according to the principle of regarding the attitude motion as a virtual input of the lateral longitudinal flight dynamics. This allows the attitude and flight controllers to be designed individually without mutual interference. Considering that dynamics of the ducted fan UAV is uncertain, an estimation method of system function based on the smooth saturation function and the state variable integral is proposed. This method has the advantage of less computation while keeping the high estimation performance. The stability analysis and the simulation results showing the practical feasibility of the proposed control scheme to ducted fan UAVs are given.