Arrow Research search

Author name cluster

Xiaolin Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers
2 author rows

Possible papers

25

AAAI Conference 2026 Conference Paper

FGNet: Leveraging Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation

  • Zhenghua Li
  • Hang Chen
  • Zihao Sun
  • Kai Li
  • Xiaolin Hu

Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.

AAAI Conference 2026 Conference Paper

Put the Space of LoRA Initialization to the Extreme to Preserve Pre-trained Knowledge

  • Pengwei Tang
  • Xiaolin Hu
  • Yong Liu
  • Lizhong Ding
  • Dongjie Zhang
  • Xing Wu
  • Debing Zhang

Low-Rank Adaptation (LoRA) is the leading parameter-efficient fine-tuning method for Large Language Models (LLMs), but it still suffers from catastrophic forgetting. Recent work has shown that specialized LoRA initialization can alleviate catastrophic forgetting. There are currently two approaches to LoRA initialization aimed at preventing knowledge forgetting during fine-tuning: (1) making residual weights close to pre-trained weights, and (2) ensuring the space of LoRA initialization is orthogonal to pre-trained knowledge. The former is what current methods strive to achieve, while the importance of the latter is not sufficiently recognized. We find that the space of LoRA initialization is the key to preserving pre-trained knowledge rather than the residual weights. Existing methods like MiLoRA propose making the LoRA initialization space orthogonal to pre-trained weights. However, MiLoRA utilizes the null space of pre-trained weights. Compared to pre-trained weights, the input activations of pre-trained knowledge take into account the parameters of all previous layers as well as the input data, while pre-trained weights only contain information from the current layer. Moreover, we find that the effective ranks of input activations are much smaller than those of pre-trained weights. Thus, the null space of activations is more accurate and contains less pre-trained knowledge information compared to that of weights. Based on these, we introduce LoRA-Null, our proposed method that initializes LoRA in the null space of activations. Experimental results show that LoRA-Null effectively preserves the pre-trained world knowledge of LLMs while achieving good fine-tuning performance, as evidenced by extensive experiments.

AAAI Conference 2026 Conference Paper

VaccineRAG: Boosting Multimodal Large Language Models’ Immunity to Harmful RAG Samples

  • Qixin Sun
  • Ziqin Wang
  • Hengyuan Zhao
  • Yilin Li
  • Kaiyou Song
  • Si Liu
  • Xiaolin Hu
  • Qingpei Guo

Retrieval Augmented Generation enhances the response accuracy of Large Language Models (LLMs) by integrating retrieval and generation modules with external knowledge, demonstrating particular strength in real-time queries and Visual Question Answering tasks. However, the effectiveness of RAG is frequently hindered by the precision of the retriever: many retrieved samples fed into the generation phase are irrelevant or misleading, posing a critical bottleneck to LLMs’ performance. To address this challenge, we introduce \textbf{VaccineRAG}, a novel Chain-of-Thought-based retrieval-augmented generation dataset. On one hand, VaccineRAG employs a benchmark to evaluate models using data with varying positive/negative sample ratios, systematically exposing inherent weaknesses in current LLMs. On the other hand, it enhances models’ sample-discrimination capabilities by prompting LLMs to generate explicit Chain-of-Thought (CoT) analysis for each sample before producing final answers. Furthermore, to enhance the model’s ability to learn long-sequence complex CoT content, we propose \textbf{Partial-GRPO}. By modeling the outputs of LLMs as multiple components rather than a single whole, our model can make more informed preference selections for complex sequences, thereby enhancing its capacity to learn complex CoT. Comprehensive evaluations and ablation studies on VaccineRAG validate the effectiveness of the proposed scheme.

ICLR Conference 2025 Conference Paper

ADBM: Adversarial Diffusion Bridge Model for Reliable Adversarial Purification

  • Xiao Li
  • Wenxuan Sun
  • Huanran Chen
  • Qiongxiu Li
  • Yingzhe He
  • Jie Shi
  • Xiaolin Hu

Recently Diffusion-based Purification (DiffPure) has been recognized as an effective defense method against adversarial examples. However, we find DiffPure which directly employs the original pre-trained diffusion models for adversarial purification, to be suboptimal. This is due to an inherent trade-off between noise purification performance and data recovery quality. Additionally, the reliability of existing evaluations for DiffPure is questionable, as they rely on weak adaptive attacks. In this work, we propose a novel Adversarial Diffusion Bridge Model, termed ADBM. ADBM directly constructs a reverse bridge from the diffused adversarial data back to its original clean examples, enhancing the purification capabilities of the original diffusion models. Through theoretical analysis and experimental validation across various scenarios, ADBM has proven to be a superior and robust defense mechanism, offering significant promise for practical applications. Code is available at https://github.com/LixiaoTHU/ADBM.

NeurIPS Conference 2025 Conference Paper

Connectome-Based Modelling Reveals Orientation Maps in the Drosophila Optic Lobe

  • Jia Nuo Liew
  • Shenghan Lin
  • Bowen Chen
  • Xiaowei Zhu
  • Wei Zhang
  • Xiaolin Hu

The ability to extract oriented edges from visual input is a core computation across animal vision systems. Orientation maps, long associated with the layered architecture of the mammalian visual cortex, systematically organise neurons by their preferred edge orientation. Despite lacking cortical structures, the Drosophila melanogaster brain contains feature-selective neurons and exhibits complex visual detection capacity, raising the question of whether map-like vision representations can emerge without cortical infrastructure. We integrate a complete fruit fly brain connectome with biologically grounded spiking neuron models to simulate neuroprocessing in the fly visual system. By driving the network with oriented stimuli and analysing downstream responses, we show that coherent orientation maps can emerge from purely connectome-constrained dynamics. These results suggest that species of independent origin could evolve similar visual structures.

ICLR Conference 2025 Conference Paper

SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios

  • Kai Li 0018
  • Wendi Sang
  • Chang Zeng
  • Runxuan Yang
  • Guo Chen
  • Xiaolin Hu

Systematic evaluation of speech separation and enhancement models under moving sound source conditions requires extensive and diverse data. However, real-world datasets often lack sufficient data for training and evaluation, and synthetic datasets, while larger, lack acoustic realism. Consequently, neither effectively meets practical needs. To address this issue, we introduce SonicSim, a synthetic toolkit based on the embodied AI simulation platform Habitat-sim, designed to generate highly customizable data for moving sound sources. SonicSim supports multi-level adjustments—including scene-level, microphone-level, and source-level—enabling the creation of more diverse synthetic data. Leveraging SonicSim, we constructed a benchmark dataset called SonicSet, utilizing LibriSpeech, Freesound Dataset 50k (FSD50K), Free Music Archive (FMA), and 90 scenes from Matterport3D to evaluate speech separation and enhancement models. Additionally, to investigate the differences between synthetic and real-world data, we selected 5 hours of raw, non-reverberant data from the SonicSet validation set and recorded a real-world speech separation dataset, providing a reference for comparing SonicSet with other synthetic datasets. For speech enhancement, we utilized the real-world dataset RealMAN to validate the acoustic gap between SonicSet and existing synthetic datasets. The results indicate that models trained on SonicSet generalize better to real-world scenarios compared to other synthetic datasets. Code is publicly available at ***https://cslikai.cn/SonicSim/***.

AAAI Conference 2025 Conference Paper

Stability and Generalization of Zeroth-Order Decentralized Stochastic Gradient Descent with Changing Topology

  • Xiaolin Hu
  • Zixuan Gong
  • Gengze Xu
  • Wei Liu
  • Jian Luan
  • Bin Wang
  • Yong Liu

Zeroth-order (ZO) optimization as the gradient-free method has become a powerful tool when the first-order gradient is unavailable or expensive to obtain, especially in decentralized learning scenarios where data and computational resources are distributed across multiple clients. There have been many efforts to analyze the optimization convergence rate of zeroth-order decentralized stochastic gradient descent (ZO-DSGD) algorithms. However, the generalization of these methods has not been well studied. In this paper, we provide a generalization analysis of ZO-DSGD with changing topology, where the clients run zeroth-order SGD with local data and communicate with each other according to time-varying topology. We systematically analyze the generalization error in convex, strongly convex, and non-convex cases. The obtained results in the convex and strongly convex cases with zeroth-order oracles recover the results of SGD. Moreover, the generalization bounds derived in non-convex cases align with that of DSGD. To capture the influence of communication topology on the generalization performance, we analyze local generalization bounds concerning local models held at different clients. The obtained results reflect the influence of the number of clients, local sample size, and topology on the generalization error. To the best of our knowledge, this is the first work that provides a generalization analysis of zeroth-order decentralized stochastic gradient descent methods and recovers the results of SGD.

IJCAI Conference 2025 Conference Paper

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

  • Xinhao Yao
  • Hongjin Qian
  • Xiaolin Hu
  • Gengze Xu
  • Wei Liu
  • Jian Luan
  • Bin Wang
  • Yong Liu

Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we explore two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs (where Wq, Wk, and Wv denote the weights of the query, key, and value layers, respectively). The first phenomenon, termed “Unequal Importance of Attention Matrices”, highlights the impact of fine-tuning different weight matrices. It shows that optimizing the Wv matrix yields significantly better performance than optimizing the Wk matrix. Fine-tuning only the Wq and Wv matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices (Wq, Wk, and Wv). The second phenomenon, “Attention Matrices with Customized Learning Rate Lead to Better Convergence”, emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the Wv matrix compared to Wq and Wk accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving algorithms in LLMs fine-tuning.

ICLR Conference 2025 Conference Paper

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

  • Mohan Xu
  • Kai Li
  • Guo Chen
  • Xiaolin Hu

In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features, while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing state-of-the-art (SOTA) model TF-GridNet.

NeurIPS Conference 2024 Conference Paper

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

  • Jiawei Gao
  • Ziqin Wang
  • Zeqi Xiao
  • Jingbo Wang
  • Tai WANG
  • Jinkun Cao
  • Xiaolin Hu
  • Si Liu

Enabling humanoid robots to clean rooms has long been a pursued dream within humanoid research communities. However, many tasks require multi-humanoid collaboration, such as carrying large and heavy furniture together. Given the scarcity of motion capture data on multi-humanoid collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Coo perative H uman- O bject I nteraction ( CooHOI ), a framework designed to tackle the challenge of multi-humanoid object transportation problem through a two-phase learning paradigm: individual skill learning and subsequent policy transfer. First, a single humanoid character learns to interact with objects through imitation learning from human motion priors. Then, the humanoid learns to collaborate with others by considering the shared dynamics of the manipulated object using centralized training and decentralized execution (CTDE) multi-agent RL algorithms. When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-humanoid HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-humanoid interactions, and can be seamlessly extended to include more participants and a wide range of object types.

NeurIPS Conference 2024 Conference Paper

Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective

  • Xinhao Yao
  • Xiaolin Hu
  • Shenzhi Yang
  • Yong Liu

Pre-trained large language models (LLMs) based on Transformer have demonstrated striking in-context learning (ICL) abilities. With a few demonstration input-label pairs, they can predict the label for an unseen input without any parameter updates. In this paper, we show an exciting phenomenon that SVD-based weight pruning can enhance ICL performance, and more surprising, pruning weights in deep layers often results in more stable performance improvements than in shallow layers. However, the underlying mechanism of those findings still remains an open question. To reveal those findings, we conduct an in-depth theoretical analysis by presenting the implicit gradient descent (GD) trajectories of ICL and giving the mutual information based generalization bounds of ICL via full implicit GD trajectories. This helps us reasonably explain the surprising experimental findings. Besides, based on all our experimental and theoretical insights, we intuitively propose a simple, model-compression and derivative-free algorithm for downstream tasks in enhancing ICL inference. Experiments on benchmark datasets and open source LLMs display the method effectiveness.

NeurIPS Conference 2024 Conference Paper

Full-Distance Evasion of Pedestrian Detectors in the Physical World

  • Zhi Cheng
  • Zhanhao Hu
  • Yuqiu Liu
  • Jianmin Li
  • Hang Su
  • Xiaolin Hu

Many studies have proposed attack methods to generate adversarial patterns for evading pedestrian detection, alarming the computer vision community about the need for more attention to the robustness of detectors. However, adversarial patterns optimized by these methods commonly have limited performance at medium to long distances in the physical world. To overcome this limitation, we identify two main challenges. First, in existing methods, there is commonly an appearance gap between simulated distant adversarial patterns and their physical world counterparts, leading to incorrect optimization. Second, there exists a conflict between adversarial losses at different distances, which causes difficulties in optimization. To overcome these challenges, we introduce a Full Distance Attack (FDA) method. Our physical world experiments demonstrate the effectiveness of our FDA patterns across various detection models like YOLOv5, Deformable-DETR, and Mask RCNN. Codes available at https: //github. com/zhicheng2T0/Full-Distance-Attack. git

NeurIPS Conference 2023 Conference Paper

HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds

  • Gang Zhang
  • Chen Junnan
  • Guohuan Gao
  • Jianmin Li
  • Xiaolin Hu

3D object detection in point clouds is important for autonomous driving systems. A primary challenge in 3D object detection stems from the sparse distribution of points within the 3D scene. Existing high-performance methods typically employ 3D sparse convolutional neural networks with small kernels to extract features. To reduce computational costs, these methods resort to submanifold sparse convolutions, which prevent the information exchange among spatially disconnected features. Some recent approaches have attempted to address this problem by introducing large-kernel convolutions or self-attention mechanisms, but they either achieve limited accuracy improvements or incur excessive computational costs. We propose HEDNet, a hierarchical encoder-decoder network for 3D object detection, which leverages encoder-decoder blocks to capture long-range dependencies among features in the spatial space, particularly for large and distant objects. We conducted extensive experiments on the Waymo Open and nuScenes datasets. HEDNet achieved superior detection accuracy on both datasets than previous state-of-the-art methods with competitive efficiency. The code is available at https: //github. com/zhanggang001/HEDNet.

AAAI Conference 2021 Conference Paper

Fooling Thermal Infrared Pedestrian Detectors in Real World Using Small Bulbs

  • Xiaopei Zhu
  • Xiao Li
  • Jianmin Li
  • Zheyao Wang
  • Xiaolin Hu

Thermal infrared detection systems play an important role in many areas such as night security, autonomous driving, and body temperature detection. They have the unique advantages of passive imaging, temperature sensitivity and penetration. But the security of these systems themselves has not been fully explored, which poses risks in applying these systems. We propose a physical attack method with small bulbs on a board against the state of-the-art pedestrian detectors. Our goal is to make infrared pedestrian detectors unable to detect real-world pedestrians. Towards this goal, we first showed that it is possible to use two kinds of patches to attack the infrared pedestrian detector based on YOLOv3. The average precision (AP) dropped by 64. 12% in the digital world, while a blank board with the same size caused the AP to drop by 29. 69% only. After that, we designed and manufactured a physical board and successfully attacked YOLOv3 in the real world. In recorded videos, the physical board caused AP of the target detector to drop by 34. 48%, while a blank board with the same size caused the AP to drop by 14. 91% only. With the ensemble attack techniques, the designed physical board had good transferability to unseen detectors.

NeurIPS Conference 2021 Conference Paper

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

  • Xiaolin Hu
  • Kai Li
  • Weiyi Zhang
  • Yi Luo
  • Jean-Marie Lemercier
  • Timo Gerkmann

Recent advances in the design of neural network architectures, in particular those specialized in modeling sequences, have provided significant improvements in speech separation performance. In this work, we propose to use a bio-inspired architecture called Fully Recurrent Convolutional Neural Network (FRCNN) to solve the separation task. This model contains bottom-up, top-down and lateral connections to fuse information processed at various time-scales represented by stages. In contrast to the traditional approach updating stages in parallel, we propose to first update the stages one by one in the bottom-up direction, then fuse information from adjacent stages simultaneously and finally fuse information from all stages to the bottom stage together. Experiments showed that this asynchronous updating scheme achieved significantly better results with much fewer parameters than the traditional synchronous updating scheme on speech separation. In addition, the proposed model achieved competitive or better results with high efficiency as compared to other state-of-the-art approaches on two benchmark datasets.

AAAI Conference 2020 Conference Paper

Dynamic Network Pruning with Interpretable Layerwise Channel Selection

  • Yulong Wang
  • Xiaolu Zhang
  • Xiaolin Hu
  • Bo Zhang
  • Hang Su

Dynamic network pruning achieves runtime acceleration by dynamically determining the inference paths based on different inputs. However, previous methods directly generate continuous decision values for each weight channel, which cannot reflect a clear and interpretable pruning process. In this paper, we propose to explicitly model the discrete weight channel selections, which encourages more diverse weights utilization, and achieves more sparse runtime inference paths. Meanwhile, with the help of interpretable layerwise channel selections in the dynamic network, we can visualize the network decision paths explicitly for model interpretability. We observe that there are clear differences in the layerwise decisions between normal and adversarial examples. Therefore, we propose a novel adversarial example detection algorithm by discriminating the runtime decision features. Experiments show that our dynamic network achieves higher prediction accuracy under the similar computing budgets on CIFAR10 and ImageNet datasets compared to traditional static pruning methods and other dynamic pruning approaches. The proposed adversarial detection algorithm can significantly improve the state-of-the-art detection rate across multiple attacks, which provides an opportunity to build an interpretable and robust model.

NeurIPS Conference 2020 Conference Paper

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

  • Xiang Li
  • Wenhai Wang
  • Lijun Wu
  • Shuo Chen
  • Xiaolin Hu
  • Jun Li
  • Jinhui Tang
  • Jian Yang

One-stage detector basically formulates object detection as dense classification and localization (i. e. , bounding box regression). The classification is usually optimized by Focal Loss and the box location is commonly learned under Dirac delta distribution. A recent trend for one-stage detectors is to introduce an \emph{individual} prediction branch to estimate the quality of localization, where the predicted quality facilitates the classification to improve detection performance. This paper delves into the \emph{representations} of the above three fundamental elements: quality estimation, classification and localization. Two problems are discovered in existing practices, including (1) the inconsistent usage of the quality estimation and classification between training and inference, and (2) the inflexible Dirac delta distribution for localization. To address the problems, we design new representations for these elements. Specifically, we merge the quality estimation into the class prediction vector to form a joint representation, and use a vector to represent arbitrary distribution of box locations. The improved representations eliminate the inconsistency risk and accurately depict the flexible distribution in real data, but contain \emph{continuous} labels, which is beyond the scope of Focal Loss. We then propose Generalized Focal Loss (GFL) that generalizes Focal Loss from its discrete form to the \emph{continuous} version for successful optimization. On COCO {\tt test-dev}, GFL achieves 45. 0\% AP using ResNet-101 backbone, surpassing state-of-the-art SAPD (43. 5\%) and ATSS (43. 6\%) with higher or comparable inference speed.

NeurIPS Conference 2020 Conference Paper

Generating Adjacency-Constrained Subgoals in Hierarchical Reinforcement Learning

  • Tianren Zhang
  • Shangqi Guo
  • Tian Tan
  • Xiaolin Hu
  • Feng Chen

Goal-conditioned hierarchical reinforcement learning (HRL) is a promising approach for scaling up reinforcement learning (RL) techniques. However, it often suffers from training inefficiency as the action space of the high-level, i. e. , the goal space, is often large. Searching in a large goal space poses difficulties for both high-level subgoal generation and low-level policy learning. In this paper, we show that this problem can be effectively alleviated by restricting the high-level action space from the whole goal space to a k-step adjacent region of the current state using an adjacency constraint. We theoretically prove that the proposed adjacency constraint preserves the optimal hierarchical policy in deterministic MDPs, and show that this constraint can be practically implemented by training an adjacency network that can discriminate between adjacent and non-adjacent subgoals. Experimental results on discrete and continuous control tasks show that incorporating the adjacency constraint improves the performance of state-of-the-art HRL approaches in both deterministic and stochastic environments.

AAAI Conference 2020 Conference Paper

Pruning from Scratch

  • Yulong Wang
  • Xiaolu Zhang
  • Lingxi Xie
  • Jun Zhou
  • Hang Su
  • Bo Zhang
  • Xiaolin Hu

Network pruning is an important research field aiming at reducing computational costs of neural networks. Conventional approaches follow a fixed paradigm which first trains a large and redundant network, and then determines which units (e. g. , channels) are less important and thus can be removed. In this work, we find that pre-training an over-parameterized model is not necessary for obtaining the target pruned structure. In fact, a fully-trained over-parameterized model will reduce the search space for the pruned structure. We empirically show that more diverse pruned structures can be directly pruned from randomly initialized weights, including potential models with better performance. Therefore, we propose a novel network pruning pipeline which allows pruning from scratch with little training overhead. In the experiments for compressing classification models on CIFAR10 and ImageNet datasets, our approach not only greatly reduces the pre-training burden of traditional pruning methods, but also achieves similar or even higher accuracy under the same computation budgets. Our results facilitate the community to rethink the effectiveness of existing techniques used for network pruning.

AAAI Conference 2019 Conference Paper

Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

  • Wei Feng
  • Wentao Liu
  • Tong Li
  • Jing Peng
  • Chen Qian
  • Xiaolin Hu

Human-object interactions (HOI) recognition and pose estimation are two closely related tasks. Human pose is an essential cue for recognizing actions and localizing the interacted objects. Meanwhile, human action and their interacted objects’ localizations provide guidance for pose estimation. In this paper, we propose a turbo learning framework to perform HOI recognition and pose estimation simultaneously. First, two modules are designed to enforce message passing between the tasks, i. e. pose aware HOI recognition module and HOI guided pose estimation module. Then, these two modules form a closed loop to utilize the complementary information iteratively, which can be trained in an end-to-end manner. The proposed method achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.

AAAI Conference 2018 Conference Paper

A Cascaded Inception of Inception Network With Attention Modulated Feature Fusion for Human Pose Estimation

  • Wentao Liu
  • Jie Chen
  • Cheng Li
  • Chen Qian
  • Xiao Chu
  • Xiaolin Hu

Accurate keypoint localization of human pose needs diversified features: the high level for contextual dependencies and the low level for detailed refinement of joints. However, the importance of the two factors varies from case to case, but how to efficiently use the features is still an open problem. Existing methods have limitations in preserving low level features, adaptively adjusting the importance of different levels of features, and modeling the human perception process. This paper presents three novel techniques step by step to efficiently utilize different levels of features for human pose estimation. Firstly, an inception of inception (IOI) block is designed to emphasize the low level features. Secondly, an attention mechanism is proposed to adjust the importance of individual levels according to the context. Thirdly, a cascaded network is proposed to sequentially localize the joints to enforce message passing from joints of stand-alone parts like head and torso to remote joints like wrist or ankle. Experimental results demonstrate that the proposed method achieves the state-of-the-art performance on both MPII and LSP benchmarks.

NeurIPS Conference 2017 Conference Paper

Gated Recurrent Convolution Neural Network for OCR

  • Jianfeng Wang
  • Xiaolin Hu

Optical Character Recognition (OCR) aims to recognize text in natural images. Inspired by a recently proposed model for general image classification, Recurrent Convolution Neural Network (RCNN), we propose a new architecture named Gated RCNN (GRCNN) for solving this problem. Its critical component, Gated Recurrent Convolution Layer (GRCL), is constructed by adding a gate to the Recurrent Convolution Layer (RCL), the critical component of RCNN. The gate controls the context modulation in RCL and balances the feed-forward information and the recurrent information. In addition, an efficient Bidirectional Long Short-Term Memory (BLSTM) is built for sequence modeling. The GRCNN is combined with BLSTM to recognize text in natural images. The entire GRCNN-BLSTM model can be trained end-to-end. Experiments show that the proposed model outperforms existing methods on several benchmark datasets including the IIIT-5K, Street View Text (SVT) and ICDAR.

NeurIPS Conference 2015 Conference Paper

Convolutional Neural Networks with Intra-Layer Recurrent Connections for Scene Labeling

  • Ming Liang
  • Xiaolin Hu
  • Bo Zhang

Scene labeling is a challenging computer vision task. It requires the use of both local discriminative features and global context information. We adopt a deep recurrent convolutional neural network (RCNN) for this task, which is originally proposed for object recognition. Different from traditional convolutional neural networks (CNN), this model has intra-layer recurrent connections in the convolutional layers. Therefore each convolutional layer becomes a two-dimensional recurrent neural network. The units receive constant feed-forward inputs from the previous layer and recurrent inputs from their neighborhoods. While recurrent iterations proceed, the region of context captured by each unit expands. In this way, feature extraction and context modulation are seamlessly integrated, which is different from typical methods that entail separate modules for the two steps. To further utilize the context, a multi-scale RCNN is proposed. Over two benchmark datasets, Standford Background and Sift Flow, the model outperforms many state-of-the-art models in accuracy and efficiency.

ICRA Conference 2011 Conference Paper

Static balancing and dynamic modeling of a three-degree-of-freedom parallel kinematic manipulator

  • Dan Zhang 0006
  • Feng Gao 0011
  • Xiaolin Hu
  • Zhen Gao 0004

This research is concerned with the design and analysis of a parallel kinematic manipulator (PKM) with three degrees of freedom (DOF). The proposed PKM combining the spatial rotational and translational degrees of freedom has varied advantages and good potential applications of materials handling. First, the static balancing of the parallel manipulator is investigated. The definition and methodology of static balancing are introduced. Two methods including adjusting kinematic parameters and counterweights are applied to the structure and the counterweights method leads to static balancing of the PKM. The conditions of static balancing are given. Then the dynamic model of the proposed PKM is deduced. It describes the relationship between the driving forces and the motion of the end-effector platform. Two approaches, the Newton-Euler and the Lagrange methods, are compared and the later one is selected to build the dynamic model of the 3-DOF tripod mechanism.