Arrow Research search

Author name cluster

Hong Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

79 papers
2 author rows

Possible papers

79

AAAI Conference 2026 Conference Paper

Debiased Multiplex Tokenizer for Efficient Map-Free Visual Relocalization

  • Wenshuai Wang
  • Hong Liu
  • Shengquan Li
  • Peifeng Jiang
  • Runwei Ding

Image-based feature representation plays a critical role in visual localization, enabling robots to estimate their position and orientation in GPS-denied environments. However, this task is often undermined by significant variations in camera viewpoints and scene appearances. Recently, map-free visual relocalization (MFVR) has emerged as a promising paradigm due to its compatibility with lightweight deployment and privacy isolation on mobile devices. In this paper, we propose the Debiased Multiplex Tokenizer (DeMT) as a novel method for versatile and efficient MFVR. Specifically, DeMT performs relative pose regression through an integrated framework built upon a pretrained vision Mamba encoder, comprising three key modules: First, Multiplex Interactive Tokenization yields robust image tokens with non-local affinities and cross-domain descriptions; Second, Debiased Anchor Registration facilitates anchor token matching through proximity graph retrieval and causal pointer attribution; Third, Geometry-Informed Pose Regression empowers multi-layer perceptrons with a gating mechanism and spectral normalization to support both pair-wise and multi-view modes. Extensive evaluations across nine public datasets demonstrate that DeMT substantially outperforms existing baselines and ablation variants in diverse indoor and outdoor environments.

TMLR Journal 2026 Journal Article

Improving Foundation Model Group Robustness with Auxiliary Sentence Embeddings

  • Sisuo Lyu
  • Hong Liu
  • Jie Li
  • Yan Teng
  • Yingchun Wang

This paper addresses the critical challenge of mitigating group-based biases in vision-language foundation models, a pressing issue for ensuring trustworthy AI deployment. We introduce DoubleCCA, a novel and computationally efficient framework that systematically enriches textual representations to enhance group robustness. Our key innovation is to leverage an auxiliary large sentence embedding model to capture diverse semantic perspectives, counteracting biased representations induced by limited training data. To this end, we propose a two-stage Canonical Correlation Analysis (DoubleCCA) technique: first, aligning augmented and original embeddings in a shared space; second, reconstructing invariant features to align with visual representations, thus enhancing the model's group robustness. We further propose a simple sentence augmentation approach, which aims to improve the robustness of CCA-induced subspaces. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of vision-language foundation models to group-based biases. The experiments on a variety of datasets demonstrate that our method outperforms existing methods in terms of both performance and robustness. Our code is available at https://github.com/sisuolv/doublecca.

AAAI Conference 2026 Conference Paper

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

  • Hualei Wang
  • Yiming Li
  • Shuo Ma
  • Hong Liu
  • Xiangdong Wang

Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to realize end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance. Evaluations show strong performance across a variety of fine-grained tasks, such as dense captioning, temporal grounding, and timeline speech summarization, which demonstrates TimeAudio's robust temporal localization and reasoning capabilities.

AAAI Conference 2026 Conference Paper

Masked Clustering Prediction for Unsupervised Point Cloud Pre-training

  • Bin Ren
  • Xiaoshui Huang
  • Mengyuan Liu
  • Hong Liu
  • Fabio Poiesi
  • Nicu Sebe
  • Guofeng Mei

Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of MaskClu via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, setting new competitive results.

AAAI Conference 2026 Conference Paper

QuMAB: Query-based Multi-annotator Behavior Pattern Learning

  • Liyun Zhang
  • Zheng Lian
  • Hong Liu
  • Takanori Takebe
  • Shozo Nishii
  • Yuta Nakashima

Multi-annotator learning traditionally aggregates diverse annotations to approximate a single “ground truth”, treating disagreements as noise. However, this paradigm faces fundamental challenges: subjective tasks often lack absolute ground truth, and sparse annotation coverage makes aggregation statistically unreliable. We introduce a paradigm shift from sample-wise aggregation to annotator-wise behavior modeling. By treating annotator disagreements as valuable information rather than noise, modeling annotator-specific behavior patterns can reconstruct unlabeled data to reduce annotation cost, enhance aggregation reliability, and explain annotator decision behavior. To this end, we propose QuMAB (Query-based Multi-Annotator Behavior Pattern Learning), which uses lightweight queries to model individual annotators while capturing inter-annotator correlations as implicit regularization, preventing overfitting to sparse individual data while maintaining individualization and improving generalization, with a visualization of annotator focus regions offering an explainable analysis of behavior understanding. We contribute two large-scale datasets with dense per-annotator labels: STREET (4,300 labels/annotator) and AMER (average 3,118 labels/annotator), the first multimodal multi-annotator dataset. Extensive experiments demonstrate the superiority of our QuMAB in modeling individual annotators’ behavior patterns, their utility for consensus prediction, and applicability under sparse annotations.

AAAI Conference 2026 Conference Paper

SimLabel: Similarity-Weighted Semi-supervision for Multi-annotator Learning with Missing Labels

  • Liyun Zhang
  • Zheng Lian
  • Hong Liu
  • Takanori Takebe
  • Yuta Nakashima

Multi-annotator learning (MAL) aims to model annotator-specific labeling patterns. However, existing methods face a critical challenge: they simply skip updating annotator-specific model parameters when encountering missing labels—a common scenario in real-world crowdsourced datasets where each annotator labels only small subsets of samples. This leads to inefficient data utilization and overfitting risks. To this end, we propose a novel similarity-weighted semi-supervised learning framework (SimLabel) that leverages inter-annotator similarities to generate weighted soft labels for missing annotations, enabling the utilization of unannotated samples rather than skipping them entirely. We further introduce a confidence-based iterative refinement mechanism that combines maximum probability with entropy-based uncertainty to prioritize predicted high-quality pseudo-labels to impute missing labels, jointly enhancing similarity estimation and model performance over time. For evaluation, we contribute a new multimodal multi-annotator dataset, AMER2, with high and more variable missing rates, reflecting real-world annotation sparsity and enabling evaluation across different sparsity levels. Extensive experiments validate the effectiveness of our method.

AAAI Conference 2025 Conference Paper

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

  • Chao Zeng
  • Songwei Liu
  • Yusheng Xie
  • Hong Liu
  • Xiaojian Wang
  • Miao Wei
  • Shu Yang
  • Fangmin Chen

Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17⬇ vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6x acceleration improvement and 2.7x memory compression gain.

IROS Conference 2025 Conference Paper

Autonomous Obstacle Avoidance for a Snake Robot with Surface Pressure Sensing

  • Yongjun Sun
  • Zhao Xue
  • Liming Bao
  • Hong Liu

A sixteen-joint snake robot with full-body surface pressure sensing capabilities has been developed. A total of 64 thin film pressure sensors are evenly distributed on the surface of the robot. Four intelligent obstacle avoidance movements integrating surface pressure perception were investigated. They are as follows: the roll-over obstacle avoidance motion capable of autonomously switching between the regular rolling gait and the hump rolling gait, the autonomous crawling obstacle avoidance motion under unknown obstacle parameters, the intelligent winding and climbing motion on horizontal pipes with either unknown diameters or those with variable diameters, and the gap-crossing motion that can autonomously detect the gap position and cross over horizontal pipes with gaps. Finally, experiments were conducted in different scenarios to verify the feasibility of these four intelligent motions.

NeurIPS Conference 2025 Conference Paper

D-VST: Diffusion Transformer for Pathology-Correct Tone-Controllable Cross-Dye Virtual Staining of Whole Slide Images

  • Shurong Yang
  • Dong Wei
  • Yihuang Hu
  • Qiong Peng
  • Hong Liu
  • Yawen Huang
  • Xian Wu
  • Yefeng Zheng

Diffusion-based virtual staining methods of histopathology images have demonstrated outstanding potential for stain normalization and cross-dye staining (e. g. , hematoxylin-eosin to immunohistochemistry). However, achieving pathology-correct cross-dye virtual staining with versatile tone controls poses significant challenges due to the difficulty of decoupling the given pathology and tone conditions. This issue would cause non-pathologic regions to be mistakenly stained like pathologic ones, and vice versa, which we term “pathology leakage. ” To address this issue, we propose diffusion virtual staining Transformer (D-VST), a new framework with versatile tone control for cross-dye virtual staining. Specifically, we introduce a pathology encoder in conjunction with a tone encoder, combined with a two-stage curriculum learning scheme that decouples pathology and tone conditions, to enable tone control while eliminating pathology leakage. Further, to extend our method for billion-pixel whole slide image (WSI) staining, we introduce a novel frequency-aware adaptive patch sampling strategy for high-quality yet efficient inference of ultra-high resolution images in a zero-shot manner. Integrating these two innovative components facilitates a pathology-correct, tone-controllable, cross-dye WSI virtual staining process. Extensive experiments on three virtual staining tasks that involve translating between four different dyes demonstrate the superiority of our approach in generating high-quality and pathologically accurate images compared to existing methods based on generative adversarial networks and diffusion models. Our code and trained models will be released.

JBHI Journal 2025 Journal Article

Enhancing Ultrasound Scanning Skills in a Leader–Follower Robotic System through Expert Hand Impedance Regulation

  • Baoshan Niu
  • Dapeng Yang
  • Le Zhang
  • Yiming Ji
  • Li Jiang
  • Hong Liu

Traditional breast cancer surgeries require collaboration between ultrasound (US) doctors and surgeons, making the procedure complex and treating physicians prone to fatigue. In leader–follower robotic surgery, a surgeon controls an US robotic arm and an instrument robotic arm with their left and right hands, enabling independent surgical performance. However, the lack of US scanning skills among surgeons, as well as the physical separation in leader–follower operations, can negatively impact both the scanning and surgical outcomes. This paper proposes a robot-assisted scheme based on dynamic arm impedance compensation (IC) that references expert arm stiffness to compensate for novice arm stiffness. The impedance compensator adjusts the compensation strategy according to the scanning area and scanning stage. The impedance force generator estimates the scanning direction via Kalman filtering and applies stiffness and damping forces in the vertical direction to suppress tremors and other involuntary movements. The experimental results revealed that during the coarse and fine scanning phases, the probe position variance decreased by 57. 9% and 73. 6%, the contact force variance decreased by 55. 2% and 42. 5%, and the US image confidence increased by 22. 0% and 23. 8%, respectively. Compared with traditional filtering compensation (FC) schemes, this approach reduces the average position variance and contact force variance by 32. 0% and 25. 3%, respectively, and increases confidence by 7. 3%. In a no-compensation test, the IC training group outperformed the FC group. This scheme can assist leader–follower US scanning and rapidly improve surgical skills.

ICML Conference 2025 Conference Paper

GraphGPT: Generative Pre-trained Graph Eulerian Transformer

  • Qifang Zhao
  • Weidong Ren
  • Tianyu Li 0007
  • Hong Liu
  • Xingsheng He
  • Xiaoxiao Xu

We introduce GraphGPT, a novel self-supervised generative pre-trained model for graph learning based on the Graph Eulerian Transformer ( GET ). First, we propose GET, which combines a standard transformer encoder or decoder architecture with an innovative graph-to-sequence transformation method. This method converts graphs or sampled subgraphs into sequences of tokens representing nodes, edges, and attributes in a reversible manner using Eulerian paths. We pre-train GET using either of the two self-supervised tasks: next-token prediction (NTP) and scheduled masked-token prediction (SMTP). The pre-trained model is then fine-tuned for downstream tasks such as graph-, edge-, and node-level prediction. Despite its simplicity, GraphGPT achieves performance comparable to or surpassing state-of-the-art methods on multiple large-scale Open Graph Benchmark (OGB) datasets. It demonstrates exceptional results on the molecular property prediction dataset PCQM4Mv2 and the protein-protein interaction dataset ogbl-ppa. Notably, generative pre-training enables scaling GraphGPT to 2 billion parameters while maintaining performance gains — a breakthrough that overcomes the scalability limitations of traditional Graph Neural Networks (GNNs) and prior graph transformers (GTs). To advance research in graph foundation models and facilitate scientific discovery in chemistry, materials science, and related fields, we have released the source code (https: //github. com/alibaba/graph-gpt) and model checkpoints (https: //www. modelscope. cn/organization/Alibaba-DT).

AAAI Conference 2025 Conference Paper

LiD-FL: Towards List-Decodable Federated Learning

  • Hong Liu
  • Liren Shan
  • Han Bao
  • Ronghui You
  • Yuhao Yi
  • Jiancheng Lv

Federated learning is often used in environments with many unverified participants. Therefore, federated learning under adversarial attacks receives significant attention. This paper proposes an algorithmic framework for list-decodable federated learning, where a central server maintains a list of models, with at least one guaranteed to perform well. The framework has no strict restriction on the fraction of honest clients, extending the applicability of Byzantine federated learning to the scenario with more than half adversaries. Assuming the variance of gradient noise in stochastic gradient descent is bounded, we prove a convergence theorem of our method for strongly convex and smooth losses. Experimental results, including image classification tasks with both convex and non-convex losses, demonstrate that the proposed algorithm can withstand the malicious majority under various attacks.

IROS Conference 2025 Conference Paper

One-shot Global Localization through Semantic Distribution Feature Retrieval and Semantic Topological Histogram Registration

  • Feixuan Huang
  • Hong Liu
  • Wang Gao
  • Shuguo Pan
  • Heng Zhao

One-shot global localization is crucial in many robotic applications, providing significant advantages during initialization and relocalization processes. However, LiDAR-based one-shot global localization methods encounter challenges, including local feature matching errors, sensitivity to dynamic objects, and computational complexity in the absence of an initial pose. To address these issues, we propose a one-shot LiDAR-semantic-graph-based global localization method. To mitigate the interference of dynamic objects on localization, we extract stable semantic objects from LiDAR point clouds using dynamic curved voxel clustering and subsequently construct a semantic graph. Furthermore, we leverage the distribution characteristics of the semantic objects to quickly filter candidate retrievals and construct a cost matrix for the Hungarian algorithm, utilizing a semantic topological histogram to solve vertex matching. This yields a coarse pose estimate, which is subsequently refined using Fast-GICP. We demonstrate the superior localization performance compared to existing state-of-the-art methods on multiple large-scale outdoor datasets, including MulRan, MCD, and Apollo. Our method will be open-sourced and accessible at: https://github.com/Hfx-J/SGGL.

AAAI Conference 2025 Conference Paper

PDDM: Pseudo Depth Diffusion Model for RGB-PD Semantic Segmentation Based in Complex Indoor Scenes

  • Xinhua Xu
  • Hong Liu
  • Jianbing Wu
  • Jinfu Liu

The integration of RGB and depth modalities significantly enhances the accuracy of segmenting complex indoor scenes, with depth data from RGB-D cameras playing a crucial role in this improvement. However, collecting an RGB-D dataset is more expensive than an RGB dataset due to the need for specialized depth sensors. Aligning depth and RGB images also poses challenges due to sensor positioning and issues like missing data and noise. In contrast, Pseudo Depth (PD) from high-precision depth estimation algorithms can eliminate the dependence on RGB-D sensors and alignment processes, as well as provide effective depth information and show significant potential in semantic segmentation. Therefore, to explore the practicality of utilizing pseudo depth instead of real depth for semantic segmentation, we design an RGB-PD segmentation pipeline to integrate RGB and pseudo depth and propose a Pseudo Depth Aggregation Module (PDAM) for fully exploiting the informative clues provided by the diverse pseudo depth maps. The PDAM aggregates multiple pseudo depth maps into a single modality, making it easily adaptable to other RGB-D segmentation methods. In addition, the pre-trained diffusion model serves as a strong feature extractor for RGB segmentation tasks, but multi-modal diffusion-based segmentation methods remain unexplored. Therefore, we present a Pseudo Depth Diffusion Model (PDDM) that adopts a large-scale text-image diffusion model as a feature extractor and a simple yet effective fusion strategy to integrate pseudo depth. To verify the applicability of pseudo depth and our PDDM, we perform extensive experiments on the NYUv2 and SUNRGB-D datasets. The experimental results demonstrate that pseudo depth can effectively enhance segmentation performance, and our PDDM achieves state-of-the-art performance, outperforming other methods by +6.98 mIoU on NYUv2 and +2.11 mIoU on SUNRGB-D.

JBHI Journal 2025 Journal Article

Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines

  • Wenhao Li
  • Hongkuan Zhang
  • Hongwei Zhang
  • Zhengxu Li
  • Zengjie Dong
  • Yafan Chen
  • Niranjan Bidargaddi
  • Hong Liu

Current medical language models, adapted from large language models, typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative CPGs. Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content without relying on model-generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension and coronary heart disease diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG-based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low-cost, and hallucination-free method for grounding medical language models in evidence-based clinical practice, with strong potential for broader clinical deployment.

AAAI Conference 2025 Conference Paper

SVTformer: Spatial-View-Temporal Transformer for Multi-View 3D Human Pose Estimation

  • Wanruo Zhang
  • Mengyuan Liu
  • Hong Liu
  • Wenhao Li

Recently, transformer-based methods have been introduced to estimate 3D human pose from multiple views by aggregating the spatial-temporal information of human joints to achieve the lifting of 2D to 3D. However, previous approaches cannot model the inter-frame correspondence of each view's joint individually, nor can they directly consider all view interactions at each time, leading to insufficient learning of multi-view associations. To address this issue, we propose a Spatial-View-Temporal transformer (SVTformer) to decouple spatial-view-temporal information in sequential order for correlation learning and model dependencies between them in a local-to-global manner. SVTformer includes an attended Spatial-View-Temporal (SVT) patch embedding to attentively capture the local features of the input poses and stacked SVT encoders to extract global spatial-view-temporal dependencies progressively. Specifically, SVT encoders perform three reconstructions sequentially to attended features with the learning through view decoupling for temporal-enhanced spatial correlation, temporal decoupling for spatial-enhanced view correlation, and another view decoupling for spatial-enhanced temporal relationship. This decoupling-coupling-decoupling multi-view scheme enables us to alternatively model the inter-joint spatial relationships, cross-view dependencies, and temporal motion associations. We evaluate the proposed SVTformer on three popular 3D HPE datasets, and it yields state-of-the-art performance. It effectively deals with ill-posed problems and enhances the accuracy of 3D human pose estimation.

IROS Conference 2025 Conference Paper

TCNet: A Temporally Consistent Network for Self-supervised Monocular Depth Estimation

  • Ying Zhu
  • Hong Liu
  • Jianbing Wu
  • Mengyuan Liu

Despite significant advances in self-supervised monocular depth estimation methods, achieving temporally consistent and accurate depth maps from frame sequences remains a formidable challenge. Existing approaches often estimate depth maps for individual frames in isolation, neglecting the rich geometric and temporal coherence present across frames. Consequently, this oversight leads to temporally inconsistent outputs, resulting in noticeable temporal flickering artifacts. In response, this paper presents TCNet, a Temporal Consistent Network for self-supervised monocular depth estimation. Specifically, we propose an Inter-frame Temporal Fusion (ITF) module to emphasize the influence of preceding images on the depth estimation of the current frame. The Temporal Consistency Loss (TCL) is proposed to leverage the temporal constraints between the depth maps of adjacent frames. Besides, TCNet can also be applied to both single-frame and multi-frame scenarios during inference. Experimental evaluations on the KITTI dataset demonstrate that our method surpasses state-of-the-art depth estimation methods in accuracy and temporal consistency. Our code will be made public.

AAAI Conference 2025 Conference Paper

TCPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation

  • Jiajie Liu
  • Mengyuan Liu
  • Hong Liu
  • Wenhao Li

Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn a more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.

AAAI Conference 2024 Conference Paper

Audio Generation with Multiple Conditional Diffusion Model

  • Zhifang Guo
  • Jianguo Mao
  • Rui Tao
  • Long Yan
  • Kazushige Ouchi
  • Hong Liu
  • Xiangdong Wang

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation.

ICLR Conference 2024 Conference Paper

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

  • Zhiyuan Liu 0001
  • Hong Liu
  • Denny Zhou
  • Tengyu Ma 0001

Generating a sequence of intermediate steps, \emph{a.k.a.}, a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

NeurIPS Conference 2024 Conference Paper

DiffusionFake: Enhancing Generalization in Deepfake Detection via Guided Stable Diffusion

  • Ke Sun
  • Shen Chen
  • Taiping Yao
  • Hong Liu
  • Xiaoshuai Sun
  • Shouhong Ding
  • Rongrong Ji

The rapid progress of Deepfake technology has made face swapping highly realistic, raising concerns about the malicious use of fabricated facial content. Existing methods often struggle to generalize to unseen domains due to the diverse nature of facial manipulations. In this paper, we revisit the generation process and identify a universal principle: Deepfake images inherently contain information from both source and target identities, while genuine faces maintain a consistent identity. Building upon this insight, we introduce DiffusionFake, a novel plug-and-play framework that reverses the generative process of face forgeries to enhance the generalization of detection models. DiffusionFake achieves this by injecting the features extracted by the detection model into a frozen pre-trained Stable Diffusion model, compelling it to reconstruct the corresponding target and source images. This guided reconstruction process constrains the detection network to capture the source and target related features to facilitate the reconstruction, thereby learning rich and disentangled representations that are more resilient to unseen forgeries. Extensive experiments demonstrate that DiffusionFake significantly improves cross-domain generalization of various detector architectures without introducing additional parameters during inference. The code are available in https: //github. com/skJack/DiffusionFake. git.

AAAI Conference 2024 Conference Paper

Federated Modality-Specific Encoders and Multimodal Anchors for Personalized Brain Tumor Segmentation

  • Qian Dai
  • Dong Wei
  • Hong Liu
  • Jinghan Sun
  • Liansheng Wang
  • Yefeng Zheng

Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, it is not uncommon that some FL participants only possess a subset of the complete imaging modalities, posing inter-modal heterogeneity as a challenge to effectively training a global model on all participants’ data. In addition, each participant would expect to obtain a personalized model tailored for its local data characteristics from the FL in such a scenario. In this work, we propose a new FL framework with federated modality-specific encoders and multimodal anchors (FedMEMA) to simultaneously address the two concurrent issues. Above all, FedMEMA employs an exclusive encoder for each modality to account for the inter-modal heterogeneity in the first place. In the meantime, while the encoders are shared by the participants, the decoders are personalized to meet individual needs. Specifically, a server with full-modal data employs a fusion decoder to aggregate and fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation reversely. Meanwhile, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the encoder parameters. On the other end, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up the information loss due to absent modalities while adapting the representations of present ones. FedMEMA is validated on the BraTS 2020 benchmark for multimodal brain tumor segmentation. Results show that it outperforms various up-to-date methods for multimodal and personalized FL and that its novel designs are effective. Our code is available.

AAAI Conference 2024 Conference Paper

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation

  • Yasi Wang
  • Hong Liu
  • Chao Zhang
  • Lu Xu
  • Qiang Wang

Homography estimation is a fundamental problem in computer vision. Previous works mainly focus on estimating either a single homography, or multiple homographies based on mesh grid division of the image. In practical scenarios, single homography is inadequate and often leads to a compromised result for multiple planes; while mesh grid multi-homography damages the plane distribution of the scene, and does not fully address the restriction to use homography. In this work, we propose a novel semantics guided multi-homography estimation framework, Mask-Homo, to provide an explicit solution to the multi-plane depth disparity problem. First, a pseudo plane mask generation module is designed to obtain multiple correlated regions that follow the plane distribution of the scene. Then, multiple local homography transformations, each of which aligns a correlated region precisely, are predicted and corresponding warped images are fused to obtain the final result. Furthermore, a new metric, Mask-PSNR, is proposed for more comprehensive evaluation of alignment. Extensive experiments are conducted to verify the effectiveness of the proposed method. Our code is available at https://github.com/SAITPublic/MaskHomo.

IJCAI Conference 2024 Conference Paper

Mitigating robust overfitting via self-residual-calibration regularization (Abstract Reprint)

  • Hong Liu
  • Zhun Zhong
  • Nicu Sebe
  • Shin'ichi Satoh

Overfitting in adversarial training has attracted the interest of researchers in the community of artificial intelligence and machine learning in recent years. To address this issue, in this paper we begin by evaluating the defense performances of several calibration methods on various robust models. Our analysis and experiments reveal two intriguing properties: 1) a well-calibrated robust model is decreasing the confidence of robust model; 2) there is a trade-off between the confidences of natural and adversarial images. These new properties offer a straightforward insight into designing a simple but effective regularization, called Self-Residual-Calibration (SRC). The proposed SRC calculates the absolute residual between adversarial and natural logit features corresponding to the ground-truth labels. Furthermore, we utilize the pinball loss to minimize the quantile residual between them, resulting in more robust regularization. Extensive experiments indicate that our SRC can effectively mitigate the overfitting problem while improving the robustness of state-of-the-art models. Importantly, SRC is complementary to various regularization methods. When combined with them, we are capable of achieving the top-rank performance on the AutoAttack benchmark leaderboard.

AAAI Conference 2024 Conference Paper

Near-Optimal Resilient Aggregation Rules for Distributed Learning Using 1-Center and 1-Mean Clustering with Outliers

  • Yuhao Yi
  • Ronghui You
  • Hong Liu
  • Changxin Liu
  • Yuan Wang
  • Jiancheng Lv

Byzantine machine learning has garnered considerable attention in light of the unpredictable faults that can occur in large-scale distributed learning systems. The key to secure resilience against Byzantine machines in distributed learning is resilient aggregation mechanisms. Although abundant resilient aggregation rules have been proposed, they are designed in ad-hoc manners, imposing extra barriers on comparing, analyzing, and improving the rules across performance criteria. This paper studies near-optimal aggregation rules using clustering in the presence of outliers. Our outlier-robust clustering approach utilizes geometric properties of the update vectors provided by workers. Our analysis show that constant approximations to the 1-center and 1-mean clustering problems with outliers provide near-optimal resilient aggregators for metric-based criteria, which have been proven to be crucial in the homogeneous and heterogeneous cases respectively. In addition, we discuss two contradicting types of attacks under which no single aggregation rule is guaranteed to improve upon the naive average. Based on the discussion, we propose a two-phase resilient aggregation framework. We run experiments for image classification using a non-convex loss function. The proposed algorithms outperform previously known aggregation rules by a large margin with both homogeneous and heterogeneous data distributions among non-faulty workers. Code and appendix are available at https://github.com/jerry907/AAAI24-RASHB.

ICML Conference 2024 Conference Paper

Position: Towards Implicit Prompt For Text-To-Image Models

  • Yue Yang
  • Yuqi Lin
  • Hong Liu
  • Wenqi Shao
  • Runjian Chen
  • Hailong Shang
  • Yu Wang 0002
  • Yu Qiao 0001

Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position paper highlights the current state of T2I models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts with popular T2I models. Specifically, we design and collect more than 2, 000 implicit prompts of three aspects: General Symbols, Celebrity Privacy, and Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models’ capabilities under these implicit prompts. Experiment results show that (1) T2I models are able to accurately create various target symbols indicated by implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can be bypassed with implicit prompts. We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

ICLR Conference 2024 Conference Paper

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

  • Hong Liu
  • Zhiyuan Li 0005
  • David Leo Wright Hall
  • Percy Liang
  • Tengyu Ma 0001

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50\% fewer steps, less total compute, and reduced wall-clock time.

JBHI Journal 2023 Journal Article

A 3D Cross-Modality Feature Interaction Network With Volumetric Feature Alignment for Brain Tumor and Tissue Segmentation

  • Yuzhou Zhuang
  • Hong Liu
  • Enmin Song
  • Chih-Cheng Hung

Accurate volumetric segmentation of brain tumors and tissues is beneficial for quantitative brain analysis and brain disease identification in multi-modal Magnetic Resonance (MR) images. Nevertheless, due to the complex relationship between modalities, 3D Fully Convolutional Networks (3D FCNs) using simple multi-modal fusion strategies hardly learn the complex and nonlinear complementary information between modalities. Meanwhile, the indiscriminative feature aggregation between low-level and high-level features easily causes volumetric feature misalignment in 3D FCNs. On the other hand, the 3D convolution operations of 3D FCNs are excellent at modeling local relations but typically inefficient at capturing global relations between distant regions in volumetric images. To tackle these issues, we propose an Aligned Cross-Modality Interaction Network (ACMINet) for segmenting the regions of brain tumors and tissues from MR images. In this network, the cross-modality feature interaction module is first designed to adaptively and efficiently fuse and refine multi-modal features. Secondly, the volumetric feature alignment module is developed for dynamically aligning low-level and high-level features by the learnable volumetric feature deformation field. Thirdly, we propose the volumetric dual interaction graph reasoning module for graph-based global context modeling in spatial and channel dimensions. Our proposed method is applied to brain glioma, vestibular schwannoma, and brain tissue segmentation tasks, and we performed extensive experiments on BraTS2018, BraTS2020, Vestibular Schwannoma, and iSeg-2017 datasets. Experimental results show that ACMINet achieves state-of-the-art segmentation performance on all four benchmark datasets and obtains the highest DSC score of hard-segmented enhanced tumor region on the validation leaderboard of the BraTS2020 challenge.

IROS Conference 2023 Conference Paper

An Efficient Trajectory Planner for Car-Like Robots on Uneven Terrain

  • Long Xu 0002
  • Kaixin Chai
  • Zhichao Han 0002
  • Hong Liu
  • Chao Xu 0001
  • Yanjun Cao
  • Fei Gao 0011

Autonomous navigation of ground robots on uneven terrain is being considered in more and more tasks. However, uneven terrain will bring two problems to motion planning: how to assess the traversability of the terrain and how to cope with the dynamics model of the robot associated with the terrain. The trajectories generated by existing methods are often too conservative or cannot be tracked well by the controller since the second problem is not well solved. In this paper, we propose terrain pose mapping to describe the impact of terrain on the robot. With this mapping, we can obtain the SE(3) state of the robot on uneven terrain for a given state in SE(2). Then, based on it, we present a trajectory optimization framework for car-like robots on uneven terrain that can consider both of the above problems. The trajectories generated by our method conform to the dynamics model of the system without being overly conservative and yet able to be tracked well by the controller. We perform simulations and real-world experiments to validate the efficiency and trajectory quality of our algorithm.

NeurIPS Conference 2023 Conference Paper

Improving Adversarial Robustness via Information Bottleneck Distillation

  • Huafeng Kuang
  • Hong Liu
  • Yongjian Wu
  • Shin'ichi Satoh
  • Rongrong Ji

Previous studies have shown that optimizing the information bottleneck can significantly improve the robustness of deep neural networks. Our study closely examines the information bottleneck principle and proposes an Information Bottleneck Distillation approach. This specially designed, robust distillation technique utilizes prior knowledge obtained from a robust pre-trained model to boost information bottlenecks. Specifically, we propose two distillation strategies that align with the two optimization processes of the information bottleneck. Firstly, we use a robust soft-label distillation method to increase the mutual information between latent features and output prediction. Secondly, we introduce an adaptive feature distillation method that automatically transfers relevant knowledge from the teacher model to the student model, thereby reducing the mutual information between the input and latent features. We conduct extensive experiments to evaluate our approach's robustness against state-of-the-art adversarial attackers such as PGD-attack and AutoAttack. Our experimental results demonstrate the effectiveness of our approach in significantly improving adversarial robustness. Our code is available at https: //github. com/SkyKuang/IBD.

AAAI Conference 2023 Conference Paper

Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering

  • Jianguo Mao
  • Wenbin Jiang
  • Hong Liu
  • Xiangdong Wang
  • Yajuan Lyu

Recently, video question answering has attracted growing attention. It involves answering a question based on a fine-grained understanding of video multi-modal information. Most existing methods have successfully explored the deep understanding of visual modality. We argue that a deep understanding of linguistic modality is also essential for answer reasoning, especially for videos that contain character dialogues. To this end, we propose an Inferential Knowledge-Enhanced Integrated Reasoning method. Our method consists of two main components: 1) an Inferential Knowledge Reasoner to generate inferential knowledge for linguistic modality inputs that reveals deeper semantics, including the implicit causes, effects, mental states, etc. 2) an Integrated Reasoning Mechanism to enhance video content understanding and answer reasoning by leveraging the generated inferential knowledge. Experimental results show that our method achieves significant improvement on two mainstream datasets. The ablation study further demonstrates the effectiveness of each component of our approach.

AAAI Conference 2023 Conference Paper

M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities

  • Hong Liu
  • Dong Wei
  • Donghuan Lu
  • Jinghan Sun
  • Liansheng Wang
  • Yefeng Zheng

Multimodal magnetic resonance imaging (MRI) provides complementary information for sub-region analysis of brain tumors. Plenty of methods have been proposed for automatic brain tumor segmentation using four common MRI modalities and achieved remarkable performance. In practice, however, it is common to have one or more modalities missing due to image corruption, artifacts, acquisition protocols, allergy to contrast agents, or simply cost. In this work, we propose a novel two-stage framework for brain tumor segmentation with missing modalities. In the first stage, a multimodal masked autoencoder (M3AE) is proposed, where both random modalities (i.e., modality dropout) and random patches of the remaining modalities are masked for a reconstruction task, for self-supervised learning of robust multimodal representations against missing modalities. To this end, we name our framework M3AE. Meanwhile, we employ model inversion to optimize a representative full-modal image at marginal extra cost, which will be used to substitute for the missing modalities and boost performance during inference. Then in the second stage, a memory-efficient self distillation is proposed to distill knowledge between heterogenous missing-modal situations while fine-tuning the model for supervised segmentation. Our M3AE belongs to the ‘catch-all’ genre where a single model can be applied to all possible subsets of modalities, thus is economic for both training and deployment. Extensive experiments on BraTS 2018 and 2020 datasets demonstrate its superior performance to existing state-of-the-art methods with missing modalities, as well as the efficacy of its components. Our code is available at: https://github.com/ccarliu/m3ae.

ICML Conference 2023 Conference Paper

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

  • Hong Liu
  • Sang Michael Xie
  • Zhiyuan Li 0005
  • Tengyu Ma 0001

Language modeling on large-scale datasets improves performance of various downstream tasks. The validation pre-training loss is often used as the evaluation metric for language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself hard to evaluate comprehensively). Contrary to the conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. We identify three ways to produce models with the same pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the pre-training algorithms. These experiments demonstrate the existence of implicit bias of pre-training algorithms—among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima of pre-training loss in language models, and empirically observe a strong correlation between flatness (measured by the trace of Hessian) and downstream performance among models with the same pre-training loss. We also prove in a synthetic language setting that among models with the minimal pre-training loss, the flattest model transfers to downstream tasks.

JBHI Journal 2023 Journal Article

Syn_SegNet: A Joint Deep Neural Network for Ultrahigh-Field 7T MRI Synthesis and Hippocampal Subfield Segmentation in Routine 3T MRI

  • Xinwei Li
  • Linjin Wang
  • Hong Liu
  • Baoqiang Ma
  • Lei Chu
  • Xiaoxi Dong
  • Debin Zeng
  • Tongtong Che

Precise delineation of hippocampus subfields is crucial for the identification and management of various neurological and psychiatric disorders. However, segmenting these subfields automatically in routine 3T MRI is challenging due to their complex morphology and small size, as well as the limited signal contrast and resolution of the 3T images. This research proposes Syn_SegNet, an end-to-end, multitask joint deep neural network that leverages ultrahigh-field 7T MRI synthesis to improve hippocampal subfield segmentation in 3T MRI. Our approach involves two key components. First, we employ a modified Pix2PixGAN as the synthesis model, incorporating self-attention modules, image and feature matching loss, and ROI loss to generate high-quality 7T-like MRI around the hippocampal region. Second, we utilize a variant of 3D-U-Net with multiscale deep supervision as the segmentation subnetwork, incorporating an anatomic weighted cross-entropy loss that capitalizes on prior anatomical knowledge. We evaluate our method on hippocampal subfield segmentation in paired 3T MRI and 7T MRI with seven different anatomical structures. The experimental findings demonstrate that Syn_SegNet's segmentation performance benefits from integrating synthetic 7T data in an online manner and is superior to competing methods. Furthermore, we assess the generalizability of the proposed approach using a publicly accessible 3T MRI dataset. The developed method would be an efficient tool for segmenting hippocampal subfields in routine clinical 3T MRI.

AAAI Conference 2023 Conference Paper

Uniform Sequence Better: Time Interval Aware Data Augmentation for Sequential Recommendation

  • Yizhou Dang
  • Enneng Yang
  • Guibing Guo
  • Linying Jiang
  • Xingwei Wang
  • Xiaoxiao Xu
  • Qinghui Sun
  • Hong Liu

Sequential recommendation is an important task to predict the next-item to access based on a sequence of interacted items. Most existing works learn user preference as the transition pattern from the previous item to the next one, ignoring the time interval between these two items. However, we observe that the time interval in a sequence may vary significantly different, and thus result in the ineffectiveness of user modeling due to the issue of preference drift. In fact, we conducted an empirical study to validate this observation, and found that a sequence with uniformly distributed time interval (denoted as uniform sequence) is more beneficial for performance improvement than that with greatly varying time interval. Therefore, we propose to augment sequence data from the perspective of time interval, which is not studied in the literature. Specifically, we design five operators (Ti-Crop, Ti-Reorder, Ti-Mask, Ti-Substitute, Ti-Insert) to transform the original non-uniform sequence to uniform sequence with the consideration of variance of time intervals. Then, we devise a control strategy to execute data augmentation on item sequences in different lengths. Finally, we implement these improvements on a state-of-the-art model CoSeRec and validate our approach on four real datasets. The experimental results show that our approach reaches significantly better performance than the other 9 competing methods. Our implementation is available: https://github.com/KingGugu/TiCoSeRec.

JBHI Journal 2022 Journal Article

APRNet: A 3D Anisotropic Pyramidal Reversible Network With Multi-Modal Cross-Dimension Attention for Brain Tissue Segmentation in MR Images

  • Yuzhou Zhuang
  • Hong Liu
  • Enmin Song
  • Guangzhi Ma
  • Xiangyang Xu
  • Chih-Cheng Hung

Brain tissue segmentation in multi-modal magnetic resonance (MR) images is significant for the clinical diagnosis of brain diseases. Due to blurred boundaries, low contrast, and intricate anatomical relationships between brain tissue regions, automatic brain tissue segmentation without prior knowledge is still challenging. This paper presents a novel 3D fully convolutional network (FCN) for brain tissue segmentation, called APRNet. In this network, we first propose a 3D anisotropic pyramidal convolutional reversible residual sequence (3DAPC-RRS) module to integrate the intra-slice information with the inter-slice information without significant memory consumption; secondly, we design a multi-modal cross-dimension attention (MCDA) module to automatically capture the effective information in each dimension of multi-modal images; then, we apply 3DAPC-RRS modules and MCDA modules to a 3D FCN with multiple encoded streams and one decoded stream for constituting the overall architecture of APRNet. We evaluated APRNet on two benchmark challenges, namely MRBrainS13 and iSeg-2017. The experimental results show that APRNet yields state-of-the-art segmentation results on both benchmark challenge datasets and achieves the best segmentation performance on the cerebrospinal fluid region. Compared with other methods, our proposed approach exploits the complementary information of different modalities to segment brain tissue regions in both adult and infant MR images, and it achieves the average Dice coefficient of 87. 22% and 93. 03% on the MRBrainS13 and iSeg-2017 testing data, respectively. The proposed method is beneficial for quantitative brain analysis in the clinical study, and our code is made publicly available.

AAAI Conference 2022 Conference Paper

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition

  • Tianyu Guo
  • Hong Liu
  • Zhan Chen
  • Mengyuan Liu
  • Tao Wang
  • Runwei Ding

In recent years, self-supervised representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods. The existing contrastive learning methods use normal augmentations to construct similar positive samples, which limits the ability to explore novel movement patterns. In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed. First, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to obtain diverse positive samples, which bring novel movement patterns to improve the universality of the learned representations. Second, since directly using extreme augmentations may not be able to boost the performance due to the drastic changes in original identity, the Dual Distributional Divergence Minimization Loss (D3 M Loss) is proposed to minimize the distribution divergence in a more gentle way. Third, the Nearest Neighbors Mining (NNM) is proposed to further expand positive samples to make the abundant information mining process more reasonable. Exhaustive experiments on NTU RGB+D 60, PKU-MMD, NTU RGB+D 120 datasets have verified that our AimCLR can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations. Our code is available at https: //github. com/Levigty/AimCLR.

AAAI Conference 2022 Conference Paper

Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

  • Yidi Li
  • Hong Liu
  • Hao Tang

Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios. However, how to combine the heterogeneous information and exploit the complementarity of multi-modal signals remains a challenging issue. In this paper, we propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. Specifically, a novel acoustic map based on spatial-temporal Global Coherence Field (stGCF) is first constructed for heterogeneous signal fusion, which employs a camera model to map audio cues to the localization space consistent with the visual cues. Then a multi-modal perception attention network is introduced to derive the perception weights that measure the reliability and effectiveness of intermittent audio and video streams disturbed by noise. Moreover, a unique cross-modal self-supervised learning method is presented to model the confidence of audio and visual observations by leveraging the complementarity and consistency between different modalities. Experimental results show that the proposed MPT achieves 98. 6% and 78. 3% tracking accuracy on the standard and occluded datasets, respectively, which demonstrates its robustness under adverse conditions and outperforms the current state-of-the-art methods.

AAAI Conference 2022 Conference Paper

Pose-Guided Feature Disentangling for Occluded Person Re-identification Based on Transformer

  • Tao Wang
  • Hong Liu
  • Pinhao Song
  • Tianyu Guo
  • Wei Shi

Occluded person re-identification is a challenging task as human body parts could be occluded by some obstacles (e. g. trees, cars, and pedestrians) in certain scenes. Some existing pose-guided methods solve this problem by aligning body parts according to graph matching, but these graph-based methods are not intuitive and complicated. Therefore, we propose a transformer-based Pose-guided Feature Disentangling (PFD) method by utilizing pose information to clearly disentangle semantic components (e. g. human body or joint parts) and selectively match non-occluded parts correspondingly. First, Vision Transformer (ViT) is used to extract the patch features with its strong capability. Second, to preliminarily disentangle the pose information from patch information, the matching and distributing mechanism is leveraged in Pose-guided Feature Aggregation (PFA) module. Third, a set of learnable semantic views are introduced in transformer decoder to implicitly enhance the disentangled body part features. However, those semantic views are not guaranteed to be related to the body without additional supervision. Therefore, Pose-View Matching (PVM) module is proposed to explicitly match visible body parts and automatically separate occlusion features. Fourth, to better prevent the interference of occlusions, we design a Pose-guided Push Loss to emphasize the features of visible body parts. Extensive experiments over five challenging datasets for two tasks (occluded and holistic Re-ID) demonstrate that our proposed PFD is superior promising, which performs favorably against state-of-the-art methods. Code is available at https: //github. com/WangTaoAs/PFD Net

ICLR Conference 2022 Conference Paper

Self-supervised Learning is More Robust to Dataset Imbalance

  • Hong Liu
  • Jeff Z. HaoChen
  • Adrien Gaidon
  • Tengyu Ma 0001

Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find via extensive experiments that off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. The performance gap between balanced and imbalanced pre-training with SSL is significantly smaller than the gap with supervised learning, across sample sizes, for both in-domain and, especially, out-of-domain evaluation. Second, towards understanding the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn label-irrelevant-but-transferable features that help classify the rare classes and downstream tasks. In contrast, supervised learning has no incentive to learn features irrelevant to the labels from frequent examples. We validate this hypothesis with semi-synthetic experiments as well as rigorous mathematical analyses on a simplified setting. Third, inspired by the theoretical insights, we devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets with several evaluation criteria, closing the small gap between balanced and imbalanced datasets with the same number of examples.

IJCAI Conference 2021 Conference Paper

Adversarial Feature Disentanglement for Long-Term Person Re-identification

  • Wanlu Xu
  • Hong Liu
  • Wei Shi
  • Ziling Miao
  • Zhisheng Lu
  • Feihu Chen

Most existing person re-identification methods are effective in short-term scenarios because of their appearance dependencies. However, these methods may fail in long-term scenarios where people might change their clothes. To this end, we propose an adversarial feature disentanglement network (AFD-Net) which contains intra-class reconstruction and inter-class adversary to disentangle the identity-related and identity-unrelated (clothing) features. For intra-class reconstruction, the person images with the same identity are represented and disentangled into identity and clothing features by two separate encoders, and further reconstructed into original images to reduce intra-class feature variations. For inter-class adversary, the disentangled features across different identities are exchanged and recombined to generate adversarial clothes-changing images for training, which makes the identity and clothing features more independent. Especially, to supervise these new generated clothes-changing images, a re-feeding strategy is designed to re-disentangle and reconstruct these new images for image-level self-supervision in the original image space and feature-level soft-supervision in the disentangled feature space. Moreover, we collect a challenging Market-Clothes dataset and a real-world PKU-Market-Reid dataset for evaluation. The results on one large-scale short-term dataset (Market-1501) and five long-term datasets (three public and two we proposed) confirm the superiority of our method against other state-of-the-art methods.

NeurIPS Conference 2021 Conference Paper

Cycle Self-Training for Domain Adaptation

  • Hong Liu
  • Jianmin Wang
  • Mingsheng Long

Mainstream approaches for unsupervised domain adaptation (UDA) learn domain-invariant representations to narrow the domain shift, which are empirically effective but theoretically challenged by the hardness or impossibility theorems. Recently, self-training has been gaining momentum in UDA, which exploits unlabeled target data by training with target pseudo-labels. However, as corroborated in this work, under distributional shift, the pseudo-labels can be unreliable in terms of their large discrepancy from target ground truth. In this paper, we propose Cycle Self-Training (CST), a principled self-training algorithm that explicitly enforces pseudo-labels to generalize across domains. CST cycles between a forward step and a reverse step until convergence. In the forward step, CST generates target pseudo-labels with a source-trained classifier. In the reverse step, CST trains a target classifier using target pseudo-labels, and then updates the shared representations to make the target classifier perform well on the source data. We introduce the Tsallis entropy as a confidence-friendly regularization to improve the quality of target pseudo-labels. We analyze CST theoretically under realistic assumptions, and provide hard cases where CST recovers target ground truth, while both invariant feature learning and vanilla self-training fail. Empirical results indicate that CST significantly improves over the state-of-the-arts on visual recognition and sentiment analysis benchmarks.

AAAI Conference 2021 Conference Paper

Domain General Face Forgery Detection by Learning to Weight

  • Ke Sun
  • Hong Liu
  • Qixiang Ye
  • Yue Gao
  • Jianzhuang Liu
  • Ling Shao
  • Rongrong Ji

In this paper, we propose a domain-general model, termed learning-to-weight (LTW), that guarantees face detection performance across multiple domains, particularly the target domains that are never seen before. However, various face forgery methods cause complex and biased data distributions, making it challenging to detect fake faces in unseen domains. We argue that different faces contribute differently to a detection model trained on multiple domains, making the model likely to fit domain-specific biases. As such, we propose the LTW approach based on the meta-weight learning algorithm, which configures different weights for face images from different domains. The LTW network can balance the model’s generalizability across multiple domains. Then, the meta-optimization calibrates the source domain’s gradient enabling more discriminative features to be learned. The detection ability of the network is further improved by introducing an intra-class compact loss. Extensive experiments on several commonly used deepfake datasets to demonstrate the effectiveness of our method in detecting synthetic faces. Code and supplemental material are available at https: //github. com/skJack/LTW.

AAAI Conference 2021 Conference Paper

Learning to Attack Real-World Models for Person Re-identification via Virtual-Guided Meta-Learning

  • Fengxiang Yang
  • Zhun Zhong
  • Hong Liu
  • Zheng Wang
  • Zhiming Luo
  • Shaozi Li
  • Nicu Sebe
  • Shin'ichi Satoh

Recent advances in person re-identification (re-ID) have led to impressive retrieval accuracy. However, existing re-ID models are challenged by the adversarial examples crafted by adding quasi-imperceptible perturbations. Moreover, re- ID systems face the domain shift issue that training and testing domains are not consistent. In this study, we argue that learning powerful attackers with high universality that works well on unseen domains is an important step in promoting the robustness of re-ID systems. Therefore, we introduce a novel universal attack algorithm called “MetaAttack” for person re-ID. MetaAttack can mislead re-ID models on unseen domains by a universal adversarial perturbation. Specifically, to capture common patterns across different domains, we propose a meta-learning scheme to seek the universal perturbation via the gradient interaction between meta-train and meta-test formed by two datasets. We also take advantage of a virtual dataset (PersonX), instead of real ones, to conduct meta-test. This scheme not only enables us to learn with more comprehensive variation factors but also mitigates the negative effects caused by biased factors of real datasets. Experiments on three large-scale re-ID datasets demonstrate the effectiveness of our method in attacking re-ID models on unseen domains. Our final visualization results reveal some new properties of existing re-ID systems, which can guide us in designing a more robust re- ID model. Code and supplemental material are available at https: //github. com/FlyingRoastDuck/MetaAttack AAAI21.

IJCAI Conference 2021 Conference Paper

Modality-aware Style Adaptation for RGB-Infrared Person Re-Identification

  • Ziling Miao
  • Hong Liu
  • Wei Shi
  • Wanlu Xu
  • Hanrong Ye

RGB-infrared (IR) person re-identification is a challenging task due to the large modality gap between RGB and IR images. Many existing methods bridge the modality gap by style conversion, requiring high-similarity images exchanged by complex CNN structures, like GAN. In this paper, we propose a highly compact modality-aware style adaptation (MSA) framework, which aims to explore more potential relations between RGB and IR modalities by introducing new related modalities. Therefore, the attention is shifted from bridging to filling the modality gap with no requirement on high-quality generated images. To this end, we firstly propose a concise feature-free image generation structure to adapt the original modalities to two new styles that are compatible with both inputs by patch-based pixel redistribution. Secondly, we devise two image style quantification metrics to discriminate styles in image space using luminance and contrast. Thirdly, we design two image-level losses based on the quantified results to guide the style adaptation during an end-to-end four-modality collaborative learning process. Experimental results on two datasets SYSU-MM01 and RegDB show that MSA achieves significant improvements with little extra computation cost and outperforms the state-of-the-art methods.

AAAI Conference 2021 Conference Paper

Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

  • Zhan Chen
  • Sicheng Li
  • Bing Yang
  • Qinghan Li
  • Hong Liu

Graph convolutional networks have been widely used for skeleton-based action recognition due to their excellent modeling ability of non-Euclidean data. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory but fails to directly model the distant joints relations and long-range temporal information that are vital to distinguishing various actions. To solve this problem, we present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enrich the receptive field of the model in spatial and temporal dimensions. Concretely, the MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of subgraph convolution, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-graph convolutions, and each node could complete multiple spatial and temporal aggregations with its neighborhoods. The final equivalent receptive field is accordingly enlarged, which is capable of capturing both short- and long-range dependencies in spatial and temporal domains. By coupling these two modules as a basic block, we further propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition. The proposed MST-GCN achieves remarkable performance on three challenging benchmark datasets, NTU RGB+D, NTU-120 RGB+D and Kinetics-Skeleton, for skeleton-based action recognition.

JBHI Journal 2020 Journal Article

A Two-Stage Convolutional Neural Networks for Lung Nodule Detection

  • Haichao Cao
  • Hong Liu
  • Enmin Song
  • Guangzhi Ma
  • Renchao Jin
  • Xiangyang Xu
  • Tengying Liu
  • Chih-Cheng Hung

Early detection of lung cancer is an effective way to improve the survival rate of patients. It is a critical step to have accurate detection of lung nodules in computed tomography (CT) images for the diagnosis of lung cancer. However, due to the heterogeneity of the lung nodules and the complexity of the surrounding environment, it is a challenge to develop a robust nodule detection method. In this study, we propose a two-stage convolutional neural networks (TSCNN) for lung nodule detection. The first stage based on the improved U-Net segmentation network is to establish an initial detection of lung nodules. During this stage, in order to obtain a high recall rate without introducing excessive false positive nodules, we propose a new sampling strategy for training. Simultaneously, a two-phase prediction method is also proposed in this stage. The second stage in the TSCNN architecture based on the proposed dual pooling structure is built into three 3D-CNN classification networks for false positive reduction. Since the network training requires a significant amount of training data, we designed a random mask as the data augmentation method in this study. Furthermore, we have improved the generalization ability of the false positive reduction model by means of ensemble learning. We verified the proposed architecture on the LUNA dataset in our experiments, which showed that the proposed TSCNN architecture did obtain competitive detection performance.

NeurIPS Conference 2020 Conference Paper

Learning to Adapt to Evolving Domains

  • Hong Liu
  • Mingsheng Long
  • Jianmin Wang
  • Yu Wang

Domain adaptation aims at knowledge transfer from a labeled source domain to an unlabeled target domain. Current domain adaptation methods have made substantial advances in adapting discrete domains. However, this can be unrealistic in real-world applications, where target data usually comes in an online and continually evolving manner as small batches, posing challenges to classic domain adaptation paradigm: (1) Mainstream domain adaptation methods are tailored to stationary target domains, and can fail in non-stationary environments. (2) Since the target data arrive online, the agent should also maintain competence on previous target domains, i. e. to adapt without forgetting. To tackle these challenges, we propose a meta-adaptation framework which enables the learner to adapt to continually evolving target domain without catastrophic forgetting. Our framework comprises of two components: a meta-objective of learning representations to adapt to evolving domains, enabling meta-learning for unsupervised domain adaptation; and a meta-adapter for learning to adapt without forgetting, reserving knowledge from previous target data. Experiments validate the effectiveness our method on evolving target domains.

IJCAI Conference 2020 Conference Paper

Unsupervised Monocular Visual-inertial Odometry Network

  • Peng Wei
  • Guoliang Hua
  • Weibo Huang
  • Fanyang Meng
  • Hong Liu

Recently, unsupervised methods for monocular visual odometry (VO), with no need for quantities of expensive labeled ground truth, have attracted much attention. However, these methods are inadequate for long-term odometry task, due to the inherent limitation of only using monocular visual data and the inability to handle the error accumulation problem. By utilizing supplemental low-cost inertial measurements, and exploiting the multi-view geometric constraint and sequential constraint, an unsupervised visual-inertial odometry framework (UnVIO) is proposed in this paper. Our method is able to predict the per-frame depth map, as well as extracting and self-adaptively fusing visual-inertial motion features from image-IMU stream to achieve long-term odometry task. A novel sliding window optimization strategy, which consists of an intra-window and an inter-window optimization, is introduced for overcoming the error accumulation and scale ambiguity problem. The intra-window optimization restrains the geometric inferences within the window through checking the photometric consistency. And the inter-window optimization checks the 3D geometric consistency and trajectory consistency among predictions of separate windows. Extensive experiments have been conducted on KITTI and Malaga datasets to demonstrate the superiority of UnVIO over other state-of-the-art VO / VIO methods. The codes are open-source.

AAAI Conference 2019 Conference Paper

Learning Neural Bag-of-Matrix-Summarization with Riemannian Network

  • Hong Liu
  • Jie Li
  • Yongjian Wu
  • Rongrong Ji

Symmetric positive defined (SPD) matrix has attracted increasing research focus in image/video analysis, which merits in capturing the Riemannian geometry in its structured 2D feature representation. However, computation in the vector space on SPD matrices cannot capture the geometric properties, which corrupts the classification performance. To this end, Riemannian based deep network has become a promising solution for SPD matrix classification, because of its excellence in performing non-linear learning over SPD matrix. Besides, Riemannian metric learning typically adopts a kNN classifier that cannot be extended to large-scale datasets, which limits its application in many time-efficient scenarios. In this paper, we propose a Bag-of-Matrix-Summarization (BoMS) method to be combined with Riemannian network, which handles the above issues towards highly efficient and scalable SPD feature representation. Our key innovation lies in the idea of summarizing data in a Riemannian geometric space instead of the vector space. First, the whole training set is compressed with a small number of matrix features to ensure high scalability. Second, given such a compressed set, a constant-length vector representation is extracted by efficiently measuring the distribution variations between the summarized data and the latent feature of the Riemannian network. Finally, the proposed BoMS descriptor is integrated into the Riemannian network, upon which the whole framework is end-to-end trained via matrix back-propagation. Experiments on four different classification tasks demonstrate the superior performance of the proposed method over the state-of-the-art methods.

AAAI Conference 2019 Conference Paper

Towards Optimal Discrete Online Hashing with Balanced Similarity

  • Mingbao Lin
  • Rongrong Ji
  • Hong Liu
  • Xiaoshuai Sun
  • Yongjian Wu
  • Yunsheng Wu

When facing large-scale image datasets, online hashing serves as a promising solution for online retrieval and prediction tasks. It encodes the online streaming data into compact binary codes, and simultaneously updates the hash functions to renew codes of the existing dataset. To this end, the existing methods update hash functions solely based on the new data batch, without investigating the correlation between such new data and the existing dataset. In addition, existing works update the hash functions using a relaxation process in its corresponding approximated continuous space. And it remains as an open problem to directly apply discrete optimizations in online hashing. In this paper, we propose a novel supervised online hashing method, termed Balanced Similarity for Online Discrete Hashing (BSODH), to solve the above problems in a unified framework. BSODH employs a well-designed hashing algorithm to preserve the similarity between the streaming data and the existing dataset via an asymmetric graph regularization. We further identify the “data-imbalance” problem brought by the constructed asymmetric graph, which restricts the application of discrete optimization in our problem. Therefore, a novel balanced similarity is further proposed, which uses two equilibrium factors to balance the similar and dissimilar weights and eventually enables the usage of discrete optimizations. Extensive experiments conducted on three widely-used benchmarks demonstrate the advantages of the proposed method over the stateof-the-art methods.

ICML Conference 2019 Conference Paper

Transferable Adversarial Training: A General Approach to Adapting Deep Classifiers

  • Hong Liu
  • Mingsheng Long
  • Jianmin Wang 0001
  • Michael I. Jordan

Domain adaptation enables knowledge transfer from a labeled source domain to an unlabeled target domain. A mainstream approach is adversarial feature adaptation, which learns domain-invariant representations through aligning the feature distributions of both domains. However, a theoretical prerequisite of domain adaptation is the adaptability measured by the expected risk of an ideal joint hypothesis over the source and target domains. In this respect, adversarial feature adaptation may potentially deteriorate the adaptability, since it distorts the original feature distributions when suppressing domain-specific variations. To this end, we propose Transferable Adversarial Training (TAT) to enable the adaptation of deep classifiers. The approach generates transferable examples to fill in the gap between the source and target domains, and adversarially trains the deep classifiers to make consistent predictions over the transferable examples. Without learning domain-invariant representations at the expense of distorting the feature distributions, the adaptability in the theoretical learning bound is algorithmically guaranteed. A series of experiments validate that our approach advances the state of the arts on a variety of domain adaptation tasks in vision and NLP, including object recognition, learning from synthetic to real data, and sentiment classification.

AAAI Conference 2019 Conference Paper

Unified Embedding Alignment with Missing Views Inferring for Incomplete Multi-View Clustering

  • Jie Wen
  • Zheng Zhang
  • Yong Xu
  • Bob Zhang
  • Lunke Fei
  • Hong Liu

Multi-view clustering aims to partition data collected from diverse sources based on the assumption that all views are complete. However, such prior assumption is hardly satisfied in many real-world applications, resulting in the incomplete multi-view learning problem. The existing attempts on this problem still have the following limitations: 1) the underlying semantic information of the missing views is commonly ignored; 2) The local structure of data is not well explored; 3) The importance of different views is not effectively evaluated. To address these issues, this paper proposes a Unified Embedding Alignment Framework (UEAF) for robust incomplete multi-view clustering. In particular, a locality-preserved reconstruction term is introduced to infer the missing views such that all views can be naturally aligned. A consensus graph is adaptively learned and embedded via the reverse graph regularization to guarantee the common local structure of multiple views and in turn can further align the incomplete views and inferred views. Moreover, an adaptive weighting strategy is designed to capture the importance of different views. Extensive experimental results show that the proposed method can significantly improve the clustering performance in comparison with some state-of-the-art methods.

JBHI Journal 2017 Journal Article

Classification of Multiple Finger Motions During Dynamic Upper Limb Movements

  • Dapeng Yang
  • Wei Yang
  • Qi Huang
  • Hong Liu

To better restore human hand function, advanced hand prostheses should be able to deal with a variety of daily living conditions. In this paper, we addressed myoelectric signal variations introduced by different muscle contractions, dynamic arm movements, and outer interfering forces in the practice of pattern recognition-based myoelectric control schemes. We examined four different training paradigms (data-collection protocols) and quantified their effectiveness for obtaining a robust classification. We further depicted the classification accuracy according to different arm/wrist motion primitives. Our results indicate the training paradigm that collects myoelectric signals on dynamic arm postures and varying muscular contractions (DPDE) can largely mitigate the motions' misclassification rate. The misclassification rate of finger motions seems to highly correlate to wrist pronation and supination, rather than different arm positions. Combining proprioceptive information, such as the hand's orientation, with myoelectric signals for classification only slightly alleviates the misclassification rate.

AAAI Conference 2017 Conference Paper

Ordinal Constrained Binary Code Learning for Nearest Neighbor Search

  • Hong Liu
  • Rongrong Ji
  • Yongjian Wu
  • Feiyue Huang

Recent years have witnessed extensive attention in binary code learning, a. k. a. hashing, for nearest neighbor search problems. It has been seen that high-dimensional data points can be quantized into binary codes to give an efficient similarity approximation via Hamming distance. Among existing schemes, ranking-based hashing is recent promising that targets at preserving ordinal relations of ranking in the Hamming space to minimize retrieval loss. However, the size of the ranking tuples, which shows the ordinal relations, is quadratic or cubic to the size of training samples. By given a large-scale training data set, it is very expensive to embed such ranking tuples in binary code learning. Besides, it remains a dificulty to build ranking tuples efficiently for most ranking-preserving hashing, which are deployed over an ordinal graph-based setting. To handle these problems, we propose a novel ranking-preserving hashing method, dubbed Ordinal Constraint Hashing (OCH), which efficiently learns the optimal hashing functions with a graph-based approximation to embed the ordinal relations. The core idea is to reduce the size of ordinal graph with ordinal constraint projection, which preserves the ordinal relations through a small data set (such as clusters or random samples). In particular, to learn such hash functions effectively, we further relax the discrete constraints and design a specific stochastic gradient decent algorithm for optimization. Experimental results on three large-scale visual search benchmark datasets, i. e. LabelMe, Tiny100K and GIST1M, show that the proposed OCH method can achieve superior performance over the state-ofthe-arts approaches.

IJCAI Conference 2016 Conference Paper

3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector

  • Chen Chen
  • Mengyuan Liu
  • Baochang Zhang
  • Jungong Han
  • Junjun Jiang
  • Hong Liu

This paper presents an effective local spatio-temporal descriptor for action recognition from depth video sequences. The unique property of our descriptor is that it takes the shape discrimination and action speed variations into account, intending to solve the problems of distinguishing different pose shapes and identifying the actions with different speeds in one goal. The entire algorithm is carried out in three stages. In the first stage, a depth sequence is divided into temporally overlapping depth segments which are used to generate three depth motion maps (DMMs), capturing the shape and motion cues. To cope with speed variations in actions, multiple frame lengths of depth segments are utilized, leading to a multi-temporal DMMs representation. In the second stage, all the DMMs are first partitioned into dense patches. Then, the local binary patterns (LBP) descriptor is exploited to characterize local rotation invariant texture information in those patches. In the third stage, the Fisher kernel is employed to encode the patch descriptors for a compact feature representation, which is fed into a kernel-based extreme learning machine classifier. Extensive experiments on the public MSRAction3D, MSRGesture3D and DHA datasets show that our proposed method outperforms state-of-the-art approaches for depth-based action recognition.

IJCAI Conference 2016 Conference Paper

A Novel Feature Matching Strategy for Large Scale Image Retrieval

  • Hao Tang
  • Hong Liu

Feature-to-feature matching is the key issue in the Bag-of-Features model. The baseline approach employs a coarse feature-to-feature matching, namely, two descriptors are assumed to match if they are assigned the same quantization index. However, this Hard Assignment strategy usually incurs undesirable low precision. To fix it, Multiple Assignment and Soft Assignment are proposed. These two methods reduce the quantization error to some extent, but there are still a lot of room for improvement. To further improve retrieval precision, in this paper, we propose a novel feature matching strategy, called local-restricted Soft Assignment (lrSA), in which a new feature matching function is introduced. The lrSA strategy is evaluated through extensive experiments on five benchmark datasets. Experiments show that the results exceed the retrieval performance of current quantization methods on these datasets. Combined with post-processing steps, we have achieved competitive results compared with the state-of-the-art methods. Overall, our strategy shows notable benefit for retrieval with large vocabularies and dataset size.

IJCAI Conference 2016 Conference Paper

Supervised Matrix Factorization for Cross-Modality Hashing

  • Hong Liu
  • Rongrong Ji
  • Yongjian Wu
  • Gang Hua

Matrix factorization has been recently utilized for the task of multi-modal hashing for cross-modality visual search, where basis functions are learned to map data from different modalities to the same Hamming embedding. In this paper, we propose a novel cross-modality hashing algorithm termed Supervised Matrix Factorization Hashing (SMFH) which tackles the multi-modal hashing problem with a collective non-negative matrix factorization across the different modalities. In particular, SMFH employs a well-designed binary code learning algorithm to preserve the similarities among multi-modal original features through a graph regularization. At the same time, semantic labels, when available, are incorporated into the learning procedure. We conjecture that all these would facilitate to preserve the most relevant information during the binary quantization process, and hence improve the retrieval accuracy. We demonstrate the superior performance of SMFH on three cross-modality visual search benchmarks, i. e. , the PASCAL-Sentence, Wiki, and NUS-WIDE, with quantitative comparison to various state-of-the-art methods.

AAAI Conference 2016 Conference Paper

Towards Optimal Binary Code Learning via Ordinal Embedding

  • Hong Liu
  • Rongrong Ji
  • Yongjian Wu
  • Wei Liu

Binary code learning, a. k. a. , hashing, has been recently popular due to its high efficiency in large-scale similarity search and recognition. It typically maps high-dimensional data points to binary codes, where data similarity can be efficiently computed via rapid Hamming distance. Most existing unsupervised hashing schemes pursue binary codes by reducing the quantization error from an original real-valued data space to a resulting Hamming space. On the other hand, most existing supervised hashing schemes constrain binary code learning to correlate with pairwise similarity labels. However, few methods consider ordinal relations in the binary code learning process, which serve as a very significant cue to learn the optimal binary codes for similarity search. In this paper, we propose a novel hashing scheme, dubbed Ordinal Embedding Hashing (OEH), which embeds given ordinal relations among data points to learn the ranking-preserving binary codes. The core idea is to construct a directed unweighted graph to capture the ordinal relations, and then train the hash functions using this ordinal graph to preserve the permutation relations in the Hamming space. To learn such hash functions effectively, we further relax the discrete constraints and design a stochastic gradient decent algorithm to obtain the optimal solution. Experimental results on two large-scale benchmark datasets demonstrate that the proposed OEH method can achieve superior performance over the state-of-the-arts approaches. At last, the evaluation on query by humming dataset demonstrates the OEH also has good performance for music retrieval by using user’s humming or singing.

TIST Journal 2015 Journal Article

Exploring Spatial Correlation for Visual Object Retrieval

  • Miaojing Shi
  • Xinghai Sun
  • Dacheng Tao
  • Chao Xu
  • George Baciu
  • Hong Liu

Bag-of-visual-words (BOVW)-based image representation has received intense attention in recent years and has improved content-based image retrieval (CBIR) significantly. BOVW does not consider the spatial correlation between visual words in natural images and thus biases the generated visual words toward noise when the corresponding visual features are not stable. This article outlines the construction of a visual word co-occurrence matrix by exploring visual word co-occurrence extracted from small affine-invariant regions in a large collection of natural images. Based on this co-occurrence matrix, we first present a novel high-order predictor to accelerate the generation of spatially correlated visual words and a penalty tree (PTree) to continue generating the words after the prediction. Subsequently, we propose two methods of co-occurrence weighting similarity measure for image ranking: Co-Cosine and Co-TFIDF. These two new schemes down-weight the contributions of the words that are less discriminative because of frequent co-occurrences with other words. We conduct experiments on Oxford and Paris Building datasets, in which the ImageNet dataset is used to implement a large-scale evaluation. Cross-dataset evaluations between the Oxford and Paris datasets and Oxford and Holidays datasets are also provided. Thorough experimental results suggest that our method outperforms the state of the art without adding much additional cost to the BOVW model.

ICRA Conference 2010 Conference Paper

A salient feature and scene semantics based attention model for human tracking on mobile robots

  • Hong Liu
  • Huijun He

It is a great challenge to perform robust tracking for a mobile robot owing to dynamic environments. Also, fast motion or abrupt jerk of the robotic camera poses a severe threat for continuous tracking. To address these problems, a novel attention model is proposed motivated by human attention mechanism which consists of low level salient feature and high level scene semantics. The low level layer extracts color and motion feature to obtain combined feature probability map. In semantic level, the ADM(attention distribution map) is computed by applying an attenuation function on the combined feature map which is motivated by human's foveal vision. The object position is found using CAMSHIFT algorithm in ADM. And this layer also generates a region-based SSG(scene semantics graph). When robot moves abnormally, the model detects candidate regions in color saliency map and then attention shifts from one region to the next and check it by elastically matching SSG until the target is recovered. Experiments in several kinds of environments give promising results and show that this model is robust for mobile robotic tracking. When camera moves steadily, a little fast or even jerks very abruptly, it can keep continuous tracking.

AAAI Conference 2010 Conference Paper

Myopic Policies for Budgeted Optimization with Constrained Experiments

  • Javad Azimi
  • Xiaoli Fern
  • Alan Fern
  • Elizabeth Burrows
  • Frank Chaplen
  • Yanzhen Fan
  • Hong Liu
  • Jun Jaio

Motivated by a real-world problem, we study a novel budgeted optimization problem where the goal is to optimize an unknown function f(x) given a budget. In our setting, it is not practical to request samples of f(x) at precise input values due to the formidable cost of precise experimental setup. Rather, we may request a constrained experiment, which is a subset r of the input space for which the experimenter returns x ∈ r and f(x). Importantly, as the constraints become looser, the experimental cost decreases, but the uncertainty about the location x of the next observation increases. Our goal is to manage this trade-off by selecting a sequence of constrained experiments to best optimize f within the budget. We introduce cost-sensitive policies for selecting constrained experiments using both model-free and model-based approaches, inspired by policies for unconstrained settings. Experiments on synthetic functions and functions derived from real-world experimental data indicate that our policies outperform random selection, that the model-based policies are superior to model-free ones, and give insights into which policies are preferable overall.

IROS Conference 1993 Conference Paper

CAD-based 3D robot vision

  • Nanning Zheng 0001
  • Xiao-Dong Fu
  • Hong Liu

This paper describes a technique for developing a CAD model-based 3-D robot vision system which can be used for recognizing and assembling parts or objects on an automated assembly line. A notable feature of the system is that a single eye-on-hand configuration can be used for computing disparity data and stereo matching between two 2-D images obtained by an accurately moving camera mounted on the end-arm of robot. An approach to stereo matching based on the edge-relation is proposed. The image linear feature and edge-relation set are translated to the 3-D space, and geometric models residing in a database are then used to obtain possible solutions. A novel method of computing sparse depth information is developed for matching two 3-D objects. Experimental result has shown the feasibility and effectiveness of the proposed technique. The system has been successfully implemented for recognizing a class of industrial parts.