Arrow Research search

Author name cluster

Qiang Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers
2 author rows

Possible papers

57

AAAI Conference 2026 Conference Paper

Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation

  • Shengqian Zhu
  • Chengrong Yu
  • Qiang Wang
  • Ying Song
  • Guangjun Li
  • Jiafei Wu
  • Xiaogang Xu
  • Zhang Yi

Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class annotations. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading cues from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global historical prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms current state-of-the-art methods, highlighting its robustness and generalization capabilities.

AAAI Conference 2026 Conference Paper

D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos

  • Wenkang Zhang
  • Yan Zhao
  • Qiang Wang
  • Zhixin Xu
  • Li Song
  • Zhengxue Cheng

Free-Viewpoint Video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representation remains a major challenge. Existing dynamic 3D Gaussian Splatting methods couple reconstruction with optimization-dependent compression and customized motion formats, limiting generalization and standardization. To address this, we propose D-FCGS, a novel Feedforward Compression framework for Dynamic Gaussian Splatting. Key innovations include: (1) a standardized Group-of-Frames (GoF) structure with I-P coding, leveraging sparse control points to extract inter-frame motion tensors; (2) a dual prior-aware entropy model that fuses hyperprior and spatial-temporal priors for accurate rate estimation; (3) a control-point-guided motion compensation mechanism and refinement network to enhance view-consistent fidelity. Trained on Gaussian frames derived from multi-view videos, D-FCGS generalizes across diverse scenes in a zero-shot fashion. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression compared to the baseline while preserving visual quality across viewpoints. This work advances feedforward compression of dynamic 3DGS, facilitating scalable FVV transmission and storage for immersive applications.

AAAI Conference 2026 Conference Paper

EC-MVSNet: Enhanced Cascaded Multi-View Stereo with Cross-Scale Relevance Integration

  • Shaoqian Wang
  • Jiadai Sun
  • Bin Fan
  • Qiang Wang
  • Bin Lu
  • Yuchao Dai

Cascade-based multi-scale architectures are currently the mainstream in Multi-view Stereo (MVS), achieving a balance between computational efficiency and reconstruction accuracy. However, existing cascade MVS methods suffer from significant limitations in cross-scale information utilization, where depth estimation processes operate independently across scales without fully exploiting the rich relevance between adjacent scales. To address this fundamental limitation, we propose an Enhanced Cascade Multi-View Stereo framework (EC-MVSNet), which introduces a novel cross-scale relevance integration strategy. Specifically, we introduce a Cross-Scale Feature-based Joint Construction (CFC) module to synergistically combine features from adjacent scales to build more reliable cost volumes. Additionally, a Cross-Scale Probability-guided Enhancement (CPE) module is proposed to propagate depth probability distributions across scales to guide cost volume enhancement. Furthermore, we propose a Monocular Feature-based Refinement (MFR) module to further enhance depth prediction accuracy by leveraging monocular priors. Extensive experiments demonstrate that EC-MVSNet achieves state-of-the-art performance on multiple benchmarks, validating the effectiveness of the cross-scale integration in improving MVS reconstruction quality.

AAAI Conference 2026 Conference Paper

GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery

  • Jizhou Han
  • Chenhao Ding
  • Songlin Dong
  • Yuhang He
  • Shaokun Wang
  • Qiang Wang
  • Yihong Gong

Continual Generalized Category Discovery (C-GCD) requires identifying novel classes from unlabeled data while retaining knowledge of known classes over time. Existing methods typically update classifier weights dynamically, resulting in forgetting and inconsistent feature alignment. We propose GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning. GOAL conducts supervised alignment for labeled samples and confidence-guided alignment for novel samples, enabling stable integration of new classes without disrupting old ones. Experiments on four benchmarks show that GOAL outperforms prior methods, reducing forgetting by 16.1% and boosting novel class discovery by 3.2%, establishing a strong solution for long-horizon continual discovery.

AAAI Conference 2026 Conference Paper

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

  • Sijie Wang
  • Qiang Wang
  • Shaohuai Shi

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remarkable capabilities. However, their practical deployment is often hindered by slow inference speeds and high memory consumption. In this paper, we propose a novel pipelining framework named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and communication among multiple GPUs to be pipelined, thus reducing the inference latency. Second, we propose DeDiVAE to decouple the diffusion module and the VAE module into two GPU groups whose executions can also be pipelined to reduce the memory consumption and inference latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and HunyuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8-GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06× to 4.02× speedups over OpenSoraPlan and HunyuanVideo.

AAAI Conference 2026 Conference Paper

SALR: Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

  • Longteng Zhang
  • Sen Wu
  • Shuai Hou
  • Zhengyu Qing
  • Zhuo Zheng
  • Danning Ke
  • Qihong Lin
  • Qiang Wang

Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA’s performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of (1 - r/min(d, k)). To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by 2x, and delivers up to a 1.7x inference speedup.

AAAI Conference 2026 Conference Paper

Shared & Domain Self-Adaptive Experts with Frequency-Aware Discrimination for Continual Test-Time Adaptation

  • Jianchao Zhao
  • Chenhao Ding
  • Songlin Dong
  • Jiangyang Li
  • Qiang Wang
  • Yuhang He
  • Yihong Gong

This paper focuses on the Continual Test-Time Adaptation (CTTA) task, aiming to enable an agent to continuously adapt to evolving target domains while retaining previously acquired domain knowledge for effective reuse when those domains reappear. Existing shared-parameter paradigms struggle to balance adaptation and forgetting, leading to decreased efficiency and stability. To address this, we propose a frequency-aware shared and self-adaptive expert framework, consisting of two key components: (i) a dual-branch expert architecture that extracts general features and dynamically models domain-specific representations, effectively reducing cross-domain interference and repetitive learning cost; and (ii) an online Frequency-aware Domain Discriminator (FDD), which leverages the robustness of low-frequency image signals for online domain shift detection, guiding dynamic allocation of expert resources for more stable and realistic adaptation. Additionally, we introduce a Continual Repeated Shifts (CRS) benchmark to simulate periodic domain changes for more realistic evaluation. Experimental results show that our method consistently outperforms existing approaches on both classification and segmentation CTTA tasks under standard and CRS settings, with ablations and visualizations confirming its effectiveness and robustness.

IROS Conference 2025 Conference Paper

Applicability Analysis for Optical Cooperative Localization

  • Yixian Li
  • Qiang Wang
  • Jiaxing Wu
  • Wuhong Zhao
  • Shengrong Hu
  • Zhonghu Hao

For optical cooperative localization, which employs optical beacons with prior features as cooperative targets, a fundamental prerequisite is to ensure that the beacons are always captured by the vision sensors during the entire localization process. In other words, there is an applicability issue of optical cooperative localization with respect to the relative range between beacons and vision sensors, whereas the corresponding analysis method has so far remained a gap. In this work, we propose a general applicability analysis method for optical cooperative localization to fill this gap. We translate this problem into constructing a multi-constraint model incorporating geometrics and radiometrics for describing the relationship between optical sensor parameters and relative range or depth. For parameterized beacons and vision sensors, the geometric constraint is related to the imaging quantities and the radiometric constraint is determined by the radiation properties. Numerical evaluations are performed based on the range of parameters in practice, and real-world experiments are conducted to validate the effectiveness of the proposed applicability analysis. The results demonstrate the effectiveness of the proposed applicability analysis method and are instructive for real-world deployment of optical cooperative localization.

NeurIPS Conference 2025 Conference Paper

Consistent Supervised-Unsupervised Alignment for Generalized Category Discovery

  • Jizhou Han
  • Shaokun Wang
  • Yuhang He
  • Chenhao Ding
  • Qiang Wang
  • Xinyuan Gao
  • Songlin Dong
  • Yihong Gong

Generalized Category Discovery (GCD) focuses on classifying known categories while simultaneously discovering novel categories from unlabeled data. However, previous GCD methods face challenges due to inconsistent optimization objectives and category confusion. This leads to feature overlap and ultimately hinders performance on novel categories. To address these issues, we propose the Neural Collapse-inspired Generalized Category Discovery (NC-GCD) framework. By pre-assigning and fixing Equiangular Tight Frame (ETF) prototypes, our method ensures an optimal geometric structure and a consistent optimization objective for both known and novel categories. We introduce a Consistent ETF Alignment Loss that unifies supervised and unsupervised ETF alignment and enhances category separability. Additionally, a Semantic Consistency Matcher (SCM) is designed to maintain stable and consistent label assignments across clustering iterations. Our method significantly enhancing novel category accuracy and demonstrating its effectiveness.

AAAI Conference 2025 Conference Paper

DualCP: Rehearsal-Free Domain-Incremental Learning via Dual-Level Concept Prototype

  • Qiang Wang
  • Yuhang He
  • Songlin Dong
  • Xiang Song
  • Jizhou Han
  • Haoyu Luo
  • Yihong Gong

Domain-Incremental Learning (DIL) enables vision models to adapt to changing conditions in real-world environments while maintaining the knowledge acquired from previous domains. Given privacy concerns and training time, Rehearsal-Free DIL (RFDIL) is more practical. Inspired by the incremental cognitive process of the human brain, we design Dual-level Concept Prototypes (DualCP) for each class to address the conflict between learning new knowledge and retaining old knowledge in RFDIL. To construct DualCP, we propose a Concept Prototype Generator (CPG) that generates both coarse-grained and fine-grained prototypes for each class. Additionally, we introduce a Coarse-to-Fine calibrator (C2F) to align image features with DualCP. Finally, we propose a Dual Dot-Regression (DDR) loss function to optimize our C2F module. Extensive experiments on the DomainNet, CDDB, and CORe50 datasets demonstrate the effectiveness of our method.

AAAI Conference 2025 Conference Paper

Flexible Sharpness-Aware Personalized Federated Learning

  • Xinda Xing
  • Qiugang Zhan
  • Xiurui Xie
  • Yuning Yang
  • Qiang Wang
  • Guisong Liu

Personalized federated learning (PFL) is a new paradigm to address the statistical heterogeneity problem in federated learning. Most existing PFL methods focus on leveraging global and local information such as model interpolation or parameter decoupling. However, these methods often overlook the generalization potential during local client learning. From a local optimization perspective, we propose a simple and general PFL method, Federated learning with Flexible Sharpness-Aware Minimization (FedFSA). Specifically, we emphasize the importance of applying a larger perturbation to critical layers of the local model when using the Sharpness-Aware Minimization (SAM) optimizer. Then, we design a metric, perturbation sensitivity, to estimate the layer-wise sharpness of each local model. Based on this metric, FedFSA can flexibly select the layers with the highest sharpness to employ larger perturbation. Extensive experiments are conducted on four datasets with two types of statistical heterogeneity for image classification. The results show that FedFSA outperforms seven state-of-the-art baselines by up to 8.26% in test accuracy. Besides, FedFSA can be applied to different model architectures and easily integrated into other federated learning methods, achieving a 4.45% improvement.

AAAI Conference 2025 Conference Paper

OAMaskFlow: Occlusion-Aware Motion Mask for Scene Flow

  • Xiongfeng Peng
  • Zhihua Liu
  • Weiming Li
  • Yamin Mao
  • Qiang Wang

The scene flow estimation methods make significant progress by estimating pixel-wise 3D motion on implicitly learning a motion embedding using an end-to-end differentiable optimization framework. However, the motion embedding learned implicitly is insufficient for grouping pixels into rigid object in challenging regions, such as occlusion and inconsistent multi-view geometric properties. To address this issue, we propose a novel method for estimating scene flow called OAMaskFlow, which has three novelties. Firstly, we propose the concept of occlusion-aware motion (OAM) mask and generate the ground truth annotation through the photo-metric and geometry consistency. Secondly, we propose to supervise the motion embedding with the OAM mask to learn informative and reliable motion representation of the scene. Finally, a 3D motion propagation module is proposed to propagate high-quality 3D motion from reliable pixels to the challenging occluded regions. Experiments show that our proposed OAMaskFlow has reduced the EPE3D metric by 21.0% on the FlyingThings3D dataset and decreased SF-all metric by 24.3% on the KITTI scene flow benchmark than the baseline method RAFT-3D. Furthermore, we apply our proposed OAM mask in simultaneous localization and mapping (SLAM) to improve a state-of-the-art method DROID-SLAM. In comparison, the ATE metric has decreased by 65.7% and 58.3% on the TartanAir monocular and stereo datasets respectively.

AAAI Conference 2025 Conference Paper

ParZC: Parametric Zero-Cost Proxies for Efficient NAS

  • Peijie Dong
  • Lujun Li
  • Zhenheng Tang
  • Xiang Liu
  • Zimian Wei
  • Qiang Wang
  • Xiaowen Chu

Recent advancements in Zero-shot Neural Architecture Search (NAS) highlight the ability of zero-cost proxies in identifying superior architecture. However, we identify a critical issue with current zero-cost proxies: they aggregate node-wise zero-cost statistics without considering that not all nodes in a neural network equally impact performance estimation. Our observations reveal that node-wise zero-cost statistics significantly vary in their contributions to performance, with each node exhibiting a degree of uncertainty. Based on this insight, we introduce a novel method called Parametric Zero-Cost Proxies (ParZC) framework to enhance the adaptability of zero-cost proxies through parameterization. To address the node indiscrimination, we propose a Mixer Architecture with Bayesian Network (MABN) to explore the node-wise zero-cost statistics and estimate node-specific uncertainty. Moreover, we propose DiffKendall as a loss function to improve ranking consistency. Comprehensive experiments on NAS-Bench-101, 201, and NDS demonstrate the superiority of our proposed ParZC compared to existing zero-shot NAS methods. Additionally, we demonstrate the versatility and adaptability of ParZC on Vision Transformer search space.

IROS Conference 2025 Conference Paper

SAFormer: Spatially Adaptive Transformer for Efficient and Multi-Resolution Occupancy Prediction

  • Song Tang
  • Qiang Wang
  • Xiaowen Chu 0001

Accurate and efficient 3D scene understanding from multi-view images remains a fundamental challenge in autonomous driving. Existing methods often struggle with high-dimensional features, leading to excessive computational costs and memory usage. In this paper, we present SAFormer, a novel transformer-based framework for efficient spatially adaptive occupancy prediction. SAFormer incorporates two key techniques to reduce resource consumption: Octree-based Multi-resolution Feature (OMRF) Learning and Spatial-Adaptive Progressive Query (SAPQ). First, OMRF introduces an Octree-based hierarchical structure to compress multi-resolution 3D feature volumes. Second, SAPQ facilitates efficient information flow across different scales while effectively addressing scene sparsity. It employs a region-aware query mechanism that intelligently allocates computational resources, processing safety-critical regions at high resolution while handling background elements at lower resolutions. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance while significantly reducing inference latency (up to 3×) and memory cost (up to 2. 9×). Additional experiments on SSCBench-KITTI-360 further validate our approach’s generalizability. Our approach excels in managing scene sparsity and recognizing small, safety-critical objects, highlighting its potential for practical applications in autonomous driving.

AAAI Conference 2024 Conference Paper

CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning

  • Qingsong Yan
  • Qiang Wang
  • Kaiyong Zhao
  • Jie Chen
  • Bo Li
  • Xiaowen Chu
  • Fei Deng

Neural Radiance Fields have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2G-NeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel camera parameter free neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion. Given a sequence of images, CF-NeRF estimates camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset, NeRFBuster, which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to rotation and achieves state-of-the-art results without providing prior information and constraints.

NeurIPS Conference 2024 Conference Paper

Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models

  • Lujun Li
  • Peijie Dong
  • Zhenheng Tang
  • Xiang Liu
  • Qiang Wang
  • Wenhan Luo
  • Wei Xue
  • Qifeng Liu

In this paper, we present DSA, the first automated framework for discovering sparsity allocation schemes for layer-wise pruning in Large Language Models (LLMs). LLMs have become increasingly powerful, but their large parameter counts make them computationally expensive. Existing pruning methods for compressing LLMs primarily focus on evaluating redundancies and removing element-wise weights. However, these methods fail to allocate adaptive layer-wise sparsities, leading to performance degradation in challenging tasks. We observe that per-layer importance statistics can serve as allocation indications, but their effectiveness depends on the allocation function between layers. To address this issue, we develop an expression discovery framework to explore potential allocation strategies. Our allocation functions involve two steps: reducing element-wise metrics to per-layer importance scores, and modelling layer importance to sparsity ratios. To search for the most effective allocation function, we construct a search space consisting of pre-process, reduction, transform, and post-process operations. We leverage an evolutionary algorithm to perform crossover and mutation on superior candidates within the population, guided by performance evaluation. Finally, we seamlessly integrate our discovered functions into various uniform methods, resulting in significant performance improvements. We conduct extensive experiments on multiple challenging tasks such as arithmetic, knowledge reasoning, and multimodal benchmarks spanning GSM8K, MMLU, SQA, and VQA, demonstrating that our DSA method achieves significant performance gains on the LLaMA-1|2|3, Mistral, and OPT models. Notably, the LLaMA-1|2|3 model pruned by our DSA reaches 4. 73\%|6. 18\%|10. 65\% gain over the state-of-the-art techniques (e. g. , Wanda and SparseGPT).

AAAI Conference 2024 Conference Paper

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

  • Xiaoxuan Yu
  • Hao Wang
  • Weiming Li
  • Qiang Wang
  • Soonyong Cho
  • Younghun Sung

Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.

ICRA Conference 2024 Conference Paper

DVI-SLAM: A Dual Visual Inertial SLAM Network

  • Xiongfeng Peng
  • Zhihua Liu
  • Weiming Li
  • Ping Tan
  • SoonYong Cho
  • Qiang Wang

Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45. 3% and 36. 2% respectively.

NeurIPS Conference 2024 Conference Paper

Is Your HD Map Constructor Reliable under Sensor Corruptions?

  • Xiaoshuai Hao
  • Mengchuan Wei
  • Yifan Yang
  • Haimei Zhao
  • Hui Zhang
  • Yi Zhou
  • Qiang Wang
  • Weiming Li

Driving systems often rely on high-definition (HD) maps for precise environmental information, which is crucial for planning and navigation. While current HD map constructors perform well under ideal conditions, their resilience to real-world challenges, \eg, adverse weather and sensor failures, is not well understood, raising safety concerns. This work introduces MapBench, the first comprehensive benchmark designed to evaluate the robustness of HD map construction methods against various sensor corruptions. Our benchmark encompasses a total of 29 types of corruptions that occur from cameras and LiDAR sensors. Extensive evaluations across 31 HD map constructors reveal significant performance degradation of existing methods under adverse weather conditions and sensor failures, underscoring critical safety concerns. We identify effective strategies for enhancing robustness, including innovative approaches that leverage multi-modal fusion, advanced data augmentation, and architectural techniques. These insights provide a pathway for developing more reliable HD map construction methods, which are essential for the advancement of autonomous driving technology. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.

AAAI Conference 2024 Conference Paper

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation

  • Yasi Wang
  • Hong Liu
  • Chao Zhang
  • Lu Xu
  • Qiang Wang

Homography estimation is a fundamental problem in computer vision. Previous works mainly focus on estimating either a single homography, or multiple homographies based on mesh grid division of the image. In practical scenarios, single homography is inadequate and often leads to a compromised result for multiple planes; while mesh grid multi-homography damages the plane distribution of the scene, and does not fully address the restriction to use homography. In this work, we propose a novel semantics guided multi-homography estimation framework, Mask-Homo, to provide an explicit solution to the multi-plane depth disparity problem. First, a pseudo plane mask generation module is designed to obtain multiple correlated regions that follow the plane distribution of the scene. Then, multiple local homography transformations, each of which aligns a correlated region precisely, are predicted and corresponding warped images are fused to obtain the final result. Furthermore, a new metric, Mask-PSNR, is proposed for more comprehensive evaluation of alignment. Extensive experiments are conducted to verify the effectiveness of the proposed method. Our code is available at https://github.com/SAITPublic/MaskHomo.

AAAI Conference 2024 Conference Paper

MmAP: Multi-Modal Alignment Prompt for Cross-Domain Multi-Task Learning

  • Yi Xin
  • Junlong Du
  • Qiang Wang
  • Ke Yan
  • Shouhong Ding

Multi-Task Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific decoders. However, the complexity of the decoders increases with the number of tasks. To tackle this challenge, we integrate the decoder-free vision-language model CLIP, which exhibits robust zero-shot generalization capability. Recently, parameter-efficient transfer learning methods have been extensively explored with CLIP for adapting to downstream tasks, where prompt tuning showcases strong potential. Nevertheless, these methods solely fine-tune a single modality (text or visual), disrupting the modality structure of CLIP. In this paper, we first propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process. Building upon MmAP, we develop an innovative multi-task prompt learning framework. On the one hand, to maximize the complementarity of tasks with high similarity, we utilize a gradient-driven task grouping method that partitions tasks into several disjoint groups and assign a group-shared MmAP to each group. On the other hand, to preserve the unique characteristics of each task, we assign an task-specific MmAP to each task. Comprehensive experiments on two large multi-task learning datasets demonstrate that our method achieves significant performance improvements compared to full fine-tuning while only utilizing approximately ~ 0.09% of trainable parameters.

AAAI Conference 2024 Conference Paper

VMT-Adapter: Parameter-Efficient Transfer Learning for Multi-Task Dense Scene Understanding

  • Yi Xin
  • Junlong Du
  • Qiang Wang
  • Zhiwen Lin
  • Ke Yan

Large-scale pre-trained models have achieved remarkable success in various computer vision tasks. A standard approach to leverage these models is to fine-tune all model parameters for downstream tasks, which poses challenges in terms of computational and storage costs. Recently, inspired by Natural Language Processing (NLP), parameter-efficient transfer learning has been successfully applied to vision tasks. However, most existing techniques primarily focus on single-task adaptation, and despite limited research on multi-task adaptation, these methods often exhibit suboptimal training/inference efficiency. In this paper, we first propose an once-for-all Vision Multi-Task Adapter (VMT-Adapter), which strikes approximately O(1) training and inference efficiency w.r.t task number. Concretely, VMT-Adapter shares the knowledge from multiple tasks to enhance cross-task interaction while preserves task-specific knowledge via independent knowledge extraction modules. Notably, since task-specific modules require few parameters, VMT-Adapter can handle an arbitrary number of tasks with a negligible increase of trainable parameters. We also propose VMT-Adapter-Lite, which further reduces the trainable parameters by learning shared parameters between down- and up-projections. Extensive experiments on four dense scene understanding tasks demonstrate the superiority of VMT-Adapter(-Lite), achieving a 3.96% (1.34%) relative improvement compared to single-task full fine-tuning, while utilizing merely ~1% (0.36%) trainable parameters of the pre-trained model.

NeurIPS Conference 2023 Conference Paper

BadTrack: A Poison-Only Backdoor Attack on Visual Object Tracking

  • Bin Huang
  • Jiaqian Yu
  • Yiwei Chen
  • Siyang Pan
  • Qiang Wang
  • Zhi Wang

Visual object tracking (VOT) is one of the most fundamental tasks in computer vision community. State-of-the-art VOT trackers extract positive and negative examples that are used to guide the tracker to distinguish the object from the background. In this paper, we show that this characteristic can be exploited to introduce new threats and hence propose a simple yet effective poison-only backdoor attack. To be specific, we poison a small part of the training data by attaching a predefined trigger pattern to the background region of each video frame, so that the trigger appears almost exclusively in the extracted negative examples. To the best of our knowledge, this is the first work that reveals the threat of poison-only backdoor attack on VOT trackers. We experimentally show that our backdoor attack can significantly degrade the performance of both two-stream Siamese and one-stream Transformer trackers on the poisoned data while gaining comparable performance with the benign trackers on the clean data.

JBHI Journal 2023 Journal Article

Multi-Task Learning for Pulmonary Arterial Hypertension Prognosis Prediction Via Memory Drift and Prior Prompt Learning on 3D Chest CT

  • Guanyu Yang
  • Yuting He
  • Yang Lv
  • Yang Chen
  • Jean-Louis Coatrieux
  • Xiaoxuan Sun
  • Qiang Wang
  • Yongyue Wei

Pulmonary arterial hypertension (PAH) prognosis prediction on 3D non-contrast CT images is one of the most important tasks for PAH treatment. It will help clinicians stratify patients into different groups for early diagnosis and timely intervention via automatically extracting the potential biomarkers of PAH to predict mortality. However, it is still a task of great challenges due to the large volume and low-contrast regions of interest in 3D chest CT images. In this paper, we propose the first multi-task learning-based PAH prognosis prediction framework, P $^{2}$ -Net, which effectively optimizes the model and powerfully represents task-dependent features via our Memory Drift (MD) and Prior Prompt Learning (PPL) strategies. 1) Our MD maintains a large memory bank to provide a dense sampling of the deep biomarkers' distribution. Therefore, although the batch size is very small caused by our large volume, a reliable (negative log partial) likelihood loss is still able to be calculated on a representative probability distribution for robust optimization. 2) Our PPL simultaneously learns an additional manual biomarkers prediction task to embed clinical prior knowledge into our deep prognosis prediction task in hidden and explicit ways. Therefore, it will prompt the prediction of deep biomarkers and improve the perception of task-dependent features in our low-contrast regions. Our P $^{2}$ -Net achieves a high prognostic correlation of the prediction and great generalization with the highest 70. 19% C-index and 2. 14 HR. Extensive experiments with promising results on our PAH prognosis prediction reveal powerful prognosis performance and great clinical significance in PAH treatment. All of our code will be made publicly available online.

AAAI Conference 2023 Conference Paper

Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity

  • Qingsong Yan
  • Qiang Wang
  • Kaiyong Zhao
  • Bo Li
  • Xiaowen Chu
  • Fei Deng

Existing learning-based multi-view stereo (MVS) methods rely on the depth range to build the 3D cost volume and may fail when the range is too large or unreliable. To address this problem, we propose a disparity-based MVS method based on the epipolar disparity flow (E-flow), called DispMVS, which infers the depth information from the pixel movement between two views. The core of DispMVS is to construct a 2D cost volume on the image plane along the epipolar line between each pair (between the reference image and several source images) for pixel matching and fuse uncountable depths triangulated from each pair by multi-view geometry to ensure multi-view consistency. To be robust, DispMVS starts from a randomly initialized depth map and iteratively refines the depth map with the help of the coarse-to-fine strategy. Experiments on DTUMVS and Tanks\&Temple datasets show that DispMVS is not sensitive to the depth range and achieves state-of-the-art results with lower GPU memory.

ICLR Conference 2023 Conference Paper

SketchKnitter: Vectorized Sketch Generation with Diffusion Models

  • Qiang Wang
  • Haoge Deng
  • Yonggang Qi
  • Da Li 0001
  • Yi-Zhe Song

We show vectorized sketch generation can be identified as a reversal of the stroke deformation process. This relationship was established by means of a diffusion model that learns data distributions over the stroke-point locations and pen states of real human sketches. Given randomly scattered stroke-points, sketch generation becomes a process of deformation-based denoising, where the generator rectifies positions of stroke points at each timestep to converge at a recognizable sketch. A key innovation was to embed recognizability into the reverse time diffusion process. It was observed that the estimated noise during the reversal process is strongly correlated with sketch classification accuracy. An auxiliary recurrent neural network (RNN) was consequently used to quantify recognizability during data sampling. It follows that, based on the recognizability scores, a sampling shortcut function can also be devised that renders better quality sketches with fewer sampling steps. Finally it is shown that the model can be easily extended to a conditional generation framework, where given incomplete and unfaithful sketches, it yields one that is more visually appealing and with higher recognizability.

AAAI Conference 2023 Conference Paper

Towards Reliable Neural Machine Translation with Consistency-Aware Meta-Learning

  • Rongxiang Weng
  • Qiang Wang
  • Wensen Cheng
  • Changfeng Zhu
  • Min Zhang

Neural machine translation (NMT) has achieved remarkable success in producing high-quality translations. However, current NMT systems suffer from a lack of reliability, as their outputs that are often affected by lexical or syntactic changes in inputs, resulting in large variations in quality. This limitation hinders the practicality and trustworthiness of NMT. A contributing factor to this problem is that NMT models trained with the one-to-one paradigm struggle to handle the source diversity phenomenon, where inputs with the same meaning can be expressed differently. In this work, we treat this problem as a bilevel optimization problem and present a consistency-aware meta-learning (CAML) framework derived from the model-agnostic meta-learning (MAML) algorithm to address it. Specifically, the NMT model with CAML (named CoNMT) first learns a consistent meta representation of semantically equivalent sentences in the outer loop. Subsequently, a mapping from the meta representation to the output sentence is learned in the inner loop, allowing the NMT model to translate semantically equivalent sentences to the same target sentence. We conduct experiments on the NIST Chinese to English task, three WMT translation tasks, and the TED M2O task. The results demonstrate that CoNMT effectively improves overall translation quality and reliably handles diverse inputs.

IJCAI Conference 2022 Conference Paper

Learning from Students: Online Contrastive Distillation Network for General Continual Learning

  • Jin Li
  • Zhong Ji
  • Gang Wang
  • Qiang Wang
  • Feng Gao

The goal of General Continual Learning (GCL) is to preserve learned knowledge and learn new knowledge with constant memory from an infinite data stream where task boundaries are blurry. Distilling the model's response of reserved samples between the old and the new models is an effective way to achieve promise performance on GCL. However, it accumulates the inherent old model's response bias and is not robust to model changes. To this end, we propose an Online Contrastive Distillation Network (OCD-Net) to tackle these problems, which explores the merit of the student model in each time step to guide the training process of the student model. Concretely, the teacher model is devised to help the student model to consolidate the learned knowledge, which is trained online via integrating the model weights of the student model to accumulate the new knowledge. Moreover, our OCD-Net incorporates both relation and adaptive response to help the student model alleviate the catastrophic forgetting, which is also beneficial for the teacher model preserves the learned knowledge. Extensive experiments on six benchmark datasets demonstrate that our proposed OCD-Net significantly outperforms state-of-the-art approaches in 3. 26%~8. 71% with various buffer sizes. Our code is available at https: //github. com/lijincm/OCD-Net.

IJCAI Conference 2021 Conference Paper

Dynamic Rebalancing Dockless Bike-Sharing System based on Station Community Discovery

  • Jingjing Li
  • Qiang Wang
  • Wenqi Zhang
  • Donghai Shi
  • Zhiwei Qin

Influenced by the era of the sharing economy and mobile payment, Dockless Bike-Sharing System (Dockless BSS) is expanding in many major cities. The mobility of users constantly leads to supply and demand imbalance, which seriously affects the total profit and customer satisfaction. In this paper, we propose the Spatio-Temporal Mixed Integer Program (STMIP) with Flow-graphed Community Discovery (FCD) approach to rebalancing the system. Different from existing studies that ignore the route of trucks and adopt a centralized rebalancing, our approach considers the spatio-temporal information of trucks and discovers station communities for truck-based rebalancing. First, we propose the FCD algorithm to detect station communities. Significantly, rebalancing communities decomposes the centralized system into a distributed multi-communities system. Then, by considering the routing and velocity of trucks, we design the STMIP model with the objective of maximizing total profit, to find a repositioning policy for each station community. We design a simulator built on real-world data from DiDi Chuxing to test the algorithm performance. The extensive experimental results demonstrate that our approach outperforms in terms of service level, profit, and complexity compared with the state-of-the-art approach.

JBHI Journal 2021 Journal Article

Effective Brain State Estimation During Propofol-Induced Sedation Using Advanced EEG Microstate Spectral Analysis

  • Yamin Li
  • Wen Shi
  • Zhian Liu
  • Jing Li
  • Qiang Wang
  • Xiangguo Yan
  • Zehong Cao
  • Gang Wang

Brain states are patterns of neuronal synchrony, and the electroencephalogram (EEG) microstate provides a promising tool to characterize and analyze the synchronous neural firing. However, the topographical spectral information for each predominate microstate is still unclear during the switch of consciousness, such as sedation, and the practical usage of the EEG microstate is worth probing. Also, the mechanism behind the anesthetic-induced alternations of brain states remains poorly understood. In this study, an advanced EEG microstate spectral analysis was utilized using multivariate empirical mode decomposition in Hilbert-Huang transform. The practicability was further investigated in scalp EEG recordings during the propofol-induced transition of consciousness. The process of transition from the awake baseline to moderate sedation was accompanied by apparent increases in microstate (A, B, and F) energy, especially in the whole-brain delta band, frontal alpha band and beta band. In comparison to other effective EEG-based parameters that commonly used to measure anesthetic depth, using the selected spectral features reached better performance (80% sensitivity, 90% accuracy) to estimate the brain states during sedation. The changes in microstate energy also exhibited high correlations with individual behavioral data during sedation. In a nutshell, the EEG microstate spectral analysis is an effective method to estimate brain states during propofol-induced sedation, giving great insights into the underlying mechanism. The generated spectral features can be promising markers to dynamically assess the consciousness level.

AAAI Conference 2021 System Paper

Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

  • Yanhao Zhang
  • Qiang Wang
  • Pan Pan
  • Yun Zheng
  • Cheng Da
  • Siyang Sun
  • Yinghui Xu

Nowadays, live-stream and short video shopping in Ecommerce have grown exponentially. However, the sellers are required to manually match images of the selling products to the timestamp of exhibition in the untrimmed video, resulting in a complicated process. To solve the problem, we present an innovative demonstration of multi-modal retrieval system called “Fashion Focus”, which enables to exactly localize the product images in the online video as the focuses. Different modality contributes to the community localization, including visual content, linguistic features and interaction context are jointly investigated via presented multi-modal learning. Our system employs two procedures for analysis, including video content structuring and multi-modal retrieval, to automatically achieve accurate video-to-shop matching. Fashion Focus presents a unified framework that can orientate the consumers towards relevant product exhibitions during watching videos and help the sellers to effectively deliver the products over search and recommendation.

ICLR Conference 2020 Conference Paper

Adversarial AutoAugment

  • Xinyu Zhang
  • Qiang Wang
  • Jian Zhang
  • Zhao Zhong

Data augmentation (DA) has been widely utilized to improve generalization in training deep neural networks. Recently, human-designed data augmentation has been gradually replaced by automatically learned augmentation policy. Through finding the best policy in well-designed search space of data augmentation, AutoAugment (Cubuk et al., 2019) can significantly improve validation accuracy on image classification tasks. However, this approach is not computationally practical for large-scale problems. In this paper, we develop an adversarial method to arrive at a computationally-affordable solution called Adversarial AutoAugment, which can simultaneously optimize target related object and augmentation policy search loss. The augmentation policy network attempts to increase the training loss of a target network through generating adversarial augmentation policies, while the target network can learn more robust features from harder examples to improve the generalization. In contrast to prior work, we reuse the computation in target network training for policy evaluation, and dispense with the retraining of the target network. Compared to AutoAugment, this leads to about 12x reduction in computing cost and 11x shortening in time overhead on ImageNet. We show experimental results of our approach on CIFAR-10/CIFAR-100, ImageNet, and demonstrate significant performance improvements over state-of-the-art. On CIFAR-10, we achieve a top-1 test error of 1.36%, which is the currently best performing single model. On ImageNet, we achieve a leading performance of top-1 accuracy 79.40% on ResNet-50 and 80.00% on ResNet-50-D without extra data.

AAAI Conference 2020 Conference Paper

Multi-Speaker Video Dialog with Frame-Level Temporal Localization

  • Qiang Wang
  • Pin Jiang
  • Zhiyi Guo
  • Yahong Han
  • Zhou Zhao

To simulate human interaction in real life, dialog systems are introduced to generate a response to previous chat utterances. There have been several studies for two-speaker video dialogs in the form of question answering. However, more informative semantic cues might be exploited via a multirounds chatting or discussing about the video among multiple speakers. So multi-speakers video dialogs are more applicable in real life. Besides, speakers always chat about a subsegment of the long video fragment for a period of time. Current video dialog systems require to be directly given the relevant video sub-segment which speakers are chatting about. However, it is always hard to accurately spot the corresponding video sub-segment in practical applications. In this paper, we introduce a novel task of Multi-Speaker Video Dialog with frame-level Temporal Localization (MSVD-TL) to make video dialog systems more applicable. Given a long video fragment and a set of chat history utterances, MSVD- TL targets to predict the following response and localize the relevant video sub-segment in frame level, simultaneously. We develop a new multi-task model with a response prediction module and a frame-level temporal localization module. Besides, we focus on the characteristic of the video dialog generation process and exploit the relation among the video fragment, the chat history, and the following response to re- fine their representations. We evaluate our approach for both the Multi-Speaker Video Dialog without frame-level temporal localization (MSVD w/o TL) task and the MSVD-TL task. The experimental results further demonstrate that MSVD-TL enhances the applicability of video dialog in real life.

AAAI Conference 2020 Conference Paper

Neural Machine Translation with Joint Representation

  • Yanyang Li
  • Qiang Wang
  • Tong Xiao
  • Tongran Liu
  • Jingbo Zhu

Though early successes of Statistical Machine Translation (SMT) systems are attributed in part to the explicit modelling of the interaction between any two source and target units, e. g. , alignment, the recent Neural Machine Translation (NMT) systems resort to the attention which partially encodes the interaction for efficiency. In this paper, we employ Joint Representation that fully accounts for each possible interaction. We sidestep the inefficiency issue by refining representations with the proposed efficient attention operation. The resulting Reformer models offer a new Sequence-to- Sequence modelling paradigm besides the Encoder-Decoder framework and outperform the Transformer baseline in either the small scale IWSLT14 German-English, English-German and IWSLT15 Vietnamese-English or the large scale NIST12 Chinese-English translation tasks by about 1 BLEU point. We also propose a systematic model scaling approach, allowing the Reformer model to beat the state-of-the-art Transformer in IWSLT14 German-English and NIST12 Chinese-English with about 50% fewer parameters. The code is publicly available at https: //github. com/lyy1994/reformer.

AAAI Conference 2020 Conference Paper

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild

  • Yueying Kao
  • Weiming Li
  • Qiang Wang
  • Zhouchen Lin
  • Wooshik Kim
  • Sunghoon Hong

Monocular object pose estimation is an important yet challenging computer vision problem. Depth features can provide useful information for pose estimation. However, existing methods rely on real depth images to extract depth features, leading to its difficulty on various applications. In this paper, we aim at extracting RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation. Specifically, a deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module. The embedding module is trained with synthetic pair data to learn a depth-oriented embedding space between RGB and depth images optimized for object pose estimation. The adaptation module is to further align distributions from synthetic to real data. Compared to existing methods, our method does not need any real depth images and can be trained easily with large-scale synthetic data. Extensive experiments and comparisons show that our method achieves best performance on a challenging public PASCAL 3D+ dataset in all the metrics, which substantiates the superiority of our method and the above modules.

IJCAI Conference 2019 Conference Paper

A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

  • Shaohuai Shi
  • Kaiyong Zhao
  • Qiang Wang
  • Zhenheng Tang
  • Xiaowen Chu

Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e. g. , Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP) to O(k logP), which significantly boosts the system scalability. However, it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper, we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions. We then derive the convergence rate of gTop-k S-SGD, which is at the same order as the vanilla mini-batch SGD. Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance.

IJCAI Conference 2018 Conference Paper

An Appearance-and-Structure Fusion Network for Object Viewpoint Estimation

  • Yueying Kao
  • Weiming Li
  • Zairan Wang
  • Dongqing Zou
  • Ran He
  • Qiang Wang
  • Minsu Ahn
  • Sunghoon Hong

Automatic object viewpoint estimation from a single image is an important but challenging problem in machine intelligence community. Although impressive performance has been achieved, current state-of-the-art methods still have difficulty to deal with the visual ambiguity and structure ambiguity in real world images. To tackle these problems, a novel Appearance-and-Structure Fusion network, which we call it ASFnet that estimates viewpoint by fusing both appearance and structure information, is proposed in this paper. The structure information is encoded by precise semantic keypoints and can help address the visual ambiguity. Meanwhile, distinguishable appearance features contribute to overcoming the structure ambiguity. Our ASFnet integrates an appearance path and a structure path to an end-to-end network and allows deep features effectively share supervision from both the two complementary aspects. A convolutional layer is learned to fuse the two path results adaptively. To balance the influence from the two supervision sources, a piecewise loss weight strategy is employed during training. Experimentally, our proposed network outperforms state-of-the-art methods on a public PASCAL 3D+ dataset, which verifies the effectiveness of our method and further corroborates the above proposition.

IJCAI Conference 2018 Conference Paper

Do not Lose the Details: Reinforced Representation Learning for High Performance Visual Tracking

  • Qiang Wang
  • Mengdan Zhang
  • Junliang Xing
  • Jin Gao
  • Weiming Hu
  • Steve Maybank

This work presents a novel end-to-end trainable CNN model for high performance visual object tracking. It learns both low-level fine-grained representations and a high-level semantic embedding space in a mutual reinforced way, and a multi-task learning strategy is proposed to perform the correlation analysis on representations from both levels. In particular, a fully convolutional encoder-decoder network is designed to reconstruct the original visual features from the semantic projections to preserve all the geometric information. Moreover, the correlation filter layer working on the fine-grained representations leverages a global context constraint for accurate object appearance modeling. The correlation filter in this layer is updated online efficiently without network fine-tuning. Therefore, the proposed tracker benefits from two complementary effects: the adaptability of the fine-grained correlation analysis and the generalization capability of the semantic embedding. Extensive experimental evaluations on four popular benchmarks demonstrate its state-of-the-art performance.

IJCAI Conference 2018 Conference Paper

HCR-Net: A Hybrid of Classification and Regression Network for Object Pose Estimation

  • Zairan Wang
  • Weiming Li
  • Yueying Kao
  • Dongqing Zou
  • Qiang Wang
  • Minsu Ahn
  • Sunghoon Hong

Object pose estimation from a single image is a fundamental and challenging problem in computer vision and robotics. Generally, current methods treat pose estimation as a classification or a regression problem. However, regression based methods usually suffer from the issue of imbalanced training data, while classification methods are difficult to discriminate nearby poses. In this paper, a hybrid CNN model, which we call it HCR-Net that integrates both a classification network and a regression network, is proposed to deal with these issues. Our model is inspired by that regression methods can get better accuracy on homogeneously distributed datasets while classification methods are more effective for coarse quantization of the poses even if the dataset is not well balanced. The classification methods and the regression methods essentially complement each other. Thus we integrate both them into a neural network in a hybrid fashion and train it end-to-end with two novel loss functions. As a result, our method surpass the state-of-the-art methods, even with imbalanced training data and much less data augmentation. The experimental results on the challenging Pascal3D+ database demonstrate that our method outperforms the state-of-the-arts significantly, achieving improvements on ACC and AVP metrics up to 4% and 6%, respectively.

AAAI Conference 2013 Conference Paper

Personalized Recommendation Based on Co-Ranking and Query-Based Collaborative Diffusion

  • Xiao Yang
  • Zhaoxin Zhang
  • Qiang Wang

In this paper, we present an adaptive graph-based personalized recommendation method based on co-ranking and query-based collaborative diffusion. By utilizing the unique network structure of n-partite heterogeneous graph, we attempt to address the problem of personalized recommendation in a two-layer ranking process with the help of reasonable measure of high and low order relationships and analyzing the representation of user’s preference in the graph. The experiments show that this algorithm can outperform the traditional CF methods and achieve competitive performance compared with many model-based and graph-based recommendation methods, and have better scalability and flexibility.