Author name cluster

Haifeng Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers

1 author row

AAAI Conference 2026 Conference Paper

Bridging the Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation

Huazheng Wang
Yongcheng Jing
Haifeng Sun
Jingyu Wang
Jianxin Liao
Leszek Rutkowski
Dacheng Tao

Cross-tokenizer knowledge distillation, where the teacher and student employ different tokenizers, is becoming increasingly prevalent, yet it poses underexplored challenges: existing methods fail to capture the rich knowledge encoded in teacher logits, as evidenced by the neglect of semantic information, inaccurate and biased logit alignment, and discarding distributional structure—ultimately leading to unfavorable distillation. To address these issues, we propose SeDi, a semantics and distribution-aware knowledge transfer framework tailored for cross-tokenizer distillation. To preserve factual knowledge, SeDi employs bipartite graph-based alignment at the tokenization level and a sliding window re-encoding strategy at the vocabulary level, enabling unbiased transfer of the teacher’s next-token predictions into the student’s vocabulary space. To further retain distributional information, we align the student’s entropy with that of the teacher by incorporating the student’s own logits during training, which helps to mitigate the exposure bias problem. Experiments on ten datasets across three task domains and five different teacher-student model pairs with varying vocabulary sizes demonstrate that SeDi delivers substantial improvements, with gains of up to 19.8%.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MCPTox: A Benchmark for Tool Poisoning on Real-World MCP Servers

Zhiqiang Wang
Yichao Gao
Yanting Wang
Suyuan Liu
Haifeng Sun
Haoran Cheng
Guanquan Shi
Haohua Du

By providing a standardized interface for LLM agents to interact with external tools, the Model Context Protocol (MCP) is quickly becoming a cornerstone of the modern autonomous agent ecosystem. However, it creates novel attack surfaces due to untrusted external tools. While prior work has focused on attacks injected through external tool outputs, we investigate a more fundamental vulnerability: Tool Poisoning, where malicious instructions are embedded within a tool's metadata at the registration stage. To date, this threat has been primarily demonstrated through isolated cases, lacking a systematic, large-scale evaluation. We introduce MCPTox, the first benchmark to systematically evaluate agent robustness against Tool Poisoning in realistic MCP settings. MCPTox is constructed upon 45 live, real-world MCP servers and 353 authentic tools. To achieve this, we design three distinct attack templates to generate a comprehensive suite of 1348 malicious test cases by few-shot learning, covering 10 categories of potential risks. Our evaluation on 20 prominent LLM agents setting reveals a widespread vulnerability to Tool Poisoning, with GPT-o1-mini, achieving an attack success rate of 72.8%. We find that more capable models are often more susceptible, as the attack exploits their superior instruction-following abilities. Finally, the failure case analysis reveals that agents rarely refuse these attacks, with the highest refused rate (Claude-3.7-Sonnet) less than 3%, demonstrating that existing safety alignment is ineffective against malicious actions that use legitimate tools for unauthorized operation. Our findings create a crucial empirical baseline for understanding and mitigating this widespread threat, and we release MCPTox for the development of verifiably safer AI agents.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

A³-Net: Calibration-Free Multi-View 3D Hand Reconstruction for Enhanced Musical Instrument Learning

Geng Chen
Xufeng Jian
Yuchen Chen
Pengfei Ren
Jingyu Wang
Haifeng Sun
Qi Qi
Jing Wang

Precise 3D hand posture is essential for learning musical instruments. Reconstructing highly precise 3D hand gestures enables learners to correct and master proper techniques through 3D simulation and Extended Reality. However, exsiting methods typically rely on precisely calibrated multi-camera systems, which are not easily deployable in everyday environments. In this paper, we focus on calibration-free multi-view 3D hand reconstruction in unconstrained scenarios. Establishing correspondences between multi-view images is particularly challenging without camera extrinsics. To address this, we propose A^3-Net, a multi-level alignment framework that utilizes 3D structural representations with hierarchical geometric and explicit semantic information as alignment proxies, facilitating multi-view feature interaction in both 3D geometric space and 2D visual space. Specifically, we first perfrom global geometric alignment to map multi-view features into a canonical space. Subsequently, we aggregate information into predefined sparse and dense proxies to further integrate cross-view semantics through mutual interaction. Finnaly, we perfrom 2D alignment to align projected 2D visual features with 2D observations. Our method achieves state-of-the-art results in the multi-view 3D hand reconstruction task, demonstrating the effectiveness of our proposed framework.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Beyond Statistical Analysis: Multimodal Framework for Time Series Forecasting with LLM-Driven Temporal Pattern

Jiahong Xiong
Chengsen Wang
Haifeng Sun
Yuhan Jing
Qi Qi
Zirui Zhuang
Lei Zhang
Jianxin Liao

Accurate forecasting of time series is crucial for many applications in the real world. Conventional methods primarily rely on statistical analysis of historical data, often leading to overfitting and failing to account for background information and constraints imposed by external events. Therefore, introducing large language models (LLMs) with robust textual capabilities holds significant potential. However, due to the inherent limitations of LLMs in handling numerical data, they do not exhibit advantages in precise numerical prediction tasks. Therefore, we propose a framework to integrate LLMs with conventional methods synergistically. Rather than directly outputting numerical predictions, we leverage the capabilities of the LLMs to generate textual temporal patterns, thereby fully utilizing their inherent knowledge and reasoning abilities. Additionally, we introduce a memory network designed to decode these textual representations into a format that numerical models can effectively interpret. This approach not only capitalizes on the strengths of the LLM in text processing but also bridges the gap between textual and numerical data, enhancing the overall predictive performance of the model. Our experimental results demonstrate the framework's effectiveness, achieving state-of-the-art performance on various benchmark datasets.

PDF Details DOI

AAAI Conference 2025 Conference Paper

ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data

Chengsen Wang
Qi Qi
Jingyu Wang
Haifeng Sun
Zirui Zhuang
Jinming Wu
Lei Zhang
Jianxin Liao

Human experts typically integrate numerical and textual multimodal information to analyze time series. However, most traditional deep learning predictors rely solely on unimodal numerical data, using a fixed-length window for training and prediction on a single dataset, and cannot adapt to different scenarios. The powered pre-trained large language model has introduced new opportunities for time series analysis. Yet, existing methods are either inefficient in training, incapable of handling textual information, or lack zero-shot forecasting capability. In this paper, we innovatively model time series as a foreign language and construct ChatTime, a unified framework for time series and text processing. As an out-of-the-box multimodal time series foundation model, ChatTime provides zero-shot forecasting capability and supports bimodal input/output for both time series and text. We design a series of experiments to verify the superior performance of ChatTime across multiple tasks and scenarios, and create four multimodal datasets to address data gaps. The experimental results demonstrate the potential and utility of ChatTime.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Do LVLMs Truly Understand Video Anomalies? Revealing Hallucination via Co-Occurrence Patterns

Menghao Zhang
Huazheng Wang
Pengfei Ren
Kangheng Lin
Qi Qi
Haifeng Sun
Zirui Zhuang
Lei Zhang

Large Vision-Language Models (LVLMs) pretrained on large-scale multimodal data have shown promising capabilities in Video Anomaly Detection (VAD). However, their ability to reason about abnormal events based on scene semantics remains underexplored. In this paper, we investigate LVLMs’ behavior in VAD from a visual-textual co-occurrence perspective, focusing on whether their decisions are driven by statistical shortcuts between visual instances and textual phrases. By analyzing visual-textual co-occurrence in pretraining data and conducting experiments under different data settings, we reveal a hallucination phenomenon: LVLMs tend to rely on co-occurrence patterns between visual instances and textual phrases associated with either normality or abnormality, leading to incorrect predictions when these high-frequency objects appear in semantically mismatched contexts. To address this issue, we propose VAD-DPO, a direct preference optimization method supervised with counter-example pairs. By constructing visually similar but semantically contrasting video clips, VAD-DPO encourages the model to align its predictions with the semantics of scene rather than relying on co-occurrence patterns. Extensive experiments on six benchmark datasets demonstrate the effectiveness of VAD-DPO in enhancing both anomaly detection and reasoning performance, particularly in scene-dependent scenarios.

PDF Details

IJCAI Conference 2025 Conference Paper

Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU

Shuxi Guo
Zikang Xu
Jiahao Liu
Jinyi Zhang
Qi Qi
Haifeng Sun
Jun Huang
Jianxin Liao

Deep learning-based recommendation systems are increasingly important in the industry. To meet strict SLA requirements, serving frameworks must efficiently handle concurrent queries. However, current serving systems fail to serve concurrent queries due to the following problems: (1) inefficient operator (op) scheduling due to the query-wise op launching mechanism, and (2) heavy contention caused by the mutable nature of recommendation model inference. This paper presents RecOS, a system designed to optimize concurrent recommendation model inference on GPUs. RecOS efficiently schedules ops from different queries by monitoring GPU workloads and assigning ops to the most suitable streams. This approach reduces contention and enhances inference efficiency by leveraging inter-op parallelism and op characteristics. To maintain correctness across multiple CUDA streams, RecOS introduces a unified asynchronous tensor management mechanism. Evaluations demonstrate that RecOS improves online service performance, reducing latency by up to 68%.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Generalizable Hand-Object Modeling from Monocular RGB Images via 3D Gaussians

Xingyu Liu
Pengfei Ren
Qi Qi
Haifeng Sun
Zirui Zhuang
Jing Wang
Jianxin Liao
Jingyu Wang

Recent advances in hand-object interaction modeling have employed implicit representations, such as Signed Distance Functions (SDF) and Neural Radiance Fields (NeRF) to reconstruct hands and objects with arbitrary topology and photo-realistic detail. However, these methods often rely on dense 3D surface annotations, or are tailored to short clips constrained in motion trajectories and scene contexts, limiting their generalization to diverse environments and movement patterns. In this work, we present HOGS, an adaptively perceptive 3D Gaussian Splatting (3DGS) framework for generalizable hand-object modeling from unconstrained monocular RGB images. By integrating photometric cues from the visual modality with the physically grounded structure of 3D Gaussians, HOGS disentangles inherent geometry from transient lighting and motion-induced appearance changes. This endows hand-object assets with the ability to generalize to unseen environments and dynamic motion patterns. Experiments on two challenging datasets demonstrate that HOGS outperforms state-of-the-art methods in monocular hand-object reconstruction and photo-realistic rendering.

PDF Details

NeurIPS Conference 2025 Conference Paper

Unified 2D-3D Discrete Priors for Noise-Robust and Calibration-Free Multiview 3D Human Pose Estimation

Geng Chen
Pengfei Ren
Xufeng Jian
Haifeng Sun
Menghao Zhang
Qi Qi
Zirui Zhuang
Jing Wang

Multi-view 3D human pose estimation (HPE) leverages complementary information across views to improve accuracy and robustness. Traditional methods rely on camera calibration to establish geometric correspondences, which is sensitive to calibration accuracy and lacks flexibility in dynamic settings. Calibration-free approaches address these limitations by learning adaptive view interactions, typically leveraging expressive and flexible continuous representations. However, as the multiview interaction relationship is learned entirely from data without constraint, they are vulnerable to noisy input, which can propagate, amplify and accumulate errors across all views, severely corrupting the final estimated pose. To mitigate this, we propose a novel framework that integrates a noise-resilient discrete prior into the continuous representation-based model. Specifically, we introduce the \textit{UniCodebook}, a unified, compact, robust, and discrete representation complementary to continuous features, allowing the model to benefit from robustness to noise while preserving regression capability. Furthermore, we further propose an attribute-preserving and complementarity-enhancing Discrete-Continuous Spatial Attention (DCSA) mechanism to facilitate interaction between discrete priors and continuous pose features. Extensive experiments on three representative datasets demonstrate that our approach outperforms both calibration-required and calibration-free methods, achieving state-of-the-art performance.

PDF Details

NeurIPS Conference 2024 Conference Paper

FM-Delta: Lossless Compression for Storing Massive Fine-tuned Foundation Models

Wanyi Ning
Jingyu Wang
Qi Qi
Mengde Zhu
Haifeng Sun
Daixuan Cheng
Jianxin Liao
Ce Zhang

Pre-trained foundation models, particularly large language models, have achieved remarkable success and led to massive fine-tuned variants. These models are commonly fine-tuned locally and then uploaded by users to cloud platforms such as HuggingFace for secure storage. However, the huge model number and their billion-level parameters impose heavy storage overhead for cloud with limited resources. Our empirical and theoretical analysis reveals that most fine-tuned models in cloud have a small difference (delta) from their pre-trained models. To this end, we propose a novel lossless compression scheme FM-Delta specifically for storing massive fine-tuned models in cloud. FM-Delta maps fine-tuned and pre-trained model parameters into integers with the same bits, and entropy codes their integer delta. In this way, cloud only needs to store one uncompressed pre-trained model and other compressed fine-tuned models. Extensive experiments have demonstrated that FM-Delta efficiently reduces cloud storage consumption for massive fine-tuned models by an average of around 50% with only negligible additional time in most end-to-end cases. For example, on up to 10 fine-tuned models in the GPT-NeoX-20B family, FM-Delta reduces the original storage requirement from 423GB to 205GB, significantly saving cloud storage costs.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation

Xingyu Liu
Pengfei Ren
Yuanyuan Gao
Jingyu Wang
Haifeng Sun
Qi Qi
Zirui Zhuang
Jianxin Liao

Previous 3D hand pose estimation methods primarily rely on a single modality, either RGB or depth, and the comprehensive utilization of the dual modalities has not been extensively explored. RGB and depth data provide complementary information and thus can be fused to enhance the robustness of 3D hand pose estimation. However, there exist two problems for applying existing fusion methods in 3D hand pose estimation: redundancy of dense feature fusion and ambiguity of visual features. First, pixel-wise feature interactions introduce high computational costs and ineffective calculations of invalid pixels. Second, visual features suffer from ambiguity due to color and texture similarities, as well as depth holes and noise caused by frequent hand movements, which interferes with modeling cross-modal correlations. In this paper, we propose Keypoint-Fusion for RGB-D based 3D hand pose estimation, which leverages the unique advantages of dual modalities to mutually eliminate the feature ambiguity, and performs cross-modal feature fusion in a more efficient way. Specifically, we focus cross-modal fusion on sparse yet informative spatial regions (i.e. keypoints). Meanwhile, by explicitly extracting relatively more reliable information as disambiguation evidence, depth modality provides 3D geometric information for RGB feature pixels, and RGB modality complements the precise edge information lost due to the depth noise. Keypoint-Fusion achieves state-of-the-art performance on two challenging hand datasets, significantly decreasing the error compared with previous single-modal methods.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Rethinking the Power of Timestamps for Robust Time Series Forecasting: A Global-Local Fusion Perspective

Chengsen Wang
Qi Qi
Jingyu Wang
Haifeng Sun
Zirui Zhuang
Jinming Wu
Jianxin Liao

Time series forecasting has played a pivotal role across various industries, including finance, transportation, energy, healthcare, and climate. Due to the abundant seasonal information they contain, timestamps possess the potential to offer robust global guidance for forecasting techniques. However, existing works primarily focus on local observations, with timestamps being treated merely as an optional supplement that remains underutilized. When data gathered from the real world is polluted, the absence of global information will damage the robust prediction capability of these algorithms. To address these problems, we propose a novel framework named GLAFF. Within this framework, the timestamps are modeled individually to capture the global dependencies. Working as a plugin, GLAFF adaptively adjusts the combined weights for global and local information, enabling seamless collaboration with any time series forecasting backbone. Extensive experiments conducted on nine real-world datasets demonstrate that GLAFF significantly enhances the average performance of widely used mainstream forecasting models by 12. 5\%, surpassing the previous state-of-the-art method by 5. 5\%.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Safeguarding Sustainable Cities: Unsupervised Video Anomaly Detection through Diffusion-based Latent Pattern Learning

Menghao Zhang
Jingyu Wang
Qi Qi
Pengfei Ren
Haifeng Sun
Zirui Zhuang
Lei Zhang
Jianxin Liao

Sustainable cities requires high-quality community management and surveillance analytics, which are supported by video anomaly detection techniques. However, mainstream video anomaly detection techniques still require manually labeled data and do not apply to real-world massive videos. Without labeling, unsupervised video anomaly detection (UVAD) is challenged by the problem of pseudo-labeled noise and the openness of anomaly detection. In response, a diffusion-based latent pattern learning UVAD framework is proposed, called DiffVAD. The method learns potential patterns by generating different patterns of the same event through diffusion models. The detection of anomalies is realized by evaluating the pattern distribution. The different patterns of normal events are diverse but correlated, while the different patterns of abnormal events are more diffuse. This manner of detection is equally effective for unseen normal events in the training set. In addition, we design a refinement strategy for pseudo-labels to mitigate the effects of the noise problem. Extensive experiments on six benchmark datasets demonstrate the design’s promising generalization ability and high efficiency. Specifically, DiffVAD obtains an AUC score of 81. 9% on the ShanghaiTech dataset.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Drift doesn't Matter: Dynamic Decomposition with Diffusion Reconstruction for Unstable Multivariate Time Series Anomaly Detection

Chengsen Wang
Zirui Zhuang
Qi Qi
Jingyu Wang
Xingyu Wang
Haifeng Sun
Jianxin Liao

Many unsupervised methods have recently been proposed for multivariate time series anomaly detection. However, existing works mainly focus on stable data yet often omit the drift generated from non-stationary environments, which may lead to numerous false alarms. We propose **D**ynamic **D**ecomposition with **D**iffusion **R**econstruction (D$^3$R), a novel anomaly detection network for real-world unstable data to fill the gap. D$^3$R tackles the drift via decomposition and reconstruction. In the decomposition procedure, we utilize data-time mix-attention to dynamically decompose long-period multivariate time series, overcoming the limitation of the local sliding window. The information bottleneck is critical yet difficult to determine in the reconstruction procedure. To avoid retraining once the bottleneck changes, we control it externally by noise diffusion and directly reconstruct the polluted data. The whole model can be trained end-to-end. Extensive experiments on various real-world datasets demonstrate that D$^3$R significantly outperforms existing methods, with a 11% average relative improvement over the previous SOTA models.

PDF Details

IJCAI Conference 2023 Conference Paper

Not Only Pairwise Relationships: Fine-Grained Relational Modeling for Multivariate Time Series Forecasting

Jinming Wu
Qi Qi
Jingyu Wang
Haifeng Sun
Zhikang Wu
Zirui Zhuang
Jianxin Liao

Recent graph-based methods achieve significant success in multivariate time series modeling and forecasting due to their ability to handle relationships among time series variables. However, only pairwise relationships are considered in most existing works. They ignore beyond-pairwise relationships and their potential categories in practical scenarios, which leads to incomprehensive relationship learning for multivariate time series forecasting. In this paper, we present ReMo, a Relational Modeling-based method, to promote fine-grained relational learning among multivariate time series data. Firstly, by treating time series variables and complex relationships as nodes and hyperedges, we extract multi-view hypergraphs from data to capture beyond-pairwise relationships. Secondly, a novel hypergraph message passing strategy is designed to characterize both nodes and hyperedges by inferring the potential categories of relationships and further distinguishing their impacts on time series variables. By integrating these two modules into the time series forecasting framework, ReMo effectively improves the performance of multivariate time series forecasting. The experimental results on seven commonly used datasets from different domains demonstrate the superiority of our model.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Scene-Level Sketch-Based Image Retrieval with Minimal Pairwise Supervision

Ce Ge
Jingyu Wang
Qi Qi
Haifeng Sun
Tong Xu
Jianxin Liao

The sketch-based image retrieval (SBIR) task has long been researched at the instance level, where both query sketches and candidate images are assumed to contain only one dominant object. This strong assumption constrains its application, especially with the increasingly popular intelligent terminals and human-computer interaction technology. In this work, a more general scene-level SBIR task is explored, where sketches and images can both contain multiple object instances. The new general task is extremely challenging due to several factors: (i) scene-level SBIR inherently shares sketch-specific difficulties with instance-level SBIR (e.g., sparsity, abstractness, and diversity), (ii) the cross-modal similarity is measured between two partially aligned domains (i.e., not all objects in images are drawn in scene sketches), and (iii) besides instance-level visual similarity, a more complex multi-dimensional scene-level feature matching problem is imposed (including appearance, semantics, layout, etc.). Addressing these challenges, a novel Conditional Graph Autoencoder model is proposed to deal with scene-level sketch-images retrieval. More importantly, the model can be trained with only pairwise supervision, which distinguishes our study from others in that elaborate instance-level annotations (for example, bounding boxes) are no longer required. Extensive experiments confirm the ability of our model to robustly retrieve multiple related objects at the scene level and exhibit superior performance beyond strong competitors.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Semi-transductive Learning for Generalized Zero-Shot Sketch-Based Image Retrieval

Ce Ge
Jingyu Wang
Qi Qi
Haifeng Sun
Tong Xu
Jianxin Liao

Sketch-based image retrieval (SBIR) is an attractive research area where freehand sketches are used as queries to retrieve relevant images. Existing solutions have advanced the task to the challenging zero-shot setting (ZS-SBIR), where the trained models are tested on new classes without seen data. However, they are prone to overfitting under a realistic scenario when the test data includes both seen and unseen classes. In this paper, we study generalized ZS-SBIR (GZS-SBIR) and propose a novel semi-transductive learning paradigm. Transductive learning is performed on the image modality to explore the potential data distribution within unseen classes, and zero-shot learning is performed on the sketch modality sharing the learned knowledge through a semi-heterogeneous architecture. A hybrid metric learning strategy is proposed to establish semantics-aware ranking property and calibrate the joint embedding space. Extensive experiments are conducted on two large-scale benchmarks and four evaluation metrics. The results show that our method is superior over the state-of-the-art competitors in the challenging GZS-SBIR task.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Two Heads Are Better than One: Image-Point Cloud Network for Depth-Based 3D Hand Pose Estimation

Pengfei Ren
Yuchen Chen
Jiachang Hao
Haifeng Sun
Qi Qi
Jingyu Wang
Jianxin Liao

Depth images and point clouds are the two most commonly used data representations for depth-based 3D hand pose estimation. Benefiting from the structuring of image data and the inherent inductive biases of the 2D Convolutional Neural Network (CNN), image-based methods are highly efficient and effective. However, treating the depth data as a 2D image inevitably ignores the 3D nature of depth data. Point cloud-based methods can better mine the 3D geometric structure of depth data. However, these methods suffer from the disorder and non-structure of point cloud data, which is computationally inefficient. In this paper, we propose an Image-Point cloud Network (IPNet) for accurate and robust 3D hand pose estimation. IPNet utilizes 2D CNN to extract visual representations in 2D image space and performs iterative correction in 3D point cloud space to exploit the 3D geometry information of depth data. In particular, we propose a sparse anchor-based "aggregation-interaction-propagation'' paradigm to enhance point cloud features and refine the hand pose, which reduces irregular data access. Furthermore, we introduce a 3D hand model to the iterative correction process, which significantly improves the robustness of IPNet to occlusion and depth holes. Experiments show that IPNet outperforms state-of-the-art methods on three challenging hand datasets.

PDF Details DOI

AAAI Conference 2020 Conference Paper

AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation

Weiting Huang
Pengfei Ren
Jingyu Wang
Qi Qi
Haifeng Sun

In this paper, we propose an adaptive weighting regression (AWR) method to leverage the advantages of both detectionbased and regression-based method. Hand joint coordinates are estimated as discrete integration of all pixels in dense representation, guided by adaptive weight maps. This learnable aggregation process introduces both dense and joint supervision that allows end-to-end training and brings adaptability to weight maps, making network more accurate and robust. Comprehensive exploration experiments are conducted to validate the effectiveness and generality of AWR under various experimental settings, especially its usefulness for different types of dense representation and input modality. Our method outperforms other state-of-the-art methods on four publicly available datasets, including NYU, ICVL, MSRA and HANDS 2017 dataset.

PDF Details