Arrow Research search

Author name cluster

Hao Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

41 papers
2 author rows

Possible papers

41

AAAI Conference 2026 Conference Paper

CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

  • Bichen Wang
  • Yixin Sun
  • Junzhe Wang
  • Hao Yang
  • Xing Fu
  • Yanyan Zhao
  • Si Wei
  • Shijin Wang

The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce CARE-Bench, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.

AAAI Conference 2026 Conference Paper

DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion

  • Yinghui Li
  • Qianyu Zhou
  • Di Shao
  • Hao Yang
  • Ye Zhu
  • Richard Dazeley
  • Xuequan Lu

Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of state space models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.

AAAI Conference 2026 Conference Paper

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

  • Liuyue Xie
  • Avik Kuthiala
  • George Z Wei
  • Ce Zheng
  • Ananya Bal
  • Mosam Dabhi
  • Liting Wen
  • Taru Rustagi

We introduce MAVERIX (Multimodal Audio-Visual Evaluation and Recognition IndeX), a unified benchmark to probe video understanding in multimodal LLMs, encompassing video, audio, and text inputs with human performance baselines. Although recent advancements in audiovisual models have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with questions that closely mimic the multimodal understanding experiences available to humans during decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence, with the website publicly available below.

AAAI Conference 2026 Conference Paper

PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

  • Hao Yang
  • Qianyu Zhou
  • Haijia Sun
  • Xiangtai Li
  • Xuequan Lu
  • Lizhuang Ma
  • Shuicheng Yan

Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two core modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.

JBHI Journal 2025 Journal Article

A Phase-Enhanced Neural Network With Dual-Path Transformer for Single-Channel Chest Sound Separation

  • Yuqi Wang
  • Han Yang
  • Zhixing Gao
  • Zhiwei Dai
  • Kang Yu
  • Tingting Song
  • Hao Yang
  • Yunfeng Wang

Auscultation of the chest is a fundamental diagnostic tool for cardiovascular and pulmonary diseases. However, the two main chest sound parts, heart sound (HS) and lung sound (LS), are often mixed, limiting diagnostic accuracy. This paper presents a novel Phase-Enhanced Neural Network (PENN) for HS and LS separation. To address the under-utilization of phase information, PENN integrates a feedforward connection that feeds the input spectrum into the Restorer, enabling phase recovery based on the local inference feature of phase. A time-frequency Dual-Path Transformer (DPT) is employed to expand the network's receptive field and enhance performance. To interpret the effectiveness of PENN, two new metrics, mSI-SDRi and pSI-SDRi, are proposed to separately evaluate the contributions of magnitude and phase. Experiments show that PENN achieves pSI-SDRi improvements of 1. 44 dB for HS and 2. 25 dB for LS under a LS cutoff frequency ( $f_{c\text{lung}}$ ) of 60Hz. Extensive experimental results demonstrate the effectiveness and robustness of PENN, offering a promising solution to improve the accuracy of auscultation.

NeurIPS Conference 2025 Conference Paper

Geometric Logit Decoupling for Energy-Based Graph Out-of-distribution Detection

  • Min Wang
  • Hao Yang
  • Qing Cheng
  • Jincai Huang

GNNs have achieved remarkable performance across a range of tasks, but their reliability under distribution shifts remains a significant challenge. In particular, energy-based OOD detection methods—which compute energy scores from GNN logits—suffer from unstable performance due to a fundamental coupling between the norm and direction of node embeddings. Our analysis reveals that this coupling leads to systematic misclassification of high-norm OOD samples and hinders reliable ID–OOD separation. Interestingly, GNNs also exhibit a desirable inductive bias known as angular clustering, where embeddings of the same class align in direction. Motivated by these observations, we propose GeoEnergy (Geometric Logit Decoupling for Energy-Based OOD Detection), a plug-and-play framework that enforces hyperspherical logit geometry by normalizing class weights while preserving embedding norms. This decoupling yields more structured energy distributions, sharper intra-class alignment, and improved calibration. GeoEnergy can be integrated into existing energy-based GNNs without retraining or architectural modification. Extensive experiments demonstrate that GeoEnergy consistently improves OOD detection performance and confidence reliability across various benchmarks and distribution shifts.

JBHI Journal 2025 Journal Article

Medical Vision-Language Modeling With Semantic Interaction and Adaptive Refinement Prompting for Bias Mitigation

  • Cheng Li
  • Weijian Huang
  • Hao Yang
  • Jiarun Liu
  • Yong Liang
  • Shanshan Wang

Vision-Language Models (VLMs) have demonstrated impressive capabilities across various medical tasks, including report generation and visual question answering (VQA). However, pixel-level tasks such as image segmentation remain relatively underexplored, despite their critical importance for clinical decision-making, surgical planning, and model interpretability. Moreover, the scarcity of high-quality segmentation annotations in the medical domain often leads to biased data distributions, characterized by imbalances in disease types, anatomical coverage, and image quality. These biases are frequently overlooked during both model development and evaluation, limiting the robustness and real-world applicability of VLMs in healthcare scenarios. In this study, we propose a unified medical vision-language model applicable for a variety of clinical tasks, including report generation, VQA, and pixel-level image segmentation. Within the model, we propose a semantic interaction mechanism aimed at enhancing pixel-level vision and language representation learning. To mitigate the impact of biased data distributions, we explicitly develop an adaptive refinement prompting method involving the iterative re-prompting of hard samples. The proposed method is thoroughly validated through experiments on eight datasets and comparisons with nine state-of-the-art methods. The experimental results indicate that our model achieves superior performance in both medical VQA and segmentation tasks. These results highlight the potential of our approach in advancing the deployment of medical VLMs in real-world clinical applications. Code will be released at: https://github.com/SZUHvern/Unified-Medical-Vision-Language-Modeling

AAAI Conference 2025 Conference Paper

PointDGMamba: Domain Generalization of Point Cloud Classification via Generalized State Space Model

  • Hao Yang
  • Qianyu Zhou
  • Haijia Sun
  • Xiangtai Li
  • Fengqi Liu
  • Xuequan Lu
  • Lizhuang Ma
  • Shuicheng Yan

Domain Generalization (DG) has been recently explored to improve the generalizability of point cloud classification (PCC) models toward unseen domains. However, they often suffer from limited receptive fields or quadratic complexity due to the use of convolution neural networks or vision Transformers. In this paper, we present the first work that studies the generalizability of state space models (SSMs) in DG PCC and find that directly applying SSMs into DG PCC will encounter several challenges: the inherent topology of the point cloud tends to be disrupted and leads to noise accumulation during the serialization stage. Besides, the lack of designs in domain-agnostic feature learning and data scanning will introduce unanticipated domain-specific information into the 3D sequence data. To this end, we propose a novel framework, PointDGMamba, that excels in strong generalizability toward unseen domains and has the advantages of global receptive fields and efficient linear complexity. PointDGMamba consists of three innovative components: Masked Sequence Denoising (MSD), Sequence-wise Cross-domain Feature Aggregation (SCFA), and Dual-level Domain Scanning (DDS). In particular, MSD selectively masks out the noised point tokens of the point cloud sequences, SCFA introduces cross-domain but same-class point cloud features to encourage the model to learn how to extract more generalized features. DDS includes intra-domain scanning and cross-domain scanning to facilitate information exchange between features. In addition, we propose a new and more challenging benchmark PointDG-3to1 for multi-domain generalization. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of PointDGMamba.

AAAI Conference 2025 Conference Paper

SRDC: Semantics-based Ransomware Detection and Classification with LLM-assisted Pre-training

  • Ce Zhou
  • Yilun Liu
  • Weibin Meng
  • Shimin Tao
  • Weinan Tian
  • Feiyu Yao
  • Xiaochun Li
  • Tao Han

In recent years, ransomware has emerged as a formidable data security threat, causing significant data privacy breaches that inflict substantial financial, reputational, and operational damages on society. Many studies employ dynamic feature analysis for ransomware detection. However, these methods utilize neither the internal semantic information (semantic information inherent in the features), nor external semantics (the wealth of existing knowledge and expert experience with regard to ransomware detection). Moreover, conventional methods rely on training data from known ransomware families, while zero-day ransomware often has unknown data distribution patterns, posing detection challenges. In this paper, we propose a Semantics-based Ransomware Detection and family Classification (SRDC) framework that can utilize both internal and external semantics of software. To bolster semantic analysis in zero-day attacks, we also design a procedure called LLM-assisted task-adaptive pre-training (LATAP). In LATAP, ransomware semantics from human experts and LLMs are employed to pre-train the detection model (GPT-2). By fully utilizing semantics, the proposed SRDC framework outperforms the SOTA methods by 12.15% for ransomware family classification tasks, and by 4.03% for zero-day ransomware detection tasks. SRDC also exhibits excellent data efficiency, requiring only two ransom families for training, which is only 35% of the data required by existing methods, to achieve a 90%+ accuracy of zero-day ransomware detection in nine unseen ransom families.

AAAI Conference 2025 Conference Paper

Test-Time Adaptation on Noisy Data via Model-Pruning-Based Filtering and Flatness-Aware Entropy Minimization

  • Xingzhi Zhou
  • Zhiliang Tian
  • Boyang Zhang
  • Yibo Zhang
  • Ka Chun Cheung
  • Simon See
  • Hao Yang
  • Yun Zhou

Test-time adaptation (TTA) deals with domain shifts during inference by training models based on only unlabeled test samples. Test samples may include noisy samples, which degrade domain adaptation. Existing methods rely on the model's output prediction to detect and filter noisy samples, and further search for flat regions during optimization, which makes the optimization more robust on noisy samples. However, there are two issues: (1) the output prediction tends to be inaccurate due to domain shifts, weakening noisy-sample detection; (2) current approaches for searching flat regions focus on optimization to enhance the worst case, which ignores achieving flatness by avoiding the quick changing of losses. To address these challenges, we propose a model pruning-based test-time adaptation model for noisy data streams, named MoTTA, which leverages a new proposed filtering, output difference under pruning (ODP)-based filtering, and a flatness-aware entropy minimization (FlatEM). Specifically, to reduce the impact of inaccurate output predictions, ODP-based filtering measures the output difference of a sample before and after model pruning, which works even under inaccurate output. To improve the search for flat loss surfaces, FlatEM integrates zeroth-order flatness and first-order flatness (minimize the maximal gradient normalization with a weight perturbation constrained in a small Euclidean ball) on entropy minimization. To solve these hard maximum problems, we leverage Taylor expansion to obtain approximated results for optimization. FlatEM also adopts a parameter regularization to mitigate incorrect updates from noisy samples. The experiments show our advantages in dealing with noisy data streams at TTA comparable to existing baselines.

NeurIPS Conference 2025 Conference Paper

User-Instructed Disparity-aware Defocus Control

  • Yudong Han
  • Yan Yang
  • Hao Yang
  • Liyuan Pan

In photography, an All-in-Focus (AiF) image may not always effectively convey the creator’s intent. Professional photographers manipulate Depth of Field (DoF) to control which regions appear sharp or blurred, achieving compelling artistic effects. For general users, the ability to flexibly adjust DoF enhances creative expression and image quality. In this paper, we propose UiD, a User-Instructed DoF control framework, that allows users to specify refocusing regions using text, box, or point prompts, and our UiD automatically simulates in-focus and out-of-focus (OoF) regions in the given images. However, controlling defocus blur in a single-lens camera remains challenging due to the difficulty in estimating depth-aware aberrations and the suboptimal quality of reconstructed AiF images. To address this, we leverage dual-pixel (DP) sensors, commonly found in DSLR-style and mobile cameras. DP sensors provide a small-baseline stereo pair in a single snapshot, enabling depth-aware aberration estimation. Our approach first establishes an invertible mapping between OoF and AiF images to learn spatially varying defocus kernels and the disparity features. These depth-aware kernels enable bidirectional image transformation—deblurring out-of-focus (OoF) images into all-in-focus (AiF) representations, and conversely reblurring AiF images into OoF outputs—by seamlessly switching between the kernel and its inverse form. These depth-aware kernels enable both deblurring of OoF images into AiF representations and reblurring AiF images into OoF representations by flexibly switching its original form to its inverse one. For user-guided refocusing, we first generate masks based on user prompts using SAM, which modulates disparity features in closed form, allowing dynamic kernel re-estimation for reblurring. This achieves user-controlled refocusing effects. Extensive experiments on both common datasets and the self-collected dataset demonstrate that UiD offers superior flexibility and quality in DoF manipulation imaging.

AAAI Conference 2024 Conference Paper

AvatarVerse: High-Quality & Stable 3D Avatar Creation from Text and Pose

  • Huichao Zhang
  • Bowen Chen
  • Hao Yang
  • Liao Qu
  • Xu Wang
  • Li Chen
  • Chao Long
  • Feida Zhu

Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io/.

AAAI Conference 2024 Conference Paper

Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks

  • Min Wang
  • Hao Yang
  • Jincai Huang
  • Qing Cheng

Confidence calibration in Graph Neural Networks (GNNs) aims to align a model's predicted confidence with its actual accuracy. Recent studies have indicated that GNNs exhibit an under-confidence bias, which contrasts the over-confidence bias commonly observed in deep neural networks. However, our deeper investigation into this topic reveals that not all GNNs exhibit this behavior. Upon closer examination of message passing in GNNs, we found a clear link between message aggregation and confidence levels. Specifically, GNNs with extensive message aggregation, often seen in deep architectures or when leveraging large amounts of labeled data, tend to exhibit overconfidence. This overconfidence can be attributed to factors like over-learning and over-smoothing. Conversely, GNNs with fewer layers, known for their balanced message passing and superior node representation, may exhibit under-confidence. To counter these confidence biases, we introduce the Adaptive Unified Label Smoothing (AU-LS) technique. Our experiments show that AU-LS outperforms existing methods, addressing both over and under-confidence in various GNN scenarios.

AAAI Conference 2024 Conference Paper

Translate Meanings, Not Just Words: IdiomKB’s Role in Optimizing Idiomatic Translation with Language Models

  • Shuang Li
  • Jiangjie Chen
  • Siyu Yuan
  • Xinyi Wu
  • Hao Yang
  • Shimin Tao
  • Yanghua Xiao

To translate well, machine translation (MT) systems and general-purposed language models (LMs) need a deep understanding of both source and target languages and cultures. Therefore, idioms, with their non-compositional nature, pose particular challenges for Transformer-based systems, as literal translations often miss the intended meaning. Traditional methods, which replace idioms using existing knowledge bases (KBs), often lack scale and context-awareness. Addressing these challenges, our approach prioritizes context-awareness and scalability, allowing for offline storage of idioms in a manageable KB size. This ensures efficient serving with smaller models and provides a more comprehensive understanding of idiomatic expressions. We introduce a multilingual idiom KB (IdiomKB) developed using large LMs to address this. This KB facilitates better translation by smaller models, such as BLOOMZ (7.1B), Alpaca (7B), and InstructGPT (6.7B), by retrieving idioms' figurative meanings. We present a novel, GPT-4-powered metric for human-aligned evaluation, demonstrating that IdiomKB considerably boosts model performance. Human evaluations further validate our KB's quality.

AAAI Conference 2023 Conference Paper

Denoising Pre-training for Machine Translation Quality Estimation with Curriculum Learning

  • Xiang Geng
  • Yu Zhang
  • Jiahuan Li
  • Shujian Huang
  • Hao Yang
  • Shimin Tao
  • Yimeng Chen
  • Ning Xie

Quality estimation (QE) aims to assess the quality of machine translations when reference translations are unavailable. QE plays a crucial role in many real-world applications of machine translation. Because labeled QE data are usually limited in scale, recent research, such as DirectQE, pre-trains QE models with pseudo QE data and obtains remarkable performance. However, there tends to be inevitable noise in the pseudo data, hindering models from learning QE accurately. Our study shows that the noise mainly comes from the differences between pseudo and real translation outputs. To handle this problem, we propose CLQE, a denoising pre-training framework for QE based on curriculum learning. More specifically, we propose to measure the degree of noise in the pseudo QE data with some metrics based on statistical or distributional features. With the guidance of these metrics, CLQE gradually pre-trains the QE model using data from cleaner to noisier. Experiments on various benchmarks reveal that CLQE outperforms DirectQE and other strong baselines. We also show that with our framework, pre-training converges faster than directly using the pseudo data. We make our CLQE code available (https://github.com/NJUNLP/njuqe).

AAAI Conference 2023 Conference Paper

FreeEnricher: Enriching Face Landmarks without Additional Cost

  • Yangyu Huang
  • Xi Chen
  • Jongyoo Kim
  • Hao Yang
  • Chong Li
  • Jiaolong Yang
  • Dong Chen

Recent years have witnessed significant growth of face alignment. Though dense facial landmark is highly demanded in various scenarios, e.g., cosmetic medicine and facial beautification, most works only consider sparse face alignment. To address this problem, we present a framework that can enrich landmark density by existing sparse landmark datasets, e.g., 300W with 68 points and WFLW with 98 points. Firstly, we observe that the local patches along each semantic contour are highly similar in appearance. Then, we propose a weakly-supervised idea of learning the refinement ability on original sparse landmarks and adapting this ability to enriched dense landmarks. Meanwhile, several operators are devised and organized together to implement the idea. Finally, the trained model is applied as a plug-and-play module to the existing face alignment networks. To evaluate our method, we manually label the dense landmarks on 300W testset. Our method yields state-of-the-art accuracy not only in newly-constructed dense 300W testset but also in the original sparse 300W and WFLW testsets without additional cost.

NeurIPS Conference 2023 Conference Paper

From Trainable Negative Depth to Edge Heterophily in Graphs

  • Yuchen Yan
  • Yuzhong Chen
  • Huiyuan Chen
  • Minghua Xu
  • Mahashweta Das
  • Hao Yang
  • Hanghang Tong

Finding the proper depth $d$ of a graph convolutional network (GCN) that provides strong representation ability has drawn significant attention, yet nonetheless largely remains an open problem for the graph learning community. Although noteworthy progress has been made, the depth or the number of layers of a corresponding GCN is realized by a series of graph convolution operations, which naturally makes $d$ a positive integer ($d \in \mathbb{N}+$). An interesting question is whether breaking the constraint of $\mathbb{N}+$ by making $d$ a real number ($d \in \mathbb{R}$) can bring new insights into graph learning mechanisms. In this work, by redefining GCN's depth $d$ as a trainable parameter continuously adjustable within $(-\infty, +\infty)$, we open a new door of controlling its signal processing capability to model graph homophily/heterophily (nodes with similar/dissimilar labels/attributes tend to be inter-connected). A simple and powerful GCN model TEDGCN, is proposed to retain the simplicity of GCN and meanwhile automatically search for the optimal $d$ without the prior knowledge regarding whether the input graph is homophilic or heterophilic. Negative-valued $d$ intrinsically enables high-pass frequency filtering functionality via augmented topology for graph heterophily. Extensive experiments demonstrate the superiority of TEDGCN on node classification tasks for a variety of homophilic and heterophilic graphs.

NeurIPS Conference 2023 Conference Paper

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

  • Hao Yang
  • Haiyang Wang
  • Di Dai
  • Liwei Wang

Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https: //github. com/PRED4pc/PRED.

IJCAI Conference 2023 Conference Paper

Probabilistic Masked Attention Networks for Explainable Sequential Recommendation

  • Huiyuan Chen
  • Kaixiong Zhou
  • Zhimeng Jiang
  • Chin-Chia Michael Yeh
  • Xiaoting Li
  • Menghai Pan
  • Yan Zheng
  • Xia Hu

Transformer-based models are powerful for modeling temporal dynamics of user preference in sequential recommendation. Most of the variants adopt the Softmax transformation in the self-attention layers to generate dense attention probabilities. However, real-world item sequences are often noisy, containing a mixture of true-positive and false-positive interactions. Such dense attentions inevitably assign probability mass to noisy or irrelevant items, leading to sub-optimal performance and poor explainability. Here we propose a Probabilistic Masked Attention Network (PMAN) to identify the sparse pattern of attentions, which is more desirable for pruning noisy items in sequential recommendation. Specifically, we employ a probabilistic mask to achieve sparse attentions under a constrained optimization framework. As such, PMAN allows to select which information is critical to be retained or dropped in a data-driven fashion. Experimental studies on real-world benchmark datasets show that PMAN is able to improve the performance of Transformers significantly.

IJCAI Conference 2023 Conference Paper

Stochastic Feature Averaging for Learning with Long-Tailed Noisy Labels

  • Hao-Tian Li
  • Tong Wei
  • Hao Yang
  • Kun Hu
  • Chong Peng
  • Li-Bo Sun
  • Xun-Liang Cai
  • Min-Ling Zhang

Deep neural networks have shown promising results on a wide variety of tasks using large-scale and well-annotated training datasets. However, data collected from real-world applications can suffer from two prevalent biases, i. e. , long-tailed class distribution and label noise. Previous efforts on long-tailed learning and label-noise learning can only address a single type of data bias, leading to a severe deterioration of their performance. In this paper, we propose a distance-based sample selection algorithm called Stochastic Feature Averaging (SFA), which fits a Gaussian using the exponential running average of class centroids to capture uncertainty in representation space due to label noise and data scarcity. With SFA, we detect noisy samples based on their distances to class centroids sampled from this Gaussian distribution. Based on the identified clean samples, we then propose to train an auxiliary balanced classifier to improve the generalization for the minority class and facilitate the update of Gaussian parameters. Extensive experimental results show that SFA can enhance the performance of existing methods on both simulated and real-world datasets. Further, we propose to combine SFA with the sample-selection approach, distribution-robust, and noise-robust loss functions, resulting in significant improvement in performance over the baselines. Our code is available at https: //github. com/HotanLee/SFA

AAAI Conference 2023 Conference Paper

SwiftAvatar: Efficient Auto-Creation of Parameterized Stylized Character on Arbitrary Avatar Engines

  • Shizun Wang
  • Weihong Zeng
  • Xu Wang
  • Hao Yang
  • Li Chen
  • Chuang Zhang
  • Ming Wu
  • Yi Yuan

The creation of a parameterized stylized character involves careful selection of numerous parameters, also known as the "avatar vectors" that can be interpreted by the avatar engine. Existing unsupervised avatar vector estimation methods that auto-create avatars for users, however, often fail to work because of the domain gap between realistic faces and stylized avatar images. To this end, we propose SwiftAvatar, a novel avatar auto-creation framework that is evidently superior to previous works. SwiftAvatar introduces dual-domain generators to create pairs of realistic faces and avatar images using shared latent codes. The latent codes can then be bridged with the avatar vectors as pairs, by performing GAN inversion on the avatar images rendered from the engine using avatar vectors. Through this way, we are able to synthesize paired data in high-quality as many as possible, consisting of avatar vectors and their corresponding realistic faces. We also propose semantic augmentation to improve the diversity of synthesis. Finally, a light-weight avatar vector estimator is trained on the synthetic pairs to implement efficient auto-creation. Our experiments demonstrate the effectiveness and efficiency of SwiftAvatar on two different avatar engines. The superiority and advantageous flexibility of SwiftAvatar are also verified in both subjective and objective evaluations.

NeurIPS Conference 2023 Conference Paper

Your representations are in the network: composable and parallel adaptation for large scale models

  • Yonatan Dukler
  • Alessandro Achille
  • Hao Yang
  • Varsha Vivek
  • Luca Zancato
  • Benjamin Bowman
  • Avinash Ravichandran
  • Charless Fowlkes

We present a framework for transfer learning that efficiently adapts a large base-model by learning lightweight cross-attention modules attached to its intermediate activations. We name our approach InCA (Introspective-Cross-Attention) and show that it can efficiently survey a network’s representations and identify strong performing adapter models for a downstream task. During training, InCA enables training numerous adapters efficiently and in parallel, isolated from the frozen base model. On the ViT-L/16 architecture, our experiments show that a single adapter, 1. 3% of the full model, is able to reach full fine-tuning accuracy on average across 11 challenging downstream classification tasks. Compared with other forms of parameter-efficient adaptation, the isolated nature of the InCA adaptation is computationally desirable for large-scale models. For instance, we adapt ViT-G/14 (1. 8B+ parameters) quickly with 20+ adapters in parallel on a single V100 GPU (76% GPU memory reduction) and exhaustively identify its most useful representations. We further demonstrate how the adapters learned by InCA can be incrementally modified or combined for flexible learning scenarios and our approach achieves state of the art performance on the ImageNet-to-Sketch multi-task benchmark.

AAAI Conference 2022 Short Paper

Exploring Entity Interactions for Few-Shot Relation Learning (Student Abstract)

  • Yi Liang
  • Shuai Zhao
  • Bo Cheng
  • Yuwei Yin
  • Hao Yang

Few-shot relation learning refers to infer facts for relations with a limited number of observed triples. Existing metriclearning methods for this problem mostly neglect entity interactions within and between triples. In this paper, we explore this kind of fine-grained semantic meanings and propose our model TransAM. Specifically, we serialize reference entities and query entities into sequence and apply transformer structure with local-global attention to capture both intra- and inter-triple entity interactions. Experiments on two public benchmark datasets NELL-One and Wiki-One with 1shot setting prove the effectiveness of TransAM.

AAAI Conference 2022 Conference Paper

Noninvasive Lung Cancer Early Detection via Deep Methylation Representation Learning

  • Xiangrui Cai
  • Jinsheng Tao
  • Shichao Wang
  • Zhiyu Wang
  • Jiaxian Wang
  • Mei Li
  • Hong Wang
  • Xixiang Tu

Early detection of lung cancer is crucial for five-year survival of patients. Compared with the pathological analysis and CT scans, the circulating tumor DNA (ctDNA) methylation based approach is noninvasive and cost-effective, and thus is one of the most promising methods for early detection of lung cancer. Existing studies on ctDNA methylation data measure the methylation level of each region with a predefined metric, ignoring the positions of methylated CpG sites and methylation patterns, thus are not able to capture the early cancer signals. In this paper, we propose a blood-based lung cancer detection method, and present the first ever study to represent methylation regions by continuous vectors. Specifically, we propose DeepMeth to regard each region as a one-channel image and develop an auto-encoder model to learn its representation. For each ctDNA methylation sample, DeepMeth achieves its representation via concatenating the region vectors. We evaluate DeepMeth on a multicenter clinical dataset collected from 14 hospitals. The experiments show that DeepMeth achieves about 5%-8% improvements compared with the baselines in terms of Area Under the Curve (AUC). Moreover, the experiments also demonstrate that DeepMeth can be combined with traditional scalar metrics to enhance the diagnostic power of ctDNA methylation classifiers. DeepMeth has been clinically deployed and applied to 450 patients from 94 hospitals nationally since April 2020.

AAAI Conference 2021 Conference Paper

Beating Attackers At Their Own Games: Adversarial Example Detection Using Adversarial Gradient Directions

  • Yuhang Wu
  • Sunpreet S Arora
  • Yanhong Wu
  • Hao Yang

Adversarial examples are input examples that are specifically crafted to deceive machine learning classifiers. State-of-theart adversarial example detection methods characterize an input example as adversarial either by quantifying the magnitude of feature variations under multiple perturbations or by measuring its distance from estimated benign example distribution. Instead of using such metrics, the proposed method is based on the observation that the directions of adversarial gradients when crafting (new) adversarial examples play a key role in characterizing the adversarial space. Compared to detection methods that use multiple perturbations, the proposed method is efficient as it only applies a single random perturbation on the input example. Experiments conducted on two different databases, CIFAR-10 and ImageNet, show that the proposed detection method achieves, respectively, 97. 9% and 98. 6% AUC-ROC (on average) on five different adversarial attacks, and outperforms multiple state-of-the-art detection methods. Results demonstrate the effectiveness of using adversarial gradient directions for adversarial example detection.

NeurIPS Conference 2021 Conference Paper

Delayed Propagation Transformer: A Universal Computation Engine towards Practical Control in Cyber-Physical Systems

  • Wenqing Zheng
  • Qiangqiang Guo
  • Hao Yang
  • Peihao Wang
  • Zhangyang Wang

Multi-agent control is a central theme in the Cyber-Physical Systems (CPS). However, current control methods either receive non-Markovian states due to insufficient sensing and decentralized design, or suffer from poor convergence. This paper presents the Delayed Propagation Transformer (DePT), a new transformer-based model that specializes in the global modeling of CPS while taking into account the immutable constraints from the physical world. DePT induces a cone-shaped spatial-temporal attention prior, which injects the information propagation and aggregation principles and enables a global view. With physical constraint inductive bias baked into its design, our DePT is ready to plug and play for a broad class of multi-agent systems. The experimental results on one of the most challenging CPS -- network-scale traffic signal control system in the open world -- show that our model outperformed the state-of-the-art expert methods on synthetic and real-world datasets. Our codes are released at: https: //github. com/VITA-Group/DePT.

IJCAI Conference 2021 Conference Paper

Online Credit Payment Fraud Detection via Structure-Aware Hierarchical Recurrent Neural Network

  • Wangli Lin
  • Li Sun
  • Qiwei Zhong
  • Can Liu
  • Jinghua Feng
  • Xiang Ao
  • Hao Yang

Online credit payment fraud detection plays a critical role in financial institutions due to the growing volume of fraudulent transactions. Recently, researchers have shown an increased interest in capturing users’ dynamic and evolving fraudulent tendencies from their behavior sequences. However, most existing methodologies for sequential modeling overlook the intrinsic structure information of web pages. In this paper, we adopt multi-scale behavior sequence generated from different granularities of web page structures and propose a model named SAH-RNN to consume the multi-scale behavior sequence for online payment fraud detection. The SAH-RNN has stacked RNN layers in which upper layers modeling for compendious behaviors are updated less frequently and receive the summarized representations from lower layers. A dual attention is devised to capture the impacts on both sequential information within the same sequence and structural information among different granularity of web pages. Experimental results on a large-scale real-world transaction dataset from Alibaba show that our proposed model outperforms state-of-the-art models. The code is available at https: //github. com/WangliLin/SAH-RNN.

AAAI Conference 2020 Short Paper

HGMAN: Multi-Hop and Multi-Answer Question Answering Based on Heterogeneous Knowledge Graph (Student Abstract)

  • Xu Wang
  • Shuai Zhao
  • Bo Cheng
  • Jiale Han
  • Yingting Li
  • Hao Yang
  • Guoshun Nan

Multi-hop question answering models based on knowledge graph have been extensively studied. Most existing models predict a single answer with the highest probability by ranking candidate answers. However, they are stuck in predicting all the right answers caused by the ranking method. In this paper, we propose a novel model that converts the ranking of candidate answers into individual predictions for each candidate, named heterogeneous knowledge graph based multi-hop and multi-answer model (HGMAN). HGMAN is capable of capturing more informative representations for relations assisted by our heterogeneous graph, which consists of multiple entity nodes and relation nodes. We rely on graph convolutional network for multi-hop reasoning and then binary classification for each node to get multiple answers. Experimental results on MetaQA dataset show the performance of our proposed model over all baselines.

IJCAI Conference 2019 Conference Paper

Position Focused Attention Network for Image-Text Matching

  • Yaxiong Wang
  • Hao Yang
  • Xueming Qian
  • Lin Ma
  • Jing Lu
  • Biao Li
  • Xin Fan

Image-text matching tasks have recently attracted a lot of attention in the computer vision field. The key point of this cross-domain problem is how to accurately measure the similarity between the visual and the textual contents, which demands a fine understanding of both modalities. In this paper, we propose a novel position focused attention network (PFAN) to investigate the relation between the visual and the textual views. In this work, we integrate the object position clue to enhance the visual-text joint-embedding learning. We first split the images into blocks, by which we infer the relative position of region in the image. Then, an attention mechanism is proposed to model the relations between the image region and blocks and generate the valuable position feature, which will be further utilized to enhance the region expression and model a more reliable relationship between the visual image and the textual sentence. Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical news dataset (Tencent-News) to validate the practical application value of proposed method. As far as we know, this is the first attempt to test the performance on the practical application. Our method can achieve the state-of-art performance on all of these three datasets.

IJCAI Conference 2019 Conference Paper

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

  • Bonggun Shin
  • Hao Yang
  • Jinho D. Choi

Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while keeping high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents an embedding distillation framework that remarkably reduces the dimension of word embeddings without compromising accuracy. A new distillation ensemble approach is also proposed that trains a high-efficient student model using multiple teacher models. In our approach, the teacher models play roles only during training such that the student model operates on its own without getting supports from the teacher models during decoding, which makes it run as fast and light as any single model. All models are evaluated on seven document classification datasets and show significant advantage over the teacher models for most cases. Our analysis depicts insightful transformation of word embeddings from distillation and suggests a future direction to ensemble approaches using neural models.

AAAI Conference 2018 Conference Paper

Hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map Based Feature Extraction for Human Action Recognition

  • Yang Du
  • Chunfeng Yuan
  • Bing Li
  • Weiming Hu
  • Hao Yang
  • Zhikang Fu
  • Lili Zhao

Feature extraction is a critical step in the task of action recognition. Hand-crafted features are often restricted because of their fixed forms and deep learning features are more effective but need large-scale labeled data for training. In this paper, we propose a new hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map (NOASSOM) to adaptively and learn effective features from data without supervision. NOASSOM is extended from Adaptive-Subspace Self-Organizing Map (ASSOM) which only deals with linear data and is trained with supervision by the labeled data. Firstly, by adding a nonlinear orthogonal map layer, NOAS- SOM is able to handle the nonlinear input data and it avoids defining the specific form of the nonlinear orthogonal map by a kernel trick. Secondly, we modify loss function of ASSOM such that every input sample is used to train model individually. In this way, NOASSOM effectively learns the statistic patterns from data without supervision. Thirdly, we propose a hierarchical NOASSOM to extract more representative features. Finally, we apply the proposed hierarchical NOAS- SOM to efficiently describe the appearance and motion information around trajectories for action recognition. Experimental results on widely used datasets show that our method has superior performance than many state-of-the-art hand-crafted features and deep learning features based methods.

AAAI Conference 2018 Conference Paper

SC2Net: Sparse LSTMs for Sparse Coding

  • Joey Tianyi Zhou
  • Kai Di
  • Jiawei Du
  • Xi Peng
  • Hao Yang
  • Sinno Jialin Pan
  • Ivor Tsang
  • Yong Liu

The iterative hard-thresholding algorithm (ISTA) is one of the most popular optimization solvers to achieve sparse codes. However, ISTA suffers from following problems: 1) ISTA employs non-adaptive updating strategy to learn the parameters on each dimension with a fixed learning rate. Such a strategy may lead to inferior performance due to the scarcity of diversity; 2) ISTA does not incorporate the historical information into the updating rules, and the historical information has been proven helpful to speed up the convergence. To address these challenging issues, we propose a novel formulation of ISTA (named as adaptive ISTA) by introducing a novel adaptive momentum vector. To efficiently solve the proposed adaptive ISTA, we recast it as a recurrent neural network unit and show its connection with the well-known long short term memory (LSTM) model. With a new proposed unit, we present a neural network (termed SC2Net) to achieve sparse codes in an end-to-end manner. To the best of our knowledge, this is one of the first works to bridge the 1-solver and LSTM, and may provide novel insights in understanding model-based optimization and LSTM. Extensive experiments show the effectiveness of our method on both unsupervised and supervised tasks.

IJCAI Conference 2013 Conference Paper

Reduced Heteroscedasticity Linear Regression for Nyström Approximation

  • Hao Yang
  • Jianxin Wu

The Nyström method is a well known sampling based low-rank matrix approximation approach. It is usually considered to be originated from the numerical treatment of integral equations and eigendecomposition of matrices. In this paper, we present a novel point of view for the Nyström approximation. We show that theoretically the Nyström method can be regraded as a set of pointwise ordinary least square linear regressions of the kernel matrix, sharing the same design matrix. With the new interpretation, we are able to analyze the approximation quality based on the fulfillment of the homoscedasticity assumption and explain the success and deficiency of various sampling methods. We also empirically show that positively skewed explanatory variable distributions can lead to heteroscedasticity. Based on this discovery, we propose to use non-symmetric explanatory functions to improve the quality of the Nyström approximation with almost no extra computational cost. Experiments show that positively skewed datasets widely exist, and our method exhibits good improvements on these datasets.

ICAPS Conference 1994 Conference Paper

Case Adaptation in a Case-based Process Planning System

  • Hao Yang
  • Wen F. Lu

This paper describes the case adaptation in a casebased process planning system: PROCASE. PROCASE is an acronymfor Process Routines Organizedas Case Archiveswith SimulationEnvironment. In a case-based processplanningsystem, a newprocessplan is generated by adapting an existing similar process planningcase. Case adaptationis an importantand, mostof the times, difficult issue. Thisis becausefirst, usually, an existing case maybe a similar case but not an identical case. Adaptationis essential to tailor this similar process planningcase to generate a newprocess plan whichcan produceexactly the newpart needed. Second, adaptation involves manyreasoningprocesses whichembedsa great amountof knowledge. To encode such knowledgefor computer simulation is not a plain task. The case adaptation in PROCASE comprises case modification and caserepairing. Thispaperwill first briefly introduce the case representar. iod and case retrieving in PROCASE. Thenthe rest of the paperwill presentthe case adaptation in PROCASE. t Currently is with WizdomSystems Inc. 1300 Iroquois Ave., Naperville, IL 60563, xyzhu@interaccess. com