Arrow Research search

Author name cluster

Wei Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

56 papers
2 author rows

Possible papers

56

AAAI Conference 2026 Conference Paper

Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models

  • Yi Liu
  • Xiangyu Liu
  • Zequn Sun
  • Wei Hu

Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the reasoning performance.

JBHI Journal 2026 Journal Article

BECM-Net: A Multi-granularity Collaborative Framework for Semi-Supervised Fetal Ultrasound Segmentation

  • Wei Hu
  • Cong Tan
  • Wendong Wang
  • Zeheng Wang
  • Qibing Qin
  • Wenfeng Zhang
  • Haibo Ni

Accurate segmentation of fetal ultrasound (US) images is essential for measuring the Angle of Progression (AoP) and assessing fetal head descent during labor. However, conventional semi-supervised learning (SSL) for ultrasound segmentation is challenged by inaccurate pseudo-labeling at blurred or low-contrast boundaries and by limited enforcement of consistency. To address these challenges, we propose the Boundary-Enhanced Collaborative Multi-granularity Network (BECM-Net), which, from a multi-granularity modeling perspective, can be interpreted as a unified framework that jointly optimizes pixel-level, region-level, and structure-level representations. Specifically, at the pixel level, a novel DirDiff-Conv module enhances boundary perception and texture representation through multi-orientation differential filtering, enabling fine-grained modeling of local structures. At the region level, the Uncertainty-Confidence Aligned Mix (UCA-Mix) strategy performs uncertainty-guided bidirectional region-level mixing, facilitating semantic alignment and reducing pseudo-label noise. At the structure level, the ContourRefine branch models object contours by integrating deep semantic features with shallow boundary cues while coupling boundary learning with pseudo-label supervision, thereby enforcing structural-level consistency in global shape and boundary continuity. Through collaborative optimization across multiple granularities, BECM-Net provides more reliable supervision and robust feature learning under limited annotations. Extensive experiments on fetal ultrasound datasets demonstrate that BECM-Net can achieve the state-of-the-art performance, with particularly notable gains in challenging regions with ambiguous pubic symphysis and fetal head boundaries.

AAAI Conference 2026 Conference Paper

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

  • Jianhao Chen
  • Zishuo Xun
  • Bocheng Zhou
  • Han Qi
  • Hangfan Zhang
  • Qiaosheng Zhang
  • Yang Chen
  • Wei Hu

This paper presents a simple, effective, and cost-efficient strategy, named ModelSwitch, to improve LLM performance by scaling test-time compute. ModelSwitch builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using sample consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on seven datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, our strategy requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm.

JBHI Journal 2026 Journal Article

MsGA: Gestational Age Estimation with Multi-plane Unified Measurements Driven by Anatomic Segmentation

  • Mingjun Huang
  • Junbo Zhang
  • Wei Hu
  • Chao Sun
  • Xiantao Cai
  • Bo Du

An accurate estimation of gestational age is critical for prenatal care and clinical decision-making. Existing ultrasound-based gestational age estimation methods are limited by the insufficient information representation capacity of conventional medical segmentation models, noise interference in ultrasound images, and inter-observer variability in traditional geometry-based measurement methods. To address these challenges, we propose the MsGA model to estimate gestational age with multi-plane unified measurements driven by anatomic segmentation. In the anatomic segmentation stage, a lightweight and high-performance LGF-UNet module is proposed, which utilizes the Deep Patch Embedding module to expand the receptive field, the Local-Global Fusion Transformer block to enhance local-global feature fusion, and the Focusing Attention Bottleneck module to suppress ultrasound noise via an adaptive threshold. In the measurement stage, a Point Regression module is introduced to refine biometric landmark localization. Furthermore, we create a fully annotated ultrasound plane dataset for the estimation of gestational age across various gestational stages. Extensive experiments on the dataset have demonstrated the effectiveness of the whole model and each module. Our MsGA model is superior to existing models with fewer parameters and achieves state-of-the-art performance on the Gestational Age Estimation task.

AAAI Conference 2026 Conference Paper

ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders

  • Xiangyu Liu
  • Haodi Lei
  • Yi Liu
  • Yang Liu
  • Wei Hu

Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.

TMLR Journal 2026 Journal Article

Sparse Mean Estimation in Adversarial Settings via Incremental Learning

  • Jianhao Ma
  • Rui Ray Chen
  • Yinghui He
  • Salar Fattahi
  • Wei Hu

In this paper, we study the problem of sparse mean estimation under adversarial corruptions, where the goal is to estimate the $k$-sparse mean of a heavy-tailed distribution from samples contaminated by adversarial noise. Existing methods face two key limitations: they require prior knowledge of the sparsity level $k$ and scale poorly in high-dimensional settings. We propose a simple and scalable estimator that addresses both challenges. Specifically, it learns the $k$-sparse mean without knowing $k$ in advance and operates in near-linear time and memory with respect to the ambient dimension. Under a moderate signal-to-noise ratio, our method achieves the optimal statistical rate, matching the information-theoretic lower bound. Extensive simulations corroborate our theoretical guarantees. At the heart of our approach is an incremental learning phenomenon: we show that a basic subgradient method applied to a nonconvex two-layer formulation with an $\ell_1$-loss can incrementally learn the $k$ nonzero components of the true mean while suppressing the rest. More broadly, our work is the first to reveal the incremental learning phenomenon of the subgradient method in the presence of heavy-tailed distributions and adversarial corruption.

AAAI Conference 2026 Conference Paper

Spatial-Spectral Homogeneous Attacks on Physical-World Large Vision-Language Models

  • Daizong Liu
  • Baoquan Chen
  • Wei Hu

Although large vision-language models (LVLMs) have demonstrated promising versatile capabilities on various downstream tasks, they are shown to be susceptible to adversarial examples. Existing LVLM attackers simply implement adversarial patterns in an impracticable setting: i) add digital global perturbations to entire input image; ii) access prior knowledge of LVLMs for optimization; iii) do not consider realistic transformations. These make them difficult to deploy in the physical-world attack scenarios. Motivated by the research gap and counter-practice phenomenon, this paper proposes the first practical LVLM attack method based on a novel adversarial patch design, which can achieve physical and digital attack settings without using any LVLM details. In particular, we introduce adversarial homogeneous constraints in both spatial and spectral domains to improve the patch stealthy for resisting potential real-world defenses. Besides, we also develop a new technique for synthesizing reasonably realistic transformations that capture the expected patch appearance variations in daily life. Extensive experiments are conducted to verify the strong adversarial capabilities of our proposed attack against prevalent LVLMs spanning a spectrum of tasks.

NeurIPS Conference 2025 Conference Paper

Benign Overfitting in Single-Head Attention

  • Roey Magen
  • Shuning Shang
  • Zhiwei Xu
  • Spencer Frei
  • Wei Hu
  • Gal Vardi

The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.

AAAI Conference 2025 Conference Paper

Controllable Protein Sequence Generation with LLM Preference Optimization

  • Xiangyu Liu
  • Yi Liu
  • Silei Chen
  • Wei Hu

Designing proteins with specific attributes offers an important solution to address biomedical challenges. Pre-trained protein large language models (LLMs) have shown promising results on protein sequence generation. However, to control sequence generation for specific attributes, existing work still exhibits poor functionality and structural stability. In this paper, we propose a novel controllable protein design method called CtrlProt. We finetune a protein LLM with a new multi-listwise preference optimization strategy to improve generation quality and support multi-attribute controllable generation. Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both single-attribute and multi-attribute protein sequence generation.

ICLR Conference 2025 Conference Paper

Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model

  • Zhiwei Xu
  • Zhiyu Ni
  • Yixin Wang
  • Wei Hu

''Grokking'' is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. While intriguing, this delayed generalization phenomenon compromises predictability and efficiency. Ideally, models should generalize directly without delay. To this end, this paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks, based on the key observation that data embedding plays a crucial role in determining whether generalization is delayed. GrokTransfer first trains a smaller, weaker model to reach a nontrivial (but far from optimal) test performance. Then, the learned input embedding from this weaker model is extracted and used to initialize the embedding in the target, stronger model. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay. Moreover, we demonstrate that, across empirical studies of different tasks, GrokTransfer effectively reshapes the training dynamics and eliminates delayed generalization, for both fully-connected neural networks and Transformers.

ICLR Conference 2025 Conference Paper

Swing-by Dynamics in Concept Learning and Compositional Generalization

  • Yongyi Yang
  • Core Francisco Park
  • Ekdeep Singh Lubana
  • Maya Okawa
  • Wei Hu
  • Hidenori Tanaka

Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model's speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work's compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM's learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights---e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.

NeurIPS Conference 2025 Conference Paper

Towards Building Model/Prompt-Transferable Attackers against Large Vision-Language Models

  • Xiaowen Cai
  • Daizong Liu
  • Xiaoye Qu
  • Xiang Fang
  • Jianfeng Dong
  • Keke Tang
  • Pan Zhou
  • Lichao Sun

Although Large Vision-Language Models (LVLMs) exhibit impressive multimodal capabilities, their vulnerability to adversarial examples has raised serious security concerns. Existing LVLM attackers simply optimize adversarial images that easily overfit a certain model/prompt, making them ineffective once they are transferred to attack a different model/prompt. Motivated by this research gap, this paper aims to develop a more powerful attack that is transferable to black-box LVLM models of different structures and task-aware prompts of different semantics. Specifically, we introduce a new perspective of information theory to investigate LVLMs' transferable characteristics by exploring the relative dependence between outputs of the LVLM model and input adversarial samples. Our empirical observations suggest that enlarging/decreasing the mutual information between outputs and the disentangled adversarial/benign patterns of input images helps to generate more agnostic perturbations for misleading LVLMs' perception with better transferability. In particular, we formulate the complicated calculation of information gain as an estimation problem and incorporate such informative constraints into the adversarial learning process. Extensive experiments on various LVLM models/prompts demonstrate our significant transfer-attack performance.

JMLR Journal 2025 Journal Article

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

  • Peng Wang
  • Xiao Li
  • Can Yaras
  • Zhihui Zhu
  • Laura Balzano
  • Wei Hu
  • Qing Qu

Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximately low-rank: each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Moreover, our extensive experiments not only validate our theoretical results but also reveal a similar pattern in deep nonlinear networks, which aligns well with recent empirical studies. Finally, we demonstrate the practical value of our results in transfer learning. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

  • Pulkit Gopalani
  • Wei Hu

Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena—repetition bias and representation collapse—are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.

NeurIPS Conference 2024 Conference Paper

A Prompt-Based Knowledge Graph Foundation Model for Universal In-Context Reasoning

  • Yuanning Cui
  • Zequn Sun
  • Wei Hu

Extensive knowledge graphs (KGs) have been constructed to facilitate knowledge-driven tasks across various scenarios. However, existing work usually develops separate reasoning models for different KGs, lacking the ability to generalize and transfer knowledge across diverse KGs and reasoning settings. In this paper, we propose a prompt-based KG foundation model via in-context learning, namely KG-ICL, to achieve a universal reasoning ability. Specifically, we introduce a prompt graph centered with a query-related example fact as context to understand the query relation. To encode prompt graphs with the generalization ability to unseen entities and relations in queries, we first propose a unified tokenizer that maps entities and relations in prompt graphs to predefined tokens. Then, we propose two message passing neural networks to perform prompt encoding and KG reasoning, respectively. We conduct evaluation on 43 different KGs in both transductive and inductive settings. Results indicate that the proposed KG-ICL outperforms baselines on most datasets, showcasing its outstanding generalization and universal reasoning capabilities. The source code is accessible on GitHub: https: //github. com/nju-websoft/KG-ICL.

NeurIPS Conference 2024 Conference Paper

Abrupt Learning in Transformers: A Case Study on Matrix Completion

  • Pulkit Gopalani
  • Ekdeep S. Lubana
  • Wei Hu

Recent analysis on the training dynamics of Transformers has unveiled an interesting characteristic: the training loss plateaus for a significant number of training steps, and then suddenly (and sharply) drops to near--optimal values. To understand this phenomenon in depth, we formulate the low-rank matrix completion problem as a masked language modeling (MLM) task, and show that it is possible to train a BERT model to solve this task to low error. Furthermore, the loss curve shows a plateau early in training followed by a sudden drop to near-optimal values, despite no changes in the training procedure or hyper-parameters. To gain interpretability insights into this sudden drop, we examine the model's predictions, attention heads, and hidden states before and after this transition. Concretely, we observe that (a) the model transitions from simply copying the masked input to accurately predicting the masked entries; (b) the attention heads transition to interpretable patterns relevant to the task; and (c) the embeddings and hidden states encode information relevant to the problem. We also analyze the training dynamics of individual model components to understand the sudden drop in loss.

ICLR Conference 2024 Conference Paper

Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data

  • Zhiwei Xu
  • Yutong Wang 0002
  • Spencer Frei
  • Gal Vardi
  • Wei Hu

Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize near-optimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfitting---achieving a perfect fit to training data with near-random performance on test data---before transitioning (''grokking'') to near-optimal generalization later in training. In this work, we show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100\% training accuracy, perfectly fitting the noisy labels in the training data, but achieves near-random test accuracy. At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a ''grokking'' phenomenon. This provides the first theoretical result of benign overfitting in neural network classification when the data distribution is not linearly separable. Our proofs rely on analyzing the feature learning process under GD, which reveals that the network implements a non-generalizable linear classifier after one step and gradually learns generalizable features in later steps.

TMLR Journal 2024 Journal Article

Bias Amplification Enhances Minority Group Performance

  • Gaotang Li
  • Jiarui Liu
  • Wei Hu

Neural networks produced by standard training are known to suffer from poor accuracy on rare subgroups despite achieving high accuracy on average, due to the correlations between certain spurious features and labels. Previous approaches based on worst-group loss minimization (e.g. Group-DRO) are effective in improving worse-group accuracy but require expensive group annotations for all the training samples. In this paper, we focus on the more challenging and realistic setting where group annotations are only available on a small validation set or are not available at all. We propose BAM, a novel two-stage training algorithm: in the first stage, the model is trained using a bias amplification scheme via introducing a learnable auxiliary variable for each training sample; in the second stage, we upweight the samples that the bias-amplified model misclassifies, and then continue training the same model on the reweighted dataset. Empirically, BAM achieves competitive performance compared with existing methods evaluated on spurious correlation benchmarks in computer vision and natural language processing. Moreover, we find a simple stopping criterion based on minimum class accuracy difference that can remove the need for group annotations, with little or no loss in worst-group accuracy. We perform extensive analyses and ablations to verify the effectiveness and robustness of our algorithm in varying class and group imbalance ratios.

ICML Conference 2024 Conference Paper

DFlow: A Generative Model Combining Denoising AutoEncoder and Normalizing Flow for High Fidelity Waveform Generation

  • Chenfeng Miao
  • Qingying Zhu
  • Minchuan Chen
  • Wei Hu
  • Zijian Li
  • Shaojun Wang
  • Jing Xiao 0006

In this work, we present DFlow, a novel generative framework that combines Normalizing Flow (NF) with a Denoising AutoEncoder (DAE), for high-fidelity waveform generation. With a tactfully designed structure, DFlow seamlessly integrates the capabilities of both NF and DAE, resulting in a significantly improved performance compared to the standard NF models. Experimental results showcase DFlow’s superiority, achieving the highest MOS score among the existing methods on commonly used datasets and the fastest synthesis speed among all likelihood models. We further demonstrate the generalization ability of DFlow by generating high-quality out-of-distribution audio samples, such as singing and music audio. Additionally, we extend the model capacity of DFlow by scaling up both the model size and training set size. Our large-scale universal vocoder, DFlow-XL, achieves highly competitive performance against the best universal vocoder, BigVGAN.

AAAI Conference 2024 Conference Paper

DHGCN: Dynamic Hop Graph Convolution Network for Self-Supervised Point Cloud Learning

  • Jincen Jiang
  • Lizhi Zhao
  • Xuequan Lu
  • Wei Hu
  • Imran Razzak
  • Meili Wang

Recent works attempt to extend Graph Convolution Networks (GCNs) to point clouds for classification and segmentation tasks. These works tend to sample and group points to create smaller point sets locally and mainly focus on extracting local features through GCNs, while ignoring the relationship between point sets. In this paper, we propose the Dynamic Hop Graph Convolution Network (DHGCN) for explicitly learning the contextual relationships between the voxelized point parts, which are treated as graph nodes. Motivated by the intuition that the contextual information between point parts lies in the pairwise adjacent relationship, which can be depicted by the hop distance of the graph quantitatively, we devise a novel self-supervised part-level hop distance reconstruction task and design a novel loss function accordingly to facilitate training. In addition, we propose the Hop Graph Attention (HGA), which takes the learned hop distance as input for producing attention weights to allow edge features to contribute distinctively in aggregation. Eventually, the proposed DHGCN is a plug-and-play module that is compatible with point-based backbone networks. Comprehensive experiments on different backbones and tasks demonstrate that our self-supervised method achieves state-of-the-art performance. Our source codes are available at: https://github.com/Jinec98/DHGCN.

AAAI Conference 2024 Conference Paper

Explicitly Perceiving and Preserving the Local Geometric Structures for 3D Point Cloud Attack

  • Daizong Liu
  • Wei Hu

Deep learning models for point clouds have shown to be vulnerable to adversarial attacks, which have received increasing attention in various safety-critical applications such as autonomous driving, robotics, and surveillance. Existing 3D attack methods generally employ global distance losses to implicitly constrain the point-wise perturbations for optimization. However, these simple losses are quite difficult to accurately measure and restrict the proper 3D geometry as point clouds are highly structured. Although few recent works try to exploit additional shape-aware surface knowledge to globally constrain the point position, they still fail to preserve the detailed point-to-point geometric dependency in different local regions. To this end, in this paper, we propose a novel Multi-grained Geometry-aware Attack (MGA), which explicitly captures the local topology characteristics in different 3D regions for adversarial constraint. Specifically, we first develop multi-scale spectral local filter banks adapting to different 3D object shapes to explore potential geometric structures in local regions. Considering that objects may contain complex geometries, we then extend each filter bank into multi-layer ones to gradually capture the topology contexts of the same region in a coarse-to-fine manner. Hence, the focused local geometric structures will be highlighted in the coefficients calculated by the filtering process. At last, by restricting these coefficients between benign and adversarial samples, our MGA is able to properly measure and preserve the detailed geometry contexts in the whole 3D object with trivial perturbations. Extensive experiments demonstrate that our attack can achieve superior performance on various 3D classification models, with satisfying adversarial imperceptibility and strong resistance to different defense methods.

ICLR Conference 2024 Conference Paper

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

  • Tianyu Guo 0004
  • Wei Hu
  • Song Mei
  • Huan Wang 0016
  • Caiming Xiong
  • Silvio Savarese
  • Yu Bai 0017

While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with \emph{representations}. Concretely, we construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but \emph{fixed} representation function, composed with a linear function that \emph{differs} in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve near-optimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a post-ICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios.

AAAI Conference 2024 Conference Paper

Knowledge Graph Error Detection with Contrastive Confidence Adaption

  • Xiangyu Liu
  • Yang Liu
  • Wei Hu

Knowledge graphs (KGs) often contain various errors. Previous works on detecting errors in KGs mainly rely on triplet embedding from graph structure. We conduct an empirical study and find that these works struggle to discriminate noise from semantically-similar correct triplets. In this paper, we propose a KG error detection model CCA to integrate both textual and graph structural information from triplet reconstruction for better distinguishing semantics. We design interactive contrastive learning to capture the differences between textual and structural patterns. Furthermore, we construct realistic datasets with semantically-similar noise and adversarial noise. Experimental results demonstrate that CCA outperforms state-of-the-art baselines, especially on semantically-similar noise and adversarial noise.

JBHI Journal 2024 Journal Article

S3-Net: A Self-Supervised Dual-Stream Network for Radiology Report Generation

  • Renjie Pan
  • Ruisheng Ran
  • Wei Hu
  • Wenfeng Zhang
  • Qibing Qin
  • Shaoguo Cui

Intelligent medicine is eager to automatically generate radiology reports to ease the tedious work of radiologists. Previous researches mainly focused on the text generation with encoder-decoder structure, while CNN networks for visual features ignored the long-range dependencies correlated with textual information. Besides, few studies exploit cross-modal mappings to promote radiology report generation. To alleviate the above problems, we propose a novel end-to-end radiology report generation model dubbed Self-Supervised dual-Stream Network (S3-Net). Specifically, a Dual-Stream Visual Feature Extractor (DSVFE) composed of ResNet and SwinTransformer is proposed to capture more abundant and effective visual features, where the former focuses on local response and the latter explores long-range dependencies. Then, we introduced the Fusion Alignment Module (FAM) to fuse the dual-stream visual features and facilitate alignment between visual features and text features. Furthermore, the Self-Supervised Learning with Mask(SSLM) is introduced to further enhance the visual feature representation ability. Experimental results on two mainstream radiology reporting datasets (IU X-ray and MIMIC-CXR) show that our proposed approach outperforms previous models in terms of language generation metrics.

AAAI Conference 2024 Conference Paper

Understanding Surprising Generalization Phenomena in Deep Learning

  • Wei Hu

Deep learning has exhibited a number of surprising generalization phenomena that are not captured by classical statistical learning theory. This talk will survey some of my work on the theoretical characterizations of several such intriguing phenomena: (1) Implicit regularization: A major mystery in deep learning is that deep neural networks can often generalize well despite their excessive expressive capacity. Towards explaining this mystery, it has been suggested that commonly used gradient-based optimization algorithms enforce certain implicit regularization which effectively constrains the model capacity. (2) Benign overfitting: In certain scenarios, a model can perfectly fit noisily labeled training data, but still archives near-optimal test error at the same time, which is very different from the classical notion of overfitting. (3) Grokking: In certain scenarios, a model initially achieves perfect training accuracy but no generalization (i.e. no better than a random predictor), and upon further training, transitions to almost perfect generalization. Theoretically establishing these properties often involves making appropriate high-dimensional assumptions on the problem as well as a careful analysis of the training dynamics.

ICML Conference 2023 Conference Paper

Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations

  • Yongyi Yang
  • Jacob Steinhardt
  • Wei Hu

Recent work has observed an intriguing "Neural Collapse” phenomenon in well-trained neural networks, where the last-layer representations of training samples with the same label collapse into each other. This appears to suggest that the last-layer representations are completely determined by the labels, and do not depend on the intrinsic structure of input distribution. We provide evidence that this is not a complete description, and that the apparent collapse hides important fine-grained structure in the representations. Specifically, even when representations apparently collapse, the small amount of remaining variation can still faithfully and accurately captures the intrinsic structure of input distribution. As an example, if we train on CIFAR-10 using only 5 coarse-grained labels (by combining two classes into one super-class) until convergence, we can reconstruct the original 10-class labels from the learned representations via unsupervised clustering. The reconstructed labels achieve 93% accuracy on the CIFAR-10 test set, nearly matching the normal CIFAR-10 accuracy for the same architecture. We also provide an initial theoretical result showing the fine-grained representation structure in a simplified synthetic setting. Our results show concretely how the structure of input data can play a significant role in determining the fine-grained structure of neural representations, going beyond what Neural Collapse predicts.

IJCAI Conference 2023 Conference Paper

Enabling Abductive Learning to Exploit Knowledge Graph

  • Yu-Xuan Huang
  • Zequn Sun
  • Guangyao Li
  • Xiaobin Tian
  • Wang-Zhou Dai
  • Wei Hu
  • Yuan Jiang
  • Zhi-Hua Zhou

Most systems integrating data-driven machine learning with knowledge-driven reasoning usually rely on a specifically designed knowledge base to enable efficient symbolic inference. However, it could be cumbersome for the nonexpert end-users to prepare such a knowledge base in real tasks. Recent years have witnessed the success of large-scale knowledge graphs, which could be ideal domain knowledge resources for real-world machine learning tasks. However, these large-scale knowledge graphs usually contain much information that is irrelevant to a specific learning task. Moreover, they often contain a certain degree of noise. Existing methods can hardly make use of them because the large-scale probabilistic logical inference is usually intractable. To address these problems, we present ABductive Learning with Knowledge Graph (ABL-KG) that can automatically mine logic rules from knowledge graphs during learning, using a knowledge forgetting mechanism for filtering out irrelevant information. Meanwhile, these rules can form a logic program that enables efficient joint optimization of the machine learning model and logic inference within the Abductive Learning (ABL) framework. Experiments on four different tasks show that ABL-KG can automatically extract useful rules from large-scale and noisy knowledge graphs, and significantly improve the performance of machine learning with only a handful of labeled data.

NeurIPS Conference 2023 Conference Paper

Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity

  • Zhanpeng Zhou
  • Yongyi Yang
  • Xiaojiang Yang
  • Junchi Yan
  • Wei Hu

Recent work has revealed many intriguing empirical phenomena in neural network training, despite the poorly understood and highly complex loss landscapes and training dynamics. One of these phenomena, Linear Mode Connectivity (LMC), has gained considerable attention due to the intriguing observation that different solutions can be connected by a linear path in the parameter space while maintaining near-constant training and test losses. In this work, we introduce a stronger notion of linear connectivity, Layerwise Linear Feature Connectivity (LLFC), which says that the feature maps of every layer in different trained networks are also linearly connected. We provide comprehensive empirical evidence for LLFC across a wide range of settings, demonstrating that whenever two trained networks satisfy LMC (via either spawning or permutation methods), they also satisfy LLFC in nearly all the layers. Furthermore, we delve deeper into the underlying factors contributing to LLFC, which reveal new insights into the permutation approaches. The study of LLFC transcends and advances our understanding of LMC by adopting a feature-learning perspective.

ICLR Conference 2023 Conference Paper

Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

  • Spencer Frei
  • Gal Vardi
  • Peter L. Bartlett
  • Nathan Srebro
  • Wei Hu

The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.

AAAI Conference 2023 Conference Paper

Lifelong Embedding Learning and Transfer for Growing Knowledge Graphs

  • Yuanning Cui
  • Yuxin Wang
  • Zequn Sun
  • Wenqiang Liu
  • Yiqiao Jiang
  • Kexin Han
  • Wei Hu

Existing knowledge graph (KG) embedding models have primarily focused on static KGs. However, real-world KGs do not remain static, but rather evolve and grow in tandem with the development of KG applications. Consequently, new facts and previously unseen entities and relations continually emerge, necessitating an embedding model that can quickly learn and transfer new knowledge through growth. Motivated by this, we delve into an expanding field of KG embedding in this paper, i.e., lifelong KG embedding. We consider knowledge transfer and retention of the learning on growing snapshots of a KG without having to learn embeddings from scratch. The proposed model includes a masked KG autoencoder for embedding learning and update, with an embedding transfer strategy to inject the learned knowledge into the new entity and relation embeddings, and an embedding regularization method to avoid catastrophic forgetting. To investigate the impacts of different aspects of KG growth, we construct four datasets to evaluate the performance of lifelong KG embedding. Experimental results show that the proposed model outperforms the state-of-the-art inductive and lifelong embedding baselines.

AAAI Conference 2022 Conference Paper

Ensemble Semi-supervised Entity Alignment via Cycle-Teaching

  • Kexuan Xin
  • Zequn Sun
  • Wen Hua
  • Bing Liu
  • Wei Hu
  • Jianfeng Qu
  • Xiaofang Zhou

Entity alignment is to find identical entities in different knowledge graphs. Although embedding-based entity alignment has recently achieved remarkable progress, training data insufficiency remains a critical challenge. Conventional semisupervised methods also suffer from the incorrect entity alignment in newly proposed training data. To resolve these issues, we design an iterative cycle-teaching framework for semisupervised entity alignment. The key idea is to train multiple entity alignment models (called aligners) simultaneously and let each aligner iteratively teach its successor the proposed new entity alignment. We propose a diversity-aware alignment selection method to choose reliable entity alignment for each aligner. We also design a conflict resolution mechanism to resolve the alignment conflict when combining the new alignment of an aligner and that from its teacher. Besides, considering the influence of cycle-teaching order, we elaborately design a strategy to arrange the optimal order that can maximize the overall performance of multiple aligners. The cycle-teaching process can break the limitations of each model’s learning capability and reduce the noise in new training data, leading to improved performance. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed cycle-teaching framework, which significantly outperforms the state-of-the-art models when the training data is insufficient and the new entity alignment has much noise.

ICML Conference 2022 Conference Paper

More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

  • Alexander Wei 0001
  • Wei Hu
  • Jacob Steinhardt

Of theories for why large-scale machine learning models generalize despite being vastly overparameterized, which of their assumptions are needed to capture the qualitative phenomena of generalization in the real world? On one hand, we find that most theoretical analyses fall short of capturing these qualitative phenomena even for kernel regression, when applied to kernels derived from large-scale neural networks (e. g. , ResNet-50) and real data (e. g. , CIFAR-100). On the other hand, we find that the classical GCV estimator (Craven and Wahba, 1978) accurately predicts generalization risk even in such overparameterized settings. To bolster this empirical finding, we prove that the GCV estimator converges to the generalization risk whenever a local random matrix law holds. Finally, we apply this random matrix theory lens to explain why pretrained representations generalize better as well as what factors govern scaling laws for kernel regression. Our findings suggest that random matrix theory, rather than just being a toy model, may be central to understanding the properties of neural representations in practice.

NeurIPS Conference 2022 Conference Paper

Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation

  • Botao Yu
  • Peiling Lu
  • Rui Wang
  • Wei Hu
  • Xu Tan
  • Wei Ye
  • Shikun Zhang
  • Tao Qin

Symbolic music generation aims to generate music scores automatically. A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e. g. , over 10, 000 tokens), and the existing models have shortcomings in generating musical repetition structures. In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e. g. , the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3X longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures.

TMLR Journal 2022 Journal Article

Representation Alignment in Neural Networks

  • Ehsan Imani
  • Wei Hu
  • Martha White

It is now a standard for neural network representations to be trained on large, publicly available datasets, and used for new problems. The reasons for why neural network representations have been so successful for transfer, however, are still not fully understood. In this paper we show that, after training, neural network representations align their top singular vectors to the targets. We investigate this representation alignment phenomenon in a variety of neural network architectures and find that (a) alignment emerges across a variety of different architectures and optimizers, with more alignment arising from depth (b) alignment increases for layers closer to the output and (c) existing high-performance deep CNNs exhibit high levels of alignment. We then highlight why alignment between the top singular vectors and the targets can speed up learning and show in a classic synthetic transfer problem that representation alignment correlates with positive and negative transfer to similar and dissimilar tasks.

ICML Conference 2021 Conference Paper

A Representation Learning Perspective on the Importance of Train-Validation Splitting in Meta-Learning

  • Nikunj Saunshi
  • Arushi Gupta
  • Wei Hu

An effective approach in meta-learning is to utilize multiple “train tasks” to learn a good initialization for model parameters that can help solve unseen “test tasks” with very few samples by fine-tuning from this initialization. Although successful in practice, theoretical understanding of such methods is limited. This work studies an important aspect of these methods: splitting the data from each task into train (support) and validation (query) sets during meta-training. Inspired by recent work (Raghu et al. , 2020), we view such meta-learning methods through the lens of representation learning and argue that the train-validation split encourages the learned representation to be {\em low-rank} without compromising on expressivity, as opposed to the non-splitting variant that encourages high-rank representations. Since sample efficiency benefits from low-rankness, the splitting strategy will require very few samples to solve unseen test tasks. We present theoretical results that formalize this idea for linear representation learning on a subspace meta-learning instance, and experimentally verify this practical benefit of splitting in simulations and on standard meta-learning benchmarks.

AAAI Conference 2020 Conference Paper

Knowledge Graph Alignment Network with Gated Multi-Hop Neighborhood Aggregation

  • Zequn Sun
  • Chengming Wang
  • Wei Hu
  • Muhao Chen
  • Jian Dai
  • Wei Zhang
  • Yuzhong Qu

Graph neural networks (GNNs) have emerged as a powerful paradigm for embedding-based entity alignment due to their capability of identifying isomorphic subgraphs. However, in real knowledge graphs (KGs), the counterpart entities usually have non-isomorphic neighborhood structures, which easily causes GNNs to yield different representations for them. To tackle this problem, we propose a new KG alignment network, namely AliNet, aiming at mitigating the non-isomorphism of neighborhood structures in an end-to-end manner. As the direct neighbors of counterpart entities are usually dissimilar due to the schema heterogeneity, AliNet introduces distant neighbors to expand the overlap between their neighborhood structures. It employs an attention mechanism to highlight helpful distant neighbors and reduce noises. Then, it controls the aggregation of both direct and distant neighborhood information using a gating mechanism. We further propose a relation loss to refine entity representations. We perform thorough experiments with detailed ablation studies and analyses on five entity alignment datasets, demonstrating the effectiveness of AliNet.

ICLR Conference 2020 Conference Paper

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

  • Wei Hu
  • Lechao Xiao
  • Jeffrey Pennington

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

IS Journal 2020 Journal Article

Study on the Situational Awareness System of Mine Fire Rescue Using Faster Ross Girshick-Convolutional Neural Network

  • Jiuling Zhang
  • Yang Jia
  • Ding Zhu
  • Wei Hu
  • Zhenling Tang

With the continuous development of society, with the advent of the era of big data, situational awareness systems are gradually becoming well known and play an important role. Situational awareness systems are based on safe big data, and they are environmentally, dynamically, and holistically aware of security. A comprehensive system of risk capabilities. Therefore, this article uses the situational awareness system to study the rescue problem of mine fires, in order to reduce the casualties and economic losses caused by mine fires. On this basis, the convolutional neural network algorithm is used for situational awareness. By optimizing the algorithm, from region-based convolutional neural network (R-CNN) model to fast R-CNN model, the optimal model of faster R-CNN is finally proposed and implemented. The mine fire rescue problem.

NeurIPS Conference 2020 Conference Paper

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

  • Wei Hu
  • Lechao Xiao
  • Ben Adlam
  • Jeffrey Pennington

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically. Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training.

IROS Conference 2020 Conference Paper

π-Map: A Decision-Based Sensor Fusion with Global Optimization for Indoor Mapping

  • Zhiliu Yang
  • Bo Yu 0014
  • Wei Hu
  • Jie Tang 0003
  • Shaoshan Liu
  • Chen Liu 0001

In this paper, we propose π-map, a tightly coupled fusion mechanism that dynamically consumes LiDAR and sonar data to generate reliable and scalable indoor maps for autonomous robot navigation. The key novelty of π-map over previous attempts is the utilization of a fusion mechanism that works in three stages: the first LiDAR scan matching stage efficiently generates initial key localization poses; the second optimization stage is used to eliminate errors accumulated from the previous stage and guarantees that accurate large-scale maps can be generated; then the final revisit scan fusion stage effectively fuses the LiDAR map and the sonar map to generate a highly accurate representation of the indoor environment. We evaluate π-map on both large and small environments and verify its superiority over existing fusion methods.

JBHI Journal 2019 Journal Article

Automated Layer Segmentation of Retinal Optical Coherence Tomography Images Using a Deep Feature Enhanced Structured Random Forests Classifier

  • Xiaoming Liu
  • Tianyu Fu
  • Zhifang Pan
  • Dong Liu
  • Wei Hu
  • Jun Liu
  • Kai Zhang

Optical coherence tomography (OCT) is a high-resolution and noninvasive imaging modality that has become one of the most prevalent techniques for ophthalmic diagnosis. Retinal layer segmentation is very crucial for doctors to diagnose and study retinal diseases. However, manual segmentation is often a time-consuming and subjective process. In this work, we propose a new method for automatically segmenting retinal OCT images, which integrates deep features and hand-designed features to train a structured random forests classifier. The deep convolutional features are learned from deep residual network. With the trained classifier, we can get the contour probability graph of each layer; finally, the shortest path is employed to achieve the final layer segmentation. The experimental results show that our method achieves good results with the mean layer contour error of 1. 215 pixels, whereas that of the state of the art was 1. 464 pixels, and achieves an F1-score of 0. 885, which is also better than 0. 863 that is obtained by the state of the art method.

NeurIPS Conference 2019 Conference Paper

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

  • Rohith Kuditipudi
  • Xiang Wang
  • Holden Lee
  • Yi Zhang
  • Zhiyuan Li
  • Wei Hu
  • Rong Ge
  • Sanjeev Arora

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

NeurIPS Conference 2019 Conference Paper

Implicit Regularization in Deep Matrix Factorization

  • Sanjeev Arora
  • Nadav Cohen
  • Wei Hu
  • Yuping Luo

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity. " We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.

IJCAI Conference 2019 Conference Paper

Multi-view Knowledge Graph Embedding for Entity Alignment

  • Qingheng Zhang
  • Zequn Sun
  • Wei Hu
  • Muhao Chen
  • Lingbing Guo
  • Yuzhong Qu

We study the problem of embedding-based entity alignment between knowledge graphs (KGs). Previous works mainly focus on the relational structure of entities. Some further incorporate another type of features, such as attributes, for refinement. However, a vast of entity features are still unexplored or not equally treated together, which impairs the accuracy and robustness of embedding-based entity alignment. In this paper, we propose a novel framework that unifies multiple views of entities to learn embeddings for entity alignment. Specifically, we embed entities based on the views of entity names, relations and attributes, with several combination strategies. Furthermore, we design some cross-KG inference methods to enhance the alignment between two KGs. Our experiments on real-world datasets show that the proposed framework significantly outperforms the state-of-the-art embedding-based entity alignment methods. The selected views, cross-KG inference and combination strategies all contribute to the performance improvement.

NeurIPS Conference 2019 Conference Paper

On Exact Computation with an Infinitely Wide Neural Net

  • Sanjeev Arora
  • Simon Du
  • Wei Hu
  • Zhiyuan Li
  • Russ Salakhutdinov
  • Ruosong Wang

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al. , 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al. , 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.

NeurIPS Conference 2018 Conference Paper

Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

  • Simon Du
  • Wei Hu
  • Jason Lee

We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i. e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization. This result implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers. Using a discretization argument, we analyze gradient descent with positive step size for the non-convex low-rank asymmetric matrix factorization problem without any regularization. Inspired by our findings for gradient flow, we prove that gradient descent with step sizes $\eta_t=O(t^{−(1/2+\delta)}) (0<\delta\le1/2)$ automatically balances two low-rank factors and converges to a bounded global optimum. Furthermore, for rank-1 asymmetric matrix factorization we give a finer analysis showing gradient descent with constant step size converges to the global minimum at a globally linear rate. We believe that the idea of examining the invariance imposed by first order algorithms in learning homogeneous models could serve as a fundamental building block for studying optimization for learning deep models.

IJCAI Conference 2018 Conference Paper

Bootstrapping Entity Alignment with Knowledge Graph Embedding

  • Zequn Sun
  • Wei Hu
  • Qingheng Zhang
  • Yuzhong Qu

Embedding-based entity alignment represents different knowledge graphs (KGs) as low-dimensional embeddings and finds entity alignment by measuring the similarities between entity embeddings. Existing approaches have achieved promising results, however, they are still challenged by the lack of enough prior alignment as labeled training data. In this paper, we propose a bootstrapping approach to embedding-based entity alignment. It iteratively labels likely entity alignment as training data for learning alignment-oriented KG embeddings. Furthermore, it employs an alignment editing method to reduce error accumulation during iterations. Our experiments on real-world datasets showed that the proposed approach significantly outperformed the state-of-the-art embedding-based ones for entity alignment. The proposed alignment-oriented KG embedding, bootstrapping process and alignment editing method all contributed to the performance improvement.

NeurIPS Conference 2018 Conference Paper

Online Improper Learning with an Approximation Oracle

  • Elad Hazan
  • Wei Hu
  • Yuanzhi Li
  • Zhiyuan Li

We study the following question: given an efficient approximation algorithm for an optimization problem, can we learn efficiently in the same setting? We give a formal affirmative answer to this question in the form of a reduction from online learning to offline approximate optimization using an efficient algorithm that guarantees near optimal regret. The algorithm is efficient in terms of the number of oracle calls to a given approximation oracle – it makes only logarithmically many such calls per iteration. This resolves an open question by Kalai and Vempala, and by Garber. Furthermore, our result applies to the more general improper learning problems.

NeurIPS Conference 2017 Conference Paper

Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls

  • Zeyuan Allen-Zhu
  • Elad Hazan
  • Wei Hu
  • Yuanzhi Li

We propose a rank-k variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation (1-SVD) in Frank-Wolfe with a top-k singular-vector computation (k-SVD), which can be done by repeatedly applying 1-SVD k times. Alternatively, our algorithm can be viewed as a rank-k restricted version of projected gradient descent. We show that our algorithm has a linear convergence rate when the objective function is smooth and strongly convex, and the optimal solution has rank at most k. This improves the convergence rate and the total time complexity of the Frank-Wolfe method and its variants.

NeurIPS Conference 2016 Conference Paper

Combinatorial Multi-Armed Bandit with General Reward Functions

  • Wei Chen
  • Wei Hu
  • Fu Li
  • Jian Li
  • Yu Liu
  • Pinyan Lu

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that SDCB can achieve $O(\log T)$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. We apply our results to the $K$-MAX problem and expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt T)$ bound on the $(1-\epsilon)$-approximation regret of its online problem, for any $\epsilon>0$.

AAAI Conference 2015 Conference Paper

An EBMC-Based Approach to Selecting Types for Entity Filtering

  • Jiwei Ding
  • Wentao Ding
  • Wei Hu
  • Yuzhong Qu

The quantity of entities in the Linked Data is increasing rapidly. For entity search and browsing systems, filtering is very useful for users to find entities that they are interested in. Type is a kind of widely-used facet and can be easily obtained from knowledge bases, which enables to create filters by selecting at most K types of an entity collection. However, existing approaches often fail to select high-quality type filters due to complex overlap between types. In this paper, we propose a novel type selection approach based upon Budgeted Maximum Coverage (BMC), which can achieve integral optimization for the coverage quality of type filters. Furthermore, we define a new optimization problem called Extended Budgeted Maximum Coverage (EBMC) and propose an EBMC-based approach, which enhances the BMC-based approach by incorporating the relevance between entities and types, so as to create sensible type filters. Our experimental results show that the EBMCbased approach performs best comparing with several representative approaches.

SODA Conference 2014 Conference Paper

A Constant Factor Approximation Algorithm for Fault-Tolerant k -Median

  • MohammadTaghi Hajiaghayi
  • Wei Hu
  • Jian Li 0015
  • Shi Li 0001
  • Barna Saha

In this paper, we consider the fault-tolerant k -median problem and give the first constant factor approximation algorithm for it. In the fault-tolerant generalization of classical k -median problem, each client j needs to be assigned to at least r j ≥ 1 distinct open facilities. The service cost of j is the sum of its distances to the r j facilities, and the k -median constraint restricts the number of open facilities to at most k. Previously, a constant factor was known only for the special case when all r j s are the same, and a logarithmic approximation ratio was known for the general case. In addition, we present the first polynomial time algorithm for the fault-tolerant k -median problem on a path or a HST by showing that the corresponding LP always has an integral optimal solution. We also consider the fault-tolerant facility location problem, where the service cost of j can be a weighted sum of its distance to the r j facilities. We give a simple constant factor approximation algorithm, generalizing several previous results which only work for nonincreasing weight vectors.