Arrow Research search

Author name cluster

Hongjun Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
2 author rows

Possible papers

15

AAAI Conference 2026 Conference Paper

HarmoQ: Harmonized Post-Training Quantization for High-Fidelity Image Super-Resolution

  • Hongjun Wang
  • Jiyuan Chen
  • Xuan Song
  • Yinqiang Zheng

Post-training quantization offers an efficient pathway to deploy super-resolution models, yet existing methods treat weight and activation quantization independently, missing their critical interplay. Through controlled experiments on SwinIR, we uncover a striking asymmetry: weight quantization primarily degrades structural similarity, while activation quantization disproportionately affects pixel-level accuracy. This stems from their distinct roles—weights encode learned restoration priors for textures and edges, whereas activations carry input-specific intensity information. Building on this insight, we propose HarmoQ, a unified framework that harmonizes quantization across components through three synergistic steps: structural residual calibration proactively adjusts weights to compensate for activation-induced detail loss, harmonized scale optimization analytically balances quantization difficulty via closed-form solutions, and adaptive boundary refinement iteratively maintains this balance during optimization. Experiments show HarmoQ achieves substantial gains under aggressive compression, outperforming prior art by 0.46 dB on Set5 at 2-bit while delivering 3.2× speedup and 4× memory reduction on A100 GPUs. This work provides the first systematic analysis of weight-activation coupling in super-resolution quantization and establishes a principled solution for efficient high-quality image restoration.

AAAI Conference 2026 Conference Paper

Resilience Inference for Supply Chains with Hypergraph Neural Network

  • Zetian Shen
  • Hongjun Wang
  • Jiyuan Chen
  • Xuan Song

Supply chains are integral to global economic stability, yet disruptions can swiftly propagate through interconnected networks, resulting in substantial economic impacts. Accurate and timely inference of supply chain resilience—the capability to maintain core functions during disruptions—is crucial for proactive risk mitigation and robust network design. However, existing approaches lack effective mechanisms to infer supply chain resilience without explicit system dynamics and struggle to represent the higher-order, multi-entity dependencies inherent in supply chain networks. These limitations motivate the definition of a novel problem and the development of targeted modeling solutions. To address these challenges, we formalize a novel problem: Supply Chain Resilience Inference (SCRI), defined as predicting supply chain resilience using hypergraph topology and observed inventory trajectories without explicit dynamic equations. To solve this problem, we propose the Supply Chain Resilience Inference Hypergraph Network (SC-RIHN), a novel hypergraph-based model leveraging set-based encoding and hypergraph message passing to capture multi-party firm-product interactions. Comprehensive experiments demonstrate that SC-RIHN significantly outperforms traditional MLP, representative graph neural network variants, and ResInf baselines across synthetic benchmarks, underscoring its potential for practical, early-warning risk assessment in complex supply chain systems.

NeurIPS Conference 2025 Conference Paper

Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

  • Weining Ren
  • Hongjun Wang
  • Xiao Tan
  • Kai Han

We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder—the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http: //visual-ai. github. io/fin3r}{https: //visual-ai. github. io/fin3r}

NeurIPS Conference 2025 Conference Paper

GSPN-2: Efficient Parallel Sequence Modeling

  • Hongjun Wang
  • yitong jiang
  • Collin McCarthy
  • David Wehr
  • Hanrong Ye
  • Xinhao Li
  • Ka Chun Cheung
  • Wonmin Byeon

Efficient vision transformer remains a bottleneck for high-resolution images and long-video related real-world applications. Generalized Spatial Propagation Network (GSPN) \cite{wang2025parallel} addresses this by replacing quadratic self-attention with a line-scan propagation scheme, bringing the cost close to linear in the number of rows or columns, while retaining accuracy. Despite this advancement, the existing GSPN implementation still suffers from (i) heavy overhead due to repeatedly launching GPU kernels, (ii) excessive data transfers from global GPU memory, and (iii) redundant computations caused by maintaining separate propagation weights for each channel. We introduce GSPN-2, a joint algorithm–system redesign. In particular, we eliminate thousands of micro-launches from the previous implementation into one single 2D kernel, explicitly pin one warp to each channel slice, and stage the previous column's activations in shared memory. On the model side, we introduce a set of channel-shared propagation weights that replace per-channel matrices, trimming parameters, and align naturally with the affinity map used in transformer attention. Experiments demonstrate GSPN-2's effectiveness across image classification and text-to-image synthesis tasks, matching transformer-level accuracy with significantly lower computational cost. GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications through its unique combination of structured matrix transformations and GPU-optimized implementation.

ICLR Conference 2025 Conference Paper

HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts

  • Hongjun Wang
  • Sagar Vaze
  • Kai Han 0001

Generalized Category Discovery (GCD) is a challenging task in which, given a partially labelled dataset, models must categorize all unlabelled instances, regardless of whether they come from labelled categories or from new ones. In this paper, we challenge a remaining assumption in this task: that all images share the same domain. Specifically, we introduce a new task and method to handle GCD when the unlabelled data also contains images from different domains to the labelled set. Our proposed `HiLo' networks extract High-level semantic and Low-level domain features, before minimizing the mutual information between the representations. Our intuition is that the clusterings based on domain information and semantic information should be independent. We further extend our method with a specialized domain augmentation tailored for the GCD task, as well as a curriculum learning approach. Finally, we construct a benchmark from corrupted fine-grained datasets as well as a large-scale evaluation on DomainNet with real-world domain shifts, reimplementing a number of GCD baselines in this setting. We demonstrate that HiLo outperforms SoTA category discovery models by a large margin on all evaluations.

NeurIPS Conference 2025 Conference Paper

Panoptic Captioning: An Equivalence Bridge for Image and Text

  • Kun-Yu Lin
  • Hongjun Wang
  • Weining Ren
  • Kai Han

This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalent of images, which has broad potential applications. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2. 5-78B and even surpass proprietary models like GPT-4o and Gemini-2. 0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https: //visual-ai. github. io/pancap/

ICLR Conference 2024 Conference Paper

SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning

  • Hongjun Wang
  • Sagar Vaze
  • Kai Han 0001

Generalized Category Discovery (GCD) aims to classify unlabelled images from both ‘seen’ and ‘unseen’ classes by transferring knowledge from a set of labelled ‘seen’ class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: https://visual-ai.github.io/sptnet.

TIST Journal 2024 Journal Article

T-Distributed Stochastic Neighbor Embedding for Co-Representation Learning

  • Wei Chen
  • Hongjun Wang
  • Yinghui Zhang
  • Ping Deng
  • Zhipeng Luo
  • Tianrui Li

Co-clustering is the simultaneous clustering of the samples and attributes of a data matrix that provides deeper insight into data than traditional clustering. However, there is a lack of representation learning algorithms that serve this mechanism of co-clustering, and the current representation learning algorithms are limited to the sample perspective and lack the use of information in the attribute perspective. To solve this problem, in this article, ctSNE, a co-representation learning model based on t-distributed stochastic neighbor embedding, is proposed for unsupervised co-clustering, where ctSNE makes the dataset representation outputted more discriminative of row and column clusters (i.e. co-discrimination). On the basis of t-distributed stochastic neighbor embedding retaining the sample data distribution and local data structure, the philosophy of collaboration is introduced (i.e., row and column hidden relationship information) so that the ctSNE model is equipped with co-representation learning capability, which can effectively improve the performance of co-clustering. To prove the effectiveness of the ctSNE model, several classic co-clustering algorithms are used to check the co-representation performance of ctSNE, and a novel internal index based on an internal clustering index, known as total inertia, is proposed to demonstrate the effect of co-clustering. The numerous experimental results show that ctSNE has tremendous co-representation capability and can significantly improve the performance of co-clustering algorithms.

IJCAI Conference 2023 Conference Paper

Causal-Based Supervision of Attention in Graph Neural Network: A Better and Simpler Choice towards Powerful Attention

  • Hongjun Wang
  • Jiyuan Chen
  • Lun Du
  • Qiang Fu
  • Shi Han
  • Xuan Song

Recent years have witnessed the great potential of attention mechanism in graph representation learning. However, while variants of attention-based GNNs are setting new benchmarks for numerous real-world datasets, recent works have pointed out that their induced attentions are less robust and generalizable against noisy graphs due to lack of direct supervision. In this paper, we present a new framework which utilizes the tool of causality to provide a powerful supervision signal for the learning process of attention functions. Specifically, we estimate the direct causal effect of attention to the final prediction, and then maximize such effect to guide attention attending to more meaningful neighbors. Our method can serve as a plug-and-play module for any canonical attention-based GNNs in an end-to-end fashion. Extensive experiments on a wide range of benchmark datasets illustrated that, by directly supervising attention functions, the model is able to converge faster with a clearer decision boundary, and thus yields better performances.

AAAI Conference 2023 Conference Paper

Easy Begun Is Half Done: Spatial-Temporal Graph Modeling with ST-Curriculum Dropout

  • Hongjun Wang
  • Jiyuan Chen
  • Tong Pan
  • Zipei Fan
  • Xuan Song
  • Renhe Jiang
  • Lingyu Zhang
  • Yi Xie

Spatial-temporal (ST) graph modeling, such as traffic speed forecasting and taxi demand prediction, is an important task in deep learning area. However, for the nodes in the graph, their ST patterns can vary greatly in difficulties for modeling, owning to the heterogeneous nature of ST data. We argue that unveiling the nodes to the model in a meaningful order, from easy to complex, can provide performance improvements over traditional training procedure. The idea has its root in Curriculum Learning, which suggests in the early stage of training models can be sensitive to noise and difficult samples. In this paper, we propose ST-Curriculum Dropout, a novel and easy-to-implement strategy for spatial-temporal graph modeling. Specifically, we evaluate the learning difficulty of each node in high-level feature space and drop those difficult ones out to ensure the model only needs to handle fundamental ST relations at the beginning, before gradually moving to hard ones. Our strategy can be applied to any canonical deep learning architecture without extra trainable parameters, and extensive experiments on a wide range of datasets are conducted to illustrate that, by controlling the difficulty level of ST relations as the training progresses, the model is able to capture better representation of the data and thus yields better generalization.

ICLR Conference 2022 Conference Paper

Self-ensemble Adversarial Training for Improved Robustness

  • Hongjun Wang
  • Yisen Wang 0001

Due to numerous breakthroughs in real-world applications brought by machine intelligence, deep neural networks (DNNs) are widely employed in critical applications. However, predictions of DNNs are easily manipulated with imperceptible adversarial perturbations, which impedes the further deployment of DNNs and may result in profound security and privacy implications. By incorporating adversarial samples into the training data pool, adversarial training is the strongest principled strategy against various adversarial attacks among all sorts of defense methods. Recent works mainly focus on developing new loss functions or regularizers, attempting to find the unique optimal point in the weight space. But none of them taps the potentials of classifiers obtained from standard adversarial training, especially states on the searching trajectory of training. In this work, we are dedicated to the weight states of models through the training process and devise a simple but powerful \emph{Self-Ensemble Adversarial Training} (SEAT) method for yielding a robust classifier by averaging weights of history models. This considerably improves the robustness of the target model against several well known adversarial attacks, even merely utilizing the naive cross-entropy loss to supervise. We also discuss the relationship between the ensemble of predictions from different adversarially trained models and the prediction of weight-ensembled models, as well as provide theoretical and empirical evidence that the proposed self-ensemble method provides a smoother loss landscape and better robustness than both individual models and the ensemble of predictions from different classifiers. We further analyze a subtle but fatal issue in the general settings for the self-ensemble model, which causes the deterioration of the weight-ensembled method in the late phases.

TIST Journal 2022 Journal Article

Self-supervised Discriminative Representation Learning by Fuzzy Autoencoder

  • Wenlu Yang
  • Hongjun Wang
  • Yinghui Zhang
  • Zehao Liu
  • Tianrui Li

Representation learning based on autoencoders has received great concern for its potential ability to capture valuable latent information. Conventional autoencoders pursue minimal reconstruction error, but in most machine learning tasks such as classification and clustering, the discrimination of feature representation is also important. To address this limitation, an enhanced self-supervised discriminative fuzzy autoencoder (FAE) is innovatively proposed, which focuses on exploring information within data to guide the unsupervised training process and enhancing feature discrimination in a self-supervised manner. In FAE, fuzzy membership is applied to provide a means of self-supervised, which allows FAE can not only utilize AE’s outstanding representation learning capabilities but can also transform the original data into another space with improved discrimination. First, the objective function corresponding to FAE is proposed by reconstruction loss and clustering oriented loss simultaneously. Subsequently, Mini-Batch Gradient Descent is applied to infer the objective function and the detailed process is illustrated step by step. Finally, empirical studies on clustering tasks have demonstrated the superiority of FAE over the state of the art.

IJCAI Conference 2018 Conference Paper

Crowd Counting using Deep Recurrent Spatial-Aware Network

  • Lingbo Liu
  • Hongjun Wang
  • Guanbin Li
  • Wanli Ouyang
  • Liang Lin

Crowd counting from unconstrained scene images is a crucial task in many real-world applications like urban surveillance and management, but it is greatly challenged by the camera’s perspective that causes huge appearance variations in people’s scales and rotations. Conventional methods address such challenges by resorting to fixed multi-scale architectures that are often unable to cover the largely varied scales while ignoring the rotation variations. In this paper, we propose a unified neural network framework, named Deep Recurrent Spatial-Aware Network, which adaptively addresses the two issues in a learnable spatial transform module with a region-wise refinement process. Specifically, our framework incorporates a Recurrent Spatial-Aware Refinement (RSAR) module iteratively conducting two components: i) a Spatial Transformer Network that dynamically locates an attentional region from the crowd density map and transforms it to the suitable scale and rotation for optimal crowd estimation; ii) a Local Refinement Network that refines the density map of the attended region with residual learning. Extensive experiments on four challenging benchmarks show the effectiveness of our approach. Specifically, comparing with the existing best-performing methods, we achieve an improvement of 12\% on the largest dataset WorldExpo’10 and 22. 8\% on the most challenging dataset UCF\_CC\_50