Author name cluster

Yue Bai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

AAAI Conference 2026 Conference Paper

Arbitrary-Scale 3D Gaussian Super-Resolution

Huimin Zeng
Yue Bai
Yun Fu

Existing 3D Gaussian Splatting (3DGS) super-resolution methods typically perform high-resolution (HR) rendering of fixed scale factors, making them impractical for resource-limited scenarios. Directly rendering arbitrary-scale HR views with vanilla 3DGS introduces aliasing artifacts due to the lack of scale-aware rendering ability, while adding a post-processing upsampler for 3DGS complicates the framework and reduces rendering efficiency. To tackle these issues, we build an integrated framework that incorporates scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable 3D Gaussian super-resolution of arbitrary scale factors with a single 3D model. Notably, our approach supports both integer and non-integer scale rendering to provide more flexibility. Extensive experiments demonstrate the effectiveness of our model in producing high-quality arbitrary-scale HR views (6.59 dB PSNR gain over 3DGS) with a single model. It preserves structural consistency with LR views and across different scales, while maintaining real-time rendering speed (85 FPS at 1080p).

PDF Details DOI

AAAI Conference 2026 Conference Paper

Automating Complex Document Workflows via Stepwise and Rollback-Enabled Operation Orchestration

Yanbin Zhang
Hanhui Ye
Yue Bai
Qiming Zhang
Liao Xiang
Wu Mianzhi
Renjun Hu

Workflow automation promises substantial productivity gains in everyday document-related tasks. While prior agentic systems can execute isolated instructions, they struggle with automating multi-step, session-level workflows due to limited control over the operational process. To this end, we introduce AutoDW, a novel execution framework that enables stepwise, rollback-enabled operation orchestration. AutoDW incrementally plans API actions conditioned on user instructions, intent-filtered API candidates, and the evolving states of the document. It further employs robust rollback mechanisms at both the argument and API levels, enabling dynamic correction and fault tolerance. These designs together ensure that the execution trajectory of AutoDW remains aligned with user intent and document context across long-horizon workflows. To assess its effectiveness, we construct a comprehensive benchmark of 250 sessions and 1,708 human-annotated instructions, reflecting realistic document processing scenarios with interdependent instructions. AutoDW achieves 90% and 62% completion rates on instruction- and session-level tasks, respectively, outperforming strong baselines by 40% and 76%. Moreover, AutoDW also remains robust for the decision of backbone LLMs and on tasks with varying difficulty.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Accessing Vision Foundation Models via ImageNet-1K

Yitian Zhang
Xu Ma 0005
Yue Bai
Huan Wang
Yun Fu 0001

Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.

Details

ICLR Conference 2024 Conference Paper

Don't Judge by the Look: Towards Motion Coherent Video Representation

Yitian Zhang
Yue Bai
Huan Wang 0014
Yizhou Wang 0006
Yun Fu 0001

Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video understanding and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video understanding, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at https://github.com/BeSpontaneous/MCA-pytorch.

Details

UAI Conference 2024 Conference Paper

α-Former: Local-Feature-Aware (L-FA) Transformer

Zhi Xu 0013
Bin Sun 0002
Yue Bai
Yun Fu 0001

Despite the success of current segmentation models powered by the transformer, the camouflaged instance segmentation (CIS) task remains a challenge due to the similarity between the target and the background. To address this issue, we propose a novel approach called the local-feature-aware transformer ($\alpha$-Former), inspired by how humans find the camouflaged instance in a given photograph. We use traditional computer vision descriptors to simulate how humans find the unnatural boundary in a given photograph. Then, the information extracted by traditional descriptors can be employed as prior knowledge to enhance the neural network’s performance. Moreover, due to the non-learnable characteristics of traditional descriptors, we designed a learnable binary filter to simulate the traditional descriptors. In order to aggregate the information from the backbone and binary filter, we introduce an adapter to merge local features into the transformer framework. Additionally, we introduce an edge-aware feature fusion module to improve boundary results in the segmentation model. Using the proposed transformer-based encoder-decoder architecture, our $\alpha$-Former surpasses state-of-the-art performance on the COD10K and NC4K datasets.

Details

NeurIPS Conference 2023 Conference Paper

Latent Graph Inference with Limited Supervision

Jianglin Lu
Yi Xu
Huan Wang
Yue Bai
Yun Fu

Latent graph inference (LGI) aims to jointly learn the underlying graph structure and node representations from data features. However, existing LGI methods commonly suffer from the issue of supervision starvation, where massive edge weights are learned without semantic supervision and do not contribute to the training loss. Consequently, these supervision-starved weights, which determine the predictions of testing samples, cannot be semantically optimal, resulting in poor generalization. In this paper, we observe that this issue is actually caused by the graph sparsification operation, which severely destroys the important connections established between pivotal nodes and labeled ones. To address this, we propose to restore the corrupted affinities and replenish the missed supervision for better LGI. The key challenge then lies in identifying the critical nodes and recovering the corrupted affinities. We begin by defining the pivotal nodes as k-hop starved nodes, which can be identified based on a given adjacency matrix. Considering the high computational burden, we further present a more efficient alternative inspired by CUR matrix decomposition. Subsequently, we eliminate the starved nodes by reconstructing the destroyed connections. Extensive experiments on representative benchmarks demonstrate that reducing the starved nodes consistently improves the performance of state-of-the-art LGI methods, especially under extremely limited supervision (6. 12% improvement on Pubmed with a labeling rate of only 0. 3%).

PDF Details

AAAI Conference 2023 Conference Paper

Layout Representation Learning with Spatial and Structural Hierarchies

Yue Bai
Dipu Manandhar
Zhaowen Wang
John Collomosse
Yun Fu

We present a novel hierarchical modeling method for layout representation learning, the core of design documents (e.g., user interface, poster, template). Existing works on layout representation often ignore element hierarchies, which is an important facet of layouts, and mainly rely on the spatial bounding boxes for feature extraction. This paper proposes a Spatial-Structural Hierarchical Auto-Encoder (SSH-AE) that learns hierarchical representation by treating a hierarchically annotated layout as a tree format. On the one side, we model SSH-AE from both spatial (semantic views) and structural (organization and relationships) perspectives, which are two complementary aspects to represent a layout. On the other side, the semantic/geometric properties are associated at multiple resolutions/granularities, naturally handling complex layouts. Our learned representations are used for effective layout search from both spatial and structural similarity perspectives. We also newly involve the tree-edit distance (TED) as an evaluation metric to construct a comprehensive evaluation protocol for layout similarity assessment, which benefits a systematic and customized layout search. We further present a new dataset of POSTER layouts which we believe will be useful for future layout research. We show that our proposed SSH-AE outperforms the existing methods achieving state-of-the-art performance on two benchmark datasets. Code is available at github.com/yueb17/SSH-AE.

PDF Details DOI

EAAI Journal 2023 Journal Article

Smart mobile robot fleet management based on hierarchical multi-agent deep Q network towards intelligent manufacturing

Yue Bai
Yaqiong Lv
Jiatong Zhang

With the advent of intelligent manufacturing era, smart mobile robots have taken the major roles on transporting materials through intelligent dynamic production environment. It is paramount to efficiently manage the mobile robot fleet to complete the material transportation task so as to facilitate the smooth production in the workshop. However, many mobile robot fleet management systems adopt centralized control where the interaction between mobile robots and other resources cannot be updated promptly, and the in-time dynamic environment variance cannot be considered. To cope with the unexpected real-time change of the system, reinforcement learning (RL) method is suggested to handle the problem. To overcome the sparse reward problem of RL, we propose a hierarchical multi-agent deep Q network (HMDQN) algorithm of a two-layer structure, in which the goal layer is controlled by a main controller for selecting a current goal and the sub controller from action layer is for coordinating multi-agent controlled smart mobile robots. The main controller is aimed at learning how to determine the current goal based on the order status and production states. Simultaneously, the sub controller is learning to seek an optimal way that the smart mobile robot fleet jointly executes the transportation tasks through the information exchange between robot agent and production environment under the current goal. A smart mobile robot fleet management case in a PCBA company is studied to validate the feasibility of our approach. In addition, we utilized alternative methods to solve the same problem and compared the performance to prove our approach’s superiority. Furthermore, we demonstrated the adaptability of the proposed method by changing the problem scales.

Details DOI

ICLR Conference 2022 Conference Paper

Dual Lottery Ticket Hypothesis

Yue Bai
Huan Wang 0014
Zhiqiang Tao
Kunpeng Li
Yun Fu 0001

Fully exploiting the learning capacity of neural networks requires overparameterized dense networks. On the other side, directly training sparse neural networks typically results in unsatisfactory performance. Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity. Concretely, it claims there exist winning tickets from a randomly initialized network found by iterative magnitude pruning and preserving promising trainability (or we say being in trainable condition). In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark, then go from a complementary direction to articulate the Dual Lottery Ticket Hypothesis (DLTH): Randomly selected subnetworks from a randomly initialized dense network can be transformed into a trainable condition and achieve admirable performance compared with LTH --- random tickets in a given lottery pool can be transformed into winning tickets. Specifically, by using uniform-randomly selected subnetworks to represent the general cases, we propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH. Concretely, we introduce a regularization term to borrow learning capacity and realize information extrusion from the weights which will be masked. After finishing the transformation for the randomly selected subnetworks, we conduct the regular finetuning to evaluate the model using fair comparisons with LTH and other strong baselines. Extensive experiments on several public datasets and comparisons with competitive approaches validate our DLTH as well as the effectiveness of the proposed model RST. Our work is expected to pave a way for inspiring new research directions of sparse network training in the future. Our code is available at https://github.com/yueb17/DLTH.

Details

NeurIPS Conference 2022 Conference Paper

Look More but Care Less in Video Recognition

Yitian Zhang
Yue Bai
Huan Wang
Yi Xu
Yun Fu

Existing action recognition methods typically sample a few frames to represent each video to avoid the enormous computation, which often limits the recognition performance. To tackle this problem, we propose Ample and Focal Network (AFNet), which is composed of two branches to utilize more frames but with less computation. Specifically, the Ample Branch takes all input frames to obtain abundant information with condensed computation and provides the guidance for Focal Branch by the proposed Navigation Module; the Focal Branch squeezes the temporal size to only focus on the salient frames at each convolution block; in the end, the results of two branches are adaptively fused to prevent the loss of information. With this design, we can introduce more frames to the network but cost less computation. Besides, we demonstrate AFNet can utilize less frames while achieving higher accuracy as the dynamic selection in intermediate features enforces implicit temporal modeling. Further, we show that our method can be extended to reduce spatial redundancy with even less cost. Extensive experiments on five datasets demonstrate the effectiveness and efficiency of our method.

PDF Details

NeurIPS Conference 2022 Conference Paper

Parameter-Efficient Masking Networks

Yue Bai
Huan Wang
Xu Ma
Yitian Zhang
Zhiqiang Tao
Yun Fu

A deeper network structure generally handles more complicated non-linearity and performs more competitively. Nowadays, advanced network designs often contain a large number of repetitive structures (e. g. , Transformer). They empower the network capacity to a new level but also increase the model size inevitably, which is unfriendly to either model restoring or transferring. In this study, we are the first to investigate the representative potential of fixed random weights with limited unique values by learning diverse masks and introduce the Parameter-Efficient Masking Networks (PEMN). It also naturally leads to a new paradigm for model compression to diminish the model size. Concretely, motivated by the repetitive structures in modern neural networks, we utilize one random initialized layer, accompanied with different masks, to convey different feature mappings and represent repetitive network modules. Therefore, the model can be expressed as \textit{one-layer} with a bunch of masks, which significantly reduce the model storage cost. Furthermore, we enhance our strategy by learning masks for a model filled by padding a given random weights vector. In this way, our method can further lower the space complexity, especially for models without many repetitive architectures. We validate the potential of PEMN learning masks on random weights with limited unique values and test its effectiveness for a new compression paradigm based on different network architectures. Code is available at \href{https: //github. com/yueb17/PEMN}{\textcolor{magenta}{https: //github. com/yueb17/PEMN}}.

PDF Details

IJCAI Conference 2022 Conference Paper

Recent Advances on Neural Network Pruning at Initialization

Huan Wang
Can Qin
Yue Bai
Yulun Zhang
Yun Fu

Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to PaT and discuss the open problems. Apart from the dedicated literature review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Correlative Channel-Aware Fusion for Multi-View Time Series Classification

Yue Bai
Lichen Wang
Zhiqiang Tao
Sheng Li
Yun Fu

Multi-view time series classification (MVTSC) aims to improve the performance by fusing the distinctive temporal information from multiple views. Existing methods for MVTSC mainly aim to fuse multi-view information at an early stage, e. g. , by extracting a common feature subspace among multiple views. However, these approaches may not fully explore the unique temporal patterns of each view in complicated time series. Additionally, the label correlations of multiple views, which are critical to boosting, are usually under-explored for the MVTSC problem. To address the aforementioned issues, we propose a Correlative Channel- Aware Fusion (C2 AF) network. First, C2 AF extracts comprehensive and robust temporal patterns by a two-stream structured encoder for each view, and derives the intra-view/interview label correlations with a concise correlation matrix. Second, a channel-aware learnable fusion mechanism is implemented through CNN to further explore the global correlative patterns. Our C2 AF is an end-to-end framework for MVTSC. Extensive experimental results on three real-world datasets demonstrate the superiority of our C2 AF over the state-ofthe-art methods. A detailed ablation study is also provided to illustrate the indispensability of each model component.

PDF Details