Author name cluster

Shihao Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

EAAI Journal 2026 Journal Article

Cross-Granularity Fusion Vision Mamba UNet for medical image segmentation

Tuersunjiang Baidi
Zitong Ren
Kurban Ubul
Alimjan Aysa
Boyuan Li
Shihao Wang

Details DOI

NeurIPS Conference 2025 Conference Paper

DP²O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution

Rongyuan Wu
Lingchen Sun
Zhengqiang Zhang
Shihao Wang
Tianhe Wu
Qiaosi Yi
Shuai Li
Lei Zhang

Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP²O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP²O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.

PDF Details

NeurIPS Conference 2025 Conference Paper

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Guo Chen
Zhiqi Li
Shihao Wang
Jindong Jiang
Yicheng Liu
Lidong Lu
De-An Huang
Wonmin Byeon

We introduce Eagle2. 5, a frontier vision-language model (VLM) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle2. 5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle2. 5-8B achieves 72. 4\% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2. 5-VL-72B and InternVL2. 5-78B.

PDF Details

ICLR Conference 2025 Conference Paper

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
Yilin Zhao
De-An Huang
Hongxu Yin

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.

Details

AAAI Conference 2025 Conference Paper

L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression

Junxuan Zhang
Zhengxue Cheng
Yan Zhao
Shihao Wang
Dajiang Zhou
Guo Lu
Li Song

Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Multi-Frame Deformable Look-Up Table for Compressed Video Quality Enhancement

Gang He
Guancheng Quan
Chang Wu
Shihao Wang
Dajiang Zhou
Yunsong Li

The rapid progress of multimedia technology has led to an increased focus on enhancing the quality of experience (QoE) for video. Specifically, the demand for low-latency and high-quality decoding has grown significantly. Compressed Video Quality Enhancement (CVQE) methods based on Deep Neural Networks (DNNs) have achieved remarkable success. However, most of the methods suffer from high computational complexity, thereby limiting their practicality in low-latency scenarios. Recently, Look-Up Table (LUT) methods have shown great efficiency, which makes them considerably promising in the field of low-latency CVQE. In this paper, we propose an efficient multi-frame deformable Look-Up Table structure for CVQE. Firstly, we design an efficient CNN to explore the inter-frame correlation and then predict the multi-scale convolution offsets. Secondly, we introduce a temporal feature extraction module and a multi-scale fusion module. We first exploit the predicted offsets to guide sampling for precise temporal alignment and extract multi-frame information. Then, higher quality frames are reconstructed from the fused multi-scale features. During the inference, we convert these two modules into LUTs to achieve a sound trade-off between model performance and computational complexity. Experiments demonstrate that our proposed method dramatically outperforms the state-of-the-art LUT-based methods, and obtains competitive performance compared to CNN-based methods with the capability to run in real-time(30fps) at 1080p resolution.

PDF Details DOI

EAAI Journal 2025 Journal Article

Tube-LaneNet: Predict each three-dimensional lane as a completed structure via geometric priors

Genghua Kou
Shihao Wang
Ying Li

Details DOI

AAAI Conference 2024 Conference Paper

Far3D: Expanding the Horizon for Surround-View 3D Object Detection

Xiaohui Jiang
Shuailin Li
Yingfei Liu
Shihao Wang
Fan Jia
Tiancai Wang
Lijin Han
Xiangyu Zhang

Recently 3D object detection from surround-view images has made notable advancements with its low deployment cost. However, most works have primarily focused on close perception range while leaving long-range detection less explored. Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence. To address these limitations, this paper proposes a novel sparse query-based framework, dubbed Far3D. By utilizing high-quality 2D object priors, we generate 3D adaptive queries that complement the 3D global queries. To efficiently capture discriminative features across different views and scales for long-range objects, we introduce a perspective-aware aggregation module. Additionally, we propose a range-modulated 3D denoising approach to address query error propagation and mitigate convergence issues in long-range tasks. Significantly, Far3D demonstrates SoTA performance on the challenging Argoverse 2 dataset, covering a wide range of 150 meters, surpassing several LiDAR-based approaches. The code is available at https://github.com/megvii-research/Far3D.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Automated Model Design and Benchmarking of Deep Learning Models for COVID-19 Detection with Chest CT Scans

Xin He
Shihao Wang
Xiaowen Chu
Shaohuai Shi
Jiangping Tang
Xin Liu
Chenggang Yan
Jiyong Zhang

The COVID-19 pandemic has spread globally for several months. Because its transmissibility and high pathogenicity seriously threaten people’s lives, it is crucial to accurately and quickly detect COVID-19 infection. Many recent studies have shown that deep learning (DL) based solutions can help detect COVID-19 based on chest CT scans. However, most existing work focuses on 2D datasets, which may result in low quality models as the real CT scans are 3D images. Besides, the reported results span a broad spectrum on different datasets with a relatively unfair comparison. In this paper, we first use three state-of-the-art 3D models (ResNet3D101, DenseNet3D121, and MC3 18) to establish the baseline performance on three publicly available chest CT scan datasets. Then we propose a differentiable neural architecture search (DNAS) framework to automatically search the 3D DL models for 3D chest CT scans classification and use the Gumbel Softmax technique to improve the search efficiency. We further exploit the Class Activation Mapping (CAM) technique on our models to provide the interpretability of the results. The experimental results show that our searched models (CovidNet3D) outperform the baseline human-designed models on three datasets with tens of times smaller model size and higher accuracy. Furthermore, the results also verify that CAM can be well applied in CovidNet3D for COVID- 19 datasets to provide interpretability for medical diagnosis. Code: https: //github. com/HKBU-HPML/CovidNet3D.

PDF Details

ICRA Conference 2021 Conference Paper

Efficient Online Calibration for Autonomous Vehicle's Longitudinal Dynamical System: A Gaussian Model Approach

Shihao Wang
Canqiang Deng
Qingjie Qi

In this paper, we present an efficient online calibration system for longitudinal vehicle dynamics of driverless cars. Instead of modeling vehicle’s longitudinal dynamical system analytically, we employ a data-driven method to generate an "end-to-end" numerical model with a look-up table which saves vehicle’s velocity, control command, and acceleration. This reference table should be calibrated to account for variations of vehicle’s hardware status over time. To reduce the expensive labor in calibration process, we propose an effective algorithm to update this reference look-up table with a Gaussian model approach. We introduce a 2-D Gaussian distribution to model acceleration error between interpolated one from look-up table and actual one from vehicle sensors. We estimate model’s standard deviations with a "three-sigma rule" heuristic and calculate its height with a backtracking method such that monotonicity constraint between acceleration and control command is strictly satisfied in the updated table. The effectiveness of our proposed system is verified in realworld road tests with Lincoln MKZ.

Details

ICRA Conference 2018 Conference Paper

Realization of a Real-Time Optimal Control Strategy to Stabilize a Falling Humanoid Robot with Hand Contact

Shihao Wang
Kris Hauser

In this paper, we present a real-time falling robot stabilization system for a humanoid robot in which the robot can prevent falling using hand contact with walls and other surfaces in the environment. Instead of ignoring or avoiding interaction with environmental obstacles, our system uses obstacle geometry to determine a contact point that reduces impact and necessary friction. It uses a planar dynamic model that is appropriate for falling stabilization in the robot's sagittal plane and frontal plane. The hand contact is determined with an optimal control approach, and to make the algorithm run in realtime, a simplified three-link robot model and a pre-computed database of subproblems for the hand contact optimization are adopted. Moreover, if the robot is not leaning too far after stabilization, we employ a heuristic push-up strategy to recover the robot to a standing posture. System integration is performed on the Darwin-Mini robot and validation is conducted in several environments and falling scenarios.

Details