Author name cluster

Jiayi Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

AAAI Conference 2026 Conference Paper

ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement

Xin Xu
Hao Liu
Wei Liu
Wei Wang
Jiayi Wu
Kui Jiang

Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.

PDF Details DOI

JBHI Journal 2025 Journal Article

Cell-Level Free Cervical Lesion Detection in Cytology Images Via Weakly Supervised Self-Correction

Jiayi Wu
Yan Zhao
Chinmay Chakraborty
Sandeep Kumar Thota
Jingmin Xin
Keping Yu

Cervical cancer remains the fourth most common cancer among women worldwide. Early detection of cervical lesions in cytology images can prevent disease progression, but current deep learning methods for cell- or patch-level analysis in whole slide images (WSI) face significant challenges due to limited, noisy, or incomplete annotations. To address these limitations, weakly supervised learning methods, particularly multiple instance learning (MIL), have been explored. However, traditional MIL methods often suffer from label noise, leading to inaccurate feature extraction, which in turn restricts their robustness and generalization. In this paper, we propose Self-Correcting Instance Learning (SCIL), a novel two-stage instance-based MIL framework designed to enhance instance-level cervical lesion detection under bag-level supervision. SCIL incorporates a weakly supervised self-correction mechanism within a teacher-student architecture to mitigate the effects of noisy pseudo labels. This process involves a contrastive dynamic weighting strategy to adjust instance-level loss and enhance feature representation in stage one, followed by an uncertainty-based self-correction strategy in stage two to retain only high-confidence data with reassigned labels. Extensive evaluations of a slide cervical cytology image dataset demonstrate that SCIL significantly improves the detection of cervical lesions at both the patch and slide levels, highlighting its ability to overcome the limitations of imperfect data in cervical lesion detection.

Details DOI

NeurIPS Conference 2025 Conference Paper

Controlling The Spread of Epidemics on Networks with Differential Privacy

Dũng Nguyen
Aravind Srinivasan
Renata Valieva
Anil Vullikanti
Jiayi Wu

Designing effective strategies for controlling epidemic spread by vaccination is an important question in epidemiology, especially in the early stages when vaccines are limited. This is a challenging question when the contact network is very heterogeneous, and strategies based on controlling network properties, such as the degree and spectral radius, have been shown to be effective. Implementation of such strategies requires detailed information on the contact structure, which might be sensitive in many applications. Our focus here is on choosing effective vaccination strategies when the edges are sensitive and differential privacy guarantees are needed. Our main contributions are $(\varepsilon, \delta)$-differentially private algorithms for designing vaccination strategies by reducing the maximum degree and spectral radius. Our key technique is a private algorithm for the multi-set multi-cover problem, which we use for controlling network properties. We evaluate privacy-utility tradeoffs of our algorithms on multiple synthetic and real-world networks, and show their effectiveness.

PDF Details

AAAI Conference 2025 Conference Paper

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Wangyu Xue
Chen Qian
Jiayi Wu
Yang Zhou
Wentao Liu
Ju Ren
Siming Fan
Yaoxue Zhang

Existing research on human-centric video understanding typically focuses on analyzing specific moments or entire videos. However, many applications require higher precision at the frame level. In this work, we propose a novel task, BestShot, which aims to locate highlight frames within human-centric videos through language queries. This task requires not only a deep semantic understanding of human actions but also precise temporal localization. To support this task, we introduce the BestShot Benchmark. The benchmark is meticulously constructed by combining human-annotated highlight frames, duration labels and detailed textual descriptions. These descriptions cover three critical elements: (1) Visual content; (2) Fine-grained actions; and (3) Human pose descriptions. Together, these elements provide the necessary precision to identify the exact highlight frames in videos. To tackle this problem, we have collected two distinct datasets: (i) ShotGPT4o Dataset, which is algorithmically generated by GPT-4o and (ii) Image-SMPLText Dataset, which features large-scale and accurate per-frame pose descriptions using PoseScript and existing pose estimation datasets. Based on these datasets, we present a strong baseline model, ShotVL, fine-tuned from InternVL, specifically for BestShot. We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing state-of-the-art (SOTA) models. ShotVL demonstrates a significant 64% improvement over InternVL on the BestShot Benchmark and a notable 68% improvement on the THUMOS14 Benchmark, while maintaining SOTA performance in general image classification and retrieval.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Simulating Society Requires Simulating Thought

Chance Jiajie Li
Jiayi Wu
Zhenze MO
Ao Qu
Yuhan Tang
Kaiya Zhao
Yulu Gan
Jie Fan

Simulating society with large language models (LLMs), we argue, requires more than generating plausible behavior; it demands cognitively grounded reasoning that is structured, revisable, and traceable. LLM-based agents are increasingly used to emulate individual and group behavior, primarily through prompting and supervised fine-tuning. Yet current simulations remain grounded in a behaviorist “demographics in, behavior out” paradigm, focusing on surface-level plausibility. As a result, they often lack internal coherence, causal reasoning, and belief traceability—making them unreliable for modeling how people reason, deliberate, and respond to interventions. To address this, we present a conceptual modeling paradigm, Generative Minds (GenMinds), which draws from cognitive science to support structured belief representations in generative agents. To evaluate such agents, we introduce the RECAP (REconstructing CAusal Paths) framework, a benchmark designed to assess reasoning fidelity via causal traceability, demographic grounding, and intervention consistency. These contributions advance a broader shift: from surface-level mimicry to generative agents that simulate thought—not just language—for social simulations.

PDF Details

IROS Conference 2025 Conference Paper

ViewActive: Active viewpoint optimization from a single image

Jiayi Wu
Xiaomin Lin 0002
Botao He
Cornelia Fermüller
Yiannis Aloimonos

When observing objects, humans benefit from their spatial visualization and mental rotation ability to envision potential optimal viewpoints based on the current observation. This capability is crucial for enabling robots to achieve efficient and robust scene perception during operation, as optimal viewpoints provide essential and informative features for accurately representing scenes in 2D images, thereby enhancing downstream tasks. To endow robots with this human-like active viewpoint optimization capability, we propose ViewActive, a modernized machine learning approach drawing inspiration from aspect graph, which provides viewpoint optimization guidance based solely on the current 2D image input. Specifically, we introduce the 3D Viewpoint Quality Field (VQF), a compact and consistent representation for viewpoint quality distribution similar to an aspect graph, composed of three general-purpose viewpoint quality metrics: self-occlusion ratio, occupancy-aware surface normal entropy, and visual entropy. We utilize pre-trained image encoders to extract robust visual and semantic features, which are then decoded into the 3D VQF, allowing our model to generalize effectively across diverse objects, including unseen categories. The lightweight ViewActive network (72 FPS on a single GPU) significantly enhances the performance of state-of-the-art object recognition pipelines and can be integrated into real-time motion planning for robotic applications. Our code and dataset are available here https://github.com/jiayi-wu-umd/ViewActive.

Details

NeurIPS Conference 2024 Conference Paper

Cross-model Control: Improving Multiple Large Language Models in One-time Training

Jiayi Wu
Hao Sun
Hengyi Cai
Lixin Su
Shuaiqiang Wang
Dawei Yin
Xiang Li
Ming Gao

The number of large language models (LLMs) with varying parameter scales and vocabularies is increasing. While they deliver powerful performance, they also face a set of common optimization needs to meet specific requirements or standards, such as instruction following or avoiding the output of sensitive information from the real world. However, how to reuse the fine-tuning outcomes of one model to other models to reduce training costs remains a challenge. To bridge this gap, we introduce Cross-model Control (CMC), a method that improves multiple LLMs in one-time training with a portable tiny language model. Specifically, we have observed that the logit shift before and after fine-tuning is remarkably similar across different models. Based on this insight, we incorporate a tiny language model with a minimal number of parameters. By training alongside a frozen template LLM, the tiny model gains the capability to alter the logits output by the LLMs. To make this tiny language model applicable to models with different vocabularies, we propose a novel token mapping strategy named PM-MinED. We have conducted extensive experiments on instruction tuning and unlearning tasks, demonstrating the effectiveness of CMC. Our code is available at https: //github. com/wujwyi/CMC

PDF Details DOI

IROS Conference 2024 Conference Paper

MARVIS: Motion & Geometry Aware Real and Virtual Image Segmentation

Jiayi Wu
Xiaomin Lin 0002
Shahriar Negahdaripour
Cornelia Fermüller
Yiannis Aloimonos

Tasks such as autonomous navigation, 3D reconstruction, and object recognition near the water surfaces are crucial in marine robotics applications. However, challenges arise due to dynamic disturbances, e. g. , light reflections and refraction from the random air-water interface, irregular liquid flow, and similar factors, which can lead to potential failures in perception and navigation systems. Traditional computer vision algorithms struggle to differentiate between real and virtual image regions, significantly complicating tasks. A virtual image region is an apparent representation formed by the redirection of light rays, typically through reflection or refraction, creating the illusion of an object’s presence without its actual physical location. This work proposes a novel approach for segmentation on real and virtual image regions, exploiting synthetic images combined with domain-invariant information, a Motion Entropy Kernel, and Epipolar Geometric Consistency. Our segmentation network does not need to be re-trained if the domain changes. We show this by deploying the same segmentation network in two different domains: simulation and the real world. By creating realistic synthetic images that mimic the complexities of the water surface, we provide fine-grained training data for our network (MARVIS) to discern between real and virtual images effectively. By motion & geometry-aware design choices and through comprehensive experimental analysis, we achieve state-of-the-art real-virtual image segmentation performance in unseen real world domain, achieving an IoU over 78% and a F 1 -Score over 86% while ensuring a small computational footprint. MARVIS offers over 43 FPS (8 FPS) inference rates on a single GPU (CPU core). Our code and dataset are available here https://github.com/jiayi-wu-umd/MARVIS.

Details

EAAI Journal 2024 Journal Article

Semi-supervised fault diagnosis of wheelset bearings in high-speed trains using autocorrelation and improved flow Gaussian mixture model

Jiayi Wu
Yilei Li
Limin Jia
Guoping An
Yan-Fu Li
Jérôme Antoni
Ge Xin

Details DOI

ICRA Conference 2023 Conference Paper

UDepth: Fast Monocular Depth Estimation for Visually-guided Underwater Robots

Boxiao Yu
Jiayi Wu
Md Jahidul Islam

In this paper, we present a fast monocular depth estimation method for enabling 3D perception capabilities of low-cost underwater robots. We formulate a novel end-to-end deep visual learning pipeline named UDepth, which incorporates domain knowledge of image formation characteristics of natural underwater scenes. First, we adapt a new input space from raw RGB image space by exploiting underwater light attenuation prior, and then devise a least-squared formulation for coarse pixel-wise depth prediction. Subsequently, we extend this into a domain projection loss that guides the end-to-end learning of UDepth on over 9K RGB-D training samples. UDepth is designed with a computationally light MobileNetV2 backbone and a Transformer-based optimizer for ensuring fast inference rates on embedded systems. By domain-aware design choices and through comprehensive experimental analyses, we demonstrate that it is possible to achieve state-of-the-art depth estimation performance while ensuring a small computational footprint. Specifically, with 70 % −80 % less network parameters than existing benchmarks, UDepth achieves comparable and often better depth estimation performance. While the full model offers over 66 FPS (13 FPS) inference rates on a single GPU (CPU core), our domain projection for coarse depth prediction runs at 51. 5 FPS rates on single-board Jetson TX2s. The inference pipelines are available at https://github.com/uf-robopi/UDepth.

Details