Author name cluster

Yingcong Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

NeurIPS Conference 2025 Conference Paper

ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

Litao Guo
Xinli Xu
Luozhou Wang
Jiantao Lin
Jinsong Zhou
Zixin Zhang
Bolan Su
Yingcong Chen

With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems.

PDF Details

NeurIPS Conference 2025 Conference Paper

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

Hao Lu
Tianshuo Xu
Wenzhao Zheng
Yunpeng Zhang
Wei Zhan
Dalong Du
Masayoshi Tomizuka
Kurt Keutzer

Large reconstruction model has remarkable progress, which can directly predict 3D or 4D representations for unseen scenes and objects. However, current work has not systematically explored the potential of large reconstruction models in the field of autonomous driving. To achieve this, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon). With an elaborate and simple framework design, it not only ensures efficient and high-quality reconstruction, but also provides potential for downstream tasks. There are two core contributions: firstly, the Prune and Dilate Block (PD-Block) is proposed to prune redundant and overlapping Gaussian points and dilate Gaussian points for complex objects. Then, dynamic and static decoupling is tailored to better learn the temporary-consistent geometry across different time. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle type adaptation, and scene editing. Our code will be available.

PDF Details

NeurIPS Conference 2025 Conference Paper

Event-Guided Consistent Video Enhancement with Modality-Adaptive Diffusion Pipeline

Kanghao Chen
Zixin Zhang
Guoqiang Liang
Lutao Jiang
Zeyu Wang
Yingcong Chen

Recent advancements in low-light video enhancement (LLVE) have increasingly leveraged both RGB and event cameras to improve video quality under challenging conditions. However, existing approaches share two key drawbacks. First, they are tuned for steady low-light scenes, so their performance drops when illumination varies. Second, they assume every sensing modality is always available, while real systems may lose or corrupt one of them. These limitations make the methods brittle in dynamic, real-world settings. In this paper, we propose EVDiffuser, a novel framework for consistent LLVE that integrates RGB and event data through a modality-adaptive diffusion pipeline. By harnessing the powerful priors of video diffusion models, EVDiffuser enables consistent video enhancement and generalization to diverse scenarios under varying illumination, where RGB or events may even be absent. Specifically, we first design a modality-agnostic conditioning mechanism based on a diffusion pipeline by treating the two modalities as optional conditions, which is fine-tuned using augmented and integrated datasets. Furthermore, we introduce a modality-adaptive guidance rescaling that dynamically adjusts the contribution of each modality according to sensor-specific characteristics. Additionally, we establish a benchmark that accounts for varying illumination and diverse real-world scenarios, facilitating future research on consistent event-guided LLVE. Our experiments demonstrate state-of-the-art performance across challenging scenarios (i. e. , varying illumination) and sensor-based settings (e. g. , event-only, RGB-only), highlighting the generalization of our framework.

PDF Details

NeurIPS Conference 2025 Conference Paper

PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

Wang Wang
Xiao Yang
Qingyong Hu
Jack Tang
Can Liu
Dengbo He
Yuntao Wang
Yingcong Chen

Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration of various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied by six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal‑processing and deep‑learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open‑source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart‑cockpit systems.

PDF Details

JBHI Journal 2024 Journal Article

ConDiff-rPPG: Robust Remote Physiological Measurement to Heterogeneous Occlusions

Jiyao Wang
Ximeng Wei
Hao Lu
Yingcong Chen
Dengbo He

Remote photoplethysmography (rPPG) is a contactless technique that facilitates the measurement of physiological signals and cardiac activities through facial video recordings. This approach holds tremendous potential for various applications. However, existing rPPG methods often did not account for different types of occlusions that commonly occur in real-world scenarios, such as temporary movement or actions of humans in videos or dust on camera. The failure to address these occlusions can compromise the accuracy of rPPG algorithms. To address this issue, we proposed a novel Condiff-rPPG to improve the robustness of rPPG measurement facing various occlusions. First, we compressed the damaged face video into a spatio-temporal representation with several types of masks. Second, the diffusion model was designed to recover the missing information with observed values as a condition. Moreover, a novel low-rank decomposition regularization was proposed to eliminate background noise and maximize informative features. ConDiff-rPPG ensured consistency in optimization goals during the training process. Through extensive experiments, including intra- and cross-dataset evaluations, as well as ablation tests, we demonstrated the robustness and generalization ability of our proposed model.

Details DOI

JBHI Journal 2024 Journal Article

Hierarchical Style-Aware Domain Generalization for Remote Physiological Measurement

Jiyao Wang
Hao Lu
Ange Wang
Yingcong Chen
Dengbo He

The utilization of remote photoplethysmography (rPPG) technology has gained attention in recent years due to its ability to extract blood volume pulse (BVP) from facial videos, making it accessible for various applications such as health monitoring and emotional analysis. However, the BVP signal is susceptible to complex environmental changes or individual differences, causing existing methods to struggle in generalizing for unseen domains. This article addresses the domain shift problem in rPPG measurement and shows that most domain generalization methods fail to work well in this problem due to ambiguous instance-specific differences. To address this, the article proposes a novel approach called Hierarchical Style-aware Representation Disentangling (HSRD). HSRD improves generalization capacity by separating domain-invariant and instance-specific feature space during training, which increases the robustness of out-of-distribution samples during inference. This work presents state-of-the-art performance against several methods in both cross and intra-dataset settings.

Details DOI

NeurIPS Conference 2023 Conference Paper

ReTR: Modeling Rendering Via Transformer for Generalizable Neural Surface Reconstruction

Yixun Liang
Hao He
Yingcong Chen

Generalizable neural surface reconstruction techniques have attracted great attention in recent years. However, they encounter limitations of low confidence depth distribution and inaccurate surface reasoning due to the oversimplified volume rendering process employed. In this paper, we present Reconstruction TRansformer (ReTR), a novel framework that leverages the transformer architecture to redesign the rendering process, enabling complex render interaction modeling. It introduces a learnable $\textit{meta-ray token}$ and utilizes the cross-attention mechanism to simulate the interaction of rendering process with sampled points and render the observed color. Meanwhile, by operating within a high-dimensional feature space rather than the color space, ReTR mitigates sensitivity to projected colors in source views. Such improvements result in accurate surface assessment with high confidence. We demonstrate the effectiveness of our approach on various datasets, showcasing how our method outperforms the current state-of-the-art approaches in terms of reconstruction quality and generalization ability. $\textit{Our code is available at }$ https: //github. com/YixunLiang/ReTR.

PDF Details