Author name cluster

Minghao Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

AAAI Conference 2026 Conference Paper

Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction

Changyue Shi
Chuxiao Yang
Xinyuan Hu
Minghao Chen
Wenwen Pan
Yan Yang
Jiajun Ding
Zhou Yu

Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

Xinyuan Hu
Changyue Shi
Chuxiao Yang
Minghao Chen
Jiajun Ding
Tao Wei
Chen Wei
Zhou Yu

Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose SRSplat, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the Reference-Guided Feature Enhancement (RGFE) module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from RGFE. To further refine predicted Gaussian primitives, we introduce Texture-Aware Density Control (TADC), which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

AutoPartGen: Autoregressive 3D Part Generation and Discovery

Minghao Chen
Jianyuan Wang
Roman Shapovalov
Tom Monnier
Hyunyoung Jung
Dilin Wang
Rakesh Ranjan
Iro Laina

We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object's parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.

PDF Details

AAAI Conference 2025 Conference Paper

STraj: Self-training for Bridging the Cross-Geography Gap in Trajectory Prediction

Zhanwei Zhang
Minghao Chen
Zhihong Gu
Xinkui Zhao
Zheng Yang
Binbin Lin
Deng Cai
Wenxiao Wang

Accurate trajectory prediction has prominent significance in autonomous driving scenarios. Most existing methods predict the trajectory of an agent by learning its interaction with other agents and the map within the scenario. However, the heterogeneous distribution of these elements across different geographical scenarios is always ignored. Thus, trajectory predictors might struggle to generalize well when deployed in different geographical scenarios. To bridge the cross-geography gap, in this paper, we propose a plug-and-play self-training pipeline, termed STraj, for cross-geography trajectory prediction. STraj comprises three progressive steps: pseudo label (i.e., time-series trajectory) generation, update, and utilization. First, to generate pseudo labels that generalize to the cross-geography scenarios, STraj pre-trains the predictor through the complementary agent and map augmentations. Second, to facilitate the stable training of the predictor, we design a specific pseudo label update strategy. This strategy selects high-consistency pseudo trajectories from the current and historical epochs to supervise the target domain samples. Third, with generated pseudo trajectories, we introduce trajectory-induced contrastive learning to mitigate the representation bias of cross-geography agents. Extensive experiment results on various cross-geography trajectory prediction benchmarks demonstrate the effectiveness of STraj.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

AutoManual: Constructing Instruction Manuals by LLM Agents via Interactive Environmental Learning

Minghao Chen
Yihang Li
Yanting Yang
Shiyu Yu
Binbin Lin
Xiaofei He

Large Language Models (LLM) based agents have shown promise in autonomously completing tasks across various domains, e. g. , robotics, games, and web navigation. However, these agents typically require elaborate design and expert prompts to solve tasks in specific domains, which limits their adaptability. We introduce AutoManual, a framework enabling LLM agents to autonomously build their understanding through interaction and adapt to new environments. AutoManual categorizes environmental knowledge into diverse rules and optimizes them in an online fashion by two agents: 1) The Planner codes actionable plans based on current rules for interacting with the environment. 2) The Builder updates the rules through a well-structured rule system that facilitates online rule management and essential detail retention. To mitigate hallucinations in managing rules, we introduce a case-conditioned prompting strategy for the Builder. Finally, the Formulator agent compiles these rules into a comprehensive manual. The self-generated manual can not only improve the adaptability but also guide the planning of smaller LLMs while being human-readable. Given only one simple demonstration, AutoManual significantly improves task success rates, achieving 97. 4\% with GPT-4-turbo and 86. 2\% with GPT-3. 5-turbo on ALFWorld benchmark tasks. The code is available at https: //github. com/minghchen/automanual.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

G2LTraj: A Global-to-Local Generation Approach for Trajectory Prediction

Zhanwei Zhang
Zishuo Hua
Minghao Chen
Wei Lu
Binbin Lin
Deng Cai
Wenxiao Wang

Predicting future trajectories of traffic agents accurately holds substantial importance in various applications such as autonomous driving. Previous methods commonly infer all future steps of an agent either recursively or simultaneously. However, the recursive strategy suffers from the accumulated error, while the simultaneous strategy overlooks the constraints among future steps, resulting in kinematically infeasible predictions. To address these issues, in this paper, we propose G2LTraj, a plug-and-play global-to-local generation approach for trajectory prediction. Specifically, we generate a series of global key steps that uniformly cover the entire future time range. Subsequently, the local intermediate steps between the adjacent key steps are recursively filled in. In this way, we prevent the accumulated error from propagating beyond the adjacent key steps. Moreover, to boost the kinematical feasibility, we not only introduce the spatial constraints among key steps but also strengthen the temporal constraints among the intermediate steps. Finally, to ensure the optimal granularity of key steps, we design a selectable granularity strategy that caters to each predicted trajectory. Our G2LTraj significantly improves the performance of seven existing trajectory predictors across the ETH, UCY and nuScenes datasets. Experimental results demonstrate its effectiveness. Code will be available at https: //github. com/Zhanwei-Z/G2LTraj.

PDF Details DOI

AAAI Conference 2024 Conference Paper

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training

Yuqi Lin
Minghao Chen
Kaipeng Zhang
Hengjia Li
Mingming Li
Zheng Yang
Dongqin Lv
Binbin Lin

Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.

PDF Details DOI

ICML Conference 2021 Conference Paper

KD3A: Unsupervised Multi-Source Decentralized Domain Adaptation via Knowledge Distillation

Haozhe Feng
Zhaoyang You
Minghao Chen
Tianye Zhang
Minfeng Zhu 0001
Fei Wu 0001
Chao Wu 0001
Wei Chen 0001

Conventional unsupervised multi-source domain adaptation (UMDA) methods assume all source domains can be accessed directly. However, this assumption neglects the privacy-preserving policy, where all the data and computations must be kept decentralized. There exist three challenges in this scenario: (1) Minimizing the domain distance requires the pairwise calculation of the data from the source and target domains, while the data on the source domain is not available. (2) The communication cost and privacy security limit the application of existing UMDA methods, such as the domain adversarial training. (3) Since users cannot govern the data quality, the irrelevant or malicious source domains are more likely to appear, which causes negative transfer. To address the above problems, we propose a privacy-preserving UMDA paradigm named Knowledge Distillation based Decentralized Domain Adaptation (KD3A), which performs domain adaptation through the knowledge distillation on models from different source domains. The extensive experiments show that KD3A significantly outperforms state-of-the-art UMDA approaches. Moreover, the KD3A is robust to the negative transfer and brings a 100x reduction of communication cost compared with other decentralized UMDA methods.

Details

NeurIPS Conference 2021 Conference Paper

Searching the Search Space of Vision Transformer

Minghao Chen
Kan Wu
Bolin Ni
Houwen Peng
Bei Liu
Jianlong Fu
Hongyang Chao
Haibin Ling

Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https: //github. com/microsoft/Cream.

PDF Details

AAAI Conference 2021 Conference Paper

SHOT-VAE: Semi-supervised Deep Generative Models With Label-aware ELBO Approximations

Hao-Zhe Feng
Kezhi Kong
Minghao Chen
Tianye Zhang
Minfeng Zhu
Wei Chen

Semi-supervised variational autoencoders (VAEs) have obtained strong results, but have also encountered the challenge that good ELBO values do not always imply accurate inference results. In this paper, we investigate and propose two causes of this problem: (1) The ELBO objective cannot utilize the label information directly. (2) A bottleneck value exists, and continuing to optimize ELBO after this value will not improve inference accuracy. On the basis of the experiment results, we propose SHOT-VAE to address these problems without introducing additional prior knowledge. The SHOT- VAE offers two contributions: (1) A new ELBO approximation named smooth-ELBO that integrates the label predictive loss into ELBO. (2) An approximation based on optimal interpolation that breaks the ELBO value bottleneck by reducing the margin between ELBO and the data likelihood. The SHOT-VAE achieves good performance with 25. 30% error rate on CIFAR-100 with 10k labels and reduces the error rate to 6. 11% on CIFAR-10 with 4k labels.

PDF Details

AAAI Conference 2020 Conference Paper

Adversarial-Learned Loss for Domain Adaptation

Minghao Chen
Shuai Zhao
Haifeng Liu
Deng Cai

Recently, remarkable progress has been made in learning transferable representation across domains. Previous works in domain adaptation are majorly based on two techniques: domain-adversarial learning and self-training. However, domain-adversarial learning only aligns feature distributions between domains but does not consider whether the target features are discriminative. On the other hand, selftraining utilizes the model predictions to enhance the discrimination of target features, but it is unable to explicitly align domain distributions. In order to combine the strengths of these two methods, we propose a novel method called Adversarial-Learned Loss for Domain Adaptation (ALDA). We ﬁrst analyze the pseudo-label method, a typical selftraining method. Nevertheless, there is a gap between pseudolabels and the ground truth, which can cause incorrect training. Thus we introduce the confusion matrix, which is learned through an adversarial manner in ALDA, to reduce the gap and align the feature distributions. Finally, a new loss function is auto-constructed from the learned confusion matrix, which serves as the loss for unlabeled target samples. Our ALDA outperforms state-of-the-art approaches in four standard domain adaptation datasets. Our code is available at https: //github. com/ZJULearning/ALDA.

PDF Details