Author name cluster

Jieneng Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

ICLR Conference 2025 Conference Paper

GenEx: Generating an Explorable World

Taiming Lu
Tianmin Shu
Alan L. Yuille
Daniel Khashabi
Jieneng Chen

Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing *GenEx*, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms expectations about the surrounding environments. *GenEx* generates high-quality, continuous 360-degree virtual environments, achieving robust loop consistency and active 3D mapping over extended trajectories. Leveraging generative imagination, GPT-assisted agents can undertake complex embodied tasks, including goal-agnostic exploration and goal-driven navigation. Agents utilize imagined observations to update their beliefs, simulate potential outcomes, and enhance their decision-making. Training on the synthetic urban dataset *GenEx-DB* and evaluation on *GenEx-EQA* demonstrate that our approach significantly improves agents' planning capabilities, providing a transformative platform toward intelligent, imaginative embodied exploration.

Details

NeurIPS Conference 2025 Conference Paper

Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang
Yitong Li
Yu-Cheng Chou
Jieneng Chen
Alan Yuille
Chen Wei
Junfei Xiao

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2. 0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1, 000 USD.

PDF Details

NeurIPS Conference 2024 Conference Paper

Efficient Large Multi-modal Models via Visual Context Compression

Jieneng Chen
Luoxin Ye
Ju He
Zhao-Yang Wang
Daniel Khashabi
Alan Yuille

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experimentsshow that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a light and staged training scheme that incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly compression during training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs and improving inference efficiency.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Pedro R. Bassi
Wenxuan Li
Yucheng Tang
Fabian Isensee
Zifu Wang
Jieneng Chen
Yu-Cheng Chou
Saikat Roy

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5, 195 training CT scans from 76 hospitals around the world and 5, 903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks---which, differing from algorithms, are more flexible and can support different algorithms—including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Semi-supervised Medical Image Segmentation through Dual-task Consistency

Xiangde Luo
Jieneng Chen
Tao Song
Guotai Wang

Deep learning-based semi-supervised learning (SSL) algorithms have led to promising results in medical images segmentation and can alleviate doctors’ expensive annotations by leveraging unlabeled data. However, most of the existing SSL algorithms in the literature tend to regularize the model training by perturbing networks and/or data. Observing that multi/dual-task learning attends to various levels of information which have inherent prediction perturbation, we ask the question in this work: can we explicitly build task-level regularization rather than implicitly constructing networksand/or data-level perturbation and then regularization for SSL? To answer this question, we propose a novel dual-taskconsistency semi-supervised framework for the first time. Concretely, we use a dual-task deep network that jointly predicts a pixel-wise segmentation map and a geometry-aware level set representation of the target. The level set representation is converted to an approximated segmentation map through a differentiable task transform layer. Simultaneously, we introduce a dual-task consistency regularization between the level set-derived segmentation maps and directly predicted segmentation maps for both labeled and unlabeled data. Extensive experiments on two public datasets show that our method can largely improve the performance by incorporating the unlabeled data. Meanwhile, our framework outperforms the state-of-the-art semi-supervised learning methods. Code is available at: https: //github. com/HiLab-git/DTC

PDF Details

JBHI Journal 2020 Journal Article

Coarse-to-Fine Adversarial Networks and Zone-Based Uncertainty Analysis for NK/T-Cell Lymphoma Segmentation in CT/PET Images

Xiaobin Hu
Rui Guo
Jieneng Chen
Hongwei Li
Diana Waldmannstetter
Yu Zhao
Biao Li
Kuangyu Shi

Extranodal natural killer/T cell lymphoma (ENKL), nasal type is a kind of rare disease with a low survival rate that primarily affects Asian and South American populations. Segmentation of ENKL lesions is crucial for clinical decision support and treatment planning. This paper is the first study on computer-aided diagnosis systems for the ENKL segmentation problem. We propose an automatic, coarse-to-fine approach for ENKL segmentation using adversarial networks. In the coarse stage, we extract the region of interest bounding the lesions utilizing a segmentation neural network. In the fine stage, we use an adversarial segmentation network and further introduce a multi-scale L 1 loss function to drive the network to learn both global and local features. The generator and discriminator are alternately trained by backpropagation in an adversarial fashion in a min-max game. Furthermore, we present the first exploration of zone-based uncertainty estimates based on Monte Carlo dropout technique in the context of deep networks for medical image segmentation. Specifically, we propose the uncertainty criteria based on the lesion and the background, and then linearly normalize them to a specific interval. This is not only the crucial criterion for evaluating the superiority of the algorithm, but also permits subsequent optimization by engineers and revision by clinicians after quantitatively understanding the main source of uncertainty from the background or the lesion zone. Experimental results demonstrate that the proposed method is more effective and lesion-zone stable than state-of-the-art deep-learning based segmentation model.

Details DOI