Arrow Research search

Author name cluster

Jingdong Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

AAAI Conference 2026 Conference Paper

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses Through Reasoning MLLMs

  • Zheng Qin
  • Ruobing Zheng
  • Yabing Wang
  • Tianqi Li
  • Yi Yuan
  • Jingdong Chen
  • Le Wang

While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks.Furthermore, grounded in the observation that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, we posit that reasoning ability serves as the key to unlocking it. We devise a multi-stage, modality-progressive reinforcement learning approach, resulting in HumanSense-Omni-Reasoning, which substantially enhances performance on higher-level understanding and interactive tasks. Additionally, we observe that successful reasoning processes appear to exhibit consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.

AAAI Conference 2026 Conference Paper

SCAN: Self-Calibrated AutoregressioN for High-Quality Visual Generation

  • Zhanzhou Feng
  • Qingpei Guo
  • Jingdong Chen
  • Feng Gao
  • Ming Yang
  • Shiliang Zhang

Human artists can continuously refine their coarse sketches during artistic creation. This is quite different from existing autoregressive generation, where a token is determined once sampled. Aiming to flexibly refine the generated contents, this paper presents a Self-Calibrated AutoregressioN (SCAN) model capable of self-evaluating and refining generation quality without regenerating the entire image. We unify image token generation and quality evaluation into a single autoregressive model, formulating both tasks as categorical prediction problems. During inference, the model first generates a coarse initial image, then iteratively refines the lowest-quality patches until satisfactory image quality is achieved. Experimental results demonstrate that SCAN effectively handles diverse real-world generation errors and achieves a promising balance between image quality and speed. For example, SCAN-XL achieves an FID of 2.10 and an IS of 326.1, surpassing the LlamaGen-XL by 1.29 (+38%) in FID and 99.0 (+43.6%) in IS, with a 5.6× speedup (19.76s to 3.56s). Compared to recent works, SCAN improves FID and speed by +18.3% and +23% over VAR-d20, and by +7% and +46% over RandAR-XL.

AAAI Conference 2026 Conference Paper

UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

  • Xinyang Song
  • Libin Wang
  • Weining Wang
  • Shaozhen Liu
  • DanDan Zheng
  • Jingdong Chen
  • Qi Li
  • Zhenan Sun

The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.

ICLR Conference 2025 Conference Paper

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

  • Shuai Tan
  • Biao Gong
  • Xiang Wang 0012
  • Shiwei Zhang 0001
  • Dandan Zheng
  • Ruobing Zheng
  • Kecheng Zheng
  • Jingdong Chen

Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes $\texttt{Animate-X}$, a universal animation framework based on LDM for various character types (collectively named $\texttt{X}$), including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of LDM by simulating possible inputs in advance that may arise during inference. Moreover, we introduce a new Animated Anthropomorphic Benchmark ($\texttt{$A^2$Bench}$) to evaluate the performance of $\texttt{Animate-X}$ on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of $\texttt{Animate-X}$ compared to state-of-the-art methods.

NeurIPS Conference 2025 Conference Paper

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

  • Xiaolong Wang
  • Lixiang Ru
  • Ziyuan Huang
  • Kaixiang Ji
  • DanDan Zheng
  • Jingdong Chen
  • Jun Zhou

We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.

AAAI Conference 2025 Conference Paper

HomoMatcher: Achieving Dense Feature Matching with Semi-Dense Efficiency by Homography Estimation

  • Xiaolong Wang
  • Lei Yu
  • Yingying Zhang
  • Jiangwei Lao
  • Lixiang Ru
  • Liheng Zhong
  • Jingdong Chen
  • Yu Zhang

Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semi-dense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency.

NeurIPS Conference 2025 Conference Paper

VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

  • Qianqian Qiao
  • DanDan Zheng
  • Yihang Bo
  • Bao Peng
  • Heng Huang
  • Longteng Jiang
  • Jingdong Chen
  • Jun Zhou

Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10, 490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https: //github. com/BestiVictory/VADB.

NeurIPS Conference 2025 Conference Paper

VideoMAR: Autoregressive Video Generation with Continuous Tokens

  • Hu Yu
  • Biao Gong
  • Hangjie Yuan
  • DanDan Zheng
  • Weilong Chai
  • Jingdong Chen
  • Kecheng Zheng
  • Feng Zhao

Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ($9. 3\%$), training data ($0. 5\%$), and GPU resources ($0. 2\%$).

TMLR Journal 2025 Journal Article

ViTime: Foundation Model for Time Series Forecasting Powered by Vision Intelligence

  • Luoxiao Yang
  • Yun Wang
  • Xinqi Fan
  • Israel Cohen
  • Jingdong Chen
  • Zijun Zhang

Time series forecasting (TSF) possesses great practical values in various fields, including power and energy, transportation, etc. TSF methods have been studied based on kncowledge from classical statistics to modern deep learning. Yet, all of them were developed based on one fundamental concept, the numerical data fitting. Thus, the models developed have been long known for being problem-specific and lacking application generalizability. Practitioners expect a TSF foundation model that serves TSF tasks in different applications. The central question is then how to develop such a TSF foundation model. This paper offers one pioneering study in the TSF foundation model development method and proposes a vision intelligence-powered framework, ViTime, for the first time. ViTime fundamentally shifts TSF from numerical fitting to operations based on a binary image-based time series metric space and naturally supports both point and probabilistic forecasting. We also provide rigorous theoretical analyses of ViTime, including quantization-induced system error bounds and principled strategies for optimal parameter selection. Furthermore, we propose RealTS, an innovative synthesis algorithm generating diverse and realistic training samples, effectively enriching the training data and significantly enhancing model generalizability. Extensive experiments demonstrate ViTime's SOTA performance. In zero-shot scenarios, ViTime outperforms TimesFM by 9-15%. With just 10% fine-tuning data, ViTime surpasses both leading foundation models and fully-supervised benchmarks, a gap that widens with 100% fine-tuning. ViTime also exhibits exceptional robustness, effectively handling missing data and outperforming TimesFM by 20-30% under various data perturbations, validating the power of its visual space data operation paradigm.

NeurIPS Conference 2024 Conference Paper

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

  • Ziyuan Huang
  • Kaixiang Ji
  • Biao Gong
  • Zhiwu Qing
  • Qinglong Zhang
  • Kecheng Zheng
  • Jian Wang
  • Jingdong Chen

This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by $\sim$73\%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.

ICLR Conference 2024 Conference Paper

LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints

  • Weidi Xu
  • Jingwei Wang
  • Lele Xie
  • Jianshan He
  • Hongting Zhou
  • Taifeng Wang
  • Xiaopei Wan
  • Jingdong Chen

Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, which performs mean-field variational inference over a Markov Logic Network (MLN). It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency. By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations greatly mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations. Empirical results in three kinds of tasks over images, graphs, and text show that LogicMP outperforms advanced competitors in both performance and efficiency.

AAAI Conference 2022 Conference Paper

CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

  • Hao Huang
  • Yongtao Wang
  • Zhaoyu Chen
  • Yuze Zhang
  • Yuheng Li
  • Zhi Tang
  • Wei Chu
  • Jingdong Chen

Malicious applications of deepfakes (i. e. , technologies generating target facial attributes or entire faces from facial images) have posed a huge threat to individuals’ reputation and security. To mitigate these threats, recent studies have proposed adversarial watermarks to combat deepfake models, leading them to generate distorted outputs. Despite achieving impressive results, these adversarial watermarks have low imagelevel and model-level transferability, meaning that they can protect only one facial image from one specific deepfake model. To address these issues, we propose a novel solution that can generate a Cross-Model Universal Adversarial Watermark (CMUA-Watermark), protecting a large number of facial images from multiple deepfake models. Specifically, we begin by proposing a cross-model universal attack pipeline that attacks multiple deepfake models iteratively. Then, we design a two-level perturbation fusion strategy to alleviate the conflict between the adversarial watermarks generated by different facial images and models. Moreover, we address the key problem in cross-model optimization with a heuristic approach to automatically find the suitable attack step sizes for different models, further weakening the model-level conflict. Finally, we introduce a more reasonable and comprehensive evaluation method to fully test the proposed method and compare it with existing ones. Extensive experimental results demonstrate that the proposed CMUA- Watermark can effectively distort the fake facial images generated by multiple deepfake models while achieving a better performance than existing methods. Our code is available at https: //github. com/VDIGPKU/CMUA-Watermark.

IROS Conference 2021 Conference Paper

A Conceptual Approach of Passive Human-Intention-Orientated Variable Admittance Control using Power Envelope

  • Jingdong Chen
  • Paul I. Ro

Two main challenges that need to be addressed in physical human-robot interaction (pHRI) are efficient recognition of human intention and interaction safety. In this paper, a general human intention framework was summarized, firstly, according to the robot's roles: a passive follower and a compliant leader. Secondly, we proposed variable admittance control models governed by human intentions. Power envelope approaches were then proposed to impose constraints on the variable admittance parameters inferred from human intention for maintaining passivity conservatively. Our passivity preserving approaches were validated via simulation and shown to avoid mismatching of time-varying admittance parameters that restrain drastic changes of admittance controller dynamics, which usually result in instability. Finally, the relationship between the robot's passivity and stability when it interacts with the human was analyzed.

IJCAI Conference 2021 Conference Paper

MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction

  • Guozhi Tang
  • Lele Xie
  • Lianwen Jin
  • Jiapeng Wang
  • Jingdong Chen
  • Zhen Xu
  • Qianying Wang
  • Yaqiang Wu

Visual Information Extraction (VIE) task aims to extract key information from multifarious document images (e. g. , invoices and purchase receipts). Most previous methods treat the VIE task simply as a sequence labeling problem or classification problem, which requires models to carefully identify each kind of semantics by introducing multimodal features, such as font, color, layout. But simply introducing multimodal features can't work well when faced with numeric semantic categories or some ambiguous texts. To address this issue, in this paper we propose a novel key-value matching model based on a graph neural network for VIE (MatchVIE). Through key-value matching based on relevancy evaluation, the proposed MatchVIE can bypass the recognitions to various semantics, and simply focuses on the strong relevancy between entities. Besides, we introduce a simple but effective operation, Num2Vec, to tackle the instability of encoded values, which helps model converge more smoothly. Comprehensive experiments demonstrate that the proposed MatchVIE can significantly outperform previous methods. Notably, to the best of our knowledge, MatchVIE may be the first attempt to tackle the VIE task by modeling the relevancy between keys and values and it is a good complement to the existing methods.

ICML Conference 2016 Conference Paper

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

  • Dario Amodei
  • Sundaram Ananthanarayanan
  • Rishita Anubhai
  • Jingliang Bai
  • Eric Battenberg
  • Carl Case
  • Jared Casper
  • Bryan Catanzaro

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

NeurIPS Conference 2007 Conference Paper

Blind channel identification for speech dereverberation using l1-norm sparse learning

  • Yuanqing Lin
  • Jingdong Chen
  • Youngmoo Kim
  • Daniel Lee

Speech dereverberation remains an open problem after more than three decades of research. The most challenging step in speech dereverberation is blind chan- nel identification (BCI). Although many BCI approaches have been developed, their performance is still far from satisfactory for practical applications. The main difficulty in BCI lies in finding an appropriate acoustic model, which not only can effectively resolve solution degeneracies due to the lack of knowledge of the source, but also robustly models real acoustic environments. This paper proposes a sparse acoustic room impulse response (RIR) model for BCI, that is, an acous- tic RIR can be modeled by a sparse FIR filter. Under this model, we show how to formulate the BCI of a single-input multiple-output (SIMO) system into a l1- norm regularized least squares (LS) problem, which is convex and can be solved efficiently with guaranteed global convergence. The sparseness of solutions is controlled by l1-norm regularization parameters. We propose a sparse learning scheme that infers the optimal l1-norm regularization parameters directly from microphone observations under a Bayesian framework. Our results show that the proposed approach is effective and robust, and it yields source estimates in real acoustic environments with high fidelity to anechoic chamber measurements.