Arrow Research search

Author name cluster

Yi Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers
2 author rows

Possible papers

35

EAAI Journal 2026 Journal Article

Digital twin driven fault diagnosis method for early faults of hydraulic system

  • Chao Yang
  • Baoping Cai
  • Zhigang Tian
  • Qingping Li
  • Yinhang Zhang
  • Yi Jiang
  • Haidong Shao

Identifying early faults can ensure the safety of hydraulic systems. This is needed for industrial production. Obtaining sensor data and conducting a single diagnosis is the main fault diagnosis method. This method is prone to overlooking early faults. To address that, a digital twin (DT)-driven fault diagnosis method is proposed for detection of early faults in hydraulic systems. A fault diagnosis model capable of identifying faults at varying degrees of severity is established. An iterative updating mechanism is employed to construct a DT model of the hydraulic system. A feedback-based verification process is integrated into the DT framework. It is used to correct incorrect diagnoses. Misclassified faults are fed back into the system for re-diagnosis, and a fault degree matching mechanism is implemented to confirm fault severity. The method is validated using a laboratory hydraulic system. Experimental results demonstrate that the method improves the accuracy of early fault detection.

YNIMG Journal 2026 Journal Article

Distinct neurodynamics underlie empathy for infant pain: An EEG study of temporal and oscillatory mechanisms

  • Chunyan Liu
  • Yi Jiang

Protecting infants from harm is widely considered a fundamental evolutionary imperative and a cross-cultural universal. While adults exhibit heightened empathic responses to infant pain, the underlying neurocognitive dynamics remain unclear. Using EEG during a pain empathy paradigm, we identified distinct neural responses to infant pain compared to adult pain. Relative to adult pain-neutral condition, infant pain-neutral condition elicited a larger P3 amplitude, suggesting enhanced cognitive empathy. In the oscillatory domain, infant pain (versus infant-neutral) induced enhanced alpha power and greater adaptive modulation of alpha and low beta (15–18 Hz) rhythms. Conversely, adult pain (versus adult-neutral) was associated with stronger suppression of low-alpha (8–10 Hz) activity and reduced adaptive modulation. Furthermore, empathy for infant pain engaged increased posterior-to-anterior information flow, suggesting heightened integration across affective and cognitive networks. These findings collectively suggest that the increased alpha power may reflect rapid threat detection and top-down modulation, while the enhanced adaptive changes signify efficient response optimization during infant pain empathy. Our results are consistent with the model of the parental brain as an evolutionary product that balances conserved subcortical responses with flexible cortical regulation, pointing toward a unique neurophysiological profile supporting the protection of vulnerable offspring.

AAAI Conference 2026 Conference Paper

FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

  • Shilong Zhang
  • Wenbo Li
  • Shoufa Chen
  • Chongjian Ge
  • Peize Sun
  • Yifu Zhang
  • Yi Jiang
  • Zehuan Yuan

DiT models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands—especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low-resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage achieves a nearly straight ODE trajectory between low and high resolutions via flow matching, effectively generating fine details and fixing artifacts with minimal NFEs. To ensure a seamless connection between the two independently trained stages during inference, we carefully design degradation strategies during the second-stage training. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high-resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.

AAAI Conference 2026 Conference Paper

Multi-Level Blur-Aware Stable Diffusion for Region-Adaptive Defocus Deblurring

  • Xiaopan Li
  • Yi Jiang
  • Shiqian Wu
  • Shoulie Xie
  • Sos Agaian

Defocus blur, common in shallow depth-of-field photography, varies across image regions and is challenging to accurately estimate and restore. Existing deblurring methods often struggle to capture fine structural textures and do not effectively adapt to regional differences in blur. We propose Multi-Level Blur-Aware Stable Diffusion (MBSD), a novel framework that explicitly integrates regional blur recognition into a diffusion-based image restoration process. MBSD assigns blur-level labels to image patches using a Patch Blur Annotator (PBA), guiding a Multi-Scale Blur Estimator (MSBE) to predict soft blur probabilities and generate routing weights. These weights control a Blur-Adaptive Expert Mixer (BAEM), which adaptively combines features based on local blur severity. The features are then passed to a text-to-image diffusion model via a cross-attention mechanism, enabling region-specific restoration. Extensive experiments on public benchmarks demonstrate that MBSD delivers superior perceptual quality while maintaining competitive PSNR and SSIM, consistently outperforming state-of-the-art methods.

YNIMG Journal 2026 Journal Article

Temporally congruent auditory stream modulates visual processing both independently of and interactively with selective attention in a competing scenario

  • Jieru Chen
  • Wenjie Liu
  • Shiqi Tan
  • Xiangyong Yuan
  • Yi Jiang

In competing environments, both selective attention and audiovisual interaction can facilitate visual processing, yet whether their influences operate independently or interactively remains debated. Using electroencephalography (EEG), we addressed this issue by instructing participants to selectively attend to one of two lateralized flickering discs, which also changed their shapes either temporally congruent or incongruent with a pitch-changing sound. We found that reaction times for detecting deviants embedded in the attended visual stream were reduced when a temporally congruent sound was concurrently played. Compared to a temporally incongruent auditory stream, a congruent one selectively enhanced steady-state visual evoked potentials (SSVEPs) in response to flickering of the unattended stream. In contrast, the SSVEP and inter-trial phase coherence in response to the shape-modulation for both attended and unattended streams were enhanced at the harmonic frequencies by the temporally congruent sound. The results indicate that the auditory influence on visual processing orthogonal to audiovisual temporal congruency (flicker) interacts with attention, whereas the auditory influence on visual processing relevant to audiovisual temporal congruency (shape-modulation) is largely independent of attention. However, these congruency effects were observed only under rhythmic audiovisual streams: When audiovisual pitch-shape modulation followed unrhythmic temporal structures, these congruency effects totally disappeared. Together, these findings demonstrate that temporally congruent auditory streams can modulate visual processing both independently of and interactively with selective attention, highlighting a flexible and complex interplay between selective attention and audiovisual interaction.

AAAI Conference 2026 Conference Paper

Unveiling the Attribute Misbinding Threat in Identity-Preserving Models

  • Junming Fu
  • Jishen Zeng
  • Yi Jiang
  • Peiyu Zhuang
  • Baoying Chen
  • Siyu Lu
  • Jianquan Yang

Identity-preserving models have led to notable progress in generating personalized content. Unfortunately, such models also exacerbate risks when misused, for instance, by generating threatening content targeting specific individuals. This paper introduces the Attribute Misbinding Attack, a novel method that poses a threat to identity-preserving models by inducing them to produce Not-Safe-For-Work (NSFW) content. The attack's core idea involves crafting benign-looking textual prompts to circumvent text-filter safeguards and leverage a key model vulnerability: flawed attribute binding that stems from its internal attention bias. This results in misattributing harmful descriptions to a target identity and generating NSFW outputs. To facilitate the study of this attack, we present the Misbinding Prompt evaluation set, which examines the content generation risks of current state-of-the-art identity-preserving models across four risk dimensions: pornography, violence, discrimination, and illegality. Additionally, we introduce the Attribute Binding Safety Score (ABSS), a metric for concurrently assessing both content fidelity and safety compliance. Experimental results show that our Misbinding Prompt evaluation set achieves a 5.28 % higher success rate in bypassing five leading text filters (including GPT-4o) compared to existing main-stream evaluation sets, while also demonstrating the highest proportion of NSFW content generation. The proposed ABSS metric enables a more comprehensive evaluation of identity-preserving models by concurrently assessing both content fidelity and safety compliance.

JBHI Journal 2026 Journal Article

USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification

  • Changmiao Wang
  • Songqi Zhang
  • Yongquan Zhang
  • Yifei Wang
  • Liya Liu
  • Nannan Li
  • Xingzhi Li
  • Jiexin Pan

Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on post operative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: https://github.com/fancccc/KidneyStoneSC.

AAAI Conference 2025 Conference Paper

Enhancing Adversarial Transferability with Adversarial Weight Tuning

  • Jiahao Chen
  • Zhou Feng
  • Rui Zeng
  • Yuwen Pu
  • Chunyi Zhou
  • Yi Jiang
  • Yuyou Gan
  • Jinbao Li

Deep neural networks (DNNs) are vulnerable to adversarial examples (AEs) that mislead the model while appearing benign to human observers. A critical concern is the transferability of AEs, which enables black-box attacks without direct access to the target model. However, many previous attacks have failed to explain the intrinsic mechanism of adversarial transferability, lacking a unified and representative metric for transferability as well. In this paper, we rethink the property of transferable AEs and develop a novel metric to measure transferability from the perspective of generalization. Building on insights from this metric, we analyze the generalization of AEs across models with different architectures and prove that we can find a local perturbation to mitigate the gap between surrogate and target models. We further establish the inner connections between model smoothness and flat local maxima, both of which contribute to the transferability of AEs. Further, we propose a new adversarial attack algorithm, Adversarial Weight Tuning (AWT), which adaptively adjusts the parameters of the surrogate model using generated AEs to optimize the flat local maxima and model smoothness simultaneously, without the need for extra data. AWT is a data-free tuning method that combines gradient-based and model-related attack methods to enhance the transferability of AEs. Extensive experiments on a variety of models with different architectures on ImageNet demonstrate that AWT yields superior performance over other attacks, with an average increase of nearly 5% and 10% attack success rates on CNN-based and Transformer-based models, respectively, compared to state-of-the-art attacks.

NeurIPS Conference 2025 Conference Paper

InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

  • Jinlai Liu
  • Jian Han
  • Bin Yan
  • Hui Wu
  • Fengda Zhu
  • Xing Wang
  • Yi Jiang
  • BINGYUE PENG

We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long-duration video synthesis via straightforward temporal autoregression. Through extensive experiments, InfinityStar scores 83. 74 on VBench, outperforming all autoregressive models by large margins, even surpassing diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10$\times$ faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial-level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

EAAI Journal 2025 Journal Article

Leakage localization methodology based on dynamic pressure signal for subsea pipeline

  • Guowei Ji
  • Baoping Cai
  • Xuelin Liu
  • Yi Jiang
  • Yixin Zhao
  • Qingping Li
  • Lei Gao
  • Kaizheng Wu

Leakage is one of the most critical failure forms of subsea pipeline. The accurate leakage localization is of great significance to ensure the safe and reliable transportation of subsea pipeline. Leakage locations are considered to be discretely distributed along the subsea pipeline. However, an overabundance of nodes in leakage localization model is caused by excessive discretization. The performance of a leakage localization model with much localization points is poor. Furthermore, the data collected in site usually contains a lot of noise which reduce the effectiveness of leakage localization. A leakage localization methodology based on dynamic pressure signal for subsea pipeline is proposed in this paper. A noise reduction model based on variational mode decomposition (VMD) algorithm combined with power spectral density (PSD) is established to reduce noise of leakage signal. An improved K-means grouping model is developed to mining data for inherent similarity and cluster leakage characteristic. It improves robustness of the leakage localization model. A radial basis function (RBF) neural network leakage localization model optimized by the pelican optimization algorithm (POA) is used to identify the location of leakage. A leakage experiment is used to study performance of this methodology. The localization accuracy of the proposed leakage localization methodology is more than 90 %, the localization error is less than 16 cm. After neural network combined with improved grouping model, average leakage localization accuracy increased by 15. 65 %, average absolute error decreased 8. 64 cm. The proposed methodology provides an effective tool for leakage localization of critical equipment in subsea production system.

NeurIPS Conference 2025 Conference Paper

UniTok: a Unified Tokenizer for Visual Generation and Understanding

  • Chuofan Ma
  • Yi Jiang
  • Junfeng Wu
  • Jihan Yang
  • Xin Yu
  • Zehuan Yuan
  • BINGYUE PENG
  • Xiaojuan Qi

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0. 38 rFID and 78. 6\% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14. 6 to 2. 5 on ImageNet 256$\times$256 benchmark. All codes and models have been made publicly available.

YNIMG Journal 2024 Journal Article

A common and specialized neural code for social attention triggered by eye gaze and biological motion

  • Ruidi Wang
  • Tian Yuan
  • Li Wang
  • Yi Jiang

Humans appear to be endowed with the ability to readily share attention with interactive partners through the utilization of social direction cues, such as eye gaze and biological motion (BM). Here, we investigated the specialized brain mechanism underlying this fundamental social attention ability by incorporating different types of social (i.e., BM, gaze) and non-social (arrow) cues and combining functional magnetic resonance imaging (fMRI) with a modified central cueing paradigm. Using multi-voxel pattern analysis (MVPA), we found that although gaze- and BM-mediated attentional orienting could be decoded from neural activity in a wide range of brain areas, only the right anterior and posterior superior temporal sulcus (aSTS and pSTS) could specifically decode attentional orienting triggered by social but not non-social cues. Critically, cross-category MVPA further revealed that social attention could be decoded across BM and gaze cues in the right STS and the right superior temporal gyrus (STG). However, these regions could not decode attentional orienting across social and non-social cues. These findings together provide evidence for the existence of a specialized social attention module in the human brain, with the right STS/STG being the critical neural site dedicated to social attention.

NeurIPS Conference 2024 Conference Paper

Efficiency for Free: Ideal Data Are Transportable Representations

  • Peng Sun
  • Yi Jiang
  • Tao Lin

Data, the seminal opportunity and challenge in modern machine learning, currently constrains the scalability of representation learning and impedes the pace of model evolution. In this work, we investigate the efficiency properties of data from both optimization and generalization perspectives. Our theoretical and empirical analysis reveals an unexpected finding: for a given task, utilizing a publicly available, task- and architecture-agnostic model (referred to as the `prior model' in this paper) can effectively produce efficient data. Building on this insight, we propose the Representation Learning Accelerator (ReLA), which promotes the formation and utilization of efficient data, thereby accelerating representation learning. Utilizing a ResNet-18 pre-trained on CIFAR-10 as a prior model to inform ResNet-50 training on ImageNet-1K reduces computational costs by $50\%$ while maintaining the same accuracy as the model trained with the original BYOL, which requires $100\%$ cost. Our code is available at: \url{https: //github. com/LINs-lab/ReLA}.

AAAI Conference 2024 Conference Paper

Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding

  • Guoqing Chao
  • Yi Jiang
  • Dianhui Chu

Incomplete multi-view clustering becomes an important research problem, since multi-view data with missing values are ubiquitous in real-world applications. Although great efforts have been made for incomplete multi-view clustering, there are still some challenges: 1) most existing methods didn't make full use of multi-view information to deal with missing values; 2) most methods just employ the consistent information within multi-view data but ignore the complementary information; 3) For the existing incomplete multi-view clustering methods, incomplete multi-view representation learning and clustering are treated as independent processes, which leads to performance gap. In this work, we proposed a novel Incomplete Contrastive Multi-View Clustering method with high-confidence guiding (ICMVC). Firstly, we proposed a multi-view consistency relation transfer plus graph convolutional network to tackle missing values problem. Secondly, instance-level attention fusion and high-confidence guiding are proposed to exploit the complementary information while instance-level contrastive learning for latent representation is designed to employ the consistent information. Thirdly, an end-to-end framework is proposed to integrate multi-view missing values handling, multi-view representation learning and clustering assignment for joint optimization. Experiments compared with state-of-the-art approaches demonstrated the effectiveness and superiority of our method. Our code is publicly available at https://github.com/liunian-Jay/ICMVC. The version with supplementary material can be found at http://arxiv.org/abs/2312.08697.

NeurIPS Conference 2024 Conference Paper

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

  • Junke Wang
  • Yi Jiang
  • Zehuan Yuan
  • Binyue Peng
  • Zuxuan Wu
  • Yu-Gang Jiang

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to either image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window attention and causal attention for spatial and temporal modeling, respectively. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e. g. , 1. 11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method.

NeurIPS Conference 2024 Conference Paper

Recognize Any Regions

  • Haosen Yang
  • Chuofan Ma
  • Bin Wen
  • Yi Jiang
  • Zehuan Yuan
  • Xiatian Zhu

Understanding the semantics of individual regions or patches of unconstrained images, such as open-world object detection, remains a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e. g. , SAM) with semantic information from a ViL model (e. g. , CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Extensive experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives, along with substantial computational savings (e. g. , training our model with 3 million data in a single day using 8 V100 GPUs). RegionSpot outperforms GLIP-L by 2. 9 in mAP on LVIS val set, with an even larger margin of 13. 1 AP for more challenging and rare categories, and a 2. 5 AP increase on ODinW. Furthermore, it exceeds GroundingDINO-L by 11. 0 AP for rare categories on the LVIS minival set.

NeurIPS Conference 2024 Conference Paper

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

  • Keyu Tian
  • Yi Jiang
  • Zehuan Yuan
  • BINGYUE PENG
  • Liwei Wang

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-style AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18. 65 to 1. 73, inception score (IS) from 80. 4 to 350. 2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0. 998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

NeurIPS Conference 2023 Conference Paper

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

  • Chuofan Ma
  • Yi Jiang
  • Xin Wen
  • Zehuan Yuan
  • Xiaojuan Qi

Deriving reliable region-word alignment from image-text pairs is critical to learnobject-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-languagemodels for alignment, which are prone to limitations in localization accuracy orgeneralization capabilities. In this paper, we propose CoDet, a novel approachthat overcomes the reliance on pre-aligned vision-language space by reformulatingregion-word alignment as a co-occurring object discovery problem. Intuitively, bygrouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects andalign them with the shared concept. Extensive experiments demonstrate that CoDethas superior performances and compelling scalability in open-vocabulary detection, e. g. , by scaling up the visual backbone, CoDet achieves 37. 0 $AP^m_{novel}$ and 44. 7 $AP^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4. 2 $AP^m_{novel}$ and 9. 8 $AP^m_{all}$. Code is available at https: //github. com/CVMI-Lab/CoDet.

YNIMG Journal 2023 Journal Article

Cortical encoding of rhythmic kinematic structures in biological motion

  • Li Shen
  • Xiqian Lu
  • Xiangyong Yuan
  • Ruichen Hu
  • Ying Wang
  • Yi Jiang

Biological motion (BM) perception is of great survival value to human beings. The critical characteristics of BM information lie in kinematic cues containing rhythmic structures. However, how rhythmic kinematic structures of BM are dynamically represented in the brain and contribute to visual BM processing remains largely unknown. Here, we probed this issue in three experiments using electroencephalogram (EEG). We found that neural oscillations of observers entrained to the hierarchical kinematic structures of the BM sequences (i.e., step-cycle and gait-cycle for point-light walkers). Notably, only the cortical tracking of the higher-level rhythmic structure (i.e., gait-cycle) exhibited a BM processing specificity, manifested by enhanced neural responses to upright over inverted BM stimuli. This effect could be extended to different motion types and tasks, with its strength positively correlated with the perceptual sensitivity to BM stimuli at the right temporal brain region dedicated to visual BM processing. Modeling results further suggest that the neural encoding of spatiotemporally integrative kinematic cues, in particular the opponent motions of bilateral limbs, drives the selective cortical tracking of BM information. These findings underscore the existence of a cortical mechanism that encodes periodic kinematic features of body movements, which underlies the dynamic construction of visual BM perception.

ICLR Conference 2023 Conference Paper

Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling

  • Keyu Tian
  • Yi Jiang
  • Qishuai Diao
  • Chen Lin 0003
  • Liwei Wang
  • Zehuan Yuan

We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, randomly masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet’s hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method, called Sparse masKed modeling (SparK), is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). The improvements on object detection and instance segmentation are more significant (up to +3.5%), validating the strong transferability of features learned. We also find SparK’s favorable scaling behavior by observing more gains on larger networks. All of these findings support the promising future of generative pre-training on convnets. Both codes and pre-trained models have been released at https://github.com/keyu-tian/SparK.

EAAI Journal 2023 Journal Article

TransCFD: A transformer-based decoder for flow field prediction

  • Jundou Jiang
  • Guanxiong Li
  • Yi Jiang
  • Laiping Zhang
  • Xiaogang Deng

The computational fluid dynamics (CFD) method is computationally intensive and costly, and evaluating aerodynamic performance through CFD is time-consuming and labor-intensive. For the design and optimization of aerodynamic shapes, it is essential to obtain aerodynamic performance efficiently and accurately. This paper proposed TransCFD, a Transformer-based decoding architecture for flow field prediction. The aerodynamic shape is parameterized and used as input to the decoder, which learns an end-to-end mapping between the shape and the flow fields. Compared with the CFD method, the TransCFD was evaluated to have a mean absolute error (MAE) of less than 1%, increase the speed by three orders of magnitude, and perform very well in generalization capability. The method simplifies the input requirements compared to most existing methods. Although the object of this work is a two-dimensional airfoil, the setup of this scheme is very general. TransCFD is promising for rapid aerodynamic performance evaluation, with potential applications in accelerating the aerodynamic design.

JBHI Journal 2022 Journal Article

DeepRayburst for Automatic Shape Analysis of Tree-Like Structures in Biomedical Images

  • Yi Jiang
  • Weixun Chen
  • Min Liu
  • Yaonan Wang
  • Erik Meijering

Precise quantification of tree-like structures from biomedical images, such as neuronal shape reconstruction and retinal blood vessel caliber estimation, is increasingly important in understanding normal function and pathologic processes in biology. Some handcrafted methods have been proposed for this purpose in recent years. However, they are designed only for a specific application. In this paper, we propose a shape analysis algorithm, DeepRayburst, that can be applied to many different applications based on a Multi-Feature Rayburst Sampling (MFRS) and a Dual Channel Temporal Convolutional Network (DC-TCN). Specifically, we first generate a Rayburst Sampling (RS) core containing a set of multidirectional rays. Then the MFRS is designed by extending each ray of the RS to multiple parallel rays which extract a set of feature sequences. A Gaussian kernel is then used to fuse these feature sequences and outputs one feature sequence. Furthermore, we design a DC-TCN to make the rays terminate on the surface of tree-like structures according to the fused feature sequence. Finally, by analyzing the distribution patterns of the terminated rays, the algorithm can serve multiple shape analysis applications of tree-like structures. Experiments on three different applications, including soma shape reconstruction, neuronal shape reconstruction, and vessel caliber estimation, confirm that the proposed method outperforms other state-of-the-art shape analysis methods, which demonstrate its flexibility and robustness.

NeurIPS Conference 2022 Conference Paper

Rethinking Resolution in the Context of Efficient Video Recognition

  • Chuofan Ma
  • Qiushan Guo
  • Yi Jiang
  • Ping Luo
  • Zehuan Yuan
  • Xiaojuan Qi

In this paper, we empirically study how to make the most of low-resolution frames for efficient video recognition. Existing methods mainly focus on developing compact networks or alleviating temporal redundancy of video inputs to increase efficiency, whereas compressing frame resolution has rarely been considered a promising solution. A major concern is the poor recognition accuracy on low-resolution frames. We thus start by analyzing the underlying causes of performance degradation on low-resolution frames. Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale. Motivated by the success of knowledge distillation (KD), we propose to bridge the gap between network and input size via cross-resolution KD (ResKD). Our work shows that ResKD is a simple but effective method to boost recognition accuracy on low-resolution frames. Without bells and whistles, ResKD considerably surpasses all competitive methods in terms of efficiency and accuracy on four large-scale benchmark datasets, i. e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V2. In addition, we extensively demonstrate its effectiveness over state-of-the-art architectures, i. e. , 3D-CNNs and Video Transformers, and scalability towards super low-resolution frames. The results suggest ResKD can serve as a general inference acceleration method for state-of-the-art video recognition. Our code will be available at https: //github. com/CVMI-Lab/ResKD.

JBHI Journal 2021 Journal Article

Deep Learning Methods for Lung Cancer Segmentation in Whole-Slide Histopathology Images—The ACDC@LungHP Challenge 2019

  • Zhang Li
  • Jiehua Zhang
  • Tao Tan
  • Xichao Teng
  • Xiaoliang Sun
  • Hong Zhao
  • Lihong Liu
  • Yang Xiao

Accurate segmentation of lung cancer in pathology slides is a critical step in improving patient care. We proposed the ACDC@LungHP (Automatic Cancer Detection and Classification in Whole-slide Lung Histopathology) challenge for evaluating different computer-aided diagnosis (CADs) methods on the automatic diagnosis of lung cancer. The ACDC@LungHP 2019 focused on segmentation (pixel-wise detection) of cancer tissue in whole slide imaging (WSI), using an annotated dataset of 150 training images and 50 test images from 200 patients. This paper reviews this challenge and summarizes the top 10 submitted methods for lung cancer segmentation. All methods were evaluated using metrics using the precision, accuracy, sensitivity, specificity, and DICE coefficient (DC). The DC ranged from 0. 7354 $\pm$ 0. 1149 to 0. 8372 $\pm$ 0. 0858. The DC of the best method was close to the inter-observer agreement (0. 8398 $\pm$ 0. 0890). All methods were based on deep learning and categorized into two groups: multi-model method and single model method. In general, multi-model methods were significantly better ( p $< $ 0. 01) than single model methods, with mean DC of 0. 7966 and 0. 7544, respectively. Deep learning based methods could potentially help pathologists find suspicious regions for further analysis of lung cancer in WSI.

JMLR Journal 2019 Journal Article

SimpleDet: A Simple and Versatile Distributed Framework for Object Detection and Instance Recognition

  • Yuntao Chen
  • Chenxia Han
  • Yanghao Li
  • Zehao Huang
  • Yi Jiang
  • Naiyan Wang
  • Zhaoxiang Zhang

Object detection and instance recognition play a central role in many AI applications like autonomous driving, video surveillance and medical image analysis. However, training object detection models on large scale datasets remains computationally expensive and time consuming. This paper presents an efficient and open source object detection framework called SimpleDet which enables the training of state-of-the-art detection models on consumer grade hardware at large scale. SimpleDet covers a wide range of models including both high-performance and high-speed ones. SimpleDet is well-optimized for both low precision training and distributed training and achieves 70% higher throughput for the Mask R-CNN detector compared with existing frameworks. Codes, examples and documents of SimpleDet can be found at https://github.com/tusimple/simpledet. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2019. ( edit, beta )

YNIMG Journal 2014 Journal Article

The feet have it: Local biological motion cues trigger reflexive attentional orienting in the brain

  • Li Wang
  • Xiaoying Yang
  • Jinfu Shi
  • Yi Jiang

Most vertebrates, humans included, have a primitive visual system extremely sensitive to the motion of biological entities. Most previous studies have examined the global aspects of biological motion perception, but local motion processing has received much less attention. Here we provide direct psychophysical and electrophysiological evidence that human observers are intrinsically tuned to the characteristics of local biological motion cues independent of global configuration. Using a modified central cueing paradigm, we show that observers involuntarily orient their attention towards the walking direction of feet motion sequences, which triggers an early directing attention negativity (EDAN) in the occipito-parietal region 100–160ms after the stimulus onset. Notably, such effects are sensitive to the orientation of the local cues and are independent of whether the observers are aware of the biological nature of the motion. Our findings unambiguously demonstrate the automatic processing of local biological motion without explicit recognition. More importantly, with the discovery that local biological motion signals modulate attention, we highlight the functional importance of such processing in the brain.

YNIMG Journal 2012 Journal Article

3D fiber tractography with susceptibility tensor imaging

  • Chunlei Liu
  • Wei Li
  • Bing Wu
  • Yi Jiang
  • G. Allan Johnson

Gradient-echo MRI has revealed anisotropic magnetic susceptibility in the brain white matter. This magnetic susceptibility anisotropy can be measured and characterized with susceptibility tensor imaging (STI). In this study, a method of fiber tractography based on STI is proposed and demonstrated in the mouse brain. STI experiments of perfusion-fixed mouse brains were conducted at 7. 0T. The magnetic susceptibility tensor was calculated for each voxel with regularization and decomposed into its eigensystem. The major eigenvector is found to be aligned with the underlying fiber orientation. Following the orientation of the major eigenvector, we are able to map distinctive fiber pathways in 3D. As a comparison, diffusion tensor imaging (DTI) and DTI fiber tractography were also conducted on the same specimens. The relationship between STI and DTI fiber tracts was explored with similarities and differences identified. It is anticipated that the proposed method of STI tractography may provide a new way to study white matter fiber architecture. As STI tractography is based on physical principles that are fundamentally different from DTI, it may also be valuable for the ongoing validation of DTI tractography.

YNIMG Journal 2012 Journal Article

Dispositional fear, negative affectivity, and neuroimaging response to visually suppressed emotional faces

  • Nathalie Vizueta
  • Christopher J. Patrick
  • Yi Jiang
  • Kathleen M. Thomas
  • Sheng He

“Invisible” stimulus paradigms provide a method for investigating basic affective processing in clinical and non-clinical populations. Neuroimaging studies utilizing continuous flash suppression (CFS) have shown increased amygdala response to invisible fearful versus neutral faces. The current study used CFS in conjunction with functional MRI to test for differences in brain reactivity to visible and invisible emotional faces in relation to two distinct trait dimensions relevant to psychopathology: negative affectivity (NA) and fearfulness. Subjects consisted of college students (N=31) assessed for fear/fearlessness along with dispositional NA. The main brain regions of interest included the fusiform face area (FFA), superior temporal sulcus (STS), and amygdala. Higher NA, but not trait fear, was associated with enhanced response to fearful versus neutral faces in STS and right amygdala (but not FFA), within the invisible condition specifically. The finding that NA rather than fearfulness predicted degree of amygdala reactivity to suppressed faces implicates the input subdivision of the amygdala in the observed effects. Given the central role of NA in anxiety and mood disorders, the current data also support use of the CFS methodology for investigating the neurobiology of these disorders.

YNIMG Journal 2011 Journal Article

Microscopic diffusion tensor atlas of the mouse brain

  • Yi Jiang
  • G. Allan Johnson

Eight diffusion tensor imaging (DTI) datasets of normal adult C57BL/6J mouse brains were acquired with an isotropic Nyquist limited resolution of 43μm (voxel volume ~80pl). Each specimen was scanned with a b0 image and 6 diffusion-weighted images. T1- and T2*-weighted data were acquired with each specimen to aid nonlinear registration of the data to a common reference space (called “Waxholm Space”). We identified 80 different discrete landmarks in Waxholm Space to provide the gold standard for measuring the registration quality. The accuracy of the registration was established by measuring displacement of the 80 landmarks in each registered brain from the same landmarks in the reference brain. The accuracy was better than 130μm for 95% of the landmarks (overall landmark displacement is 65±40μm, n=640). Mean and coefficient of variation atlases of DTI indices were generated with potential application for both voxel-based and region of interest-based analysis. To examine consistency of DTI data among individual subjects in this study and difference in diffusion indices between separate brain structures within each subject, averaged values of DTI indices (axial diffusivity, radial diffusivity, fractional anisotropy, and angular deviation of the primary eigenvector) were computed in 9 white matter structures in each brain. The variation of the DTI indices across the population was very small, e. g. , ~5% for axial diffusivity for each white matter structure, enabling confident differentiation of differences in these structures within each subject. ANOVA tests indicated that the current protocol is able to provide consistent DTI data of individual brains (p>0. 25), and distinguish difference of diffusion indices between white matter structures (p<0. 001). Power analysis was also performed to provide an estimate of the number of specimens required to detect a 10% change of the DTI indices in each white matter structure. The data provide a critical addition to Waxholm Space, the International Neuroinformatics Coordinating Facility (www. incf. org) online comprehensive atlas of the mouse brain.

YNIMG Journal 2011 Journal Article

Population-averaged diffusion tensor imaging atlas of the Sprague Dawley rat brain

  • Jelle Veraart
  • Trygve B. Leergaard
  • Bjørnar T. Antonsen
  • Wim Van Hecke
  • Ines Blockx
  • Ben Jeurissen
  • Yi Jiang
  • Annemie Van der Linden

Rats are widely used in experimental neurobiological research, and rat brain atlases are important resources for identifying brain regions in the context of experimental microsurgery, tissue sampling, and neuroimaging, as well as comparison of findings across experiments. Currently, most available rat brain atlases are constructed from histological material derived from single specimens, and provide two-dimensional or three-dimensional (3D) outlines of diverse brain regions and fiber tracts. Important limitations of such atlases are that they represent individual specimens, and that finer details of tissue architecture are lacking. Access to more detailed 3D brain atlases representative of a population of animals is needed. Diffusion tensor imaging (DTI) is a unique neuroimaging modality that provides sensitive information about orientation structure in tissues, and is widely applied in basic and clinical neuroscience investigations. To facilitate analysis and assignment of location in rat brain neuroimaging investigations, we have developed a population-averaged three-dimensional DTI atlas of the normal adult Sprague Dawley rat brain. The atlas is constructed from high resolution ex vivo DTI images, which were nonlinearly warped into a population-averaged in vivo brain template. The atlas currently comprises a selection of manually delineated brain regions, the caudate–putamen complex, globus pallidus, entopeduncular nucleus, substantia nigra, external capsule, corpus callosum, internal capsule, cerebral peduncle, fimbria of the hippocampus, fornix, anterior commisure, optic tract, and stria terminalis. The atlas is freely distributed and potentially useful for several purposes, including automated and manual delineation of rat brain structural and functional imaging data.

YNIMG Journal 2010 Journal Article

Microscopic diffusion tensor imaging of the mouse brain

  • Yi Jiang
  • G. Allan Johnson

Diffusion tensor imaging (DTI) data at 43 μm isotropic resolution has been acquired on the intact adult mouse brain in 28-h scan time by using a streamlined protocol, including specimen fixation and staining, image acquisition, reconstruction, post-processing, and distribution. An intermediate registration of each component image is required to achieve the desired microscopic resolution. Multiple parameters have been derived, including fractional anisotropy, axial and radial diffusivity, and a color-coded orientation map of the primary eigenvector. Each DTI dataset was mapped to a common reference space to facilitate future standardized analysis. Fiber tracking has also been demonstrated, providing 3D connection information. This protocol to acquire high-resolution DTI data in a robust and repeatable fashion will serve as a foundation to quantitatively study mouse brain integrity and white matter architecture, at what we believe to be the highest spatial resolution yet attained.

YNIMG Journal 2009 Journal Article

Dynamics of processing invisible faces in the brain: Automatic neural encoding of facial expression information

  • Yi Jiang
  • Robert W. Shannon
  • Nathalie Vizueta
  • Edward M. Bernat
  • Christopher J. Patrick
  • Sheng He

The fusiform face area (FFA) and the superior temporal sulcus (STS) are suggested to process facial identity and facial expression information respectively. We recently demonstrated a functional dissociation between the FFA and the STS as well as correlated sensitivity of the STS and the amygdala to facial expressions using an interocular suppression paradigm [Jiang, Y. , He, S. , 2006. Cortical responses to invisible faces: dissociating subsystems for facial-information processing. Curr. Biol. 16, 2023–2029. ]. In the current event-related brain potential (ERP) study, we investigated the temporal dynamics of facial information processing. Observers viewed neutral, fearful, and scrambled face stimuli, either visibly or rendered invisible through interocular suppression. Relative to scrambled face stimuli, intact visible faces elicited larger positive P1 (110–130 ms) and larger negative N1 or N170 (160–180 ms) potentials at posterior occipital and bilateral occipito-temporal regions respectively, with the N170 amplitude significantly greater for fearful than neutral faces. Invisible intact faces generated a stronger signal than scrambled faces at 140–200 ms over posterior occipital areas whereas invisible fearful faces (compared to neutral and scrambled faces) elicited a significantly larger negative deflection starting at 220 ms along the STS. These results provide further evidence for cortical processing of facial information without awareness and elucidate the temporal sequence of automatic facial expression information extraction.

YNIMG Journal 2005 Journal Article

Distinct neural substrates for the perception of real and virtual visual worlds

  • Shihui Han
  • Yi Jiang
  • Glyn W. Humphreys
  • Tiangang Zhou
  • Peng Cai

Virtual environments have been frequently used for training and skill improvement. However, do real and virtual worlds engage the same brain states in human perceivers? We measured brain activity using functional magnetic resonance imaging (fMRI) while adults watched movie and cartoon clips, simulating real and virtual visual worlds, respectively. Relative to baselines using random static images, the medial prefrontal cortex (MPFC) and the cerebellum were activated only by movie clips of other humans. In contrast, cartoon clips of human and non-human agents activated the superior parietal lobes, while movie clips of animals also activated the superior parietal lobes. Our fMRI findings suggest that the perception of real-world humans is characterised by the involvement of MPFC and the cerebellum, most likely for on-line representation of the mental states of others, whereas the perception of virtual-world agents engages the parietal cortex in attention to actions.

YNIMG Journal 2004 Journal Article

Engagement of the prefrontal cortex in representational momentum: an fMRI study

  • Hengyi Rao
  • Shihui Han
  • Yi Jiang
  • Yanping Xue
  • Hua Gu
  • Yong Cui
  • Dingguo Gao

Behavioral studies have identified a robust phenomenon that an observer's memory of the final position of a moving target is shifted a little further in its motion direction, which is usually called representational momentum (RM). However, the neural substrates underlying RM are poorly understood. The current study measured hemodynamic responses in association with RM using functional magnetic resonance imaging (fMRI). Two experiments using block and event-related designs, respectively, were conducted in which subjects compared the orientation of a probe rectangle with the remembered orientation of the final inducing figures in a set of rotating rectangles. Both experiments showed that, relative to the control task in which behavioral data did not show RM effects, RM task induced stronger activation in the prefrontal cortex. However, no activation was found in MT/MST complex in association with RM. The fMRI results suggest that RM may not simply reflect implicit motion perception and high level cognitive mechanisms underpinned by the prefrontal cortex may be involved in the RM effect.