Arrow Research search

Author name cluster

Xin Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

44 papers
2 author rows

Possible papers

44

AAAI Conference 2026 Conference Paper

Decoupling Understanding from Reasoning via Problem Space Mapping for Small-Scale Model Reasoning

  • Li Wang
  • Changhao Zhang
  • Zengqi Xiu
  • Kai Lu
  • Xin Yu
  • Kui Zhang
  • Wenjun Wu

Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., up to 1.5B parameters) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

YNIMG Journal 2026 Journal Article

Resting-state fMRI coherence is selectively diminished around 0.1 Hz in patients with unilateral carotid artery stenosis

  • Sangcheon Choi
  • Gabriel Hoffmann
  • Sebastian Schneider
  • Stephan Kaczmarz
  • Xin Yu
  • Christine Preibisch
  • Christian Sorg

In the brain, vasomotor dynamics at infra-slow frequencies (∼0.1 Hz), driven by synchronized oscillations of smooth muscle cells in vessel walls, are thought to play a crucial role in regulating cerebral perfusion and underlie resting-state functional connectivity (FC), typically measured by correlated time courses of functional signals. In particular, rodent studies have demonstrated that vasomotor activity contributes to the coherence of blood oxygenation level dependent (BOLD) signal fluctuations. However, in humans, detecting this contribution non-invasively remains challenging due to the limited spatiotemporal sensitivity of functional magnetic resonance imaging (fMRI) to vasomotion. Given that prior studies have identified internal carotid artery stenosis (ICAS) as an informative conditional lesion model of vasomotor and hemodynamic impairments in humans, we investigated whether ICAS affects interhemispheric BOLD coherence at ∼0.1 Hz. Using a multi-modal fMRI framework integrating resting-state fMRI with quantitative mapping of cerebral blood volume, blood flow, oxygen metabolism, and BOLD time lag, we compared BOLD coherence between patients with asymptomatic unilateral ICAS and healthy controls. Frequency-specific analysis revealed significantly diminished inter-hemispheric BOLD coherence at ∼0.1 Hz across canonical resting-state networks in ICAS patients, while ultra-slow (<0.05 Hz) coherence remained largely preserved. This reduction was spatially widespread across brain networks and particularly pronounced in watershed areas, i.e., border zones between major vascular territories, associated with significantly increased lateralization of cerebral blood volume (p < 0.01). Notably, coherence-based FC patterns at ∼0.1 Hz were heterogeneous within watershed areas but homogeneous outside, suggesting an interplay between compensatory mechanisms and cerebrovascular impairment. Taken together, our findings demonstrate that ICAS induces subtle, frequency- and region-specific alterations in interhemispheric FC, consistent with a model in which impaired vasomotor activity and hemodynamic dysfunctions impact resting-state FC in the human brain.

TIST Journal 2026 Journal Article

ROIS: Role-Based Multi-Agent Collaboration by Context-Time-Aware Information Sharing

  • Hanwen Qi
  • Tinghuai Ma
  • Kexing Peng
  • Xin Yu

In complex cooperative tasks, Multi-Agent Reinforcement Learning (MARL) faces the dual challenges of an exponentially growing joint action space and the constraints of partial observability. While the Centralized Training with Decentralized Execution (CTDE) paradigm is widely adopted, it often leads to homogeneous policies that lack the necessary specialization for complex teamwork. While role-based methods encourage specialization, they often lack mechanisms for inter-agent interaction. Consequently, the lack of rich information for role assignment means their roles may be assigned ineffectively, hindering the convergence of the team policy to its optimum. To address this critical gap, we propose ROIS, a novel framework that enhances multi-agent collaboration by grounding dynamic role assignments in a context-time-aware information sharing mechanism. Our key insight is to leverage a dedicated information sharing module that captures multi-step temporal context, providing each agent with richer, tailored feedback from its teammates. This mechanism directly addresses the lack of inter-agent interaction, leading to more accurate and effective role assignments. This results in a more coherent task division, which guides specialized policies toward the optimal joint policy and drastically reduces ineffective exploration. We conduct extensive experiments on the demanding StarCraft II, SMACv2, and Multi-agent Particle Environment benchmarks. The results demonstrate that ROIS consistently achieves state-of-the-art performance, significantly outperforming a wide range of advanced baselines, particularly in scenarios requiring deep coordination and policy adaptation. Finally, comprehensive ablation studies confirm the essential contribution of each component to the framework’s success.

EAAI Journal 2026 Journal Article

Spatial dependency learning for image-based anomaly detection in engine combustion

  • Luyun Miao
  • Dazhi Zhang
  • Zhen Cao
  • Zhichang Guo
  • Yao Li
  • Xun Yuan
  • Jangbo Peng
  • Chaobo Yang

Traditional scramjet anomaly detection methods are constrained by delayed pressure responses and handcrafted features that depend on expert experience. To address this issue, this paper proposes an intelligent situational awareness algorithm for engine anomaly detection based on chemiluminescence imaging of combustion processes. The model learns the spatial dependencies of local features in stable flame images, using a self-supervised learning framework to characterize the feature distribution of normal image patches and identify anomalies as deviations from this distribution. Experimental results demonstrate that the proposed method achieves 100. 0% accuracy and 100. 0% area under the receiver operating characteristic curve (AUROC) at the image level, while 90. 9% accuracy and 94. 8% AUROC at the pixel level. The algorithm is trained solely on normal images and is capable of simultaneously detecting both abnormal states and abnormal regions.

NeurIPS Conference 2025 Conference Paper

AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections

  • Xin Yu
  • Yujia Wang
  • Jinghui Chen
  • Lingzhou Xue

Low-Rank Adaptation (LoRA) has emerged as an effective technique for reducing memory overhead in fine-tuning large language models. However, it often suffers from sub-optimal performance compared with full fine-tuning since the update is constrained in the low-rank space. Recent variants such as LoRA-Pro attempt to mitigate this by adjusting the gradients of the low-rank matrices to approximate the full gradient. However, LoRA-Pro's solution is not unique, and different solutions can lead to significantly varying performance in ablation studies. Besides, to incorporate momentum or adaptive optimization design, approaches like LoRA-Pro must first compute the equivalent gradient, causing a higher memory cost close to full fine-tuning. A key challenge remains in integrating momentum properly into the low-rank space with lower memory cost. In this work, we propose AltLoRA, an alternating projection method that avoids the difficulties in gradient approximation brought by the joint update design, meanwhile integrating momentum without higher memory complexity. Our theoretical analysis provides convergence guarantees and further shows that AltLoRA enables stable feature learning and robustness to transformation invariance. Extensive experiments across multiple tasks demonstrate that AltLoRA outperforms LoRA and its variants, narrowing the gap toward full fine-tuning while preserving superior memory efficiency.

NeurIPS Conference 2025 Conference Paper

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

  • Simin Li
  • Zihao Mao
  • Hanxiao Li
  • Zonglei Jing
  • Zhuohang bian
  • Jun Guo
  • Li Wang
  • Zhuoran Han

In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of \emph{robustness}, which ensures stability under uncertainties, and \emph{resilience}, the ability to recover from disruptions—a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82, 620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones.

IJCAI Conference 2025 Conference Paper

Multimodal Retina Image Analysis Survey: Datasets, Tasks and Methods

  • Hongwei Sheng
  • Heming Du
  • Xin Shen
  • Sen Wang
  • Xin Yu

Retina images provide a noninvasive view of the central nervous system and microvasculature, making it essential for clinical applications. Changes in the retina often indicate both ophthalmic and systemic diseases, aiding in diagnosis and early intervention. While deep learning algorithms have advanced retina image analysis, a comprehensive review of related datasets, tasks, and benchmarking is still lacking. In this survey, we systematically categorize existing retina image datasets based on their available data modalities, and review the tasks these datasets support in multimodal retina image analysis. We also explain key evaluation metrics used in various retina image analysis benchmarks. By thoroughly examining current datasets and methods, we highlight the challenges and limitations in existing benchmarks and discuss potential research topics in the field. We hope this work will guide future retina analysis methods and promote the shared use of existing data across different tasks.

ICML Conference 2025 Conference Paper

Understanding the Statistical Accuracy-Communication Trade-off in Personalized Federated Learning with Minimax Guarantees

  • Xin Yu
  • Zelin He
  • Ying Sun
  • Lingzhou Xue
  • Runze Li

Personalized federated learning (PFL) offers a flexible framework for aggregating information across distributed clients with heterogeneous data. This work considers a personalized federated learning setting that simultaneously learns global and local models. While purely local training has no communication cost, collaborative learning among the clients can leverage shared knowledge to improve statistical accuracy, presenting an accuracy-communication trade-off in personalized federated learning. However, the theoretical analysis of how personalization quantitatively influences sample and algorithmic efficiency and their inherent trade-off is largely unexplored. This paper makes a contribution towards filling this gap, by providing a quantitative characterization of the personalization degree on the tradeoff. The results further offer theoretical insights for choosing the personalization degree. As a side contribution, we establish the minimax optimality in terms of statistical accuracy for a widely studied PFL formulation. The theoretical result is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting.

NeurIPS Conference 2025 Conference Paper

UniTok: a Unified Tokenizer for Visual Generation and Understanding

  • Chuofan Ma
  • Yi Jiang
  • Junfeng Wu
  • Jihan Yang
  • Xin Yu
  • Zehuan Yuan
  • BINGYUE PENG
  • Xiaojuan Qi

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0. 38 rFID and 78. 6\% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14. 6 to 2. 5 on ImageNet 256$\times$256 benchmark. All codes and models have been made publicly available.

NeurIPS Conference 2025 Conference Paper

When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

  • Zhuo Cao
  • Heming Du
  • Bingqing Zhang
  • Xin Yu
  • Xue Li
  • Sen Wang

Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2, 212 annotations covering 6, 384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3. 00% on G-mAP, 2. 70% on mAP@3+tgt, and 2. 56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https: //github. com/Zhuo-Cao/QV-M2.

IJCAI Conference 2025 Conference Paper

Zero-Shot Machine Unlearning with Proxy Adversarial Data Generation

  • Huiqiang Chen
  • Tianqing Zhu
  • Xin Yu
  • Wanlei Zhou

Machine unlearning aims to remove the influence of specific samples from a trained model. A key challenge in this process is over-unlearning, where the model's performance on the remaining data significantly drops due to the change in the model's parameters. Existing unlearning algorithms depend on the remaining data to prevent this issue. As such, these methods are inapplicable in a more practical scenario, where only the unlearning samples are available (i. e. , zero-shot unlearning). This paper presents a novel framework, ZS-PAG, to fill this gap. Our approach offers three key innovations: (1) we approximate the inaccessible remaining data by generating adversarial samples; (2) leveraging the generated samples, we pinpoint a specific subspace to perform the unlearning process, therefore preventing over-unlearning in the challenging zero-shot scenario; and (3) we consider the influence of the unlearning process on the remaining samples and design an influence-based pseudo-labeling strategy. As a result, our method further improves the model's performance after unlearning. The proposed method holds a theoretical guarantee, and experiments on various benchmarks validate the effectiveness and superiority of our proposed method over several baselines.

NeurIPS Conference 2024 Conference Paper

DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection

  • Jia S. Lim
  • Zhuoxiao Chen
  • Mahsa Baktashmotlagh
  • Zhi Chen
  • Xin Yu
  • Zi Huang
  • Yadan Luo

Class-agnostic object detection (OD) can be a cornerstone or a bottleneck for many downstream vision tasks. Despite considerable advancements in bottom-up and multi-object discovery methods that leverage basic visual cues to identify salient objects, consistently achieving a high recall rate remains difficult due to the diversity of object types and their contextual complexity. In this work, we investigate using vision-language models (VLMs) to enhance object detection via a self-supervised prompt learning strategy. Our initial findings indicate that manually crafted text queries often result in undetected objects, primarily because detection confidence diminishes when the query words exhibit semantic overlap. To address this, we propose a Dispersing Prompt Expansion (DiPEx) approach. DiPEx progressively learns to expand a set of distinct, non-overlapping hyperspherical prompts to enhance recall rates, thereby improving performance in downstream tasks such as out-of-distribution OD. Specifically, DiPEx initiates the process by self-training generic parent prompts and selecting the one with the highest semantic uncertainty for further expansion. The resulting child prompts are expected to inherit semantics from their parent prompts while capturing more fine-grained semantics. We apply dispersion losses to ensure high inter-class discrepancy among child prompts while preserving semantic consistency between parent-child prompt pairs. To prevent excessive growth of the prompt sets, we utilize the maximum angular coverage (MAC) of the semantic space as a criterion for early termination. We demonstrate the effectiveness of DiPEx through extensive class-agnostic OD and OOD-OD experiments on MS-COCO and LVIS, surpassing other prompting methods by up to 20. 1% in AR and achieving a 21. 3% AP improvement over SAM.

AAAI Conference 2024 Conference Paper

Leveraging Partial Symmetry for Multi-Agent Reinforcement Learning

  • Xin Yu
  • Rongye Shi
  • Pu Feng
  • Yongkai Tian
  • Simin Li
  • Shuhao Liao
  • Wenjun Wu

Incorporating symmetry as an inductive bias into multi-agent reinforcement learning (MARL) has led to improvements in generalization, data efficiency, and physical consistency. While prior research has succeeded in using perfect symmetry prior, the realm of partial symmetry in the multi-agent domain remains unexplored. To fill in this gap, we introduce the partially symmetric Markov game, a new subclass of the Markov game. We then theoretically show that the performance error introduced by utilizing symmetry in MARL is bounded, implying that the symmetry prior can still be useful in MARL even in partial symmetry situations. Motivated by this insight, we propose the Partial Symmetry Exploitation (PSE) framework that is able to adaptively incorporate symmetry prior in MARL under different symmetry-breaking conditions. Specifically, by adaptively adjusting the exploitation of symmetry, our framework is able to achieve superior sample efficiency and overall performance of MARL algorithms. Extensive experiments are conducted to demonstrate the superior performance of the proposed framework over baselines. Finally, we implement the proposed framework in real-world multi-robot testbed to show its superiority.

IJCAI Conference 2024 Conference Paper

Machine Unlearning via Null Space Calibration

  • Huiqiang Chen
  • Tianqing Zhu
  • Xin Yu
  • Wanlei Zhou

Machine unlearning aims to enable models to forget specific data instances when receiving deletion requests. Current research centers on efficient unlearning to erase the influence of data from the model and neglects the subsequent impacts on the remaining data. Consequently, existing unlearning algorithms degrade the model's performance after unlearning, known as over-unlearning. This paper addresses this critical yet under-explored issue by introducing machine Unlearning via Null Space Calibration (UNSC), which can accurately unlearn target samples without over-unlearning. On the contrary, by calibrating the decision space during unlearning, UNSC can significantly improve the model's performance on the remaining samples. In particular, our approach hinges on confining the unlearning process to a specified null space tailored to the remaining samples, which is augmented by strategically pseudo-labeling the unlearning samples. Comparison against several established baselines affirms the superiority of our approach.

NeurIPS Conference 2024 Conference Paper

MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset

  • Xin Shen
  • Heming Du
  • Hongwei Sheng
  • Shuyun Wang
  • Hui Chen
  • Huiqiang Chen
  • Zhuojie Wu
  • Xiaobiao Du

Isolated Sign Language Recognition (ISLR) focuses on identifying individual sign language glosses. Considering the diversity of sign languages across geographical regions, developing region-specific ISLR datasets is crucial for supporting communication and research. Auslan, as a sign language specific to Australia, still lacks a dedicated large-scale word-level dataset for the ISLR task. To fill this gap, we curate \underline{\textbf{the first}} large-scale Multi-view Multi-modal Word-Level Australian Sign Language recognition dataset, dubbed MM-WLAuslan. Compared to other publicly available datasets, MM-WLAuslan exhibits three significant advantages: (1) the largest amount of data, (2) the most extensive vocabulary, and (3) the most diverse of multi-modal camera views. Specifically, we record 282K+ sign videos covering 3, 215 commonly used Auslan glosses presented by 73 signers in a studio environment. Moreover, our filming system includes two different types of cameras, i. e. , three Kinect-V2 cameras and a RealSense camera. We position cameras hemispherically around the front half of the model and simultaneously record videos using all four cameras. Furthermore, we benchmark results with state-of-the-art methods for various multi-modal ISLR settings on MM-WLAuslan, including multi-view, cross-camera, and cross-view. Experiment results indicate that MM-WLAuslan is a challenging ISLR dataset, and we hope this dataset will contribute to the development of Auslan and the advancement of sign languages worldwide. All datasets and benchmarks are available at MM-WLAuslan.

NeurIPS Conference 2024 Conference Paper

TPR: Topology-Preserving Reservoirs for Generalized Zero-Shot Learning

  • Hui Chen
  • Yanbin Liu
  • Yongqiang Ma
  • Nanning Zheng
  • Xin Yu

Pre-trained vision-language models (VLMs) such as CLIP have shown excellent performance for zero-shot classification. Based on CLIP, recent methods design various learnable prompts to evaluate the zero-shot generalization capability on a base-to-novel setting. This setting assumes test samples are already divided into either base or novel classes, limiting its application to realistic scenarios. In this paper, we focus on a more challenging and practical setting: generalized zero-shot learning (GZSL), i. e. , testing with no information about the base/novel division. To address this challenging zero-shot problem, we introduce two unique designs that enable us to classify an image without the need of knowing whether it comes from seen or unseen classes. Firstly, most existing methods only adopt a single latent space to align visual and linguistic features, which has a limited ability to represent complex visual-linguistic patterns, especially for fine-grained tasks. Instead, we propose a dual-space feature alignment module that effectively augments the latent space with a novel attribute space induced by a well-devised attribute reservoir. In particular, the attribute reservoir consists of a static vocabulary and learnable tokens complementing each other for flexible control over feature granularity. Secondly, finetuning CLIP models (e. g. , prompt learning) on seen base classes usually sacrifices the model's original generalization capability on unseen novel classes. To mitigate this issue, we present a new topology-preserving objective that can enforce feature topology structures of the combined base and novel classes to resemble the topology of CLIP. In this manner, our model will inherit the generalization ability of CLIP through maintaining the pairwise class angles in the attribute space. Extensive experiments on twelve object recognition datasets demonstrate that our model, termed Topology-Preserving Reservoir (TPR), outperforms strong baselines including both prompt learning and conventional generative-based zero-shot methods.

NeurIPS Conference 2023 Conference Paper

Auslan-Daily: Australian Sign Language Translation for Daily Communication and News

  • Xin Shen
  • Shaozu Yuan
  • Hongwei Sheng
  • Heming Du
  • Xin Yu

Sign language translation (SLT) aims to convert a continuous sign language video clip into a spoken language. Considering different geographic regions generally have their own native sign languages, it is valuable to establish corresponding SLT datasets to support related communication and research. Auslan, as a sign language specific to Australia, still lacks a dedicated large-scale dataset for SLT. To fill this gap, we curate an Australian Sign Language translation dataset, dubbed Auslan-Daily, which is collected from the Auslan educational TV series and Auslan TV programs. The former involves daily communications among multiple signers in the wild, while the latter comprises sign language videos for up-to-date news, weather forecasts, and documentaries. In particular, Auslan-Daily has two main features: (1) the topics are diverse and signed by multiple signers, and (2) the scenes in our dataset are more complex, e. g. , captured in various environments, gesture interference during multi-signers' interactions and various camera positions. With a collection of more than 45 hours of high-quality Auslan video materials, we invite Auslan experts to align different fine-grained visual and language pairs, including video $\leftrightarrow$ fingerspelling, video $\leftrightarrow$ gloss, and video $\leftrightarrow$ sentence. As a result, Auslan-Daily contains multi-grained annotations that can be utilized to accomplish various fundamental sign language tasks, such as signer detection, sign spotting, fingerspelling detection, isolated sign language recognition, sign language translation and alignment. Moreover, we benchmark results with state-of-the-art models for each task in Auslan-Daily. Experiments indicate that Auslan-Daily is a highly challenging SLT dataset, and we hope this dataset will contribute to the development of Auslan and the advancement of sign languages worldwide in a broader context. All datasets and benchmarks are available at Auslan-Daily.

AAAI Conference 2023 Conference Paper

FlowFace: Semantic Flow-Guided Shape-Aware Face Swapping

  • Hao Zeng
  • Wei Zhang
  • Changjie Fan
  • Tangjie Lv
  • Suzhen Wang
  • Zhimeng Zhang
  • Bowen Ma
  • Lincheng Li

In this work, we propose a semantic flow-guided two-stage framework for shape-aware face swapping, namely FlowFace. Unlike most previous methods that focus on transferring the source inner facial features but neglect facial contours, our FlowFace can transfer both of them to a target face, thus leading to more realistic face swapping. Concretely, our FlowFace consists of a face reshaping network and a face swapping network. The face reshaping network addresses the shape outline differences between the source and target faces. It first estimates a semantic flow (i.e. face shape differences) between the source and the target face, and then explicitly warps the target face shape with the estimated semantic flow. After reshaping, the face swapping network generates inner facial features that exhibit the identity of the source face. We employ a pre-trained face masked autoencoder (MAE) to extract facial features from both the source face and the target face. In contrast to previous methods that use identity embedding to preserve identity information, the features extracted by our encoder can better capture facial appearances and identity information. Then, we develop a cross-attention fusion module to adaptively fuse inner facial features from the source face with the target facial attributes, thus leading to better identity preservation. Extensive quantitative and qualitative experiments on in-the-wild faces demonstrate that our FlowFace outperforms the state-of-the-art significantly.

NeurIPS Conference 2023 Conference Paper

RVD: A Handheld Device-Based Fundus Video Dataset for Retinal Vessel Segmentation

  • MD WAHIDUZZAMAN KHAN
  • Hongwei Sheng
  • Hu Zhang
  • Heming Du
  • Sen Wang
  • Minas Coroneo
  • Farshid Hajati
  • Sahar Shariflou

Retinal vessel segmentation is generally grounded in image-based datasets collected with bench-top devices. The static images naturally lose the dynamic characteristics of retina fluctuation, resulting in diminished dataset richness, and the usage of bench-top devices further restricts dataset scalability due to its limited accessibility. Considering these limitations, we introduce the first video-based retinal dataset by employing handheld devices for data acquisition. The dataset comprises 635 smartphone-based fundus videos collected from four different clinics, involving 415 patients from 50 to 75 years old. It delivers comprehensive and precise annotations of retinal structures in both spatial and temporal dimensions, aiming to advance the landscape of vasculature segmentation. Specifically, the dataset provides three levels of spatial annotations: binary vessel masks for overall retinal structure delineation, general vein-artery masks for distinguishing the vein and artery, and fine-grained vein-artery masks for further characterizing the granularities of each artery and vein. In addition, the dataset offers temporal annotations that capture the vessel pulsation characteristics, assisting in detecting ocular diseases that require fine-grained recognition of hemodynamic fluctuation. In application, our dataset exhibits a significant domain shift with respect to data captured by bench-top devices, thus posing great challenges to existing methods. Thanks to rich annotations and data scales, our dataset potentially paves the path for more advanced retinal analysis and accurate disease diagnosis. In the experiments, we provide evaluation metrics and benchmark results on our dataset, reflecting both the potential and challenges it offers for vessel segmentation tasks. We hope this challenging dataset would significantly contribute to the development of eye disease diagnosis and early prevention.

JBHI Journal 2023 Journal Article

Semantic-Aware Contrastive Learning for Multi-Object Medical Image Segmentation

  • Ho Hin Lee
  • Yucheng Tang
  • Qi Yang
  • Xin Yu
  • Leon Y. Cai
  • Lucas W. Remedios
  • Shunxing Bao
  • Bennett A. Landman

Medical image segmentation, or computing voxel-wise semantic masks, is a fundamental yet challenging task in medical imaging domain. To increase the ability of encoder-decoder neural networks to perform this task across large clinical cohorts, contrastive learning provides an opportunity to stabilize model initialization and enhances downstream tasks performance without ground-truth voxel-wise labels. However, multiple target objects with different semantic meanings and contrast level may exist in a single image, which poses a problem for adapting traditional contrastive learning methods from prevalent “image-level classification” to “pixel-level segmentation”. In this article, we propose a simple semantic-aware contrastive learning approach leveraging attention masks and image-wise labels to advance multi-object semantic segmentation. Briefly, we embed different semantic objects to different clusters rather than the traditional image-level embeddings. We evaluate our proposed method on a multi-organ medical image segmentation task with both in-house data and MICCAI Challenge 2015 BTCV datasets. Compared with current state-of-the-art training strategies, our proposed pipeline yields a substantial improvement of 5. 53% and 6. 09% on Dice score for both medical image segmentation cohorts respectively (p-value $<$ 0. 01). The performance of the proposed method is further assessed on external medical image cohort via MICCAI Challenge FLARE 2021 dataset, and achieves a substantial improvement from Dice 0. 922 to 0. 933 (p-value $<$ 0. 01).

NeurIPS Conference 2023 Conference Paper

Streaming Factor Trajectory Learning for Temporal Tensor Decomposition

  • Shikai Fang
  • Xin Yu
  • Shibo Li
  • Zheng Wang
  • Mike Kirby
  • Shandian Zhe

Practical tensor data is often along with time information. Most existing temporal decomposition approaches estimate a set of fixed factors for the objects in each tensor mode, and hence cannot capture the temporal evolution of the objects' representation. More important, we lack an effective approach to capture such evolution from streaming data, which is common in real-world applications. To address these issues, we propose Streaming Factor Trajectory Learning (SFTL) for temporal tensor decomposition. We use Gaussian processes (GPs) to model the trajectory of factors so as to flexibly estimate their temporal evolution. To address the computational challenges in handling streaming data, we convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE). We develop an efficient online filtering algorithm to estimate a decoupled running posterior of the involved factor states upon receiving new data. The decoupled estimation enables us to conduct standard Rauch-Tung-Striebel smoothing to compute the full posterior of all the trajectories in parallel, without the need for revisiting any previous data. We have shown the advantage of SFTL in both synthetic tasks and real-world applications.

AAAI Conference 2023 Conference Paper

StyleTalk: One-Shot Talking Head Generation with Controllable Speaking Styles

  • Yifeng Ma
  • Suzhen Wang
  • Zhipeng Hu
  • Changjie Fan
  • Tangjie Lv
  • Yu Ding
  • Zhidong Deng
  • Xin Yu

Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.

NeurIPS Conference 2022 Conference Paper

Batch Multi-Fidelity Active Learning with Budget Constraints

  • Shibo Li
  • Jeff M Phillips
  • Xin Yu
  • Robert Kirby
  • Shandian Zhe

Learning functions with high-dimensional outputs is critical in many applications, such as physical simulation and engineering design. However, collecting training examples for these applications is often costly, e. g. , by running numerical solvers. The recent work (Li et al. , 2022) proposes the first multi-fidelity active learning approach for high-dimensional outputs, which can acquire examples at different fidelities to reduce the cost while improving the learning performance. However, this method only queries at one pair of fidelity and input at a time, and hence has a risk of bringing in strongly correlated examples to reduce the learning efficiency. In this paper, we propose Batch Multi-Fidelity Active Learning with Budget Constraints (BMFAL-BC), which can promote the diversity of training examples to improve the benefit-cost ratio, while respecting a given budget constraint for batch queries. Hence, our method can be more practically useful. Specifically, we propose a novel batch acquisition function that measures the mutual information between a batch of multi-fidelity queries and the target function, so as to penalize highly correlated queries and encourages diversity. The optimization of the batch acquisition function is challenging in that it involves a combinatorial search over many fidelities while subject to the budget constraint. To address this challenge, we develop a weighted greedy algorithm that can sequentially identify each (fidelity, input) pair, while achieving a near $(1 - 1/e)$-approximation of the optimum. We show the advantage of our method in several computational physics and engineering applications.

YNIMG Journal 2022 Journal Article

Focal fMRI signal enhancement with implantable inductively coupled detectors

  • Yi Chen
  • Qi Wang
  • Sangcheon Choi
  • Hang Zeng
  • Kengo Takahashi
  • Chunqi Qian
  • Xin Yu

Despite extensive efforts to increase the signal-to-noise ratio (SNR) of fMRI images for brain-wide mapping, technical advances of focal brain signal enhancement are lacking, in particular, for animal brain imaging. Emerging studies have combined fMRI with fiber optic-based optogenetics to decipher circuit-specific neuromodulation from meso to macroscales. High-resolution fMRI is needed to integrate hemodynamic responses into cross-scale functional dynamics, but the SNR remains a limiting factor given the complex implantation setup of animal brains. Here, we developed a multimodal fMRI imaging platform with an implanted inductive coil detector. This detector boosts the tSNR of MRI images, showing a 2-3-fold sensitivity gain over conventional coil configuration. In contrast to the cryoprobe or array coils with limited spaces for implanted brain interface, this setup offers a unique advantage to study brain circuit connectivity with optogenetic stimulation and can be further extended to other multimodal fMRI mapping schemes.

IJCAI Conference 2022 Conference Paper

Learning Implicit Body Representations from Double Diffusion Based Neural Radiance Fields

  • Guangming Yao
  • Hongzhi Wu
  • Yi Yuan
  • Lincheng Li
  • Kun Zhou
  • Xin Yu

In this paper, we present a novel double diffusion based neural radiance field, dubbed DD-NeRF, to reconstruct human body geometry and render the human body appearance in novel views from a sparse set of images. We first propose a double diffusion mechanism to achieve expressive representations of input images by fully exploiting human body priors and image appearance details at two levels. At the coarse level, we first model the coarse human body poses and shapes via an unclothed 3D deformable vertex model as guidance. At the fine level, we present a multi-view sampling network to capture subtle geometric deformations and image detailed appearances, such as clothing and hair, from multiple input views. Considering the sparsity of the two level features, we diffuse them into feature volumes in the canonical space to construct neural radiance fields. Then, we present a signed distance function (SDF) regression network to construct body surfaces from the diffused features. Thanks to our double diffused representations, our method can even synthesize novel views of unseen subjects. Experiments on various datasets demonstrate that our approach outperforms the state-of-the-art in both geometric reconstruction and novel view synthesis.

AAAI Conference 2022 Conference Paper

Monocular Camera-Based Point-Goal Navigation by Learning Depth Channel and Cross-Modality Pyramid Fusion

  • Tianqi Tang
  • Heming Du
  • Xin Yu
  • Yi Yang

For a monocular camera-based navigation system, if we could effectively explore scene geometric cues from RGB images, the geometry information will significantly facilitate the efficiency of the navigation system. Motivated by this, we propose a highly efficient point-goal navigation framework, dubbed Geo-Nav. In a nutshell, Geo-Nav consists of two parts: a visual perception part and a navigation part. In the visual perception part, we firstly propose a Self-supervised Depth Estimation network (SDE) specially tailored for the monocular camera-based navigation agent. SDE learns a mapping from an RGB input image to its corresponding depth image by exploring scene geometric constraints in a selfconsistency manner. Then, in order to achieve a representative visual representation from the RGB inputs and learned depth images, we propose a Cross-modality Pyramid Fusion module (CPF). Concretely, CPF computes a patch-wise crossmodality correlation between different modal features and exploits the correlation to fuse and enhance features at each scale. Thanks to the patch-wise nature of CPF, we can fuse feature maps at high resolution, allowing the visual network to perceive more image details. In the navigation part, the extracted visual representations are fed to a navigation policy network to learn how to map the visual representations to agent actions effectively. Extensive experiments on the Gibson benchmark demonstrate that Geo-Nav outperforms the state-of-the-art in terms of efficiency and effectiveness.

AAAI Conference 2022 Conference Paper

One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning

  • Suzhen Wang
  • Lincheng Li
  • Yu Ding
  • Xin Yu

Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.

NeurIPS Conference 2022 Conference Paper

Recall Distortion in Neural Network Pruning and the Undecayed Pruning Algorithm

  • Aidan Good
  • Jiaqi Lin
  • Xin Yu
  • Hannah Sieg
  • Mikey Fergurson
  • Shandian Zhe
  • Jerzy Wieczorek
  • Thiago Serra

Pruning techniques have been successfully used in neural networks to trade accuracy for sparsity. However, the impact of network pruning is not uniform: prior work has shown that the recall for underrepresented classes in a dataset may be more negatively affected. In this work, we study such relative distortions in recall by hypothesizing an intensification effect that is inherent to the model. Namely, that pruning makes recall relatively worse for a class with recall below accuracy and, conversely, that it makes recall relatively better for a class with recall above accuracy. In addition, we propose a new pruning algorithm aimed at attenuating such effect. Through statistical analysis, we have observed that intensification is less severe with our algorithm but nevertheless more pronounced with relatively more difficult tasks, less complex models, and higher pruning ratios. More surprisingly, we conversely observe a de-intensification effect with lower pruning ratios.

IJCAI Conference 2021 Conference Paper

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

  • Suzhen Wang
  • Lincheng Li
  • Yu Ding
  • Changjie Fan
  • Xin Yu

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii)} maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

YNIMG Journal 2021 Journal Article

Contribution of animal models toward understanding resting state functional connectivity

  • Patricia Pais-Roldán
  • Celine Mateo
  • Wen-Ju Pan
  • Ben Acland
  • David Kleinfeld
  • Lawrence H. Snyder
  • Xin Yu
  • Shella Keilholz

Functional connectivity, which reflects the spatial and temporal organization of intrinsic activity throughout the brain, is one of the most studied measures in human neuroimaging research. The noninvasive acquisition of resting state functional magnetic resonance imaging (rs-fMRI) allows the characterization of features designated as functional networks, functional connectivity gradients, and time-varying activity patterns that provide insight into the intrinsic functional organization of the brain and potential alterations related to brain dysfunction. Functional connectivity, hence, captures dimensions of the brain's activity that have enormous potential for both clinical and preclinical research. However, the mechanisms underlying functional connectivity have yet to be fully characterized, hindering interpretation of rs-fMRI studies. As in other branches of neuroscience, the identification of the neurophysiological processes that contribute to functional connectivity largely depends on research conducted on laboratory animals, which provide a platform where specific, multi-dimensional investigations that involve invasive measurements can be carried out. These highly controlled experiments facilitate the interpretation of the temporal correlations observed across the brain. Indeed, information obtained from animal experimentation to date is the basis for our current understanding of the underlying basis for functional brain connectivity. This review presents a compendium of some of the most critical advances in the field based on the efforts made by the animal neuroimaging community.

AAAI Conference 2021 Conference Paper

Modeling the Probabilistic Distribution of Unlabeled Data for One-shot Medical Image Segmentation

  • Yuhang Ding
  • Xin Yu
  • Yi Yang

Existing image segmentation networks mainly leverage largescale labeled datasets to attain high accuracy. However, labeling medical images is very expensive since it requires sophisticated expert knowledge. Thus, it is more desirable to employ only a few labeled data in pursuing high segmentation performance. In this paper, we develop a data augmentation method for one-shot brain magnetic resonance imaging (MRI) image segmentation which exploits only one labeled MRI image (named atlas) and a few unlabeled images. In particular, we propose to learn the probability distributions of deformations (including shapes and intensities) of different unlabeled MRI images with respect to the atlas via 3D variational autoencoders (VAEs). In this manner, our method is able to exploit the learned distributions of image deformations to generate new authentic brain MRI images, and the number of generated samples will be sufficient to train a deep segmentation network. Furthermore, we introduce a new standard segmentation benchmark to evaluate the generalization performance of a segmentation network through a cross-dataset setting (collected from different sources). Extensive experiments demonstrate that our method outperforms the state-of-theart one-shot medical segmentation methods. Our code has been released at https: //github. com/dyh127/Modeling-the- Probabilistic-Distribution-of-Unlabeled-Data.

NeurIPS Conference 2021 Conference Paper

Scaling Up Exact Neural Network Compression by ReLU Stability

  • Thiago Serra
  • Xin Yu
  • Abhinav Kumar
  • Srikumar Ramalingam

We can compress a rectifier network while exactly preserving its underlying functionality with respect to a given input domain if some of its neurons are stable. However, current approaches to determine the stability of neurons with Rectified Linear Unit (ReLU) activations require solving or finding a good approximation to multiple discrete optimization problems. In this work, we introduce an algorithm based on solving a single optimization problem to identify all stable neurons. Our approach is on median 183 times faster than the state-of-art method on CIFAR-10, which allows us to explore exact compression on deeper (5 x 100) and wider (2 x 800) networks within minutes. For classifiers trained under an amount of L1 regularization that does not worsen accuracy, we can remove up to 56% of the connections on the CIFAR-10 dataset. The code is available at the following link, https: //github. com/yuxwind/ExactCompression.

AAAI Conference 2021 Conference Paper

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

  • Lincheng Li
  • Suzhen Wang
  • Zhimeng Zhang
  • Yu Ding
  • Yixing Zheng
  • Xin Yu
  • Changjie Fan

In this paper, we propose a novel text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a speaker-specific stage. In the speaker-independent stage, we design three parallel networks to generate animation parameters of the mouth, upper face, and head from texts, separately. In the speaker-specific stage, we present a 3D face model guided attention network to synthesize videos tailored for different individuals. It takes the animation parameters as input and exploits an attention mask to manipulate facial expression changes for the input individuals. Furthermore, to better establish authentic correspondences between visual motions (i. e. , facial expression changes and head movements) and audios, we leverage a high-accuracy motion capture dataset instead of relying on long videos of specific individuals. After attaining the visual and audio correspondences, we can effectively train our network in an end-to-end fashion. Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photorealistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.

AAAI Conference 2020 Conference Paper

Optimal Feature Transport for Cross-View Image Geo-Localization

  • Yujiao Shi
  • Xin Yu
  • Liu Liu
  • Tong Zhang
  • Hongdong Li

This paper addresses the problem of cross-view image geolocalization, where the geographic location of a ground-level street-view query image is estimated by matching it against a large scale aerial map (e. g. , a high-resolution satellite image). State-of-the-art deep-learning based methods tackle this problem as deep metric learning which aims to learn global feature representations of the scene seen by the two different views. Despite promising results are obtained by such deep metric learning methods, they, however, fail to exploit a crucial cue relevant for localization, namely, the spatial layout of local features. Moreover, little attention is paid to the obvious domain gap (between aerial view and ground view) in the context of cross-view localization. This paper proposes a novel Cross-View Feature Transport (CVFT) technique to explicitly establish cross-view domain transfer that facilitates feature alignment between ground and aerial images. Specifically, we implement the CVFT as network layers, which transports features from one domain to the other, leading to more meaningful feature similarity comparison. Our model is differentiable and can be learned end-to-end. Experiments on large-scale datasets have demonstrated that our method has remarkably boosted the state-of-the-art cross-view localization performance, e. g. , on the CVUSA dataset, with significant improvements for top-1 recall from 40. 79% to 61. 43%, and for top-10 from 76. 36% to 90. 49%. We expect the key insight of the paper (i. e. , explicitly handling domain difference via domain transport) will prove to be useful for other similar problems in computer vision as well.

NeurIPS Conference 2020 Conference Paper

TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

  • Dongxu Li
  • Chenchen Xu
  • Xin Yu
  • Kaihao Zhang
  • Benjamin Swift
  • Hanna Suominen
  • Hongdong Li

Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences. Sign videos consist of continuous sequences of sign gestures with no clear boundaries in between. Existing SLT models usually represent sign visual features in a frame-wise manner so as to avoid needing to explicitly segmenting the videos into isolated signs. However, these methods neglect the temporal information of signs and lead to substantial ambiguity in translation. In this paper, we explore the temporal semantic structures of sign videos to learn more discriminative features. To this end, we first present a novel sign video segment representation which takes into account multiple temporal granularities, thus alleviating the need for accurate video segmentation. Taking advantage of the proposed segment representation, we develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet. Specifically, TSPNet introduces an inter-scale attention to evaluate and enhance local semantic consistency of sign segments and an intra-scale attention to resolve semantic ambiguity by using non-local video context. Experiments show that our TSPNet outperforms the state-of-the-art with significant improvements on the BLEU score (from 9. 58 to 13. 41) and ROUGE score (from 31. 80 to 34. 96) on the largest commonly used SLT dataset. Our implementation is available at https: //github. com/verashira/TSPNet.

YNIMG Journal 2019 Journal Article

Multimodal assessment of recovery from coma in a rat model of diffuse brainstem tegmentum injury

  • Patricia Pais-Roldán
  • Brian L. Edlow
  • Yuanyuan Jiang
  • Johannes Stelzer
  • Ming Zou
  • Xin Yu

Despite the association between brainstem lesions and coma, a mechanistic understanding of coma pathogenesis and recovery is lacking. We developed a coma model in the rat mimicking human brainstem coma, which allowed multimodal analysis of a brainstem tegmentum lesion's effects on behavior, cortical electrophysiology, and global brain functional connectivity. After coma induction, we observed a transient period (∼1h) of unresponsiveness accompanied by cortical burst-suppression. Comatose rats then gradually regained behavioral responsiveness concurrent with emergence of delta/theta-predominant cortical rhythms in primary somatosensory cortex. During the acute stage of coma recovery (∼1–8h), longitudinal resting-state functional MRI revealed an increase in functional connectivity between subcortical arousal nuclei in the thalamus, basal forebrain, and basal ganglia and cortical regions implicated in awareness. This rat coma model provides an experimental platform to systematically study network-based mechanisms of coma pathogenesis and recovery, as well as to test targeted therapies aimed at promoting recovery of consciousness after coma.

NeurIPS Conference 2019 Conference Paper

Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization

  • Yujiao Shi
  • Liu Liu
  • Xin Yu
  • Hongdong Li

In this paper, we develop a new deep network to explicitly address these inherent differences between ground and aerial views. We observe there exist some approximate domain correspondences between ground and aerial images. Specifically, pixels lying on the same azimuth direction in an aerial image approximately correspond to a vertical image column in the ground view image. Thus, we propose a two-step approach to exploit this prior knowledge. The first step is to apply a regular polar transform to warp an aerial image such that its domain is closer to that of a ground-view panorama. Note that polar transform as a pure geometric transformation is agnostic to scene content, hence cannot bring the two domains into full alignment. Then, we add a subsequent spatial-attention mechanism which further brings corresponding deep features closer in the embedding space. To improve the robustness of feature representation, we introduce a feature aggregation strategy via learning multiple spatial embeddings. By the above two-step approach, we achieve more discriminative deep representations, facilitating cross-view Geo-localization more accurate. Our experiments on standard benchmark datasets show significant performance boosting, achieving more than doubled recall rate compared with the previous state of the art.

AAAI Conference 2017 Conference Paper

Face Hallucination with Tiny Unaligned Images by Transformative Discriminative Neural Networks

  • Xin Yu
  • Fatih Porikli

Conventional face hallucination methods rely heavily on accurate alignment of low-resolution (LR) faces before upsampling them. Misalignment often leads to deficient results and unnatural artifacts for large upscaling factors. However, due to the diverse range of poses and different facial expressions, aligning an LR input image, in particular when it is tiny, is severely difficult. To overcome this challenge, here we present an end-to-end transformative discriminative neural network (TDN) devised for super-resolving unaligned and very small face images with an extreme upscaling factor of 8. Our method employs an upsampling network where we embed spatial transformation layers to allow local receptive fields to line-up with similar spatial supports. Furthermore, we incorporate a class-specific loss in our objective through a successive discriminative network to improve the alignment and upsampling performance with semantic information. Extensive experiments on large face datasets show that the proposed method significantly outperforms the state-of-the-art.

YNICL Journal 2015 Journal Article

Prefrontal cortex connectivity dysfunction in performing the Fist–Edge–Palm task in patients with first-episode schizophrenia and non-psychotic first-degree relatives

  • Raymond C.K. Chan
  • Jia Huang
  • Qing Zhao
  • Ya Wang
  • Yun-yao Lai
  • Nan Hong
  • David H.K. Shum
  • Eric F.C. Cheung

Neurological soft signs have been considered one of the promising neurological endophenotypes for schizophrenia. However, most previous studies have employed clinical rating data only. The present study aimed to examine the neurobiological basis of one of the typical motor coordination signs, the Fist-Edge-Palm (FEP) task, in patients with first-episode schizophrenia and their non-psychotic first degree relatives. Thirteen patients with first-episode schizophrenia, 14 non-psychotic first-degree relatives and 14 healthy controls were recruited. All of them were instructed to perform the FEP task in a 3 T GE Machine. Psychophysiological interaction (PPI) analysis was used to evaluate the functional connectivity between the sensorimotor cortex and frontal regions when participants performed the FEP task compared to simple motor tasks. In the contrast of palm-tapping (PT) vs. rest, activation of the left frontal-parietal region was lowest in the schizophrenia group, intermediate in the relative group and highest in the healthy control group. In the contrast of FEP vs. PT, patients with schizophrenia did not show areas of significant activation, while relatives and healthy controls showed significant activation of the left middle frontal gyrus. Moreover, with the increase in task complexity, significant functional connectivity was observed between the sensorimotor cortex and the right frontal gyrus in healthy controls but not in patients with first episode schizophrenia. These findings suggest that activity of the left frontal-parietal and frontal regions may be neurofunctional correlates of neurological soft signs, which in turn may be a potential endophenotype of schizophrenia. Moreover, the right frontal gyrus may play a specific role in the execution of the FEP task in schizophrenia spectrum disorders.

YNIMG Journal 2014 Journal Article

Lack of dystrophin results in abnormal cerebral diffusion and perfusion in vivo

  • Candida L. Goodnough
  • Ying Gao
  • Xin Li
  • Mohammed Q. Qutaish
  • L. Henry Goodnough
  • Joseph Molter
  • David Wilson
  • Chris A. Flask

Dystrophin, the main component of the dystrophin–glycoprotein complex, plays an important role in maintaining the structural integrity of cells. It is also involved in the formation of the blood–brain barrier (BBB). To elucidate the impact of dystrophin disruption in vivo, we characterized changes in cerebral perfusion and diffusion in dystrophin-deficient mice (mdx) by magnetic resonance imaging (MRI). Arterial spin labeling (ASL) and diffusion-weighted MRI (DWI) studies were performed on 2-month-old and 10-month-old mdx mice and their age-matched wild-type controls (WT). The imaging results were correlated with Evan's blue extravasation and vascular density studies. The results show that dystrophin disruption significantly decreased the mean cerebral diffusivity in both 2-month-old (7. 38±0. 30×10-4 mm2/s) and 10-month-old (6. 93±0. 53×10-4 mm2/s) mdx mice as compared to WT (8. 49±0. 24×10-4, 8. 24±0. 25×10-4 mm2/s, respectively). There was also an 18% decrease in cerebral perfusion in 10-month-old mdx mice as compared to WT, which was associated with enhanced arteriogenesis. The reduction in water diffusivity in mdx mice is likely due to an increase in cerebral edema or the existence of large molecules in the extracellular space from a leaky BBB. The observation of decreased perfusion in the setting of enhanced arteriogenesis may be caused by an increase of intracranial pressure from cerebral edema. This study demonstrates the defects in water handling at the BBB and consequently, abnormal perfusion associated with the absence of dystrophin.

YNIMG Journal 2012 Journal Article

Direct imaging of macrovascular and microvascular contributions to BOLD fMRI in layers IV–V of the rat whisker–barrel cortex

  • Xin Yu
  • Daniel Glen
  • Shumin Wang
  • Stephen Dodd
  • Yoshiyuki Hirano
  • Ziad Saad
  • Richard Reynolds
  • Afonso C. Silva

The spatiotemporal characteristics of the hemodynamic response to increased neural activity were investigated at the level of individual intracortical vessels using BOLD-fMRI in a well-established rodent model of somatosensory stimulation at 11. 7T. Functional maps of the rat barrel cortex were obtained at 150×150×500μm spatial resolution every 200ms. The high spatial resolution allowed separation of active voxels into those containing intracortical macro vessels, mainly vein/venules (referred to as macrovasculature), and those enriched with arteries/capillaries and small venules (referred to as microvasculature) since the macro vessel can be readily mapped due to the fast T2* decay of blood at 11. 7T. The earliest BOLD response was observed within layers IV–V by 0. 8s following stimulation and encompassed mainly the voxels containing the microvasculature and some confined macrovasculature voxels. By 1. 2s, the BOLD signal propagated to the macrovasculature voxels where the peak BOLD signal was 2–3 times higher than that of the microvasculature voxels. The BOLD response propagated in individual venules/veins far from neuronal sources at later times. This was also observed in layers IV–V of the barrel cortex after specific stimulation of separated whisker rows. These results directly visualized that the earliest hemodynamic changes to increased neural activity occur mainly in the microvasculature and spread toward the macrovasculature. However, at peak response, the BOLD signal is dominated by penetrating venules even at layers IV–V of the cortex.

YNIMG Journal 2011 Journal Article

Morphological and functional midbrain phenotypes in Fibroblast Growth Factor 17 mutant mice detected by Mn-enhanced MRI

  • Xin Yu
  • Brian J. Nieman
  • Anamaria Sudarov
  • Kamila U. Szulc
  • Davood J. Abdollahian
  • Nitin Bhatia
  • Anil K. Lalwani
  • Alexandra L. Joyner

With increasing efforts to develop and utilize mouse models of a variety of neuro-developmental diseases, there is an urgent need for sensitive neuroimaging methods that enable in vivo analysis of subtle alterations in brain anatomy and function in mice. Previous studies have shown that the brains of Fibroblast Growth Factor 17 null mutants (Fgf17 −/−) have anatomical abnormalities in the inferior colliculus (IC)—the auditory midbrain—and minor foliation defects in the cerebellum. In addition, changes in the expression domains of several cortical patterning genes were detected, without overt changes in forebrain morphology. Recently, it has also been reported that Fgf17 −/− mutants have abnormal vocalization and social behaviors, phenotypes that could reflect molecular changes in the cortex and/or altered auditory processing / perception in these mice. We used manganese (Mn)-enhanced magnetic resonance imaging (MEMRI) to analyze the anatomical phenotype of Fgf17 −/− mutants in more detail than achieved previously, detecting changes in IC, cerebellum, olfactory bulb, hypothalamus and frontal cortex. We also used MEMRI to characterize sound-evoked activity patterns, demonstrating a significant reduction of the active IC volume in Fgf17 −/− mice. Furthermore, tone-specific (16- and 40-kHz) activity patterns in the IC of Fgf17 −/− mice were observed to be largely overlapping, in contrast to the normal pattern, separated along the dorsal-ventral axis. These results demonstrate that Fgf17 plays important roles in both the anatomical and functional development of the auditory midbrain, and show the utility of MEMRI for in vivo analyses of mutant mice with subtle brain defects.

YNIMG Journal 2010 Journal Article

3D mapping of somatotopic reorganization with small animal functional MRI

  • Xin Yu
  • Shumin Wang
  • Der-Yow Chen
  • Stephen Dodd
  • Artem Goloshevsky
  • Alan P. Koretsky

There are few in vivo noninvasive methods to study neuroplasticity in animal brains. Functional MRI (fMRI) has been developed for animal brain mapping, but few fMRI studies have analyzed functional alteration due to plasticity in animal models. One major limitation is that fMRI maps are characterized by statistical parametric mapping making the apparent boundary dependent on the statistical threshold used. Here, we developed a method to characterize the location of center-of-mass in fMRI maps that is shown not to be sensitive to statistical threshold. Utilizing centers-of-mass as anchor points to fit the spatial distribution of the BOLD response enabled quantitative group analysis of altered boundaries of functional somatosensory maps. This approach was used to study cortical reorganization in the rat primary somatosensory cortex (S1) after sensory deprivation to the barrel cortex by follicle ablation (F. A.). FMRI demonstrated an enlarged nose S1 representation in the 3D somatotopic functional maps. This result clearly demonstrates that fMRI enables the spatial mapping of functional changes that can characterize multiple regions of S1 cortex and still be sensitive to changes due to plasticity.