Arrow Research search

Author name cluster

Yao Yao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

AAAI Conference 2025 Conference Paper

4D Diffusion for Dynamic Protein Structure Prediction with Reference and Motion Guidance

  • Kaihui Cheng
  • Ce Liu
  • Qingkun Su
  • Jun Wang
  • Liwei Zhang
  • Yining Tang
  • Yao Yao
  • Siyu Zhu

Protein structure prediction is pivotal for understanding the structure-function relationship of proteins, advancing biological research, and facilitating pharmaceutical development and experimental design. While deep learning methods and the expanded availability of experimental 3D protein structures have accelerated structure prediction, the dynamic nature of protein structures has received limited attention. This study introduces an innovative 4D diffusion model incorporating molecular dynamics (MD) simulation data to learn dynamic protein structures. Our approach is distinguished by the following components: (1) a unified diffusion model capable of generating dynamic protein structures, including both the backbone and side chains, utilizing atomic grouping and side-chain dihedral angle predictions; (2) a reference network that enhances structural consistency by integrating the latent embeddings of the initial 3D protein structures; and (3) a motion alignment module aimed at improving temporal structural coherence across multiple time steps. To our knowledge, this is the first diffusion-based model aimed at predicting protein trajectories across multiple time steps simultaneously. Validation on benchmark datasets demonstrates that our model exhibits high accuracy in predicting dynamic 3D structures of proteins containing up to 256 amino acids over 32 time steps, effectively capturing both local flexibility in stable states and significant conformational changes.

IROS Conference 2025 Conference Paper

CG-3DGS: Complexity-Guided 3D Gaussian Splatting for High-Fidelity Surgical Scene Reconstruction

  • Yao Yao
  • Bo Ouyang
  • Cancan Zhao

Accurate 3D reconstruction in surgical scenarios is essential for visualizing dynamic tissues with complex anatomical geometries. While 3D Gaussian Splatting (3D-GS) has been explored as an efficient approach to scene modeling, occlusion-induced voids and suboptimal detail optimization have limited its application in surgery. This work introduces a Complexity-Guided 3D Gaussian Splatting (CG-3DGS) framework, in which occlusion regions are globally filled by a state-of-the-art optical flow-based video inpainting method. A frequency–spatial aware refinement (FSAR) mechanism is proposed, allowing spectral signatures and spatial gradients to be jointly analyzed to enhance critical anatomical features (e. g. , blood vessels). This mechanism adaptively guides Gaussian densification based on scene-specific anatomical complexity. Experimental results demonstrate that the proposed framework achieves higher reconstruction fidelity while maintaining efficient rendering speeds.

NeurIPS Conference 2025 Conference Paper

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

  • Shuang Wu
  • Youtian Lin
  • Feihu Zhang
  • Yifei Zeng
  • Yikang Yang
  • yajie bao
  • Jiachen Qian
  • Siyu Zhu

Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3. 9$\times$ speedup in the forward pass and a 9. 6$\times$ speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024³ resolution using only 8 GPUs—a task typically requiring at least 32 GPUs for volumetric representations at $256^3$ resolution, thus making gigascale 3D generation both practical and accessible. Project page: https: //www. neural4d. com/research-page/direct3d-s2.

AAAI Conference 2025 Conference Paper

JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation

  • Yao Yao
  • Peike Li
  • Boyu Chen
  • Alex Wang

With rapid advances in generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation. Nevertheless, achieving precise control over multi-track generation remains an open challenge. While existing models excel in directly generating multi-track mix, their limitations become evident when it comes to composing individual tracks and integrating them in a controllable manner. This departure from the typical workflows of professional composers hinders the ability to refine details in specific tracks. To address this gap, we propose JEN-1 Composer, a unified framework designed to efficiently model marginal, conditional, and joint distributions over multi-track music using a single model. Building upon an audio latent diffusion model, JEN-1 Composer extends the versatility of multi-track music generation. We introduce a progressive curriculum training strategy, which gradually escalates the difficulty of training tasks while ensuring the model's generalization ability and facilitating smooth transitions between different scenarios. During inference, users can iteratively generate and select music tracks, thus incrementally composing entire musical pieces in accordance with the Human-AI co-composition workflow. Our approach demonstrates state-of-the-art performance in controllable and high-fidelity multi-track music synthesis, marking a significant advancement in interactive AI-assisted music creation.

AAAI Conference 2025 Conference Paper

JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning

  • Boyu Chen
  • Peike Li
  • Yao Yao
  • Alex Wang

Large models for text-to-music generation have achieved significant progress, facilitating the creation of high-quality and varied musical compositions from provided text prompts. However, input text prompts may not precisely capture user requirements, particularly when the objective is to generate music that embodies a specific concept derived from a designated reference collection. In this paper, we propose a novel method for customized text-to-music generation, which can capture the concept from a two-minute reference music and generate a new piece of music conforming to the concept. We achieve this by fine-tuning a pretrained text-to-music model using the reference music. However, directly fine-tuning all parameters leads to overfitting issues. To address this problem, we propose a Pivotal Parameters Tuning method that enables the model to assimilate the new concept while preserving its original generative capabilities. Additionally, we identify a potential concept conflict when introducing multiple concepts into the pretrained model. We present a concept enhancement strategy to distinguish multiple concepts, enabling the fine-tuned model to generate music incorporating either individual or multiple concepts simultaneously. We also introduce a new dataset and evaluation protocol for this task. Our proposed JEN1-DreamStyler outperforms several baselines in both qualitative and quantitative evaluations.

NeurIPS Conference 2025 Conference Paper

Statistical Analysis of the Sinkhorn Iterations for Two-Sample Schr\"{o}dinger Bridge Estimation

  • Ibuki Maeda
  • Yao Yao
  • Atsushi Nitanda

The Schrödinger bridge problem seeks the optimal stochastic process that connects two given probability distributions with minimal energy modification. While the Sinkhorn algorithm is widely used to solve the static optimal transport problem, a recent work (Pooladian and Niles-Weed, 2024) proposed the *Sinkhorn bridge*, which estimates Schrödinger bridges by plugging optimal transport into the time-dependent drifts of SDEs, with statistical guarantees in the one-sample estimation setting where the true source distribution is fully accessible. In this work, to further justify this method, we study the statistical performance of intermediate Sinkhorn iterations in the two-sample estimation setting, where only finite samples from both source and target distributions are available. Specifically, we establish a statistical bound on the squared total variation error of Sinkhorn bridge iterations: $\mathcal{O}(1/m+1/n + r^{2k})~(r \in (0, 1))$, where $m$ and $n$ are the sample sizes from the source and target distributions, respectively, and $k$ is the number of Sinkhorn iterations. This result provides a theoretical guarantee for the finite-sample performance of the Schrödinger bridge estimator and offers practical guidance for selecting sample sizes and the number of Sinkhorn iterations. Notably, our theoretical results apply to several representative methods such as [SF]$^2$M, DSBM-IMF, BM2, and lightSB(-M) under specific settings, through the previously unnoticed connection between these estimators.

NeurIPS Conference 2024 Conference Paper

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

  • Shuang Wu
  • Youtian Lin
  • Feihu Zhang
  • Yifei Zeng
  • Jingxi Xu
  • Philip Torr
  • Xun Cao
  • Yao Yao

Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multi-view diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https: //www. neural4d. com/research/direct3d.

NeurIPS Conference 2024 Conference Paper

Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models

  • Luohe Shi
  • Yao Yao
  • Zuchao Li
  • Lefei Zhang
  • Hai Zhao

Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and Parameter-Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting LLMs to downstream tasks. ICL typically constructs a few-shot learning scenario, either manually or by setting up a Retrieval-Augmented Generation (RAG) system, helping models quickly grasp domain knowledge or question-answering patterns without changing model parameters. However, this approach involves trade-offs, such as slower inference speed and increased space occupancy. PEFT assists the model in adapting to tasks through minimal parameter modifications, but the training process still demands high hardware requirements, even with a small number of parameters involved. To address these challenges, we propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning, maintaining low inference costs. RTD constructs a reference datastore from the provided training examples and optimizes the LLM's final vocabulary distribution by flexibly selecting suitable references based on the input, resulting in more trustable responses and enabling the model to adapt to downstream tasks at a low cost. Experimental evaluations on various LLMs using different benchmarks demonstrate that RTD establishes a new paradigm for augmenting models to downstream tasks. Furthermore, our method exhibits strong orthogonality with traditional methods, allowing for concurrent usage. Our code can be found at https: //github. com/ShiLuohe/ReferenceTrustableDecoding.

AAAI Conference 2023 Conference Paper

Coarse2Fine: Local Consistency Aware Re-prediction for Weakly Supervised Object Localization

  • Yixuan Pan
  • Yao Yao
  • Yichao Cao
  • Chongjin Chen
  • Xiaobo Lu

Weakly supervised object localization aims to localize objects of interest by using only image-level labels. Existing methods generally segment activation map by threshold to obtain mask and generate bounding box. However, the activation map is locally inconsistent, i.e., similar neighboring pixels of the same object are not equally activated, which leads to the blurred boundary issue: the localization result is sensitive to the threshold, and the mask obtained directly from the activation map loses the fine contours of the object, making it difficult to obtain a tight bounding box. In this paper, we introduce the Local Consistency Aware Re-prediction (LCAR) framework, which aims to recover the complete fine object mask from locally inconsistent activation map and hence obtain a tight bounding box. To this end, we propose the self-guided re-prediction module (SGRM), which employs a novel superpixel aggregation network to replace the post-processing of threshold segmentation. In order to derive more reliable pseudo label from the activation map to supervise the SGRM, we further design an affinity refinement module (ARM) that utilizes the original image feature to better align the activation map with the image appearance, and design a self-distillation CAM (SD-CAM) to alleviate the locator dependence on saliency. Experiments demonstrate that our LCAR outperforms the state-of-the-art on both the CUB-200-2011 and ILSVRC datasets, achieving 95.89% and 70.72% of GT-Know localization accuracy, respectively.

KER Journal 2023 Journal Article

Using active learning and an agent-based system to perform interactive knowledge extraction based on the COVID-19 corpus

  • Yao Yao
  • Junying Liu
  • Conor Ryan

Abstract Efficient knowledge extraction from Big Data is quite a challenging topic. Recognizing relevant concepts from unannotated data while considering both context and domain knowledge is critical to implementing successful knowledge extraction. In this research, we provide a novel platform we call Active Learning Integrated with Knowledge Extraction (ALIKE) that overcomes the challenges of context awareness and concept extraction, which have impeded knowledge extraction in Big Data. We propose a method to extract related concepts from unorganized data with different contexts using multiple agents, synergy, reinforcement learning, and active learning. We test ALIKE on the datasets of the COVID-19 Open Research Dataset Challenge. The experiment result suggests that the ALIKE platform can more efficiently distinguish inherent concepts from different papers than a non-agent-based method (without active learning) and that our proposed approach has a better chance to address the challenges of knowledge extraction with heterogeneous datasets. Moreover, the techniques used in ALIKE are transferable across any domain with multidisciplinary activity.

AAAI Conference 2022 Conference Paper

iGrow: A Smart Agriculture Solution to Autonomous Greenhouse Control

  • Xiaoyan Cao
  • Yao Yao
  • Lanqing Li
  • Wanpeng Zhang
  • Zhicheng An
  • Zhong Zhang
  • Li Xiao
  • Shihui Guo

Agriculture is the foundation of human civilization. However, the rapid increase of the global population poses a challenge on this cornerstone by demanding more food. Modern autonomous greenhouses, equipped with sensors and actuators, provide a promising solution to the problem by empowering precise control for high-efficient food production. However, the optimal control of autonomous greenhouses is challenging, requiring decision-making based on high-dimensional sensory data, and the scaling of production is limited by the scarcity of labor capable of handling this task. With the advances of artificial intelligence (AI), the internet of things (IoT), and cloud computing technologies, we are hopeful to provide a solution to automate and smarten greenhouse control to address the above challenges. In this paper, we propose a smart agriculture solution named iGrow, for autonomous greenhouse control (AGC): (1) for the first time, we formulate the AGC problem as a Markov decision process (MDP) optimization problem; (2) we design a neural network-based simulator incorporated with the incremental mechanism to simulate the complete planting process of an autonomous greenhouse, which provides a testbed for the optimization of control strategies; (3) we propose a closed-loop bi-level optimization algorithm, which can dynamically re-optimize the greenhouse control strategy with newly observed data during real-world production. We not only conduct simulation experiments but also deploy iGrow in real scenarios, and experimental results demonstrate the effectiveness and superiority of iGrow in autonomous greenhouse simulation and optimal control. Particularly, compelling results from the tomato pilot project in real autonomous greenhouses show that our solution significantly increases crop yield (+10. 15%) and net profit (+92. 70%) with statistical significance compared to planting experts. Our solution opens up a new avenue for greenhouse production. The code is available at https: //github. com/holmescao/iGrow. git.

NeurIPS Conference 2022 Conference Paper

Large-scale Optimization of Partial AUC in a Range of False Positive Rates

  • Yao Yao
  • Qihang Lin
  • Tianbao Yang

The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e. g. , deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of $\tilde O(1/\epsilon^6)$ for finding a nearly $\epsilon$-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms in training both linear models and deep neural networks for partial AUC maximization and sum of ranked range loss minimization.

AAAI Conference 2020 Conference Paper

Deep Discriminative CNN with Temporal Ensembling for Ambiguously-Labeled Image Classification

  • Yao Yao
  • Jiehui Deng
  • Xiuhua Chen
  • Chen Gong
  • Jianxin Wu
  • Jian Yang

In this paper, we study the problem of image classification where training images are ambiguously annotated with multiple candidate labels, among which only one is correct but is not accessible during the training phase. Due to the adopted non-deep framework and improper disambiguation strategies, traditional approaches are usually short of the representation ability and discrimination ability, so their performances are still to be improved. To remedy these two shortcomings, this paper proposes a novel approach termed “Deep Discriminative CNN” (D2 CNN) with temporal ensembling. Specifically, to improve the representation ability, we innovatively employ the deep convolutional neural networks for ambiguously-labeled image classification, in which the wellknown ResNet is adopted as our backbone. To enhance the discrimination ability, we design an entropy-based regularizer to maximize the margin between the potentially correct label and the unlikely ones of each image. In addition, we utilize the temporally assembled predictions of different epochs to guide the training process so that the latent groundtruth label can be confidently highlighted. This is much superior to the traditional disambiguation operations which treat all candidate labels equally and identify the hidden groundtruth label via some heuristic ways. Thorough experimental results on multiple datasets firmly demonstrate the effectiveness of our proposed D2 CNN when compared with other existing stateof-the-art approaches.