Author name cluster

Yu Cao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

IROS Conference 2025 Conference Paper

A Wearable Centaur Robot with Wheel-Legged Transformation for Enhanced Load-Carrying Assistance

Songhao Li
Yu Cao
Zhiyuan Di
Yifei Guo
Jian Huang

The execution of long-distance load-carrying tasks across multiple terrains remains a frequent requirement. These tasks often involve heavy loads, resulting in fatigue, decreased efficiency, and potential safety risks. To address this issue, this paper proposes a wearable centaur robot with wheel-legged transformation for human load-carrying assistance. The key feature of this robotic mechanism is the independent wheel-legged transformable structure, enabling transitions between the wheeled and legged modes. The wheeled mode ensures high load-carrying efficiency, while in the legged mode, the wheels are laid flat, transforming the ankle joint into a locked support surface that provides stable gait support. This design enables efficient and stable load carriage over complex terrains, all while preserving the natural gait of the user. Next, we develop a unified control framework for human-robot collaborative locomotion across different terrains, which includes velocity control based on an admittance model for the wheeled mode, gait control using a Bézier trajectory for the legged mode, and the transition between the two modes. The preliminary experiments include wheeled-mode, legged-mode, mode transition and obstacle crossing under human-robot collaborative locomotion, validating the proposed robot’s adaptability to different terrains while assisting with human load carriage.

Details

EAAI Journal 2025 Journal Article

Code-switching finetuning: Bridging multilingual pretrained language models for enhanced cross-lingual performance

Changtong Zan
Liang Ding
Li Shen
Yu Cao
Weifeng Liu

In recent years, the development of pre-trained models has significantly propelled advancements in natural language processing. However, multilingual sequence-to-sequence pretrained language models (Seq2Seq PLMs) are pretrained on a wide range of languages (e. g. , 25 languages), yet often finetuned for specific bilingual tasks (e. g. , English–German), leading to domain and task discrepancies between pretraining and finetuning stages, which may lead to sub-optimal downstream performance. In this study, we first illustratively reveal such domain and task discrepancies, and then conduct an in-depth investigation into the side effects that these discrepancies may have on both training dynamic and downstream performance. To alleviate those side effects, we introduce a simple and effective code-switching restoration task (namely code-switching finetuning) into the standard pretrain-finetune pipeline. Specifically, in the first stage, we recast the downstream data as the self-supervised format used for pretraining, in which the denoising signal is the code-switched cross-lingual phrase. Then, the model is finetuned on downstream task as usual in the second stage. Experiments spanning both natural language generation (12 supervised translations, 30 zero-shot translations, and 2 cross-lingual summarization tasks) and understanding (7 cross-lingual natural language inference tasks) tasks demonstrate that our model consistently and significantly surpasses the standard finetuning strategy. Analyses show that our method introduces negligible computational cost and reduces cross-lingual representation gaps. We have made the code publicly available at: https: //github. com/zanchangtong/CSR4mBART.

Details DOI

AAAI Conference 2025 Conference Paper

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

Haozhuo Zhang
Bin Zhu
Yu Cao
Yanbin Hao

Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model’s understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing and colors.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo
Ye Han
Pingzhi Li
Jiayin Qin
Jie Peng
Yang Zhao
Yu Cao
Tianlong Chen

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose $\texttt{Mozart}$, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3. 5D wafer-scale chiplet architectures. On the algorithm side, $\texttt{Mozart}$ exploits the inherent modularity of chiplets and introduces: ($1$) an expert allocation strategy that enables efficient on-package all-to-all communication, and ($2$) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, $\texttt{Mozart}$ adaptively co-locates heterogeneous modules on specialized chiplets with a 2. 5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.

PDF Details

JBHI Journal 2025 Journal Article

Multi-Gate Mixture of Multi-View Graph Contrastive Learning on Electronic Health Record

Yu Cao
Qian Wang
Xu Wang
Dezhong Peng
Peilin Li

Electronic Health Record (EHR) is the digital form of patient visits that contains various medical data, including diagnosis, treatment, and lab events. Representation learning of EHR with deep learning methods has been beneficial for patient-related prediction tasks. Recently, studies have focused on revealing the inherent graph structure between medical events in EHR. Graph neural network (GNN) methods are prevalent and perform well in various prediction tasks. However, the inherent relationships between various medical events must be marked, which is complicated and time-consuming. Most research works adopt the straightforward structure of GNN models on a single prediction task which could not fully exploit the potential of EHR representations. Compared with previous work, the multi-task prediction could utilize the latent information of concealed correlations between different prediction tasks. In addition, self-contrastive learning on graphs could improve the representation learned by GNN. We propose a multi-gate mixture of multi-view graph contrastive learning (MMMGCL) method, aiming to get a more reasonable EHR representation and improve the performances of downstream tasks. First, each patient visit is represented as a graph with a well-designed hierarchically fully-connected pattern. Second, node features in the manually constructed graph are pre-trained via the Glove method with hierarchical ontology knowledge. Finally, MMMGCL processes the pre-trained graph and adopts a joint learning strategy to simultaneously optimize task and contrastive losses. We verify our method on two large open-source medical datasets, Medical Information Mart for Intensive Care (MIMIC-III) and the eICU Collaborative Research Database (eICU). Experiment results show that our method could improve performance compared to straightforward graph-based methods on prediction tasks of patient readmission, mortality, and length of stay.

Details DOI

IJCAI Conference 2024 Conference Paper

Dual Semantic Fusion Hashing for Multi-Label Cross-Modal Retrieval

Kaiming Liu
Yunhong Gong
Yu Cao
Zhenwen Ren
Dezhong Peng
Yuan Sun

Cross-modal hashing (CMH) has been widely used for multi-modal retrieval tasks due to its low storage cost and fast query speed. Although existing CMH methods achieve promising performance, most of them mainly rely on coarse-grained supervision information (\ie pairwise similarity matrix) to measure the semantic similarities between all instances, ignoring the impact of multi-label distribution. To address this issue, we construct fine-grained semantic similarity to explore the cluster-level semantic relationships between multi-label data, and propose a new dual semantic fusion hashing (DSFH) for multi-label cross-modal retrieval. Specifically, we first learn the modal-specific representation and consensus hash codes, thereby merging the specificity with consistency. Then, we fuse the coarse-grained and fine-grained semantics to mine multiple-level semantic relationships, thereby enhancing hash codes discrimination. Extensive experiments on three benchmarks demonstrate the superior performance of our DSFH compared with 16 state-of-the-art methods.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Transformer-Based Selective Super-resolution for Efficient Image Refinement

Tianyi Zhang
Kishore Kasichainula
Yaoxin Zhuo
Baoxin Li
Jae-Sun Seo
Yu Cao

Conventional super-resolution methods suffer from two drawbacks: substantial computational cost in upscaling an entire large image, and the introduction of extraneous or potentially detrimental information for downstream computer vision tasks during the refinement of the background. To solve these issues, we propose a novel transformer-based algorithm, Selective Super-Resolution (SSR), which partitions images into non-overlapping tiles, selects tiles of interest at various scales with a pyramid architecture, and exclusively reconstructs these selected tiles with deep features. Experimental results on three datasets demonstrate the efficiency and robust performance of our approach for super-resolution. Compared to the state-of-the-art methods, the FID score is reduced from 26.78 to 10.41 with 40% reduction in computation cost for the BDD100K dataset.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Exploring the Optimal Choice for Generative Processes in Diffusion Models: Ordinary vs Stochastic Differential Equations

Yu Cao
Jingrun Chen
Yixin Luo
Xiang Zhou

The diffusion model has shown remarkable success in computer vision, but it remains unclear whether the ODE-based probability flow or the SDE-based diffusion model is more superior and under what circumstances. Comparing the two is challenging due to dependencies on data distributions, score training, and other numerical issues. In this paper, we study the problem mathematically for two limiting scenarios: the zero diffusion (ODE) case and the large diffusion case. We first introduce a pulse-shape error to perturb the score function and analyze error accumulation of sampling quality, followed by a thorough analysis for generalization to arbitrary error. Our findings indicate that when the perturbation occurs at the end of the generative process, the ODE model outperforms the SDE model with a large diffusion coefficient. However, when the perturbation occurs earlier, the SDE model outperforms the ODE model, and we demonstrate that the error of sample generation due to such a pulse-shape perturbation is exponentially suppressed as the diffusion term's magnitude increases to infinity. Numerical validation of this phenomenon is provided using Gaussian, Gaussian mixture, and Swiss roll distribution, as well as realistic datasets like MNIST and CIFAR-10.

PDF Details

ICRA Conference 2023 Conference Paper

Stable Station Keeping of Autonomous Sailing Robots via the Switched Systems Approach for Ocean Observation

Weimin Qi
Qinbo Sun
Yu Cao
Huihuan Qian

Ocean observation is an emerging field, and sailing robots have several promising features (e. g. , long-range sailing, environmental friendliness, energy-saving and low-noise) to perform tasks. In this paper, we define an ocean observation mission in a restricted target area as a station keeping problem. Inspired by an orientation-restricted Dubins path method, the robot keeps sailing and collecting data in a smooth reciprocation, where the trajectories consist of sailing against wind segments and turning downwind parts divided by a goal area and an acceptable area. The upwind sailing segments are of interest for data acquisition. However, the system stability can not be guaranteed during the whole reciprocation especially for sailing outside the goal area. Hereby, we refer to a switched systems approach and propose a desired heading generation scheme to realize safe and stable control in both areas. The stability for subsystems is proved with Lyapunov-like functions. The stable station keeping scheme is verified in both simulation and real experiments. Finally, we completed continuous and effective observation within 50 minutes in the goal area with a radius of 50 meters by a catamaran robot named OceanVoy460.

Details

JBHI Journal 2022 Journal Article

AFP-Mask: Anchor-Free Polyp Instance Segmentation in Colonoscopy

Dechun Wang
Shuijiao Chen
Xinzi Sun
Qilei Chen
Yu Cao
Benyuan Liu
Xiaowei Liu

Colorectal cancer (CRC) is a common and lethal disease. Globally, CRC is the third most commonly diagnosed cancer in males and the second in females. The most effective way to prevent CRC is through using colonoscopy to identify and remove precancerous growths at an early stage. The detection and removal of colorectal polyps have been found to be associated with a reduction in mortality from colorectal cancer. However, the false negative rate of polyp detection during colonoscopy is often high even for experienced physicians. With recent advances in deep learning based object detection techniques, automated polyp detection shows great potential in helping physicians reduce false positive rate during colonoscopy. In this paper, we propose a novel anchor-free instance segmentation framework that can localize polyps and produce the corresponding instance level masks without using predefined anchor boxes. Our framework consists of two branches: (a) an object detection branch that performs classification and localization, (b) a mask generation branch that produces instance level masks. Instead of predicting a two-dimensional mask directly, we encode it into a compact representation vector, which allows us to incorporate instance segmentation with one-stage bounding-box detectors in a simple yet effective way. Moreover, our proposed encoding method can be trained jointly with object detector. Our experiment results show that our framework achieves a precision of 99. 36% and a recall of 96. 44% on public datasets, outperforming existing anchor-free instance segmentation methods by at least 2. 8% in mIoU on our private dataset.

Details DOI

AAAI Conference 2022 Conference Paper

Gradient-Based Novelty Detection Boosted by Self-Supervised Binary Classification

Jingbo Sun
Li Yang
Jiaxin Zhang
Frank Liu
Mahantesh Halappanavar
Deliang Fan
Yu Cao

Novelty detection aims to automatically identify out-ofdistribution (OOD) data, without any prior knowledge of them. It is a critical step in data monitoring, behavior analysis and other applications, helping enable continual learning in the field. Conventional methods of OOD detection perform multi-variate analysis on an ensemble of data or features, and usually resort to the supervision with OOD data to improve the accuracy. In reality, such supervision is impractical as one cannot anticipate the anomalous data. In this paper, we propose a novel, self-supervised approach that does not rely on any pre-defined OOD data: (1) The new method evaluates the Mahalanobis distance of the gradients between the in-distribution and OOD data. (2) It is assisted by a self-supervised binary classifier to guide the label selection to generate the gradients, and maximize the Mahalanobis distance. In the evaluation with multiple datasets, such as CIFAR-10, CIFAR-100, SVHN and TinyImageNet, the proposed approach consistently outperforms state-of-the-art supervised and unsupervised methods in the area under the receiver operating characteristic (AUROC) and area under the precision-recall curve (AUPR) metrics. We further demonstrate that this detector is able to accurately learn one OOD class in continual learning.

PDF Details

NeurIPS Conference 2022 Conference Paper

Learning Optimal Flows for Non-Equilibrium Importance Sampling

Yu Cao
Eric Vanden-Eijnden

Many applications in computational sciences and statistical inference require the computation of expectations with respect to complex high-dimensional distributions with unknown normalization constants, as well as the estimation of these constants. Here we develop a method to perform these calculations based on generating samples from a simple base distribution, transporting them by the flow generated by a velocity field, and performing averages along these flowlines. This non-equilibrium importance sampling (NEIS) strategy is straightforward to implement and can be used for calculations with arbitrary target distributions. On the theory side, we discuss how to tailor the velocity field to the target and establish general conditions under which the proposed estimator is a perfect estimator with zero-variance. We also draw connections between NEIS and approaches based on mapping a base distribution onto a target via a transport map. On the computational side, we show how to use deep learning to represent the velocity field by a neural network and train it towards the zero variance optimum. These results are illustrated numerically on benchmark examples (with dimension up to $10$), where after training the velocity field, the variance of the NEIS estimator is reduced by up to $6$ orders of magnitude than that of a vanilla estimator. We also compare the performances of NEIS with those of Neal's annealed importance sampling (AIS).

PDF Details

IJCAI Conference 2020 Conference Paper

Efficient and Modularized Training on FPGA for Real-time Applications

Shreyas Kolala Venkataramanaiah
Xiaocong Du
Zheng Li
Shihui Yin
Yu Cao
Jae-Sun Seo

Training of deep Convolution Neural Networks (CNNs) requires a tremendous amount of computation and memory and thus, GPUs are widely used to meet the computation demands of these complex training tasks. However, lacking the flexibility to exploit architectural optimizations, GPUs have poor energy efficiency of GPUs and are hard to be deployed on energy-constrained platforms. FPGAs are highly suitable for training, such as real-time learning at the edge, as they provide higher energy efficiency and better flexibility to support algorithmic evolution. This paper first develops a training accelerator on FPGA, with 16-bit fixed-point computing and various training modules. Furthermore, leveraging model segmentation techniques from Progressive Segmented Training, the newly developed FPGA accelerator is applied to online learning, achieving much lower computation cost. We demonstrate the performance of representative CNNs trained for CIFAR-10 on Intel Stratix-10 MX FPGA, evaluating both the conventional training procedure and the online learning algorithm.

PDF Details DOI

AAAI Conference 2020 Short Paper

SATNet: Symmetric Adversarial Transfer Network Based on Two-Level Alignment Strategy towards Cross-Domain Sentiment Classification (Student Abstract)

Yu Cao
Hua Xu

In recent years, domain adaptation tasks have attracted much attention, especially, the task of cross-domain sentiment classiﬁcation (CDSC). In this paper, we propose a novel domain adaptation method called Symmetric Adversarial Transfer Network (SATNet). Experiments on the Amazon reviews dataset demonstrate the effectiveness of SATNet.

PDF Details

AAAI Conference 2020 Conference Paper

Unsupervised Domain Adaptation on Reading Comprehension

Yu Cao
Meng Fang
Baosheng Yu
Joey Tianyi Zhou

Reading comprehension (RC) has been studied in a variety of datasets with the boosted performance brought by deep neural networks. However, the generalization capability of these models across different domains remains unclear. To alleviate the problem, we investigate unsupervised domain adaptation on RC, wherein a model is trained on the labeled source domain and to be applied to the target domain with only unlabeled samples. We ﬁrst show that even with the powerful BERT contextual representation, a model can not generalize well from one domain to another. To solve this, we provide a novel conditional adversarial self-training method (CASe). Speciﬁcally, our approach leverages a BERT model ﬁne-tuned on the source dataset along with the conﬁdence ﬁltering to generate reliable pseudo-labeled samples in the target domain for self-training. On the other hand, it further reduces domain distribution discrepancy through conditional adversarial learning across domains. Extensive experiments show our approach achieves comparable performance to supervised models on multiple large-scale benchmark datasets.

PDF Details

IJCAI Conference 2019 Conference Paper

CLVSA: A Convolutional LSTM Based Variational Sequence-to-Sequence Model with Attention for Predicting Trends of Financial Markets

Jia Wang
Tong Sun
Benyuan Liu
Yu Cao
Hongwei Zhu

Financial markets are a complex dynamical system. The complexity comes from the interaction between a market and its participants, in other words, the integrated outcome of activities of the entire participants determines the markets trend, while the markets trend affects activities of participants. These interwoven interactions make financial markets keep evolving. Inspired by stochastic recurrent models that successfully capture variability observed in natural sequential data such as speech and video, we propose CLVSA, a hybrid model that consists of stochastic recurrent networks, the sequence-to-sequence architecture, the self- and inter-attention mechanism, and convolutional LSTM units to capture variationally underlying features in raw financial trading data. Our model outperforms basic models, such as convolutional neural network, vanilla LSTM network, and sequence-to-sequence model with attention, based on backtesting results of six futures from January 2010 to December 2017. Our experimental results show that, by introducing an approximate posterior, CLVSA takes advantage of an extra regularizer based on the Kullback-Leibler divergence to prevent itself from overfitting traps.

PDF Details