Arrow Research search

Author name cluster

Shuo Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

53 papers
2 author rows

Possible papers

53

AAAI Conference 2026 Conference Paper

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

  • Shuo Yang
  • Qihui Zhang
  • Yuyang Liu
  • Yue Huang
  • Xiaojun Jia
  • Kun-Peng Ning
  • Jia-Yu Yao
  • Jigang Wang

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction—defined by weight differences between aligned (safe) and unaligned models—rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

AAAI Conference 2026 Conference Paper

Careful Queries, Credible Results: Teaching RAG Models Advanced Web Search Tools with Reinforcement Learning

  • Yuqin Dai
  • Shuo Yang
  • Guoqing Wang
  • Yong Deng
  • Zhanwei Zhang
  • Jun Yin
  • Pengyu Zeng
  • Zhenzhe Ying

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating up-to-date external knowledge, yet real-world web environments present unique challenges. These limitations manifest as two key challenges: pervasive misinformation in the web environment, which introduces unreliable or misleading content that can degrade retrieval accuracy, and the underutilization of web tools, which, if effectively employed, could enhance query precision and help mitigate this noise, ultimately improving retrieval results in RAG systems. To address these issues, we propose WebFilter, a novel RAG framework that generates source-restricted queries and filters out unreliable content. This approach combines a retrieval filtering mechanism with a behavior- and outcome-driven reward strategy, optimizing both query formulation and retrieval outcomes. Extensive experiments demonstrate that WebFilter improves answer quality and retrieval precision, outperforming existing RAG methods on both in-domain and out-of-domain benchmarks.

AAAI Conference 2026 Conference Paper

Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

  • Yuchen Zhou
  • Jiayu Tang
  • Shuo Yang
  • Xiaoyan Xiao
  • Yuqin Dai
  • Wenhao Yang
  • Chao Gou
  • Xiaobo Xia

Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical **''logical blindspots''** that limit their reliability in practical applications. To systematically diagnose this, we introduce **LogicBench**, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose **LogicCLIP**, a novel training framework designed to boost VLMs' logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP's substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.

AAAI Conference 2026 Conference Paper

Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

  • Shuo Yang
  • Yuwei Niu
  • Yuyang Liu
  • Yang Ye
  • Bin Lin
  • Li Yuan

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to look back at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.

AAAI Conference 2026 Conference Paper

Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering

  • Jinfeng Xu
  • Zheyu Chen
  • Shuo Yang
  • Jinze Li
  • Ziyue Peng
  • Zewei Liu
  • Hewei Wang
  • Jiayi Zhang

Multiple clustering aims to discover diverse latent structures from different perspectives, yet existing methods generate exhaustive clusterings without discerning user interest, necessitating laborious manual screening. Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. Multi-DProxy introduces 1) gated cross-modal fusion that synthesizes discriminative joint representations by adaptively modeling feature interactions. 2) dual-constraint proxy optimization where user interest constraints enforce semantic consistency with domain concepts while concept constraints employ hard example mining to enhance cluster discrimination. 3) dynamic candidate management that refines textual proxies through iterative clustering feedback. Therefore, Multi-DProxy not only effectively captures a user's interest through proxies but also enables the identification of relevant clusterings with greater precision. Extensive experiments demonstrate state-of-the-art performance with significant improvements over existing methods across a broad set of multi-clustering benchmarks.

AAAI Conference 2026 Conference Paper

Next Patch Prediction for AutoRegressive Visual Generation

  • Yatian Pang
  • Peng Jin
  • Shuo Yang
  • Bin Zhu
  • Bin Lin
  • Chaoran Feng
  • Zhenyu Tang
  • Liuhan Chen

Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks. Pioneering works introduce NTP to autoregressive visual generation tasks. In this work, we rethink the NTP for autoregressive image generation and extend it to a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens with higher information density. By using patch tokens as a more compact input sequence, the autoregressive model is trained to predict the next patch, significantly reducing computational costs. To further exploit the natural hierarchical structure of image data, we propose a multi-scale coarse-to-fine patch grouping strategy. With this strategy, the training process begins with a large patch size and ends with vanilla NTP where the patch size is 1x1, thus maintaining the original inference process without modifications. Extensive experiments across a diverse range of model sizes demonstrate that NPP could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet 256x256 generation benchmark. Notably, our method retains the original autoregressive model architecture without introducing additional trainable parameters or specifically designing a custom image tokenizer, offering a flexible and plug-and-play solution for enhancing autoregressive visual generation.

NeurIPS Conference 2025 Conference Paper

CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

  • Zheyuan Liu
  • Munan Ning
  • Qihui Zhang
  • Shuo Yang
  • Zhongrui Wang
  • Yiwei Yang
  • Xianzhe Xu
  • Yibing Song

Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34. 7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.

JBHI Journal 2025 Journal Article

Hepatic Vessel Roadmap Prediction Using Adaptive Tracking and Bending Energy Modeling in X-Ray Fluoroscopy

  • Shuo Yang
  • Deqiang Xiao
  • Haixiao Geng
  • Danni Ai
  • Jingfan Fan
  • Tianyu Fu
  • Hong Song
  • Feng Duan

Dynamic visualization of the hepatic vessel is crucial in X-ray image-guided transjugular intrahepatic portosystemic shunt (TIPS) procedures. However, intraoperative breathing and the presence of guidewires complicate the prediction of the vessel position and posture without contrast agents. The respiration compensation technique aims to utilize the intraoperative respiration modeling to deform the initial vessel roadmap, thereby achieving the dynamic vessel prediction in the X-ray image sequence for the interventional guidance. Therefore, we propose a novel respiration compensation framework utilizing the adaptive tracking and bending energy modeling to achieve the stable vessel roadmap prediction under free breathing. First, we introduce the inter-frame rigid displacement compensation module based on the domain adaptation and adaptive centroid tracking. This module fits the respiratory curve from the X-ray images, providing the temporal motion priors for aligning roadmaps across frames. Second, we propose the novel deformation compensation module based on the bending energy modeling to correct the respiratory motion, wherein we utilize the energy features of the guidewires to drive the non-rigid registration. The control points sampled by the bending energy guide the local image to form the deformation field, facilitating the dynamic overlap of the vessel roadmaps in X-ray images. Experimental results on simulated and clinical datasets show an average tracking error of 0. 95 $\pm$ 0. 26 mm and 1. 49 $\pm$ 0. 40 mm, respectively. The effective and fast (mean 57 ms per frame) compensation achieved by our framework has the potential for improving the outcome of liver intervention and reducing the reliance on contrast agents.

ECAI Conference 2025 Conference Paper

JOIG: Joint Optimization Model of Image Features and Constraint Geometry Fusion for Generalizable Gaussian

  • Shuo Yang
  • Jianbo Zhang
  • Yongming Han
  • Liang Yuan

Novel View Synthesis (NVS) seeks to generate realistic novel views from limited source images, offering an effective solution for 3D reconstruction in complex or unknown environments. Achieving high generalization under occlusion, varying illumination, and sparse observations remains challenging, largely hinging on the effective extraction, optimization, and fusion of image features and spatial geometry. In this work, we propose JOIG — a Joint Optimization Model of Image Features and Constraint Geometry Fusion for generalizable 3D Gaussian splatting. JOIG introduces three key components: Multiscale Dimension Rotation Fusion (MDRF) to capture intrinsic dependencies across feature dimensions for enhanced image encoding, Geometry Self-Correcting Aggregation (GSCA) to refine multi-view geometry with depth-guided reweighting, and Geometry-Image Feature Aggregation (GIFA) to achieve pixel-aligned fusion of spatial and image information. Extensive experiments on DTU, LLFF, NeRF Synthetic, and Tanks and Temples datasets demonstrate that JOIG achieves state-of-the-art generalization performance, significantly improving both quantitative metrics and visual fidelity in novel view synthesis.

NeurIPS Conference 2025 Conference Paper

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

  • Xiaohao Liu
  • Xiaobo Xia
  • Weixiang Zhao
  • Manyi Zhang
  • Xianzhi Yu
  • Xiu Su
  • Shuo Yang
  • See-Kiong Ng

Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code is available at https: //github. com/Xiaohao-Liu/L-MTP.

AAAI Conference 2025 Conference Paper

MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

  • Jinfeng Xu
  • Zheyu Chen
  • Shuo Yang
  • Jinze Li
  • Hewei Wang
  • Edith C. H. Ngai

As multimedia information proliferates, multimodal recommendation systems have garnered significant attention. These systems leverage multimodal information to alleviate the data sparsity issue inherent in recommendation systems, thereby enhancing the accuracy of recommendations. Due to the natural semantic disparities among multimodal features, recent research has primarily focused on cross-modal alignment using self-supervised learning to bridge these gaps. However, aligning different modal features might result in the loss of valuable interaction information, distancing them from ID embeddings. It is crucial to recognize that the primary goal of multimodal recommendation is to predict user preferences, not merely to understand multimodal content. To this end, we propose a new Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method, which effectively reduces the gap among modalities while retaining interaction information. Specifically, MENTOR begins by extracting representations from each modality using both heterogeneous user-item and homogeneous item-item graphs. It then employs a multilevel cross-modal alignment task, guided by ID embeddings, to align modalities across multiple levels while retaining historical interaction information. To balance effectiveness and efficiency, we further propose an optional general feature enhancement task that bolsters the general features from both structure and feature perspectives, thus enhancing the robustness of our model.

IJCAI Conference 2025 Conference Paper

METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection

  • Yongqi Wang
  • Xinxiao Wu
  • Shuo Yang

Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance. Codes are at https: //github. com/wangyongqi558/METOR.

AAAI Conference 2025 Short Paper

Multimodal Commonsense Knowledge Distillation for Visual Question Answering (Student Abstract)

  • Shuo Yang
  • Siwen Luo
  • Soyeon Caren Han

Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset. The code is in https://github.com/adlnlp/MCKDVQA.

ICLR Conference 2025 Conference Paper

Neural networks on Symmetric Spaces of Noncompact Type

  • Xuan Son Nguyen
  • Shuo Yang
  • Aymeric Histace

Recent works have demonstrated promising performances of neural networks on hyperbolic spaces and symmetric positive definite (SPD) manifolds. These spaces belong to a family of Riemannian manifolds referred to as symmetric spaces of noncompact type. In this paper, we propose a novel approach for developing neural networks on such spaces. Our approach relies on a unified formulation of the distance from a point to a hyperplane on the considered spaces. We show that some existing formulations of the point-to-hyperplane distance can be recovered by our approach under specific settings. Furthermore, we derive a closed-form expression for the point-to-hyperplane distance in higher-rank symmetric spaces of noncompact type equipped with G-invariant Riemannian metrics. The derived distance then serves as a tool to design fully-connected (FC) layers and an attention mechanism for neural networks on the considered spaces. Our approach is validated on challenging benchmarks for image classification, electroencephalogram (EEG) signal classification, image generation, and natural language inference.

ICLR Conference 2025 Conference Paper

PiCO: Peer Review in LLMs based on Consistency Optimization

  • Kun-Peng Ning
  • Shuo Yang
  • Yuyang Liu
  • Jia-Yu Yao
  • Zhen-Hui Liu
  • Yonghong Tian 0001
  • Yibing Song
  • Li Yuan 0007

Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically without any human feedback. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM’s response score is jointly determined by other anonymous ones. During this process, we found that those answers that are more recognized by other ``reviewers'' (models) usually come from LLMs with stronger abilities, while these models can also evaluate others' answers more accurately. We formalize it as a consistency assumption, i.e., the ability and score of the model usually have consistency. We exploit this to optimize each model's confidence, thereby re-ranking the LLMs to be closer to human rankings. We perform experiments on multiple datasets with standard rank-based metrics, validating the effectiveness of the proposed approach.

NeurIPS Conference 2025 Conference Paper

Radial Attention: $\mathcal{O}(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

  • XINGYANG LI
  • Muyang Li
  • Tianle Cai
  • Haocheng Xi
  • Shuo Yang
  • Yujun Lin
  • Lvmin Zhang
  • Songlin Yang

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $\mathcal{O}(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $\mathcal{O}(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that \method maintains video quality across Wan2. 1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1. 9× speedup over the original dense attention. With minimal tuning, it enables video generation up to 4× longer while reducing training costs by up to 4. 4× compared to direct fine-tuning and accelerating inference by up to 3. 7× compared to dense attention inference. Code is released at https: //github. com/mit-han-lab/radial-attention.

AAAI Conference 2025 Conference Paper

RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting

  • Shuo Yang
  • Bardh Prenkaj
  • Gjergji Kasneci

Despite the widespread use of LLMs due to their superior performance in various tasks, their high computational costs often lead potential users to opt for the pretraining-finetuning pipeline. However, biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels, creating so-called shortcuts and hindering the generalizability of fine-tuned models. Existing debiasing methods often rely on prior knowledge of specific dataset biases, which is challenging to acquire a priori. We propose RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation. RAZOR leverages LLMs to iteratively rewrite potentially biased text segments by replacing them with heuristically selected alternatives in a shortcut space defined by token statistics and positional information. This process aims to align surface-level text features more closely with diverse label distributions, thereby promoting the learning of genuine linguistic patterns. Compared with unsupervised SoTA models, RAZOR improves by 3.5% on the FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score. Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by x2 without requiring prior bias information, a result that is on par with SoTA models that leverage prior information. Our work prioritizes data manipulation over architectural modifications, emphasizing the pivotal role of data quality in enhancing model performance and fairness. This research contributes to developing more robust evaluation benchmarks for debiasing methods by incorporating metrics for bias reduction and overall model efficacy.

ICRA Conference 2025 Conference Paper

Robots with Attitude: Singularity-Free Quaternion-Based Model-Predictive Control for Agile Legged Robots

  • Zixin Zhang 0003
  • John Z. Zhang
  • Shuo Yang
  • Zachary Manchester

We present a model-predictive control (MPC) framework for legged robots that avoids the singularities associated with common three-parameter attitude representations like Euler angles during large-angle rotations. Our method parameterizes the robot's attitude with singularity-free unit quaternions and makes modifications to the iterative linear-quadratic regulator (iLQR) algorithm to deal with the resulting geometry. The derivation of our algorithm requires only elementary calculus and linear algebra, deliberately avoiding the abstraction and notation of Lie groups. We demonstrate the performance and computational efficiency of quaternion MPC in several experiments on quadruped and humanoid robots.

ICML Conference 2025 Conference Paper

SCISSOR: Mitigating Semantic Bias through Cluster-Aware Siamese Networks for Robust Classification

  • Shuo Yang
  • Bardh Prenkaj
  • Gjergji Kasneci

Shortcut learning undermines model generalization to out-of-distribution data. While the literature attributes shortcuts to biases in superficial features, we show that imbalances in the semantic distribution of sample embeddings induce spurious semantic correlations, compromising model robustness. To address this issue, we propose SCISSOR (Semantic Cluster Intervention for Suppressing ShORtcut), a Siamese network-based debiasing approach that remaps the semantic space by discouraging latent clusters exploited as shortcuts. Unlike prior data-debiasing approaches, SCISSOR eliminates the need for data augmentation and rewriting. We evaluate SCISSOR on 6 models across 4 benchmarks: Chest-XRay and Not-MNIST in computer vision, and GYAFC and Yelp in NLP tasks. Compared to several baselines, SCISSOR reports +5. 3 absolute points in F1 score on GYAFC, +7. 3 on Yelp, +7. 7 on Chest-XRay, and +1 on Not-MNIST. SCISSOR is also highly advantageous for lightweight models with $\tilde$9. 5% improvement on F1 for ViT on computer vision datasets and $\tilde$11. 9% for BERT on NLP. Our study redefines the landscape of model generalization by addressing overlooked semantic biases, establishing SCISSOR as a foundational framework for mitigating shortcut learning and fostering more robust, bias-resistant AI systems.

NeurIPS Conference 2025 Conference Paper

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

  • Shuo Yang
  • Haocheng Xi
  • Yilong Zhao
  • Muyang Li
  • Jintao Zhang
  • Han Cai
  • Yujun Lin
  • Xiuyu Li

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates Top-p dynamic budget control and customized kernel implementations, achieving up to $2. 30\times$ and $1. 89\times$ speedup while maintaining a PSNR of up to $30$ and $26$ on HunyuanVideo and Wan 2. 1, respectively. Our code is open-sourced at https: //github. com/svg-project/Sparse-VideoGen.

ICRA Conference 2025 Conference Paper

Trustworthy Robot Behavior Tree Generation Based on Multi-Source Heterogeneous Knowledge Graph

  • Jianchao Yuan
  • Shuo Yang
  • Qi Zhang
  • Ge Li
  • Jianping Tang

In robotics, the design of robot behavior trees generally requires roboticists to comprehensively and customizable consider all the relevant factors including the robot hardware capabilities, task descriptions, etc, posing great challenges for design quality and efficiency. The mainstream practice of BT design paradigm has been utilizing the BT component framework to develop task-specific BT structures manually. In contrast, the latest advances in Generative Pretrained Transformers (GPTs) have also opened up the possibility of BT design automation. However, these approaches generally show low efficiency or are less trustworthy for complex robot task goals due to time-consuming manual design and unreliable GPT reasoning. To solve the above limitations, this paper proposes a novel knowledge-driven approach that develops a specialized knowledge graph from multi-sourced and heterogeneous highquality robot knowledge to reason out a trustworthy robot plan for achieving complex task goals. Then we present the plan transformation and BT merging algorithms to automatically generate the plan-level BT structure. The comparative experiment results have shown that our approach can generate highquality and trustworthy BT structure regarding the task plan accuracy and consistency, as well as the BT generation time, compared with the manual design and GPT-based approaches.

NeurIPS Conference 2025 Conference Paper

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

  • Chaofan Lin
  • Jiaming Tang
  • Shuo Yang
  • Hanshuo Wang
  • Tian Tang
  • Boyu Tian
  • Ion Stoica
  • Song Han

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been of great importance recently. However, most existing sparse attention algorithms use a fixed budget of how many tokens to use in their computations. This simple static decision raises critical issues in real-world deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we reveal a key insight that leveraging the idea of top-$p$ sampling (a. k. a. , nucleus sampling) in sparse attention could enable efficient and adaptive budget decisions. Based on this, we propose Twilight, a framework that enhances any existing sparse attention algorithm with adaptive budget decision capabilities without sacrificing accuracy. Empirical results show that Twilight can adaptively prune up to 98% tokens with nearly no accuracy loss in both mid- and long-context scenarios, leading to a $1. 4\times$ speedup over state-of-the-art sparse attention mechanisms.

NeurIPS Conference 2025 Conference Paper

UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation

  • Jiyu Guo
  • Shuo Yang
  • Yiming Huang
  • Yancheng Long
  • Xiaobo Xia
  • Xiu Su
  • Bo Zhao
  • Zeke Xie

Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3. 87\% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.

AAAI Conference 2025 Conference Paper

Video Summarization Using Denoising Diffusion Probabilistic Model

  • Zirui Shang
  • Yubo Zhu
  • Hongxi Li
  • Shuo Yang
  • Xinxiao Wu

Video summarization aims to eliminate visual redundancy while retaining key parts of video to construct concise and comprehensive synopses. Most existing methods use discriminative models to predict the importance scores of video frames. However, these methods are susceptible to annotation inconsistency caused by the inherent subjectivity of different annotators when annotating the same video. In this paper, we introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective, effectively reducing the interference of subjective annotation noise. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction, and generates summaries by iterative denoising. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability. Moreover, to facilitate training DDPM with limited data, we employ an unsupervised video summarization model to implement the earlier denoising process. Extensive experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the effectiveness of our method.

NeurIPS Conference 2025 Conference Paper

WorldModelBench: Judging Video Generation Models As World Models

  • Dacheng Li
  • Yunhao Fang
  • Yukang Chen
  • Shuo Yang
  • Shiyi Cao
  • Justin Wong
  • Michael Luo
  • Xiaolong Wang

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law—issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 9. 9% lower error in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The dataset is hosted in HuggingFace at https: //huggingface. co/datasets/Efficient-Large-Model/worldmodelbench. The code to run evaluation is available at https: //github. com/WorldModelBench-Team/WorldModelBench.

ICLR Conference 2024 Conference Paper

Matrix Manifold Neural Networks++

  • Xuan Son Nguyen
  • Shuo Yang
  • Aymeric Histace

Deep neural networks (DNNs) on Riemannian manifolds have garnered increasing interest in various applied areas. For instance, DNNs on spherical and hyperbolic manifolds have been designed to solve a wide range of computer vision and nature language processing tasks. One of the key factors that contribute to the success of these networks is that spherical and hyperbolic manifolds have the rich algebraic structures of gyrogroups and gyrovector spaces. This enables principled and effective generalizations of the most successful DNNs to these manifolds. Recently, some works have shown that many concepts in the theory of gyrogroups and gyrovector spaces can also be generalized to matrix manifolds such as Symmetric Positive Definite (SPD) and Grassmann manifolds. As a result, some building blocks for SPD and Grassmann neural networks, e.g., isometric models and multinomial logistic regression (MLR) can be derived in a way that is fully analogous to their spherical and hyperbolic counterparts. Building upon these works, in this paper, we design fully-connected (FC) and convolutional layers for SPD neural networks. We also develop MLR on Symmetric Positive Semi-definite (SPSD) manifolds, and propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective. We demonstrate the effectiveness of the proposed approach in the human action recognition and node classification tasks.

IROS Conference 2024 Conference Paper

MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

  • Dayou Li
  • Chenkun Zhao
  • Shuo Yang
  • Ran Song 0001
  • Xiaolei Li 0003
  • Wei Zhang 0021

This paper focuses on target-oriented grasping in occluded scenes, where the target object is specified by a binary mask and the goal is to grasp the target object with as few robotic manipulations as possible. Most existing methods rely on a push-grasping synergy to complete this task. To deliver a more powerful target-oriented grasping pipeline, we present MPGNet, a three-branch network for learning a synergy between moving, pushing, and grasping actions. We also propose a multi-stage training strategy to train the MPGNet which contains three policy networks corresponding to the three actions. The effectiveness of our method is demonstrated via both simulated and real-world experiments. Video of the real-world experiments is at https://youtu.be/S_QKZqkh0w8.

AAAI Conference 2024 Conference Paper

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

  • Shuo Yang
  • Yongqi Wang
  • Xiaofeng Ji
  • Xinxiao Wu

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

ICML Conference 2023 Conference Paper

Building Neural Networks on Matrix Manifolds: A Gyrovector Space Approach

  • Xuan Son Nguyen
  • Shuo Yang

Matrix manifolds, such as manifolds of Symmetric Positive Definite (SPD) matrices and Grassmann manifolds, appear in many applications. Recently, by applying the theory of gyrogroups and gyrovector spaces that is a powerful framework for studying hyperbolic geometry, some works have attempted to build principled generalizations of Euclidean neural networks on matrix manifolds. However, due to the lack of many concepts in gyrovector spaces for the considered manifolds, e. g. , the inner product and gyroangles, techniques and mathematical tools provided by these works are still limited compared to those developed for studying hyperbolic geometry. In this paper, we generalize some notions in gyrovector spaces for SPD and Grassmann manifolds, and propose new models and layers for building neural networks on these manifolds. We show the effectiveness of our approach in two applications, i. e. , human action recognition and knowledge graph completion.

ICRA Conference 2023 Conference Paper

Cerberus: Low-Drift Visual-Inertial-Leg Odometry For Agile Locomotion

  • Shuo Yang
  • Zixin Zhang 0003
  • Zhengyu Fu
  • Zachary Manchester

We present an open-source Visual-Inertial-Leg Odometry (VILO) state estimation solution for legged robots, called Cerberus, which precisely estimates position on various terrains in real-time using a set of standard sensors, including stereo cameras, IMU, joint encoders, and contact sensors. In addition to estimating robot states, we perform online kinematic parameter calibration and outlier rejection to substantially reduce position drift. Hardware experiments in various indoor and outdoor environments validate that online calibration of kinematic parameters can reduce estimation drift to less than 1% during long-distance, high-speed locomotion. Our drift results are better than those of any other state estimation method using the same set of sensors reported in the literature. Moreover, our state estimator performs well even when the robot experiences large impacts and camera occlusion. The implementation of the state estimator, along with the datasets used to compute our results, is available at https://github.com/ShuoYangRobotics/Cerberus.

ICRA Conference 2023 Conference Paper

Enhanced Balance for Legged Robots Using Reaction Wheels

  • Chi-Yen Lee
  • Shuo Yang
  • Benjamin Bokser
  • Zachary Manchester

We introduce a reaction wheel system that enhances the balancing capabilities and stability of quadrupedal robots during challenging locomotion tasks. Inspired by both the standard centroidal dynamics model common in legged robotics and models of spacecraft commonly used in the aerospace community, we model the coupled quadruped-reaction-wheel system as a gyrostat, and simplify the dynamics to formulate the problem as a linear discrete-time trajectory optimization problem. Modifications are made to a standard centroidal model-predictive control (MPC) algorithm to solve for both stance foot ground reaction forces and reaction wheel torques simultaneously. The MPC problem is posed as a quadratic program and solved online at 1000 Hz. We demonstrate improved attitude stabilization both in simulation and on hardware compared to a quadruped without reaction wheels, and perform a challenging traversal of a narrow balance beam that would be impossible for a standard quadruped. A video of our experiments is available online 1.

IROS Conference 2023 Conference Paper

Multi-IMU Proprioceptive Odometry for Legged Robots

  • Shuo Yang
  • Zixin Zhang 0003
  • Benjamin Bokser
  • Zachary Manchester

This paper presents a novel, low-cost proprioceptive sensing solution for legged robots with point feet to achieve accurate low-drift long-term position and velocity estimation. In addition to conventional sensors, including one body Inertial Measurement Unit (IMU) and joint encoders, we attach an additional IMU to each calf link of the robot just above the foot. An extended Kalman filter is used to fuse data from all sensors to estimate the robot's body and foot positions in the world frame. Using the additional IMUs, the filter is able to reliably determine foot contact modes and detect foot slips without tactile or pressure-based foot contact sensors. This sensing solution is validated in various hardware experiments, which confirm that it can reduce position drift by nearly an order of magnitude compared to conventional approaches with only a very modest increase in hardware and computational costs.

IJCAI Conference 2022 Conference Paper

Entity-aware and Motion-aware Transformers for Language-driven Action Localization

  • Shuo Yang
  • Xinxiao Wu

Language-driven action localization in videos is a challenging task that involves not only visual-linguistic matching but also action boundary prediction. Recent progress has been achieved through aligning language queries to video segments, but estimating precise boundaries is still under-explored. In this paper, we propose entity-aware and motion-aware Transformers that progressively localize actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries. The entity-aware Transformer incorporates the textual entities into visual representation learning via cross-modal and cross-frame attentions to facilitate attending action-related video clips. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales via integrating long short-term memory into the self-attention module to further improve the precision of action boundary prediction. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method achieves better performance than existing methods.

ICML Conference 2022 Conference Paper

Linear Bandit Algorithms with Sublinear Time Complexity

  • Shuo Yang
  • Tongzheng Ren
  • Sanjay Shakkottai
  • Eric Price 0001
  • Inderjit S. Dhillon
  • Sujay Sanghavi

We propose two linear bandits algorithms with per-step complexity sublinear in the number of arms $K$. The algorithms are designed for applications where the arm set is extremely large and slowly changing. Our key realization is that choosing an arm reduces to a maximum inner product search (MIPS) problem, which can be solved approximately without breaking regret guarantees. Existing approximate MIPS solvers run in sublinear time. We extend those solvers and present theoretical guarantees for online learning problems, where adaptivity (i. e. , a later step depends on the feedback in previous steps) becomes a unique challenge. We then explicitly characterize the tradeoff between the per-step complexity and regret. For sufficiently large $K$, our algorithms have sublinear per-step complexity and $\widetilde O(\sqrt{T})$ regret. Empirically, we evaluate our proposed algorithms in a synthetic environment and a real-world online movie recommendation problem. Our proposed algorithms can deliver a more than 72 times speedup compared to the linear time baselines while retaining similar regret.

AAAI Conference 2022 Conference Paper

Semantically Contrastive Learning for Low-Light Image Enhancement

  • Dong Liang
  • Ling Li
  • Mingqiang Wei
  • Shuo Yang
  • Liyan Zhang
  • Wenhan Yang
  • Yun Du
  • Huiyu Zhou

Low-light image enhancement (LLE) remains challenging due to the unfavorable prevailing low-contrast and weakvisibility problems of single RGB images. In this paper, we respond to the intriguing learning-related question – if leveraging both accessible unpaired over/underexposed images and high-level semantic guidance, can improve the performance of cutting-edge LLE models? Here, we propose an effective semantically contrastive learning paradigm for LLE (namely SCL-LLE). Beyond the existing LLE wisdom, it casts the image enhancement task as multi-task joint learning, where LLE is converted into three constraints of contrastive learning, semantic brightness consistency, and feature preservation for simultaneously ensuring the exposure, texture, and color consistency. SCL-LLE allows the LLE model to learn from unpaired positives (normal-light)/negatives (over/underexposed), and enables it to interact with the scene semantics to regularize the image enhancement network, yet the interaction of high-level semantic knowledge and the lowlevel signal prior is seldom investigated in previous methods. Training on readily available open data, extensive experiments demonstrate that our method surpasses the state-of-thearts LLE models over six independent cross-scenes datasets. Moreover, SCL-LLE’s potential to benefit the downstream semantic segmentation under extremely dark conditions is discussed. Source Code: https: //github. com/LingLIx/SCL-LLE.

NeurIPS Conference 2022 Conference Paper

Toward Understanding Privileged Features Distillation in Learning-to-Rank

  • Shuo Yang
  • Sujay Sanghavi
  • Holakou Rahmanian
  • Jan Bakus
  • Vishwanathan S. V. N.

In learning-to-rank problems, a \textit{privileged feature} is one that is available during model training, but not available at test time. Such features naturally arise in merchandised recommendation systems; for instance, "user clicked this item" as a feature is predictive of "user purchased this item" in the offline data, but is clearly not available during online serving. Another source of privileged features is those that are too expensive to compute online but feasible to be added offline. \textit{Privileged features distillation} (PFD) refers to a natural idea: train a "teacher" model using all features (including privileged ones) and then use it to train a "student" model that does not use the privileged features. In this paper, we first study PFD empirically on three public ranking datasets and an industrial-scale ranking problem derived from Amazon's logs. We show that PFD outperforms several baselines (no-distillation, pretraining-finetuning, self-distillation, and generalized distillation) on all these datasets. Next, we analyze why and when PFD performs well via both empirical ablation studies and theoretical analysis for linear models. Both investigations uncover an interesting non-monotone behavior: as the predictive power of a privileged feature increases, the performance of the resulting student model initially increases but then decreases. We show the reason for the later decreasing performance is that a very predictive privileged teacher produces predictions with high variance, which lead to high variance student estimates and inferior testing performance.

AAAI Conference 2021 Conference Paper

Adversarial Robustness through Disentangled Representations

  • Shuo Yang
  • Tianyu Guo
  • Yunhe Wang
  • Chang Xu

Despite the remarkable empirical performance of deep learning models, their vulnerability to adversarial examples has been revealed in many studies. They are prone to make a susceptible prediction to the input with imperceptible adversarial perturbation. Although recent works have remarkably improved the model’s robustness under the adversarial training strategy, an evident gap between the natural accuracy and adversarial robustness inevitably exists. In order to mitigate this problem, in this paper, we assume that the robust and non-robust representations are two basic ingredients entangled in the integral representation. For achieving adversarial robustness, the robust representations of natural and adversarial examples should be disentangled from the non-robust part and the alignment of the robust representations can bridge the gap between accuracy and robustness. Inspired by this motivation, we propose a novel defence method called Deep Robust Representation Disentanglement Network (DRRDN). Specifically, DRRDN employs a disentangler to extract and align the robust representations from both adversarial and natural examples. Theoretical analysis guarantees the mitigation of the trade-off between robustness and accuracy with good disentanglement and alignment performance. Experimental results on benchmark datasets finally demonstrate the empirical superiority of our method.

NeurIPS Conference 2021 Conference Paper

Does Preprocessing Help Training Over-parameterized Neural Networks?

  • Zhao Song
  • Shuo Yang
  • Ruizhe Zhang

Deep neural networks have achieved impressive performance in many areas. Designing a fast and provable method for training neural networks is a fundamental question in machine learning. The classical training method requires paying $\Omega(mnd)$ cost for both forward computation and backward computation, where $m$ is the width of the neural network, and we are given $n$ training points in $d$-dimensional space. In this paper, we propose two novel preprocessing ideas to bypass this $\Omega(mnd)$ barrier: * First, by preprocessing the initial weights of the neural networks, we can train the neural network in $\widetilde{O}(m^{1-\Theta(1/d)} n d)$ cost per iteration. * Second, by preprocessing the input data points, we can train neural network in $\widetilde{O} (m^{4/5} nd )$ cost per iteration. From the technical perspective, our result is a sophisticated combination of tools in different fields, greedy-type convergence analysis in optimization, sparsity observation in practical work, high-dimensional geometric search in data structure, concentration and anti-concentration in probability. Our results also provide theoretical insights for a large number of previously established fast training methods. In addition, our classical algorithm can be generalized to the Quantum computation model. Interestingly, we can get a similar sublinear cost per iteration but avoid preprocessing initial weights or input data points.

ICRA Conference 2021 Conference Paper

Equality Constrained Linear Optimal Control With Factor Graphs

  • Shuo Yang
  • Gerry Chen
  • Yetong Zhang
  • Howie Choset
  • Frank Dellaert

This paper presents a novel factor graph-based approach to solve the discrete-time finite-horizon Linear Quadratic Regulator problem subject to auxiliary linear equality constraints within and across time steps. We represent such optimal control problems using constrained factor graphs and optimize the factor graphs to obtain the optimal trajectory and the feedback control policies using the variable elimination algorithm with a modified Gram-Schmidt process. We prove that our approach has the same order of computational complexity as the state-of-the-art dynamic programming approach. Furthermore, current dynamic programming approaches can only handle equality constraints between variables at the same time step, but ours can handle equality constraints among any combination of variables at any time step while maintaining linear complexity with respect to trajectory length. Our approach can be used to efficiently generate trajectories and feedback control policies to achieve periodic motion or repetitive manipulation.

IROS Conference 2021 Conference Paper

PackerBot: Variable-Sized Product Packing with Heuristic Deep Reinforcement Learning

  • Zifei Yang
  • Shuo Yang
  • Shuai Song
  • Wei Zhang 0021
  • Ran Song 0001
  • Jiyu Cheng
  • Yibin Li 0001

Product packing is a typical application in ware-house automation that aims to pick objects from unstructured piles and place them into bins with optimized placing policy. However, it still remains a significant challenge to finish the product packing tasks in general logistics scenarios where the objects are variable-sized and the configurations are complex. In this work, we present the PackerBot, a complete robotic pipeline for performing variable-sized product packing in unstructured scenes. First, by leveraging the imperfect experience of human packer, we propose a heuristic DRL framework for learning optimal online 3D bin packing policy. Then we integrate it with a 6-DoF suction-based picking module and a product size estimation module, leading to a complete product packing system, namely the PackerBot. Extensive experimental results show that our method achieves the state-of-the-art performance in both simulated and real-world tests. The video demonstration is available at: https://vsislab.github.io/packerbot.

AAAI Conference 2020 Conference Paper

A Unified Framework for Knowledge Intensive Gradient Boosting: Leveraging Human Experts for Noisy Sparse Domains

  • Harsha Kokel
  • Phillip Odom
  • Shuo Yang
  • Sriraam Natarajan

Incorporating richer human inputs including qualitative constraints such as monotonic and synergistic influences has long been adapted inside AI. Inspired by this, we consider the problem of using such influence statements in the successful gradient-boosting framework. We develop a unified framework for both classification and regression settings that can both effectively and efficiently incorporate such constraints to accelerate learning to a better model. Our results in a large number of standard domains and two particularly novel realworld domains demonstrate the superiority of using domain knowledge rather than treating the human as a mere labeler.

ICRA Conference 2020 Conference Paper

Cross-context Visual Imitation Learning from Demonstrations

  • Shuo Yang
  • Wei Zhang 0021
  • Weizhi Lu
  • Hesheng Wang 0001
  • Yibin Li 0001

Imitation learning enables robots to learn a task by simply watching the demonstration of the task. Current imitation learning methods usually require the learner and demonstrator to occur in the same context. This limits their scalability to practical applications. In this paper, we propose a more general imitation learning method which allows the learner and the demonstrator to come from different contexts, such as different viewpoints, backgrounds, and object positions and appearances. Specifically, we design a robotic system consisting of three models: context translation model, depth prediction model and multi-modal inverse dynamics model. First, the context translation model translates the demonstration to the context of learner from a different context. Then combining the color observation and depth observation as inputs, the inverse model maps the multi-modal observations into actions to reproduce the demonstration, where the depth observation is provided by a depth prediction model. By performing the block stacking tasks both in simulation and real world, we prove the cross-context learning advantage of the proposed robotic system over other systems.

IJCAI Conference 2020 Conference Paper

Financial Risk Analysis for SMEs with Graph-based Supply Chain Mining

  • Shuo Yang
  • Zhiqiang Zhang
  • Jun Zhou
  • Yang Wang
  • Wang Sun
  • Xingyu Zhong
  • Yanming Fang
  • Quan Yu

Small and Medium-sized Enterprises (SMEs) are playing a vital role in the modern economy. Recent years, financial risk analysis for SMEs attracts lots of attentions from financial institutions. However, the financial risk analysis for SMEs usually suffers data deficiency problem, especially for the mobile financial institutions which seldom collect credit-related data directly from SMEs. Fortunately, although credit-related information of SMEs is hard to be acquired sufficiently, the interactive relationships between SMEs, which may contain valuable information of financial risk, is usually available for the mobile financial institutions. Finding out credit-related relationship of SME from massive interactions helps comprehensively model the SMEs thus improve the performance of financial risk analysis. In this paper, tackling the data deficiency problem of financial risk analysis for SMEs, we propose an innovative financial risk analysis framework with graph-based supply chain mining. Specifically, to capture the credit-related topology structural and temporal variation information of SMEs, we design and employ a novel spatial-temporal aware graph neural network, to mine supply chain relationship on a SME graph, and then analysis the credit risk based on the mined supply chain graph. Experimental results on real-world financial datasets prove the effectiveness of our proposal for financial risk analysis for SMEs.

NeurIPS Conference 2019 Conference Paper

Interaction Hard Thresholding: Consistent Sparse Quadratic Regression in Sub-quadratic Time and Space

  • Shuo Yang
  • Yanyao Shen
  • Sujay Sanghavi

Quadratic regression involves modeling the response as a (generalized) linear function of not only the features $x^{j_1}$ but also of quadratic terms $x^{j_1}x^{j_2}$. The inclusion of such higher-order “interaction terms" in regression often provides an easy way to increase accuracy in already-high-dimensional problems. However, this explodes the problem dimension from linear $O(p)$ to quadratic $O(p^2)$, and it is common to look for sparse interactions (typically via heuristics). In this paper, we provide a new algorithm – Interaction Hard Thresholding (IntHT) which is the first one to provably accurately solve this problem in sub-quadratic time and space. It is a variant of Iterative Hard Thresholding; one that uses the special quadratic structure to devise a new way to (approx. ) extract the top elements of a $p^2$ size gradient in sub-$p^2$ time and space. Our main result is to theoretically prove that, in spite of the many speedup-related approximations, IntHT linearly converges to a consistent estimate under standard high-dimensional sparse recovery assumptions. We also demonstrate its value via synthetic experiments. Moreover, we numerically show that IntHT can be extended to higher-order regression problems, and also theoretically analyze an SVRG variant of IntHT.

IROS Conference 2019 Conference Paper

Learning Actions from Human Demonstration Video for Robotic Manipulation

  • Shuo Yang
  • Wei Zhang 0021
  • Weizhi Lu
  • Hesheng Wang 0001
  • Yibin Li 0001

Learning actions from human demonstration is an emerging trend for designing intelligent robotic systems, which can be referred as video to command. The performance of such approach highly relies on the quality of video captioning. However, the general video captioning methods focus more on the understanding of the full frame, lacking of consideration on the specific object of interests in robotic manipulations. We propose a novel deep model to learn actions from human demonstration video for robotic manipulation. It consists of two deep networks, grasp detection network (GNet) and video captioning network (CNet). GNet performs two functions: providing grasp solutions and extracting the local features for the object of interests in robotic manipulation. CNet outputs the captioning results by fusing the features of both full frames and local objects. Experimental results on UR5 robotic arm show that our method could produce more accurate command from video demonstration than state-of-the-art work, thereby leading to more robust grasping performance.

AAAI Conference 2019 Conference Paper

Unsupervised Fake News Detection on Social Media: A Generative Approach

  • Shuo Yang
  • Kai Shu
  • Suhang Wang
  • Renjie Gu
  • Fan Wu
  • Huan Liu

Social media has become one of the main channels for people to access and consume news, due to the rapidness and low cost of news dissemination on it. However, such properties of social media also make it a hotbed of fake news dissemination, bringing negative impacts on both individuals and society. Therefore, detecting fake news has become a crucial problem attracting tremendous research effort. Most existing methods of fake news detection are supervised, which require an extensive amount of time and labor to build a reliably annotated dataset. In search of an alternative, in this paper, we investigate if we could detect fake news in an unsupervised manner. We treat truths of news and users’ credibility as latent random variables, and exploit users’ engagements on social media to identify their opinions towards the authenticity of news. We leverage a Bayesian network model to capture the conditional dependencies among the truths of news, the users’ opinions, and the users’ credibility. To solve the inference problem, we propose an efficient collapsed Gibbs sampling approach to infer the truths of news and the users’ credibility without any labelled data. Experiment results on two datasets show that the proposed method significantly outperforms the compared unsupervised methods.

AAAI Conference 2017 Conference Paper

Efficiently Answering Technical Questions Ñ A Knowledge Graph Approach

  • Shuo Yang
  • Lei Zou
  • Zhongyuan Wang
  • Jun Yan
  • Ji-Rong Wen

More and more users prefer to ask their technical questions online. For machines, understanding a question is nontrivial. Current approaches lack explicit background knowledge. In this paper, we introduce a novel technical question understanding approach to recommending probable solutions to users. First, a knowledge graph is constructed which contains abundant technical information, and an augmented knowledge graph is built on the basis of the knowledge graph, to link the knowledge graph and documents. Then we develop a light weight question driven mechanism to select candidate documents. To improve the online performance, we propose an index-based random walk to support the online search. We use comprehensive experiments to evaluate the effectiveness of our approach on a large scale of real-world query logs. Our system outperforms main-stream search engine and the state-of-art information retrieval methods. Meanwhile, extensive experiments confirm the efficiency of our index-based online search mechanism.

AAAI Conference 2016 Conference Paper

Learning Continuous-Time Bayesian Networks in Relational Domains: A Non-Parametric Approach

  • Shuo Yang
  • Tushar Khot
  • Kristian Kersting
  • Sriraam Natarajan

Many real world applications in medicine, biology, communication networks, web mining, and economics, among others, involve modeling and learning structured stochastic processes that evolve over continuous time. Existing approaches, however, have focused on propositional domains only. Without extensive feature engineering, it is difficult— if not impossible—to apply them within relational domains where we may have varying number of objects and relations among them. We therefore develop the first relational representation called Relational Continuous-Time Bayesian Networks (RCTBNs) that can address this challenge. It features a nonparametric learning method that allows for efficiently learning the complex dependencies and their strengths simultaneously from sequence data. Our experimental results demonstrate that RCTBNs can learn as effectively as stateof-the-art approaches for propositional tasks while modeling relational tasks faithfully.

AAAI Conference 2015 Conference Paper

Deep Representation Learning with Target Coding

  • Shuo Yang
  • Ping Luo
  • Chen Change Loy
  • Kenneth W. Shum
  • Xiaoou Tang

We consider the problem of learning deep representation when target labels are available. In this paper, we show that there exists intrinsic relationship between target coding and feature representation learning in deep networks. Specifically, we found that distributed binary code with error correcting capability is more capable of encouraging discriminative features, in comparison to the 1-of-K coding that is typically used in supervised deep learning. This new finding reveals additional benefit of using error-correcting code for deep model learning, apart from its well-known error correcting property. Extensive experiments are conducted on popular visual benchmark datasets.

ICRA Conference 2015 Conference Paper

High performance full attitude control of a quadrotor on SO(3)

  • Yun Yu
  • Shuo Yang
  • Mingxi Wang
  • Cheng Li 0002
  • Zexiang Li 0001

This paper presents a novel quadrotor UAV attitude control algorithm to realize complex acrobatic UAV maneuvers. A nonlinear dynamic model based on the exponential coordinates parametrization of rotation is proposed. By analysing the model using Lie Group and Lie Algebra theory, cascaded linear PID controllers are designed. To further improve the controller performance, PID controllers are augmented with smith predictor and rotational trajectory planner. The experiments conducted on a real quadrotor show that our control algorithm surpasses most known quadrotor controllers.

ICRA Conference 2015 Conference Paper

Precise quadrotor autonomous landing with SRUKF vision perception

  • Shuo Yang
  • Jiahang Ying
  • Yang Lu
  • Zexiang Li 0001

We present an autonomous quadrotor system that is able to perform high precision landing on small platform in both indoor and outdoor environment. Its taking off and landing processes are fully autonomous. We use vision sensor to detect the landing platform, and the vision measurement is enhanced by IMU with SRUKF based sensor fusion method. All computation are done real-time and on-board. We implement the system and carry a series of experiments under various environmental conditions. The experiment results confirm the robustness and precision of our system in real use cases.