Author name cluster

Shanda Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

AAAI Conference 2026 Conference Paper

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Weiwei Sun
Shengyu Feng
Shanda Li
Yiming Yang

Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems---a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human-designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research.

PDF Details DOI

TMLR Journal 2026 Journal Article

CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Shanda Li
Tanya Marwah
Junhong Shen
Weiwei Sun
Andrej Risteski
Yiming Yang
Ameet Talwalkar

Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). With CodePDE, we present a thorough evaluation on critical capacities of LLM for PDE solving: reasoning, debugging, self-refinement, and test-time scaling. CodePDE shows that, with advanced inference-time algorithms and scaling strategies, LLMs can achieve strong performance across a range of representative PDE problems. We also identify novel insights into LLM-driven solver generation, such as trade-offs between solver reliability and sophistication, design principles for LLM-powered PDE solving agents, and failure modes for LLM on hard tasks. These insights offer guidance for building more capable and reliable LLM-based scientific engines.

PDF Details

ICLR Conference 2025 Conference Paper

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

Yangzhen Wu
Zhiqing Sun
Shanda Li
Sean Welleck
Yiming Yang 0002

While the scaling laws of large language models (LLMs) training have been extensively studied, optimal inference configurations of LLMs remain underexplored. We study inference scaling laws (aka test-time scaling laws) and compute-optimal inference, focusing on the trade-offs between model sizes and generating additional tokens with different inference strategies. As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies such as greedy search, majority voting, best-of-$n$, weighted voting, and two different tree search algorithms, using different model sizes and compute budgets. Our findings suggest that scaling inference compute with inference strategies can be more computationally efficient than scaling model parameters. Additionally, smaller models combined with advanced inference algorithms offer Pareto-optimal trade-offs in cost and performance. For example, the Llemma-7B model, when paired with our novel tree search algorithm, consistently outperforms the Llemma-34B model across all tested inference strategies on the MATH benchmark. We hope these insights contribute to a deeper understanding of inference scaling laws (test-time scaling laws) for LLMs.

Details

ICML Conference 2025 Conference Paper

Maximal Update Parametrization and Zero-Shot Hyperparameter Transfer for Fourier Neural Operators

Shanda Li
Shinjae Yoo
Yiming Yang 0002

Fourier Neural Operators (FNOs) offer a principled approach for solving complex partial differential equations (PDEs). However, scaling them to handle more complex PDEs requires increasing the number of Fourier modes, which significantly expands the number of model parameters and makes hyperparameter tuning computationally impractical. To address this, we introduce $\mu$ Transfer-FNO, a zero-shot hyperparameter transfer technique that enables optimal configurations, tuned on smaller FNOs, to be directly applied to billion-parameter FNOs without additional tuning. Building on the Maximal Update Parametrization ($\mu$P) framework, we mathematically derive a parametrization scheme that facilitates the transfer of optimal hyperparameters across models with different numbers of Fourier modes in FNOs, which is validated through extensive experiments on various PDEs. Our empirical study shows that $\mu$Transfer-FNO reduces computational cost for tuning hyperparameters on large FNOs while maintaining or improving accuracy.

Details

ICLR Conference 2025 Conference Paper

TFG-Flow: Training-free Guidance in Multimodal Generative Flow

Haowei Lin
Shanda Li
Haotian Ye
Yiming Yang
Stefano Ermon
Yitao Liang
Jianzhu Ma

Given an unconditional generative model and a predictor for a target property (e.g., a classifier), the goal of training-free guidance is to generate samples with desirable target properties without additional training. As a highly efficient technique for steering generative models toward flexible outcomes, training-free guidance has gained increasing attention in diffusion models. However, existing methods only handle data in continuous spaces, while many scientific applications involve both continuous and discrete data (referred to as multimodality). Another emerging trend is the growing use of the simple and general flow matching framework in building generative foundation models, where guided generation remains under-explored. To address this, we introduce TFG-Flow, a novel training-free guidance method for multimodal generative flow. TFG-Flow addresses the curse-of-dimensionality while maintaining the property of unbiased sampling in guiding discrete variables. We validate TFG-Flow on four molecular design tasks and show that TFG-Flow has great potential in drug design by generating molecules with desired properties.

Details

EAAI Journal 2024 Journal Article

A visual detection algorithm for autonomous driving road environment perception

Peichao Cong
Hao Feng
Shanda Li
Tianheng Li
Yutao Xu
Xin Zhang

Achieving accurate and real-time perception of environmental targets in complex traffic scenes based on visual sensors is a challenging research problem in the field of autonomous driving technology. In methods to date, it is difficult to effectively balance the detection accuracy and speed. To this end, this paper proposes an interactive and lightweight visual detection algorithm – YRDM (Your Region Decision-Making) – based on the concepts of efficient mining and utilisation of target feature information, lightweight network structure, and optimisation of label allocation for highly practical detection of ambient targets in autonomous driving scenarios. First, a two-stage algorithm architecture consisting of four low-parameter subnetworks is constructed with the goal of efficiently mining and utilising target feature information, and the accuracy and effectiveness of the algorithm are balanced through the interaction of information between the subnetworks. Second, in order to further improve the detection speed, lightweight convolution is introduced into the structure of the YRDM network to construct the DSC3 module, which allows lightweight processing of the subnetwork structure. Finally, by converting the label assignment problem into an optimal transport problem, adaptation to the global nature of the samples by YRDM is improved, allowing better detection accuracy. The algorithm is tested with two major public datasets, BDD100K and KITTI, and a large number of experimental results show that the comprehensive performance of YRDM is better than other existing algorithms. In addition, ablation experiments and mobile terminal device deployment experiments further demonstrate the effectiveness and real-time performance of this algorithm.

Details DOI

ICLR Conference 2024 Conference Paper

Functional Interpolation for Relative Positions improves Long Context Transformers

Shanda Li
Chong You
Guru Guruganesh
Joshua Ainslie
Santiago Ontañón
Manzil Zaheer
Sumit Sanghai
Yiming Yang 0002

Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.

Details

NeurIPS Conference 2022 Conference Paper

Is $L^2$ Physics Informed Loss Always Suitable for Training Physics Informed Neural Network?

Chuwei Wang
Shanda Li
Di He
Liwei Wang

The Physics-Informed Neural Network (PINN) approach is a new and promising way to solve partial differential equations using deep learning. The $L^2$ Physics-Informed Loss is the de-facto standard in training Physics-Informed Neural Networks. In this paper, we challenge this common practice by investigating the relationship between the loss function and the approximation quality of the learned solution. In particular, we leverage the concept of stability in the literature of partial differential equation to study the asymptotic behavior of the learned solution as the loss approaches zero. With this concept, we study an important class of high-dimensional non-linear PDEs in optimal control, the Hamilton-Jacobi-Bellman (HJB) Equation, and prove that for general $L^p$ Physics-Informed Loss, a wide class of HJB equation is stable only if $p$ is sufficiently large. Therefore, the commonly used $L^2$ loss is not suitable for training PINN on those equations, while $L^{\infty}$ loss is a better choice. Based on the theoretical insight, we develop a novel PINN training algorithm to minimize the $L^{\infty}$ loss for HJB equations which is in a similar spirit to adversarial training. The effectiveness of the proposed algorithm is empirically demonstrated through experiments. Our code is released at https: //github. com/LithiumDA/L_inf-PINN.

PDF Details

NeurIPS Conference 2022 Conference Paper

Your Transformer May Not be as Powerful as You Expect

Shengjie Luo
Shanda Li
Shuxin Zheng
Tie-Yan Liu
Liwei Wang
Di He

Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative---RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications. The code will be made publicly available at https: //github. com/lsj2408/URPE.

PDF Details

NeurIPS Conference 2021 Conference Paper

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Shengjie Luo
Shanda Li
Tianle Cai
Di He
Dinglan Peng
Shuxin Zheng
Guolin Ke
Liwei Wang

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e. g. , Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

PDF Details