Arrow Research search

Author name cluster

Yu Wang 0002

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
1 author row

Possible papers

21

ICLR Conference 2025 Conference Paper

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

  • Yao Teng
  • Han Shi
  • Xian Liu
  • Xuefei Ning
  • Guohao Dai 0001
  • Yu Wang 0002
  • Zhenguo Li
  • Xihui Liu

The current large auto-regressive models can generate high-quality, high-resolution images, but these models require hundreds or even thousands of steps of next-token prediction during inference, resulting in substantial time consumption. In existing studies, Jacobi decoding, an iterative parallel decoding algorithm, has been used to accelerate the auto-regressive generation and can be executed without training. However, the Jacobi decoding relies on a deterministic criterion to determine the convergence of iterations. Thus, it works for greedy decoding but is incompatible with sampling-based decoding which is crucial for visual quality and diversity in the current auto-regressive text-to-image generation. In this paper, we propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding and allowing the model to generate diverse images. Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion, enabling the model to generate images with fewer steps than the conventional next-token-prediction paradigm. We also investigate the token initialization strategies that leverage the spatial locality of visual data to further improve the acceleration ratio under specific scenarios. We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality. The code of our work is available here: https://github.com/tyshiwo1/Accelerating-T2I-AR-with-SJD/.

ICLR Conference 2025 Conference Paper

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

  • Enshu Liu
  • Xuefei Ning
  • Yu Wang 0002
  • Zinan Lin 0001

Autoregressive (AR) models have recently achieved state-of-the-art performance in text and image generation. However, their primary limitation is slow generation speed due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that attempt to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To overcome this, we propose Distilled Decoding (DD), which leverages flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. The entire training process of DD does not need the training data of the original AR model (as opposed to some other methods), thus making DD more practical. We evaluate DD on state-of-the-art image AR models and present promising results. For VAR, which requires 10-step generation (680 tokens), DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. Similarly, for LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID scores $>$100. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The code and the pre-trained models will be released at https://github.com/imagination-research/distilled-decoding. The project website is at https://imagination-research.github.io/distilled-decoding.

ICLR Conference 2025 Conference Paper

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

  • Yue Yang
  • Shuibo Zhang
  • Kaipeng Zhang
  • Yi Bin
  • Yu Wang 0002
  • Ping Luo 0002
  • Wenqi Shao

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

ICML Conference 2025 Conference Paper

Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

  • Yuxuan Zhou 0002
  • Xien Liu
  • Chenwei Yan
  • Chen Ning
  • Xiao Zhang 0001
  • Boxun Li
  • Xiangling Fu
  • Shijin Wang 0001

Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom’s Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs’ medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.

ICML Conference 2025 Conference Paper

Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

  • Jijia Liu
  • Feng Gao
  • Qingmin Liao
  • Chao Yu 0005
  • Yu Wang 0002

Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to “kick-start” training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average 1. 62$\times$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.

ICML Conference 2025 Conference Paper

Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization

  • Zelai Xu
  • Wanjun Gu
  • Chao Yu 0005
  • Yi Wu 0013
  • Yu Wang 0002

Large language model (LLM) agents have recently demonstrated impressive capabilities in various domains like open-ended conversation and multi-step decision-making. However, it remains challenging for these agents to solve strategic language games, such as Werewolf, which demand both strategic decision-making and free-form language interactions. Existing LLM agents often suffer from intrinsic bias in their action distributions and limited exploration of the unbounded text action space, resulting in suboptimal performance. To address these challenges, we propose Latent Space Policy Optimization (LSPO), an iterative framework that combines game-theoretic methods with LLM fine-tuning to build strategic language agents. LSPO leverages the observation that while the language space is combinatorially large, the underlying strategy space is relatively compact. We first map free-form utterances into a finite latent strategy space, yielding an abstracted extensive-form game. Then we apply game-theoretic methods like Counterfactual Regret Minimization (CFR) to optimize the policy in the latent space. Finally, we fine-tune the LLM via Direct Preference Optimization (DPO) to align with the learned policy. By iteratively alternating between these steps, our LSPO agents progressively enhance both strategic reasoning and language communication. Experiment on the Werewolf game shows that our agents iteratively expand the strategy space with improving performance and outperform existing Werewolf agents, underscoring their effectiveness in free-form language games with strategic interactions.

ICLR Conference 2025 Conference Paper

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

  • Enshu Liu
  • Junyi Zhu 0002
  • Zinan Lin 0001
  • Xuefei Ning
  • Shuaiqi Wang
  • Matthew B. Blaschko
  • Sergey Yekhanin
  • Shengen Yan

Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. When training DM and CM, intermediate weight checkpoints are not fully utilized and only the last converged checkpoint is used. In this work, we find proper checkpoint merging can significantly improve the training convergence and final performance. Specifically, we propose LCSC, a simple but effective and efficient method to enhance the performance of DM and CM, by combining checkpoints along the training trajectory with coefficients deduced from evolutionary search. We demonstrate the value of LCSC through two use cases: (a) Reducing training cost. With LCSC, we only need to train DM/CM with fewer number of iterations and/or lower batch sizes to obtain comparable sample quality with the fully trained model. For example, LCSC achieves considerable training speedups for CM (23$\times$ on CIFAR-10 and 15$\times$ on ImageNet-64). (b) Enhancing pre-trained models. When full training is already done, LCSC can further improve the generation quality or efficiency of the final converged models. For example, LCSC achieves better FID using 1 number of function evaluation (NFE) than the base model with 2 NFE on consistency distillation, and decreases the NFE of DM from 15 to 9 while maintaining the generation quality. Applying LCSC to large text-to-image models, we also observe clearly enhanced generation quality.

IROS Conference 2025 Conference Paper

Multi-UAV Formation Control with Static and Dynamic Obstacle Avoidance via Reinforcement Learning

  • Yuqing Xie 0005
  • Chao Yu 0005
  • Hongzhi Zang
  • Feng Gao
  • Wenhao Tang
  • Jingyi Huang
  • Jiayu Chen 0005
  • Botian Xu

This paper tackles the challenging task of maintaining formation among multiple unmanned aerial vehicles (UAVs) while avoiding both static and dynamic obstacles during directed flight. The complexity of the task arises from its multi-objective nature, the large exploration space, and the sim-to-real gap. To address these challenges, we propose a two-stage reinforcement learning (RL) pipeline. In the first stage, we randomly search for a reward function that balances key objectives: directed flight, obstacle avoidance, formation maintenance, and zero-shot policy deployment. The second stage applies this reward function to more complex scenarios and utilizes curriculum learning to accelerate policy training. Additionally, we incorporate an attention-based observation encoder to improve formation maintenance and adaptability to varying obstacle densities. Experimental results in both simulation and real-world environments demonstrate that our method outperforms both planning-based and RL-based baselines in terms of collision-free rates and formation maintenance across static, dynamic, and mixed obstacle scenarios. Ablation studies further confirm the effectiveness of our curriculum learning strategy and attention-based encoder. Animated demonstrations are available at: https://sites.google.com/view/uav-formation-with-avoidance/.

ICLR Conference 2025 Conference Paper

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

  • Tianchen Zhao
  • Tongcheng Fang
  • Haofeng Huang
  • Rui Wan
  • Widyadewi Soedarmadji
  • Enshu Liu
  • Shiyao Li
  • Zinan Lin 0001

Diffusion transformers have demonstrated remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that existing quantization methods face challenges when applied to text-to-image and video tasks. To address these challenges, we begin by systematically analyzing the source of quantization error and conclude with the unique challenges posed by DiT quantization. Accordingly, we design an improved quantization scheme: ViDiT-Q (**V**ideo \& **I**mage **Di**ffusion **T**ransformer **Q**uantization), tailored specifically for DiT models. We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models, achieving W8A8 and W4A8 with negligible degradation in visual quality and metrics. Additionally, we implement efficient GPU kernels to achieve practical 2-2.5x memory optimization and a 1.4-1.7x end-to-end latency speedup.

ICLR Conference 2024 Conference Paper

A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models

  • Enshu Liu
  • Xuefei Ning
  • Huazhong Yang
  • Yu Wang 0002

Recent years have witnessed the rapid progress and broad application of diffusion probabilistic models (DPMs). Sampling from DPMs can be viewed as solving an ordinary differential equation (ODE). Despite the promising performance, the generation of DPMs usually consumes much time due to the large number of function evaluations (NFE). Though recent works have accelerated the sampling to around 20 steps with high-order solvers, the sample quality with less than 10 NFE can still be improved. In this paper, we propose a unified sampling framework (USF) to study the optional strategies for solver. Under this framework, we further reveal that taking different solving strategies at different timesteps may help further decrease the truncation error, and a carefully designed \emph{solver schedule} has the potential to improve the sample quality by a large margin. Therefore, we propose a new sampling framework based on the exponential integral formulation that allows free choices of solver strategy at each step and design specific decisions for the framework. Moreover, we propose $S^3$, a predictor-based search method that automatically optimizes the solver schedule to get a better time-quality trade-off of sampling. We demonstrate that $S^3$ can find outstanding solver schedules which outperform the state-of-the-art sampling methods on CIFAR-10, CelebA, ImageNet-64, and LSUN-Bedroom datasets. Specifically, we achieve 2.69 FID with 9 NFE and 6.86 FID with 5 NFE on CIFAR-10 dataset, outperforming the SOTA method significantly. We further apply $S^3$ to Stable-Diffusion model and get an acceleration ratio of 2$\times$, showing the feasibility of sampling in very few steps without retraining of the neural network.

ICML Conference 2024 Conference Paper

Evaluating Quantized Large Language Models

  • Shiyao Li
  • Xuefei Ning
  • Luning Wang
  • Tengxuan Liu
  • Xiangsheng Shi
  • Shengen Yan
  • Guohao Dai 0001
  • Huazhong Yang

Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https: //github. com/thu-nics/qllm-eval.

ICML Conference 2024 Conference Paper

Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game

  • Zelai Xu
  • Chao Yu 0005
  • Fei Fang 0001
  • Yu Wang 0002
  • Yi Wu 0013

Agents built with large language models (LLMs) have shown great potential across a wide range of domains. However, in complex decision-making tasks, pure LLM-based agents tend to exhibit intrinsic bias in their choice of actions, which is inherited from the model’s training data and results in suboptimal performance. To develop strategic language agents, i. e. , agents that generate flexible language actions and possess strong decision-making abilities, we propose a novel framework that powers LLM-based agents with reinforcement learning (RL). We consider Werewolf, a popular social deduction game, as a challenging testbed that emphasizes versatile communication and strategic gameplay. To mitigate the intrinsic bias in language actions, our agents use an LLM to perform deductive reasoning and generate a diverse set of action candidates. Then an RL policy trained to optimize the decision-making ability chooses an action from the candidates to play in the game. Extensive experiments show that our agents overcome the intrinsic bias and outperform existing LLM-based agents in the Werewolf game. We also conduct human-agent experiments and find that our agents achieve human-level performance and demonstrate strong strategic play.

ICML Conference 2024 Conference Paper

Position: Towards Implicit Prompt For Text-To-Image Models

  • Yue Yang
  • Yuqi Lin
  • Hong Liu
  • Wenqi Shao
  • Runjian Chen
  • Hailong Shang
  • Yu Wang 0002
  • Yu Qiao 0001

Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position paper highlights the current state of T2I models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts with popular T2I models. Specifically, we design and collect more than 2, 000 implicit prompts of three aspects: General Symbols, Celebrity Privacy, and Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models’ capabilities under these implicit prompts. Experiment results show that (1) T2I models are able to accurately create various target symbols indicated by implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can be bypassed with implicit prompts. We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

ICLR Conference 2024 Conference Paper

Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation

  • Xuefei Ning
  • Zinan Lin 0001
  • Zixuan Zhou
  • Zifu Wang
  • Huazhong Yang
  • Yu Wang 0002

This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose Skeleton-of-Thought (SoT), which first guides LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-ups across 12 LLMs, but it can also potentially improve the answer quality on several question categories. SoT is an initial attempt at data-centric optimization for inference efficiency, and showcases the potential of eliciting high-quality answers by explicitly planning the answer structure in language.

ICLR Conference 2023 Conference Paper

Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased

  • Chao Yu 0005
  • Jiaxuan Gao
  • Weilin Liu
  • Botian Xu
  • Hao Tang
  • Jiaqi Yang
  • Yu Wang 0002
  • Yi Wu 0013

There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.

ICML Conference 2023 Conference Paper

OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models

  • Enshu Liu
  • Xuefei Ning
  • Zinan Lin 0001
  • Huazhong Yang
  • Yu Wang 0002

Diffusion probabilistic models (DPMs) are a new class of generative models that have achieved state-of-the-art generation quality in various domains. Despite the promise, one major drawback of DPMs is the slow generation speed due to the large number of neural network evaluations required in the generation process. In this paper, we reveal an overlooked dimension—model schedule—for optimizing the trade-off between generation quality and speed. More specifically, we observe that small models, though having worse generation quality when used alone, could outperform large models in certain generation steps. Therefore, unlike the traditional way of using a single model, using different models in different generation steps in a carefully designed model schedule could potentially improve generation quality and speed simultaneously. We design OMS-DPM, a predictor-based search algorithm, to determine the optimal model schedule given an arbitrary generation time budget and a set of pre-trained models. We demonstrate that OMS-DPM can find model schedules that improve generation quality and speed than prior state-of-the-art methods across CIFAR-10, CelebA, ImageNet, and LSUN datasets. When applied to the public checkpoints of the Stable Diffusion model, we are able to accelerate the sampling by 2x while maintaining the generation quality.

ICRA Conference 2022 Conference Paper

Explore-Bench: Data Sets, Metrics and Evaluations for Frontier-based and Deep-reinforcement-learning-based Autonomous Exploration

  • Yuanfan Xu
  • Jincheng Yu
  • Jiahao Tang
  • Jiantao Qiu
  • Jian Wang 0030
  • Yuan Shen 0001
  • Yu Wang 0002
  • Huazhong Yang

Autonomous exploration and mapping of unknown terrains employing single or multiple robots is an essential task in mobile robotics and has therefore been widely investigated. Nevertheless, given the lack of unified data sets, metrics, and platforms to evaluate the exploration approaches, we develop an autonomous robot exploration benchmark en-titled Explore-Bench. The benchmark involves various explo-ration scenarios and presents two types of quantitative metrics to evaluate exploration efficiency and multi-robot cooperation. Explore-Bench is extremely useful as, recently, deep rein-forcement learning (DRL) has been widely used for robot exploration tasks and achieved promising results. However, training DRL-based approaches requires large data sets, and additionally, current benchmarks rely on realistic simulators with a slow simulation speed, which is not appropriate for training exploration strategies. Hence, to support efficient DRL training and comprehensive evaluation, the suggested Explore-Bench designs a 3-level platform with a unified data flow and 12 × speed-up that includes a grid-based simulator for fast evaluation and efficient training, a realistic Gazebo simulator, and a remotely accessible robot testbed for high-accuracy tests in physical environments. The practicality of the proposed benchmark is highlighted with the application of one DRL-based and three frontier-based exploration approaches. Fur-thermore, we analyze the performance differences and provide some insights about the selection and design of exploration methods. Our benchmark is available at https://github.com/efc-robot/Explore-Bench.

ICRA Conference 2022 Conference Paper

Multi-UAV Disaster Environment Coverage Planning with Limited-Endurance

  • Hongyu Song
  • Jincheng Yu
  • Jiantao Qiu
  • Zhixiao Sun
  • Kuijun Lang
  • Qing Luo
  • Yuan Shen 0001
  • Yu Wang 0002

Disaster areas involving floods and earthquakes are commonly large, with the rescue time being quite tight, suggesting multi-Unmanned Aerial Vehicles (UAV) exploration rather than employing a single UAV. For such scenarios, current UAV exploration is modeled as a Coverage Path Planning (CPP) problem to achieve full area coverage in the presence of obstacles. However, the UAV's endurance capability is limited, and the rescue time is constrained, prohibiting even multiple UAVs from completing disaster area coverage on time. Therefore, this paper defines a multi-Agent Endurance-limited CPP (MAEl-CPP) problem that is based on an a priori known heatmap of the disaster area, which affords to explore the most valuable areas under UAV limited energy constraints. Furthermore, we propose a path planning algorithm for the MAEl-CPP problem by ranking the possible disaster areas according to their importance through satellite or remote sensing aerial images and completing path planning according to this ranking. Experimental results demonstrate that the search efficiency of the proposed algorithm is 4. 2 times that of the existing algorithm.

ICRA Conference 2022 Conference Paper

Relative Distributed Formation and Obstacle Avoidance with Multi-agent Reinforcement Learning

  • Yuzi Yan
  • Xiaoxiang Li
  • Xinyou Qiu
  • Jiantao Qiu
  • Jian Wang 0030
  • Yu Wang 0002
  • Yuan Shen 0001

Multi-agent formation as well as obstacle avoid-ance is one of the most actively studied topics in the field of multi-agent systems. Although some classic controllers like model predictive control (MPC) and fuzzy control achieve a certain measure of success, most of them require precise global information which is not accessible in harsh environments. On the other hand, some reinforcement learning (RL) based approaches adopt the leader-follower structure to organize different agents' behaviors, which sacrifices the collaboration between agents thus suffering from bottlenecks in maneuver-ability and robustness. In this paper, we propose a distributed formation and obstacle avoidance method based on multi-agent reinforcement learning (MARL). Agents in our system only utilize local and relative information to make decisions and control themselves distributively, and will reorganize themselves into a new topology quickly in case that any of them is dis-connected. Our method achieves better performance regarding formation error, formation convergence rate and on-par success rate of obstacle avoidance compared with baselines (both classic control methods and another RL-based method). The feasibility of our method is verified by both simulation and hardware implementation with Ackermann-steering vehicles.

ICLR Conference 2021 Conference Paper

Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization

  • Zhenggang Tang
  • Chao Yu 0005
  • Boyuan Chen 0003
  • Huazhe Xu
  • Xiaolong Wang 0004
  • Fei Fang 0001
  • Simon S. Du
  • Yu Wang 0002

We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover a set of multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games and a real-world game Agar.io, where multiple equilibria exist but standard multi-agent policy gradient algorithms always converge to a fixed one with a sub-optimal payoff for every player even using state-of-the-art exploration techniques. Furthermore, with the set of diverse strategies from RPG, we can (1) achieve higher payoffs by fine-tuning the best policy from the set; and (2) obtain an adaptive agent by using this set of strategies as its training opponents.

ICRA Conference 2021 Conference Paper

SMMR-Explore: SubMap-based Multi-Robot Exploration System with Multi-robot Multi-target Potential Field Exploration Method

  • Jincheng Yu
  • Jianming Tong
  • Yuanfan Xu
  • Zhilin Xu
  • Haolin Dong
  • Tianxiang Yang
  • Yu Wang 0002

Collaborative exploration in an unknown environment without external positioning under limited communication is an essential task for multi-robot applications. For inter-robot positioning, various Distributed Simultaneous Localization and Mapping (DSLAM) systems share the Place Recognition (PR) descriptors and sensor data to estimate the relative pose between robots and merge robots’ maps. As maps are constantly shared among robots in exploration, we design a map-based DSLAM framework, which only shares the submaps, eliminating the transfer of PR descriptors and sensor data. Our framework saves 30% of total communication traffic. For exploration, each robot is assigned to get much unknown information about environments with paying little travel cost. As the number of sampled points increases, the goal would change back and forth among sampled frontiers, leading to the downgrade in exploration efficiency and the overlap of trajectories. We propose an exploration strategy based on Multi-robot Multi-target Potential Field (MMPF), which can eliminate goal’s back-and-forth changes, boosting the exploration efficiency by 1. 03 ×∼1. 62 × with 3 % ∼ 40 % travel cost saved. Our SubMap-based Multi-robot Exploration method (SMMR-Explore) is evaluated on both Gazebo simulator and real robots. The simulator and the exploration framework are published as an open-source ROS project at https://github.com/efc-robot/SMMR-Explore.