Arrow Research search

Author name cluster

Bo Peng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers
2 author rows

Possible papers

29

AAAI Conference 2026 Conference Paper

How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

  • Bo Peng
  • Pi Bu
  • Keyu Pan
  • Xinrun Xu
  • Yingxiu Zhao
  • Miao Chen
  • Yang Du
  • Lin Li

Recent advances in vision–language models (VLMs) have shed light on human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents still rely on high-level commands or discretised action spaces—``non-native'' settings that diverge markedly from the real world. Moreover, current benchmarks focus exclusively on high-level tasks, while lacking joint evaluation and analysis on both low- and high-level. To bridge these gaps, we present \textbf{NativeEmbodied}, a challenging benchmark for VLM-driven embodied agents that adopts a unified, native low-level action space. Built upon diverse simulated scenes, NativeEmbodied first designs three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed and comprehensive performance analysis, we further decouple the entangled skills behind complex tasks and construct four types of low-level tasks, each corresponding to a key fundamental embodied skill. This joint evaluation across task and skill granularities enables a fine-grained assessment of embodied agent. Comprehensive experiments on the best VLMs reveal pronounced deficiencies in certain fundamental embodied skills. Further analysis shows that these bottlenecks severely constrain performance on high-level tasks. Our NativeEmbodied not only pinpoints the key challenges faced by current VLM-driven embodied agents, but also provides valuable insight for future development of this field.

JBHI Journal 2026 Journal Article

MBBo-RPSLD: Training a Multimodal BlenderBot for Rehabilitation in Post-Stroke Language Disorder

  • Yangyang Guo
  • Airu Huang
  • Bo Peng
  • Yufeng Li
  • Wei Gu

Stroke, a severe cerebrovascular event, can lead to motor deficits and often impairs language, affecting quality of life. Thus, developing effective rehabilitation models is crucial for enhancing language function and well-being in stroke patients. This paper presents the Multi-Blender model, designed to address the challenges of multimodal data processing and the complexity of medical dialogue in stroke language rehabilitation. The model integrates the multimodal encoding capabilities of ImageBind-LLM with the conversational generation strengths of BlenderBot, creating a tailored rehabilitation solution for stroke patients. We evaluated the model using a range of datasets, including the NINDS dataset, MSDM database, and clinical data from hospitals, focusing on audio-video recognition and speech translation tasks. Our results demonstrate that the Multi-Blender model outperforms existing models, achieving a BLEU score of 30. 2 in the AST task, surpassing Whisper Large-v2 and AudioPaLM. In the ASR task, it also displayed superior performance. The model's effectiveness was further validated through an adjusted MME benchmark, where it scored 85. 25% in perceptual tasks and 76. 83% in cognitive tasks, outperforming other models in language understanding and fluency scoring. These findings indicate that the Multi-Blender model significantly enhances stroke language rehabilitation by improving multimodal data processing and providing accurate, reliable solutions. Future work will focus on expanding the training dataset and optimizing the model to further advance the effectiveness of stroke rehabilitation.

AAAI Conference 2026 Conference Paper

Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

  • Hangyu Liu
  • Bo Peng
  • Pengxiang Ding
  • Donglin Wang

Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. However, existing generative approaches for multi-target attacks primarily encode target labels into one-dimensional tensors, leading to a loss of fine-grained visual information and overfitting to model-specific features during noise generation. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model's attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments demonstrate that TGAF consistently surpasses state-of-the-art methods across various settings.

AAAI Conference 2026 Conference Paper

Revisiting MLLM Based Image Quality Assessment: Errors and Remedy

  • Zhenchen Tang
  • Songlin Yang
  • Bo Peng
  • Zichuan Wang
  • Jing Dong

The rapid progress of multi-modal large language models (MLLMs) has boosted the task of image quality assessment (IQA). However, a key challenge arises from the inherent mismatch between the discrete token outputs of MLLMs and the continuous nature of quality scores required by IQA tasks. This discrepancy significantly hinders the performance of MLLM-based IQA methods. Previous approaches that convert discrete token predictions into continuous scores often suffer from conversion errors. Moreover, the semantic confusion introduced by level tokens (e.g., “good”) further constrains the performance of MLLMs on IQA tasks and degrades their original capabilities to related tasks. To tackle these problems, we provide a theoretical analysis of the errors inherent in previous approaches and, motivated by this analysis, propose a simple yet effective framework, Q-Scorer. This framework incorporates a lightweight regression module and IQA-specific score tokens into the MLLM pipeline. Extensive experiments demonstrate that Q-Scorer achieves state-of-the-art performance across multiple IQA benchmarks, generalizes well to mixed datasets, and further improves combined with other methods.

NeurIPS Conference 2025 Conference Paper

$\texttt{STRCMP}$: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization

  • Xijun Li
  • Jiexiang Yang
  • Jinghao Wang
  • Bo Peng
  • Jianguo Yao
  • Haibing Guan

Combinatorial optimization (CO) problems, central to operation research and theoretical computer science, present significant computational challenges due to their $\mathcal{NP}$-hard nature. While large language models (LLMs) have emerged as promising tools for CO—either by directly generating solutions or synthesizing solver-specific codes—existing approaches often $\textit{neglect critical structural priors inherent to CO problems}$, leading to suboptimality and iterative inefficiency. Inspired by human experts’ success in leveraging CO structures for algorithm design, we propose $\texttt{STRCMP}$, a novel structure-aware LLM-based algorithm discovery framework that systematically integrates structure priors to enhance solution quality and solving efficiency. Our framework combines a graph neural network (GNN) for extracting structural embeddings from CO instances with an LLM conditioned on these embeddings to identify high-performed algorithms in the form of solver-specific codes. This composite architecture ensures syntactic correctness, preserves problem topology, and aligns with natural language objectives, while an evolutionary refinement process iteratively optimizes generated algorithm. Extensive evaluations across Mixed Integer Linear Programming and Boolean Satisfiability problems, using nine benchmark datasets, demonstrate that our proposed $\texttt{STRCMP}$ outperforms five strong neural and LLM-based methods by a large margin, in terms of both solution optimality and computational efficiency. The code is publicly available in the repository: https: //github. com/Y-Palver/L2O-STRCMP.

NeurIPS Conference 2025 Conference Paper

An Information-theoretical Framework for Understanding Out-of-distribution Detection with Pretrained Vision-Language Models

  • Bo Peng
  • Jie Lu
  • Guangquan Zhang
  • Zhen Fang

Out-of-distribution (OOD) detection, recognized for its ability to identify samples of unknown classes, provides solid advantages in ensuring the reliability of machine learning models. Among existing OOD detection methods, pre-trained vision-language models have emerged as powerful post-hoc OOD detectors by leveraging textual and visual information. Despite the empirical success, there still remains a lack of research on a formal understanding of their effectiveness. This paper bridges the gap by theoretically demonstrating that existing CLIP-based post-hoc methods effectively perform a stochastic estimation of the point-wise mutual information (PMI) between the input image and each in-distribution label. This estimation is then utilized to construct energy functions for modeling in-distribution distributions. Different from prior methods that inherently consider PMI estimation as a whole task, we, motivated by the divide-and-conquer philosophy, decompose PMI estimation into multiple easier sub-tasks by applying the chain rule of PMI, which not only reduces the estimation complexity but also provably increases the estimation upper bound to reduce the underestimation bias. Extensive evaluations across mainstream benchmarks empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups.

NeurIPS Conference 2025 Conference Paper

Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

  • Jiachen Jiang
  • Jinxin Zhou
  • Bo Peng
  • Xia Ning
  • Zhihui Zhu

Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment---the alignment between each vision patch and its corresponding semantic words---and propose a $\textit{multi-semantic alignment hypothesis}$. Our analysis indicates that the projector trained by caption loss improves patch-level alignment but only to a limited extent, resulting in weak and coarse alignment. To address this issue, we propose $\textit{patch-aligned training}$ to efficiently enhance patch-level alignment. Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM's performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.

AAAI Conference 2025 Conference Paper

MM-Tracker: Motion Mamba for UAV-platform Multiple Object Tracking

  • Mufeng Yao
  • Jinlong Peng
  • Qingdong He
  • Bo Peng
  • Hao Chen
  • Mingmin Chi
  • Chao Liu
  • Jon Atli Benediktsson

Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces both local object motion and global camera motion. Motion blur also increases the difficulty of detecting large moving objects. Previous UAV motion modeling approaches either focus only on local motion or ignore motion blurring effects, thus limiting their tracking performance and speed. To address these issues, we propose the Motion Mamba Module, which explores both local and global motion features through cross-correlation and bi-directional Mamba Modules for better motion modeling. To address the detection difficulties caused by motion blur, we also design motion margin loss to effectively improve the detection accuracy of motion blurred objects. Based on the Motion Mamba module and motion margin loss, our proposed MM-Tracker surpasses the state-of-the-art in two widely open-source UAV-MOT datasets.

AAAI Conference 2024 Conference Paper

AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis

  • Dongze Li
  • Kang Zhao
  • Wei Wang
  • Bo Peng
  • Yingya Zhang
  • Jing Dong
  • Tieniu Tan

Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with few-shot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.

NeurIPS Conference 2024 Conference Paper

Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars

  • Julieta Martinez
  • Emily Kim
  • Javier Romero
  • Timur Bagautdinov
  • Shunsuke Saito
  • Shoou-I Yu
  • Stuart Anderson
  • Michael Zollhöfer

To build photorealistic avatars that users can embody, human modelling must be complete (cover the full body), driveable (able to reproduce the current motion and appearance from the user), and generalizable ( i. e. , easily adaptable to novel identities). Towards these goals, paired captures, that is, captures of the same subject obtained from systems of diverse quality and availability, are crucial. However, paired captures are rarely available to researchers outside of dedicated industrial labs: Codec Avatar Studio is our proposal to close this gap. Towards generalization and driveability, we introduce a dataset of 256 subjects captured in two modalities: high resolution multi-view scans of their heads, and video from the internal cameras of a headset. Towards completeness, we introduce a dataset of 4 subjects captured in eight modalities: high quality relightable multi-view captures of heads and hands, full body multi-view captures with minimal and regular clothes, and corresponding head, hands and body phone captures. Together with our data, we also provide code and pre-trained models for different state-of-the-art human generation models. Our datasets and code are available at https: //github. com/facebookresearch/ava-256 and https: //github. com/facebookresearch/goliath.

AAAI Conference 2024 Conference Paper

ConditionVideo: Training-Free Condition-Guided Video Generation

  • Bo Peng
  • Xinyuan Chen
  • Yaohui Wang
  • Chaochao Lu
  • Yu Qiao

Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.

ICLR Conference 2024 Conference Paper

ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

  • Bo Peng
  • Yadan Luo
  • Yonggang Zhang 0003
  • Yixuan Li 0001
  • Zhen Fang 0001

Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25\% and 28.19\% (FPR95) on CIFAR-100 and ImageNet-1K, respectively.

AAAI Conference 2024 Conference Paper

End-to-End Learning of LTLf Formulae by Faithful LTLf Encoding

  • Hai Wan
  • Pingjia Liang
  • Jianfeng Du
  • Weilin Luo
  • Rongzhen Ye
  • Bo Peng

It is important to automatically discover the underlying tree-structured formulae from large amounts of data. In this paper, we examine learning linear temporal logic on finite traces (LTLf) formulae, which is a tree structure syntactically and characterizes temporal properties semantically. Its core challenge is to bridge the gap between the concise tree-structured syntax and the complex LTLf semantics. Besides, the learning quality is endangered by explosion of the search space and wrong search bias guided by imperfect data. We tackle these challenges by proposing an LTLf encoding method to parameterize a neural network so that the neural computation is able to simulate the inference of LTLf formulae. We first identify faithful LTLf encoding, a subclass of LTLf encoding, which has a one-to-one correspondence to LTLf formulae. Faithful encoding guarantees that the learned parameter assignment of the neural network can directly be interpreted to an LTLf formula. With such an encoding method, we then propose an end-to-end approach, TLTLf, to learn LTLf formulae through neural networks parameterized by our LTLf encoding method. Experimental results demonstrate that our approach achieves state-of-the-art performance with up to 7% improvement in accuracy, highlighting the benefits of introducing the faithful LTLf encoding.

ICML Conference 2024 Conference Paper

Knowledge Distillation with Auxiliary Variable

  • Bo Peng
  • Zhen Fang 0001
  • Guangquan Zhang 0001
  • Jie Lu 0001

Knowledge distillation (KD) provides an efficient framework for transferring knowledge from a teacher model to a student model by aligning their predictive distributions. The existing KD methods adopt the same strategy as the teacher to formulate the student’s predictive distribution. However, employing the same distribution-modeling strategy typically causes sub-optimal knowledge transfer due to the discrepancy in model capacity between teacher and student models. Designing student-friendly teachers contributes to alleviating the capacity discrepancy, while it requires either complicated or student-specific training schemes. To cast off this dilemma, we propose to introduce an auxiliary variable to promote the ability of the student to model predictive distribution. The auxiliary variable is defined to be related to target variables, which will boost the model prediction. Specifically, we reformulate the predictive distribution with the auxiliary variable, deriving a novel objective function of KD. Theoretically, we provide insights to explain why the proposed objective function can outperform the existing KD methods. Experimentally, we demonstrate that the proposed objective function can considerably and consistently outperform existing KD methods.

AAAI Conference 2024 Conference Paper

Learning Dense Correspondence for NeRF-Based Face Reenactment

  • Songlin Yang
  • Wei Wang
  • Yushi Lan
  • Xiangyu Fan
  • Bo Peng
  • Lei Yang
  • Jing Dong

Face reenactment is challenging due to the need to establish dense correspondence between various face representations for motion transfer. Recent studies have utilized Neural Radiance Field (NeRF) as fundamental representation, which further enhanced the performance of multi-view face reenactment in photo-realism and 3D consistency. However, establishing dense correspondence between different face NeRFs is non-trivial, because implicit representations lack ground-truth correspondence annotations like mesh-based 3D parametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning 3DMM space with NeRF-based face representations can realize motion control, it is sub-optimal for their limited face-only modeling and low identity fidelity. Therefore, we are inspired to ask: Can we learn the dense correspondence between different NeRF-based face representations without a 3D parametric model prior? To address this challenge, we propose a novel framework, which adopts tri-planes as fundamental NeRF representation and decomposes face tri-planes into three components: canonical tri-planes, identity deformations, and motion. In terms of motion control, our key contribution is proposing a Plane Dictionary (PlaneDict) module, which efficiently maps the motion conditions to a linear weighted addition of learnable orthogonal plane bases. To the best of our knowledge, our framework is the first method that achieves one-shot multi-view face reenactment without a 3D parametric model prior. Extensive experiments demonstrate that we produce better results in fine-grained motion control and identity preservation than previous methods.

NeurIPS Conference 2024 Conference Paper

Learning to Shape In-distribution Feature Space for Out-of-distribution Detection

  • Yonggang Zhang
  • Jie Lu
  • Bo Peng
  • Zhen Fang
  • Yiu-ming Cheung

Out-of-distribution (OOD) detection is critical for deploying machine learning models in the open world. To design scoring functions that discern OOD data from the in-distribution (ID) cases from a pre-trained discriminative model, existing methods tend to make rigorous distributional assumptions either explicitly or implicitly due to the lack of knowledge about the learned feature space in advance. The mismatch between the learned and assumed distributions motivates us to raise a fundamental yet under-explored question: \textit{Is it possible to deterministically model the feature distribution while pre-training a discriminative model? }This paper gives an affirmative answer to this question by presenting a Distributional Representation Learning (\texttt{DRL}) framework for OOD detection. In particular, \texttt{DRL} explicitly enforces the underlying feature space to conform to a pre-defined mixture distribution, together with an online approximation of normalization constants to enable end-to-end training. Furthermore, we formulate \texttt{DRL} into a provably convergent Expectation-Maximization algorithm to avoid trivial solutions and rearrange the sequential sampling to guide the training consistency. Extensive evaluations across mainstream OOD detection benchmarks empirically manifest the superiority of the proposed \texttt{DRL} over its advanced counterparts.

AAAI Conference 2024 Conference Paper

Solving Spectrum Unmixing as a Multi-Task Bayesian Inverse Problem with Latent Factors for Endmember Variability

  • Dong Wu
  • Mingmin Chi
  • Xuan Zang
  • Bo Peng

With the increasing customization of spectrometers, spectral unmixing has become a widely used technique in fields such as remote sensing, textiles, and environmental protection. However, endmember variability is a common issue for unmixing, where changes in lighting, atmospheric, temporal conditions, or the intrinsic spectral characteristics of materials, can all result in variations in the measured spectrum. Recent studies have employed deep neural networks to tackle endmember variability. However, these approaches rely on generic networks to implicitly resolve the issue, which struggles with the ill-posed nature and lack of effective convergence constraints for endmember variability. This paper proposes a streamlined multi-task learning model to rectify this problem, incorporating abundance regression and multi-label classification with Unmixing as a Bayesian Inverse Problem, denoted as BIPU. To address the issue of the ill-posed nature, the uncertainty of unmixing is quantified and minimized through the Laplace approximation in a Bayesian inverse solver. In addition, to improve convergence under the influence of endmember variability, the paper introduces two types of constraints. The first separates background factors of variants from the initial factors for each endmember, while the second identifies and eliminates the influence of non-existent endmembers via multi-label classification during convergence. The effectiveness of this model is demonstrated not only on a self-collected near-infrared spectral textile dataset (FENIR), but also on three commonly used remote sensing hyperspectral image datasets, where it achieves state-of-the-art unmixing performance and exhibits strong generalization capabilities.

NeurIPS Conference 2023 Conference Paper

A Dataset of Relighted 3D Interacting Hands

  • Gyeongsik Moon
  • Shunsuke Saito
  • Weipeng Xu
  • Rohan Joshi
  • Julia Buffalini
  • Harley Bellan
  • Nicholas Rosen
  • Jesse Richardson

The two-hand interaction is one of the most challenging signals to analyze due to the self-similarity, complicated articulations, and occlusions of hands. Although several datasets have been proposed for the two-hand interaction analysis, all of them do not achieve 1) diverse and realistic image appearances and 2) diverse and large-scale groundtruth (GT) 3D poses at the same time. In this work, we propose Re: InterHand, a dataset of relighted 3D interacting hands that achieve the two goals. To this end, we employ a state-of-the-art hand relighting network with our accurately tracked two-hand 3D poses. We compare our Re: InterHand with existing 3D interacting hands datasets and show the benefit of it. Our Re: InterHand is available in https: //mks0601. github. io/ReInterHand/

NeurIPS Conference 2023 Conference Paper

On the Last-iterate Convergence in Time-varying Zero-sum Games: Extra Gradient Succeeds where Optimism Fails

  • Yi Feng
  • Hu Fu
  • Qun Hu
  • Ping Li
  • Ioannis Panageas
  • Bo Peng
  • Xiao Wang

Last-iterate convergence has received extensive study in two player zero-sum games starting from bilinear, convex-concave up to settings that satisfy the MVI condition. Typical methods that exhibit last-iterate convergence for the aforementioned games include extra-gradient (EG) and optimistic gradient descent ascent (OGDA). However, all the established last-iterate convergence results hold for the restrictive setting where the underlying repeated game does not change over time. Recently, a line of research has focused on regret analysis of OGDA in time-varying games, i. e. , games where payoffs evolve with time; the last-iterate behavior of OGDA and EG in time-varying environments remains unclear though. In this paper, we study the last-iterate behavior of various algorithms in two types of unconstrained, time-varying, bilinear zero-sum games: periodic and convergent perturbed games. These models expand upon the usual repeated game formulation and incorporate external environmental factors, such as the seasonal effects on species competition and vanishing external noise. In periodic games, we prove that EG will converge while OGDA and momentum method will diverge. This is quite surprising, as to the best of our knowledge, it is the first result that indicates EG and OGDA have qualitatively different last-iterate behaviors and do not exhibit similar behavior. In convergent perturbed games, we prove all these algorithms converge as long as the game itself stabilizes with a faster rate than $1/t$.

AAAI Conference 2022 Conference Paper

Bridging LTLf Inference to GNN Inference for Learning LTLf Formulae

  • Weilin Luo
  • Pingjia Liang
  • Jianfeng Du
  • Hai Wan
  • Bo Peng
  • Delong Zhang

Learning linear temporal logic on finite traces (LTLf ) formulae aims to learn a target formula that characterizes the highlevel behavior of a system from observation traces in planning. Existing approaches to learning LTLf formulae, however, can hardly learn accurate LTLf formulae from noisy data. It is challenging to design an efficient search mechanism in the large search space in form of arbitrary LTLf formulae while alleviating the wrong search bias resulting from noisy data. In this paper, we tackle this problem by bridging LTLf inference to GNN inference. Our key theoretical contribution is showing that GNN inference can simulate LTLf inference to distinguish traces. Based on our theoretical result, we design a GNN-based approach, GLTLf, which combines GNN inference and parameter interpretation to seek the target formula in the large search space. Thanks to the non-deterministic learning process of GNNs, GLTLf is able to cope with noise. We evaluate GLTLf on various datasets with noise. Our experimental results confirm the effectiveness of GNN inference in learning LTLf formulae and show that GLTLf is superior to the state-of-the-art approaches.

JBHI Journal 2022 Journal Article

Multi-Modality MR Image Synthesis via Confidence-Guided Aggregation and Cross-Modality Refinement

  • Bo Peng
  • Bingzheng Liu
  • Yi Bin
  • Lili Shen
  • Jianjun Lei

Magnetic resonance imaging (MRI) can provide multi-modality MR images by setting task-specific scan parameters, and has been widely used in various disease diagnosis and planned treatments. However, in practical clinical applications, it is often difficult to obtain multi-modality MR images simultaneously due to patient discomfort, and scanning costs, etc. Therefore, how to effectively utilize the existing modality images to synthesize missing modality image has become a hot research topic. In this paper, we propose a novel confidence-guided aggregation and cross-modality refinement network (CACR-Net) for multi-modality MR image synthesis, which effectively utilizes complementary and correlative information of multiple modalities to synthesize high-quality target-modality images. Specifically, to effectively utilize the complementary modality-specific characteristics, a confidence-guided aggregation module is proposed to adaptively aggregate the multiple target-modality images generated from multiple source-modality images by using the corresponding confidence maps. Based on the aggregated target-modality image, a cross-modality refinement module is presented to further refine the target-modality image by mining correlative information among the multiple source-modality images and aggregated target-modality image. By training the proposed CACR-Net in an end-to-end manner, high-quality and sharp target-modality MR images are effectively synthesized. Experimental results on the widely used benchmark demonstrate that the proposed method outperforms state-of-the-art methods.

IJCAI Conference 2018 Conference Paper

Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding

  • Hai Wan
  • Yonghao Luo
  • Bo Peng
  • Wei-Shi Zheng

This paper focuses on scene graph completion which aims at predicting new relations between two entities utilizing existing scene graphs and images. By comparing with the well-known knowledge graph, we first identify that each scene graph is associated with an image and each entity of a visual triple in a scene graph is composed of its entity type with attributes and grounded with a bounding box in its corresponding image. We then propose an end-to-end model named Representation Learning via Jointly Structural and Visual Embedding (RLSV) to take advantages of structural and visual information in scene graphs. In RLSV model, we provide a fully-convolutional module to extract the visual embeddings of a visual triple and apply hierarchical projection to combine the structural and visual embeddings of a visual triple. In experiments, we evaluate our model on two scene graph completion tasks: link prediction and visual triple classification, and further analyze by case studies. Experimental results demonstrate that our model outperforms all baselines in both tasks, which justifies the significance of combining structural and visual information for scene graph completion.

JMLR Journal 2016 Journal Article

An Error Bound for L1-norm Support Vector Machine Coefficients in Ultra-high Dimension

  • Bo Peng
  • Lan Wang
  • Yichao Wu

Comparing with the standard $L_2$-norm support vector machine (SVM), the $L_1$-norm SVM enjoys the nice property of simultaneously preforming classification and feature selection. In this paper, we investigate the statistical performance of $L_1$-norm SVM in ultra-high dimension, where the number of features $p$ grows at an exponential rate of the sample size $n$. Different from existing theory for SVM which has been mainly focused on the generalization error rates and empirical risk, we study the asymptotic behavior of the coefficients of $L_1$-norm SVM. Our analysis reveals that the $L_1$-norm SVM coefficients achieve near oracle rate, that is, with high probability, the $L_2$ error bound of the estimated $L_1$-norm SVM coefficients is of order $O_p(\sqrt{q\log p/n})$, where $q$ is the number of features with nonzero coefficients. Furthermore, we show that if the $L_1$-norm SVM is used as an initial value for a recently proposed algorithm for solving non- convex penalized SVM (Zhang et al., 2016b), then in two iterative steps it is guaranteed to produce an estimator that possesses the oracle property in ultra-high dimension, which in particular implies that with probability approaching one the zero coefficients are estimated as exactly zero. Simulation studies demonstrate the fine performance of $L_1$-norm SVM as a sparse classifier and its effectiveness to be utilized to solve non-convex penalized SVM problems in high dimension. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )