Arrow Research search

Author name cluster

Fu Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

JBHI Journal 2026 Journal Article

VGRF Signal-Based Gait Analysis for Parkinson’s Disease Detection: A Multi-Scale Directed Graph Neural Network Approach

  • Xiaotian Wang
  • Xuanhang Xu
  • Zhifu Zhao
  • Fu Li
  • Fei Qi
  • Shuo Liang

Parkinson’s Disease (PD) is often characterized by abnormal gait patterns, which can be objectively and quantitatively diagnosed using Vertical Ground Reaction Force (VGRF) signals. Previous studies have demonstrated the effectiveness of deep learning in VGRF signal analysis. However, the inherent graph structure of VGRF signals has not been adequately considered, limiting the representation of dynamic gait characteristics. To address this, we propose a Multi-Scale Adaptive Directed Graph Neural Network (MS-ADGNN) approach to distinguish the gaits between Parkinson’s patients and healthy controls. This method models the VGRF signal as a multi-scale directed graph, capturing the distribution relationships within the plantar sensors and the dynamic pressure conduction during walking. MS-ADGNN integrates an Adaptive Directed Graph Network (ADGN) unit and a Multi-Scale Temporal Convolutional Network (MSTCN) unit. ADGN extracts spatial features from three scales of the directed graph, effectively capturing local and global connectivity. MSTCN extracts multi-scale temporal features, capturing short to long-term dependencies. The proposed method outperforms existing methods on three widely used datasets. In cross-dataset experiments, the average improvements in terms of accuracy, F1-score, and geometric mean are 2. 46 $\%$, 1. 25 $\%$, and 1. 11 $\%$ respectively. Meanwhile, in 10-fold cross-validation experiments, the improvements are 0. 78 $\%$, 0. 83 $\%$, and 0. 81 $\%$ respectively.

IJCAI Conference 2025 Conference Paper

Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models

  • Siyang Zhang
  • Bin Li
  • Jingtao Qi
  • Xueying Wang
  • Fu Li
  • Jianan Wang
  • En Zhu
  • Jinjing Sun

Behavior trees(BTs) provide a systematic and structured control architecture extensively employed in game AI and robotic behavior control, owing to their modularity, reactivity, and reusability. Nonetheless, manual BTs design requires significant expertise and becomes inefficient as task complexity increases. Recent automation technologies have avoided manual work, but often have high application barriers and face challenges in adapting to new tasks, making it difficult to easily configure them to specific requirements. Code-BT introduces a novel approach that utilizes large language models(LLMs) to automatically generate BTs, representing the task planning process as the process of coding and organizing sequences. By retrieving control flow information from the generated code, BTs can be efficiently constructed to address the complexity and diversity of task planning challenges. Rather than relying on manual design, Code-BT uses task instructions to guide the selection of relevant APIs, and then systematically assembles these APIs into modular code to align with the BTs structure. Finally, action sequences and control logic are extracted from the generated code to construct the BTs. Our approach not only ensures the automation of BTs generation but also guarantees the scalability and adaptability for long-term tasks. Experimental results demonstrate that Code-BT substantially improves LLM performance in BTs generation, achieving improvements ranging from16. 67% to 29. 17%.

ICLR Conference 2025 Conference Paper

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

  • Qinghao Ye
  • Xianhan Zeng
  • Fu Li
  • Chunyuan Li
  • Haoqi Fan 0001

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, demonstrating robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

AAAI Conference 2023 Conference Paper

AdaCM: Adaptive ColorMLP for Real-Time Universal Photo-Realistic Style Transfer

  • Tianwei Lin
  • Honglin Lin
  • Fu Li
  • Dongliang He
  • Wenhao Wu
  • Meiling Wang
  • Xin Li
  • Yong Liu

Photo-realistic style transfer aims at migrating the artistic style from an exemplar style image to a content image, producing a result image without spatial distortions or unrealistic artifacts. Impressive results have been achieved by recent deep models. However, deep neural network based methods are too expensive to run in real-time. Meanwhile, bilateral grid based methods are much faster but still contain artifacts like overexposure. In this work, we propose the Adaptive ColorMLP (AdaCM), an effective and efficient framework for universal photo-realistic style transfer. First, we find the complex non-linear color mapping between input and target domain can be efficiently modeled by a small multi-layer perceptron (ColorMLP) model. Then, in AdaCM, we adopt a CNN encoder to adaptively predict all parameters for the ColorMLP conditioned on each input content and style image pair. Experimental results demonstrate that AdaCM can generate vivid and high-quality stylization results. Meanwhile, our AdaCM is ultrafast and can process a 4K resolution image in 6ms on one V100 GPU.

NeurIPS Conference 2021 Conference Paper

CoFiNet: Reliable Coarse-to-fine Correspondences for Robust PointCloud Registration

  • Hao Yu
  • Fu Li
  • Mahdi Saleh
  • Benjamin Busam
  • Slobodan Ilic

We study the problem of extracting correspondences between a pair of point clouds for registration. For correspondence retrieval, existing works benefit from matching sparse keypoints detected from dense points but usually struggle to guarantee their repeatability. To address this issue, we present CoFiNet - Coarse-to-Fine Network which extracts hierarchical correspondences from coarse to fine without keypoint detection. On a coarse scale and guided by a weighting scheme, our model firstly learns to match down-sampled nodes whose vicinity points share more overlap, which significantly shrinks the search space of a consecutive stage. On a finer scale, node proposals are consecutively expanded to patches that consist of groups of points together with associated descriptors. Point correspondences are then refined from the overlap areas of corresponding patches, by a density-adaptive matching module capable to deal with varying point density. Extensive evaluation of CoFiNet on both indoor and outdoor standard benchmarks shows our superiority over existing methods. Especially on 3DLoMatch where point clouds share less overlap, CoFiNet significantly outperforms state-of-the-art approaches by at least 5% on Registration Recall, with at most two-third of their parameters.

MFCS Conference 2021 Conference Paper

Maximum Votes Pareto-Efficient Allocations via Swaps on a Social Network

  • Fu Li
  • Xiong Zheng

In recent work, Gourv{è}s, Lesca, and Wilczynski (IJCAI 17) propose a variant of the classic housing markets model in which the matching between agents and objects evolves through Pareto-improving swaps between pairs of agents who are adjacent in a social network. To explore the swap dynamics of their model, they pose several basic questions concerning the set of reachable matchings, and investigate the computational complexity of these questions when the graph structure of the social network is a star, path, or tree, or is unrestricted. We are interested in how to direct the agents to swap objects with each other in order to arrive at a reachable matching that is both efficient and most agreeable. In particular, we study the computational complexity of reaching a Pareto-efficient matching that maximizes the number of agents who prefer their match to their initial endowments. We consider various graph structures of the social network: path, star, tree, or being unrestricted. Additionally, we consider two assumptions regarding preference relations of agents: strict (ties among objects not allowed) or weak (ties among objects allowed). By designing two polynomial-time algorithms and two NP-hardness reductions, we resolve the complexity of all cases not yet known. Our main contributions include a polynomial-time algorithm for path networks with strict preferences and an NP-hardness result in a star network with weak preferences.

AAAI Conference 2021 Conference Paper

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

  • Wenhao Wu
  • Dongliang He
  • Tianwei Lin
  • Fu Li
  • Chuang Gan
  • Errui Ding

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H ×W ×T video frames as space-time signal (viewing from the Height- Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-theshelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i. e. , Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance but maintain 2D CNN’s complexity.

AAMAS Conference 2021 Conference Paper

Object Allocation Over a Network of Objects: Mobile Agents with Strict Preferences

  • Fu Li
  • C. Gregory Plaxton
  • Vaibhav B. Sinha

In recent work, Gourvès, Lesca, and Wilczynski propose a variant of the classic housing markets model where the matching between agents and objects evolves through Pareto-improving swaps between pairs of adjacent agents in a social network. To explore the swap dynamics of their model, they pose several basic questions concerning the set of reachable matchings. In their work and other follow-up works, these questions have been studied for various classes of graphs: stars, paths, generalized stars (i. e. , trees where at most one vertex has degree greater than two), trees, and cliques. For generalized stars and trees, it remains open whether a Paretoefficient reachable matching can be found in polynomial time. In this paper, we pursue the same set of questions under a natural variant of their model. In our model, the social network is replaced by a network of objects, and a swap is allowed to take place between two agents if it is Pareto-improving and the associated objects are adjacent in the network. In those cases where the question of polynomial-time solvability versus NP-hardness has been resolved for the social network model, we are able to show that the same result holds for the network-of-objects model. In addition, for our model, we present a polynomial-time algorithm for computing a Pareto-efficient reachable matching in generalized star networks. Moreover, the object reachability algorithm that we present for path networks is significantly faster than the known polynomial-time algorithms for the same question in the social network model.

AAAI Conference 2020 Conference Paper

Multi-Label Classification with Label Graph Superimposing

  • Ya Wang
  • Dongliang He
  • Fu Li
  • Xiang Long
  • Zhichao Zhou
  • Jinwen Ma
  • Shilei Wen

Images or videos always contain multiple objects or actions. Multi-label recognition has been witnessed to achieve pretty performance attribute to the rapid development of deep learning technologies. Recently, graph convolution network (GCN) is leveraged to boost the performance of multi-label recognition. However, what is the best way for label correlation modeling and how feature learning can be improved with label system awareness are still unclear. In this paper, we propose a label graph superimposing framework to improve the conventional GCN+CNN framework developed for multi-label recognition in the following two aspects. Firstly, we model the label correlations by superimposing label graph built from statistical co-occurrence information into the graph constructed from knowledge priors of labels, and then multilayer graph convolutions are applied on the final superimposed graph for label embedding abstraction. Secondly, we propose to leverage embedding of the whole label system for better representation learning. In detail, lateral connections between GCN and CNN are added at shallow, middle and deep layers to inject information of label system into backbone CNN for label-awareness in the feature learning process. Extensive experiments are carried out on MS- COCO and Charades datasets, showing that our proposed solution can greatly improve the recognition performance and achieves new state-of-the-art recognition performance.

AAAI Conference 2019 Conference Paper

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

  • Dongliang He
  • Xiang Zhao
  • Jizhou Huang
  • Fu Li
  • Xiao Liu
  • Shilei Wen

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-ofthe-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

AAAI Conference 2019 Conference Paper

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

  • Dongliang He
  • Zhichao Zhou
  • Chuang Gan
  • Fu Li
  • Xiao Liu
  • Yandong Li
  • Limin Wang
  • Shilei Wen

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

AAAI Conference 2018 Conference Paper

Multimodal Keyless Attention Fusion for Video Classification

  • Xiang Long
  • Chuang Gan
  • Gerard Melo
  • Xiao Liu
  • Yandong Li
  • Fu Li
  • Shilei Wen

The problem of video classification is inherently sequential and multimodal, and deep neural models hence need to capture and aggregate the most pertinent signals for a given input video. We propose Keyless Attention as an elegant and efficient means to more effectively account for the sequential nature of the data. Moreover, comparing a variety of multimodal fusion methods, we find that Multimodal Keyless Attention Fusion is the most successful at discerning interactions between modalities. We experiment on four highly heterogeneous datasets, UCF101, ActivityNet, Kinetics, and YouTube-8M to validate our conclusion, and show that our approach achieves highly competitive results. Especially on large-scale data, our method has great advantages in efficiency and performance. Most remarkably, our best single model can achieve 77. 0% in terms of the top-1 accuracy and 93. 2% in terms of the top-5 accuracy on the Kinetics validation set, and achieve 82. 2% in terms of GAP@20 on the official YouTube-8M test set.

NeurIPS Conference 2016 Conference Paper

Combinatorial Multi-Armed Bandit with General Reward Functions

  • Wei Chen
  • Wei Hu
  • Fu Li
  • Jian Li
  • Yu Liu
  • Pinyan Lu

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that SDCB can achieve $O(\log T)$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. We apply our results to the $K$-MAX problem and expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt T)$ bound on the $(1-\epsilon)$-approximation regret of its online problem, for any $\epsilon>0$.