Author name cluster

Fu Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

JBHI Journal 2026 Journal Article

VGRF Signal-Based Gait Analysis for Parkinson’s Disease Detection: A Multi-Scale Directed Graph Neural Network Approach

Xiaotian Wang
Xuanhang Xu
Zhifu Zhao
Fu Li
Fei Qi
Shuo Liang

Parkinson’s Disease (PD) is often characterized by abnormal gait patterns, which can be objectively and quantitatively diagnosed using Vertical Ground Reaction Force (VGRF) signals. Previous studies have demonstrated the effectiveness of deep learning in VGRF signal analysis. However, the inherent graph structure of VGRF signals has not been adequately considered, limiting the representation of dynamic gait characteristics. To address this, we propose a Multi-Scale Adaptive Directed Graph Neural Network (MS-ADGNN) approach to distinguish the gaits between Parkinson’s patients and healthy controls. This method models the VGRF signal as a multi-scale directed graph, capturing the distribution relationships within the plantar sensors and the dynamic pressure conduction during walking. MS-ADGNN integrates an Adaptive Directed Graph Network (ADGN) unit and a Multi-Scale Temporal Convolutional Network (MSTCN) unit. ADGN extracts spatial features from three scales of the directed graph, effectively capturing local and global connectivity. MSTCN extracts multi-scale temporal features, capturing short to long-term dependencies. The proposed method outperforms existing methods on three widely used datasets. In cross-dataset experiments, the average improvements in terms of accuracy, F1-score, and geometric mean are 2. 46 $\%$, 1. 25 $\%$, and 1. 11 $\%$ respectively. Meanwhile, in 10-fold cross-validation experiments, the improvements are 0. 78 $\%$, 0. 83 $\%$, and 0. 81 $\%$ respectively.

Details DOI

IJCAI Conference 2025 Conference Paper

Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models

Siyang Zhang
Bin Li
Jingtao Qi
Xueying Wang
Fu Li
Jianan Wang
En Zhu
Jinjing Sun

Behavior trees(BTs) provide a systematic and structured control architecture extensively employed in game AI and robotic behavior control, owing to their modularity, reactivity, and reusability. Nonetheless, manual BTs design requires significant expertise and becomes inefficient as task complexity increases. Recent automation technologies have avoided manual work, but often have high application barriers and face challenges in adapting to new tasks, making it difficult to easily configure them to specific requirements. Code-BT introduces a novel approach that utilizes large language models(LLMs) to automatically generate BTs, representing the task planning process as the process of coding and organizing sequences. By retrieving control flow information from the generated code, BTs can be efficiently constructed to address the complexity and diversity of task planning challenges. Rather than relying on manual design, Code-BT uses task instructions to guide the selection of relevant APIs, and then systematically assembles these APIs into modular code to align with the BTs structure. Finally, action sequences and control logic are extracted from the generated code to construct the BTs. Our approach not only ensures the automation of BTs generation but also guarantees the scalability and adaptability for long-term tasks. Experimental results demonstrate that Code-BT substantially improves LLM performance in BTs generation, achieving improvements ranging from16. 67% to 29. 17%.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Qinghao Ye
Xianhan Zeng
Fu Li
Chunyuan Li
Haoqi Fan 0001

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, demonstrating robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Details

AAAI Conference 2023 Conference Paper

AdaCM: Adaptive ColorMLP for Real-Time Universal Photo-Realistic Style Transfer

Tianwei Lin
Honglin Lin
Fu Li
Dongliang He
Wenhao Wu
Meiling Wang
Xin Li
Yong Liu

Photo-realistic style transfer aims at migrating the artistic style from an exemplar style image to a content image, producing a result image without spatial distortions or unrealistic artifacts. Impressive results have been achieved by recent deep models. However, deep neural network based methods are too expensive to run in real-time. Meanwhile, bilateral grid based methods are much faster but still contain artifacts like overexposure. In this work, we propose the Adaptive ColorMLP (AdaCM), an effective and efficient framework for universal photo-realistic style transfer. First, we find the complex non-linear color mapping between input and target domain can be efficiently modeled by a small multi-layer perceptron (ColorMLP) model. Then, in AdaCM, we adopt a CNN encoder to adaptively predict all parameters for the ColorMLP conditioned on each input content and style image pair. Experimental results demonstrate that AdaCM can generate vivid and high-quality stylization results. Meanwhile, our AdaCM is ultrafast and can process a 4K resolution image in 6ms on one V100 GPU.

PDF Details DOI

TCS Journal 2023 Journal Article

The obnoxious facility location game with dichotomous preferences

Fu Li
C. Gregory Plaxton
Vaibhav B. Sinha

Details DOI

NeurIPS Conference 2021 Conference Paper

CoFiNet: Reliable Coarse-to-fine Correspondences for Robust PointCloud Registration

Hao Yu
Fu Li
Mahdi Saleh
Benjamin Busam
Slobodan Ilic

We study the problem of extracting correspondences between a pair of point clouds for registration. For correspondence retrieval, existing works benefit from matching sparse keypoints detected from dense points but usually struggle to guarantee their repeatability. To address this issue, we present CoFiNet - Coarse-to-Fine Network which extracts hierarchical correspondences from coarse to fine without keypoint detection. On a coarse scale and guided by a weighting scheme, our model firstly learns to match down-sampled nodes whose vicinity points share more overlap, which significantly shrinks the search space of a consecutive stage. On a finer scale, node proposals are consecutively expanded to patches that consist of groups of points together with associated descriptors. Point correspondences are then refined from the overlap areas of corresponding patches, by a density-adaptive matching module capable to deal with varying point density. Extensive evaluation of CoFiNet on both indoor and outdoor standard benchmarks shows our superiority over existing methods. Especially on 3DLoMatch where point clouds share less overlap, CoFiNet significantly outperforms state-of-the-art approaches by at least 5% on Registration Recall, with at most two-third of their parameters.

PDF Details

MFCS Conference 2021 Conference Paper

Maximum Votes Pareto-Efficient Allocations via Swaps on a Social Network

Fu Li
Xiong Zheng

In recent work, Gourv{è}s, Lesca, and Wilczynski (IJCAI 17) propose a variant of the classic housing markets model in which the matching between agents and objects evolves through Pareto-improving swaps between pairs of agents who are adjacent in a social network. To explore the swap dynamics of their model, they pose several basic questions concerning the set of reachable matchings, and investigate the computational complexity of these questions when the graph structure of the social network is a star, path, or tree, or is unrestricted. We are interested in how to direct the agents to swap objects with each other in order to arrive at a reachable matching that is both efficient and most agreeable. In particular, we study the computational complexity of reaching a Pareto-efficient matching that maximizes the number of agents who prefer their match to their initial endowments. We consider various graph structures of the social network: path, star, tree, or being unrestricted. Additionally, we consider two assumptions regarding preference relations of agents: strict (ties among objects not allowed) or weak (ties among objects allowed). By designing two polynomial-time algorithms and two NP-hardness reductions, we resolve the complexity of all cases not yet known. Our main contributions include a polynomial-time algorithm for path networks with strict preferences and an NP-hardness result in a star network with weak preferences.

Details

AAAI Conference 2021 Conference Paper

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Wenhao Wu
Dongliang He
Tianwei Lin
Fu Li
Chuang Gan
Errui Ding

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H ×W ×T video frames as space-time signal (viewing from the Height- Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-theshelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i. e. , Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance but maintain 2D CNN’s complexity.

PDF Details

AAMAS Conference 2021 Conference Paper

Object Allocation Over a Network of Objects: Mobile Agents with Strict Preferences

Fu Li
C. Gregory Plaxton
Vaibhav B. Sinha

In recent work, Gourvès, Lesca, and Wilczynski propose a variant of the classic housing markets model where the matching between agents and objects evolves through Pareto-improving swaps between pairs of adjacent agents in a social network. To explore the swap dynamics of their model, they pose several basic questions concerning the set of reachable matchings. In their work and other follow-up works, these questions have been studied for various classes of graphs: stars, paths, generalized stars (i. e. , trees where at most one vertex has degree greater than two), trees, and cliques. For generalized stars and trees, it remains open whether a Paretoefficient reachable matching can be found in polynomial time. In this paper, we pursue the same set of questions under a natural variant of their model. In our model, the social network is replaced by a network of objects, and a swap is allowed to take place between two agents if it is Pareto-improving and the associated objects are adjacent in the network. In those cases where the question of polynomial-time solvability versus NP-hardness has been resolved for the social network model, we are able to show that the same result holds for the network-of-objects model. In addition, for our model, we present a polynomial-time algorithm for computing a Pareto-efficient reachable matching in generalized star networks. Moreover, the object reachability algorithm that we present for path networks is significantly faster than the known polynomial-time algorithms for the same question in the social network model.

PDF

AAAI Conference 2020 Conference Paper

Multi-Label Classification with Label Graph Superimposing

Ya Wang
Dongliang He
Fu Li
Xiang Long
Zhichao Zhou
Jinwen Ma
Shilei Wen

Images or videos always contain multiple objects or actions. Multi-label recognition has been witnessed to achieve pretty performance attribute to the rapid development of deep learning technologies. Recently, graph convolution network (GCN) is leveraged to boost the performance of multi-label recognition. However, what is the best way for label correlation modeling and how feature learning can be improved with label system awareness are still unclear. In this paper, we propose a label graph superimposing framework to improve the conventional GCN+CNN framework developed for multi-label recognition in the following two aspects. Firstly, we model the label correlations by superimposing label graph built from statistical co-occurrence information into the graph constructed from knowledge priors of labels, and then multilayer graph convolutions are applied on the ﬁnal superimposed graph for label embedding abstraction. Secondly, we propose to leverage embedding of the whole label system for better representation learning. In detail, lateral connections between GCN and CNN are added at shallow, middle and deep layers to inject information of label system into backbone CNN for label-awareness in the feature learning process. Extensive experiments are carried out on MS- COCO and Charades datasets, showing that our proposed solution can greatly improve the recognition performance and achieves new state-of-the-art recognition performance.

PDF Details

AAAI Conference 2019 Conference Paper

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Dongliang He
Xiang Zhao
Jizhou Huang
Fu Li
Xiao Liu
Shilei Wen

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-ofthe-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

PDF Details

AAAI Conference 2019 Conference Paper

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Dongliang He
Zhichao Zhou
Chuang Gan
Fu Li
Xiao Liu
Yandong Li
Limin Wang
Shilei Wen

Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

PDF Details

AAAI Conference 2018 Conference Paper

Multimodal Keyless Attention Fusion for Video Classification

Xiang Long
Chuang Gan
Gerard Melo
Xiao Liu
Yandong Li
Fu Li
Shilei Wen

The problem of video classiﬁcation is inherently sequential and multimodal, and deep neural models hence need to capture and aggregate the most pertinent signals for a given input video. We propose Keyless Attention as an elegant and efﬁcient means to more effectively account for the sequential nature of the data. Moreover, comparing a variety of multimodal fusion methods, we ﬁnd that Multimodal Keyless Attention Fusion is the most successful at discerning interactions between modalities. We experiment on four highly heterogeneous datasets, UCF101, ActivityNet, Kinetics, and YouTube-8M to validate our conclusion, and show that our approach achieves highly competitive results. Especially on large-scale data, our method has great advantages in efﬁciency and performance. Most remarkably, our best single model can achieve 77. 0% in terms of the top-1 accuracy and 93. 2% in terms of the top-5 accuracy on the Kinetics validation set, and achieve 82. 2% in terms of GAP@20 on the ofﬁcial YouTube-8M test set.

PDF Details

NeurIPS Conference 2016 Conference Paper

Combinatorial Multi-Armed Bandit with General Reward Functions

Wei Chen
Wei Hu
Fu Li
Jian Li
Yu Liu
Pinyan Lu

In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the $\max()$ function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds. We prove that SDCB can achieve $O(\log T)$ distribution-dependent regret and $\tilde{O}(\sqrt{T})$ distribution-independent regret, where $T$ is the time horizon. We apply our results to the $K$-MAX problem and expected utility maximization problems. In particular, for $K$-MAX, we provide the first polynomial-time approximation scheme (PTAS) for its offline problem, and give the first $\tilde{O}(\sqrt T)$ bound on the $(1-\epsilon)$-approximation regret of its online problem, for any $\epsilon>0$.

PDF Details