Arrow Research search

Author name cluster

Le Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers
2 author rows

Possible papers

25

AAAI Conference 2026 Conference Paper

FeTS: A Feature-Aware Framework for Time Series Forecasting

  • Le Wang
  • Jianyong Chen
  • Songbai Liu

Time series forecasting faces a fundamental challenge: the uneven distribution of predictive importance in time series data, where some specific time points and feature combinations carry disproportionately predictive power. As a result, uniform processing methods that treat all data alike inevitably fall short of optimal performance. To address this problem, we propose FeTS, a feature-aware framework that comprehensively learns temporal features through two key components: (i) Adaptive Feature Extraction (AdaFE), which dynamically discovers the most important features within each temporal patch and extracts them on the fly, yielding sharper and more focused local representations; and (ii) Dual-Scale Feed-Forward Network (DSFFN), which strategically integrates fine-grained local features with global long-term dependencies to achieve richer dual-scale representation learning. Extensive experiments on eight benchmark datasets demonstrate that FeTS achieves state-of-the-art performance in time series forecasting tasks, offering a novel solution to the challenge of uneven predictive importance in forecasting.

AAAI Conference 2026 Conference Paper

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses Through Reasoning MLLMs

  • Zheng Qin
  • Ruobing Zheng
  • Yabing Wang
  • Tianqi Li
  • Yi Yuan
  • Jingdong Chen
  • Le Wang

While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks.Furthermore, grounded in the observation that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, we posit that reasoning ability serves as the key to unlocking it. We devise a multi-stage, modality-progressive reinforcement learning approach, resulting in HumanSense-Omni-Reasoning, which substantially enhances performance on higher-level understanding and interactive tasks. Additionally, we observe that successful reasoning processes appear to exhibit consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.

JBHI Journal 2026 Journal Article

Spec-ViT: A Vision Transformer With Wavelet for Anti-Aliasing and Denoising in Medical Image Classification

  • Xiong Zhang
  • Le Wang
  • Yanying Rao
  • Yuzheng Su
  • Fayaz Ali Dharejo
  • Radu Timofte
  • Guo-jun Mao
  • Moath Alathbah

Medical image analysis remains challenging due to inherent limitations in imaging modalities, where structural aliasing and noise artifacts persistently compromise diagnostic accuracy. While convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable progress in feature extraction, their inherent sampling mechanisms and spectral biases often exacerbate these high-frequency distortions, leading to suboptimal lesion characterization. To address this critical limitation, we propose Spec-ViT, a novel wavelet-based anti-aliasing Transformer architecture that synergistically integrates adaptive spectral purification with hierarchical attentive learning. The Wavelet Antialiasing Module (WAM) first implements learnable smoothing factor in the wavelet domain to suppress highfrequency artifacts, while preserving clinically relevant lowfrequency structures and fine diagnostic details. Building upon this spectral foundation, the Lightweight Enhanced Attention (LEA) refines feature representations through a dual-path mechanism, coupling channel-spatial attention with global multi-head self-attention to enhance lesion context modeling. Finally, the Smoothed Convolutional Gate (SCG) further sharpens local discriminability through depthwise convolution and adaptive Swish gating, completing a coherent pipeline from frequency-aware purification to global-local attentive analysis. Extensive experiments on five benchmark medical image classification datasets demonstrate that Spec-ViT consistently outperforms both baseline and state-of-the-art methods, achieving up to 84. 04% accuracy on the Pediatric Pneumonia Chest X-rays dataset in particular.

IROS Conference 2025 Conference Paper

An insect-scale multimodal amphibious piezoelectric robot

  • Le Wang
  • Xin Wang
  • Hanlin Wang
  • Xiqing Zuo
  • Chao Xu

Miniature amphibious robots are capable of performing various tasks in complex terrestrial and aquatic environments due to their superior adaptability. However, the mobility of existing miniature amphibious robots in such environments is limited by their complex locomotion systems and single mode of motion. This work presents a novel insect-scale amphibious robot, powered by a single piezoelectric actuator. The prototype of the robot is fabricated and preliminarily tested preliminarily. By exploiting the different vibration modes of the piezoelectric actuators, the robot achieves movement in an amphibious environment. The robot employs the acoustic flow generated by the higher-order mode to achieve rapid motion at the water surface. In addition, the robot attains forward and backward motion on the ground by means of friction force between the driving feet and the ground. The findings of this study offer significant insights into the development of amphibious robots that exhibit enhanced flexibility and adaptability. These insights lay the foundation for the future applications of such robots in narrow amphibious settings.

AAAI Conference 2025 Conference Paper

Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding

  • Xiaolong Sun
  • Liushuai Shi
  • Le Wang
  • Sanping Zhou
  • Kun Xia
  • Yabing Wang
  • Gang Hua

Temporal sentence grounding is a challenging task that aims to localize the moment spans relevant to a language description. Although recent DETR-based models have achieved notable progress by leveraging multiple learnable moment queries, they suffer from overlapped and redundant proposals, leading to inaccurate predictions. We attribute this limitation to the lack of task-related guidance for the learnable queries to serve a specific mode. Furthermore, the complex solution space generated by variable and open-vocabulary language descriptions complicates optimization, making it harder for learnable queries to adaptively distinguish each other, leading to more severe overlapped proposals. To address this limitation, we present the Region-Guided TRansformer (RGTR) for temporal sentence grounding, which introduces regional guidance to increase query diversity and eliminate overlapped proposals. Instead of using learnable queries, RGTR adopts a set of anchor pairs as moment queries to introduce explicit regional guidance. Each moment query takes charge of moment prediction for a specific temporal region, which reduces the optimization difficulty and ensures the diversity of the proposals. In addition, we design an IoU-aware scoring head to improve proposal quality. Extensive experiments demonstrate the effectiveness of RGTR, outperforming state-of-the-art methods on three public benchmarks and exhibiting good generalization and robustness on out-of-distribution splits.

NeurIPS Conference 2025 Conference Paper

DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

  • Jingyi Tian
  • Le Wang
  • Sanping Zhou
  • Sen Wang
  • Gang Hua

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

AAAI Conference 2025 Conference Paper

RefDetector: A Simple Yet Effective Matching-based Method for Referring Expression Comprehension

  • Yabing Wang
  • Zhuotao Tian
  • Zheng Qin
  • Sanping Zhou
  • Le Wang

Despite the rapid and substantial advancements in object detection, it continues to face limitations imposed by pre-defined category sets. Current methods for visual grounding primarily focus on how to better leverage the visual backbone to generate text-tailored visual features, which may require adjusting the parameters of the entire model. Besides, some early methods, \ie, matching-based method, build upon and extend the functionality of existing object detectors by enabling them to localize an object based on free-form linguistic expressions, which have good application potential. However, the untapped potential of the matching-based approach has not been fully realized due to inadequate exploration. In this paper, we first analyze the limitations that exist in the current matching-based method (\ie, mismatch problem and complicated fusion mechanisms), and then present a simple yet effective matching-based method, namely RefDetector. To tackle the above issues, we devise a simple heuristic rule to generate proposals with improved referent recall. Additionally, we introduce a straightforward vision-language interaction module that eliminates the need for intricate manually-designed mechanisms. Moreover, we have explored the visual grounding based on the modern detector DETR, and achieved significant performance improvement. Extensive experiments on three REC benchmark datasets, \ie, RefCOCO, RefCOCO+, and RefCOCOg validate the effectiveness of the proposed method.

NeurIPS Conference 2025 Conference Paper

SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models

  • Sen Wang
  • Jingyi Tian
  • Le Wang
  • Zhimin Liao
  • Huaiyi Dong
  • Kun Xia
  • Sanping Zhou
  • Wei Tang

World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose Scale-wise Autoregression with Motion PrOmpt (SAMPO), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4. 4× faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

NeurIPS Conference 2024 Conference Paper

Referencing Where to Focus: Improving Visual Grounding with Referential Query

  • Yabing Wang
  • Zhuotao Tian
  • Qingpei Guo
  • Zheng Qin
  • Sanping Zhou
  • Ming Yang
  • Le Wang

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.

AAAI Conference 2024 Conference Paper

Temporal Correlation Vision Transformer for Video Person Re-Identification

  • Pengfei Wu
  • Le Wang
  • Sanping Zhou
  • Gang Hua
  • Changyin Sun

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

JBHI Journal 2023 Journal Article

E2EGI: End-to-End Gradient Inversion in Federated Learning

  • Zhaohua Li
  • Le Wang
  • Guangyao Chen
  • Zhiqiang Zhang
  • Muhammad Shafiq
  • Zhaoquan Gu

A plethora of healthcare data is produced every day due to the proliferation of prominent technologies such as Internet of Medical Things (IoMT). Digital-driven smart devices like wearable watches, wristbands and bracelets are utilized extensively in modern healthcare applications. Mining valuable information from the data distributed at the owners' level is useful, but it is challenging to preserve data privacy. Federated learning (FL) has swiftly surged in popularity due to its efficacy in dealing privacy vulnerabilities. Recent studies have demonstrated that Gradient Inversion Attack (GIA) can reconstruct the input data by leaked gradients, previous work demonstrated the achievement of GIA in very limited scenarios, such as the label repetition rate of the target sample being low and batch sizes being smaller than 48. In this paper, a novel method of End-to-End Gradient Inversion (E2EGI) is proposed. Compared to the state-of-the-art method, E2EGI's Minimum Loss Combinatorial Optimization (MLCO) has the ability to realize reconstructed samples with higher similarity, and the Distributed Gradient Inversion algorithm can implement GIA with batch sizes of 8 to 256 on deep network models (such as ResNet-50) and ImageNet datasets. A new Label Reconstruction algorithm is developed that relies only on the gradient information of the target model, which can achieve a label reconstruction accuracy of 81% in one batch sample with a label repetition rate of 96%, a 27% improvement over the state-of-the-art method. This proposed work can underpin data security assessments for healthcare federated learning.

AAAI Conference 2023 Conference Paper

Multi-Stream Representation Learning for Pedestrian Trajectory Prediction

  • Yuxuan Wu
  • Le Wang
  • Sanping Zhou
  • Jinghai Duan
  • Gang Hua
  • Wei Tang

Forecasting the future trajectory of pedestrians is an important task in computer vision with a range of applications, from security cameras to autonomous driving. It is very challenging because pedestrians not only move individually across time but also interact spatially, and the spatial and temporal information is deeply coupled with one another in a multi-agent scenario. Learning such complex spatio-temporal correlation is a fundamental issue in pedestrian trajectory prediction. Inspired by the procedure that the hippocampus processes and integrates spatio-temporal information to form memories, we propose a novel multi-stream representation learning module to learn complex spatio-temporal features of pedestrian trajectory. Specifically, we learn temporal, spatial and cross spatio-temporal correlation features in three respective pathways and then adaptively integrate these features with learnable weights by a gated network. Besides, we leverage the sparse attention gate to select informative interactions and correlations brought by complex spatio-temporal modeling and reduce complexity of our model. We evaluate our proposed method on two commonly used datasets, i.e. ETH-UCY and SDD, and the experimental results demonstrate our method achieves the state-of-the-art performance. Code: https://github.com/YuxuanIAIR/MSRL-master

AAAI Conference 2022 Conference Paper

Complementary Attention Gated Network for Pedestrian Trajectory Prediction

  • Jinghai Duan
  • Le Wang
  • Chengjiang Long
  • Sanping Zhou
  • Fang Zheng
  • Liushuai Shi
  • Gang Hua

Pedestrian trajectory prediction is crucial in many practical applications due to the diversity of pedestrian movements, such as social interactions and individual motion behaviors. With similar observable trajectories and social environments, different pedestrians may make completely different future decisions. However, most existing methods only focus on the frequent modal of the trajectory and thus are difficult to generalize to the peculiar scenario, which leads to the decline of the multimodal fitting ability when facing similar scenarios. In this paper, we propose a complementary attention gated network (CAGN) for pedestrian trajectory prediction, in which a dual-path architecture including normal and inverse attention is proposed to capture both frequent and peculiar modals in spatial and temporal patterns, respectively. Specifically, a complementary block is proposed to guide normal and inverse attention, which are then be summed with learnable weights to get attention features by a gated network. Finally, multiple trajectory distributions are estimated based on the fused spatio-temporal attention features due to the multimodality of future trajectory. Experimental results on benchmark datasets, i. e. , the ETH, and the UCY, demonstrate that our method outperforms state-of-the-art methods by 13. 8% in Average Displacement Error (ADE) and 10. 4% in Final Displacement Error (FDE). Code will be available at https: //github. com/jinghaiD/CAGN

JBHI Journal 2022 Journal Article

Continuous Seizure Detection Based on Transformer and Long-Term iEEG

  • Yulin Sun
  • Weipeng Jin
  • Xiaopeng Si
  • Xingjian Zhang
  • Jiale Cao
  • Le Wang
  • Shaoya Yin
  • Dong Ming

Automatic seizure detection algorithms are necessary for patients with refractory epilepsy. Many excellent algorithms have achieved good results in seizure detection. Still, most of them are based on discontinuous intracranial electroencephalogram (iEEG) and ignore the impact of different channels on detection. This study aimed to evaluate the proposed algorithm using continuous, long-term iEEG to show its applicability in clinical routine. In this study, we introduced the ability of the transformer network to calculate the attention between the channels of input signals into seizure detection. We proposed an end-to-end model that included convolution and transformer layers. The model did not need feature engineering or format transformation of the original multi-channel time series. Through evaluation on two datasets, we demonstrated experimentally that the transformer layer could improve the performance of the seizure detection algorithm. For the SWEC-ETHZ iEEG dataset, we achieved 97. 5% event-based sensitivity, 0. 06/h FDR, and 13. 7 s latency. For the TJU-HH iEEG dataset, we achieved 98. 1% event-based sensitivity, 0. 22/h FDR, and 9. 9 s latency. In addition, statistics showed that the model allocated more attention to the channels close to the seizure onset zone within 20 s after the seizure onset, which improved the explainability of the model. This paper provides a new method to improve the performance and explainability of automatic seizure detection.

AAAI Conference 2022 Conference Paper

Learning Disentangled Classification and Localization Representations for Temporal Action Localization

  • Zixin Zhu
  • Le Wang
  • Wei Tang
  • Ziyi Liu
  • Nanning Zheng
  • Gang Hua

A common approach to Temporal Action Localization (TAL) is to generate action proposals and then perform action classification and localization on them. For each proposal, existing methods universally use a shared proposal-level representation for both tasks. However, our analysis indicates that this shared representation focuses on the most discriminative frames for classification, e. g. , “take-offs” rather than “runups” in distinguishing “high jump” and “long jump”, while frames most relevant to localization, such as the start and end frames of an action, are largely ignored. In other words, such a shared representation can not simultaneously handle both classification and localization tasks well, and it makes precise TAL difficult. To address this challenge, this paper disentangles the shared representation into classification and localization representations. The disentangled classification representation focuses on the most discriminative frames, and the disentangled localization representation focuses on the action phase as well as the action start and end. Our model can be divided into two sub-networks, i. e. , the disentanglement network and the context-based aggregation network. The disentanglement network is an autoencoder to learn orthogonal hidden variables of classification and localization. The context-based aggregation network aggregates the classification and localization representations by modeling local and global contexts. We evaluate our proposed method on two popular benchmarks for TAL, which outperforms all state-ofthe-art methods.

AAAI Conference 2022 Conference Paper

Social Interpretable Tree for Pedestrian Trajectory Prediction

  • Liushuai Shi
  • Le Wang
  • Chengjiang Long
  • Sanping Zhou
  • Fang Zheng
  • Nanning Zheng
  • Gang Hua

Understanding the multiple socially-acceptable future behaviors is an essential task for many vision applications. In this paper, we propose a tree-based method, termed as Social Interpretable Tree (SIT), to address this multi-modal prediction task, where a hand-crafted tree is built depending on the prior information of observed trajectory to model multiple future trajectories. Specifically, a path in the tree from the root to leaf represents an individual possible future trajectory. SIT employs a coarse-to-fine optimization strategy, in which the tree is first built by high-order velocity to balance the complexity and coverage of the tree and then optimized greedily to encourage multimodality. Finally, a teacher-forcing refining operation is used to predict the final fine trajectory. Compared with prior methods which leverage implicit latent variables to represent possible future trajectories, the path in the tree can explicitly explain the rough moving behaviors (e. g. , go straight and then turn right), and thus provides better interpretability. Despite the hand-crafted tree, the experimental results on ETH-UCY and Stanford Drone datasets demonstrate that our method is capable of matching or exceeding the performance of state-of-the-art methods. Interestingly, the experiments show that the raw built tree without training outperforms many prior deep neural network based approaches. Meanwhile, our method presents sufficient flexibility in longterm prediction and different best-of-K predictions.

AAAI Conference 2021 Conference Paper

ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization

  • Ziyi Liu
  • Le Wang
  • Qilin Zhang
  • Wei Tang
  • Junsong Yuan
  • Nanning Zheng
  • Gang Hua

The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to localize all action instances in an untrimmed video with only video-level supervision. Due to the lack of frame-level annotations during training, current WS-TAL methods rely on attention mechanisms to localize the foreground snippets or frames that contribute to the video-level classification task. This strategy frequently confuse context with the actual action, in the localization result. Separating action and context is a core problem for precise WS-TAL, but it is very challenging and has been largely ignored in the literature. In this paper, we introduce an Action-Context Separation Network (ACSNet) that explicitly takes into account context for accurate action localization. It consists of two branches (i. e. , the Foreground-Background branch and the Action-Context branch). The Foreground- Background branch first distinguishes foreground from background within the entire video while the Action-Context branch further separates the foreground as action and context. We associate video snippets with two latent components (i. e. , a positive component and a negative component), and their different combinations can effectively characterize foreground, action and context. Furthermore, we introduce extended labels with auxiliary context categories to facilitate the learning of action-context separation. Experiments on THU- MOS14 and ActivityNet v1. 2/v1. 3 datasets demonstrate the ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.

AAAI Conference 2021 Conference Paper

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

  • Ziyi Liu
  • Le Wang
  • Wei Tang
  • Junsong Yuan
  • Nanning Zheng
  • Gang Hua

Weakly-supervised Temporal Action Localization (WS-TAL) methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. Existing WS-TAL methods rely on deep features learned for action recognition. However, due to the mismatch between classification and localization, these features cannot distinguish the frequently co-occurring contextual background, i. e. , the context, and the actual action instances. We term this challenge action-context confusion, and it will adversely affect the action localization accuracy. To address this challenge, we introduce a framework that learns two feature subspaces respectively for actions and their context. By explicitly accounting for action visual elements, the action instances can be localized more precisely without the distraction from the context. To facilitate the learning of these two feature subspaces with only video-level categorical labels, we leverage the predictions from both spatial and temporal streams for snippets grouping. In addition, an unsupervised learning task is introduced to make the proposed module focus on mining temporal information. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks, i. e. , THUMOS14, ActivityNet v1. 2 and v1. 3 datasets.

IS Journal 2020 Journal Article

Joint Intelligence Ranking by Federated Multiplicative Update

  • Chi Zhang
  • Yu Liu
  • Le Wang
  • Yuehu Liu
  • Li Li
  • Nanning Zheng

The joint intelligence ranking of intelligent systems like autonomous driving is of great importance for building a more general, extensive, and universally accepted intelligence evaluation scheme. However, due to issues such as privacy security and industry or area competition, the integration of isolated test results may face large unimaginable difficulty in information security and encrypted model training. To address this, we derive the federated multiplicative update (FMU) algorithm with boundary constraints to solve the nonnegative matrix factorization based joint intelligence ranking. The encrypted learning process is developed to alternate original computation steps in multiplicative update algorithms. Owning feasible property for the fast convergence and secure exchange of variables, the proposed framework outperforms the previous work on both real and simulated data. Further experimental analysis reveals that the introduced federated mechanism does not harm the overall time efficiency.

AAAI Conference 2020 Conference Paper

Ladder Loss for Coherent Visual-Semantic Embedding

  • Mo Zhou
  • Zhenxing Niu
  • Le Wang
  • Zhanning Gao
  • Qilin Zhang
  • Gang Hua

For visual-semantic embedding, the existing methods normally treat the relevance between queries and candidates in a bipolar way – relevant or irrelevant, and all “irrelevant” candidates are uniformly pushed away from the query by an equal margin in the embedding space, regardless of their various proximity to the query. This practice disregards relatively discriminative information and could lead to suboptimal ranking in the retrieval results and poorer user experience, especially in the long-tail query scenario where a matching candidate may not necessarily exist. In this paper, we introduce a continuous variable to model the relevance degree between queries and multiple candidates, and propose to learn a coherent embedding space, where candidates with higher relevance degrees are mapped closer to the query than those with lower relevance degrees. In particular, the new ladder loss is proposed by extending the triplet loss inequality to a more general inequality chain, which implements variable push-away margins according to respective relevance degrees. In addition, a proper Coherent Score metric is proposed to better measure the ranking results including those “irrelevant” candidates. Extensive experiments on multiple datasets validate the efficacy of our proposed method, which achieves significant improvement over existing state-of-the-art methods.

AAAI Conference 2019 Conference Paper

Video Imprint Segmentation for Temporal Action Detection in Untrimmed Videos

  • Zhanning Gao
  • Le Wang
  • Qilin Zhang
  • Zhenxing Niu
  • Nanning Zheng
  • Gang Hua

We propose a temporal action detection by spatial segmentation framework, which simultaneously categorize actions and temporally localize action instances in untrimmed videos. The core idea is the conversion of temporal detection task into a spatial semantic segmentation task. Firstly, the video imprint representation is employed to capture the spatial/temporal interdependences within/among frames and represent them as spatial proximity in a feature space. Subsequently, the obtained imprint representation is spatially segmented by a fully convolutional network. With such segmentation labels projected back to the video space, both temporal action boundary localization and per-frame spatial annotation can be obtained simultaneously. The proposed framework is robust to variable lengths of untrimmed videos, due to the underlying fixed-size imprint representations. The efficacy of the framework is validated in two public action detection datasets.