EAAI Journal 2026 Journal Article
A lightweight and attention-enhanced framework for robust pavement defect detection
- Xiaoyan Li
- Ning Zhang
- Yue Pan
- Yaowen Lv
- Xiping Xu
- Zheng Wang
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Multivariate time series forecasting underpins applications in finance, meteorology, and industrial operations. Yet two persistent hurdles remain: (i) models typically choose between Channel–Independent (CI) and Channel–Mixed (CM) formulations—each with distinct strengths—leading to large performance variance across datasets; and (ii) short-term dynamics and long-term trends are hard to model jointly, making it difficult to capture both transient bursts and periodic patterns. We propose FusionTimePatch (FTP), a purely MLP-driven, lightweight framework composed of three modules: (1) Dual-View Global–Local Fusion (Dual-GLF), which runs CI and CM views in parallel and employs multi-scale patch recursion to adaptively adjust the look-back window, thereby coupling global tendencies with local details; (2) Channel Enhancement (CE), which adaptively identifies and amplifies salient channel signals and diffuses them to others, improving sensitivity to abrupt events and latent drivers; and (3) a Linear Fusion layer, which unifies Dual-GLF and CE outputs to strengthen cross-view interactions and enhance robustness. Extensive experiments on multiple public benchmarks show FTP consistently surpasses state-of-the-art counterparts in both accuracy and efficiency, offering a scalable new paradigm for multichannel forecasting. Code and datasets are publicly available at https://github.com/Zhveh7/FTP.
YNICL Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities. By interacting with external environments through predefined tools, these agents can carry out complex user tasks. Nonetheless, this interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent’s behavior, potentially resulting in economic loss, privacy leakage, or system compromise. System-level defenses have recently shown promise by enforcing static or predefined policies, but they still face two key challenges: the ability to dynamically update security rules and the need for memory stream isolation. To address these challenges, we propose DRIFT, a Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which enforces both control- and data-level constraints. A Secure Planner first constructs a minimal function trajectory and a JSON-schema-style parameter checklist for each function node based on the user query. A Dynamic Validator then monitors deviations from the original plan, assessing whether changes comply with privilege limitations and the user's intent. Finally, an \textit{Injection Isolator} detects and masks any instructions that may conflict with the user query from the memory stream to mitigate long-term risks. We empirically validate the effectiveness of DRIFT on the AgentDojo and ASB benchmark, demonstrating its strong security performance while maintaining high utility across diverse models—showcasing both its robustness and adaptability. The code is released at https: //github. com/SaFoLab-WISC/DRIFT.
EAAI Journal 2025 Journal Article
AAMAS Conference 2025 Conference Paper
As AI agents become integral to infrastructure, robust coordination and message synchronization are crucial. The Byzantine Generals Problem (BGP) models resilience in multi-agent systems (MAS) under adversarial conditions, handling scenarios with malicious agents—stemming from AI hallucinations or external attacks. Traditional BGP demands global consensus, which is often unnecessary and inefficient in practice. We introduce Imperfect BGP (IBGP), aligning with the local coordination patterns in MAS to address this gap, offering provable resilience against communication attacks and adaptability to changing environments, as validated by empirical results.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i. e. , their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how small perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.
NeurIPS Conference 2025 Conference Paper
Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets--DJI-day, Bosonplus-day, and Bosonplus-night--captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: http: //xjh19971. github. io/ThermalGen
EAAI Journal 2025 Journal Article
YNICL Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Document layout analysis is a crucial step for intelligent document understanding. However, many existing methods primarily focus on the visual aspects and overlook the textual features of documents. Although document pre-trained models utilize multi-modal features during the pre-training phase, they tend to operate as a unimodal pipeline when it comes to layout analysis tasks. Furthermore, current multi-modal methods perform worse than unimodal detectors on complex layout analysis datasets. To address these limitations, we propose an effective and pluggable multi-modal fusion approach named M2Doc, which fuses visual and textual features for better layout detection. M2Doc contains two pluggable multi-modal fusion modules, early-fusion and late-fusion, which align and fuse visual and textual features at the pixel level and block level. Benefitting from the concision and effectiveness of M2Doc, it can be easily applied to various detectors for better layout detection, including two-stage and end-to-end object detectors. Our experimental results demonstrate significant performance improvements in detectors equipped with M2Doc on datasets such as DocLayNet (+11.3 mAP) and M6Doc (+1.9 mAP). Furthermore, through the integration of the DINO detector with M2Doc, we achieve state-of-the-art results on DocLayNet (89.0 mAP), M6Doc (69.9 mAP), and PubLayNet (95.5 mAP). The code will be publicly released at https://github.com/johnning2333/M2Doc.
NeurIPS Conference 2024 Conference Paper
Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training.
YNICL Journal 2023 Journal Article
JBHI Journal 2022 Journal Article
Ultrasound can non-invasively detect muscle deformations and has great potential applications in prosthetic hand control. Traditional ultrasound equipment was usually too bulky to be applied in wearable scenarios. This research presented a compact ultrasound device that could be integrated into a prosthetic hand socket. The miniaturized ultrasound system included four A-mode ultrasound transducers for sensing musculature deformations, a signal excitation/acquisition module, and a prosthetic hand control module. The size of the ultrasound system was 65*75*25 mm, weighing only 85 g. For the first time, we integrated the ultrasound system into a prosthetic hand socket to evaluate its performance in practical prosthetic hand control. We designed an experiment requiring twenty subjects to perform six commonly used gestures. The performance of decoding ultrasound signals was analyzed offline using four classification algorithms and then was assessed in online control. The average values of online classification accuracy with and without wearing the physical prosthetic were 91. 5 $\pm\; \text{6. 4}\%$ and 96. 5 $\pm \; \text{3. 0}\%$, respectively. We found that wearing the prosthetic hand influenced the ultrasound gestures classification accuracy, but remarkable online classification performance could still be maintained. These experimental results demonstrated the efficacy of the designed integrated ultrasound system for practical use, paving the way for an effective HMI system that could be widely used in prosthetic hand control.
NeurIPS Conference 2022 Conference Paper
Capitalizing on the recent advances in image generation models, existing controllable face image synthesis methods are able to generate high-fidelity images with some levels of controllability, e. g. , controlling the shapes, expressions, textures, and poses of the generated face images. However, these methods focus on 2D image generative models, which are prone to producing inconsistent face images under large expression and pose changes. In this paper, we propose a new NeRF-based conditional 3D face synthesis framework, which enables 3D controllability over the generated face images by imposing explicit 3D conditions from 3D face priors. At its core is a conditional Generative Occupancy Field (cGOF) that effectively enforces the shape of the generated face to commit to a given 3D Morphable Model (3DMM) mesh. To achieve accurate control over fine-grained 3D face shapes of the synthesized image, we additionally incorporate a 3D landmark loss as well as a volume warping loss into our synthesis algorithm. Experiments validate the effectiveness of the proposed method, which is able to generate high-fidelity face images and shows more precise 3D controllability than state-of-the-art 2D-based controllable face synthesis methods.
AAAI Conference 2020 Conference Paper
We present a novel problem setting in zero-shot learning, zero-shot object recognition and detection in the context. Contrary to the traditional zero-shot learning methods, which simply infers unseen categories by transferring knowledge from the objects belonging to semantically similar seen categories, we aim to understand the identity of the novel objects in an image surrounded by the known objects using the inter-object relation prior. Specifically, we leverage the visual context and the geometric relationships between all pairs of objects in a single image, and capture the information useful to infer unseen categories. We integrate our context-aware zero-shot learning framework into the traditional zero-shot learning techniques seamlessly using a Conditional Random Field (CRF). The proposed algorithm is evaluated on both zero-shot region classification and zero-shot detection tasks. The results on Visual Genome (VG) dataset show that our model significantly boosts performance with the additional visual context compared to traditional methods.
AAAI Conference 2020 Conference Paper
Detection of malicious behavior is a fundamental problem in security. One of the major challenges in using detection systems in practice is in dealing with an overwhelming number of alerts that are triggered by normal behavior (the so-called false positives), obscuring alerts resulting from actual malicious activities. We introduce a novel approach for computing a policy for prioritizing alerts using adversarial reinforcement learning. Our approach assumes that the attacker knows the full state of the detection system and the defender’s alert prioritization policy, and will dynamically choose an optimal attack. The first step of our approach is to capture the interaction between the defender and attacker in a game theoretic model. To tackle the computational complexity of solving this game to obtain a dynamic stochastic alert prioritization policy, we propose an adversarial reinforcement learning framework. In this framework, we use neural reinforcement learning to compute best response policies for both the defender and the adversary to an arbitrary stochastic policy of the other. We then use these in a double-oracle framework to obtain an approximate equilibrium of the game, which in turn yields a robust stochastic policy for the defender. We use case studies in network intrusion and fraud detection to demonstrate that our approach is effective in creating robust alert prioritization policies. 1
JBHI Journal 2019 Journal Article
For many cerebrovascular diseases both blood pressure (BP) and hemodynamic changes are important clinical variables. In this paper, we describe the development of a novel approach to noninvasively and simultaneously monitor cerebral hemodynamics, BP, and other important parameters at high temporal resolution (250 Hz sampling rate). In this approach, cerebral hemodynamics are acquired using near infrared spectroscopy based sensors and algorithms, whereas continuous BP is acquired by superficial temporal artery tonometry with pulse transit time based drift correction. The sensors, monitoring system, and data analysis algorithms used in the prototype for this approach are reported in detail in this paper. Preliminary performance tests demonstrated that we were able to simultaneously and noninvasively record and reveal cerebral hemodynamics and BP during people's daily activity. As examples, we report dynamic cerebral hemodynamic and BP fluctuations during postural changes and micturition. These preliminary results demonstrate the feasibility of our approach, and its unique power in catching hemodynamics and BP fluctuations during transient symptoms (such as syncope) and revealing the dynamic features of related events.
IJCAI Conference 2018 Conference Paper
Separating audio mixtures into individual instrument tracks has been a standing challenge. We introduce a novel weakly supervised audio source separation approach based on deep adversarial learning. Specifically, our loss function adopts the Wasserstein distance which directly measures the distribution distance between the separated sources and the real sources for each individual source. Moreover, a global regularization term is added to fulfill the spectrum energy preservation property regardless separation. Unlike state-of-the-art weakly supervised models which often involve deliberately devised constraints or careful model selection, our approach need little prior model specification on the data, and can be straightforwardly learned in an end-to-end fashion. We show that the proposed method performs competitively on public benchmark against state-of-the-art weakly supervised methods.
TIST Journal 2017 Journal Article
With the increasing popularity of augmented reality (AR) services, providing seamless human-computer interactions in the AR setting has received notable attention in the industry. Gesture control devices have recently emerged to be the next great gadgets for AR due to their unique ability to enable computer interaction with day-to-day gestures. While these AR devices are bringing revolutions to our interaction with the cyber world, it is also important to consider potential privacy leakages from these always-on wearable devices. Specifically, the coarse access control on current AR systems could lead to possible abuse of sensor data. Although the always-on gesture sensors are frequently quoted as a privacy concern, there has not been any study on information leakage of these devices. In this article, we present our study on side-channel information leakage of the most popular gesture control device, Myo. Using signals recorded from the electromyography (EMG) sensor and accelerometers on Myo, we can recover sensitive information such as passwords typed on a keyboard and PIN sequence entered through a touchscreen. EMG signal records subtle electric currents of muscle contractions. We design novel algorithms based on dynamic cumulative sum and wavelet transform to determine the exact time of finger movements. Furthermore, we adopt the Hudgins feature set in a support vector machine to classify recorded signal segments into individual fingers or numbers. We also apply coordinate transformation techniques to recover fine-grained spatial information with low-fidelity outputs from the sensor in keystroke recovery. We evaluated the information leakage using data collected from a group of volunteers. Our results show that there is severe privacy leakage from these commodity wearable sensors. Our system recovers complex passwords constructed with lowercase letters, uppercase letters, numbers, and symbols with a mean success rate of 91%.
NeurIPS Conference 2014 Conference Paper
Convolutional neural nets (convnets) trained from massive labeled datasets have substantially improved the state-of-the-art in image classification and object detection. However, visual understanding requires establishing correspondence on a finer level than object category. Given their large pooling regions and training from whole-image labels, it is not clear that convnets derive their success from an accurate correspondence model which could be used for precise localization. In this paper, we study the effectiveness of convnet activation features for tasks requiring correspondence. We present evidence that convnet features localize at a much finer scale than their receptive field sizes, that they can be used to perform intraclass aligment as well as conventional hand-engineered features, and that they outperform conventional features in keypoint prediction on objects from PASCAL VOC 2011.
TIST Journal 2012 Journal Article
Various innovative and original works have been applied and proposed in the field of sports video analysis. However, individual works have focused on sophisticated methodologies with particular sport types and there has been a lack of scalable and holistic frameworks in this field. This article proposes a solution and presents a systematic and generic approach which is experimented on a relatively large-scale sports consortia. The system aims at the event detection scenario of an input video with an orderly sequential process. Initially, domain knowledge-independent local descriptors are extracted homogeneously from the input video sequence. Then the video representation is created by adopting a bag-of-visual-words (BoW) model. The video’s genre is first identified by applying the k-nearest neighbor (k-NN) classifiers on the initially obtained video representation, and various dissimilarity measures are assessed and evaluated analytically. Subsequently, an unsupervised probabilistic latent semantic analysis (PLSA)-based approach is employed at the same histogram-based video representation, characterizing each frame of video sequence into one of four view groups, namely closed-up-view, mid-view, long-view, and outer-field-view. Finally, a hidden conditional random field (HCRF) structured prediction model is utilized for interesting event detection. From experimental results, k-NN classifier using KL-divergence measurement demonstrates the best accuracy at 82.16% for genre categorization. Supervised SVM and unsupervised PLSA have average classification accuracies at 82.86% and 68.13%, respectively. The HCRF model achieves 92.31% accuracy using the unsupervised PLSA based label input, which is comparable with the supervised SVM based input at an accuracy of 93.08%. In general, such a systematic approach can be widely applied in processing massive videos generically.
AAAI Conference 2012 Conference Paper
Influence maximization is a problem of finding a small set of highly influential users in a social network such that the spread of influence under certain propagation models is maximized. In this paper, we consider time-critical influence maximization, in which one wants to maximize influence spread within a given deadline. Since timing is considered in the optimization, we also extend the Independent Cascade (IC) model to incorporate the time delay aspect of influence diffusion in social networks. We show that time-critical influence maximization under the time-delayed IC model maintains desired properties such as submodularity, which allows a greedy algorithm to achieve an approximation ratio of 1 − 1/e, to circumvent the NP-hardness of the problem. To overcome the inefficiency of the approximation algorithm, we design two heuristic algorithms: the first one is based on a dynamic programming procedure that computes exact influence in tree structures, while the second one converts the problem to one in the original IC model and then applies existing fast heuristics to it. Our simulation results demonstrate that our heuristics achieve the same level of influence spread as the greedy algorithm while running a few orders of magnitude faster, and they also outperform existing algorithms that disregard the deadline constraint and delays in diffusion.
TCS Journal 2007 Journal Article