EAAI Journal 2026 Journal Article
An interpretable civil case judgment prediction method based on logical reasoning and knowledge enhancement
- Shibo Cui
- Yu Sun
- Ning Wang
- Wenguang Yan
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
We introduce FIXME, the first end-to-end and large-scale benchmark for evaluating Large Language Models (LLMs) in hardware design functional verification (FV). Comprising 747 tasks derived from real-world hardware designs, FIXME spans five core FV sub-sets: specification comprehension, reference model generation, testbench generation, assertion design, and RTL debugging. To ensure high data quality, we developed an AI-human collaborative framework for agile data curation and annotation. This process resulted in 25,000 lines of verified RTL, 35,000 lines of enhanced testbenches, and over 1,200 SystemVerilog Assertions. Furthermore, through expert-guided optimization within the multi-agent aided flow, we achieved a remarkable 45.57% improvement in average functional coverage, underscoring the benchmark's robustness. Through evaluation of state-of-the-art LLMs like GPT-4.1, FIXME identifies key limitations and provides actionable insights, advancing the potential of LLM-driven automation in hardware design functional verification.
AAAI Conference 2026 Conference Paper
Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose **R**ecursive **E**valuation and **A**daptive **P**lanning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.
EAAI Journal 2025 Journal Article
IJCAI Conference 2025 Conference Paper
Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94. 58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0. 002 seconds per instance. The source code and datasets can be found at https: //github. com/ZihanYan-CQU/EAsafetyBench.
EAAI Journal 2025 Journal Article
IROS Conference 2025 Conference Paper
Human-robot collaborative manipulation with mobile, multiple manipulators is crucial for expanding robotic applications, requiring precise handling of coupled force-position constraints between partners. Current systems, however, exhibit end-effector oscillations and instability during dynamic interactions. To overcome these limitations, this work develops a collaborative framework integrating a collaborative controller and a whole-body controller. The collaborative controller employs the object’s center-of-mass dynamics model with real-time contact forces and motion states to predict trajectories while coordinating with an attitude stabilization controller to adjust the desired end-effector poses. The whole-body controller utilizes model predictive control to generate coordinated motions that strictly follow pose commands from the collaborative controller, ensuring stable transportation. Simulation and physical experiments validate the proposed framework’s effectiveness in real-world scenarios.
ICML Conference 2025 Conference Paper
Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a high-order QK integration theory and a multi-level vocabulary decomposition. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a soft integration strategy with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1. 2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.
NeurIPS Conference 2025 Conference Paper
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Extensive experiments on two complex video action datasets, Charades and SportsHHI, demonstrate the effectiveness of our approach against state-of-the-art methods. Our code can be found in https: //github. com/iamsnaping/ProDA. git.
AAAI Conference 2024 Conference Paper
We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclic-consistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks.
JBHI Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
JMLR Journal 2024 Journal Article
The expectation-maximization (EM) algorithm and its variants are widely used in statistics. In high-dimensional mixture linear regression, the model is assumed to be a finite mixture of linear regression and the number of predictors is much larger than the sample size. The standard EM algorithm, which attempts to find the maximum likelihood estimator, becomes infeasible for such model. We devise a group lasso penalized EM algorithm and study its statistical properties. Existing theoretical results of regularized EM algorithms often rely on dividing the sample into many independent batches and employing a fresh batch of sample in each iteration of the algorithm. Our algorithm and theoretical analysis do not require sample-splitting, and can be extended to multivariate response cases. The proposed methods also have encouraging performances in numerical studies. [abs] [ pdf ][ bib ] © JMLR 2024. ( edit, beta )
AAAI Conference 2023 Conference Paper
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to fine-tune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting superior performance. In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts. Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and TextCaps using a unified model.
AAAI Conference 2023 Conference Paper
Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
EAAI Journal 2023 Journal Article
ICRA Conference 2022 Conference Paper
Up-to-date High-Definition (HD) maps are essential for self-driving cars. To achieve constantly updated HD maps, we present a deep neural network (DNN), Diff-Net, to detect changes in them. Compared to traditional methods based on object detectors, the essential design in our work is a parallel feature difference calculation structure that infers map changes by comparing features extracted from the camera and rasterized images. To generate these rasterized images, we project map elements onto images in the camera view, yielding meaningful map representations that can be consumed by a DNN accordingly. As we formulate the change detection task as an object detection problem, we leverage the anchor-based structure that predicts bounding boxes with different change status categories. To the best of our knowledge, the proposed method is the first end-to-end network that tackles the high-definition map change detection task, yielding a single stage solution. Furthermore, rather than relying on single frame input, we introduce a spatio-temporal fusion module that fuses features from history frames into the current, thus improving the overall performance. Finally, we comprehensively validate our method's effectiveness using freshly collected datasets. Results demonstrate that our Diff-Net achieves better performance than the baseline methods and is ready to be integrated into a map production pipeline maintaining an up-to-date HD map.
JBHI Journal 2022 Journal Article
ECG classification is a key technology in intelligent electrocardiogram (ECG) monitoring. In the past, traditional machine learning methods such as support vector machine (SVM) and K-nearest neighbor (KNN) have been used for ECG classification, but with limited classification accuracy. Recently, the end-to-end neural network has been used for ECG classification and shows high classification accuracy. However, the end-to-end neural network has large computational complexity including a large number of parameters and operations. Although dedicated hardware such as field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) can be developed to accelerate the neural network, they result in large power consumption, large design cost, or limited flexibility. In this work, we have proposed an ultra-lightweight end-to-end ECG classification neural network that has extremely low computational complexity (∼8. 2k parameters & ∼227k multiplication/addition operations) and can be squeezed into a low-cost microcontroller (MCU) such as MSP432 while achieving 99. 1% overall classification accuracy. This outperforms the state-of-the-art ECG classification neural network. Implemented on MSP432, the proposed design consumes only 0. 4 mJ and 3. 1 mJ per heartbeat classification for normal and abnormal heartbeats respectively for real-time ECG classification.
AAAI Conference 2021 Conference Paper
In this paper, we focus on the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. The intra-video learning transforms the image contents across frames within a single video via the frame pair-wise affinity. To obtain the discriminative representation for instance-level separation, we go beyond the intra-video analysis and construct the inter-video affinity to facilitate the contrastive transformation across different videos. By forcing the transformation consistency between intra- and inter-video levels, the fine-grained correspondence associations are well preserved and the instance-level feature discrimination is effectively reinforced. Our simple framework outperforms the recent selfsupervised correspondence methods on a range of visual tasks including video object tracking (VOT), video object segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that our method also surpasses the fully-supervised affinity representation (e. g. , ResNet) and performs competitively against the recent fully-supervised algorithms designed for the specific tasks (e. g. , VOT and VOS).
IROS Conference 2020 Conference Paper
Multi-robot systems are widely used in environmental exploration and modeling, especially in hazardous environments. However, different types of robots are limited by different mobility, battery life, sensor type, etc. Heterogeneous robot systems are able to utilize various types of robots and provide solutions where robots are able to compensate each other with their different capabilities. In this paper, we consider the problem of sampling and modeling environmental characteristics with a heterogeneous team of robots. To utilize heterogeneity of the system while remaining computationally tractable, we propose an environmental partitioning approach that leverages various robot capabilities by forming a uniformly defined heterogeneity cost space. We combine with the mixture of Gaussian Processes model-learning framework to adaptively sample and model the environment in an efficient and scalable manner. We demonstrate our algorithm in field experiments with ground and aerial vehicles.
AIIM Journal 2020 Journal Article
AAAI Conference 2020 Conference Paper
In visual object tracking, by reasonably fusing multiple experts, ensemble framework typically achieves superior performance compared to the individual experts. However, the necessity of parallelly running all the experts in most existing ensemble frameworks heavily limits their efficiency. In this paper, we propose POST, a POlicy-based Switch Tracker for robust and efficient visual tracking. The proposed POST tracker consists of multiple weak but complementary experts (trackers) and adaptively assigns one suitable expert for tracking in each frame. By formulating this expert switch in consecutive frames as a decision-making problem, we learn an agent via reinforcement learning to directly decide which expert to handle the current frame without running others. In this way, the proposed POST tracker maintains the performance merit of multiple diverse models while favorably ensuring the tracking efficiency. Extensive ablation studies and experimental comparisons against state-of-the-art trackers on 5 prevalent benchmarks verify the effectiveness of the proposed method.
JBHI Journal 2020 Journal Article
Human activity recognition has been widely used in healthcare applications such as elderly monitoring, exercise supervision, and rehabilitation monitoring. Compared with other approaches, sensor-based wearable human activity recognition is less affected by environmental noise and therefore is promising in providing higher recognition accuracy. However, one of the major issues of existing wearable human activity recognition methods is that although the average recognition accuracy is acceptable, the recognition accuracy for some activities (e. g. , ascending stairs and descending stairs) is low, mainly due to relatively less training data and complex behavior pattern for these activities. Another issue is that the recognition accuracy is low when the training data from the test subject are limited, which is a common case in real practice. In addition, the use of neural network leads to large computational complexity and thus high power consumption. To address these issues, we proposed a new human activity recognition method with two-stage end-to-end convolutional neural network and a data augmentation method. Compared with the state-of-the-art methods (including neural network based methods and other methods), the proposed methods achieve significantly improved recognition accuracy and reduced computational complexity.
EAAI Journal 2019 Journal Article
JBHI Journal 2019 Journal Article
The discovery of disease-causing genes is a critical step towards understanding the nature of a disease and determining a possible cure for it. In recent years, many computational methods to identify disease genes have been proposed. However, making full use of disease-related (e. g. , symptoms) and gene-related (e. g. , gene ontology and protein-protein interactions) information to improve the performance of disease gene prediction is still an issue. Here, we develop a heterogeneous disease-gene-related network (HDGN) embedding representation framework for disease gene prediction (called HerGePred). Based on this framework, a low-dimensional vector representation (LVR) of the nodes in the HDGN can be obtained. Then, we propose two specific algorithms, namely, an LVR-based similarity prediction and a random walk with restart on a reconstructed heterogeneous disease-gene network (RWRDGN), to predict disease genes with high performance. First, to validate the rationality of the framework, we analyze the similarity-based overlap distribution of disease pairs and design an experiment for disease-gene association recovery, the results of which revealed that the LVR of nodes performs well at preserving the local and global network structure of the HDGN. Then, we apply tenfold cross validation and external validation to compare our methods with other well-known disease gene prediction algorithms. The experimental results show that the RW-RDGN performs better than the state-of-the-art algorithm. The prediction results of disease candidate genes are essential for molecular mechanism investigation and experimental validation. The source codes of HerGePred and experimental data are available at https://github.com/yangkuoone/HerGePred.
IJCAI Conference 2019 Conference Paper
We study the task of image inpainting, where an image with missing region is recovered with plausible context. Recent approaches based on deep neural networks have exhibited potential for producing elegant detail and are able to take advantage of background information, which gives texture information about missing region in the image. These methods often perform pixel/patch level replacement on the deep feature maps of missing region and therefore enable the generated content to have similar texture as background region. However, this kind of replacement is a local strategy and often performs poorly when the background information is misleading. To this end, in this study, we propose to use a multi-scale image contextual attention learning (MUSICAL) strategy that helps to flexibly handle richer background information while avoid to misuse of it. However, such strategy may not promising in generating context of reasonable style. To address this issue, both of the style loss and the perceptual loss are introduced into the proposed method to achieve the style consistency of the generated image. Furthermore, we have also noticed that replacing some of the down sampling layers in the baseline network with the stride 1 dilated convolution layers is beneficial for producing sharper and fine-detailed results. Experiments on the Paris Street View, Places, and CelebA datasets indicate the superior performance of our approach compares to the state-of-the-arts.
AAMAS Conference 2018 Conference Paper
Trust is critical to the success of human-agent teams, and a critical antecedents to trust is transparency. To best interact with human teammates, an agent explain itself so that they understand its decision-making process. However, individual differences among human teammates require that the agent dynamically adjust its explanation strategy based on their unobservable subjective beliefs. The agent must therefore recognize its teammates’ subjective beliefs relevant to trust-building (e. g. , their understanding of the agent’s capabilities and process). We leverage a nonparametric method to enable an agent to use its history of prior interactions as a means for recognizing and predicting a new teammate’s subjective beliefs. We first gather data combining observable behavior sequences with survey-based observations of typically unobservable perceptions. We then use a nearest-neighbor approach to identify the prior teammates most similar to the new one. We use these neighbors’ responses to infer the likelihood of possible beliefs, as in collaborative filtering. The results provide insights into the types of beliefs that are easy (and hard) to infer from purely behavioral observations.
AAMAS Conference 2016 Conference Paper
Researchers have observed that people will more accurately trust an autonomous system, such as a robot, if they have a more accurate understanding of its decision-making process. Studies have shown that hand-crafted explanations can help maintain effective team performance even when the system is less than 100% reliable. However, current explanation algorithms are not sufficient for making a robot’s quantitative reasoning (in terms of both uncertainty and conflicting goals) transparent to human teammates. In this work, we develop a novel mechanism for robots to automatically generate explanations of reasoning based on Partially Observable Markov Decision Problems (POMDPs). Within this mechanism, we implement alternate natural-language templates and then measure their differential impact on trust and team performance within an agent-based online testbed that simulates a human-robot team task. The results demonstrate that the added explanation capability leads to improvement in transparency, trust, and team performance. Furthermore, by observing the different outcomes due to variations in the robot’s explanation content, we gain valuable insight that can help lead to refinement of explanation algorithms to further improve human-robot interaction. General Terms Algorithms
JBHI Journal 2015 Journal Article
This paper presents compact yet comprehensive feature representations for the electroencephalogram (EEG) signal to achieve efficient epileptic seizure prediction performance. The initial EEG feature vectors are formed by acquiring the dominant amplitude and frequency components on an epoch-by-epoch basis from the EEG signals. These extracted parameters can reveal the intrinsic EEG signal changes as well as the underlying stage transitions. To improve the efficacy of feature extraction, an elimination-based feature selection method has been applied on the initial feature vectors. This diminishes redundant and noisy points, providing each patient with a lower dimensional and independent final feature form. In this context, our study is distinguished from that of others currently prevailing. Usually, these latter approaches adopted feature extraction processes, which employed time-consuming high-dimensional parameter sets. Machine learning approaches that are considered as state of the art have been employed to build patient-specific binary classifiers that can divide the extracted feature parameters into preictal and interictal groups. Through out-of-sample evaluation on the intracranial EEG recordings provided by the publicly available Freiburg dataset, promising prediction performance has been attained. Specifically, we have achieved 98. 8% sensitivity results on the 19 patients included in our experiment, where only one of 83 seizures across all patients was not predicted. To make this investigation more comprehensive, we have conducted extensive comparative studies with other recently published competing approaches, in which the advantages of our method are highlighted.
AAMAS Conference 2008 Conference Paper
We explored the association between users’ social anxiety and the interactional fidelity of an agent (also referred to as a virtual human), specifically addressing whether the contingency of agents’ nonverbal feedback affects the relationship between users’ social anxiety and their feelings of rapport, performance, or judgment on interaction partners. This subject was examined across four experimental conditions where participants interacted with three different types of agents and a real human. The three types of agents included the Non-Contingent Agent, the Responsive Agent (opposite to the Non-Contingent Agent), and the Mediated Agent (controlled by a real human). The results indicated that people having greater social anxiety would feel less rapport and show worse performance while feeling more embarrassment if they experience the untimely feedback of the Non-Contingent Agent. The results also showed people having more anxiety would trust real humans less as their interaction partners. We discuss the implication of this relationship between social anxiety in a human subject and the interactional fidelity of an agent on the design of virtual characters for social skills training and therapy.
AAMAS Conference 2008 Conference Paper
To create realistic and expressive virtual humans, we need to develop better models of the processes and dynamics of human emotions and expressions. A first step in this effort is to develop means to systematically induce and capture realistic expressions in real humans. We conducted a series of studies on human emotions and facial expression using the Emotion Evoking Game (EVG) and a high-speed video camera. In this paper, we discuss a detailed analysis of facial expressions in response to a surprise situation. We provide details on the rich dynamics of facial expressions, along with data useful for animation of virtual human. The analysis of the data also revealed considerable individual differences in whether surprise was evoked and how it was expressed.