EAAI Journal 2026 Journal Article
A method for extracting emotion–cause pairs based on bidirectional machine reading comprehension
- Guorui Li
- Yaxin Wen
- Cong Wang
- Lihong Cao
- Sancheng Peng
- Chao Deng
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Latent Diffusion Models have become a powerful tool for generating high-fidelity unrestricted adversarial examples. However, the existing methods typically perturb only the initial latent or rely on prompt engineering, which is ill-suited to the iterative nature of the diffusion process, plus optimization instability due to external text prompts and cumulative drift that push the adversarial images off the data manifold. In this paper, we propose a hierarchical attack framework that operates in alignment with the model's generative manifold and leverages intermediate denoising states to maximize attack transferability and visual fidelity. Extensive experiments show that the proposed attack improves adversarial transferability by 10-20% against a diverse set of normally-trained models and achieves over 10.5% higher success rate against adversarially-defended models, while simultaneously enhancing visual quality by 1.0-1.2 FID reduction and 16.7% LPIPS improvements.
AAAI Conference 2026 Conference Paper
Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.
AAAI Conference 2026 Conference Paper
Diffusion models have advanced fine-grained garment generation, yet balancing controllability, efficiency, and texture fidelity remains challenging. Adapter-based methods often yield incoherent details, while full fine-tuning is computationally expensive and prone to overwriting pretrained priors. To address these limitations, we propose IMAGGarment+, an efficient diffusion framework for controllable and high-quality garment synthesis. It comprises two key modules designed for efficient and attribute-aware conditioning. First, we introduce an attribute-wise feature extractor (AFE) that disentangles key garment attributes, silhouette, logo, position, and color, into parallel latent streams. Each stream is optimized independently via LoRA, ensuring minimal parameter overhead while retaining expressive capacity. Second, we develop an attribute-adaptive attention (AA) module to inject attribute-specific cues into the generative process through a selective, layer-wise injection strategy. Specifically, silhouette and color features are injected into early decoder layers to guide structural and appearance formation, while logo features are propagated across all layers to ensure cross-scale consistency. Extensive experiments on fine-grained garment benchmarks demonstrate that IMAGGarment+ outperforms state-of-the-art baselines with less than 20% additional parameters, validating its effectiveness and efficiency.
AAAI Conference 2026 Conference Paper
Embedding-as-a-Service (EaaS) is an effective and convenient deployment solution for addressing various NLP tasks. Nevertheless, recent research has shown that EaaS is vulnerable to model extraction attacks, which could lead to significant economic losses for model providers. For copyright protection, existing methods inject watermark embeddings into text embeddings and use them to detect copyright infringement. However, current watermarking methods often resist only a subset of attacks and fail to provide comprehensive protection. To this end, we present the region-triggered semantic watermarking framework called RegionMarker, which defines trigger regions within a low-dimensional space and injects watermarks into text embeddings associated with these regions. By utilizing a secret dimensionality reduction matrix to project onto this subspace and randomly selecting trigger regions, RegionMarker makes it difficult for watermark removal attacks to evade detection. Furthermore, by embedding watermarks across the entire trigger region and using the text embedding as the watermark, RegionMarker is resilient to both paraphrasing and dimension-perturbation attacks. Extensive experiments on various datasets show that RegionMarker is effective in resisting different attack methods, thereby protecting the copyright of EaaS.
AAAI Conference 2026 Conference Paper
In recent years, RF fingerprinting (RFF) has emerged as a promising technology for wireless device authentication. However, temporal variations in device load and temperature, along with channel effects, lead to inconsistencies in RFF distributions between training and testing phases. As a result, deep learning (DL)-based recognition models often suffer from degraded performance. To address this problem, we propose the first test-time-adaptation (TTA) approach to improve the domain generalization ability of RFF recognition models. We first analyze the causes of time-varying RFF distribution shifts, such as carrier frequency offset (CFO), and develop a physical impairment-based data augmentation strategy. Based on this, we further propose a physically information-aware prototype to guide the model for TTA. Our method requires no model retraining or labeled test samples, and is a lightweight, nonparametric solution. Finally, our approach is extensively evaluated using mobile phones with the IEEE 802.11 orthogonal frequency division multiplexing (OFDM) system, which demonstrates that our scheme can effectively improve RFF average recognition performance by about 7.8%.
AAAI Conference 2026 Conference Paper
Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance.
ICRA Conference 2025 Conference Paper
Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of classagnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and openset detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSODS model achieves a Fixed AP of 26. 7 %, compared to 26. 2 % for YOLO-World-v1-S and 22. 7 % for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is 57. 1 % higher than YOLO-World-v1S and 29. 6 % higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (e. g. , skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporate multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (i. e. , fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https: //github. com/umnooob/signvip/.
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which primarily generate stories in a caption-dependent manner, often overlook the importance of contextual consistency and the relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios.
AAAI Conference 2025 Conference Paper
Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs.
EAAI Journal 2025 Journal Article
JBHI Journal 2025 Journal Article
The increasing integration of large language models (LLMs) in healthcare systems has revolutionized medical service delivery while introducing privacy vulnerabilities that could compromise patient information. Traditional privacy-preserving approaches often degrade performance in healthcare applications. This paper presents HELP-ME, a framework for evaluating and protecting privacy in healthcare-oriented LLMs through a three-stage approach. First, we develop a systematic ethical privacy threat assessment methodology that identifies potential vulnerabilities in medical data handling. Second, we propose a prompt-focused privacy evaluation mechanism for healthcare scenarios. Finally, we introduce a robust ethical privacy obfuscation method that protects patient data while maintaining model utility. Experiments on the MIMIC-IV dataset demonstrate that HELP-ME achieves model source inference accuracy of 98. 2%, clinical record length analysis accuracy of up to 98. 5%, and maintains 96. 9% diagnostic accuracy in synthetic data generation. The results indicate that HELP-ME provides a practical solution for protecting privacy in healthcare LLM applications while preserving clinical functionality.
AAAI Conference 2025 Conference Paper
Most of the federated learning techniques are limited to homogeneous model fusion. With the rapid growth of smart applications on resource-constrained edge devices, it becomes a barrier to accommodate their heterogeneous computing power and memory in the real world. Federated Distillation is a promising alternative to enable aggregation from heterogeneous models. However, the effectiveness of knowledge transfer still remains elusive under the shadow of distinct representation power from heterogeneous models. In this paper, we approach from an adversarial perspective to characterize the decision boundaries during distillation. By leveraging K-step PGD attacks, we successfully model the dynamics of the closest boundary points and establish a quantitative connection between the predictive uncertainty and boundary margin. Based on these findings, we further propose a new loss function to make the distillation attend to samples close to the decision boundaries, thus learning from more informed logit distributions. The extensive experiments over CIFAR-10/100 and Tiny-ImageNet demonstrate about 0.5-3.5% improvement of accuracy under different IID and non-IID settings, with only a small increment of computational overhead.
IROS Conference 2025 Conference Paper
The complex nonlinear dynamics of hydraulic excavators, such as time delays and control coupling, pose significant challenges to achieving high-precision trajectory tracking. Traditional control methods often fall short in such applications due to their inability to effectively handle these nonlinearities, while commonly used learning-based methods require extensive interactions with the environment, leading to inefficiency. To address these issues, we introduce EfficientTrack, a trajectory tracking method that integrates model-based learning to manage nonlinear dynamics and leverages closed-loop dynamics to improve learning efficiency, ultimately minimizing tracking errors. We validate our method through comprehensive experiments both in simulation and on a real-world excavator. Comparative experiments in simulation demonstrate that our method outperforms existing learning-based approaches, achieving the highest tracking precision and smoothness with the fewest interactions. Real-world experiments further show that our method remains effective under load conditions and possesses the ability for continual learning, highlighting its practical applicability. For implementation details and source code, please refer to https://github.com/ZiqingZou/EfficientTrack.
AAAI Conference 2025 Conference Paper
Existing virtual try-on (VTON) methods provide only limited user control over garment attributes and generally overlook essential factors such as face, pose, and scene context. To address these limitations, we introduce the virtual dressing (VD) task, which aims to synthesize freely editable human images conditioned on fixed garments and optional user-defined inputs. We further propose a comprehensive affinity metric index (CAMI) to quantify the consistency between generated outputs and reference garments. We present IMAGDressing-v1, which leverages a garment-specific U-Net to integrate semantic features from CLIP and texture features from a VAE. To incorporate these garment features into a frozen denoising U-Net for flexible text-driven scene control, we employ a hybrid attention mechanism composed of frozen self-attention and trainable cross-attention layers. IMAGDressing-v1 seamlessly integrates with extension modules, such as ControlNet and IP-Adapter, enabling enhanced diversity and controllability. To alleviate data constraints, we introduce the Interactive Garment Pairing (IGPair) dataset, comprising over 300,000 garment–image pairs and a standardized data assembly pipeline. Extensive experiments demonstrate that IMAGDressing-v1 achieves state-of-the-art performance in controlled human image synthesis. The code and model will be available at https://github.com/muzishen/IMAGDressing.
AAAI Conference 2025 Conference Paper
We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.
NeurIPS Conference 2025 Conference Paper
With the rapid advancement of the digital economy, data collaboration between organizations has become a well-established business model, driving the growth of various industries. However, privacy concerns make direct data sharing impractical. To address this, Two-Party Split Learning (a. k. a. Vertical Federated Learning (VFL)) has emerged as a promising solution for secure collaborative learning. Despite its advantages, this architecture still suffers from low computational resource utilization and training efficiency. Specifically, its synchronous dependency design increases training latency, while resource and data heterogeneity among participants further hinder efficient computation. To overcome these challenges, we propose \texttt{PubSub-VFL}, a novel VFL paradigm with a Publisher/Subscriber architecture optimized for two-party collaborative learning with high computational efficiency. \texttt{PubSub-VFL} leverages the decoupling capabilities of the Pub/Sub architecture and the data parallelism of the parameter server architecture to design a hierarchical asynchronous mechanism, reducing training latency and improving system efficiency. Additionally, to mitigate the training imbalance caused by resource and data heterogeneity, we formalize an optimization problem based on participants’ system profiles, enabling the selection of optimal hyperparameters while preserving privacy. We conduct a theoretical analysis to demonstrate that \texttt{PubSub-VFL} achieves stable convergence and is compatible with security protocols such as differential privacy. Extensive case studies on five benchmark datasets further validate its effectiveness, showing that \texttt{PubSub-VFL} compared to state-of-the-art baselines not only accelerates training by $2 \sim 7\times$ without compromising accuracy but also achieves computational resource utilization by up to 91. 07\%.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
In domain adaption (DA), joint maximum mean discrepancy (JMMD), as a famous distribution-distance metric, aims to measure joint probability distribution difference between the source domain and target domain, while it is still not fully explored and especially hard to be applied into a subspace-learning framework as its empirical estimation involves a tensor-product operator whose partial derivative is difficult to obtain. To solve this issue, we deduce a concise JMMD based on the Representer theorem that avoids the tensor-product operator and obtains two essential findings. First, we reveal the uniformity of JMMD by proving that previous marginal, class conditional, and weighted class conditional probability distribution distances are three special cases of JMMD with different label reproducing kernels. Second, inspired by graph embedding, we observe that the similarity weights, which strengthen the intra-class compactness in the graph of Hilbert Schmidt independence criterion (HSIC), take opposite signs in the graph of JMMD, revealing why JMMD degrades the feature discrimination. This motivates us to propose a novel loss JMMD-HSIC by jointly considering JMMD and HSIC to promote discrimination of JMMD. Extensive experiments on several cross-domain datasets could demonstrate the validity of our revealed theoretical results and the effectiveness of our proposed JMMD-HSIC.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the F lexible A ctivation S teering with B acktracking ( FASB ) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https: //github. com/gjw185/FASB.
EAAI Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
IROS Conference 2024 Conference Paper
Safety-critical scenarios are infrequent in natural driving environments but hold significant importance for the training and testing of autonomous driving systems. The prevailing approach involves generating safety-critical scenarios automatically in simulation by introducing adversarial adjustments to natural environments. These adjustments are often tailored to specific tested systems, thereby disregarding their transferability across different systems. In this paper, we propose AdvDiffuser, an adversarial framework for generating safety-critical driving scenarios through guided diffusion. By incorporating a diffusion model to capture plausible collective behaviors of background vehicles and a lightweight guide model to effectively handle adversarial scenarios, AdvDiffuser facilitates transferability. Experimental results on the nuScenes dataset demonstrate that AdvDiffuser, trained on offline driving logs, can be applied to various tested systems with minimal warm-up episode data and outperform other existing methods in terms of realism, diversity, and adversarial performance.
NeurIPS Conference 2024 Conference Paper
Recent advancements in Automatic Prompt Optimization (APO) for text-to-image generation have streamlined user input while ensuring high-quality image output. However, most APO methods are trained assuming a fixed text-to-image model, which is impractical given the emergence of new models. To address this, we propose a novel task, model-generalized automatic prompt optimization (MGAPO), which trains APO methods on a set of known models to enable generalization to unseen models during testing. MGAPO presents significant challenges. First, we experimentally confirm the suboptimal performance of existing APO methods on unseen models. We then introduce a two-stage prompt optimization method, AP-Adapter. In the first stage, a large language model is used to rewrite the prompts. In the second stage, we propose a novel method to construct an enhanced representation space by leveraging inter-model differences. This space captures the characteristics of multiple domain models, storing them as domain prototypes. These prototypes serve as anchors to adjust prompt representations, enabling generalization to unseen models. The optimized prompt representations are subsequently used to generate conditional representations for controllable image generation. We curate a multi-modal, multi-model dataset that includes multiple diffusion models and their corresponding text-image data, and conduct experiments under a model generalization setting. The experimental results demonstrate the AP-Adapter's ability to enable the automatic prompts to generalize well to previously unseen diffusion models, generating high-quality images.
JBHI Journal 2024 Journal Article
The Type III Secretion Systems (T3SSs) play a pivotal role in host-pathogen interactions by mediating the secretion of type III secretion system effectors (T3SEs) into host cells. These T3SEs mimic host cell protein functions, influencing interactions between Gram-negative bacterial pathogens and their hosts. Identifying T3SEs is essential in biomedical research for comprehending bacterial pathogenesis and its implications on human cells. This study presents EDIFIER, a novel multi-channel model designed for accurate T3SE prediction. It incorporates a graph structural channel, utilizing graph convolutional networks (GCN) to capture protein 3D structural features and a sequence channel based on the ProteinBERT pre-trained model to extract the sequence context features of T3SEs. Rigorous benchmarking tests, including ablation studies and comparative analysis, validate that EDIFIER outperforms current state-of-the-art tools in T3SE prediction. To enhance EDIFIER's accessibility to the broader scientific community, we developed a webserver that is publicly accessible at http://edifier.unimelb-biotools.cloud.edu.au/.We anticipate EDIFIER will contribute to the field by providing reliable T3SE predictions, thereby advancing our understanding of host-pathogen dynamics.
IJCAI Conference 2024 Conference Paper
This paper investigates two fundamental problems that arise when utilizing Intrinsic Motivation (IM) for reinforcement learning in Reward-Free Pre-Training (RFPT) tasks and Exploration with Intrinsic Motivation (EIM) tasks: 1) how to design an effective intrinsic objective in RFPT tasks, and 2) how to reduce the bias introduced by the intrinsic objective in EIM tasks. Existing IM methods suffer from static skills, limited state coverage, sample inefficiency in RFPT tasks, and suboptimality in EIM tasks. To tackle these problems, we propose Constrained Intrinsic Motivation (CIM) for RFPT and EIM tasks, respectively: 1) CIM for RFPT maximizes the lower bound of the conditional state entropy subject to an alignment constraint on the state encoder network for efficient dynamic and diverse skill discovery and state coverage maximization; 2) CIM for EIM leverages constrained policy optimization to adaptively adjust the coefficient of the intrinsic objective to mitigate the distraction from the intrinsic objective. In various MuJoCo robotics environments, we empirically show that CIM for RFPT greatly surpasses fifteen IM methods for unsupervised skill discovery in terms of skill diversity, state coverage, and fine-tuning performance. Additionally, we showcase the effectiveness of CIM for EIM in redeeming intrinsic rewards when task rewards are exposed from the beginning. Our code is available at https: //github. com/x-zheng16/CIM.
AAAI Conference 2024 Conference Paper
This paper proposes UHDformer, a general Transformer for Ultra-High-Definition (UHD) image restoration. UHDformer contains two learning spaces: (a) learning in high-resolution space and (b) learning in low-resolution space. The former learns multi-level high-resolution features and fuses low-high features and reconstructs the residual images, while the latter explores more representative features learning from the high-resolution ones to facilitate better restoration. To better improve feature representation in low-resolution space, we propose to build feature transformation from the high-resolution space to the low-resolution one. To that end, we propose two new modules: Dual-path Correlation Matching Transformation module (DualCMT) and Adaptive Channel Modulator (ACM). The DualCMT selects top C/r (r is greater or equal to 1 which controls the squeezing level) correlation channels from the max-pooling/mean-pooling high-resolution features to replace low-resolution ones in Transformers, which can effectively squeeze useless content to improve the feature representation in low-resolution space to facilitate better recovery. The ACM is exploited to adaptively modulate multi-level high-resolution features, enabling to provide more useful features to low-resolution space for better learning. Experimental results show that our UHDformer reduces about ninety-seven percent model sizes compared with most state-of-the-art methods while significantly improving performance under different training sets on 3 UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes will be made available at https://github.com/supersupercong/UHDformer.
IJCAI Conference 2024 Conference Paper
Patch-level non-local self-similarity is an important property of natural images. However, most existing methods do not consider this property into neural networks for image deraining, thus affecting recovery performance. Motivated by this property, we find that there exists significant patch recurrence property of a rainy image, that is, similar patches tend to recur many times in one image and its multi-scale images and external images. To better model this property for image detaining, we develop a multi-scale graph network with exemplars, called MSGNN, that contains two branches: 1) internal data-based supervised branch is used to model the internal relations of similar patches from the rainy image itself and its multi-scale images and 2) external data-participated unsupervised branch is used to model the external relations of the similar patches in the rainy image and exemplar. Specifically, we construct a graph model by searching the k-nearest neighboring patches from both the rainy images in a multi-scale framework and the exemplar. After obtaining the corresponding k neighboring patches from the multi-scale images and exemplar, we build a graph and aggregate them in an attentional manner so that the graph can provide more information from similar patches for image deraining. We embed the proposed graph in a deep neural network and train it in an end-to-end manner. Extensive experiments demonstrate that the proposed algorithm performs favorably against eight state-of-the-art methods on five public synthetic datasets and one real-world dataset. The source codes will be available at https: //github. com/supersupercong/MSGNN.
AAMAS Conference 2024 Conference Paper
Visual-based Reinforcement Learning (RL) has gained prominence in robotics decision-making due to its significant potential. However, the prevalent utilization of images in visual-based RL lacks explicit descriptions of object structures and spatial configurations in scenes, thereby limiting the overall efficiency and robustness of RL in robot control. Additionally, training an RL policy solely using visual observations from scratch is typically sample-inefficient, rendering it impractical for real-world application. To address these challenges, this paper proposes a novel method, called Pre-training on Point-based RL (P2RL), which takes the point cloud representations of scenes as states and preserves the intricate spatial details between objects. To further enhance efficiency, we leverage the pre-training method to bolster the perception ability of the network. Key factors in the pre-training process are systematically examined to optimize downstream RL training. Experimental results demonstrate the superior robustness and efficiency of P2RL compared to the state-of-the-art image-based RL method, especially in evaluations involving untrained scenes.
IJCAI Conference 2024 Conference Paper
Few-Shot Learning (FSL) aims to train a model that can generalize to recognize new classes, with each new class having only very limited training samples. Since extracting discriminative features for new classes with few samples is challenging, existing FSL methods leverage visual and semantic prior knowledge to guide discriminative feature learning. However, for meta-learning purposes, the semantic knowledge of the query set is unavailable, so their features lack discriminability. To address this problem, we propose a novel Multi-Attention based Visual-Semantic Interaction (MAVSI) approach for FSL. Specifically, we utilize spatial and channel attention mechanisms to effectively select discriminative visual features for the support set based on its ground-truth semantics while using all the support set semantics for each query set sample. Then, a relation module with class prototypes of the support set is employed to supervise and select discriminative visual features for the query set. To further enhance the discriminability of the support set, we introduce a visual-semantic contrastive learning module to promote the similarity between visual features and their corresponding semantic features. Extensive experiments on four benchmark datasets demonstrate that our proposed MAVSI could outperform existing state-of-the-art FSL methods.
IJCAI Conference 2024 Conference Paper
Domain adaptation aims to achieve label transfer from a labeled source domain to an unlabeled target domain, where the two domains exhibit different distributions. Existing methods primarily concentrate on designing a feature extractor to learn better domain-invariant features, along with developing an effective classifier for reliable predictions. In this paper, we introduce optimal graph learning to generate a cross-domain graph that effectively connects the two domains, and two domain-specific graphs to capture domain-specific structures. On the one hand, we incorporate the three graphs into the label propagation (LP) classifier to enhance its robustness to distribution difference. On the other hand, we leverage the three graphs to introduce graph embedding losses, promoting the learning of locally discriminative and domain-invariant features. Furthermore, we maximize the nuclear norm of predictions in LP to enhance class diversity, thereby improving its robustness to class imbalance problem. Correspondingly, we develop an efficient algorithm to solve the associated optimization problem. Finally, we integrate the proposed LP and graph embedding losses into a deep neural network, resulting in our proposed deep cross-domain robust LP. Extensive experiments conducted on three cross-domain benchmark datasets demonstrate that our proposed approach could outperform existing state-of-the-art domain adaptation methods.
JBHI Journal 2024 Journal Article
Origins of replication sites (ORIs) are crucial genomic regions where DNA replication initiation takes place, playing pivotal roles in fundamental biological processes like cell division, gene expression regulation, and DNA integrity. Accurate identification of ORIs is essential for comprehending cell replication, gene expression, and mutation-related diseases. However, experimental approaches for ORI identification are often expensive and time-consuming, leading to the growing popularity of computational methods. In this study, we present PLANNER (DeeP LeArNiNg prEdictor for ORI), a novel approach for species-specific and cell-specific prediction of eukaryotic ORIs. PLANNER uses the multi-scale k-tuple sequences as input and employs the DNABERT pre-training model with transfer learning and ensemble learning strategies to train accurate predictive models. Extensive empirical test results demonstrate that PLANNER achieved superior predictive performance compared to state-of-the-art approaches, including iOri-Euk, Stack-ORI, and ORI-Deep, within specific cell types and across different cell types. Furthermore, by incorporating an interpretable analysis mechanism, we provide insights into the learned patterns, facilitating the mapping from discovering important sequential determinants to comprehensively analysing their biological functions.
AAAI Conference 2024 Conference Paper
This work presents an effective depth-consistency Self-Prompt Transformer, terms as SelfPromer, for image dehazing. It is motivated by an observation that the estimated depths of an image with haze residuals and its clear counterpart vary. Enforcing the depth consistency of dehazed images with clear ones, therefore, is essential for dehazing. For this purpose, we develop a prompt based on the features of depth differences between the hazy input images and corresponding clear counterparts that can guide dehazing models for better restoration. Specifically, we first apply deep features extracted from the input images to the depth difference features for generating the prompt that contains the haze residual information in the input. Then we propose a prompt embedding module that is designed to perceive the haze residuals, by linearly adding the prompt to the deep features. Further, we develop an effective prompt attention module to pay more attention to haze residuals for better removal. By incorporating the prompt, prompt embedding, and prompt attention into an encoder-decoder network based on VQGAN, we can achieve better perception quality. As the depths of clear images are not available at inference, and the dehazed images with one-time feed-forward execution may still contain a portion of haze residuals, we propose a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation. Extensive experiments show that our SelfPromer performs favorably against the state-of-the-art approaches on both synthetic and real-world datasets in terms of perception metrics including NIQE, PI, and PIQE. The source codes will be made available at https://github.com/supersupercong/SelfPromer.
AAAI Conference 2024 Conference Paper
Domain Generalization (DG) aims to improve the generalization ability of models trained on a specific group of source domains, enabling them to perform well on new, unseen target domains. Recent studies have shown that methods that converge to smooth optima can enhance the generalization performance of supervised learning tasks such as classification. In this study, we examine the impact of smoothness-enhancing formulations on domain adversarial training, which combines task loss and adversarial loss objectives. Our approach leverages the fact that converging to a smooth minimum with respect to task loss can stabilize the task loss and lead to better performance on unseen domains. Furthermore, we recognize that the distribution of objects in the real world often follows a long-tailed class distribution, resulting in a mismatch between machine learning models and our expectations of their performance on all classes of datasets with long-tailed class distributions. To address this issue, we consider the domain generalization problem from the perspective of the long-tail distribution and propose using the maximum square loss to balance different classes which can improve model generalizability. Our method's effectiveness is demonstrated through comparisons with state-of-the-art methods on various domain generalization datasets. Code: https://github.com/bamboosir920/SAMALTDG.
EAAI Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Deep learning methods on graph data have achieved remarkable efficacy across a variety of real-world applications, such as social network analysis and transaction risk detection. Nevertheless, recent studies have illuminated a concerning fact: even the most expressive Graph Neural Networks (GNNs) are vulnerable to graph adversarial attacks. While several methods have been proposed to enhance the robustness of GNN models against adversarial attacks, few have focused on a simple yet realistic approach: valuing the adversarial risks and focused safeguards at the node level. This empowers defenders to allocate heightened security level to vulnerable nodes, while lower to robust nodes. With this new perspective, we propose a novel graph defense strategy RisKeeper, such that the adversarial risk can be directly kept in the input graph. We start at valuing the adversarial risk, by introducing a cost-aware projected gradient descent attack that takes into account both cost avoidance and compliance with costs budgets. Subsequently, we present a learnable approach to ascertain the ideal security level for each individual node by solving a bi-level optimization problem. Through extensive experiments on four real-world datasets, we demonstrate that our method achieves superior performance surpassing state-of-the-art methods. Our in-depth case studies provide further insights into vulnerable and robust structural patterns, serving as inspiration for practitioners to exercise heightened vigilance.
AAAI Conference 2023 Conference Paper
For deep ordinal classification, learning a well-structured feature space specific to ordinal classification is helpful to properly capture the ordinal nature among classes. Intuitively, when Euclidean distance metric is used, an ideal ordinal layout in feature space would be that the sample clusters are arranged in class order along a straight line in space. However, enforcing samples to conform to a specific layout in the feature space is a challenging problem. To address this problem, in this paper, we propose a novel Constrained Proxies Learning (CPL) method, which can learn a proxy for each ordinal class and then adjusts the global layout of classes by constraining these proxies. Specifically, we propose two kinds of strategies: hard layout constraint and soft layout constraint. The hard layout constraint is realized by directly controlling the generation of proxies to force them to be placed in a strict linear layout or semicircular layout (i.e., two instantiations of strict ordinal layout). The soft layout constraint is realized by constraining that the proxy layout should always produce unimodal proxy-to-proxies similarity distribution for each proxy (i.e., to be a relaxed ordinal layout). Experiments show that the proposed CPL method outperforms previous deep ordinal classification methods under the same setting of feature extractor.
YNIMG Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
We show that raw degradation features can effectively guide deep restoration models, providing accurate degradation priors to facilitate better restoration. While networks that do not consider them for restoration forget gradually degradation during the learning process, model capacity is severely hindered. To address this, we propose a Prompting image Restorer, termed as PromptRestorer. Specifically, PromptRestorer contains two branches: a restoration branch and a prompting branch. The former is used to restore images, while the latter perceives degradation priors to prompt the restoration branch with reliable perceived content to guide the restoration process for better recovery. To better perceive the degradation which is extracted by a pre-trained model from given degradation observations, we propose a prompting degradation perception modulator, which adequately considers the characters of the self-attention mechanism and pixel-wise modulation, to better perceive the degradation priors from global and local perspectives. To control the propagation of the perceived content for the restoration branch, we propose gated degradation perception propagation, enabling the restoration branch to adaptively learn more useful features for better recovery. Extensive experimental results show that our PromptRestorer achieves state-of-the-art results on 4 image restoration tasks, including image deraining, deblurring, dehazing, and desnowing.
IS Journal 2023 Journal Article
With the continuous advancement of Industry 4. 0 and intelligent manufacturing, remaining useful lifetime (RUL) prediction can forecast the future degradation state of machinery and then estimate the remaining service time before it loses its safe operation ability. Accordingly, a series of predictive maintenance strategies can be regulated in advance for equipment in the Industrial Internet of Things. To tackle the challenges of insufficiency of failure data and lack of confidence in RUL prediction results, a similarity-based relevance vector machine (SRVM) is proposed in this article. Primarily, the relationship among latent variables in the SRVM is learned adaptively through similarity computations to fully utilize the limited degradation data. Furthermore, these internal variables in the SRVM are treated as time-varying variables and re-estimated dynamically to provide RUL prediction with reliable confidence. The experiment results show that the prediction accuracy of the SRVM is higher than that of other baseline methods.
AAAI Conference 2023 Conference Paper
As manual point-wise label is time and labor-intensive for fully supervised large-scale point cloud semantic segmentation, weakly supervised method is increasingly active. However, existing methods fail to generate high-quality pseudo labels effectively, leading to unsatisfactory results. In this paper, we propose a weakly supervised point cloud semantic segmentation framework via receptive-driven pseudo label consistency and structural consistency to mine potential knowledge. Specifically, we propose three consistency contrains: pseudo label consistency among different scales, semantic structure consistency between intra-class features and class-level relation structure consistency between pair-wise categories. Three consistency constraints are jointly used to effectively prepares and utilizes pseudo labels simultaneously for stable training. Finally, extensive experimental results on three challenging datasets demonstrate that our method significantly outperforms state-of-the-art weakly supervised methods and even achieves comparable performance to the fully supervised methods.
NeurIPS Conference 2022 Conference Paper
As in-the-wild data are increasingly involved in the training stage, machine learning applications become more susceptible to data poisoning attacks. Such attacks typically lead to test-time accuracy degradation or controlled misprediction. In this paper, we investigate the third type of exploitation of data poisoning - increasing the risks of privacy leakage of benign training samples. To this end, we demonstrate a set of data poisoning attacks to amplify the membership exposure of the targeted class. We first propose a generic dirty-label attack for supervised classification algorithms. We then propose an optimization-based clean-label attack in the transfer learning scenario, whereby the poisoning samples are correctly labeled and look "natural" to evade human moderation. We extensively evaluate our attacks on computer vision benchmarks. Our results show that the proposed attacks can substantially increase the membership inference precision with minimum overall test-time model performance degradation. To mitigate the potential negative impacts of our attacks, we also investigate feasible countermeasures.
AAAI Conference 2022 Conference Paper
Single image deraining is an important and challenging task for some downstream artificial intelligence applications such as video surveillance and self-driving systems. Most of the existing deep-learning-based methods constrain the network to generate derained images but few of them explore features from intermediate layers, different levels, and different modules which are beneficial for rain streaks removal. In this paper, we propose a high-order collaborative network with multi-scale compact constraints and a bidirectional scale-content similarity mining module to exploit features from deep networks externally and internally for rain streaks removal. Externally, we design a deraining framework with three sub-networks trained in a collaborative manner, where the bottom network transmits intermediate features to the middle network which also receives shallower rainy features from the top network and sends back features to the bottom network. Internally, we enforce multi-scale compact constraints on the intermediate layers of deep networks to learn useful features via a Laplacian pyramid. Further, we develop a bidirectional scale-content similarity mining module to explore features at different scales in a down-to-up and up-to-down manner. To improve the model performance on real-world images, we propose an online-update learning approach, which uses real-world rainy images to finetune the network and update the deraining results in a selfsupervised manner. Extensive experiments demonstrate that our proposed method performs favorably against eleven stateof-the-art methods on five public synthetic datasets and one real-world dataset.
AAAI Conference 2022 Conference Paper
Learning with limited labeled data is a long-standing problem. Among various solutions, episodic training progressively classifies a series of few-shot tasks and thereby is assumed to be beneficial for improving the model’s generalization ability. However, recent studies show that it is even inferior to the baseline model when facing domain shift between base and novel classes. To tackle this problem, we propose a domain-independent task-level self-supervised (TL- SS) method for cross-domain few-shot learning. TL-SS strategy promotes the general idea of label-based instance-level supervision to task-level self-supervision by augmenting multiple views of tasks. Two regularizations on task consistency and correlation metric are introduced to remarkably stabilize the training process and endow the generalization ability into the prediction model. We also propose a high-order associated encoder (HAE) being adaptive to various tasks. By utilizing 3D convolution module, HAE is able to generate proper parameters and enables the encoder to flexibly to any unseen tasks. Two modules complement each other and show great promotion against state-of-the-art methods experimentally. Finally, we design a generalized task-agnostic test, where our intriguing findings highlight the need to re-think the generalization ability of existing few-shot approaches.
AAAI Conference 2022 Conference Paper
The prosperity of mobile and financial technologies has bred and expanded various kinds of financial products to a broader scope of people, which contributes to financial inclusion. It brings non-trivial social benefits of diminishing financial inequality. However, the technical challenges in individual financial risk evaluation exacerbated by the unforeseen user characteristic distribution and limited credit history of new users, as well as the inexperience of newly-entered companies in handling complex data and obtaining accurate labels, impede further promotion of financial inclusion. To tackle these challenges, this paper develops a novel transfer learning algorithm (i. e. , TransBoost) that combines the merits of tree-based models and kernel methods. The TransBoost is designed with a parallel tree structure and efficient weights updating mechanism with theoretical guarantee, which enables it to excel in tackling realworld data with high dimensional features and sparsity in O(n) time complexity. We conduct extensive experiments on two public datasets and a unique largescale dataset from Tencent Mobile Payment. The results show that the TransBoost outperforms other state-ofthe-art benchmark transfer learning algorithms in terms of prediction accuracy with superior efficiency, demonstrate stronger robustness to data sparsity, and provide meaningful model interpretation. Besides, given a financial risk level, the TransBoost enables financial service providers to serve the largest number of users including those who would otherwise be excluded by other algorithms. That is, the TransBoost improves financial inclusion.
NeurIPS Conference 2021 Conference Paper
The automatic generation of long and coherent medical reports given medical images (e. g. Chest X-ray and Fundus Fluorescein Angiography (FFA)) has great potential to support clinical practice. Researchers have explored advanced methods from computer vision and natural language processing to incorporate medical domain knowledge for the generation of readable medical reports. However, existing medical report generation (MRG) benchmarks lack both explainable annotations and reliable evaluation tools, hindering the current research advances from two aspects: firstly, existing methods can only predict reports without accurate explanation, undermining the trustworthiness of the diagnostic methods; secondly, the comparison among the predicted reports from different MRG methods is unreliable using the evaluation metrics of natural-language generation (NLG). To address these issues, in this paper, we propose an explainable and reliable MRG benchmark based on FFA Images and Reports (FFA-IR). Specifically, FFA-IR is large, with 10, 790 reports along with 1, 048, 584 FFA images from clinical practice; it includes explainable annotations, based on a schema of 46 categories of lesions; and it is bilingual, providing both English and Chinese reports for each case. Besides using the widely used NLG metrics, we propose a set of nine human evaluation criteria to evaluate the generated reports. We envision FFA-IR as a testbed for explainable and reliable medical report generation. We also hope that it can broadly accelerate medical imaging research and facilitate interaction between the fields of medical imaging, computer vision, and natural language processing.
JBHI Journal 2020 Journal Article
The object detection, which has been widely applied in the biomedical field already, is of real significance but technically challenging. In practice, the object detection accuracy is vulnerable to labeling quality, which is usually not a big headache for simple algorithm or model verification since there are a bunch of ideal public available datasets whose classes and tags are all well-marked. However, in real scenarios, image data is often partially or even incorrectly labeled. Particularly, in cell detection, this becomes a thorny issue since the labelling of the dataset is incomplete and inaccurate. To address this issue, we propose a data-augmentation algorithm that can generate full labeled cell image data from incomplete labeled ones. First of all, we randomly extract the labeled objects from raw cell images, and meanwhile, keep their corresponding position information. Next, we employ the framework of cycle-consistent adversarial network, but significantly distinguished from the original one, to generate fully labeled data including both objects and backgrounds. We conduct extensive experiments on a blood cell classification dataset called BCCD to evaluate our model, and experimental results show that our proposed method can successfully address the weak annotation problem and improve the performance of object detection.
IJCAI Conference 2019 Conference Paper
Incentives are explored in the sharing economy to inspire users for better resource allocation. Previous works build a budget-feasible incentive mechanism to learn users' cost distribution. However, they only consider a special case that all tasks are considered as the same. The general problem asks for finding a solution when the cost for different tasks varies. In this paper, we investigate this general problem by considering a system with k levels of difficulty. We present two incentivizing strategies for offline and online implementation, and formally derive the ratio of utility between them in different scenarios. We propose a regret-minimizing mechanism to decide incentives by dynamically adjusting budget assignment and learning from users' cost distributions. Our experiment demonstrates utility improvement about 7 times and time saving of 54% to meet a utility objective compared to the previous works.
IJCAI Conference 2018 Conference Paper
Privacy is a fundamental challenge for a variety of smart applications that depend on data aggregation and collaborative learning across different entities. In this paper, we propose a novel privacy-preserved architecture where clients can collaboratively train a deep model while preserving the privacy of each client’s data. Our main strategy is to carefully partition a deep neural network to two non-colluding parties. One party performs linear computations on encrypted data utilizing a less complex homomorphic cryptosystem, while the other executes non-polynomial computations in plaintext but in a privacy-preserved manner. We analyze security and compare the communication and computation complexity with the existing approaches. Our extensive experiments on different datasets demonstrate not only stable training without accuracy loss, but also 14 to 35 times speedup compared to the state-of-the-art system.
ICRA Conference 2009 Conference Paper
Force-reflecting teleoperators in which the remote environment is kinesthetically coupled to the operator can considerably increase task performance. Wave-variable-based controllers can support the stable operation of force-reflecting teleoperators under arbitrary communication delays. Transparency in such systems is compromised in order to maintain stability. We had previously proposed a modified wave variable controller that implemented additional wave impedance in the wave variable transformations in order to focus more closely on force tracking. In this paper, we present a new controller for bilateral teleoperators based on the modified wave variable control method which provides superior position and force tracking performance compared to the traditional wave-variable-based method. Moreover, the method has high stability. Theoretical investigation and experimental results confirm the performance of this new controller.