EAAI Journal 2026 Journal Article
A heterogeneous multi-graph spatio-temporal network for runoff forecasting
- Xuerui Zhou
- Baowei Yan
- Jun Zhang
- Jianbo Chang
- Dongxu Yang
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Sequential recommendation aims to predict the next item based on historical interactions. To further enhance the reasoning capability in sequential recommendation, LLMs are employed to predict the next item or generate semantic IDs for item representation, given LLMs' extensive domain knowledge and reasoning ability. However, existing LLM-based methods suffer from two limitations. (i) The scarcity of recommendation data with reasoning paths makes it challenging to design suitable chain-of-thought prompting templates, and the full potential of LLMs' reasoning abilities remains underutilized. (ii) Upon obtaining semantic IDs, the LLMs and their representations are excluded from the subsequent recommendation model training, preventing downstream models from fully utilizing the rich semantic information encoded within these IDs. To address these issues, we propose a novel CoderRec framework, which is capable of fully exploiting the information encoded in semantic IDs to guide the recommendation process. Specifically, to address the problem of scarcity in reasoning path-augmented data, we introduce latent reasoning into sequential recommendation and treat the representation captured by the downstream model as domain-specific latent thought, enabling implicit logical inference without requiring explicit CoT annotations. To ensure that the downstream recommendation models are able to deeply leverage the semantic information within IDs, we propose a novel cross-scale model collaboration strategy, which employs cross-scale IDs and a two-phase approach to align LLM-derived semantics with recommendation objectives. Extensive experiments have shown the effectiveness of our proposed CoderRec framework.
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Recent self-supervised pre-training methods for object detection often rely on generic object proposals for localization and semantic feature learning for classification, but they yield limited improvements when applied to Detection Transformers (DETR) due to a lack of architectural alignment. Hence, we propose an elegant and versatile self-supervised framework tailored for DETR-like models called Distance-aware Multi-view Contrastive Learning (DisCo DETR). DisCo DETR enhances localization and semantic features through two core components. (i) Distance-aware Multi-view Object Query Fusion explicitly guides object queries to focus on spatially close objects across views, stabilizing training and improving localization accuracy. (ii) Contrastive Learning for DETR uses native bipartite matching to identify positive output pairs across views and pull them closer, enhancing semantic features discrimination with no extra matching. DisCo DETR can be seamlessly integrated into DETR-like models and achieves SOTA transfer performance on PASCAL VOC and COCO benchmarks across multiple variants.
JBHI Journal 2026 Journal Article
Automated sleep staging is essential for large-scale and home-based sleep monitoring; however, in routine clinical practice, sleep annotation remains largely dependent on experienced experts performing time-consuming and labor-intensive manual scoring. Existing automatic systems often struggle to adapt reliably to new subjects, limiting their clinical adoption and reinforcing the reliance on expert review. This creates a strong demand for adaptive and efficient sleep staging systems that can substantially reduce annotation workload while preserving expert-level accuracy. We propose BayesSleepNet, a novel framework that integrates Bayesian uncertainty quantification with active learning for adaptive sleep staging. BayesSleepNet employs principled Bayesian modeling by placing distributions over network weights and performing Monte Carlo sampling at inference, enabling explicit quantification of model (epistemic) uncertainty. These uncertainty estimates drive a two-stage sample selection strategy that first fine-tunes the model using representative epochs and subsequently prioritizes persistently uncertain samples for expert review. Across four public sleep datasets, BayesSleepNet consistently improves performance—by 7. 60% in accuracy, 8. 27% in macro-F1, and 0. 104 in Cohen's $\kappa$ —while requiring manual annotation of only 20% of data from new subjects. Despite its adaptive learning capability, BayesSleepNet remains computationally lightweight, using substantially fewer parameters than representative high-capacity state-of-the-art models. These results demonstrate the clinical promise of uncertainty-aware active learning as a practical and cost-efficient paradigm for semi-automated sleep staging. Code is available at https://github.com/yuty2009/bayesugal.
AAAI Conference 2026 Conference Paper
Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textit{e.g.}, GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.
AAAI Conference 2026 Conference Paper
Large Reasoning Models (LRMs) achieve promising results on complex reasoning tasks but remain susceptible to hallucinations. Existing hallucination detection methods based on Large Language Models (LLMs) often focus solely on final answers, overlooking inconsistencies between the answer and reasoning process. This limitation reduces their ability to detect hallucinations during inference. Moreover, training-free approaches lack mechanisms for confidence estimation, resulting in an unquantified detection output. In contrast, training-based methods can provide fine-grained assessments but often neglect the self-correction capability of LRMs, where earlier errors may be corrected in subsequent steps, leading to inaccurate hallucination detection. To address these challenges, we propose ConfFuse, a unified framework that fuses global and local confidence scores for hallucination detection. A Global Hallucination Detection Model (GHDM) is trained using Direct Preference Optimization (DPO) to assess hallucinations at the level of entire reasoning chains, yielding global confidence estimates. Simultaneously, a Process Reward Model (PRM) estimates step-wise confidence scores to capture local logical flaws. A weighted fusion strategy combines the global confidence score with the minimum local score to jointly reflect overall reasoning consistency and local soundness. Experimental evaluations demonstrate that ConfFuse surpasses Qwen3-1.7B and Qwen3-8B by up to 11.86% and 5.46% in F1 score on in-distribution datasets, and achieves average improvements of 4.65% and 2.80% on out-of-distribution datasets. These results verify the effectiveness and generalizability of the proposed framework.
EAAI Journal 2026 Journal Article
JBHI Journal 2026 Journal Article
Drug repositioning, exploring new indications for existing drugs, is emerging as a promising approach to accelerate drug discovery and reduce research risk of failure. Recent advances in this topic by applying graph neural networks have enabled researches to achieve significant results by extracting latent features from the original data. However, the previous studies have not fully considered the distinctive information embedded within different construction graphs, which may lead to insufficient classification performance due to the lack of more detailed features. This work therefore proposes a novel approach, namely MVGF DR, which leverages graph network construction and multi view graph feature fusion for drug repositioning. MVGF-DR built a comprehensive graph network from both similarity and association information, i. e. , a similarity graph network is constructed with drug-drug and disease-disease similarities where similarity information are extracted by graph isomorphism networks, and an association graph network with drug-disease associations where drug-disease relationships are explored by graph convolutional networks. Additionally, a maximum value selection strategy is introduced to filter features from different channels for feature fusion and noise reduction. The average AUROC and AUPR achieved by MVGF-DR across the three datasets reached 95. 38% and 51. 20%, respectively, outperforming the other five state-of-the-art models. Multiple experiments further also demonstrated the flexibility and practical applicability of MVGF-DR.
AAAI Conference 2026 Conference Paper
Augmented Reality (AR) and Multimodal Large Language Models (LLMs) are rapidly evolving, providing unprecedented capabilities for human-computer interaction. However, their integration introduces a new attack surface for Social Engineering (SE). In this paper, we systematically investigate the feasibility of orchestrating AR-driven Social Engineering attacks using Multimodal LLM for the first time, via our proposed SEAR framework, which operates through three key phases: (1) AR-based social context synthesis, which fuses Multimodal inputs (visual, auditory and environmental cues); (2) role-based Multimodal RAG (Retrieval-Augmented Generation), which dynamically retrieves and integrates social context; and (3) ReInteract social engineering agents, which execute adaptive multiphase attack strategies through inference interaction loops. To verify SEAR, we conducted an IRB-approved study with 60 participants and build a novel dataset of 180 annotated conversations in different social scenarios (e.g., coffee shops, networking events). Our results show that SEAR is highly effective at eliciting high-risk behaviors (e.g., 93.3% of participants susceptible to email phishing). The framework was particularly effective in building trust, with 85% of targets willing to accept an attacker's call after an interaction. Also, we identified notable limitations such as authenticity gaps. This work provides proof-of-concept for AR-LLM driven social engineering attacks and insights for developing defenses against next-generation AR/LLM-based SE threats.
AAAI Conference 2026 Conference Paper
Peptide-based drug design targeting “undruggable” proteins remains one of the most critical challenges in modern drug discovery. Conventional peptide-discovery pipelines rely on low-throughput experimental screening, which is both time-consuming and prohibitively expensive. Moreover, existing computational approaches for designing peptides against target proteins typically depend on the availability of high-quality structural information. Although recent structure-prediction tools such as AlphaFold3 have achieved breakthroughs in protein modeling, their accuracy for functional interfaces remains limited. The acquisition of high-resolution structures is often expensive, time-intensive, and particularly challenging for targets with dynamic conformations, further restricting the efficient development of peptide therapeutics. Additionally, current sequence-based generative methods follow a paradigm that relies on known templates, which limits the exploration of sequence space and results in generated peptides lacking diversity and novelty. To address these limitations, we propose a contrastive conditioned diffusion framework for target-specific peptide generation, referred to as PepCCD. It employs a contrastive learning strategy between proteins and peptides to extract sequence-based conditioning representations of target proteins, which serve as precise conditions to guide a pre-trained diffusion model to generate peptide sequences with the desired target specificity. Extensive experiments on multiple benchmark target proteins demonstrate that the peptides designed by PepCCD exhibit strong binding affinity and outperform state-of-the-art methods in terms of diversity and generation efficiency.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Inter-agent communication serves as an effective mechanism for enhancing performance in collaborative multi-agent reinforcement learning (MARL) systems. However, the inherent communication latency in practical systems induces both action decision delays and outdated information sharing, impeding MARL performance gains, particularly in time-critical applications like autonomous driving. In this work, we propose a Value-of-Information aware Low-latency Communication (VIL2C) scheme that proactively adjusts the latency distribution to mitigate its effects in MARL systems. Specifically, we define a Value of Information (VoI) metric to quantify the importance of delayed messages on the recipient agent's decision. We then design a VoI aware resource allocation method that dynamically prioritizes message transmission based on each delayed message's importance. Moreover, we propose a progressive message reception mechanism to adaptively adjust the reception duration based on received messages. We derive the optimized VoI aware resource allocation and theoretically prove the performance advantage of the proposed VIL2C scheme. Extensive experiments demonstrate that VIL2C outperforms existing approaches under various communication conditions. These gains are attributed to the low-latency transmission of high-VoI messages via resource allocation and the elimination of unnecessary waiting periods via adaptive reception duration.
YNIMG Journal 2026 Journal Article
JBHI Journal 2025 Journal Article
Auscultation of the chest is a fundamental diagnostic tool for cardiovascular and pulmonary diseases. However, the two main chest sound parts, heart sound (HS) and lung sound (LS), are often mixed, limiting diagnostic accuracy. This paper presents a novel Phase-Enhanced Neural Network (PENN) for HS and LS separation. To address the under-utilization of phase information, PENN integrates a feedforward connection that feeds the input spectrum into the Restorer, enabling phase recovery based on the local inference feature of phase. A time-frequency Dual-Path Transformer (DPT) is employed to expand the network's receptive field and enhance performance. To interpret the effectiveness of PENN, two new metrics, mSI-SDRi and pSI-SDRi, are proposed to separately evaluate the contributions of magnitude and phase. Experiments show that PENN achieves pSI-SDRi improvements of 1. 44 dB for HS and 2. 25 dB for LS under a LS cutoff frequency ( $f_{c\text{lung}}$ ) of 60Hz. Extensive experimental results demonstrate the effectiveness and robustness of PENN, offering a promising solution to improve the accuracy of auscultation.
ICML Conference 2025 Conference Paper
World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.
JBHI Journal 2025 Journal Article
Electroencephalography (EEG) source imaging (ESI) methods aim to reconstruct cortical sources from scalp EEG signals, a crucial task for understanding the normal brain as well as brain disorders. Traditional model-driven ESI methods face challenges in real-time reconstruction, while deep neural network (DNN)-based ESI methods often struggle with generalization to new data. To address these issues, we propose ADMM-ESINet, a novel deep unfolding neural network for robust and efficient reconstruction of EEG extended sources. ADMM-ESINet leverages a structured sparsity constraint within a regularization framework and employs the Alternating Direction Method of Multipliers (ADMM) to achieve iterative solutions. By unrolling the ADMM algorithm into a cascaded network architecture, ADMM-ESINet effectively integrates prior knowledge, enabling end-to-end, real-time ESI. Crucially, both the regularization parameters and the spatial transform operator are learned directly from the training data. Numerical results demonstrate that ADMM-ESINet surpasses traditional DNN-based methods in generalization ability and accurately reconstructs the location, extent, and temporal dynamics of extended sources, establishing ADMM-ESINet as a promising method for real-time ESI.
YNICL Journal 2025 Journal Article
IS Journal 2025 Journal Article
Detecting aggressive driving is challenging but crucial for public safety. Existing methods rely on time-series data of drivers’ physiology, behavior, and vehicle movement but overlook driver’s emotion and environmental influences. We propose ARISE, a multisource aggregation model integrating physiological, behavioral, and emotional data, vehicle sensor inputs, and environmental conditions. ARISE employs multisource feature extraction, multimodal fusion, and a classifier to detect aggressive driving. Unlike graph-based methods that fails to detect gradual aggression shifts or transformer-based methods prone to delays, ARISE explicitly models vehicle state continuity and the aggressive driving environment. Motion similarity descriptor tracks state transitions, while aggression descriptor quantifies environmental aggression. Additionally, a driving performance descriptor assesses driving workload and stability. Experiments show that ARISE significantly outperforms state-of-the-art methods in aggressive driving detection.
AAAI Conference 2025 Conference Paper
Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which primarily generate stories in a caption-dependent manner, often overlook the importance of contextual consistency and the relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios.
AAAI Conference 2025 Conference Paper
Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed.
NeurIPS Conference 2025 Conference Paper
Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from $1. 1\times$ to $2. 3\times$ over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by $47$\% and $48$\% over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Despite the remarkable successes of general-purpose neural networks, such as MLPs and Transformers, we find that they exhibit notable shortcomings in modeling and reasoning about periodic phenomena, achieving only marginal performance within the training domain and failing to generalize effectively to out-of-domain (OOD) scenarios. Periodicity is ubiquitous throughout nature and science. Therefore, neural networks should be equipped with the essential ability to model and handle periodicity. In this work, we propose FAN, a novel neural network that effectively addresses periodicity modeling challenges while offering broad applicability similar to MLP with fewer parameters and FLOPs. Periodicity is naturally integrated into FAN's structure and computational processes by introducing the Fourier Principle. Unlike existing Fourier-based networks, which possess particular periodicity modeling abilities but face challenges in scaling to deeper networks and are typically designed for specific tasks, our approach overcomes this challenge to enable scaling to large-scale models and maintains the capability to be applied to more types of tasks. Through extensive experiments, we demonstrate the superiority of FAN in periodicity modeling tasks and the effectiveness and generalizability of FAN across a range of real-world tasks. Moreover, we reveal that compared to existing Fourier-based networks, FAN accommodates both periodicity modeling and general-purpose modeling well.
JBHI Journal 2025 Journal Article
Recent studies have demonstrated that miRNA expression dysregulation is closely related to the occurrence of various diseases; thus, miRNA-based drug development strategies have received increasing research interest. Most existing computational methods focus on the attribute information of individual nodes and are limited to the direct associations between nodes, thereby ignoring the complex associations inherent in the network. This limitation may lead to the loss of key potential information, which impacts the prediction accuracy. To address these issues, we propose a multisource information fusion and metapath enhancement matrix based graph autoencoder (MSMP-GAE) to predict the potential associations between miRNAs and drugs. The proposed MSMP-GAE model comprises a metapath instance extraction module, a metapath feature-enhanced encoder module, a weighted feature fusion module, and a graph autoencoder. First, we construct an miRNA–drug heterogeneous network using experimentally validated miRNA–drug interactions and integrate various miRNA and drug features into an initial feature matrix to comprehensively represent their intrinsic property information. Then, we extract metapath instances from the interaction network, generate multiple metapath enhancement matrices, and fuse them with the initial feature matrix to generate high-quality node feature embeddings. Finally, we employ the graph autoencoder for fivefold cross-validation on a public dataset and test it on an independent test set. Experimental results demonstrate that the proposed MSMP-GAE model obtained an area under the curve (AUC) and AUPR values of 98. 61% and 98. 23%, respectively, which is considerably better than the several state-of-the-art methods. This highlights the importance of the higher-order complex associations between nodes in the miRNA–drug association (MDA) prediction task and provides a new method and approach to advance MDA prediction.
JBHI Journal 2025 Journal Article
Uncovering novel drug-drug interactions (DDIs) plays a pivotal role in advancing drug development and improving clinical treatment. The outstanding effectiveness of graph neural networks (GNNs) has garnered significant interest in the field of DDI prediction. Consequently, there has been a notable surge in the development of network-based computational approaches for predicting DDIs. However, current approaches face limitations in capturing the spatial relationships between neighboring nodes and their higher-level features during the aggregation of neighbor representations. To address this issue, this study introduces a novel model, KGCNN, designed to comprehensively tackle DDI prediction tasks by considering spatial relationships between molecules within the biomedical knowledge graph (BKG). KGCNN is built upon a message-passing GNN framework, consisting of propagation and aggregation. In the context of the BKG, KGCNN governs the propagation of information based on semantic relationships, which determine the flow and exchange of information between different molecules. In contrast to traditional linear aggregators, KGCNN introduces a spatial-aware capsule aggregator, which effectively captures the spatial relationships among neighboring molecules and their higher-level features within the graph structure. The ultimate goal is to leverage these learned drug representations to predict potential DDIs. To evaluate the effectiveness of KGCNN, it undergoes testing on two datasets. Extensive experimental results demonstrate its superiority in DDI predictions and quantified performance.
AAAI Conference 2025 Conference Paper
Federated learning (FL) enables collaborative learning among decentralized clients while safeguarding the privacy of their local data. Existing studies on FL typically assume offline labeled data available at each client when the training starts. Nevertheless, the training data in practice often arrive at clients in a streaming fashion without ground-truth labels. Given the expensive annotation cost, it is critical to identify a subset of informative samples for labeling on clients. However, selecting samples locally while accommodating the global training objective presents a challenge unique to FL. In this work, we tackle this conundrum by framing the data querying process in FL as a collaborative decentralized decision-making problem and proposing an effective solution named LeaDQ, which leverages multi-agent reinforcement learning algorithms. In particular, under the implicit guidance from global information, LeaDQ effectively learns the local policies for distributed clients and steers them towards selecting samples that can enhance the global model's accuracy. Extensive simulations on image and text tasks show that LeaDQ advances the model performance in various FL scenarios, outperforming the benchmarking algorithms.
ICLR Conference 2025 Conference Paper
In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing $\underline{\textbf{P}\text{ositional}\ \textbf{I}\text{ntegrity}\ \textbf{E}\text{ncoding}}$ (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.
JBHI Journal 2025 Journal Article
Anticancer peptides (ACPs) have emerged as one of the most promising therapeutic agents for cancer treatment. They are bioactive peptides featuring broad-spectrum activity and low drug-resistance. The discovery of ACPs via traditional biochemical methods is laborious and costly. Accordingly, various computational methods have been developed to facilitate the discovery of ACPs. However, the data resources and knowledge of ACPs are still very scarce, and only a few of them are clinically verified, which limits the competence of computational methods. To address this issue, in this article, we propose an ACP prediction model based on multi-domain transfer learning, namely MDTL-ACP, to discriminate novel ACPs from plentiful inactive peptides. In particular, we collect abundant antimicrobial peptides (AMPs) from four well-studied peptide domains and extract their inherent features as the input of MDTL-ACP. The features learned from multiple source domains of AMPs are then transferred into the target prediction task of ACPs via artificial neural network-based shared-extractor and task-specific classifiers in MDTL-ACP. The knowledge captured in the transferred features enhances the prediction of ACPs in the target domain. Experimental results demonstrate that MDTL-ACP can outperform the traditional and state-of-the-art ACP prediction methods.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https: //github. com/bytedance/SALMONN.
AAAI Conference 2025 Conference Paper
Scene Graph Generation (SGG) aims to detect all objects and identify their pairwise relationships existing in the scene. Considering the substantial human labor costs, existing scene graph annotations are often sparse and biased, which result in confusion training with low-frequency predicates. In this work, we design a Semi-Supervised Clustering framework for Scene Graph Generation (SSC-SGG) that uses the sparse labeled data to guide the generation of effective pseudo-labels from unlabeled object pairs, thus enriching the labeled sample space, especially for low-frequency interaction samples. We approach from the perspective of clustering, reducing the problem of confirmation bias in a self-training manner. Specifically, we first enhance the model's robustness to feature extraction via prototype-based clustering, aggregating different relationship augmented features onto the same prototype. Secondly, we design a dynamic pseudo-label assignment algorithm based on a mini-batch, which adjusts the detection sensitivity to different frequency samples from the historical assignment. Finally, we conduct joint training on the pseudo-labels and the labeled data. We conduct experiments on various SGG models and achieve substantial overall performance improvements, demonstrating the effectiveness of SSC-SGG.
NeurIPS Conference 2025 Conference Paper
Spatio-temporal trajectory representation learning plays a crucial role in various urban applications such as transportation systems, urban planning, and environmental monitoring. Existing methods can be divided into single-view and multi-view approaches, with the latter offering richer representations by integrating multiple sources of spatio-temporal data. However, these methods often struggle to generalize across diverse urban scenes due to multi-city structural heterogeneity, which arises from the disparities in road networks, grid layouts, and traffic regulations across cities, and the amplified seesaw phenomenon, where optimizing for one city, view, or task can degrade performance in others. These challenges hinder the deployment of trajectory learning models across multiple cities, limiting their real-world applicability. In this work, we propose SMARTraj$^2$, a novel stable multi-city adaptive method for multi-view spatio-temporal trajectory representation learning. Specifically, we introduce a feature disentanglement module to separate domain-invariant and domain-specific features, and a personalized gating mechanism to dynamically stabilize the contributions of different views and tasks. Our approach achieves superior generalization across heterogeneous urban scenes while maintaining robust performance across multiple downstream tasks. Extensive experiments on benchmark datasets demonstrate the effectiveness of SMARTraj$^2$ in enhancing cross-city generalization and outperforming state-of-the-art methods. See our project website at \url{https: //github. com/GestaltCogTeam/SMARTraj}.
NeurIPS Conference 2024 Conference Paper
Large language models (LLMs) have achievedimpressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts, a major source of hallucinations, has rarely been studied. While a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge, a comprehensive assessment of knowledge conflict in LLMs is still missing. Motivated by this research gap, we firstly propose ConflictBank, the largest benchmark with 7. 45M claim-evidence pairs and 553k QA pairs, addressing conflicts from misinformation, temporal discrepancies, and semantic divergences. Using ConflictBank, we conduct the thorough and controlled experiments for a comprehensive understanding of LLM behavior in knowledge conflicts, focusing on three key aspects: (i) conflicts encountered in retrieved knowledge, (ii) conflicts within the models' encoded knowledge, and (iii) the interplay between these conflict forms. Our investigation delves into four model families and twelve LLM instances and provides insights into conflict types, model sizes, and the impact at different stages. We believe that knowledge conflicts represent a critical bottleneck to achieving trustworthy artificial intelligence and hope our work will offer valuable guidance for future model training and development. Resources are available at https: //github. com/zhaochen0110/conflictbank.
EAAI Journal 2024 Journal Article
NeurIPS Conference 2024 Conference Paper
The lack of object-level labels presents a significant challenge for 3D object retrieval in the open-set environment. However, part-level shapes of objects often share commonalities across categories but remain underexploited in existing retrieval methods. In this paper, we introduce the Hypergraph-Based Assembly Fuzzy Representation (HARF) framework, which navigates the intricacies of open-set 3D object retrieval through a bottom-up lens of Part Assembly. To tackle the challenge of assembly isomorphism and unification, we propose the Hypergraph Isomorphism Convolution (HIConv) for smoothing and adopt the Isomorphic Assembly Embedding (IAE) module to generate assembly embeddings with geometric-semantic consistency. To address the challenge of open-set category generalization, our method employs high-order correlations and fuzzy representation to mitigate distribution skew through the Structure Fuzzy Reconstruction (SFR) module, by constructing a leveraged hypergraph based on local certainty and global uncertainty correlations. We construct three open-set retrieval datasets for 3D objects with part-level annotations: OP-SHNP, OP-INTRA, and OP-COSEG. Extensive experiments and ablation studies on these three benchmarks show our method outperforms current state-of-the-art methods.
AAMAS Conference 2024 Conference Paper
Effective communication protocols in multi-agent reinforcement learning (MARL) are critical to fostering cooperation and enhancing team performance. To leverage communication, many previous works have proposed to compress local information into a single message and broadcast it to all reachable agents. This simplistic messaging mechanism, however, may fail to provide adequate, critical, and relevant information to individual agents, especially in severely bandwidth-limited scenarios. This motivates us to develop contextaware communication schemes for MARL, aiming to deliver personalized messages to different agents. Our communication protocol, named CACOM, consists of two stages. In the first stage, agents exchange coarse representations in a broadcast fashion, providing context for the second stage. Following this, agents utilize attention mechanisms in the second stage to selectively generate messages personalized for the receivers. Furthermore, we employ the learned step size quantization (LSQ) technique for message quantization to reduce the communication overhead. To evaluate the effectiveness of CACOM, we integrate it with both actor-critic and value-based MARL algorithms. Empirical results on cooperative benchmark tasks demonstrate that CACOM provides evident performance gains over baselines under communication-constrained scenarios. The code is publicly available at https: //github. com/LXXXXR/CACOM.
NeurIPS Conference 2024 Conference Paper
In multi-agent reinforcement learning (MARL), parameter sharing is commonly employed to enhance sample efficiency. However, the popular approach of full parameter sharing often leads to homogeneous policies among agents, potentially limiting the performance benefits that could be derived from policy diversity. To address this critical limitation, we introduce \emph{Kaleidoscope}, a novel adaptive partial parameter sharing scheme that fosters policy heterogeneity while still maintaining high sample efficiency. Specifically, Kaleidoscope maintains one set of common parameters alongside multiple sets of distinct, learnable masks for different agents, dictating the sharing of parameters. It promotes diversity among policy networks by encouraging discrepancy among these masks, without sacrificing the efficiencies of parameter sharing. This design allows Kaleidoscope to dynamically balance high sample efficiency with a broad policy representational capacity, effectively bridging the gap between full parameter sharing and non-parameter sharing across various environments. We further extend Kaleidoscope to critic ensembles in the context of actor-critic algorithms, which could help improve value estimations. Our empirical evaluations across extensive environments, including multi-agent particle environment, multi-agent MuJoCo and StarCraft multi-agent challenge v2, demonstrate the superior performance of Kaleidoscope compared with existing parameter sharing approaches, showcasing its potential for performance enhancement in MARL. The code is publicly available at \url{https: //github. com/LXXXXR/Kaleidoscope}.
YNIMG Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Adversarial examples are commonly created by solving a constrained optimization problem, typically using sign-based methods like Fast Gradient Sign Method (FGSM). These attacks can benefit from momentum with a constant parameter, such as Momentum Iterative FGSM (MI-FGSM), to enhance black-box transferability. However, the monotonic time-varying momentum parameter is required to guarantee convergence in theory, creating a theory-practice gap. Additionally, recent work shows that sign-based methods fail to converge to the optimum in several convex settings, exacerbating the issue. To address these concerns, we propose a novel method which incorporates both an innovative adaptive momentum parameter without monotonicity assumptions and an adaptive step-size scheme that replaces the sign operation. Furthermore, we derive a regret upper bound for general convex functions. Experiments on multiple models demonstrate the efficacy of our method in generating adversarial examples with human-imperceptible noise while achieving high attack success rates, indicating its superiority over previous adversarial example generation methods.
EAAI Journal 2024 Journal Article
NeurIPS Conference 2024 Conference Paper
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7, 303 utterances, amounting to 8. 76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a process similar to that of SD-Eval. The training set contains 1, 052. 72 hours of speech data and 724. 4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e. g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate that LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https: //github. com/amphionspace/SD-Eval.
NeurIPS Conference 2024 Conference Paper
Existing open-set learning methods consider only the single-layer labels of objects and strictly assume no overlap between the training and testing sets, leading to contradictory optimization for superposed categories. In this paper, we introduce a more practical Semi-Open Environment setting for open-set 3D object retrieval with hierarchical labels, in which the training and testing set share a partial label space for coarse categories but are completely disjoint from fine categories. We propose the Hypergraph-Based Hierarchical Equilibrium Representation (HERT) framework for this task. Specifically, we propose the Hierarchical Retrace Embedding (HRE) module to overcome the global disequilibrium of unseen categories by fully leveraging the multi-level category information. Besides, tackling the feature overlap and class confusion problem, we perform the Structured Equilibrium Tuning (SET) module to utilize more equilibrial correlations among objects and generalize to unseen categories, by constructing a superposed hypergraph based on the local coherent and global entangled correlations. Furthermore, we generate four semi-open 3DOR datasets with multi-level labels for benchmarking. Results demonstrate that the proposed method can effectively generate the hierarchical embeddings of 3D objects and generalize them towards semi-open environments.
NeurIPS Conference 2024 Conference Paper
World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions.
AAMAS Conference 2023 Conference Paper
Learning communication strategies in cooperative multi-agent reinforcement learning (MARL) has recently attracted intensive attention. Early studies typically assumed a fully-connected communication topology among agents, which induces high communication costs and may not be feasible. Some recent works have developed adaptive communication strategies to reduce communication overhead, but these methods cannot effectively obtain valuable information from agents that are beyond the communication range. In this paper, we consider a realistic communication model where each agent has a limited communication range, and the communication topology dynamically changes. To facilitate effective agent communication, we propose a novel communication protocol called Adaptively Controlled Two-Hop Communication (AC2C). After an initial local communication round, AC2C employs an adaptive twohop communication strategy to enable long-range information exchange among agents to boost performance, which is implemented by a communication controller. This controller determines whether each agent should ask for two-hop messages and thus helps to reduce the communication overhead during distributed execution. We evaluate AC2C on three cooperative multi-agent tasks, and the experimental results show that it outperforms relevant baselines with lower communication costs.
ICML Conference 2023 Conference Paper
Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer’s efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods’ capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.
EAAI Journal 2023 Journal Article
IJCAI Conference 2023 Conference Paper
Camouflaged Object Detection (COD) aims to segment objects that blend in with their surroundings. Most existing methods mainly tackle this issue by a single-stage framework, which tends to degrade performance in the face of small objects, low-contrast objects and objects with diverse appearances. In this paper, we propose a novel Progressive Enhancement Network (PENet) for COD by imitating the human visual detection system, which follows a three-stage detection process: locate objects, refine textures and restore boundary. Specifically, our PENet contains three key modules, i. e. , the object location module (OLM), the group attention module (GAM) and the context feature restoration module (CFRM). The OLM is designed to position the object globally, the GAM is developed to refine both high-level semantic and low-level texture feature representation, and the CFRM is leveraged to effectively aggregate multi-level features for progressively restoring the clear boundary. Extensive results demonstrate that our PENet significantly outperforms 32 state-of-the-art methods on four widely used benchmark datasets
AAAI Conference 2023 Conference Paper
Whole-slide images (WSI) in computational pathology have high resolution with gigapixel size, but are generally with sparse regions of interest, which leads to weak diagnostic relevance and data inefficiency for each area in the slide. Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification. The limitation is evident in the application stage as the heavy computation for extracting patch-level features is inevitable. In this paper, we develop RLogist, a benchmarking deep reinforcement learning (DRL) method for fast observation strategy on WSIs. Imitating the diagnostic logic of human pathologists, our RL agent learns how to find regions of observation value and obtain representative features across multiple resolution levels, without having to analyze each part of the WSI at the high magnification. We benchmark our method on two whole-slide level classification tasks, including detection of metastases in WSIs of lymph node sections, and subtyping of lung cancer. Experimental results demonstrate that RLogist achieves competitive classification performance compared to typical multiple instance learning algorithms, while having a significantly short observation path. In addition, the observation path given by RLogist provides good decision-making interpretability, and its ability of reading path navigation can potentially be used by pathologists for educational/assistive purposes. Our code is available at: https://github.com/tencent-ailab/RLogist.
AAAI Conference 2023 Conference Paper
Recent work has demonstrated that pretrained transformers are overconfident in text classification tasks, which can be calibrated by the famous post-hoc calibration method temperature scaling (TS). Character or word spelling mistakes are frequently encountered in real applications and greatly threaten transformer model safety. Research on calibration under noisy settings is rare, and we focus on this direction. Based on a toy experiment, we discover that TS performs poorly when the datasets are perturbed by slight noise, such as swapping the characters, which results in distribution shift. We further utilize two metrics, predictive uncertainty and maximum mean discrepancy (MMD), to measure the distribution shift between clean and noisy datasets, based on which we propose a simple yet effective transferable TS method for calibrating models dynamically. To evaluate the performance of the proposed methods under noisy settings, we construct a benchmark consisting of four noise types and five shift intensities based on the QNLI, AG-News, and Emotion tasks. Experimental results on the noisy benchmark show that (1) the metrics are effective in measuring distribution shift and (2) transferable TS can significantly decrease the expected calibration error (ECE) compared with the competitive baseline ensemble TS by approximately 46.09%.
YNICL Journal 2022 Journal Article
NeurIPS Conference 2022 Conference Paper
Federated learning (FL) strives to enable collaborative training of machine learning models without centrally collecting clients' private data. Different from centralized training, the local datasets across clients in FL are non-independent and identically distributed (non-IID). In addition, the data-owning clients may drop out of the training process arbitrarily. These characteristics will significantly degrade the training performance. This paper proposes a Dropout-Resilient Secure Federated Learning (DReS-FL) framework based on Lagrange coded computing (LCC) to tackle both the non-IID and dropout problems. The key idea is to utilize Lagrange coding to secretly share the private datasets among clients so that each client receives an encoded version of the global dataset, and the local gradient computation over this dataset is unbiased. To correctly decode the gradient at the server, the gradient function has to be a polynomial in a finite field, and thus we construct polynomial integer neural networks (PINNs) to enable our framework. Theoretical analysis shows that DReS-FL is resilient to client dropouts and provides privacy protection for the local datasets. Furthermore, we experimentally demonstrate that DReS-FL consistently leads to significant performance gains over baseline methods.
NeurIPS Conference 2022 Conference Paper
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition. We build our method on Transformers for its efficacy. Although we have witnessed great progress for video action recognition in the past decade, it remains challenging yet valuable how to train a single model that can perform well across multiple datasets. Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss, aiming tolearn robust representations for action recognition. In particular, the informative loss maximizes the expressiveness of the feature embedding while the projection loss for each dataset mines the intrinsic relations between classes across datasets. We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2 datasets. Extensive experimental results show that our method can consistently improve state-of-the-art performance. Code and models are released.
NeurIPS Conference 2022 Conference Paper
Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of forcing the model to distinguish compatible negative pairs. Experiments on five downstream tasks demonstrate the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of 15 million pre-training image-text pairs, PyramidCLIP exceeds CLIP on ImageNet zero-shot classification top-1 accuracy by 10. 6%/13. 2%/10. 0% with ResNet50/ViT-B32/ViT-B16 based image encoder respectively. When scaling to larger datasets, PyramidCLIP achieves the state-of-the-art results on several downstream tasks. In particular, the results of PyramidCLIP-ResNet50 trained on 143M image-text pairs surpass that of CLIP using 400M data on ImageNet zero-shot classification task, significantly improving the data efficiency of CLIP.
NeurIPS Conference 2022 Conference Paper
Weakly-supervised whole-slide image (WSI) classification (WSWC) is a challenging task where a large number of unlabeled patches (instances) exist within each WSI (bag) while only a slide label is given. Despite recent progress for the multiple instance learning (MIL)-based WSI analysis, the major limitation is that it usually focuses on the easy-to-distinguish diagnosis-positive regions while ignoring positives that occupy a small ratio in the entire WSI. To obtain more discriminative features, we propose a novel weakly-supervised classification method based on cross-slide contrastive learning (called SCL-WC), which depends on task-agnostic self-supervised feature pre-extraction and task-specific weakly-supervised feature refinement and aggregation for WSI-level prediction. To enable both intra-WSI and inter-WSI information interaction, we propose a positive-negative-aware module (PNM) and a weakly-supervised cross-slide contrastive learning (WSCL) module, respectively. The WSCL aims to pull WSIs with the same disease types closer and push different WSIs away. The PNM aims to facilitate the separation of tumor-like patches and normal ones within each WSI. Extensive experiments demonstrate state-of-the-art performance of our method in three different classification tasks (e. g. , over 2% of AUC in Camelyon16, 5% of F1 score in BRACS, and 3% of AUC in DiagSet). Our method also shows superior flexibility and scalability in weakly-supervised localization and semi-supervised classification experiments (e. g. , first place in the BRIGHT challenge). Our code will be available at https: //github. com/Xiyue-Wang/SCL-WC.
NeurIPS Conference 2022 Conference Paper
Cross-modal retrieval between videos and texts has gained increasing interest because of the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can have multiple different text descriptions and queries. We call it the Video-Text Correspondence Ambiguity problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of video and text (e. g. , object to entity and action to verb). It is difficult for these methods to alleviate video-text correspondence ambiguity by describing a video using only one feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching Model. It automatically captures multiple prototypes to describe a video by adaptive aggregation on video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is called text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms the state-of-the-art methods on four public video retrieval datasets.
JBHI Journal 2022 Journal Article
Functional near-infrared spectroscopy (fNIRS) is a promising neuroimaging technology. The fNIRS classification problem has always been the focus of the brain-computer interface (BCI). Inspired by the success of Transformer based on self-attention mechanism in the fields of natural language processing and computer vision, we propose an fNIRS classification network based on Transformer, named fNIRS-T. We explore the spatial-level and channel-level representation of fNIRS signals to improve data utilization and network representation capacity. Besides, a preprocessing module, which consists of one-dimensional average pooling and layer normalization, is designed to replace filtering and baseline correction of data preprocessing. It makes fNIRS-T an end-to-end network, called fNIRS-PreT. Compared with traditional machine learning classifiers, convolutional neural network (CNN), and long short-term memory (LSTM), the proposed models obtain the best accuracy on three open-access datasets. Specifically, in the most extensive ternary classification task (30 subjects) that includes three types of overt movements, fNIRS-T, CNN, and LSTM obtain 75. 49%, 72. 89%, and 61. 94% on test sets, respectively. Compared to traditional classifiers, fNIRS-T is at least 27. 41% higher than statistical features and 6. 79% higher than well-designed features. In the individual subject experiment of the ternary classification task, fNIRS-T achieves an average subject accuracy of 78. 22% and surpasses CNN and LSTM by a large margin of +4. 75% and +11. 33%. fNIRS-PreT using raw data also achieves competitive performance to fNIRS-T. Therefore, the proposed models improve the performance of fNIRS-based BCI significantly.
IJCAI Conference 2021 Conference Paper
The presence of haze significantly reduces the quality of images. Researchers have designed a variety of algorithms for image dehazing (ID) to restore the quality of hazy images. However, there are few studies that summarize the deep learning (DL) based dehazing technologies. In this paper, we conduct a comprehensive survey on the recent proposed dehazing methods. Firstly, we conclude the commonly used datasets, loss functions and evaluation metrics. Secondly, we group the existing researches of ID into two major categories: supervised ID and unsupervised ID. The core ideas of various influential dehazing models are introduced. Finally, the open issues for future research on ID are pointed out.
YNIMG Journal 2021 Journal Article
AAAI Conference 2021 Conference Paper
The immunohistochemistry (IHC) test of biopsy tissue is crucial to develop targeted treatment and evaluate prognosis for cancer patients. The IHC staining slide is usually digitized into the whole-slide image (WSI) with gigapixels for quantitative image analysis. To perform a whole image prediction (e. g. , IHC scoring, survival prediction, and cancer grading) from this kind of high-dimensional image, algorithms are often developed based on multi-instance learning (MIL) framework. However, the multi-scale information of WSI and the associations among instances are not well explored in existing MIL based studies. Inspired by the fact that pathologists jointly analyze visual fields at multiple powers of objective for diagnostic predictions, we propose a Pathologist-Tree Network (PTree-Net) to sparsely model the WSI efficiently in multi-scale manner. Specifically, we propose a Focal-Aware Module (FAM) that can approximately estimate diagnosis-related regions with an extractor trained using the thumbnail of WSI. With the initial diagnosis-related regions, we hierarchically model the multi-scale patches in a tree structure, where both the global and local information can be captured. To explore this tree structure in an end-to-end network, we propose a patch Relevance-enhanced Graph Convolutional Network (RGCN) to explicitly model the correlations of adjacent parent-child nodes, accompanied by patch relevance to exploit the implicit contextual information among distant nodes. In addition, tree-based self-supervision is devised to improve representation learning and suppress irrelevant instances adaptively. Extensive experiments are performed on a large-scale IHC HER2 dataset. The ablation study confirms the effectiveness of our design, and our approach outperforms state-of-the-art by a large margin.
AAAI Conference 2021 Conference Paper
User modeling is critical for developing personalized services in industry. A common way for user modeling is to learn user representations that can be distinguished by their interests or preferences. In this work, we focus on developing universal user representation model. The obtained universal representations are expected to contain rich information, and be applicable to various downstream applications without further modifications (e. g. , user preference prediction and user profiling). Accordingly, we can be free from the heavy work of training task-specific models for every downstream task as in previous works. In specific, we propose Self-supervised User Modeling Network (SUMN) to encode behavior data into the universal representation. It includes two key components. The first one is a new learning objective, which guides the model to fully identify and preserve valuable user information under a self-supervised learning framework. The other one is a multi-hop aggregation layer, which benefits the model capacity in aggregating diverse behaviors. Extensive experiments on benchmark datasets show that our approach can outperform state-of-the-art unsupervised representation methods, and even compete with supervised ones.
YNIMG Journal 2021 Journal Article
AAAI Conference 2021 Conference Paper
Nucleus instance segmentation and classification in histopathological images is an essential prerequisite in pathology diagnosis/prognosis. However, nucleus annotations (e. g. , segmentation and labeling) require domain experts, and annotating nuclei at pixel-level is time-consuming and labor-intensive. Moreover, nuclei from different cancer types vary in shapes and appearances. These inter-cancer variations require careful annotations for specific cancer types. Therefore, to minimize the labeling cost, we propose a novel application that considers each cancer type as an individual domain and apply domain adaptation techniques to improve the segmentation/classification performance among different cancer types. Unlike the previous studies that focus on unsupervised or weakly-supervised domain adaptation independently, we would like to discover what kinds of labeling can achieve the most cost-effective domain adaptation performance in nucleus instance segmentation and classification. Specifically, we propose a unified framework that is applicable to different level annotations: no annotations, image-level, and point-level annotations. Cyclic adaptation with pseudo labels and adversarial discriminator are utilized for unsupervised domain alignment. Image-level or point-level annotations are additionally adopted to supervise the nucleus classification and refine the pseudo labels. Experiments demonstrate the effectiveness and efficacy of the proposed framework (jointly using unsupervised and weakly supervised learning) on adapting the segmentation and classification model from one cancer type to 18 other cancer types.
ICRA Conference 2020 Conference Paper
The static world assumption is standard in most simultaneous localisation and mapping (SLAM) algorithms. Increased deployment of autonomous systems to unstructured dynamic environments is driving a need to identify moving objects and estimate their velocity in real-time. Most existing SLAM based approaches rely on a database of 3D models of objects or impose significant motion constraints. In this paper, we propose a new feature-based, model-free, object-aware dynamic SLAM algorithm that exploits semantic segmentation to allow estimation of motion of rigid objects in a scene without the need to estimate the object poses or have any prior knowledge of their 3D models. The algorithm generates a map of dynamic and static structure and has the ability to extract velocities of rigid moving objects in the scene. Its performance is demonstrated on simulated, synthetic and real-world datasets.
YNIMG Journal 2020 Journal Article
ICRA Conference 2020 Conference Paper
Human-multi-robot collaboration is becoming more and more common in intelligent manufacturing. Optimal assembly scheduling of such systems plays a critical role in their production efficiency. Existing approaches mostly consider humans as agents with assumed or known capabilities, which leads to suboptimal performance in realistic applications where human capabilities usually change. In addition, most robot adaptation focuses on human-single-robot interaction and the adaptation in human-multi-robot interaction with changing human capability still remains challenging due to the complexity of the heterogeneous multi-agent interactions. This paper proposes a real-time adaptive assembly scheduling approach for human-multi-robot collaboration by modeling and incorporating changing human capability. A genetic algorithm is also designed to derive implementable solutions for the formulated adaptive assembly scheduling problem. The proposed approaches are validated through different simulated human-multi-robot assembly tasks and the results demonstrate the effectiveness and advantages of the proposed approaches.
IROS Conference 2020 Conference Paper
The problem of tracking self-motion as well as motion of objects in the scene using information from a camera is known as multi-body visual odometry and is a challenging task. This paper proposes a robust solution to achieve accurate estimation and consistent track-ability for dynamic multi-body visual odometry. A compact and effective framework is proposed leveraging recent advances in semantic instance-level segmentation and accurate optical flow estimation. A novel formulation, jointly optimizing SE(3) motion and optical flow is introduced that improves the quality of the tracked points and the motion estimation accuracy. The proposed approach is evaluated on the virtual KITTI Dataset and tested on the real KITTI Dataset, demonstrating its applicability to autonomous driving applications. For the benefit of the community, we make the source code public †.
YNIMG Journal 2019 Journal Article
JBHI Journal 2018 Journal Article
Most automated techniques for brain disease diagnosis utilize hand-crafted (e. g. , voxel-based or region-based) biomarkers from structural magnetic resonance (MR) images as feature representations. However, these hand-crafted features are usually high-dimensional or require regions-of-interest defined by experts. Also, because of possibly heterogeneous property between the hand-crafted features and the subsequent model, existing methods may lead to sub-optimal performances in brain disease diagnosis. In this paper, we propose a landmark-based deep feature learning (LDFL) framework to automatically extract patch-based representation from MRI for automatic diagnosis of Alzheimer's disease. We first identify discriminative anatomical landmarks from MR images in a data-driven manner, and then propose a convolutional neural network for patch-based deep feature learning. We have evaluated the proposed method on subjects from three public datasets, including the Alzheimer's disease neuroimaging initiative (ADNI-1), ADNI-2, and the minimal interval resonance imaging in alzheimer's disease (MIRIAD) dataset. Experimental results of both tasks of brain disease classification and MR image retrieval demonstrate that the proposed LDFL method improves the performance of disease classification and MR image retrieval.
JBHI Journal 2017 Journal Article
Structural magnetic resonance imaging (MRI) has been proven to be an effective tool for Alzheimer's disease (AD) diagnosis. While conventional MRI-based AD diagnosis typically uses images acquired at a single time point, a longitudinal study is more sensitive in detecting early pathological changes of AD, making it more favorable for accurate diagnosis. In general, there are two challenges faced in MRI-based diagnosis. First, extracting features from structural MR images requires time-consuming nonlinear registration and tissue segmentation, whereas the longitudinal study with involvement of more scans further exacerbates the computational costs. Moreover, the inconsistent longitudinal scans (i. e. , different scanning time points and also the total number of scans) hinder extraction of unified feature representations in longitudinal studies. In this paper, we propose a landmark-based feature extraction method for AD diagnosis using longitudinal structural MR images, which does not require nonlinear registration or tissue segmentation in the application stage and is also robust to inconsistencies among longitudinal scans. Specifically, first, the discriminative landmarks are automatically discovered from the whole brain using training images, and then efficiently localized using a fast landmark detection method for testing images, without the involvement of any nonlinear registration and tissue segmentation; and second, high-level statistical spatial features and contextual longitudinal features are further extracted based on those detected landmarks, which can characterize spatial structural abnormalities and longitudinal landmark variations. Using these spatial and longitudinal features, a linear support vector machine is finally adopted to distinguish AD subjects or mild cognitive impairment (MCI) subjects from healthy controls (HCs). Experimental results on the Alzheimer's Disease Neuroimaging Initiative database demonstrate the superior performance and efficiency of the proposed method, with classification accuracies of 88. 30% for AD versus HC and 79. 02% for MCI versus HC, respectively.
YNIMG Journal 2016 Journal Article
IROS Conference 2015 Conference Paper
For a robot serving in a complex environment such as in a restaurant, it is difficult to perform a task like tabletop object manipulation completely by itself, in that some information may be missing. An approach to deal with this is to use a tele-control system and method to control the robot or demonstrate. In this paper, a LeapMotion sensor based non-contact tele-control method is developed for a robot to perform tabletop object manipulation tasks. A coordinate system for mapping from the operation space of the LeapMotion sensor to the workspace of the robot is established. A gesture recognition and action generating algorithm is proposed for control or to demonstrate the motion to the robot. To evaluate the performance of the LeapMotion sensor and proposed method for tele-control of a robot, a comprehensive assessment index based on entropy weighting is proposed. Three common tele-control modes, including demonstration mode, teleoperation mode and semi-teleoperation mode, are developed on a PR2 robot. The experimental results show that the proposed tele-control system is more appropriate for use in task demonstration.
JBHI Journal 2015 Journal Article
Low energy consumption is crucial for body area networks (BANs). In BAN-enabled ECG monitoring, the continuous monitoring entails the need of the sensor nodes to transmit a huge data to the sink node, which leads to excessive energy consumption. To reduce airtime over energy-hungry wireless links, this paper presents an energy-efficient compressed sensing (CS)-based approach for on-node ECG compression. At first, an algorithm called minimal mutual coherence pursuit is proposed to construct sparse binary measurement matrices, which can be used to encode the ECG signals with superior performance and extremely low complexity. Second, in order to minimize the data rate required for faithful reconstruction, a weighted ℓ 1 minimization model is derived by exploring the multisource prior knowledge in wavelet domain. Experimental results on MIT-BIH arrhythmia database reveals that the proposed approach can obtain higher compression ratio than the state-of-the-art CS-based methods. Together with its low encoding complexity, our approach can achieve significant energy saving in both encoding process and wireless transmission.
IJCAI Conference 2015 Conference Paper
Although the light field has been recently recognized helpful in saliency detection, it is not comprehensively explored yet. In this work, we propose a new saliency detection model with light field data. The idea behind the proposed model originates from the following observations. (1) People can distinguish regions at different depth levels via adjusting the focus of eyes. Similarly, a light field image can generate a set of focal slices focusing at different depth levels, which suggests that a background can be weighted by selecting the corresponding slice. We show that background priors encoded by light field focusness have advantages in eliminating background distraction and enhancing the saliency by weighting the light field contrast. (2) Regions at closer depth ranges tend to be salient, while far in the distance mostly belong to the backgrounds. We show that foreground objects can be easily separated from similar or cluttered backgrounds by exploiting their light field depth. Extensive evaluations on the recently introduced Light Field Saliency Dataset (LFSD) [Li et al. , 2014], including studies of different light field cues and comparisons with Li et al. ’s method (the only reported light field saliency detection approach to our knowledge) and the 2D/3D state-of-the-art approaches extended with light field depth/focusness information, show that the investigated light field properties are complementary with each other and lead to improvements on 2D/3D models, and our approach produces superior results in comparison with the state-of-the-art.
AAAI Conference 2014 Conference Paper
JMLR Journal 2009 Journal Article
We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semi-inner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established. [abs] [ pdf ][ bib ] © JMLR 2009. ( edit, beta )
IROS Conference 2006 Conference Paper
Glomerulus extraction is an important step for analyzing kidney-tissue image in the computer aided diagnosis system of kidney disease. According to the characteristic of these images, this paper proposes a glomerulus extraction method based on genetic algorithm and watershed transform. Firstly, a LOG filter is applied to get binary images that contain less noise by adjusting the parameters of Gaussian function. After labeling to remove the noises and thinning, a genetic algorithm is applied to these preprocessing images to search the best fitting curve, which determines the barycenter position of glomerulus and set this barycenter as seed. Secondly, the image which contains complete object boundary can be obtained through watershed transform, after region growing operation, glomerulus region can be extracted. With abundant samples, experimental result indicates our method can extract the glomerulus from kidney-tissue image both accurately and availably
NeurIPS Conference 1988 Conference Paper
Heiligenberg (1987) recently proposed a model to explain how sen(cid: 173) sory maps could enhance resolution through orderly arrangement of broadly tuned receptors. We have extended this model to the general case of polynomial weighting schemes and proved that the response function is also a polynomial of the same order. We further demon(cid: 173) strated that the Hermitian polynomials are eigenfunctions of the sys(cid: 173) tem. Finally we suggested a biologically plausible mechanism for sen(cid: 173) sory representation of external stimuli with resolution far exceeding the inter-receptor separation.