EAAI Journal 2026 Journal Article
A text-based hybrid transfer learning model for ternary classification in health misinformation detection
- Jia Luo
- Yang Yang
- Xiaoye Feng
- Didier El Baz
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
JBHI Journal 2026 Journal Article
In molecular representation learning (MRL), tokens ( e. g. , atoms, motifs, and fingerprints) are the basic elements to represent molecules. It is a common practice by using various tokens to enhance the expressive power of Graph Neural Networks (GNNs) on molecular graphs. Although prior GNNs-based methods employing tokens achieve promising performances in drug-drug interaction (DDI) prediction, the influence of the token on the expressiveness of molecular embedding models remains underexplored. To bridge the gap, we provide an axiomatic definition of MRL from a frequency domain perspective, revealing that the model's performance is closely related to the number of tokens and deriving a theoretical upper bound of likelihood-based model convergency. Building on these insights, we propose SimMotifPro, a simple yet efficient motif-based method, for DDI prediction. Specifically, SimMotifPro uses a variant of DeeperGCN encoder and builds a motif-motif knowledge graph to capture motif interconnections. A Motif Ranker module is also introduced to decouple learned representations and differentiate the contributions of selected motifs. Empirically, we demonstrate that SimMotifPro adheres to the properties demonstrated in our theoretical upper bound and validate the general applicability of our theory across different methods. Furthermore, our approach achieves state-of-the-art performance on various benchmarks for DDI prediction. Our codes and checkpoints are available at https://github.com/siriusong/sim_motif_pro.
AAAI Conference 2026 Conference Paper
Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.
AAAI Conference 2026 Conference Paper
Fully fine-tuning large pre-trained models for each downstream task is impractical due to prohibitive memory, computation, and storage costs. Although parameter-efficient fine-tuning (PEFT) methods address this issue, leading methods like LoRA still exhibit linear scaling of trainable parameters with hidden size. Recent studies have explored PEFT in the frequency domain to reduce computational costs by employing fast Fourier transform and discrete cosine transform with sparse frequency selection. These methods rely on global frequency representations that lack spatial locality and disperse energy across the domain. As a result, sparse coefficient selection struggles to preserve fine-grained structural information and often introduces artifacts such as ringing near boundaries. To address these limitations, we propose DWTSG, a novel PEFT framework based on discrete wavelet transform (DWT) and subband guidance. DWTSG decomposes pre-trained weights into four wavelet subbands that jointly encode global context and local details. It fine-tunes only the most informative coefficients in each subband through an energy-based selection strategy that prioritizes coefficients based on their individual importance and interactions. Finally, inverse DWT reconstructs the updated weights, enabling efficient and precise adaptation. Extensive experiments on natural language understanding, commonsense reasoning, and image classification demonstrate that DWTSG outperforms existing PEFT methods, achieving superior performance and higher parameter efficiency.
AAAI Conference 2026 Conference Paper
To protect clients' right to be forgotten in federated learning, federated unlearning aims to remove the data contribution of leaving clients from the global learned model. While current studies mainly focused on enhancing unlearning efficiency and effectiveness, the crucial aspects of efficiency fairness and performance fairness among decentralized clients during unlearning have remained largely unexplored. In this study, we introduce FedShard, the first federated unlearning algorithm designed to concurrently guarantee both efficiency fairness and performance fairness. FedShard adaptively addresses the challenges introduced by dilemmas among convergence, unlearning efficiency, and unlearning fairness. Furthermore, we propose two novel metrics to quantitatively assess the fairness of unlearning algorithms, which we prove to satisfy well-known properties in other existing fairness measurements. Our theoretical analysis and numerical evaluation validate FedShard's fairness in terms of both unlearning performance and efficiency. We demonstrate that FedShard mitigates unfairness risks such as cascaded leaving and poisoning attacks and realizes more balanced unlearning costs among clients. Experimental results indicate that FedShard accelerates the data unlearning process 1.3-6.2 times faster than retraining from scratch and 4.9 times faster than the state-of-the-art exact unlearning methods.
AAAI Conference 2026 Conference Paper
Precise modeling of lane topology is essential for autonomous driving, as it directly impacts navigation and control decisions. Existing methods typically represent each lane with a single query and infer topological connectivity based on the similarity between lane queries. However, this kind of design struggles to accurately model complex lane structures, leading to unreliable topology prediction. In this view, we propose a Fine-Grained lane topology reasoning framework (TopoFG). It divides the procedure from bird’s-eye-view (BEV) features to topology prediction via fine-grained queries into three phases, i.e., Hierarchical Prior Extractor (HPE), Region-Focused Decoder (RFD), and Robust Boundary-Point Topology Reasoning (RBTR). Specifically, HPE extracts global spatial priors from the BEV mask and local sequential priors from in-lane keypoint sequences to guide subsequent fine-grained query modeling. RFD constructs fine-grained queries by integrating the spatial and sequential priors. It then samples reference points in RoI regions of the mask and applies cross-attention with BEV features to refine the query representations of each lane. RBTR models lane connectivity based on boundary-point query features and further employs a topological denoising strategy to reduce matching ambiguity. By integrating spatial and sequential priors into fine-grained queries and applying a denoising strategy to boundary-point topology reasoning, our method precisely models complex lane structures and delivers trustworthy topology predictions. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoFG achieves new state-of-the-art performance, with an OLS of 48.0% on subset_A and 45.4% on subset_B.
AAAI Conference 2026 Conference Paper
Conversational Recommender Systems (CRS) aim to provide personalized recommendations by interacting with users through natural language dialogue. However, in scenarios requiring deep geospatial awareness, existing methods, including those based on Large Language Models (LLMs), still face significant challenges in effectively fusing heterogeneous, multimodal geographic information with dynamic dialogue context. Simple fusion strategies struggle to resolve the asymmetric dependencies between dynamic user intent and static geographic context and fail to bridge the semantic gap between LLMs and structured geospatial data. To address these issues, we propose a framework for geography-aware CRS, named GeoCRS. Our core idea is to empower a frozen LLM with powerful geospatial reasoning capabilities by conditioning it on a dynamic, multimodal guidance signal generated by an external fusion architecture, all without altering the LLM's internal parameters. Specifically, we first design a hierarchical geographical encoder to uniformly represent heterogeneous geographic data. Subsequently, we introduce a contextual feature modulation module that asymmetrically injects the geographic context into the user's dialogue intent via a novel modulation mechanism to improve conversational recommendation via both geographic and dialogue context. Extensive experiments on public benchmark datasets demonstrate that our proposed GeoCRS significantly outperforms state-of-the-art baselines on the geography-aware conversational recommendation task.
AAAI Conference 2026 Conference Paper
Multimodal data significantly improves the performance of pretrained models, but its practical application is often limited by missing or incomplete data across modalities. There are two key challenges that existing methods of synthesizing missing data face: (1) semantic inaccuracies due to model hallucinations and (2) discrepancies in distribution preferences between generated and original data. To address these challenges, we propose a novel three-stage multimodal data augmentation framework (GFR), which Generate, Filter, and Rank missing modality data. Our framework leverages multimodal large models for diverse data generation, designs a scene graph matching-based filtering algorithm to ensure semantic consistency, and constructs a preference-aware ranking model to align the generated data with both the original distribution and task relevance. Our framework not only enhances semantic diversity and consistency in data generation but also effectively captures the implicit characteristics of the original dataset and the target model. We demonstrate the effectiveness of GFR across multiple datasets by testing different missing types and missing ratios.
AAAI Conference 2026 Conference Paper
Cross-language code clone detection, which identifies functionally similar code across programming languages, is critical for ensuring synchronized evolution and reducing maintenance costs in multi-platform software development. While zero-shot approaches have emerged as a practical solution to data scarcity, state-of-the-art methods still face two major limitations: an insufficiency in learning language-agnostic representations and information loss during the processing of long code. To address these challenges, we propose LC3, a novel framework for robust zero-shot cross-language code clone detection. To overcome the language-agnostic representation insufficiency, LC3 fuses source code with its underlying opcode sequences, leveraging a bimodal architecture and adversarial training to learn a language-agnostic representation. To resolve long-code information loss, LC3 introduces a semantic affinity aggregation strategy. This strategy synthesizes a robust clone score from a complete pairwise similarity matrix computed between segmented code blocks, overcoming the limitations of both simple truncation and aggregation. Extensive experiments show that LC3 significantly outperforms state-of-the-art zero-shot baselines, especially in challenging long-code scenarios.
AAAI Conference 2026 Conference Paper
Accurate interpretation of Notices To Airmen (NOTAMs) is critical for aviation safety, yet their condensed and cryptic language poses significant challenges to both manual and automated processing. Existing automated systems are typically limited to "Shallow Parsing," failing to extract the actionable intelligence needed for operational decisions. We formalize the complete interpretation task as "Deep Parsing," a dual-reasoning challenge requiring both dynamic knowledge grounding (linking the NOTAM to evolving real-world aeronautical data) and schema-based inference (applying static domain rules to deduce operational status). To tackle this challenge, we propose NOTAM-Evolve, a self-evolving framework that enables a Large Language Model (LLM) to autonomously master complex NOTAM interpretation. Leveraging a knowledge graph-enhanced retrieval module for data grounding, the framework introduces a crucial closed-loop learning process where the LLM progressively improves from its own outputs, minimizing the need for extensive human-annotated reasoning traces. In conjunction with this framework, we introduce a new benchmark dataset of 10,000 expert-annotated NOTAMs. Our experiments demonstrate that NOTAM-Evolve achieves a 30.4% absolute accuracy improvement over the base LLM, establishing a new state-of-the-art on the task of structured NOTAM interpretation.
AAAI Conference 2026 Conference Paper
Augmented Reality (AR) and Multimodal Large Language Models (LLMs) are rapidly evolving, providing unprecedented capabilities for human-computer interaction. However, their integration introduces a new attack surface for Social Engineering (SE). In this paper, we systematically investigate the feasibility of orchestrating AR-driven Social Engineering attacks using Multimodal LLM for the first time, via our proposed SEAR framework, which operates through three key phases: (1) AR-based social context synthesis, which fuses Multimodal inputs (visual, auditory and environmental cues); (2) role-based Multimodal RAG (Retrieval-Augmented Generation), which dynamically retrieves and integrates social context; and (3) ReInteract social engineering agents, which execute adaptive multiphase attack strategies through inference interaction loops. To verify SEAR, we conducted an IRB-approved study with 60 participants and build a novel dataset of 180 annotated conversations in different social scenarios (e.g., coffee shops, networking events). Our results show that SEAR is highly effective at eliciting high-risk behaviors (e.g., 93.3% of participants susceptible to email phishing). The framework was particularly effective in building trust, with 85% of targets willing to accept an attacker's call after an interaction. Also, we identified notable limitations such as authenticity gaps. This work provides proof-of-concept for AR-LLM driven social engineering attacks and insights for developing defenses against next-generation AR/LLM-based SE threats.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Aligning molecular sequence representations (e.g., SMILES notations) with textual descriptions is critical for applications spanning drug discovery, materials design, and automated chemical literature analysis. Existing methodologies typically treat molecular captioning (molecule-to-text) and text-based molecular design (text-to-molecule) as separate tasks, relying on supervised fine-tuning or contrastive learning pipelines. These approaches face three key limitations: (i) conventional metrics like BLEU prioritize linguistic fluency over chemical accuracy, (ii) training datasets frequently contain chemically ambiguous narratives with incomplete specifications, and (iii) independent optimization of generation directions leads to bidirectional inconsistency. To address these issues, we propose RTMol, a bidirectional alignment framework that unifies molecular captioning and text-to-SMILES generation through self-supervised round-trip learning. The framework introduces novel round-trip evaluation metrics and enables unsupervised training for molecular captioning without requiring paired molecule-text corpora. Experiments demonstrate that RTMol enhances bidirectional alignment performance by up to 47% across various LLMs, establishing an effective paradigm for joint molecule-text understanding and generation.
AAAI Conference 2026 Conference Paper
Reflective imaging enables the mirror imagings and physical entities to possess identical attributes, e.g., color and shape. Current mirror detection (MD) methods primarily rely on designing functional components to establish the correlation and disparities between the imagings and entities, thereby identifying the mirror regions. However, the exploration of extended scenes with dynamic content changes is rarely investigated. Therefore, we propose the MirrorSAM designed for MD based on the Segment Anything Model (SAM). Specifically, due to the varying reflections produced by mirrors in different positions and the complex visual space that interferes with localization, we design the hierarchical mixture of direction experts (HMDE) in the low-rank space to reduce biases towards entities in SAM and dynamically adjust experts based on the input scene. We observe differences in depth between mirrors and adjacent areas, and propose the depth token calibration (DTC), which introduces a learnable depth token to generate the depth map and serve as an error correction factor. We further formulate the selective pixel-prototype contrastive (SPPC) loss, selecting partially confusable samples to promote the decoupling of mirror and non-mirror representations. Extensive experiments conducted on four mirror benchmarks and two settings demonstrate that our approach surpasses state-of-the-art methods with few trainable parameters and FLOPs. We further extend to four transparent surface benchmarks to validate generalization.
AAAI Conference 2026 Conference Paper
Out-of-distribution (OOD) detection is a well-known challenge due to deep models often producing overconfident. In this paper, we reveal a key insight that trained classifiers tend to rely on sparse parameter contribution patterns, meaning that only a few dominant parameters drive predictions. This brittleness can be exploited by OOD inputs that anomalously trigger these parameters, resulting in overconfident predictions. To address this issue, we propose a simple yet effective method called Shaping Parameter Contribution Patterns (SPCP), which enhances OOD detection robustness by encouraging the classifier to learn boundary-oriented dense contribution patterns. Specifically, SPCP operates during training by rectifying excessively high parameter contributions based on a dynamically estimated threshold. This mechanism promotes the classifier to rely on a broader set of parameters for decision-making, thereby reducing the risk of overconfident predictions caused by anomalously triggered parameters, while preserving in-distribution (ID) performance. Extensive experiments under various OOD detection setups verify the effectiveness of SPCP.
AAAI Conference 2026 Conference Paper
Temporal graphs are essential for modeling complex real-world systems, such as social interactions, financial transactions, and recommendation systems, but the high computational cost and model complexity of dynamic graph neural networks (DGNNs) pose significant challenges for practical deployment. Although various pruning and sampling techniques have proven effective in accelerating static GNNs, they fall short in dynamic settings due to temporal dependencies in evolving graph structures. To address these challenges, we propose TrimDG, a general framework that accelerates DGNNs by eliminating both static and runtime redundancies. For static redundancy, we introduce a novel node influence metric, Temporal Personalized PageRank (TPP), to prune less informative nodes, and employ temporal binning to remove redundant events. For runtime redundancy during training, we develop an adaptive sampling strategy guided by graph information bottleneck and further reduce sampling frequency through temporal batch selector and sampling cache. Theoretical analysis supports our design, and experiments on real-world datasets show that TrimDG reduces runtime by an average of 83.49% across diverse DGNN backbones, while maintaining strong predictive performance, demonstrating both its efficiency and generalizability.
AAAI Conference 2026 Conference Paper
Multi-View Clustering (MVC) is a pivotal multi-view learning paradigm widely adopted across various fields. Despite recent advances, existing methods primarily focus on enhancing the performance of fused multi-view representation, often neglecting the issue of Representation Degradation (RD) arising from discrepancies in the intrinsic quality of different views. To address the limitations, we propose a novel Granular-ball Fuzzy Split and Attention Fusion (GFSAF) learning, which leverages the nature of granular-ball to extract mutual and complementary representation separately. Meanwhile, the proposed method introduces an attention variant for fused representations to mitigate the RD issue. GFSAF mainly consists of two training stages: Split-Extract Stage and Views-Fusion Stage. Specifically, we design a novel Granular-ball Fuzzy Contrastive Learning to extract mutual representation, and introduce Noise Stripping Loss to reduce the influence of noise for complementary representation. Then, a novel multi-head Cross Views Attention is proposed to employ attention mechanism from multi-view perspectives for comprehensive fused representations. Experimental results on eight databases demonstrate that our GFSAF achieves superior performance compared to several state-of-the-art MVC methods.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
YNIMG Journal 2025 Journal Article
IJCAI Conference 2025 Conference Paper
API migration is essential for software maintenance due to the rapid evolution of third-party libraries where API elements may change continuously through updates. There are two main challenges for API migration at the project level, especially across multiple versions: 1) lack of specific library evolution knowledge across multi-version; 2) difficulty in identifying the chain of changes at the project level. This paper proposes a project-level cross-multi-version API migration framework APIMig. We first construct an API evolution knowledge graph (KG) to capture changes between adjacent library versions and then derive coherent cross-version API evolution knowledge by KG reasoning. Second, we design a chain exploration algorithm to track the chain of changes and aggregate the affected code segments. Finally, a large language model is employed in completing API migration by providing the API evolution knowledge and the chain of changes. We construct an evolution KG for the Lucene library from version 4. 0. 0 to 10. 1. 0 and evaluate our approach through project migration pairs that depend on different major versions. Our framework shows improvements over the baseline in migrating projects across 7 major versions, achieving average increases of 16. 52% in CodeBLEU scores and 28. 49% in VCEU scores in GPT-4o.
JBHI Journal 2025 Journal Article
The precise segmentation of different brain regions and tissues is usually a prerequisite for the detection and diagnosis of various neurological disorders in neuroscience. Considering the abundance of functional and structural dual-modality information for positron emission tomography/magnetic resonance (PET/MR) images, we propose a novel 3D whole-brain segmentation network with a cross-fusion mechanism introduced to obtain 45 brain regions. Specifically, the network processes PET and MR images simultaneously, employing UX-Net and a cross-fusion block for feature extraction and fusion in the encoder. We test our method by comparing it with other deep learning-based methods, including 3DUXNET, SwinUNETR, UNETR, nnFormer, UNet3D, NestedUNet, ResUNet, and VNet. The experimental results demonstrate that the proposed method achieves better segmentation performance in terms of both visual and quantitative evaluation metrics and achieves more precise segmentation in three views while preserving fine details. In particular, the proposed method achieves superior quantitative results, with a Dice coefficient of 85. 73% $\pm$ 0. 01%, a Jaccard index of 76. 68% $\pm$ 0. 02%, a sensitivity of 85. 00% $\pm$ 0. 01%, a precision of 83. 26% $\pm$ 0. 03% and a Hausdorff distance (HD) of 4. 4885 $\pm$ 14. 85%. Moreover, the distribution and correlation of the SUV in the volume of interest (VOI) are also evaluated (PCC > 0. 9), indicating consistency with the ground truth and the superiority of the proposed method. In future work, we will utilize our whole-brain segmentation method in clinical practice to assist doctors in accurately diagnosing and treating brain diseases.
IJCAI Conference 2025 Conference Paper
Multi-modal learning (MML) is frequently hindered by modality imbalance, leading to suboptimal performance in real-world applications. To address this issue, existing approaches primarily focus on rebalancing MML from the perspective of optimization or architecture design. However, almost all existing methods ignore the impact of sample sequences, i. e. , an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample in terms of correlation and information criteria. Via this evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to alleviate the imbalance. Moreover, we propose a learning-based probabilistic sampling method to dynamically update the training sequence in a more fine-grained manner, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) baselines. The code is available at https: //github. com/njustkmg/IJCAI25-BSS.
IJCAI Conference 2025 Conference Paper
Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques into Transformer-based SNNs and propose the Binary Event-Driven Spiking Transformer, i. e. BESTformer. The proposed BESTformer can significantly reduce storage and computational demands by representing weights and attention maps with a mere 1-bit. However, BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization. To address this issue, we propose a Coupled Information Enhancement (CIE) method, which consists of a reversible framework and information enhancement distillation. By maximizing the mutual information between the binary model and its full-precision counterpart, the CIE method effectively mitigates the performance degradation of the BESTformer. Extensive experiments on static and neuromorphic datasets demonstrate that our method achieves superior performance to other binary SNNs, showcasing its potential as a compact yet high-performance model for resource-limited edge devices. The repository of this paper is available at https: //github. com/CaoHLin/BESTFormer.
NeurIPS Conference 2025 Conference Paper
Harnessing the event-driven characteristic, Spiking Neural Networks (SNNs) present a promising avenue toward energy-efficient Transformer architectures. However, existing Spiking Transformers still suffer significant performance gaps compared to their Artificial Neural Network counterparts. Through comprehensive analysis, we attribute this gap to these two factors. First, the binary nature of spike trains limits Spiking Self-attention (SSA)’s capacity to capture negative–negative and positive–negative membrane potential interactions on Querys and Keys. Second, SSA typically omits Softmax functions to avoid energy-intensive multiply-accumulate operations, thereby failing to maintain row-stochasticity constraints on attention scores. To address these issues, we propose a Bipolar Self-attention (BSA) paradigm, effectively modeling multi-polar membrane potential interactions with a fully spike-driven characteristic. Specifically, we demonstrate that ternary matrix multiplication provides a closer approximation to real-valued computation on both distribution and local correlation, enabling clear differentiation between homopolar and heteropolar interactions. Moreover, we propose a shift-based Softmax approximation named Shiftmax, which efficiently achieves low-entropy activation and partly maintains row-stochasticity without non-linear operation, enabling precise attention allocation. Extensive experiments show that BSA achieves substantial performance improvements across various tasks, including image classification, semantic segmentation, and event-based tracking. These results establish its potential as a fundamental building block for energy-efficient Spiking Transformers.
IJCAI Conference 2025 Conference Paper
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 25% in low-light conditions, outperforming existing methods. This work is open-source at https: //github. com/Anastasiawd/BrightVO.
AAAI Conference 2025 Conference Paper
Video Moment Retrieval (VMR) involves locating specific moments within a video based on natural language queries. However, existing VMR methods that employ various strategies for cross-modal alignment still face challenges such as limited understanding of fine-grained semantics, semantic overlap, and sparse constraints. To address these limitations, we propose a novel Concept Decomposition Transformer (CDTR) model for VMR. CDTR introduces a semantic concept decomposition module that disentangles video moments and sentence queries into concept representations, reflecting the relevance between various concepts and capturing fine-grained semantics which is crucial for cross-modal matching. These decomposed concept representations are then used as pseudo-labels, determined as positive or negative samples by adaptive concept-specific thresholds. Subsequently, fine-grained concept alignment is performed in video intra-modal and textual-visual cross-modal, aligning different conceptual components within features, enhancing the model's ability to distinguish fine-grained semantics, and alleviating issues related to semantic overlap and sparse constraints. Comprehensive experiments demonstrate the effectiveness of the CDTR, outperforming state-of-the-art methods on three widely used datasets: QVHighlights, Charades-STA, and TACoS.
NeurIPS Conference 2025 Conference Paper
The increasing utilization of graph databases across various fields stems from their capacity to represent intricate interconnections. Nonetheless, exploiting the full capabilities of graph databases continues to be a significant hurdle, largely because of the inherent difficulty in translating natural language into Cypher. Recognizing the critical role of schema selection in database query generation and drawing inspiration from recent progress in reasoning-augmented approaches trained through reinforcement learning to enhance inference capabilities and generalization, we introduce Cypher-RI, a specialized framework for the Text-to-Cypher task. Distinct from conventional approaches, our methodology seamlessly integrates schema selection within the Cypher generation pipeline, conceptualizing it as a critical element in the reasoning process. The schema selection mechanism is guided by textual context, with its outcomes recursively shaping subsequent inference processes. Impressively, our 7B-parameter model, trained through this RL paradigm, demonstrates superior performance compared to baselines, exhibiting a 9. 41\% accuracy improvement over GPT-4o on CypherBench. These results underscore the effectiveness of our proposed reinforcement learning framework, which integrates schema selection to enhance both the accuracy and reasoning capabilities in Text-to-Cypher tasks.
NeurIPS Conference 2025 Conference Paper
The explosive growth in sequence length has intensified the demand for effective and efficient long sequence modeling. Benefiting from intrinsic oscillatory membrane dynamics, Resonate-and-Fire (RF) neurons can efficiently extract frequency components from input signals and encode them into spatiotemporal spike trains, making them well-suited for long sequence modeling. However, RF neurons exhibit limited effective memory capacity and a trade-off between energy efficiency and training speed on complex temporal tasks. Inspired by the dendritic structure of biological neurons, we propose a Dendritic Resonate-and-Fire (D-RF) model, which explicitly incorporates a multi-dendritic and soma architecture. Each dendritic branch encodes specific frequency bands by utilizing the intrinsic oscillatory dynamics of RF neurons, thereby collectively achieving comprehensive frequency representation. Furthermore, we introduce an adaptive threshold mechanism into the soma structure. This mechanism adjusts the firing threshold according to historical spiking activity, thereby reducing redundant spikes while maintaining training efficiency in long-sequence tasks. Extensive experiments demonstrate that our method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training. These results underscore its potential as an effective and efficient solution for long sequence modeling on edge platforms.
YNIMG Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
ICML Conference 2025 Conference Paper
Understanding human mental states—such as intentions and desires—is crucial for natural AI-human collaboration. However, this is challenging because human actions occur irregularly over time, and the underlying mental states that drive these actions are unobserved. To tackle this, we propose a novel framework that combines a logic-informed temporal point process (TPP) with amortized variational Expectation-Maximization (EM). Our key innovation is integrating logic rules as priors to guide the TPP’s intensity function, allowing the model to capture the interplay between actions and mental events while reducing dependence on large datasets. To handle the intractability of mental state inference, we introduce a discrete-time renewal process to approximate the posterior. By jointly optimizing model parameters, logic rules, and inference networks, our approach infers entire mental event sequences and adaptively predicts future actions. Experiments on both synthetic and real-world datasets show that our method outperforms existing approaches in accurately inferring mental states and predicting actions, demonstrating its effectiveness in modeling human cognitive processes.
NeurIPS Conference 2025 Conference Paper
Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and continuous nature of real-world physical dynamics. To fill the gaps, we present SDIFT, Sequential DIffusion in Functional Tucker space, a novel framework that generates full-field evolution of physical dynamics from irregular sparse observations. SDIFT leverages the functional Tucker model as the latent space representer with proven universal approximation property, and represents sparse observations as latent functions and Tucker core sequences. We then construct a sequential diffusion model with temporally augmented UNet in the functional Tucker space, denoising noise drawn from a Gaussian process to generate the sequence of core tensors. At the posterior sampling stage, we propose a Message-Passing Posterior Sampling mechanism, enabling conditional generation of the entire sequence guided by observations at limited time steps. We validate SDIFT on three physical systems spanning astronomical (supernova explosions, light-year scale), environmental (ocean sound speed fields, kilometer scale), and molecular (organic liquid, millimeter scale) domains, demonstrating significant improvements in both reconstruction accuracy and computational efficiency compared to state-of-the-art approaches.
JBHI Journal 2025 Journal Article
Timely and accurate diagnosis of acute thoracolumbar vertebral compression fractures in X-ray images is critical for initiating prompt and effective treatment, preventing potential neurological damage and long-term disability. Recent advancements in artificial intelligence (AI) have significantly improved medical imaging analysis, providing sophisticated tools to assist clinicians in diagnosing acute thoracolumbar vertebral compression fractures. Nonetheless, detecting these fractures through imaging remains challenging due to the complex overlapping of bony structures in the thoracolumbar region, variability in fracture patterns, and often subtle nature of these injuries. Additionally, the limited availability and sometimes poor quality of medical images further complicate accurate AI-based detection. Addressing these challenges, this study introduces a transfer learning model optimized for recognizing acute thoracolumbar vertebral compression fractures from a small set of low-quality X-ray images. The model starts with a feature extraction model that analyzes multiple texture features of X-ray images. It then employs a Vision Transformer Detector (ViTDet) combined with a faster region-based convolutional neural network (Faster R-CNN) to recognize fractures efficiently. To enhance its performance on small datasets, the model employs a transfer learning approach for training. Extensive experiments with a large dataset of real-world images have shown that this model can effectively recognize acute thoracolumbar vertebral compression fractures from low-quality images, outperforming professionals with specialized knowledge in some cases.
IJCAI Conference 2025 Conference Paper
Due to the notorious modality imbalance phenomenon, multimodal learning (MML) struggles to achieve satisfactory performance. Recently, multimodal learning with alternating unimodal adaptation (MLA) has been proven effective in mitigating the interference between modalities by capturing interaction through orthogonal projection, thus relieving modality imbalance phenomenon to some extent. However, the projection strategy orthogonal to the original space can lead to poor plasticity as the alternating learning proceeds, thus affecting model performance. To address this issue, in this paper, we propose a novel multimodal learning method called interactiveMML via flat gradient modification (IGM) by employing a flat gradient modification strategy to enhance interactive MML. Specifically, we first employ a flat projection-based gradient modification strategy that is independent to the original space, aiming to avoid the poor plasticity issue. Then we introduce the sharpness-aware minimization (SAM)-based optimization strategy to fully exploit the flatness of the learning objective and further enhance interaction during learning. To this end, the plasticity problem can be avoided and the overall performance is improved. Extensive experiments on widely used datasets demonstrate that IGM outperforms various state-of-the-art (SOTA) baselines, achieving superior performance. The source code is available at https: //anonymous. 4open. science/r/method-CC45.
AAAI Conference 2025 Conference Paper
Out-of-distribution (OOD) detection is crucial for ensuring the reliable deployment of deep models in real-world scenarios. Recently, from the perspective of over-parameterization, a series of methods leveraging weight sparsification techniques have shown promising performance. These methods typically focus on selecting important parameters for in-distribution (ID) data to reduce the negative impact of redundant parameters on OOD detection. However, we empirically find that these selected parameters may behave overconfidently toward OOD data and hurt OOD detection. To address this issue, we propose a simple yet effective post-hoc method called Instance-aware Test Pruning (ITP), which performs OOD detection by considering both coarse-grained and fine-grained levels of parameter pruning. Specifically, ITP first estimates the class-specific parameter contribution distribution by exploring the ID data. By using the contribution distribution, ITP conducts coarse-grained pruning to eliminate redundant parameters. More importantly, ITP further adopts a fine-grained test pruning process based on the right-tailed Z-score test, which can adaptively remove instance-level overconfident parameters. Finally, ITP derives OOD scores from the pruned model to achieve more reliable predictions. Extensive experiments on widely adopted benchmarks verify the effectiveness of ITP, demonstrating its competitive performance.
ICLR Conference 2025 Conference Paper
Recent advances in Large Language Models (LLMs) have stimulated a significant paradigm shift in evolutionary optimization, where hand-crafted search heuristics are gradually replaced with LLMs serving as intelligent search operators. However, these studies still bear some notable limitations, including a challenge to balance exploitation with exploration, often leading to inferior solution diversity, as well as poor generalizability of problem solving across different task settings. These unsolved issues render the prowess of LLMs in robot design automation largely untapped. In this work, we present LASeR -- Large Language Model-Aided Evolutionary Search for Robot Design Automation. Leveraging a novel reflection mechanism termed DiRect, we elicit more knowledgeable exploratory behaviors from LLMs based on past search trajectories, reshaping the exploration-exploitation tradeoff with dual improvements in optimization efficiency and solution diversity. Additionally, with evolution fully grounded in task-related background information, we unprecedentedly uncover the inter-task reasoning capabilities of LLMs, facilitating generalizable design processes that effectively inspire zero-shot robot proposals for new applications. Our simulated experiments on voxel-based soft robots showcase distinct advantages of LASeR over competitive baselines. Code at https://github.com/WoodySJR/LASeR.
AAAI Conference 2025 Conference Paper
Event cameras encode visual information by generating asynchronous and sparse event streams, which hold great potential for low latency and low power consumption. Despite many successful implementations of event camera-based applications, most of them accumulate the events into frames and then utilize conventional frame-based computer vision algorithms. These frame-based methods, though typically effective, diminish the inherent advantages of the event camera's low latency and low power consumption. To solve the above problems, we propose ASGCN, which efficiently processes data on an event-by-event basis and dynamically evolves into a corresponding dynamic representation, enabling low latency and high sparsity of data representation. The sparsity computation is further improved by introducing brain-inspired spiking neural networks, resulting in low power consumption for ASGCN. Extensive and diverse experiments demonstrate the energy efficiency and low latency advantages of our processing pipeline. Especially on real-world event camera datasets, our pipeline consumes more than 10,000 times less energy and achieves similar performance compared to current frame-based methods.
AAAI Conference 2025 Conference Paper
Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve it. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
Current multimodal sentiment analysis (MSA) and emotion recognition in conversations (ERC) methods based on pre-trained language models exhibit two primary limitations: 1) Once trained for MSA and ERC tasks, these pre-trained language models lose their original generalized capabilities. 2) They demand considerable computational resources. As the size of pre-trained language models continues to grow, training larger multimodal sentiment analysis models using previous approaches could result in unnecessary computational cost. In response to this challenge, we propose Multimodal Sentiment Analysis and Emotion Recognition Adapter (MSE-Adapter), a lightweight and adaptable plugin. This plugin enables a large language model (LLM) to carry out MSA or ERC tasks with minimal computational overhead (only introduces approximately 2.6M to 2.8M trainable parameters upon the 6/7B models), while preserving the intrinsic capabilities of the LLM. In the MSE-Adapter, the Text-Guide-Mixer (TGM) module is introduced to establish explicit connections between non-textual and textual modalities through the Hadamard product. This allows non-textual modalities to better align with textual modalities at the feature level, promoting the generation of higher-quality pseudo tokens. Extensive experiments were conducted on four public English and Chinese datasets using consumer-grade GPUs and open-source LLMs (Qwen-1.8B, ChatGLM3-6B-base, and LLaMA2-7B) as the backbone. The results demonstrate the effectiveness of the proposed plugin.
IJCAI Conference 2025 Conference Paper
Accurate prediction of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties is crucial in drug development, as these properties directly impact a drug's efficacy and safety. However, existing multi-task learning models often face challenges related to noise interference and task conflicts when dealing with complex molecular structures. To address these issues, we propose a novel multi-task Graph Neural Network (GNN) model, \textbf{MTGIB-UNet}. The model begins by encoding molecular graphs to capture intricate molecular structure information. Subsequently, based on the Graph Information Bottleneck (GIB) principle, the model compresses the information flow by extracting subgraphs, retaining task-relevant features while removing noise for each task. These embeddings are then fused through a gated network that dynamically adjusts the contribution weights of auxiliary tasks to the primary task. Specifically, an uncertainty weighting (UW) strategy is applied, with additional emphasis placed on the primary task, allowing dynamic adjustment of task weights while strengthening the influence of the primary task on model training. Experiments on standard ADMET datasets demonstrate that our model outperforms existing methods. Additionally, the model shows good interpretability by identifying key molecular substructures related to specific ADMET endpoints.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
ECAI Conference 2025 Conference Paper
Large-scale Vision-Language Models (VLMs) have demonstrated impressive zero-shot performance in sample-level downstream tasks (e. g. , image classification), driven by their powerful generalization ability. However, they still struggle in instance-level tasks, e. g. , zero-shot Referring Expression Comprehension (REC), which requires precisely locating the target instance in an image based on a provided text caption. To address this issue, we propose Multimodal Semantic Decoupled Prompting (MSDP), a simple yet effective prompt engineering approach that contains both textual- and visual-focused instance-level understanding prompting. Specifically, we first propose a novel textual restructure strategy to eliminate the impact of task-irrelevant semantic information, steering the model’s attention at the textual understanding level. Meanwhile, we design a united visual prompt at the visual understanding level that maximally activates the instance-level understanding capabilities of VLMs. Experiments on several benchmarks reveal that the proposed approach outperforms state-of-the-art (SOTA) methods. The code is available at repository.
AAAI Conference 2025 Conference Paper
Anti-fraud machine learning systems are perpetually confronted with the significant challenge of concept drift, driven by the continuous and intense evolution of fraudulent techniques. That is, outdated models trained on historical fraudulent behaviors often fall short in addressing the evolving tactics of malicious users over time. The key issue lies in effectively tackling the rapid and significant evolution of fraudsters' behaviors to detect these emerging and unforeseen anomalies. In this paper, we propose a solution by directly accessing real-time data and introducing a lightweight plug-in approach named TRE (Test-time Retrieval-based Representation Enrichment). Considering the similarity among samples, TRE employs a retriever to efficiently identify the top-K most relevant recent samples and implements an aggregation strategy to provide neighboring embeddings to the predictor. It thus adjusts the trained classifiers during the test time, providing them with the information from the latest unlabeled data. Extensive experiments on three large-scale real-world datasets demonstrate the superiority of TRE. By consistently incorporating information from the nearest neighbors, TRE demonstrates high adaptability and surpasses existing methods in performance.
JBHI Journal 2025 Journal Article
Orthognathic surgery is applied to restore esthetical facial profile and functional occlusion for patients with dentofacial deformity. Virtual surgical planning (VSP) is indispensable for precise and individualized treatment. Manually designing osteotomy planes is time-consuming and highly experience-dependent. This study aimed to develop and validate an automatic osteotomy plane design method based on deep learning. Methods: A deep learning model, Ortho-OPD (orthognathic osteotomy planes de signer), was proposed, consisting of a segmentation network and the random sample consensus (RANSAC) algorithm. The segmentation network was based on a convolutional neural network (CNN), orthognathic segmenting the craniomaxillofacial (CMF) CT data. Osteotomy planes were then defined by the RANSAC algorithm. Ortho-OPD was trained on 71 samples and tested on 31 cases. The performance was evaluated quantitatively and qualitatively. Results: Ortho-OPD functioned smoothly, and all cases were successfully performed. The 3D boundary-sensitive loss was employed to optimize precision. Evaluation metrics included accuracy and clinical efficiency. The mean dice similarity coefficient (DSC) was 0. 920. 032 in CMF seg mentation. Ortho-OPD showcased excellent productivity, taking an average of about 9 seconds to complete virtual bimaxillary osteotomy compared to manual work. The angular errors between the predicted planes and ground truth planes, plus the shortest distance from the neural tube or the adjacent apical points to predicted planes, were examined, indicating no significant difference and reliability for preserving vital anatomical structures. Overall, the automatic osteotomy plane design from raw CT data was realized using Ortho-OPD, composed of CNN and RANSAC, providing an efficient and ideal alternative in orthognathic osteotomy planning.
NeurIPS Conference 2025 Conference Paper
PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36, 390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993, 000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/thoracic organs. Each scan includes metadata such as patient age, sex, diagnosis, contrast phase, in-plane spacing, slice thickness, etc. AI models trained on PanTS achieve significantly better performance in pancreatic tumor detection, localization, and segmentation than those trained on existing public datasets. Our analysis indicates that these gains are directly attributable to the 16× larger-scale tumor annotations and indirectly supported by the 24 additional surrounding anatomical structures. As the largest and most comprehensive resource of its kind, PanTS offers a new benchmark for developing and evaluating AI models in pancreatic CT analysis.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
In radar-camera 3D object detection, the radar point clouds are sparse and noisy, which causes difficulties in fusing camera and radar modalities. To solve this, we introduce a novel query-based detection method named Radar-Camera Transformer (RCTrans). Specifically, we first design a Radar Dense Encoder to enrich the sparse valid radar tokens, and then concatenate them with the image tokens. By doing this, we can fully explore the 3D information of each interest region and reduce the interference of empty tokens during the fusing stage. We then design a Pruning Sequential Decoder to predict 3D boxes based on the obtained tokens and random initialized queries. To alleviate the effect of elevation ambiguity in radar point clouds, we gradually locate the position of the object via a sequential fusion structure. It helps to get more precise and flexible correspondences between tokens and queries. A pruning training strategy is adopted in the decoder, which can save much time during inference and inhibit queries from losing their distinctiveness. Extensive experiments on the large-scale nuScenes dataset prove the superiority of our method, and we also achieve new state-of-the-art radar-camera 3D detection results.
NeurIPS Conference 2025 Conference Paper
Chemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which extends the differential reaction fingerprint (DRFP) by representing shingles as continuous high-dimensional embeddings, capturing structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8. 76\% on R$^2$ under permutation perturbations.
AAAI Conference 2025 Conference Paper
In recent years, diffusion models have revolutionized visual generation, outperforming traditional frameworks like Generative Adversarial Networks (GANs). However, generating images of humans with realistic semantic parts, such as hands and faces, remains a significant challenge due to their intricate structural complexity. To address this issue, we propose a novel post-processing solution named RealisHuman. The RealisHuman framework operates in two stages. First, it generates realistic human parts, such as hands or faces, using the original malformed parts as references, ensuring consistent details with the original image. Second, it seamlessly integrates the rectified human parts back into their corresponding positions by repainting the surrounding areas to ensure smooth and realistic blending. The RealisHuman framework significantly enhances the realism of human generation, as demonstrated by notable improvements in both qualitative and quantitative metrics.
NeurIPS Conference 2025 Conference Paper
Multimodal learning (MML) is significantly constrained by modality imbalance, leading to suboptimal performance in practice. While existing approaches primarily focus on balancing the learning of different modalities to address this issue, they fundamentally overlook the inherent disproportion in model classification ability, which serves as the primary cause of this phenomenon. In this paper, we propose a novel multimodal learning approach to dynamically balance the classification ability of weak and strong modalities by incorporating the principle of boosting. Concretely, we first propose a sustained boosting algorithm in multimodal learning by simultaneously optimizing the classification and residual errors. Subsequently, we introduce an adaptive classifier assignment strategy to dynamically facilitate the classification performance of the weak modality. Furthermore, we theoretically analyze the convergence property of the cross-modal gap function, ensuring the effectiveness of the proposed boosting scheme. To this end, the classification ability of strong and weak modalities is expected to be balanced, thereby mitigating the imbalance issue. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines. The source code is available at https: //github. com/njustkmg/NeurIPS25-AUG.
NeurIPS Conference 2025 Conference Paper
Spiking Neural Networks (SNNs) offer an energy-efficient paradigm for machine intelligence, but their continued scaling poses challenges for resource-limited deployment. Despite recent advances in binary SNNs, the storage and computational demands remain substantial for large-scale networks. To further explore the compression and acceleration potential of SNNs, we propose Sub-bit Spiking Neural Networks (S$^2$NNs) that represent weights with less than one bit. Specifically, we first establish an S$^2$NN baseline by leveraging the clustering patterns of kernels in well-trained binary SNNs. This baseline is highly efficient but suffers from \textit{outlier-induced codeword selection bias} during training. To mitigate this issue, we propose an \textit{outlier-aware sub-bit weight quantization} (OS-Quant) method, which optimizes codeword selection by identifying and adaptively scaling outliers. Furthermore, we propose a \textit{membrane potential-based feature distillation} (MPFD) method, improving the performance of highly compressed S$^2$NN via more precise guidance from a teacher model. Extensive results on vision reveal that S$^2$NN outperforms existing quantized SNNs in both performance and efficiency, making it promising for edge computing applications.
IJCAI Conference 2025 Conference Paper
Generating high-fidelity talking heads that maintain stable head poses and achieve robust lip sync remains a significant challenge. Although methods based on 3D Gaussian Splatting (3DGS) offer a promising solution via point-based deformation, they suffer from inconsistent head dynamics and mismatched mouth movements due to unstable Gaussian initialization and incomplete speech features. To overcome these limitations, we introduce SyncGaussian, a 3DGS-based framework that ensures stable head poses, enhanced lip sync, and realistic appearances with real-time rendering. SyncGaussian employs a stable head Gaussian initialization strategy to mitigate head jitter by optimizing commonly used rough head pose parameters. To enhance lip sync, we propose a sync-enhanced encoder that leverages audio-to-text and audio-to-visual speech features. Guided by a tailored cosine similarity loss function, the encoder integrates discriminative speech features through a multi-level sync adaptation mechanism, enabling the learning of an adaptive speech feature space. Extensive experiments demonstrate that SyncGaussian outperforms state-of-the-art methods in image quality, dynamic motion, and lip sync, with the potential for real-time applications.
NeurIPS Conference 2025 Conference Paper
In this work, we address the task of table image to LaTeX code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs. A central challenge of this task lies in accurately handling complex tables—those with large sizes, deeply nested structures, and semantically rich or irregular cell content—where existing methods often fail. We begin with a comprehensive analysis, identifying key challenges and highlighting the limitations of current evaluation protocols. To overcome these issues, we propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-LaTeX dataset. To further improve generation quality, we introduce a dual-reward reinforcement learning strategy based on Group Relative Policy Optimization (GRPO). Unlike standard approaches that optimize purely over text outputs, our method incorporates both a structure-level reward on LaTeX code and a visual fidelity reward computed from rendered outputs, enabling direct optimization of the visual output quality. We adopt a hybrid evaluation protocol combining TEDS-Structure and CW-SSIM, and show that our method achieves state-of-the-art performance, particularly on structurally complex tables, demonstrating the effectiveness and robustness of our approach.
IROS Conference 2025 Conference Paper
Tactile sensing plays a crucial role to empower robotic hands with improved grasping and manipulating abilities. In this paper, we propose an anthropomorphic robotic hand design with dual air bag sensors integrated soft fingertips to achieve tactile sensing. The air bag sensor is low-cost, easy-to-build and deformable, and can be embedded in the fingertip, endows the hand with the ability to perceive and makes it have the mechanical complicance similar to the human fingertip. The air bag sensor exhibits high performance metrics, including a sensitivity of ~1. 65 kPa/N, a minimum detection force of < 0. 01 N, a response time of < 10 ms, and good stability and repeatability. The experimental results show that the proposed robotic hand performs well in surface texture detection, hard inclusion depth detection and object softness detection, as well as grasping tasks. By applying a machine learning algorithm to the experimental data, an accuracy of 0. 767 and 0. 898 was achieved in predicting hard inclusion depth and object hardness, respectively. This study provides a simple and effective tactile sensing solution for the design of anthropomorphic robotic hand, and may have possible applications such as end-effectors for humanoid robots or robotic palpation.
AAAI Conference 2025 Conference Paper
Binary Spiking Neural Networks (BSNNs) inherit the event-driven paradigm of SNNs, while also adopting the reduced storage burden of binarization techniques. These distinct advantages grant BSNNs lightweight and energy-efficient characteristics, rendering them ideal for deployment on resource-constrained edge devices. However, due to the binary synaptic weights and non-differentiable spike function, effectively training BSNNs remains an open question. In this paper, we conduct an in-depth analysis of the challenge for BSNN learning, namely the frequent weight sign flipping problem. To mitigate this issue, we propose an Adaptive Gradient Modulation Mechanism (AGMM), which is designed to reduce the frequency of weight sign flipping by adaptively adjusting the gradients during the learning process. The proposed AGMM can enable BSNNs to achieve faster convergence speed and higher accuracy, effectively narrowing the gap between BSNNs and their full-precision equivalents. We validate AGMM on both static and neuromorphic datasets, and results indicate that it achieves state-of-the-art results among BSNNs. This work substantially reduces storage demands and enhances SNNs' inherent energy efficiency, making them highly feasible for resource-constrained environments.
IJCAI Conference 2025 Conference Paper
The multimodal imbalance problem has been extensively studied to prevent the undesirable scenario where multimodal performance falls below that of unimodal models. However, existing methods typically assess the strength of modalities and perform learning simultaneously under the imbalanced status. This deferred strategy fails to rebalance multimodal learning instantaneously, leading to performance degeneration. To address this, we propose a novel multimodal learning approach, termed instantaneous probe-and-rebalance multimodal learning (IPRM), which employs a two-pass forward method to first probe (but not learn) and then perform rebalanced learning under the balanced status. Concretely, we first employ the geodesic multimodal mixup (GMM) to incorporate fusion representation and probe modality strength in the first forward phase. Then the weights are instantaneously recalibrated based on the probed strength, facilitating balanced training via the second forward pass. This process is applied dynamically throughout the entire training process. Extensive experiments reveal that our proposed IPRM outperforms all baselines, achieving state-of-the-art (SOTA) performance on numerous widely used datasets. The code is available at https: //github. com/njustkmg/IJCAI25-IPRM.
NeurIPS Conference 2025 Conference Paper
Diffusion models, such as diffusion policy, have achieved state-of-the-art results in robotic manipulation by imitating expert demonstrations. While diffusion models were originally developed for vision tasks like image and video generation, many of their inference strategies have been directly transferred to control domains without adaptation. In this work, we show that by tailoring the denoising process to the specific characteristics of embodied AI tasks—particularly the structured, low-dimensional nature of action distributions---diffusion policies can operate effectively with as few as 5 neural function evaluations (NFE). Building on this insight, we propose a population-based sampling strategy, genetic denoising, which enhances both performance and stability by selecting denoising trajectories with low out-of-distribution risk. Our method solves challenging tasks with only 2 NFE while improving or matching performance. We evaluate our approach across 14 robotic manipulation tasks from D4RL and Robomimic, spanning multiple action horizons and inference budgets. In over 2 million evaluations, our method consistently outperforms standard diffusion-based policies, achieving up to 20\% performance gains with significantly fewer inference steps.
ICRA Conference 2025 Conference Paper
Crack segmentation is pivotal for structural health monitoring, enabling the timely maintenance of critical infrastructure such as bridges and roads. However, existing deep learning models are often too computationally intensive for deployment on resource-constrained devices. To address this limitation, we introduce UltraFastCrackSeg, a lightweight model designed for real-time crack segmentation that effectively balances high accuracy with low computational demands. Featuring an efficient encoder-decoder architecture, our model significantly reduces parameter count and floating-point operations (FLOPs) compared to current methods, as illustrated in Figure 1. We further enhance performance through a self-supervised pretraining approach that employs a novel, task-oriented masking strategy, thereby improving feature extraction. Experiments across multiple datasets demonstrate that UltraFastCrackSeg achieves state-of-the-art Intersection over Union (IoU) and F1 scores while maintaining a compact model size and high inference speed. Evaluations on a low-power CPU device confirm its capability to achieve up to 80 frames per second (FPS) with ONNX runtime optimization, making it highly suitable for real-time, on-site applications. These findings establish UltraFastCrackSeg as a robust and efficient solution for practical crack detection tasks. Code is available at: https://github.com/weiqingq/UltraFastCrackSeg.
TMLR Journal 2025 Journal Article
Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging process due to the inefficiencies and inconsistencies inherent in conventional reward engineering methodologies. Recent advances have explored leveraging large language models (LLMs) to automate the design of reward functions. However, LLMs’ insufficient numerical optimization capabilities often result in suboptimal reward hyperparameter tuning, while non-selective validation of candidate reward functions leads to substantial computational overhead. To address these challenges, we propose the Uncertainty-aware Reward Design Process (URDP), a novel framework that integrates large language models to streamline reward function design and evaluation. URDP quantifies candidate reward function uncertainty based on the self-consistency analysis, enabling simulation-free identification of ineffective reward components while discovering novel ones. Furthermore, we introduce uncertainty-aware Bayesian optimization (UABO), which incorporates uncertainty estimation to improve the hyperparameter configuration. Finally, we construct a bi-level optimization framework by decoupling the reward component optimization and the hyperparameter tuning. URDP promotes the collaboration between the reward logic reasoning of the LLMs and the numerical optimization strengths of the Bayesian optimization. We conduct a comprehensive evaluation of URDP across 35 diverse tasks spanning three benchmark environments: IsaacGym, Bidexterous Manipulation, and ManiSkill2. Our experimental results demonstrate that URDP not only generates higher-quality reward functions but also achieves significant improvements in the efficiency of automated reward design compared to existing approaches. We open-source all code at https://github.com/Yy12136/URDP.
NeurIPS Conference 2025 Conference Paper
Spiking Neural Networks (SNNs) demonstrate significant potential for energy-efficient neuromorphic computing through an event-driven paradigm. While training methods and computational models have greatly advanced, SNNs struggle to achieve competitive performance in visual long-sequence modeling tasks. In artificial neural networks, the effective receptive field (ERF) serves as a valuable tool for analyzing feature extraction capabilities in visual long-sequence modeling. Inspired by this, we introduce the Spatio-Temporal Effective Receptive Field (ST-ERF) to analyze the ERF distributions across various Transformer-based SNNs. Based on the proposed ST-ERF, we reveal that these models suffer from establishing a robust global ST-ERF, thereby limiting their visual feature modeling capabilities. To overcome this issue, we propose two novel channel-mixer architectures: \underline{m}ulti-\underline{l}ayer-\underline{p}erceptron-based m\underline{ixer} (MLPixer) and \underline{s}plash-and-\underline{r}econstruct \underline{b}lock (SRB). These architectures enhance global spatial ERF through all timesteps in early network stages of Transformer-based SNNs, improving performance on challenging visual long-sequence modeling tasks. Extensive experiments conducted on the Meta-SDT variants and across object detection and semantic segmentation tasks further validate the effectiveness of our proposed method. Beyond these specific applications, we believe the proposed ST-ERF framework can provide valuable insights for designing and optimizing SNN architectures across a broader range of tasks. The code is available at \href{https: //github. com/EricZhang1412/Spatial-temporal-ERF}{\faGithub~EricZhang1412/Spatial-temporal-ERF}.
EAAI Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Text-based person retrieval aims at retrieving a specific pedestrian image from a gallery based on textual descriptions. The primary challenge is how to overcome the inherent heterogeneous modality gap in the situation of significant intra-class variation and minimal inter-class variation. Existing approaches commonly employ vision-language pre-training or attention mechanisms to learn appropriate cross-modal alignments from noise inputs. Despite commendable progress, current methods inevitably suffer from two defects: 1) Matching ambiguity, which mainly derives from unreliable matching pairs; 2) One-sided cross-modal alignments, stemming from the absence of exploring one-to-many correspondence, i.e., coarse-grained semantic alignment. These critical issues significantly deteriorate retrieval performance. To this end, we propose a novel framework termed Adaptive Uncertainty-based Learning (AUL) for text-based person retrieval from the uncertainty perspective. Specifically, our AUL framework consists of three key components: 1) Uncertainty-aware Matching Filtration that leverages Subjective Logic to effectively mitigate the disturbance of unreliable matching pairs and select high-confidence cross-modal matches for training; 2) Uncertainty-based Alignment Refinement, which not only simulates coarse-grained alignments by constructing uncertainty representations but also performs progressive learning to incorporate coarse- and fine-grained alignments properly; 3) Cross-modal Masked Modeling that aims at exploring more comprehensive relations between vision and language. Extensive experiments demonstrate that our AUL method consistently achieves state-of-the-art performance on three benchmark datasets in supervised, weakly supervised, and domain generalization settings. Our code is available at https://github.com/CFM-MSG/Code-AUL.
TCS Journal 2024 Journal Article
IJCAI Conference 2024 Conference Paper
Topology augmentation is a popular strategy to address the issue of over-smoothing in graph neural networks (GNNs). To prevent potential distortion of node representations, an essential principle is to enhance the separability between embeddings of nodes from different classes while preserving smoothness among nodes of the same class. However, differentiating between inter-class and intra-class edges becomes arduous when class labels are unavailable or the graph is partially labeled. While clustering offers an alternative for identifying closely connected groups of nodes, traditional clustering methods face challenges when applied to GNNs in terms of accuracy, efficiency, adaptability, and scalability to diverse graphs. To address these limitations, we introduce ClusterDrop, which uses learnable prototypes for efficient clustering and incorporates supervised signals to enhance accuracy and adaptability across different graphs. Experiments on six datasets with varying graph structures demonstrate its effectiveness in alleviating over-smoothing and enhancing GNN performance.
NeurIPS Conference 2024 Conference Paper
Recently, federated multi-view clustering (FedMVC) has emerged to explore cluster structures in multi-view data distributed on multiple clients. Many existing approaches tend to assume that clients are isomorphic and all of them belong to either single-view clients or multi-view clients. While these methods have succeeded, they may encounter challenges in practical FedMVC scenarios involving heterogeneous hybrid views, where a mixture of single-view and multi-view clients exhibit varying degrees of heterogeneity. In this paper, we propose a novel FedMVC framework, which concurrently addresses two challenges associated with heterogeneous hybrid views, i. e. , client gap and view gap. To address the client gap, we design a local-synergistic contrastive learning approach that helps single-view clients and multi-view clients achieve consistency for mitigating heterogeneity among all clients. To address the view gap, we develop a global-specific weighting aggregation method, which encourages global models to learn complementary features from hybrid views. The interplay between local-synergistic contrastive learning and global-specific weighting aggregation mutually enhances the exploration of the data cluster structures distributed on multiple clients. Theoretical analysis and extensive experiments demonstrate that our method can handle the heterogeneous hybrid views in FedMVC and outperforms state-of-the-art methods.
NeurIPS Conference 2024 Conference Paper
Graph neural networks (GNNs) have attracted considerable attention due to their diverse applications. However, the scarcity and quality limitations of graph data present challenges to their training process in practical settings. To facilitate the development of effective GNNs, companies and researchers often seek external collaboration. Yet, directly sharing data raises privacy concerns, motivating data owners to train GNNs on their private graphs and share the trained models. Unfortunately, these models may still inadvertently disclose sensitive properties of their training graphs (\textit{e. g. }, average default rate in a transaction network), leading to severe consequences for data owners. In this work, we study graph property inference attack to identify the risk of sensitive property information leakage from shared models. Existing approaches typically train numerous shadow models for developing such attack, which is computationally intensive and impractical. To address this issue, we propose an efficient graph property inference attack by leveraging model approximation techniques. Our method only requires training a small set of models on graphs, while generating a sufficient number of approximated shadow models for attacks. To enhance diversity while reducing errors in the approximated models, we apply edit distance to quantify the diversity within a group of approximated models and introduce a theoretically guaranteed criterion to evaluate each model's error. Subsequently, we propose a novel selection mechanism to ensure that the retained approximated models achieve high diversity and low error. Extensive experiments across six real-world scenarios demonstrate our method's substantial improvement, with average increases of 2. 7\% in attack accuracy and 4. 1\% in ROC-AUC, while being 6. 5$\times$ faster compared to the best baseline.
AAAI Conference 2024 Conference Paper
Point cloud completion aims at completing shapes from their partial. Most existing methods utilized shape’s priors information for point cloud completion, such as inputting the partial and getting the complete one through an encoder-decoder deep learning structure. However, it is very often to easily cause the loss of information in the generation process because of the invisibility of missing areas. Unlike most existing methods directly inferring the missing points using shape priors, we address it as a cross-modality task. We propose a new Cross-modal Dual Phases Network (CDPNet) for shape completion. Our key idea is that the global information of the shape is obtained from the extra single-view image, and the partial point clouds provide the geometric information. After that, the multi-modal features jointly guide the specific structural information. To learn the geometric details of the shape, we chose to use patches to preserve the local geometric feature. In this way, we can generate shapes with enough geometric details. Experimental results show that our method achieves state-of-the-art performance on point cloud completion.
NeurIPS Conference 2024 Conference Paper
Time Series Classification (TSC) encompasses two settings: classifying entire sequences or classifying segmented subsequences. The raw time series for segmented TSC usually contain Multiple classes with Varying Duration of each class (MVD). Therefore, the characteristics of MVD pose unique challenges for segmented TSC, yet have been largely overlooked by existing works. Specifically, there exists a natural temporal dependency between consecutive instances (segments) to be classified within MVD. However, mainstream TSC models rely on the assumption of independent and identically distributed (i. i. d. ), focusing on independently modeling each segment. Additionally, annotators with varying expertise may provide inconsistent boundary labels, leading to unstable performance of noise-free TSC models. To address these challenges, we first formally demonstrate that valuable contextual information enhances the discriminative power of classification instances. Leveraging the contextual priors of MVD at both the data and label levels, we propose a novel consistency learning framework Con4m, which effectively utilizes contextual information more conducive to discriminating consecutive segments in segmented TSC tasks, while harmonizing inconsistent boundary labels for training. Extensive experiments across multiple datasets validate the effectiveness of Con4m in handling segmented TSC tasks on MVD. The source code is available at https: //github. com/MrNobodyCali/Con4m.
EAAI Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
The generation of logically coherent dialogues by humans relies on underlying cognitive abilities. Based on this, we redefine the dialogue coherence evaluation process, combining cognitive judgment with the basic text to achieve a more human-like evaluation. We propose a novel dialogue evaluation framework based on Dialogue Cognition Graph (DCGEval) to implement the fusion by in-depth interaction between cognition modeling and text modeling. The proposed Abstract Meaning Representation (AMR) based graph structure called DCG aims to uniformly model four dialogue cognitive abilities. Specifically, core-semantic cognition is modeled by converting the utterance into an AMR graph, which can extract essential semantic information without redundancy. The temporal and role cognition are modeled by establishing logical relationships among the different AMR graphs. Finally, the commonsense knowledge from ConceptNet is fused to express commonsense cognition. Experiments demonstrate the necessity of modeling human cognition for dialogue evaluation, and our DCGEval presents stronger correlations with human judgments compared to other state-of-the-art evaluation metrics.
IJCAI Conference 2024 Conference Paper
Modeling time series data has become a very at tractive research topic due to its wide application, such as human activity recognition, financial forecasting and sensor-based automatic system monitoring. Recently deep learning models have shown great advances in modeling the time series data but they heavily depend on a large amount of labeled data. To avoid costly labeling, this paper explores domain adaptation from a labeled source domain to the unlabeled target domain on time series data. To achieve the goal, we propose a disentangled representation learning framework named CADT to disentangle the domain-invariant features from the domain-specific ones. Particularly, CADT is injected with a novel class-wise hypersphere loss to improve the generalization of the classifier from the source domain to the target domain. Intuitively, it restricts the source data of the same class within the same hypersphere and minimizes the radius of it, which in turn enlarges the margin between different classes and makes the decision boundary of both domains easier. We further devise several kinds of domain-preserving data augmentation methods to better capture the domain-specific patterns. Extensive experiments on two public datasets and two real-world applications demonstrate the effectiveness of the proposed model against several state-of-the-art baselines.
NeurIPS Conference 2024 Conference Paper
Automated seizure detection (ASD) using intracranial electroencephalography (iEEG) is critical for effective epilepsy treatment. However, the significant domain shift of iEEG signals across subjects poses a major challenge, limiting their applicability in real-world clinical scenarios. In this paper, we address this issue by analyzing the primary cause behind the failure of existing iEEG models for subject-independent seizure detection, and identify a critical universal seizure pattern: seizure events consistently exhibit higher average amplitude compared to adjacent normal events. To mitigate the domain shifts and preserve the universal seizure patterns, we propose a novel self-comparison mechanism. This mechanism effectively aligns iEEG signals across subjects and time intervals. Building upon these findings, we propose Difference Matrix-based Neural Network (DMNet), a subject-independent seizure detection model, which leverages self-comparison based on two constructed (contextual, channel-level) references to mitigate shifts of iEEG, and utilize a simple yet effective difference matrix to encode the universal seizure patterns. Extensive experiments show that DMNet significantly outperforms previous SOTAs while maintaining high efficiency on a real-world clinical dataset collected by us and two public datasets for subject-independent seizure detection. Moreover, the visualization results demonstrate that the generated difference matrix can effectively capture the seizure activity changes during the seizure evolution process. Additionally, we deploy our method in an online diagnosis system to illustrate its effectiveness in real clinical applications.
IJCAI Conference 2024 Conference Paper
Wearable sensors play a crucial role in real-world scenarios, such as human activity recognition, sleep monitoring and electrocardiogram monitoring. However, deploying classifiers on them is challenged by distribution shifts across users and devices. Unsupervised domain adaptation (UDA) is proposed to address this, yet existing methods mostly focus on feature distribution shift, neglecting the potential misclassification due to label shift. In this paper, we propose Domain adaptation under label shift for Wearable sensor with Learnable Reweighting (DWLR) to handle both feature and label shifts. Specifically, DWLR employs learnable reweighting to align label distributions between source and target domains. It incorporates elements of information gain during the reweighting process to counter potential distribution shift that could emerge from over-reliance on data with high-confidence pseudo labels. Importantly, since wearable sensor data is time-series data, and can be subjected to distribution shifts originating from either the time domain, the frequency domain, or both, DWLR performs reweighting and alignment separately in these two domains to more robustly handle potential feature distribution shifts. Extensive experiments on three distinct wearable sensor datasets demonstrate the effectiveness of DWLR, yielding a remarkable average performance improvement of 5. 85%.
NeurIPS Conference 2024 Conference Paper
A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance.
JBHI Journal 2024 Journal Article
Dialogue System for Medical Diagnosis (DSMD) based on reinforcement learning (RL) can simulate patient-doctor interactions, playing a crucial role in clinical diagnosis. However, due to the complexity of disease etiology, DSMD faces the challenges of low efficiency in diagnostic evidence search. Moreover, solely RL-based DSMS, without the constraints of professional medical knowledge, often generates irrational, meaningless, or even erroneous symptom inquiries, leading to poor interpretability of diagnostic path and high misdiagnosis rates. To address these issues, we propose an E vidence-based dialogue system with highly I nterpretable R easoning path for A utomatic D iagnosis (EIRAD) grounded in medical knowledge graph (MKG). Specifically, our automated diagnostic model captures key symptoms for suspected diseases by explicitly leveraging the topology of MKG, enhancing the interpretability and accuracy of diagnosis. To expedite the retrieval of factual evidence, we develop two mechanisms: 1) Mapping mechanism between the entity set of MKG and DSMD's diagnostic evidence and diseases. According to the patient's symptoms, EIRAD prunes irrelevant disease and symptom nodes from the MKG, which can truncate the invalid action of RL-based DSMD. 2) Reward Mechanism of integrating the effectiveness of symptom inquiry and the accuracy of disease diagnosis. The comprehensive reward system is suitable for intelligent consultation, which can effectively drive DSMD to accelerate evidence collection. Experimental results demonstrate that our model significantly outperforms competitive benchmark methods in symptom inquiry efficiency and diagnostic accuracy.
EAAI Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
JBHI Journal 2024 Journal Article
Accurate genotyping of the epidermal growth factor receptor (EGFR) is critical for the treatment planning of lung adenocarcinoma. Currently, clinical identification of EGFR genotyping highly relies on biopsy and sequence testing which is invasive and complicated. Recent advancements in the integration of computed tomography (CT) imagery with deep learning techniques have yielded a non-invasive and straightforward way for identifying EGFR profiles. However, there are still many limitations for further exploration: 1) most of these methods still require physicians to annotate tumor boundaries, which are time-consuming and prone to subjective errors; 2) most of the existing methods are simply borrowed from computer vision field which does not sufficiently exploit the multi-level features for final prediction. To solve these problems, we propose a Denseformer framework to identify EGFR mutation status in a real end-to-end fashion directly from 3D lung CT images. Specifically, we take the 3D whole-lung CT images as the input of the neural network model without manually labeling the lung nodules. This is inspired by the medical report that the mutational status of EGFR is associated not only with the local tumor nodules but also with the microenvironment surrounded by the whole lung. Besides, we design a novel Denseformer network to fully explore the distinctive information across the different level features. The Denseformer is a novel network architecture that combines the advantages of both convolutional neural network (CNN) and Transformer. Denseformer directly learns from the 3D whole-lung CT images, which preserves the spatial location information in the CT images. To further improve the model performance, we designed a combined Transformer module. This module employs the Transformer Encoder to globally integrate the information of different levels and layers and use them as the basis for the final prediction. The proposed model has been tested on a lung adenocarcinoma dataset collected at the Affiliated Hospital of Zunyi Medical University. Extensive experiments demonstrated the proposed method can effectively extract meaningful features from 3D CT images to make accurate predictions. Compared with other state-of-the-art methods, Denseformer achieves the best performance among current methods using deep learning to predict EGFR mutation status based on a single modality of CT images.
EAAI Journal 2024 Journal Article
NeurIPS Conference 2024 Conference Paper
Graph Neural Networks (GNNs) have significantly advanced the field of drug discovery, enhancing the speed and efficiency of molecular identification. However, training these GNNs demands vast amounts of molecular data, which has spurred the emergence of collaborative model-sharing initiatives. These initiatives facilitate the sharing of molecular pre-trained models among organizations without exposing proprietary training data. Despite the benefits, these molecular pre-trained models may still pose privacy risks. For example, malicious adversaries could perform data extraction attack to recover private training data, thereby threatening commercial secrets and collaborative trust. This work, for the first time, explores the risks of extracting private training molecular data from molecular pre-trained models. This task is nontrivial as the molecular pre-trained models are non-generative and exhibit a diversity of model architectures, which differs significantly from language and image models. To address these issues, we introduce a molecule generation approach and propose a novel, model-independent scoring function for selecting promising molecules. To efficiently reduce the search space of potential molecules, we further introduce a Molecule Extraction Policy Network for molecule extraction. Our experiments demonstrate that even with only query access to molecular pre-trained models, there is a considerable risk of extracting training data, challenging the assumption that model sharing alone provides adequate protection against data extraction attacks. Our codes are publicly available at: \url{https: //github. com/renH2/Molextract}.
NeurIPS Conference 2024 Conference Paper
Multimodal learning falls into the trap of the optimization dilemma due to the modality imbalance phenomenon, leading to unsatisfactory performance in real applications. A core reason for modality imbalance is that the models of each modality converge at different rates. Many attempts naturally focus on adjusting learning procedures adaptively. Essentially, the reason why models converge at different rates is because the difficulty of fitting category labels is inconsistent for each modality during learning. From the perspective of fitting labels, we find that appropriate positive intervention label fitting can correct this difference in learning ability. By exploiting the ability of contrastive learning to intervene in the learning of category label fitting, we propose a novel multimodal learning approach that dynamically integrates unsupervised contrastive learning and supervised multimodal learning to address the modality imbalance problem. We find that a simple yet heuristic integration strategy can significantly alleviate the modality imbalance phenomenon. Moreover, we design a learning-based integration strategy to integrate two losses dynamically, further improving the performance. Experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) multimodal learning approaches. The code is available at https: //github. com/njustkmg/NeurIPS24-LFM.
AAAI Conference 2024 Conference Paper
Recently, the paradigm of pre-training and fine-tuning graph neural networks has been intensively studied and applied in a wide range of graph mining tasks. Its success is generally attributed to the structural consistency between pre-training and downstream datasets, which, however, does not hold in many real-world scenarios. Existing works have shown that the structural divergence between pre-training and downstream graphs significantly limits the transferability when using the vanilla fine-tuning strategy. This divergence leads to model overfitting on pre-training graphs and causes difficulties in capturing the structural properties of the downstream graphs. In this paper, we identify the fundamental cause of structural divergence as the discrepancy of generative patterns between the pre-training and downstream graphs. Furthermore, we propose G-Tuning to preserve the generative patterns of downstream graphs. Given a downstream graph G, the core idea is to tune the pre-trained GNN so that it can reconstruct the generative patterns of G, the graphon W. However, the exact reconstruction of a graphon is known to be computationally expensive. To overcome this challenge, we provide a theoretical analysis that establishes the existence of a set of alternative graphons called graphon bases for any given graphon. By utilizing a linear combination of these graphon bases, we can efficiently approximate W. This theoretical finding forms the basis of our model, as it enables effective learning of the graphon bases and their associated coefficients. Compared with existing algorithms, G-Tuning demonstrates consistent performance improvement in 7 in-domain and 7 out-of-domain transfer learning experiments.
IROS Conference 2024 Conference Paper
Accurate time synchronization is crucial for multisensor fusion, which is widely used in mobile robotics, autonomous driving, and virtual reality. Despite many advancements, precise multi-sensor synchronization is still challenging due to the sensors’ internal characteristics, data filtering, disjointed clock reference, and transmission delay caused by operation system scheduling. This paper proposes a novel hardware-based synchronization solution to achieve synchronization in microsecond-level precision. By introducing a Sensor Adaptor board that provides a unified clock reference, the proposed hardware architecture enables high-precision synchronization across multiple sensors. Furthermore, we develop a method for Visual-Inertial time synchronization that actively controls the exposure duration using an ambient light sensor. By managing the IMU clock signal and exposure trigger, we align the camera’s sampling moment with the authentic IMU sampling time and significantly reduce the time discrepancy in the Visual-Inertial system. Experiments are conducted to evaluate the efficiency of the proposed method and system, including comparisons with previous work. The results indicate that our method can achieve precise time synchronization and be successfully implemented in multi-sensor systems.
NeurIPS Conference 2024 Conference Paper
Exploring the integration of if-then logic rules within neural network architectures presents an intriguing area. This integration seamlessly transforms the rule learning task into neural network training using backpropagation and stochastic gradient descent. From a well-trained sparse and shallow neural network, one can interpret each layer and neuron through the language of logic rules, and a global explanatory rule set can be directly extracted. However, ensuring interpretability may impose constraints on the flexibility, depth, and width of neural networks. In this paper, we propose HyperLogic: a novel framework leveraging hypernetworks to generate weights of the main network. HyperLogic can unveil multiple diverse rule sets, each capable of capturing heterogeneous patterns in data. This provides a simple yet effective method to increase model flexibility and preserve interpretability. We theoretically analyzed the benefits of the HyperLogic by examining the approximation error and generalization capabilities under two types of regularization terms: sparsity and diversity regularizations. Experiments on real data demonstrate that our method can learn more diverse, accurate, and concise rules.
AIIM Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
The sound mixture separation is still challenging due to heavy sound overlapping and disturbance from noise. Unsupervised separation would significantly increase the difficulty. As sound overlapping always hinders accurate sound separation, we propose an Independency Adversarial Learning based Cross-Modal Sound Separation (IAL-CMS) approach, where IAL employs adversarial learning to minimize the correlation of separated sound elements, exploring high sound independence; CMS performs cross-modal sound separation, incorporating audio-visual consistent feature learning and interactive cross-attention learning to emphasize the semantic consistency among cross-modal features. Both audio-visual consistency and audio consistency are kept to guarantee accurate separation. The consistency and sound independence ensure the decomposition of overlapping mixtures into unrelated and distinguishable sound elements. The proposed approach is evaluated on MUSIC, VGGSound, and AudioSet. Extensive experiments certify that our approach outperforms existing approaches in supervised and unsupervised scenarios.
AAAI Conference 2024 Conference Paper
The paradigm of pre-training and fine-tuning graph neural networks has attracted wide research attention. In previous studies, the pre-trained models are viewed as universally versatile, and applied for a diverse range of downstream tasks. In many situations, however, this practice results in limited or even negative transfer. This paper, for the first time, emphasizes the specific application scope of graph pre-trained models: not all downstream tasks can effectively benefit from a graph pre-trained model. In light of this, we introduce the measure task consistency to quantify the similarity between graph pre-training and downstream tasks. This measure assesses the extent to which downstream tasks can benefit from specific pre-training tasks. Moreover, a novel fine-tuning strategy, Bridge-Tune, is proposed to further diminish the impact of the difference between pre-training and downstream tasks. The key innovation in Bridge-Tune is an intermediate step that bridges pre-training and downstream tasks. This step takes into account the task differences and further refines the pre-trained model. The superiority of the presented fine-tuning strategy is validated via numerous experiments with different pre-trained models and downstream tasks.
AAAI Conference 2024 Conference Paper
Soft robot design is an intricate field with unique challenges due to its complex and vast search space. In the past literature, evolutionary computation algorithms, including novel probabilistic generative models (PGMs), have shown potential in this realm. However, these methods are sample inefficient and predominantly focus on rigid robots in locomotion tasks, which limit their performance and application in robot design automation. In this work, we propose MorphVAE, an innovative PGM that incorporates a multi-task training scheme and a meticulously crafted sampling technique termed ``continuous natural selection'', aimed at bolstering sample efficiency. This method empowers us to gain insights from assessed samples across diverse tasks and temporal evolutionary stages, while simultaneously maintaining a delicate balance between optimization efficiency and biodiversity. Through extensive experiments in various locomotion and manipulation tasks, we substantiate the efficiency of MorphVAE in generating high-performing and diverse designs, surpassing the performance of competitive baselines.
ICML Conference 2024 Conference Paper
Our goal is to $\textit{efficiently}$ discover a compact set of temporal logic rules to explain irregular events of interest. We introduce a neural-symbolic rule induction framework within the temporal point process model. The negative log-likelihood is the loss that guides the learning, where the explanatory logic rules and their weights are learned end-to-end in a $\textit{differentiable}$ way. Specifically, predicates and logic rules are represented as $\textit{vector embeddings}$, where the predicate embeddings are fixed and the rule embeddings are trained via gradient descent to obtain the most appropriate compositional representations of the predicate embeddings. To make the rule learning process more efficient and flexible, we adopt a $\textit{sequential covering algorithm}$, which progressively adds rules to the model and removes the event sequences that have been explained until all event sequences have been covered. All the found rules will be fed back to the models for a final rule embedding and weight refinement. Our approach showcases notable efficiency and accuracy across synthetic and real datasets, surpassing state-of-the-art baselines by a wide margin in terms of efficiency.
AAAI Conference 2024 Conference Paper
Image captioning aims to automatically generate captions for images by learning a cross-modal generator from vision to language. The large amount of image-text pairs required for training is usually sourced from the internet due to the manual cost, which brings the noise with mismatched relevance that affects the learning process. Unlike traditional noisy label learning, the key challenge in processing noisy image-text pairs is to finely identify the mismatched words to make the most use of trustworthy information in the text, rather than coarsely weighing the entire examples. To tackle this challenge, we propose a Noise-aware Image Captioning method (NIC) to adaptively mitigate the erroneous guidance from noise by progressively exploring mismatched words. Specifically, NIC first identifies mismatched words by quantifying word-label reliability from two aspects: 1) inter-modal representativeness, which measures the significance of the current word by assessing cross-modal correlation via prediction certainty; 2) intra-modal informativeness, which amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs the pseudo-word-labels considering the reliability of the origin word-labels and model convergence to periodically coordinate mismatched words. As a result, NIC can effectively exploit both clean and noisy image-text pairs to learn a more robust mapping function. Extensive experiments conducted on the MS-COCO and Conceptual Caption datasets validate the effectiveness of our method in various noisy scenarios.
AIIM Journal 2024 Journal Article
NeurIPS Conference 2024 Conference Paper
The proliferation of abundant electricity time series (ETS) data presents numerous opportunities for various applications within power systems, including demand-side management, grid stability, and consumer behavior analysis. Deep learning models have advanced ETS modeling by effectively capturing sequence dependence. However, learning a generic representation of ETS data for various applications is challenging due to the inherently complex hierarchical structure of ETS data. Moreover, ETS data exhibits intricate temporal dependencies and is susceptible to the influence of exogenous variables. Furthermore, different instances exhibit diverse electricity consumption behavior. In this paper, we propose a foundation model PowerPM for ETS data, providing a large-scale, off-the-shelf model for power systems. PowerPM consists of a temporal encoder and a hierarchical encoder. The temporal encoder captures temporal dependencies within ETS data, taking into account exogenous variables. The hierarchical encoder models correlations between different levels of hierarchy. Furthermore, PowerPM leverages a novel self-supervised pre-training framework consisting of masked ETS modeling and dual-view contrastive learning. This framework enables PowerPM to capture temporal dependency within ETS windows and aware the discrepancy across ETS windows, providing two different perspectives to learn generic representation. Our experiments span five real-world scenario datasets, including both private and public data. Through pre-training on massive ETS data, PowerPM achieves SOTAperformance on diverse downstream tasks within the private dataset. Notably, when transferred to public datasets, PowerPM retains its edge, showcasing its remarkable generalization ability across various tasks and domains. Moreover, ablation studies and few-shot experiments further substantiate the effectiveness of our model.
EAAI Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearance. To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU dataset is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are available in the ScanERU repository.
NeurIPS Conference 2024 Conference Paper
Biological systems possess remarkable sound source localization (SSL) capabilities that are critical for survival in complex environments. This ability arises from the collaboration between the auditory periphery, which encodes sound as precisely timed spikes, and the auditory cortex, which performs spike-based computations. Inspired by these biological mechanisms, we propose a novel neuromorphic SSL framework that integrates spike-based neural encoding and computation. The framework employs Resonate-and-Fire (RF) neurons with a phase-locking coding (RF-PLC) method to achieve energy-efficient audio processing. The RF-PLC method leverages the resonance properties of RF neurons to efficiently convert audio signals to time-frequency representation and encode interaural time difference (ITD) cues into discriminative spike patterns. In addition, biological adaptations like frequency band selectivity and short-term memory effectively filter out many environmental noises, enhancing SSL capabilities in real-world settings. Inspired by these adaptations, we propose a spike-driven multi-auditory attention (MAA) module that significantly improves both the accuracy and robustness of the proposed SSL framework. Extensive experimentation demonstrates that our SSL framework achieves state-of-the-art accuracy in SSL tasks. Furthermore, it shows exceptional noise robustness and maintains high accuracy even at very low signal-to-noise ratios. By mimicking biological hearing, this neuromorphic approach contributes to the development of high-performance and explainable artificial intelligence systems capable of superior performance in real-world environments.
IJCAI Conference 2024 Conference Paper
The recent introduction of prompt tuning based on pre-trained vision-language models has dramatically improved the performance of multi-label image classification. However, some existing strategies that have been explored still have drawbacks, i. e. , either exploiting massive labeled visual data at a high cost or using text data only for text prompt tuning and thus failing to learn the diversity of visual knowledge. Hence, the application scenarios of these methods are limited. In this paper, we propose a pseudo-visual prompt (PVP) module for implicit visual prompt tuning to address this problem. Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models. Then, a co-learning strategy with a dual-adapter module is designed to transfer visual knowledge from pseudo-visual prompt to text prompt, enhancing their visual representation abilities. Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art (SOTA) methods across various settings for multi-label image classification tasks. The code is available at https: //github. com/njustkmg/PVP.
AAAI Conference 2024 Conference Paper
Graph federated learning (FL) has emerged as a pivotal paradigm enabling multiple agents to collaboratively train a graph model while preserving local data privacy. Yet, current efforts overlook a key issue: agents are self-interested and would hesitant to share data without fair and satisfactory incentives. This paper is the first endeavor to address this issue by studying the incentive mechanism for graph federated learning. We identify a unique phenomenon in graph federated learning: the presence of agents posing potential harm to the federation and agents contributing with delays. This stands in contrast to previous FL incentive mechanisms that assume all agents contribute positively and in a timely manner. In view of this, this paper presents a novel incentive mechanism tailored for fair graph federated learning, integrating incentives derived from both model gradient and payoff. To achieve this, we first introduce an agent valuation function aimed at quantifying agent contributions through the introduction of two criteria: gradient alignment and graph diversity. Moreover, due to the high heterogeneity in graph federated learning, striking a balance between accuracy and fairness becomes particularly crucial. We introduce motif prototypes to enhance accuracy, communicated between the server and agents, enhancing global model aggregation and aiding agents in local model optimization. Extensive experiments show that our model achieves the best trade-off between accuracy and the fairness of model gradient, as well as superior payoff fairness.
NeurIPS Conference 2024 Conference Paper
In recent years, deep neural networks (DNNs) have witnessed extensive applications, and protecting their intellectual property (IP) is thus crucial. As a non-invasive way for model IP protection, model fingerprinting has become popular. However, existing single-point based fingerprinting methods are highly sensitive to the changes in the decision boundary, and may suffer from the misjudgment of the resemblance of sparse fingerprinting, yielding high false positives of innocent models. In this paper, we propose ADV-TRA, a more robust fingerprinting scheme that utilizes adversarial trajectories to verify the ownership of DNN models. Benefited from the intrinsic progressively adversarial level, the trajectory is capable of tolerating greater degree of alteration in decision boundaries. We further design novel schemes to generate a surface trajectory that involves a series of fixed-length trajectories with dynamically adjusted step sizes. Such a design enables a more unique and reliable fingerprinting with relatively low querying costs. Experiments on three datasets against four types of removal attacks show that ADV-TRA exhibits superior performance in distinguishing between infringing and innocent models, outperforming the state-of-the-art comparisons.
AAAI Conference 2024 Conference Paper
Temporal sentence localization (TSL) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant yet expensive manual annotations for training. Moreover, these trained data-dependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, in this paper, we target another more practical but challenging setting: unsupervised domain adaptative temporal sentence localization (UDA-TSL), which explores whether the localization knowledge can be transferred from a fully-annotated data domain (source domain) to a new unannotated data domain (target domain). Particularly, we propose an effective and novel baseline for UDA-TSL to bridge the multi-modal gap across different domains and learn the potential correspondence between the video-query pairs in target domain. We first develop separate modality-specific domain adaptation modules to smoothly balance the minimization of the domain shifts in cross-dataset video and query domains. Then, to fully exploit the semantic correspondence of both modalities in target domain for unsupervised localization, we devise a mutual information learning module to adaptively align the video-query pairs which are more likely to be relevant in target domain, leading to more truly aligned target pairs and ensuring the discriminability of target features. In this way, our model can learn domain-invariant and semantic-aligned cross-modal representations. Three sets of migration experiments show that our model achieves competitive performance compared to existing methods.
IROS Conference 2024 Conference Paper
This paper presents a conceptual magnetically anchored and guided flexible endoscope for minimally invasive surgery (MIS). Leveraging both the magnetic coupling between the external and internal permanent magnets and the bending of a flexible joint, the endoscope offers improved maneuver-ability and adaptability within confined surgical spaces. The visual servo control allows the endoscope to autonomously track surgical instruments during procedures, thereby reducing the risk of human error and operator fatigue. First, the design and working principles of the endoscope are introduced. Subsequently, the kinematic modeling of the endoscope is derived, and the control scheme is developed based on a quadratic programming (QP) framework by taking into account both magnetically anchoring constraints and physical constraints, where the joint velocities can be resolved given the desired task velocities in a one-step way. Simulative validations are conducted to verify the effectiveness of the visual servo control for the presented endoscope tracking a static/dynamic target with physical constraints considered.
JBHI Journal 2024 Journal Article
Colon polyps in colonoscopy images exhibit significant differences in color, size, shape, appearance, and location, posing significant challenges to accurate polyp segmentation. In this paper, a Weighted Dual-branch Feature Fusion Network is proposed for Polyp Segmentation, named WDFF-Net, which adopts HarDNet68 as the backbone network. First, a dual-branch feature fusion network architecture is constructed, which includes a shared feature extractor and two feature fusion branches, i. e. Progressive Feature Fusion (PFF) branch and Scale-aware Feature Fusion (SFF) branch. The branches fuse the deep features of multiple layers for different purposes and with different fusion ways. The PFF branch is to address the under-segmentation or over-segmentation problems of flat polyps with low-edge contrast by iteratively fusing the features from low, medium, and high layers. The SFF branch is to tackle the the problem of drastic variations in polyp size and shape, especially the missed segmentation problem for small polyps. These two branches are complementary and play different roles, in improving segmentation accuracy. Second, an Object-aware Attention Mechanism (OAM) is proposed to enhance the features of the target regions and suppress those of the background regions, to interfere with the segmentation performance. Third, a weighted dual-branch the segmentation loss function is specifically designed, which dynamically assigns the weight factors of the loss functions for two branches to optimize their collaborative training. Experimental results on five public colon polyp datasets demonstrate that, the proposed WDFF-Net can achieve a superior segmentation performance with lower model complexity and faster inference speed, while maintaining good generalization ability.
AAAI Conference 2024 Conference Paper
Mirror detection is of great significance for avoiding false recognition of reflected objects in computer vision tasks. Existing mirror detection frameworks usually follow a supervised setting, which relies heavily on high quality labels and suffers from poor generalization. To resolve this, we instead propose the first weakly-supervised mirror detection framework and also provide the first scribble-based mirror dataset. Specifically, we relabel 10,158 images, most of which have a labeled pixel ratio of less than 0.01 and take only about 8 seconds to label. Considering that the mirror regions usually show great scale variation, and also irregular and occluded, thus leading to issues of incomplete or over detection, we propose a local-global feature enhancement (LGFE) module to fully capture the context and details. Moreover, it is difficult to obtain basic mirror structure using scribble annotation, and the distinction between foreground (mirror) and background (non-mirror) features is not emphasized caused by mirror reflections. Therefore, we propose a foreground-aware mask attention (FAMA), integrating mirror edges and semantic features to complete mirror regions and suppressing the influence of backgrounds. Finally, to improve the robustness of the network, we propose a prototype contrast loss (PCL) to learn more general foreground features across images. Extensive experiments show that our network outperforms relevant state-of-the-art weakly supervised methods, and even some fully supervised methods. The dataset and codes are available at https://github.com/winter-flow/WSMD.
IROS Conference 2023 Conference Paper
Obstacle avoidance (OA) and joint-limit avoidance (JLA) are essential for redundant manipulators to ensure safe and reliable robotic operations. One solution to OA and JLA is to incorporate the involved constraints into a quadratic programming (QP), by solving which OA and JLA can be achieved. There exist a few non-iterative solvers such as zeroing neural networks (ZNNs), which can solve each sampled QP problem using only one iteration, yet no solution is suitable for OA and JLA due to the absence of some derivative information. To tackle these issues, this paper proposes a novel solution with a non-iterative neural controller termed NCP-ZNN for joint-constrained redundant manipulators. Unlike iterative methods, the neural controller involving derivative information proposed in this paper possesses some positive features including non-iterative computing and convergence with time. In this paper, the reestablished OA-JLA scheme is first introduced. Then, the design details of the neural controller are presented. After that, some comparative simulations based on a PA10 robot and an experiment based on a Franka Emika Panda robot are conducted, demonstrating that the proposed neural controller is more competent in OA and JLA.
EAAI Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for downstream tasks with unlabeled data, and it has recently become an active research area. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training data do not necessarily lead to better downstream performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: fewer, but carefully chosen data are fed into a GNN model to enhance pre-training. The proposed pre-training pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model in the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learned from previous data. Therefore, the integration and interaction between these two components form a unified framework (APT), in which graph pre-training is performed in a progressive and iterative way. Experiment results show that the proposed APT is able to obtain an efficient pre-training model with fewer training data and better downstream performance.
NeurIPS Conference 2023 Conference Paper
We propose a foundation model named Brant for modeling intracranial recordings, which learns powerful representations of intracranial neural signals by pre-training, providing a large-scale, off-the-shelf model for medicine. Brant is the largest model in the field of brain signals and is pre-trained on a large corpus of intracranial data collected by us. The design of Brant is to capture long-term temporal dependency and spatial correlation from neural signals, combining the information in both time and frequency domains. As a foundation model, Brant achieves SOTA performance on various downstream tasks (i. e. neural signal forecasting, frequency-phase forecasting, imputation and seizure detection), showing the generalization ability to a broad range of tasks. The low-resource label analysis and representation visualization further illustrate the effectiveness of our pre-training strategy. In addition, we explore the effect of model size to show that a larger model with a higher capacity can lead to performance improvements on our dataset. The source code and pre-trained weights are available at: https: //zju-brainnet. github. io/Brant. github. io/.
AAAI Conference 2023 Conference Paper
Graph Neural Networks (GNNs) are powerful tools for graph representation learning. Despite their rapid development, GNNs also face some challenges, such as over-fitting, over-smoothing, and non-robustness. Previous works indicate that these problems can be alleviated by random dropping methods, which integrate augmented data into models by randomly masking parts of the input. However, some open problems of random dropping on GNNs remain to be solved. First, it is challenging to find a universal method that are suitable for all cases considering the divergence of different datasets and models. Second, augmented data introduced to GNNs causes the incomplete coverage of parameters and unstable training process. Third, there is no theoretical analysis on the effectiveness of random dropping methods on GNNs. In this paper, we propose a novel random dropping method called DropMessage, which performs dropping operations directly on the propagated messages during the message-passing process. More importantly, we find that DropMessage provides a unified framework for most existing random dropping methods, based on which we give theoretical analysis of their effectiveness. Furthermore, we elaborate the superiority of DropMessage: it stabilizes the training process by reducing sample variance; it keeps information diversity from the perspective of information theory, enabling it become a theoretical upper bound of other methods. To evaluate our proposed method, we conduct experiments that aims for multiple tasks on five public datasets and two industrial datasets with various backbone models. The experimental results show that DropMessage has the advantages of both effectiveness and generalization, and can significantly alleviate the problems mentioned above. A detailed version with full appendix can be found on arXiv: https://arxiv.org/abs/2204.10037.
AAAI Conference 2023 Conference Paper
Traditional low-rank methods overlook residuals as corruptions, but we discovered that low-rank residuals actually keep image edges together with corrupt components. Therefore, filtering out such structural information could hamper the discriminative details in images, especially in heavy corruptions. In order to address this limitation, this paper proposes a novel method named ESL-LRR, which preserves image edges by finding image projections from low-rank residuals. Specifically, our approach is built in a manifold learning framework where residuals are regarded as another view of image data. Edge preserved image projections are then pursued using a dynamic affinity graph regularization to capture the more accurate similarity between residuals while suppressing the influence of corrupt ones. With this adaptive approach, the proposed method can also find image intrinsic low-rank representation, and much discriminative edge preserved projections. As a result, a new classification strategy is introduced, aligning both modalities to enhance accuracy. Experiments are conducted on several benchmark image datasets, including MNIST, LFW, and COIL100. The results show that the proposed method has clear advantages over compared state-of-the-art (SOTA) methods, such as Low-Rank Embedding (LRE), Low-Rank Preserving Projection via Graph Regularized Reconstruction (LRPP_GRR), and Feature Selective Projection (FSP) with more than 2% improvement, particularly in corrupted cases.
NeurIPS Conference 2023 Conference Paper
Active learning (AL) methods have been proven to be an effective way to reduce the labeling effort by intelligently selecting valuable instances for annotation. Despite their great success with in-distribution (ID) scenarios, AL methods suffer from performance degradation in many real-world applications because out-of-distribution (OOD) instances are always inevitably contained in unlabeled data, which may lead to inefficient sampling. Therefore, several attempts have been explored open-set AL by strategically selecting pure ID instances while filtering OOD instances. However, concentrating solely on selecting pseudo-ID instances may cause the training constraint of the ID classifier and OOD detector. To address this issue, we propose a simple yet effective sampling scheme, Progressive Active Learning (PAL), which employs a progressive sampling mechanism to leverage the active selection of valuable OOD instances. The proposed PAL measures unlabeled instances by synergistically evaluating instances' informativeness and representativeness, and thus it can balance the pseudo-ID and pseudo-OOD instances in each round to enhance both the capacity of the ID classifier and the OOD detector. %Meanwhile, PAL measures unlabeled instances by synergistically evaluating instances' informativeness and representativeness, which can more effectively estimate the values of instances. Extensive experiments on various open-set AL scenarios demonstrate the effectiveness of the proposed PAL, compared with the state-of-the-art methods. The code is available at \url{https: //github. com/njustkmg/PAL}.
NeurIPS Conference 2023 Conference Paper
Automated seizure detection is of great importance to epilepsy diagnosis and treatment. An emerging method used in seizure detection, stereoelectroencephalography (SEEG), can provide detailed and stereoscopic brainwave information. However, modeling SEEG in clinical scenarios will face challenges like huge domain shift between different patients and dramatic pattern evolution among different brain areas. In this study, we propose a Pretraining-based model for Patient-independent seizure detection (PPi) to address these challenges. Firstly, we design two novel self-supervised tasks which can extract rich information from abundant SEEG data while preserving the unique characteristics between brain signals recorded from different brain areas. Then two techniques channel background subtraction and brain region enhancement are proposed to effectively tackle the domain shift problem. Extensive experiments show that PPi outperforms the SOTA baselines on two public datasets and a real-world clinical dataset collected by ourselves, which demonstrates the effectiveness and practicability of PPi. Finally, visualization analysis illustrates the rationality of the two domain generalization techniques.
JBHI Journal 2023 Journal Article
Early detection of COVID-19 is an ongoing area of research that can help with triage, monitoring and general health assessment of potential patients and may reduce operational strain on hospitals that cope with the coronavirus pandemic. Different machine learning techniques have been used in the literature to detect potential cases of coronavirus using routine clinical data (blood tests, and vital signs measurements). Data breaches and information leakage when using these models can bring reputational damage and cause legal issues for hospitals. In spite of this, protecting healthcare models against leakage of potentially sensitive information is an understudied research area. In this study, two machine learning techniques that aim to predict a patient's COVID-19 status are examined. Using adversarial training, robust deep learning architectures are explored with the aim to protect attributes related to demographic information about the patients. The two models examined in this work are intended to preserve sensitive information against adversarial attacks and information leakage. In a series of experiments using datasets from the Oxford University Hospitals (OUH), Bedfordshire Hospitals NHS Foundation Trust (BH), University Hospitals Birmingham NHS Foundation Trust (UHB), and Portsmouth Hospitals University NHS Trust (PUH), two neural networks are trained and evaluated. These networks predict PCR test results using information from basic laboratory blood tests, and vital signs collected from a patient upon arrival to the hospital. The level of privacy each one of the models can provide is assessed and the efficacy and robustness of the proposed architectures are compared with a relevant baseline. One of the main contributions in this work is the particular focus on the development of effective COVID-19 detection models with built-in mechanisms in order to selectively protect sensitive attributes against adversarial attacks. The results on hold-out test set and external validation confirmed that there was no impact on the generalisibility of the model using adversarial learning.
UAI Conference 2023 Conference Paper
Multi-dimensional classification (MDC) can be employed in a range of applications where one needs to predict multiple class variables for each given instance. Many existing MDC methods suffer from at least one of inaccuracy, scalability, limited use to certain types of data, hardness of interpretation or lack of probabilistic (uncertainty) estimations. This paper is an attempt to address all these disadvantages simultaneously. We propose a formal framework for probabilistic MDC in which learning an optimal multi-dimensional classifier can be decomposed, without loss of generality, into learning a set of (smaller) single-variable multi-class probabilistic classifiers and a directed acyclic graph. Current and future developments of both probabilistic classification and graphical model learning can directly enhance our framework, which is flexible and provably optimal. A collection of experiments is conducted to highlight the usefulness of this MDC framework.
EAAI Journal 2023 Journal Article
AAAI Conference 2023 Conference Paper
Videos such as movies or TV episodes usually need to divide the long storyline into cohesive units, i.e., scenes, to facilitate the understanding of video semantics. The key challenge lies in finding the boundaries of scenes by comprehensively considering the complex temporal structure and semantic information. To this end, we introduce a novel Context-Aware Transformer (CAT) with a self-supervised learning framework to learn high-quality shot representations, for generating well-bounded scenes. More specifically, we design the CAT with local-global self-attentions, which can effectively consider both the long-term and short-term context to improve the shot encoding. For training the CAT, we adopt the self-supervised learning schema. Firstly, we leverage shot-to-scene level pretext tasks to facilitate the pre-training with pseudo boundary, which guides CAT to learn the discriminative shot representations that maximize intra-scene similarity and inter-scene discrimination in an unsupervised manner. Then, we transfer contextual representations for fine-tuning the CAT with supervised data, which encourages CAT to accurately detect the boundary for scene segmentation. As a result, CAT is able to learn the context-aware shot representations and provides global guidance for scene segmentation. Our empirical analyses show that CAT can achieve state-of-the-art performance when conducting the scene segmentation task on the MovieNet dataset, e.g., offering 2.15 improvements on AP.
AAAI Conference 2023 Conference Paper
Anti-money laundering (AML) systems play a critical role in safeguarding global economy. As money laundering is considered as one of the top group crimes, there is a crucial need to discover money laundering sub-network behind a particular money laundering transaction for a robust AML system. However, existing rule-based methods for money laundering sub-network discovery is heavily based on domain knowledge and may lag behind the modus operandi of launderers. Therefore, in this work, we first address the money laundering sub-network discovery problem with a neural network based approach, and propose an AML framework AMAP equipped with an adaptive sub-network proposer. In particular, we design an adaptive sub-network proposer guided by a supervised contrastive loss to discriminate money laundering transactions from massive benign transactions. We conduct extensive experiments on real-word datasets in AliPay of Ant Group. The result demonstrates the effectiveness of our AMAP in both money laundering transaction detection and money laundering sub-network discovering. The learned framework which yields money laundering sub-network from massive transaction network leads to a more comprehensive risk coverage and a deeper insight to money laundering strategies.
NeurIPS Conference 2023 Conference Paper
In recent years, prompt tuning has sparked a research surge in adapting pre-trained models. Unlike the unified pre-training strategy employed in the language field, the graph field exhibits diverse pre-training strategies, posing challenges in designing appropriate prompt-based tuning methods for graph neural networks. While some pioneering work has devised specialized prompting functions for models that employ edge prediction as their pre-training tasks, these methods are limited to specific pre-trained GNN models and lack broader applicability. In this paper, we introduce a universal prompt-based tuning method called Graph Prompt Feature (GPF) for pre-trained GNN models under any pre-training strategy. GPF operates on the input graph's feature space and can theoretically achieve an equivalent effect to any form of prompting function. Consequently, we no longer need to illustrate the prompting function corresponding to each pre-training strategy explicitly. Instead, we employ GPF to obtain the prompted graph for the downstream task in an adaptive manner. We provide rigorous derivations to demonstrate the universality of GPF and make guarantee of its effectiveness. The experimental results under various pre-training strategies indicate that our method performs better than fine-tuning, with an average improvement of about 1. 4% in full-shot scenarios and about 3. 2% in few-shot scenarios. Moreover, our method significantly outperforms existing specialized prompt-based tuning methods when applied to models utilizing the pre-training strategy they specialize in. These numerous advantages position our method as a compelling alternative to fine-tuning for downstream adaptations.
EAAI Journal 2022 Journal Article
EAAI Journal 2022 Journal Article
IJCAI Conference 2022 Conference Paper
Graph neural networks (GNNs) have been intensively studied in various real-world tasks. However, the homophily assumption of GNNs' aggregation function limits their representation learning ability in heterophily graphs. In this paper, we shed light on the path level patterns in graphs that can explicitly reflect rich semantic and structural information. We therefore propose a novel Structure-aware Path Aggregation Graph Neural Network (PathNet) aiming to generalize GNNs for both homophily and heterophily graphs. Specifically, we first introduce a maximal entropy path sampler, which helps us sample a number of paths containing structural context. Then, we introduce a structure-aware recurrent cell consisting of order-preserving and distance-aware components to learn the semantic information of neighborhoods. Finally, we model the preference of different paths to target node after path encoding. Experimental results demonstrate that our model achieves superior performance in node classification on both heterophily and homophily graphs.
AAAI Conference 2022 Conference Paper
Adversarial attacks on graphs have attracted considerable research interests. Existing works assume the attacker is either (partly) aware of the victim model, or able to send queries to it. These assumptions are, however, unrealistic. To bridge the gap between theoretical graph attacks and real-world scenarios, in this work, we propose a novel and more realistic setting: strict black-box graph attack, in which the attacker has no knowledge about the victim model at all and is not allowed to send any queries. To design such an attack strategy, we first propose a generic graph filter to unify different families of graph-based models. The strength of attacks can then be quantified by the change in the graph filter before and after attack. By maximizing this change, we are able to find an effective attack strategy, regardless of the underlying model. To solve this optimization problem, we also propose a relaxation technique and approximation theories to reduce the difficulty as well as the computational expense. Experiments demonstrate that, even with no exposure to the model, the Macro-F1 drops 5. 5% in node classification and 29. 5% in graph classification, which is a significant result compared with existent works.
IJCAI Conference 2022 Conference Paper
Anomaly detection in graphs has attracted considerable interests in both academia and industry due to its wide applications in numerous domains ranging from finance to biology. Meanwhile, graph neural networks (GNNs) is emerging as a powerful tool for modeling graph data. A natural and fundamental question that arises here is: can abnormality be detected by graph neural networks? In this paper, we aim to answer this question, which is nontrivial. As many existing works have explored, graph neural networks can be seen as filters for graph signals, with the favor of low frequency in graphs. In other words, GNN will smooth the signals of adjacent nodes. However, abnormality in a graph intuitively has the characteristic that it tends to be dissimilar to its neighbors, which are mostly normal samples. It thereby conflicts with the general assumption with traditional GNNs. To solve this, we propose a novel Adaptive Multi-frequency Graph Neural Network (AMNet), aiming to capture both low-frequency and high-frequency signals, and adaptively combine signals of different frequencies. Experimental results on real-world datasets demonstrate that our model achieves a significant improvement comparing with several state-of-the-art baseline methods.
NeurIPS Conference 2022 Conference Paper
Graph Anomaly Detection (GAD) has recently become a hot research spot due to its practicability and theoretical value. Since GAD emphasizes the application and the rarity of anomalous samples, enriching the varieties of its datasets is fundamental. Thus, this paper present DGraph, a real-world dynamic graph in the finance domain. DGraph overcomes many limitations of current GAD datasets. It contains about 3M nodes, 4M dynamic edges, and 1M ground-truth nodes. We provide a comprehensive observation of DGraph, revealing that anomalous nodes and normal nodes generally have different structures, neighbor distribution, and temporal dynamics. Moreover, it suggests that 2M background nodes are also essential for detecting fraudsters. Furthermore, we conduct extensive experiments on DGraph. Observation and experiments demonstrate that DGraph is propulsive to advance GAD research and enable in-depth exploration of anomalous nodes.
AAAI Conference 2022 System Paper
The conversational recommender systems (CRSs) have received extensive attention in recent years. However, most of the existing works focus on various deep learning models, which are largely limited by the requirement of large-scale human-annotated datasets. Such methods are not able to deal with the cold-start scenarios in industrial products. To alleviate the problem, we propose FORCE, a Framework Of Rulebased Conversational rEcommender system that helps developers to quickly build CRS bots by simple configuration. We conduct experiments on two datasets in different languages and domains to verify its effectiveness and usability.
JBHI Journal 2022 Journal Article
Nowadays, with the development of various kinds of sensors in smartphones or wearable devices, human activity recognition (HAR) has been widely researched and has numerous applications in healthcare, smart city, etc. Many techniques based on hand-crafted feature engineering or deep neural network have been proposed for sensor based HAR. However, these existing methods usually recognize activities offline, which means the whole data should be collected before training, occupying large-capacity storage space. Moreover, once the offline model training finished, the trained model can’t recognize new activities unless retraining from the start, thus with a high cost of time and space. In this paper, we propose a multi-modality incremental learning model, called HarMI, with continuous learning ability. The proposed HarMI model can start training quickly with little storage space and easily learn new activities without storing previous training data. In detail, we first adopt attention mechanism to align heterogeneous sensor data with different frequencies. In addition, to overcome catastrophic forgetting in incremental learning, HarMI utilizes the elastic weight consolidation and canonical correlation analysis from a multi-modality perspective. Extensive experiments based on two public datasets demonstrate that HarMI can achieve a superior performance compared with several state-of-the-arts.
NeurIPS Conference 2022 Conference Paper
Recent works on machine learning for combinatorial optimization have shown that learning based approaches can outperform heuristic methods in terms of speed and performance. In this paper, we consider the problem of finding an optimal topological order on a directed acyclic graph (DAG) with focus on the memory minimization problem which arises in compilers. We propose an end-to-end machine learning based approach for topological ordering using an encoder-decoder framework. Our encoder is a novel attention based graph neural network architecture called \emph{Topoformer} which uses different topological transforms of a DAG for message passing. The node embeddings produced by the encoder are converted into node priorities which are used by the decoder to generate a probability distribution over topological orders. We train our model on a dataset of synthetically generated graphs called layered graphs. We show that our model outperforms, or is on-par, with several topological ordering baselines while being significantly faster on synthetic graphs with up to 2k nodes. We also train and test our model on a set of real-world computation graphs, showing performance improvements.
AAAI Conference 2022 Conference Paper
Deep semi-supervised learning (SSL) aims to utilize a sizeable unlabeled set to train deep networks, thereby reducing the dependence on labeled instances. However, the unlabeled set often carries unseen classes that cause the deep SSL algorithm to lose generalization. Previous works focus on the data level that they attempt to remove unseen class data or assign lower weight to them but could not eliminate their adverse effects on the SSL algorithm. Rather than focusing on the data level, this paper turns attention to the model parameter level. We find that only partial parameters are essential for seen-class classification, termed safe parameters. In contrast, the other parameters tend to fit irrelevant data, termed harmful parameters. Driven by this insight, we propose Safe Parameter Learning (SPL) to discover safe parameters and make the harmful parameters inactive, such that we can mitigate the adverse effects caused by unseen-class data. Specifically, we firstly design an effective strategy to divide all parameters in the pre-trained SSL model into safe and harmful ones. Then, we introduce a bi-level optimization strategy to update the safe parameters and kill the harmful parameters. Extensive experiments show that SPL outperforms the stateof-the-art SSL methods on all the benchmarks by a large margin. Moreover, experiments demonstrate that SPL can be integrated into the most popular deep SSL networks and be easily extended to handle other cases of class distribution mismatch.
AAAI Conference 2022 Conference Paper
There are two key issues that limit further improvements in the performance of existing rotational detectors: 1) Periodic sudden change of the parameters in the rotating bounding box (RBBox) definition causes a numerical discontinuity in the loss (such as smoothL1 loss). 2) There is a gap of optimization asynchrony between the loss in the RBBox regression and evaluation metrics. In this paper, we define a new distance formulation between two convex polygons describing the overlapping degree and non-overlapping degree. Based on this smooth distance, we propose a loss called Polygonto-Polygon distance loss (P2P Loss). The distance is derived from the area sum of triangles specified by the vertexes of one polygon and the edges of the other. Therefore, the P2P Loss is continuous, differentiable, and inherently free from any RBBox definition. Our P2P Loss is not only consistent with the detection metrics but also able to measure how far, as well as how similar, a RBBox is from another one even when they are completely non-overlapping. These features allow the RetinaNet using the P2P Loss to achieve 79. 15% mAP on the DOTA dataset, which is quite competitive compared with many state-of-the-art rotated object detectors.
NeurIPS Conference 2022 Conference Paper
The "Patient Instruction" (PI), which contains critical instructional information provided both to carers and to the patient at the time of discharge, is essential for the patient to manage their condition outside hospital. An accurate and easy-to-follow PI can improve the self-management of patients which can in turn reduce hospital readmission rates. However, writing an appropriate PI can be extremely time consuming for physicians, and is subject to being incomplete or error-prone for (potentially overworked) physicians. Therefore, we propose a new task that can provide an objective means of avoiding incompleteness, while reducing clinical workload: the automatic generation of the PI, which is imagined as being a document that the clinician can review, modify, and approve as necessary (rather than taking the human "out of the loop"). We build a benchmark clinical dataset and propose the Re$^3$Writer, which imitates the working patterns of physicians to first retrieve related working experience from historical PIs written by physicians, then reason related medical knowledge. Finally, it refines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the PI for previously-unseen patient according to their health records during hospitalization. Our experiments show that, using our method, the performance of 6 different models can be substantially boosted across all metrics, with up to 20%, 11%, and 19% relative improvements in BLEU-4, ROUGE-L, and METEOR, respectively. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of its usefulness for clinical practice. The code is available at https: //github. com/AI-in-Health/Patient-Instructions.
AAAI Conference 2022 Conference Paper
In this paper, we study the zero-shot sketch-based image retrieval (ZS-SBIR) task, which retrieves natural images related to sketch queries from unseen categories. In the literature, convolutional neural networks (CNNs) have become the defacto standard and they are either trained end-to-end or used to extract pre-trained features for images and sketches. However, CNNs are limited in modeling the global structural information of objects due to the intrinsic locality of convolution operations. To this end, we propose a Transformerbased approach called Three-Way Vision Transformer (TVT) to leverage the ability of Vision Transformer (ViT) to model global contexts due to the global self-attention mechanism. Going beyond simply applying ViT to this task, we propose a token-based strategy of adding fusion and distillation tokens and making them complementary to each other. Specifically, we integrate three ViTs, which are pre-trained on data of each modality, into a three-way pipeline through the processes of distillation and multi-modal hypersphere learning. The distillation process is proposed to supervise fusion ViT (ViT with an extra fusion token) with soft targets from modality-specific ViTs, which prevent fusion ViT from catastrophic forgetting. Furthermore, our method learns a multi-modal hypersphere by performing inter- and intra-modal alignment without loss of uniformity, which aims to bridge the modal gap between modalities of sketch and image and avoid the collapse in dimensions. Extensive experiments on three benchmark datasets, i. e. , Sketchy, TU-Berlin, and QuickDraw, demonstrate the superiority of our TVT method over the state-ofthe-art ZS-SBIR methods.
AAAI Conference 2022 Conference Paper
Unsupervised/self-supervised pre-training methods for graph representation learning have recently attracted increasing research interests, and they can be generalized to various downstream applications. Yet, the adversarial robustness of such pre-trained graph learning models remains largely unexplored. More importantly, most existing defense techniques for endto-end graph representation learning methods require prespecified label definitions, and thus cannot be directly applied to the pre-training methods. In this paper, we propose an unsupervised defense technique to robustify pre-trained deep graph models, so that the perturbations on the input graph can be successfully identified and blocked before the model is applied to different downstream tasks. Specifically, we introduce a mutual information-based measure, graph representation vulnerability (GRV), to quantify the robustness of graph encoders on the representation space. We then formulate an optimization problem to learn the graph representation by carefully balancing the trade-off between the expressive power and the robustness (i. e. , GRV) of the graph encoder. The discrete nature of graph topology and the joint space of graph data make the optimization problem intractable to solve. To handle the above difficulty and to reduce computational expense, we further relax the problem and thus provide an approximate solution. Additionally, we explore a provable connection between the robustness of the unsupervised graph encoder and that of models on downstream tasks. Extensive experiments demonstrate that even without access to labels and tasks, our model is still able to enhance robustness against adversarial attacks on three downstream tasks (i. e. , node classification, link prediction, and community detection) by an average of +16. 5% compared with existing methods.
NeurIPS Conference 2022 Conference Paper
Most existing point cloud completion methods assume the input partial point cloud is clean, which is not practical in practice, and are Most existing point cloud completion methods assume the input partial point cloud is clean, which is not the case in practice, and are generally based on supervised learning. In this paper, we present an unsupervised generative adversarial autoencoding network, named UGAAN, which completes the partial point cloud contaminated by surroundings from real scenes and cutouts the object simultaneously, only using artificial CAD models as assistance. The generator of UGAAN learns to predict the complete point clouds on real data from both the discriminator and the autoencoding process of artificial data. The latent codes from generator are also fed to discriminator which makes encoder only extract object features rather than noises. We also devise a refiner for generating better complete cloud with a segmentation module to separate the object from background. We train our UGAAN with one real scene dataset and evaluate it with the other two. Extensive experiments and visualization demonstrate our superiority, generalization and robustness. Comparisons against the previous method show that our method achieves the state-of-the-art performance on unsupervised point cloud completion and segmentation on real data.
AAAI Conference 2021 Conference Paper
In many real-world applications, the amount of data available for training is often limited, and thus inductive bias and auxiliary knowledge are much needed for regularizing model training. One popular regularization method is to impose prior distribution assumptions on model parameters, and many recent works also attempt to regularize training by integrating external knowledge into specific neurons. However, existing regularization methods fail to take account of the interaction between connected neuron pairs, which is invaluable internal knowledge for adaptive regularization for better representation learning as training progresses. In this paper, we explicitly take into account the interaction between connected neurons, and propose an adaptive internal knowledge driven regularization method, CORR-Reg. The key idea of CORR-Reg is to give a higher significance weight to connections of more correlated neuron pairs. The significance weights adaptively identify more important input neurons for each neuron. Instead of regularizing connection model parameters with a static strength such as weight decay, CORR- Reg imposes weaker regularization strength on more significant connections. As a consequence, neurons attend to more informative input features and thus learn more diversified and discriminative representation. We derive CORR-Reg with the Bayesian inference framework and propose a novel optimization algorithm with the Lagrange multiplier method and Stochastic Gradient Descent. Extensive evaluations on diverse benchmark datasets and neural network structures show that CORR-Reg achieves significant improvement over stateof-the-art regularization methods.
AAAI Conference 2021 Conference Paper
Open Set Domain Adaptation (OSDA) is a challenging domain adaptation setting which allows the existence of unknown classes on the target domain. Although existing OSDA methods are good at classifying samples of known classes, they ignore the classification ability for the unknown samples, making them unbalanced OSDA methods. To alleviate this problem, we propose a balanced OSDA methods which could recognize the unknown samples while maintain high classification performance for the known samples. Specifically, to reduce the domain gaps, we first project the features to a hyperspherical latent space. In this space, we propose to bound the centroid deviation angles to not only increase the intraclass compactness but also enlarge the inter-class margins. With the bounded centroid deviation angles, we employ the statistical Extreme Value Theory to recognize the unknown samples that are misclassified into known classes. In addition, to learn better centroids, we propose an improved centroid update strategy based on sample reweighting and adaptive update rate to cooperate with centroid alignment. Experimental results on three OSDA benchmarks verify that our method can significantly outperform the compared methods and reduce the proportion of the unknown samples being misclassified into known classes.
ICLR Conference 2021 Conference Paper
We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ). PTQ usually requires a small subset of training data but produces less powerful quantized models than Quantization-Aware Training (QAT). In this work, we propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time. BRECQ leverages the basic building blocks in neural networks and reconstructs them one-by-one. In a comprehensive theoretical study of the second-order error, we show that BRECQ achieves a good balance between cross-layer dependency and generalization error. To further employ the power of quantization, the mixed precision technique is incorporated in our framework by approximating the inter-layer and intra-layer sensitivity. Extensive experiments on various handcrafted and searched neural architectures are conducted for both image classification and object detection tasks. And for the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models. Codes are available at https://github.com/yhhhli/BRECQ.
NeurIPS Conference 2021 Conference Paper
Adversarial attacks on graphs have posed a major threat to the robustness of graph machine learning (GML) models. Naturally, there is an ever-escalating arms race between attackers and defenders. However, the strategies behind both sides are often not fairly compared under the same and realistic conditions. To bridge this gap, we present the Graph Robustness Benchmark (GRB) with the goal of providing a scalable, unified, modular, and reproducible evaluation for the adversarial robustness of GML models. GRB standardizes the process of attacks and defenses by 1) developing scalable and diverse datasets, 2) modularizing the attack and defense implementations, and 3) unifying the evaluation protocol in refined scenarios. By leveraging the modular GRB pipeline, the end-users can focus on the development of robust GML models with automated data processing and experimental evaluations. To support open and reproducible research on graph adversarial learning, GRB also hosts public leaderboards for different scenarios. As a starting point, we provide various baseline experiments to benchmark the state-of-the-art techniques. GRB is an open-source benchmark and all datasets, code, and leaderboards are available at https: //cogdl. ai/grb/home.
AAAI Conference 2021 System Paper
Rule-based dialogue management is still the most popular solution for industrial task-oriented dialogue systems for their interpretablility. However, it is hard for developers to maintain the dialogue logic when the scenarios get more and more complex. On the other hand, data-driven dialogue systems, usually with end-to-end structures, are popular in academic research and easier to deal with complex conversations, but such methods require plenty of training data and the behaviors are less interpretable. In this paper, we propose a method to leverages the strength of both rule-based and data-driven dialogue managers (DM). We firstly introduce the DM of Carina Dialog System (CDS, an advanced industrial dialogue system built by Microsoft). Then we propose the “modeltrigger” design to make the DM trainable thus scalable to scenario changes. Furthermore, we integrate pre-trained models and empower the DM with few-shot capability. The experimental results demonstrate the effectiveness and strong fewshot capability of our method.
AAAI Conference 2021 Conference Paper
Contract consistency is important in ensuring the legal validity of the contract. In many scenarios, a contract is written by filling the blanks in a precompiled form. Due to carelessness, two blanks that should be filled with the same (or different) content may be incorrectly filled with different (or same) content. This will result in the issue of contract inconsistencies, which may severely impair the legal validity of the contract. Traditional methods to address this issue mainly rely on manual contract review, which is labor-intensive and costly. In this work, we formulate a novel Contract Inconsistency Checking (CIC) problem, and design an end-to-end framework, called Pair-wise Blank Resolution (PBR), to solve the CIC problem with high accuracy. Our PBR model contains a novel BlankCoder to address the challenge of modeling meaningless blanks. BlankCoder adopts a two-stage attention mechanism that adequately associates a meaningless blank with its relevant descriptions while avoiding the incorporation of irrelevant context words. Experiments conducted on real-world datasets show the promising performance of our method with a balanced accuracy of 94. 05% and an F1 score of 90. 90% in the CIC problem.
ICRA Conference 2021 Conference Paper
This paper introduces a leg-wheel transformable quadruped robot named Lywal which can switch to the leg-mode and the wheel-mode for locomotion, and the claw-mode for picking up and transport functions. First, the mechanical structure of Lywal is designed by using an innovative 2-DoF transformable mechanism. Second, the calculation of kinematics is analyzed in detail. Then, the switching-mode strategy and the mobile control strategies in different modes are designed. Finally, the prototype of Lywal is built. The properties of the mobile modes are analyzed, and the picking-up and transport functions of the claw-mode are verified through physical experiments.
EAAI Journal 2021 Journal Article
IJCAI Conference 2021 Conference Paper
The main challenge of cross-modal retrieval is to learn the consistent embedding for heterogeneous modalities. To solve this problem, traditional label-wise cross-modal approaches usually constrain the inter-modal and intra-modal embedding consistency relying on the label ground-truths. However, the experiments reveal that different modal networks actually have various generalization capacities, thereby end-to-end joint training with consistency loss usually leads to sub-optimal uni-modal model, which in turn affects the learning of consistent embedding. Therefore, in this paper, we argue that what really needed for supervised cross-modal retrieval is a good shared classification model. In other words, we learn the consistent embedding by ensuring the classification performance of each modality on the shared model, without the consistency loss. Specifically, we consider a technique called Semantic Sharing, which directly trains the two modalities interactively by adopting a shared self-attention based classification model. We evaluate the proposed approach on three representative datasets. The results validate that the proposed semantic sharing can consistently boost the performance under NDCG metric.
EAAI Journal 2020 Journal Article
IJCAI Conference 2020 Conference Paper
Knowledge graph alignment aims to link equivalent entities across different knowledge graphs. To utilize both the graph structures and the side information such as name, description and attributes, most of the works propagate the side information especially names through linked entities by graph neural networks. However, due to the heterogeneity of different knowledge graphs, the alignment accuracy will be suffered from aggregating different neighbors. This work presents an interaction model to only leverage the side information. Instead of aggregating neighbors, we compute the interactions between neighbors which can capture fine-grained matches of neighbors. Similarly, the interactions of attributes are also modeled. Experimental results show that our model significantly outperforms the best state-of-the-art methods by 1. 9-9. 7% in terms of HitRatio@1 on the dataset DBP15K.
AAAI Conference 2020 Conference Paper
RGB-Infrared (IR) person re-identification is very challenging due to the large cross-modality variations between RGB and IR images. The key solution is to learn aligned features to the bridge RGB and IR modalities. However, due to the lack of correspondence labels between every pair of RGB and IR images, most methods try to alleviate the variations with set-level alignment by reducing the distance between the entire RGB and IR sets. However, this set-level alignment may lead to misalignment of some instances, which limits the performance for RGB-IR Re-ID. Different from existing methods, in this paper, we propose to generate cross-modality paired-images and perform both global set-level and fine-grained instance-level alignments. Our proposed method enjoys several merits. First, our method can perform set-level alignment by disentangling modalityspecific and modality-invariant features. Compared with conventional methods, ours can explicitly remove the modalityspecific features and the modality variation can be better reduced. Second, given cross-modality unpaired-images of a person, our method can generate cross-modality paired images from exchanged images. With them, we can directly perform instance-level alignment by minimizing distances of every pair of images. Extensive experimental results on two standard benchmarks demonstrate that the proposed model favourably against state-of-the-art methods. Especially, on SYSU-MM01 dataset, our model can achieve a gain of 9. 2% and 7. 7% in terms of Rank-1 and mAP. Code is available at https: //github. com/wangguanan/JSIA-ReID.
YNIMG Journal 2020 Journal Article
YNIMG Journal 2020 Journal Article
IJCAI Conference 2020 Conference Paper
Knowledge tracing (KT) defines the task of predicting whether students can correctly answer questions based on their historical response. Although much research has been devoted to exploiting the question information, plentiful advanced information among questions and skills hasn't been well extracted, making it challenging for previous work to perform adequately. In this paper, we demonstrate that large gains on KT can be realized by pre-training embeddings for each question on abundant side information, followed by training deep KT models on the obtained embeddings. To be specific, the side information includes question difficulty and three kinds of relations contained in a bipartite graph between questions and skills. To pre-train the question embeddings, we propose to use product-based neural networks to recover the side information. As a result, adopting the pre-trained embeddings in existing deep KT models significantly outperforms state-of-the-art baselines on three common KT datasets.
AAAI Conference 2020 Conference Paper
Meta-learning for few-shot learning allows a machine to leverage previously acquired knowledge as a prior, thus improving the performance on novel tasks with only small amounts of data. However, most mainstream models suffer from catastrophic forgetting and insufficient robustness issues, thereby failing to fully retain or exploit long-term knowledge while being prone to cause severe error accumulation. In this paper, we propose a novel Continual Meta- Learning approach with Bayesian Graph Neural Networks (CML-BGNN) that mathematically formulates meta-learning as continual learning of a sequence of tasks. With each task forming as a graph, the intra- and inter-task correlations can be well preserved via message-passing and history transition. To remedy topological uncertainty from graph initialization, we utilize Bayes by Backprop strategy that approximates the posterior distribution of task-specific parameters with amortized inference networks, which are seamlessly integrated into the end-to-end edge learning. Extensive experiments conducted on the miniImageNet and tieredImageNet datasets demonstrate the effectiveness and efficiency of the proposed method, improving the performance by 42. 8% compared with state-of-the-art on the miniImageNet 5-way 1-shot classification task.
AAAI Conference 2020 Conference Paper
In this paper, we propose a new end-to-end network, named Joint Learning of Attribute and Contextual relations (JLAC), to solve the task of pedestrian attribute recognition. It includes two novel modules: Attribute Relation Module (ARM) and Contextual Relation Module (CRM). For ARM, we construct an attribute graph with attribute-specific features which are learned by the constrained losses, and further use Graph Convolutional Network (GCN) to explore the correlations among multiple attributes. For CRM, we first propose a graph projection scheme to project the 2-D feature map into a set of nodes from different image regions, and then employ GCN to explore the contextual relations among those regions. Since the relation information in the above two modules is correlated and complementary, we incorporate them into a unified framework to learn both together. Experiments on three benchmarks, including PA-100K, RAP, PETA attribute datasets, demonstrate the effectiveness of the proposed JLAC.
AAAI Conference 2020 Conference Paper
Time series modeling has attracted extensive research efforts; however, achieving both reliable efficiency and interpretability from a unified model still remains a challenging problem. Among the literature, shapelets offer interpretable and explanatory insights in the classification tasks, while most existing works ignore the differing representative power at different time slices, as well as (more importantly) the evolution pattern of shapelets. In this paper, we propose to extract time-aware shapelets by designing a two-level timing factor. Moreover, we define and construct the shapelet evolution graph, which captures how shapelets evolve over time and can be incorporated into the time series embeddings by graph embedding algorithms. To validate whether the representations obtained in this way can be applied effectively in various scenarios, we conduct experiments based on three public time series datasets, and two real-world datasets from different domains. Experimental results clearly show the improvements achieved by our approach compared with 16 state-of-the-art baselines.
IJCAI Conference 2019 Conference Paper
Recent deep Re-ID models mainly focus on learning high-level semantic features, while failing to explicitly explore color information which is one of the most important cues for person Re-ID. In this paper, we propose a novel Color-Sensitive Re-ID to take full advantage of color information. On one hand, we train our model with real and fake images. By using the extra fake images, more color information can be exploited and it can avoid overfitting during training. On the other hand, we also train our model with images of the same person with different colors. By doing so, features can be forced to focus on the color difference in regions. To generate fake images with specified colors, we propose a novel Color Translation GAN (CTGAN) to learn mappings between different clothing colors and preserve identity consistency among the same clothing color. Extensive evaluations on two benchmark datasets show that our approach significantly outperforms state-of-the-art Re-ID models.
IJCAI Conference 2019 Conference Paper
Multi-modal learning refers to the process of learning a precise model to represent the joint representations of different modalities. Despite its promise for multi-modal learning, the co-regularization method is based on the consistency principle with a sufficient assumption, which usually does not hold for real-world multi-modal data. Indeed, due to the modal insufficiency in real-world applications, there are divergences among heterogeneous modalities. This imposes a critical challenge for multi-modal learning. To this end, in this paper, we propose a novel Comprehensive Multi-Modal Learning (CMML) framework, which can strike a balance between the consistency and divergency modalities by considering the insufficiency in one unified framework. Specifically, we utilize an instance level attention mechanism to weight the sufficiency for each instance on different modalities. Moreover, novel diversity regularization and robust consistency metrics are designed for discovering insufficient modalities. Our empirical studies show the superior performances of CMML on real-world data in terms of various criteria.
AAAI Conference 2019 Conference Paper
In real-world applications, data are often with multiple modalities, and many multi-modal learning approaches are proposed for integrating the information from different sources. Most of the previous multi-modal methods utilize the modal consistency to reduce the complexity of the learning problem, therefore the modal completeness needs to be guaranteed. However, due to the data collection failures, self-deficiencies, and other various reasons, multi-modal instances are often incomplete in real applications, and have the inconsistent anomalies even in the complete instances, which jointly result in the inconsistent problem. These degenerate the multi-modal feature learning performance, and will finally affect the generalization abilities in different tasks. In this paper, we propose a novel Deep Robust Unsupervised Multi-modal Network structure (DRUMN) for solving this real problem within a unified framework. The proposed DRUMN can utilize the extrinsic heterogeneous information from unlabeled data against the insufficiency caused by the incompleteness. On the other hand, the inconsistent anomaly issue is solved with an adaptive weighted estimation, rather than adjusting the complex thresholds. As DRUMN can extract the discriminative feature representations for each modality, experiments on real-world multimodal datasets successfully validate the effectiveness of our proposed method.
IJCAI Conference 2019 Conference Paper
In this paper, we propose a novel unified network named Deep Hybrid-Aligned Architecture for facial age estimation. It contains global, local and global-local branches. They are jointly optimized and thus can capture multiple types of features with complementary information. In each branch, we employ a separate loss for each sub-network to extract the independent features and use a recurrent fusion to explore correlations among those region features. Considering that the pose variations may lead to misalignment in different regions, we design an Aligned Region Pooling operation to generate aligned region features. Moreover, a new large age dataset named Web-FaceAge owning more than 120K samples is collected under diverse scenes and spanning a large age range. Experiments on five age benchmark datasets, including Web-FaceAge, Morph, FG-NET, CACD and Chalearn LAP 2015, show that the proposed method outperforms the state-of-the-art approaches significantly.
YNIMG Journal 2019 Journal Article
YNICL Journal 2019 Journal Article
AAAI Conference 2019 Conference Paper
Zero-shot learning (ZSL) and cold-start recommendation (CSR) are two challenging problems in computer vision and recommender system, respectively. In general, they are independently investigated in different communities. This paper, however, reveals that ZSL and CSR are two extensions of the same intension. Both of them, for instance, attempt to predict unseen classes and involve two spaces, one for direct feature representation and the other for supplementary description. Yet there is no existing approach which addresses CSR from the ZSL perspective. This work, for the first time, formulates CSR as a ZSL problem, and a tailor-made ZSL method is proposed to handle CSR. Specifically, we propose a Lowrank Linear Auto-Encoder (LLAE), which challenges three cruxes, i. e. , domain shift, spurious correlations and computing efficiency, in this paper. LLAE consists of two parts, a low-rank encoder maps user behavior into user attributes and a symmetric decoder reconstructs user behavior from user attributes. Extensive experiments on both ZSL and CSR tasks verify that the proposed method is a win-win formulation, i. e. , not only can CSR be handled by ZSL models with a significant performance improvement compared with several conventional state-of-the-art methods, but the consideration of CSR can benefit ZSL as well.
AAAI Conference 2019 Conference Paper
This paper addresses Weakly Supervised Object Localization (WSOL) with only image-level supervision. We model the missing object locations as latent variables, and contribute a novel self-directed optimization strategy to infer them. With the strategy, our developed Self-Directed Localization Network (SD-LocNet) is able to localize object instance whose initial location is noisy. The self-directed inference hinges on an adaptive sampling method to identify reliable object instance via measuring its localization stability score. In this way, the resulted model is robust to noisy initialized object locations which we find is important in WSOL. Furthermore, we introduce a reliability induced prior propagation strategy to transfer object priors of the reliable instances to those unreliable ones by promoting their feature similarity, which effectively refines the unreliable object instances for better localization. The proposed SD-LocNet achieves 70. 9% Cor- Loc and 51. 3% mAP on PASCAL VOC 2007, surpassing the state-of-the-arts by a large margin.
AAAI Conference 2019 Conference Paper
Despite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages still remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intraclass variations. As opposed to current techniques for ageinvariant face recognition, which either directly extract ageinvariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other. To this end, we propose a deep Age-Invariant Model (AIM) for face recognition in the wild with three distinct novelties. First, AIM presents a novel unified deep architecture jointly performing cross-age face synthesis and recognition in a mutual boosting way. Second, AIM achieves continuous face rejuvenation/aging with remarkable photorealistic and identity-preserving properties, avoiding the requirement of paired data and the true age of testing samples. Third, we develop effective and novel training strategies for end-to-end learning the whole deep architecture, which generates powerful age-invariant face representations explicitly disentangled from the age variation. Extensive experiments on several cross-age datasets (MORPH, CACD and FG-NET) demonstrate the superiority of the proposed AIM model over the state-of-the-arts. Benchmarking our model on one of the most popular unconstrained face recognition datasets IJB-C additionally verifies the promising generalizability of AIM in recognizing faces in the wild.
AAAI Conference 2019 Conference Paper
Inferring the interactions between objects, a. k. a visual relationship detection, is a crucial point for vision understanding, which captures more definite concepts than object detection. Most previous work that treats the interaction between a pair of objects as a one way fail to exploit the mutual relation between objects, which is essential to modern visual application. In this work, we propose a mutual relation net, dubbed MR-Net, to explore the mutual relation between paired objects for visual relationship detection. Specifically, we construct a mutual relation space to model the mutual interaction of paired objects, and employ linear constraint to optimize the mutual interaction, which is called mutual relation learning. Our mutual relation learning does not introduce any parameters, and can adapt to improve the performance of other methods. In addition, we devise a semantic ranking loss to discriminatively penalize predicates with semantic similarity, which is ignored by traditional loss function (e. g. , cross entropy with softmax). Then, our MR-Net optimizes the mutual relation learning together with semantic ranking loss with a siamese network. The experimental results on two commonly used datasets (VG and VRD) demonstrate the superior performance of the proposed approach.
YNIMG Journal 2019 Journal Article
AIIM Journal 2019 Journal Article
AAAI Conference 2018 Conference Paper
This paper is concerned with how to make efficient use of social information to improve recommendations. Most existing social recommender systems assume people share similar preferences with their social friends. Which, however, may not hold true due to various motivations of making online friends and dynamics of online social networks. Inspired by recent causal process based recommendations that first model user exposures towards items and then use these exposures to guide rating prediction, we utilize social information to capture user exposures rather than user preferences. We assume that people get information of products from their online friends and they do not have to share similar preferences, which is less restrictive and seems closer to reality. Under this new assumption, in this paper, we present a novel recommendation approach (named SERec) to integrate social exposure into collaborative filtering. We propose two methods to implement SERec, namely social regularization and social boosting, each with different ways to construct social exposures. Experiments on four real-world datasets demonstrate that our methods outperform the state-of-the-art methods on top-N recommendations. Further study compares the robustness and scalability of the two proposed methods.
AAAI Conference 2018 Conference Paper
Network embedding, which aims to learn the lowdimensional representations of vertices, is an important task and has attracted considerable research efforts recently. In real world, networks, like social network and biological networks, are dynamic and evolving over time. However, almost all the existing network embedding methods focus on static networks while ignore network dynamics. In this paper, we present a novel representation learning approach, DynamicTriad, to preserve both structural information and evolution patterns of a given network. The general idea of our approach is to impose triad, which is a group of three vertices and is one of the basic units of networks. In particular, we model how a closed triad, which consists of three vertices connected with each other, develops from an open triad that has two of three vertices not connected with each other. This triadic closure process is a fundamental mechanism in the formation and evolution of networks, thereby makes our model being able to capture the network dynamics and to learn representation vectors for each vertex at different time steps. Experimental results on three real-world networks demonstrate that, compared with several state-of-the-art techniques, DynamicTriad achieves substantial gains in several application scenarios. For example, our approach can effectively be applied and help to identify telephone frauds in a mobile network, and to predict whether a user will repay her loans or not in a loan network.
AAAI Conference 2018 Conference Paper
Existing relation classification methods that rely on distant supervision assume that a bag of sentences mentioning an entity pair are all describing a relation for the entity pair. Such methods, performing classification at the bag level, cannot identify the mapping between a relation and a sentence, and largely suffers from the noisy labeling problem. In this paper, we propose a novel model for relation classification at the sentence level from noisy data. The model has two modules: an instance selector and a relation classifier. The instance selector chooses high-quality sentences with reinforcement learning and feeds the selected sentences into the relation classifier, and the relation classifier makes sentencelevel prediction and provides rewards to the instance selector. The two modules are trained jointly to optimize the instance selection and relation classification processes. Experiment results show that our model can deal with the noise of data effectively and obtains better performance for relation classification at the sentence level.
AAAI Conference 2018 Conference Paper
Network embedding aims to learn the low-dimensional representations of vertexes in a network, while structure and inherent properties of the network is preserved. Existing network embedding works primarily focus on preserving the microscopic structure, such as the first- and second-order proximity of vertexes, while the macroscopic scale-free property is largely ignored. Scale-free property depicts the fact that vertex degrees follow a heavy-tailed distribution (i. e. , only a few vertexes have high degrees) and is a critical property of realworld networks, such as social networks. In this paper, we study the problem of learning representations for scale-free networks. We first theoretically analyze the difficulty of embedding and reconstructing a scale-free network in the Euclidean space, by converting our problem to the sphere packing problem. Then, we propose the “degree penalty” principle for designing scale-free property preserving network embedding algorithm: punishing the proximity between high-degree vertexes. We introduce two implementations of our principle by utilizing the spectral techniques and a skip-gram model respectively. Extensive experiments on six datasets show that our algorithms are able to not only reconstruct heavy-tailed distributed degree distribution, but also outperform state-ofthe-art embedding models in various network mining tasks, such as vertex classification and link prediction.
IJCAI Conference 2018 Conference Paper
In real world applications, data are often with multiple modalities. Researchers proposed the multi-modal learning approaches for integrating the information from different modalities. Most of the previous multi-modal methods assume that training examples are with complete modalities. However, due to the failures of data collection, self-deficiencies and other various reasons, multi-modal examples are usually with incomplete feature representation in real applications. In this paper, the incomplete feature representation issues in multi-modal learning are named as incomplete modalities, and we propose a semi-supervised multi-modal learning method aimed at this incomplete modal issue (SLIM). SLIM can utilize the extrinsic information from unlabeled data against the insufficiencies brought by the incomplete modal issues in a semi-supervised scenario. Besides, the proposed SLIM forms the problem into a unified framework which can be treated as a classifier or clustering learner, and integrate the intrinsic consistencies and extrinsic unlabeled information. As SLIM can extract the most discriminative predictors for each modality, experiments on 15 real world multi-modal datasets validate the effectiveness of our method.
AAAI Conference 2018 Conference Paper
Unprecedented human mobility has driven the rapid urbanization around the world. In China, the fraction of population dwelling in cities increased from 17. 9% to 52. 6% between 1978 and 2012. Such large-scale migration poses challenges for policymakers and important questions for researchers. To investigate the process of migrant integration, we employ a one-month complete dataset of telecommunication metadata in Shanghai with 54 million users and 698 million call logs. We find systematic differences between locals and migrants in their mobile communication networks and geographical locations. For instance, migrants have more diverse contacts and move around the city with a larger radius than locals after they settle down. By distinguishing new migrants (who recently moved to Shanghai) from settled migrants (who have been in Shanghai for a while), we demonstrate the integration process of new migrants in their first three weeks. Moreover, we formulate classification problems to predict whether a person is a migrant. Our classifier is able to achieve an F1-score of 0. 82 when distinguishing settled migrants from locals, but it remains challenging to identify new migrants because of class imbalance. This classification setup holds promise for identifying new migrants who will successfully integrate into locals (new migrants that misclassified as locals).
IS Journal 2017 Journal Article
As one of the most popular social media platforms today, Twitter provides people with an effective way to communicate and interact with each other. Through these interactions, influence among users gradually emerges and changes people's opinions. Although previous work has studied interpersonal influence as the probability of activating others during information diffusion, they ignore an important fact that information diffusion is the result of influence, while dynamic interactions among users produce influence. In this article, the authors propose a novel temporal influence model to learn users' opinion behaviors regarding a specific topic by exploring how influence emerges during communications. The experiments show that their model performs better than other influence models with different influence assumptions when predicting users' future opinions, especially for the users with high opinion diversity.
AAAI Conference 2017 Conference Paper
Model reuse attempts to construct a model by utilizing existing available models, mostly trained for other tasks, rather than building a model from scratch. It is helpful to reduce the time cost, data amount, and expertise required. Deep learning has achieved great success in various tasks involving images, voices and videos. There are several studies have the sense of model reuse, by trying to reuse pre-trained deep networks architectures or deep model features to train a new deep model. They, however, neglect the fact that there are many other fixed models or features available. In this paper, we propose a more thorough model reuse scheme, FMR (Fixed Model Reuse). FMR utilizes the learning power of deep models to implicitly grab the useful discriminative information from fixed model/features that have been widely used in general tasks. We firstly arrange the convolution layers of a deep network and the provided fixed model/features in parallel, fully connecting to the output layer nodes. Then, the dependencies between the output layer nodes and the fixed model/features are knockdown such that only the raw feature inputs are needed when the model is being used for testing, though the helpful information in the fixed model/features have already been incorporated into the model. On one hand, by the FMR scheme, the required amount of training data can be significantly reduced because of the reuse of fixed model/features. On the other hand, the fixed model/features are not explicitly used in testing, and thus, the scheme can be quite useful in applications where the fixed model/features are protected by patents or commercial secrets. Experiments on five real-world datasets validate the effectiveness of FMR compared with state-of-the-art deep methods.
IJCAI Conference 2017 Conference Paper
Multi-Model Reuse is one of the prominent problems in Learnware framework, while the main issue of Multi-Model Reuse lies in the final prediction acquisition from the responses of multiple pre-trained models. Different from multi-classifiers ensemble, there are only pre-trained models rather than the whole training sets provided in Multi-Model Reuse configuration. This configuration is closer to the real applications where the reliability of each model cannot be evaluated properly. In this paper, aiming at the lack of evaluation on reliability, the potential consistency spread on different modalities is utilized. With the consistency of pre-trained models on different modalities, we propose a Pre-trained Multi-Model Reuse approach PM2R with multi-modal data, which realizes the reusability of multiple models. PM2R can combine pre-trained multi-models efficiently without re-training, and consequently no more training data storage is required. We describe the more realistic Multi-Model Reuse setting comprehensively in our paper, and point out the differences among this setting, classifier ensemble and later fusion on multi-modal learning. Experiments on synthetic and real-world datasets validate the effectiveness of PM2R when it is compared with state-of-the-art ensemble/multi-modal learning methods under this more realistic setting.
AAAI Conference 2017 Conference Paper
This paper proposes a one-step spectral clustering method by learning an intrinsic affinity matrix (i. e. , the clustering result) from the low-dimensional space (i. e. , intrinsic subspace) of original data. Specifically, the intrinsic affinity matrix is learnt by: 1) the alignment of the initial affinity matrix learnt from original data; 2) the adjustment of the transformation matrix, which transfers the original feature space into its intrinsic subspace by simultaneously conducting feature selection and subspace learning; and 3) the clustering result constraint, i. e. , the graph constructed by the intrinsic affinity matrix has exact c connected components where c is the number of clusters. In this way, two affinity matrices and a transformation matrix are iteratively updated until achieving their individual optimum, so that these two affinity matrices are consistent and the intrinsic subspace is learnt via the transformation matrix. Experimental results on both synthetic and benchmark datasets verified that our proposed method outputted more effective clustering result than the previous clustering methods.
AAAI Conference 2017 Conference Paper
In a document, the topic distribution of a sentence depends on both the topics of preceding sentences and its own content, and it is usually affected by the topics of the preceding sentences with different weights. It is natural that a document can be treated as a sequence of sentences. Most existing works for Bayesian document modeling do not take these points into consideration. To fill this gap, we propose a Recurrent Attentional Topic Model (RATM) for document embedding. The RATM not only takes advantage of the sequential orders among sentence but also use the attention mechanism to model the relations among successive sentences. In RATM, we propose a Recurrent Attentional Bayesian Process (RABP) to handle the sequences. Based on the RABP, RATM fully utilizes the sequential information of the sentences in a document. Experiments on two copora show that our model outperforms state-of-the-art methods on document modeling and classification.
AAAI Conference 2017 Conference Paper
In this paper, we propose a novel coding method named weighted linear coding (WLC) to learn multi-level (e. g. , pixel-level, patch-level and image-level) descriptors from raw pixel data in an unsupervised manner. It guarantees the property of saliency with a similarity constraint. The resulting multi-level descriptors have a good balance between the robustness and distinctiveness. Based on WLC, all data from the same region can be jointly encoded. Consequently, when we extract the holistic image features, it is able to preserve the spatial consistency. Furthermore, we apply PCA to these features and compact person representations are then achieved. During the stage of matching persons, we exploit the complementary information resided in multi-level descriptors via a score-level fusion strategy. Experiments on the challenging person re-identification datasets - VIPeR and CUHK 01, demonstrate the effectiveness of our method.
IJCAI Conference 2016 Conference Paper
Spectral clustering has been playing a vital role in various research areas. Most traditional spectral clustering algorithms comprise two independent stages (i. e. , first learning continuous labels and then rounding the learned labels into discrete ones), which may lead to severe information loss and performance degradation. In this work, we study how to achieve discrete clustering as well as reliably generalize to unseen data. We propose a unified spectral clustering scheme which jointly learns discrete clustering labels and robust out-of-sample prediction functions. Specifically, we explicitly enforce a discrete transformation on the intermediate continuous labels, which leads to a tractable optimization problem with a discrete solution. Moreover, to further compensate the unreliability of the learned labels, we integrate an adaptive robust module with ℓ 2, p loss to learn prediction function for unseen data. Extensive experiments conducted on various data sets have demonstrated the superiority of our proposal as compared to existing clustering approaches.
AAAI Conference 2016 Conference Paper
In this paper, we propose a novel similarity measure and then introduce an efficient strategy to learn it by using only similar pairs for person verification. Unlike existing metric learning methods, we consider both the difference and commonness of an image pair to increase its discriminativeness. Under a pairconstrained Gaussian assumption, we show how to obtain the Gaussian priors (i. e. , corresponding covariance matrices) of dissimilar pairs from those of similar pairs. The application of a log likelihood ratio makes the learning process simple and fast and thus scalable to large datasets. Additionally, our method is able to handle heterogeneous data well. Results on the challenging datasets of face verification (LFW and Pub- Fig) and person re-identification (VIPeR) show that our algorithm outperforms the state-of-the-art methods.
IJCAI Conference 2016 Conference Paper
Complex objects are usually with multiple modal features. In multi-modal learning, modalities closely related to the target tasks are known as strong modalities. While collecting strong modalities of all instances is often expensive, and current multi-modal learning techniques hardly take the strong modal feature extraction expenses into consideration. On the other hand, active learning is proposed to reduce the labeling expenses by querying the ground truths for specific selected instances. In this paper, we propose a training strategy, ACQUEST (ACtive QUErying STrong modalities), which exploits strong modal information by actively querying the strong modal feature values of "selected" instances rather than their corresponding ground truths. In ACQUEST, only the informative instances are selected for strong modal feature acquisition. An inverse prediction technique is also proposed to make the ACQUEST a unified optimization form. Experiments on image datasets show that ACQUEST achieves better classification performance than conventional active learning and multi-modal learning methods with less feature acquisition costs and labeling expenses.
AAAI Conference 2016 Conference Paper
A variety of encoding methods for bag of word (BoW) model have been proposed to encode the local features in image classification. However, most of them are unsupervised and just employ k-means to form the visual vocabulary, thus reducing the discriminative power of the features. In this paper, we propose a metric embedded discriminative vocabulary learning for high-level person representation with application to person re-identification. A new and effective term is introduced which aims at making the same persons closer while different ones farther in the metric space. With the learned vocabulary, we utilize a linear coding method to encode the imagelevel features (or holistic image features) for extracting highlevel person representation. Different from traditional unsupervised approaches, our method can explore the relationship (same or not) among the persons. Since there is an analytic solution to the linear coding, it is easy to obtain the final high-level features. The experimental results on person reidentification demonstrate the effectiveness of our proposed algorithm.
AAAI Conference 2016 Conference Paper
Psychological theories suggest that emotion represents the state of mind and instinctive responses of one’s cognitive system (Cannon 1927). Emotions are a complex state of feeling that results in physical and psychological changes that influence our behavior. In this paper, we study an interesting problem of emotion contagion in social networks. In particular, by employing an image social network (Flickr) as the basis of our study, we try to unveil how users’ emotional statuses influence each other and how users’ positions in the social network affect their influential strength on emotion. We develop a probabilistic framework to formalize the problem into a role-aware contagion model. The model is able to predict users’ emotional statuses based on their historical emotional statuses and social structures. Experiments on a large Flickr dataset show that the proposed model significantly outperforms (+31% in terms of F1-score) several alternative methods in predicting users’ emotional status. We also discover several intriguing phenomena. For example, the probability that a user feels happy is roughly linear to the number of friends who are also happy; but taking a closer look, the happiness probability is superlinear to the number of happy friends who act as opinion leaders (Page et al. 1999) in the network and sublinear in the number of happy friends who span structural holes (Burt 2001). This offers a new opportunity to understand the underlying mechanism of emotional contagion in online social networks.
IJCAI Conference 2015 Conference Paper
In real world applications, data are often with multiple modalities. Previous works assumed that each modality contains sufficient information for target and can be treated with equal importance. However, it is often that different modalities are of various importance in real tasks, e. g. , the facial feature is weak modality and the fingerprint feature is strong modality in ID recognition. In this paper, we point out that different modalities should be treated with different strategies and propose the Auxiliary information Regularized Machine (ARM), which works by extracting the most discriminative feature subspace of weak modality while regularizing the strong modal predictor. Experiments on binary and multi-class datasets demonstrate the advantages of our proposed approach ARM.
AAAI Conference 2015 Conference Paper
Information diffusion, which studies how information is propagated in social networks, has attracted considerable research effort recently. However, most existing approaches do not distinguish social roles that nodes may play in the diffusion process. In this paper, we study the interplay between users’ social roles and their influence on information diffusion. We propose a Role-Aware INformation diffusion model (RAIN) that integrates social role recognition and diffusion modeling into a unified framework. We develop a Gibbssampling based algorithm to learn the proposed model using historical diffusion data. The proposed model can be applied to different scenarios. For instance, at the micro-level, the proposed model can be used to predict whether an individual user will repost a specific message; while at the macro-level, we can use the model to predict the scale and the duration of a diffusion process. We evaluate the proposed model on a real social media data set. Our model performs much better in both micro- and macro-level prediction than several alternative methods.
TIST Journal 2015 Journal Article
The availability of massive RGB-depth (RGB-D) images poses a compelling need for effective RGB-D content understanding techniques. RGB-D images provide synchronized information from multiple views (e.g., color and depth) of real-world objects and scenes. This work proposes learning compact and discriminative features from the multiple views of RGB-D content toward effective feature representation for RGB-D image understanding. In particular, a robust multiview feature learning approach is developed, which exploits the intrinsic relations among multiple views. The feature learning in multiple views is jointly optimized in an integrated formulation. The joint optimization essentially exploits the intrinsic relations among the views, leading to effective features and making the learning process robust to noises. The feature learning function is formulated as a robust nonnegative graph embedding function over multiple graphs in various views. The graphs characterize the local geometric and discriminating structure of the multiview data. The joint sparsity in ℓ 1 -norm graph embedding and ℓ 21 -norm data factorization further enhances the robustness of feature learning. We derive an efficient computational solution for the proposed approach and provide rigorous theoretical proof with regard to its convergence. We apply the proposed approach to two RGB-D image understanding tasks: RGB-D object classification and RGB-D scene categorization. We conduct extensive experiments on two real-world RGB-D image datasets. The experimental results have demonstrated the effectiveness of the proposed approach.
AAAI Conference 2014 Conference Paper
Diabetes complications often afflict diabetes patients seriously: over 68% of diabetes-related mortality is caused by diabetes complications. In this paper, we study the problem of automatically diagnosing diabetes complications from patients’ lab test results. The objective problem has two main challenges: 1) feature sparseness: a patient only undergoes 1. 26% lab tests on average, and 65. 5% types of lab tests are performed on samples from less than 10 patients; 2) knowledge skewness: it lacks comprehensive detailed domain knowledge of the association between diabetes complications and lab tests. To address these challenges, we propose a novel probabilistic model called Sparse Factor Graph Model (SparseFGM). SparseFGM projects sparse features onto a lower-dimensional latent space, which alleviates the problem of sparseness. SparseFGM is also able to capture the associations between complications and lab tests, which help handle the knowledge skewness. We evaluate the proposed model on a large collections of real medical records. SparseFGM outperforms (+20% by F1) baselines significantly and gives detailed associations between diabetes complications and lab tests.
AAAI Conference 2014 Conference Paper
Extracting emotions from images has attracted much interest, in particular with the rapid development of social networks. The emotional impact is very important for understanding the intrinsic meanings of images. Despite many studies having been done, most existing methods focus on image content, but ignore the emotion of the user who published the image. One interesting question is: How does social effect correlate with the emotion expressed in an image? Specifically, can we leverage friends interactions (e. g. , discussions) related to an image to help extract the emotions? In this paper, we formally formalize the problem and propose a novel emotion learning method by jointly modeling images posted by social users and comments added by their friends. One advantage of the model is that it can distinguish those comments that are closely related to the emotion expression for an image from the other irrelevant ones. Experiments on an open Flickr dataset show that the proposed model can significantly improve (+37. 4% by F1) the accuracy for inferring user emotions. More interestingly, we found that half of the improvements are due to interactions between 1. 0% of the closest friends.
YNIMG Journal 2012 Journal Article
YNIMG Journal 2011 Journal Article
TCS Journal 2010 Journal Article