EAAI Journal 2026 Journal Article
A novel multi-modal attentional collaborative learning framework with semantic enhancement for audio–visual question answering
- Jie Yang
- Miao Ma
- Peng Wang
- Yutong Li
- Zhao Pei
- Chao Yao
- Longjiang Guo
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Self-supervised monocular depth estimation methods severely compromise accuracy in dynamic objects due to their static scene assumption. Existing approaches for dynamic scenes suffer from two critical shortcomings: 1) reliance on supervised segmentation models (requiring costly annotations) or computationally intensive multi-branch models to isolate moving objects, and 2) simple integration of 2D/3D motion flow without reliable supervision for dynamic objects. We propose AdaDepth, a two‑stage framework that jointly performs unsupervised scene decomposition and dynamic-aware depth learning. In the initial structural stage, our geometry-motion joint scene decomposition (GMoDecomp) module ensures the robust generation of a depth prior and simultaneously partitions the scene into multiple regions through the fusion of geometric and motion cues. In the region-adaptive refinement stage, we exploit the depth prior and decomposed regions to introduce motion-aware and geometry-consistent constraints, effectively improving depth estimation in dynamic scenes. AdaDepth achieves accurate depth prediction in highly dynamic scenes without relying on external labels or specialized segmentation models. Extensive experiments on KITTI, Cityscapes, and Waymo Open demonstrate its superiority over state-of-the-art approaches.
AAAI Conference 2026 Conference Paper
With the rapid advancement of deep learning, drug target interaction (DTI) prediction has seen substantial performance enhancements. However, existing methodologies face a critical, yet unaddressed challenge, i.e., the Modality Reliability Gap. Such a gap arises from the unpredictable variance in the informativeness and reliability of 1D sequence versus 3D structural data across different drug-target pairs, critically limiting model robustness and domain generalization capabilities. To overcome it, we introduce DrugCMF, a novel Drug-Target interaction prediction method via Confidence-aware Multimodal Fusion framework designed specifically to bridge the Modality Reliability Gap. Specifically, the DrugCMF employs a four-stage approach: (1) it extracts rich features by utilizing four pre-trained models to obtain token-level embeddings from both 1D sequences and 3D structures. (2) it preserves modality informativeness by independently learning interaction patterns within each modality through a Token-level Interaction module. (3) it explicitly quantifies the reliability gap by employing a novel confidence estimation mechanism to dynamically learn weights for each modality. (4) it bridges the gap by using these confidence scores to guide a learnable cross-modal fusion module, adaptively fusing information from the most trustworthy source. By methodically addressing the Modality Reliability Gap, DrugCMF significantly outperforms SOTA methods.
JBHI Journal 2026 Journal Article
Existing respiratory monitoring techniques primarily focus on respiratory rate measurement, neglecting the potential of using thoracoabdominal patterns of respiration for infant lung health assessment. To bridge this gap, we exploit the unique advantage of spatial redundancy of a camera sensor to analyze the infant thoracoabdominal respiratory motion. Specifically, we propose a camera-based respiratory imaging (CRI) system that utilizes optical flow to construct a spatio-temporal respiratory imager for comparing the infant chest and abdominal respiratory motion, and employs deep learning algorithms to identify infant abdominal, thoracoabdominal synchronous, and thoracoabdominal asynchronous patterns of respiration. To alleviate the challenges posed by limited clinical training data and subject variability, we introduce a novel multiple-expert contrastive learning (MECL) strategy to CRI. It enriches training samples by reversing and pairing different-class data, and promotes the representation consistency of same-class data through multi-expert collaborative optimization. Clinical validation involving 44 infants shows that MECL achieves 70% in sensitivity and 80. 21% in specificity, which validates the feasibility of CRI for respiratory pattern recognition. This work investigates a novel video-based approach for assessing the infant thoracoabdominal patterns of respiration, revealing a new value stream of video health monitoring in neonatal care.
AAAI Conference 2026 Conference Paper
Graph neural networks (GNNs) excel at modeling graph-structured data but often inherit and amplify biases, leading to substantial efforts in developing fair GNNs. However, most existing approaches assume full access to sensitive attribute information, which is often impractical in real-world scenarios due to privacy concerns or risks of discrimination. To address this limitation, this paper focuses on graph fairness with limited sensitive attribute information, ensuring applicability to real-world contexts where current methods fall short. Specifically, we introduce an innovative fairness optimization strategy, propose a novel framework named FGLISA, and provide a theoretical perspective linking limited sensitive attribute information access to fairness objectives, thus enabling fair graph learning in real-world applications with limited sensitive attribute information. Experiments on diverse real-world datasets and tasks validate the effectiveness of our approach in achieving both fairness and predictive performance.
AAAI Conference 2026 Conference Paper
Graph Contrastive Learning (GCL) has proven effective in mitigating data sparsity and enhancing representation learning for recommendation. Yet, most GCL frameworks indiscriminately treat all non-anchor nodes as negatives during contrastive sampling, often leading to the false negative problem where semantically similar nodes are incorrectly repelled. Previous attempts to mitigate this issue rely on predetermined heuristics or local neighborhood mining, which struggle to reliably identify false negatives. More critically, they often overlook authentic user-item interactions for anchoring sample relationships. As a result, this paper presents MACRec, a Multi-View subspace-Alignment framework designed to Calibrate contrastive sampling in GCLbased Recommendation. MACRec comprises three core components: (1) a Multi-View Affinity (MVA) module that captures consistent semantic relations across multiple augmentations via self-expression modeling; (2) a Cross-Subspace Alignment (CSA) mechanism that leverages authentic useritem behavioral interactions to enforce semantic consistency across user and item subspaces; and (3) a Calibrationbased Contrastive Reweighting (CCR) strategy to dynamically down-weight potential false negatives during the contrastive learning process. Extensive experiments on three realworld benchmarks demonstrate that MACRec consistently improves performance across various augmentation backbones, achieving up to 14.55% relative gains.
AAAI Conference 2026 Conference Paper
Modern multi-view clustering (MVC) is dominated by two paradigms: multi-view fusion and pseudo-label-guided learning. Pseudo-labeling methods can suffer from confirmation bias; their reliance on a fixed-granularity supervision from an initial clustering can cause learned embeddings to drift from the data's true structure and lose discriminative power. Conversely, fusion methods excel at integrating information but often struggle to robustly differentiate between high-quality and noisy views, which can obscure final cluster boundaries and degrade performance. To address these complementary challenges, we propose GAPS (Granularity-Aware Pseudo Supervision), a novel MVC framework. GAPS introduces a granularity-aware supervision mechanism that generates a full hierarchy of pseudo-labels, enabling the selection of a supervision level that best aligns with the data's intrinsic multi-scale structure. Furthermore, to ensure a high-quality supervisory signal, it incorporates a reliability-aware view selection strategy using a novel Separation-Compactness Index (SCI) to identify and leverage the most informative view for pseudo-label generation. This dual approach ensures the supervisory signal is both structurally adaptive and derived from the most reliable source, leading to highly effective final representations. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness and superiority of GAPS over other competitors.
JBHI Journal 2026 Journal Article
Deep learning associated with neurological signals is poised to drive major advancements in diverse fields such as medical diagnostics, neurorehabilitation, and brain-computer interfaces. The challenge in harnessing the full potential of these signals lies in the dependency on extensive, high-quality annotated data, which is often scarce and expensive to acquire, requiring specialized infrastructure and domain expertise. To address the appetite for data in deep learning, we present Neuro-BERT, a self-supervised pre-training framework of neurological signals based on masked autoencoding in the Fourier domain. The intuition behind our approach is simple: frequency and phase distribution of neurological signals can reveal intricate neurological activities. We propose a novel pre-training task dubbed Fourier Inversion Prediction (FIP), which randomly masks out a portion of the input signal and then predicts the missing information using the Fourier inversion theorem. Pre-trained models can be potentially used for various downstream tasks such as sleep stage classification and gesture recognition. Unlike contrastive-based methods, which strongly rely on carefully hand-crafted augmentations and siamese structure, our approach works reasonably well with a simple transformer encoder with no augmentation requirements. By evaluating our method on several benchmark datasets, we show that Neuro-BERT improves downstream neurological-related tasks by a large margin.
JBHI Journal 2026 Journal Article
Methamphetamine dependence poses a significant global health challenge, yet its assessment and the evaluation of treatments like repetitive transcranial magnetic stimulation (rTMS) frequently depend on subjective self-reports, which may introduce uncertainties. While objective neuroimaging modalities such as electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) offer alternatives, their individual limitations and the reliance on conventional, often hand-crafted, feature extraction can compromise the reliability of derived biomarkers. To overcome these limitations, we propose NeuroCLIP, a novel deep learning framework integrating simultaneously recorded EEG and fNIRS data through a progressive learning strategy. This approach offers a robust and trustworthy data-driven biomarker for methamphetamine addiction. Validation experiments show that NeuroCLIP significantly improves discriminative capabilities among the methamphetamine-dependent individuals and healthy controls compared to models using either EEG or only fNIRS alone. Furthermore, the proposed framework facilitates objective, brain-based evaluation of rTMS treatment efficacy, demonstrating measurable shifts in neural patterns towards healthy control profiles after treatment. Critically, we establish the trustworthiness of the multimodal data-driven biomarker by showing its strong correlation with psychometrically validated craving scores. These findings suggest that biomarker derived from EEG-fNIRS data via NeuroCLIP offers enhanced robustness and reliability over single-modality approaches, providing a valuable tool for addiction neuroscience research and potentially improving clinical assessments.
EAAI Journal 2026 Journal Article
JBHI Journal 2026 Journal Article
Drug combination therapy has exhibited favorable effects in treating cancer patients, with less toxicity and adverse reactions compared to monotherapy. To accelerate the discovery of therapeutic drug combinations, numerous computational methods have been developed to predict drug synergy in cancer cell lines, typically modeling the task as binary classification (synergistic vs non-synergistic) or regression (continuous synergy scores). Yet, a recent study proposes categorizing drug combination benefits into multiple ordered classes (e. g. , synergy, bliss additivity, independent actions) based on clinical activities, and suggests that drug combinations remain valuable if they reduce cancer cell viability, even without defined synergy. To distinguish various levels of combination benefits, we present a novel order-aware deep learning model, called OrderCombo. Specifically, OrderCombo extracts the drug representation via a pretrained chemical language model and the cell line representation via an omics-oriented linear network. Then, these representations are fused into a unified embedding for each drug-drug-cell line triplet, by leveraging a hybrid encoder that combines concatenation-based dependencies and attention-based interactions. Finally, an ordinal contrastive loss is designed to promote a discriminative embedding space and maintain class ordinality, thereby improving the predictions of drug combination benefits. We evaluate OrderCombo on a large-scale combination benefit dataset, and in silico results show that our method outperforms the state-of-the-art baselines in terms of prediction accuracy, while maintaining robust generalization to unseen drug pairs and cell lines. Substantial case studies further demonstrate OrderCombo's potential value in discovering novel anticancer drug combinations across different therapeutic levels.
AAAI Conference 2026 Conference Paper
Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25x). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.
AAAI Conference 2026 Conference Paper
Automated classification of complex social survey questionnaires is crucial for large-scale social science research but faces significant reliability challenges due to intricate hierarchical label structures, severe class imbalance, semantic ambiguity, and incomplete data coverage. Conventional classification methods often struggle with these combined complexities, yielding results that lack trustworthiness. We introduce HOCM, a framework designed for trustworthy classification in complex, real-world taxonomies. It features two synergistic components: (1) memory-enhanced contrastive learning, tailored to learn robust representations from noisy, imbalanced data by leveraging quality-aware category memory banks; and (2) hierarchical uncertainty calibration, which enforces taxonomic consistency while providing reliable confidence estimates and identifying inputs falling outside well-represented known categories. Our evaluation on a large-scale, real-world social survey dataset—a challenging exemplar of our target problem class—demonstrates that HOCM maintains strong accuracy on known classes while effectively identifying uncertain cases, significantly boosting accuracy on confident predictions. Furthermore, it adeptly detects low-resource/unknown categories. HOCM provides a more reliable automated classification tool, enabling efficient expert review and enhancing the trustworthiness of analysis in domains with complex, hierarchical data.
IROS Conference 2025 Conference Paper
With the accelerated development of Industry 4. 0, intelligent manufacturing systems increasingly require efficient task allocation and scheduling in multi-robot systems. However, existing methods rely on domain expertise and face challenges in adapting to dynamic production constraints. Additionally, enterprises have high privacy requirements for production scheduling data, which prevents the use of cloud-based large language models (LLMs) for solution development. To address these challenges, there is an urgent need for an automated modeling solution that meets data privacy requirements. This study proposes a knowledge-augmented mixed integer linear programming (MILP) automated formulation framework, integrating local LLMs with domain-specific knowledge bases to generate executable code from natural language descriptions automatically. The framework employs a knowledge-guided DeepSeek-R1-Distill-Qwen-32B model to extract complex spatiotemporal constraints (82% average accuracy) and leverages a supervised fine-tuned Qwen2. 5-Coder-7B-Instruct model for efficient MILP code generation (90% average accuracy). Experimental results demonstrate that the framework successfully achieves automatic modeling in the aircraft skin manufacturing case while ensuring data privacy and computational efficiency. This research provides a low-barrier and highly reliable technical path for modeling in complex industrial scenarios.
AAAI Conference 2025 Conference Paper
Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain, and storage costs are substantial. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.
AAAI Conference 2025 Conference Paper
As partial samples are often absent in certain views, incomplete multi-view clustering has become a challenging task. To tackle data with missing views, current methods either utilize the data similarity relations to recover missing samples or primarily consider the available information of existing samples, typically facing some inherent limitations. Firstly, traditional solutions cannot fully explore the potential information contained in missing samples due to their omission strategy, leading to sub-optimal graphs. Moreover, most methods mainly focus on data recovery from the view level, ignoring the differences among available/missing samples in various views. To this end, we propose a collaborative Similarity Fusion and Consistency Recovery (SFCR) method, which resolves the incomplete multi-view clustering problem by learning a unified similarity graph and recovering missing samples with consistent structures. Specifically, to learn a reliable graph compatible across views, a novel view-to-sample fusion model is designed to adaptively coalesce the view-wise similarities among available samples, not only preserving the complementarity and consistency among views but also properly balancing different samples. Furthermore, the missing samples are effectively recovered under the guidance of the fused similarity graph, so as to maintain the consistent structure of recovered data across views. In this way, the similarity learning and the missing data recovery benefit from each other in a collaborative reinforcement manner. Meanwhile, SFCR can directly obtain the final clustering labels without additional post-processing. Extensive experiments demonstrate the effectiveness and superiority of SFCR.
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
As a foundational clustering paradigm, Density Peak Clustering (DPC) partitions samples into clusters based on their density peaks, garnering widespread attention. However, traditional DPC methods usually focus on high-density regions, neglecting representative peaks in relatively low-density areas, particularly in datasets with varying densities and multiple peaks. Moreover, existing DPC variants struggle to identify clusters correctly in high-dimensional spaces due to the indistinct distance differences among samples and sparse data distributions. Additionally, existing methods typically adopt a one-step label assignment strategy, making them prone to cascading errors when initial misassignments occur. To address these challenges, we propose an Enhanced Density Peak Clustering (EDPC) method, which creatively incorporates multilayer perceptron (MLP)-based dimensionality reduction and a hierarchical label assignment strategy to significantly improve clustering performance in high-dimensional scenarios. Specifically, we introduce an effective selection condition that combines average densities and density-related distances to generate potential cluster centers, ensuring that peaks across different density regions are considered simultaneously. Furthermore, an MLP, guided by pseudo-labels from sub-clusters, is designed to learn low-dimensional embeddings for high-dimensional data, preserving data locality while enhancing clusterability. Extensive experiments demonstrate the effectiveness and superiority of EDPC against state-of-the-art DPC methods.
NeurIPS Conference 2025 Conference Paper
Federated Learning (FL) has emerged as a privacy-preserving framework for training models on data generated at the edge. However, the heterogeneity of data silos (e. g. , label skew and domain shift) often leads to inconsistent learning objectives and suboptimal model performance. Inspired by the data-driven approach, we propose Flick, a novel data generation framework for heterogeneous **F**ederated **L**earning w**i**th **C**ommonsense **K**nowledge from Large Language Models (LLMs). In Flick, the client performs the local data summary to capture client-specific knowledge in textual form. The central server then distills task-relevant, high-quality knowledge from the out-of-the-box LLM -- guided by cross-client-specific insights -- to generate informative text prompts. These prompts direct a generative model in producing synthetic data, enabling global model fine-tuning and local data compensation. This process gradually aligns the label and feature distributions across clients. Extensive results on three datasets demonstrate that Flick improves the global model accuracy by up to 11. 43\%, and accelerates convergence by up to 12. 9$\times$, validating its effectiveness in addressing data heterogeneity.
ICLR Conference 2025 Conference Paper
Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real-world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree-like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root-to-leaf paths, providing not only the final prediction but also the multi-granular interpretability for the decision-making process. Extensive experiments on the graph classification benchmarks with both synthetic and real-world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state-of-the-art counterparts.
AAAI Conference 2025 Conference Paper
The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location (IMDL). However, the lack of a large-scale data foundation makes the IMDL task unattainable. In this paper, we build a local manipulation data generation pipeline that integrates the powerful capabilities of SAM, LLM, and generative models. Upon this basis, we propose the GIM dataset, which has the following advantages: 1) Large scale, GIM includes over one million pairs of AI-manipulated images and real images. 2) Rich image content, GIM encompasses a broad range of image classes. 3) Diverse generative manipulation, the images are manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce the GIM benchmark with two settings to evaluate existing IMDL methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial block (FSB), and a Multi-Window Anomalous Modeling (MWAM) module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses the previous state-of-the-art approach on two different benchmarks.
NeurIPS Conference 2025 Conference Paper
Time Series Imputation (TSI), which aims to recover missing values in temporal data, remains a fundamental challenge due to the complex and often high-rate missingness in real-world scenarios. Existing models typically optimize the point-wise reconstruction loss, focusing on recovering numerical values (local information). However, we observe that under high missing rates, these models still perform well in the training phase yet produce poor imputations and distorted latent representation distributions (global information) in the inference phase. This reveals a critical optimization dilemma: current objectives lack global guidance, leading models to overfit local noise and fail to capture global information of the data. To address this issue, we propose a new training paradigm, Glocal I nformation B ottleneck ( Glocal-IB ). Glocal-IB is model-agnostic and extends the standard IB framework by introducing a Global Alignment loss, derived from a tractable mutual information approximation. This loss aligns the latent representations of masked inputs with those of their originally observed counterparts. It helps the model retain global structure and local details while suppressing noise caused by missing values, giving rise to better generalization under high missingness. Extensive experiments on nine datasets confirm that Glocal-IB leads to consistently improved performance and aligned latent representations under missingness. Our code implementation is available in https: //github. com/Muyiiiii/NeurIPS-25-Glocal-IB.
AAAI Conference 2025 Conference Paper
Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.
EAAI Journal 2025 Journal Article
JBHI Journal 2025 Journal Article
Recently, advances in neuroscience and the rise of artificial intelligence have significantly enhanced the capabilities of epilepsy diagnosis. While EEG-based diagnosis offer a promising avenue for detecting and predicting seizure activity, practical implementation in real-world scenarios remains hindered by the heterogeneity of epilepsy and the variability of patient-specific biomarkers over time. Conventional deep learning models, trained on historical EEG, often fail to adapt to such biomarker variations, leading to degraded performance. Moreover, the computational and memory constraints of edge devices further exacerbate the challenge of on-device learning. To address these challenges, we introduce a novel framework, Memory-Efficient Intrinsic Gating Adaptation (MEIGA), designed to enhance real-world epilepsy diagnosis on resource-constrained edge devices. Our approach pre-trains a model using historical EEG data and employs lightweight adapter networks for efficient on-device tuning across new sessions, addressing session-to-session variability. By leveraging Direct Feedback Alignment (DFA), MEIGA reduces memory usage and computational overhead while maintaining high classification accuracy. Extensive experiments on the CHB-MIT epilepsy dataset demonstrate that MEIGA outperforms the pretrained-only Vision Transformer baseline, raising seizure prediction accuracy from 47. 88% to 86. 77% with only 3, 908 tunable parameters (5. 05% of the backbone). For seizure detection, MEIGA improves accuracy from 85. 06% to 96. 29% by adapting 2, 008 parameters (17. 40% of the base architecture). Further experiments on the AES dataset demonstrate that MEIGA consistently delivers strong performance across subjects and scales effectively to larger networks.
IJCAI Conference 2025 Conference Paper
Multi-view clustering aims to integrate complementary information from multiple views to improve clustering performance. However, existing ensemble-based methods suffer from information loss due to their reliance on single-granularity labels, limiting the discriminative capability of learned representations. Meanwhile, representation and graph fusion-based approaches face challenges such as explicit view alignment and manual weight tuning, making them less effective for heterogeneous views with varying data distributions. To address these limitations, we propose a novel multi-view clustering framework via Multi-granularity Ensemble (MGE), fully using the multi-granularity information across diverse views for accurate and consistent clustering. Specifically, MGE first modifies the hierarchical clustering and then leverages it on each view (including the fused view) to achieve multi-granularity labels. Moreover, the cross-view and cross-granularity fusion strategy is designed to learn a robust co-association similarity matrix, which effectively preserves the fine-grained and coarse-grained structures of multi-view data and facilitates subsequent clustering. Therefore, MGE can provide a comprehensive representation of local and global patterns within data, eliminating the requirement for view alignment and weight tuning. Experiments demonstrate that MGE consistently outperforms state-of-the-art methods across multiple datasets, validating its effectiveness and superiority in handling heterogeneous views.
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
Label correction methods are popular for their simple architecture in learning with noisy labels. However, they suffer severely from false label correction and achieve subpar performance compared with state-of-the-art methods. In this paper, we revisit the label correction methods through theoretical analysis of gradient scaling and demonstrate that the sample-wise dynamic and class-wise uniformity of interpolation weight prevents memorization of the mislabeled samples. We then propose DULC, a simple yet effective label correction method that uses the normalized Jensen-Shannon divergence (JSD) metric as the interpolation weight to promote sample-wise dynamic and class-wise uniformity. Additionally, we provide theoretical evidence that sharpening predictions in label correction facilitates the memorization of true class, and we achieve it by employing the augmentation strategy along with the sharpening function. Extensive experiments on CIFAR-10, CIFAR-100, TinyImageNet, WebVision and Clothing1M datasets demonstrate substantial improvements over state-of-the-art methods.
JBHI Journal 2025 Journal Article
Automatic and precise multi-class vertebrae segmentation from CT images is crucial for various clinical applications. However, due to similar appearances between adjacent vertebrae and the existence of various pathologies, existing single-stage and multi-stage methods suffer from imprecise vertebrae segmentation. Essentially, these methods fail to explicitly impose both contour precision and intra-vertebrae voxel consistency constraints synchronously, resulting in the intra-vertebrae segmentation inconsistency, which refers to multiple label predictions inside a singular vertebra. In this work, we intend to label complete binary masks with sequential indices to address that challenge. Specifically, a contour generation network is proposed based on Structural Low-Rank Descriptors for shape consistency, termed SLoRD. For a structural representation of vertebral contours, we adopt the spherical coordinate system and devise the spherical centroid to calculate contour descriptors. Due to vertebrae’s similar appearances, basic contour descriptors can be acquired offline to restore original contours. Therefore, SLoRD leverages these contour priors and explicit shape constraints to facilitate regressed contour points close to vertebral surfaces. Quantitative and qualitative evaluations on VerSe 2019 and 2020 demonstrate the superior performance of our framework over other single-stage and multi-stage state-of-the-art (SOTA) methods. Further, SLoRD is a plug-and-play framework to refine the segmentation inconsistency existing in coarse predictions from other approaches.
ICRA Conference 2025 Conference Paper
Recent Vision-based Large Language Models (VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e. g. , InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-theart methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44. 85% performance with limited labeled data, increasing to 54. 27 % when using unlabeled data, while models trained with full datasets reach 60. 68% on the DriveLM benchmark.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
EAAI Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
IROS Conference 2024 Conference Paper
Effectively modeling the spatio-temporal interactions both internally and externally is a challenge in controlling multi-linked snake robots. This paper presents an effective method based on deep predictive coding: SnakeFormer, to address the aforementioned issue. The main contributions include: 1) Deriving a variational free energy function with two innovative regularization terms through Bayesian probabilistic analysis, offering a novel perspective to simulate the interactions between agent and the environment; 2) Introducing an interaction-attention model within a Transformer structure for predicting dynamics, and collaboratively addressing path planning and obstacle avoidance tasks. 3) By incorporating serpenoid embedding and optimizing self-attention computations, the gait stability and motion efficiency are improved. Preliminary experiments and comparative analysis with baseline models fully validate the effectiveness and generalizability of the method.
EAAI Journal 2024 Journal Article
IJCAI Conference 2024 Conference Paper
As data with diverse representations become high-dimensional, multi-view unsupervised feature selection has been an important learning paradigm. Generally, existing methods encounter the following challenges: (i) traditional solutions either concatenate different views or introduce extra parameters to weight them, affecting the performance and applicability; (ii) emphasis is typically placed on graph construction, yet disregarding the clustering information of data; (iii) exploring the similarity structure of all samples from the original features is suboptimal and extremely time-consuming. To solve this dilemma, we propose an efficient multi-view unsupervised feature selection (EMUFS) to construct bipartite graphs between samples and anchors. Specifically, a parameter-free manner is devised to collaboratively fuse the membership matrices and graphs to learn the compatible structure information across all views, naturally balancing different views. Moreover, EMUFS leverages the similarity relations of data in the feature subspace induced by l2, 0-norm to dynamically update the graph. Accordingly, the cluster information of anchors can be accurately propagated to samples via the graph structure and further guide feature selection, enhancing the quality of selected features and the computational costs in solution processes. A convergent optimization is developed to solve the formulated problem, and experiments demonstrate the effectiveness and efficiency of EMUFS.
AAAI Conference 2024 Conference Paper
Accurate prediction of water quality and quantity is crucial for sustainable development and human well-being. However, existing data-driven methods often suffer from spatial biases in model performance due to heterogeneous data, limited observations, and noisy sensor data. To overcome these challenges, we propose Fair-Graph, a novel graph-based recurrent neural network that leverages interrelated knowledge from multiple rivers to predict water flow and temperature within large-scale stream networks. Additionally, we introduce node-specific graph masks for information aggregation and adaptation to enhance prediction over heterogeneous river segments. To reduce performance disparities across river segments, we introduce a centralized coordination strategy that adjusts training priorities for segments. We evaluate the prediction of water temperature within the Delaware River Basin, and the prediction of streamflow using simulated data from U.S. National Water Model in the Houston River network. The results showcase improvements in predictive performance and highlight the proposed model's ability to maintain spatial fairness over different river segments.
JBHI Journal 2024 Journal Article
The infant sleep-wake behavior is an essential indicator of physiological and neurological system maturity, the circadian transition of which is important for evaluating the recovery of preterm infants from inadequate physiological function and cognitive disorders. Recently, camera-based infant sleep-wake monitoring has been investigated, but the challenges of generalization caused by variance in infants and clinical environments are not addressed for this application. In this paper, we conducted a multi-center clinical trial at four hospitals to improve the generalization of camera-based infant sleep-wake monitoring. Using the face videos of 64 term and 39 preterm infants recorded in NICUs, we proposed a novel sleep-wake classification strategy, called consistent deep representation constraint (CDRC), that forces the convolutional neural network (CNN) to make consistent predictions for the samples from different conditions but with the same label, to address the variances caused by infants and environments. The clinical validation shows that by using CDRC, all CNN backbones obtain over 85% accuracy, sensitivity, and specificity in both the cross-age and cross-environment experiments, improving the ones without CDRC by almost 15% in all metrics. This demonstrates that by improving the consistency of the deep representation of samples with the same state, we can significantly improve the generalization of infant sleep-wake classification.
IJCAI Conference 2024 Conference Paper
Clinical reasoning refers to the cognitive process that physicians employ in evaluating and managing patients. This process typically involves suggesting necessary examinations, diagnosing patients’ diseases, and selecting appropriate therapies, etc. Accurate clinical reasoning requires extensive medical knowledge and rich clinical experience, setting a high bar for physicians. This is particularly challenging in developing countries due to the overwhelming number of patients and limited physician resources, contributing significantly to global health inequity and necessitating automated clinical reasoning approaches. Recently, the emergence of large language models (LLMs) such as ChatGPT and GPT-4 have demonstrated their potential in clinical reasoning. However, these LLMs are prone to hallucination problems, and the reasoning process of LLMs may not align with the clinical decision pathways of physicians. In this study, we introduce a novel framework, In-Context Padding (ICP), to enhance LLMs reasoning with medical knowledge. Specifically, we infer critical clinical reasoning elements (referred to as knowledge seeds) and use these as anchors to guide the generation process of LLMs. Experiments on two clinical question datasets validate that ICP significantly improves the clinical reasoning ability of LLMs.
NeurIPS Conference 2024 Conference Paper
Out-of-Distribution (OoD) detection is vital for the reliability of Deep Neural Networks (DNNs). Existing works have shown the insufficiency of Principal Component Analysis (PCA) straightforwardly applied on the features of DNNs in detecting OoD data from In-Distribution (InD) data. The failure of PCA suggests that the network features residing in OoD and InD are not well separated by simply proceeding in a linear subspace, which instead can be resolved through proper non-linear mappings. In this work, we leverage the framework of Kernel PCA (KPCA) for OoD detection, and seek suitable non-linear kernels that advocate the separability between InD and OoD data in the subspace spanned by the principal components. Besides, explicit feature mappings induced from the devoted task-specific kernels are adopted so that the KPCA reconstruction error for new test samples can be efficiently obtained with large-scale data. Extensive theoretical and empirical results on multiple OoD data sets and network structures verify the superiority of our KPCA detector in efficiency and efficacy with state-of-the-art detection performance.
NeurIPS Conference 2024 Conference Paper
Recent advancements in Multimodal Large Language Models (MLLMs) have greatly improved their abilities in image understanding. However, these models often struggle with grasping pixel-level semantic details, e. g. , the keypoints of an object. To bridge this gap, we introduce the novel challenge of Semantic Keypoint Comprehension, which aims to comprehend keypoints across different task scenarios, including keypoint semantic understanding, visual prompt-based keypoint detection, and textual prompt-based keypoint detection. Moreover, we introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy to effectively address these challenges. KptLLM underscores the initial discernment of semantics in keypoints, followed by the precise determination of their positions through a chain-of-thought process. With several carefully designed modules, KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations. Our extensive experiments demonstrate KptLLM's superiority in various keypoint detection benchmarks and its unique semantic capabilities in interpreting keypoints.
NeurIPS Conference 2024 Conference Paper
Large language models (LLMs) have demonstrated remarkable capabilities in language understanding and generation, leading to their widespread adoption across various fields. Among these, the medical field is particularly well-suited for LLM applications, as many medical tasks can be enhanced by LLMs. Despite the existence of benchmarks for evaluating LLMs in medical question-answering and exams, there remains a notable gap in assessing LLMs' performance in supporting patients throughout their entire hospital visit journey in real-world clinical practice. In this paper, we address this gap by dividing a typical patient's clinical journey into four stages: planning, access, delivery and ongoing care. For each stage, we introduce multiple tasks and corresponding datasets, resulting in a comprehensive benchmark comprising 12 datasets, of which five are newly introduced, and seven are constructed from existing datasets. This proposed benchmark facilitates a thorough evaluation of LLMs' effectiveness across the entire patient journey, providing insights into their practical application in clinical settings. Additionally, we evaluate three categories of LLMs against this benchmark: 1) proprietary LLM services such as GPT-4; 2) public LLMs like QWen; and 3) specialized medical LLMs, like HuatuoGPT2. Through this extensive evaluation, we aim to provide a better understanding of LLMs' performance in the medical domain, ultimately contributing to their more effective deployment in healthcare settings.
ICRA Conference 2024 Conference Paper
Scoliosis diagnosis and assessment depend largely on the measurement of the Cobb angle in spine X-ray images. With the emergence of deep learning techniques that employ landmark detection, tilt prediction, and spine segmentation, automated Cobb angle measurement has become increasingly popular. However, these methods encounter difficulties such as high noise sensitivity, intricate computational procedures, and exclusive reliance on a single type of morphological information. In this paper, we introduce the Multiple Morphology-Aware Network (MMA-Net), a novel framework that improves Cobb angle measurement accuracy by integrating multiple spine morphology as attention information. In the MMA-Net, we first feed spine X-ray images into the segmentation network to produce multiple morphological information (spine region, centerline, and boundary) and then concatenate the original X-ray image with the resulting segmentation maps as input for the regression module to perform precise Cobb angle measurement. Furthermore, we devise joint loss functions for our segmentation and regression network training, respectively. We evaluate our method on the AASCE challenge dataset and achieve superior performance with the SMAPE of 7. 28% and the MAE of 3. 18°, indicating a strong competitiveness compared to other outstanding methods. Consequently, we can offer clinicians automated, efficient, and reliable Cobb angle measurement.
JAIR Journal 2024 Journal Article
Concepts are an important construct in semantics, based on which humans understand the world with various levels of abstraction. With the recent advances in explainable artificial intelligence (XAI), concept-level explanations are receiving an increasing amount of attention from the broad research community. However, laypeople may find such explanations difficult to digest due to the potential knowledge gap and the concomitant cognitive load. Inspired by prior work that has explored analogies and sensemaking, we argue that augmenting concept-level explanations with analogical inference information from commonsense knowledge can be a potential solution to tackle this issue. To investigate the validity of our proposition, we first designed an effective analogy-based explanation generation method and collected 600 analogy-based explanations from 100 crowd workers. Next, we proposed a set of structured dimensions for the qualitative assessment of such explanations, and conducted an empirical evaluation of the generated analogies with experts. Our findings revealed significant positive correlations between the qualitative dimensions of analogies and the perceived helpfulness of analogy-based explanations, suggesting the effectiveness of the dimensions. To understand the practical utility and the effectiveness of analogybased explanations in assisting human decision-making, we conducted a follow-up empirical study (N = 280) on a skin cancer detection task with non-expert humans and an imperfect AI system. Thus, we designed a between-subjects study spanning five different experimental conditions with varying types of explanations. The results of our study confirmed that a knowledge gap can prevent participants from understanding concept-level explanations. Consequently, when only the target domain of our designed analogy-based explanation was provided (in a specific experimental condition), participants demonstrated relatively more appropriate reliance on the AI system. In contrast to our expectations, we found that analogies were not effective in fostering appropriate reliance. We carried out a qualitative analysis of the open-ended responses from participants in the study regarding their perceived usefulness of explanations and analogies. Our findings suggest that human intuition and the perceived plausibility of analogies may have played a role in affecting user reliance on the AI system. We also found that the understanding of commonsense explanations varied with the varying experience of the recipient user, which points out the need for further work on personalization when leveraging commonsense explanations. In summary, although we did not find quantitative support for our hypotheses around the benefits of using analogies, we found considerable qualitative evidence suggesting the potential of high-quality analogies in aiding non-expert users in their decision making with AI-assistance. These insights can inform the design of future methods for the generation and use of effective analogy-based explanations.
NeurIPS Conference 2024 Conference Paper
Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6. 1 RayIoU.
YNICL Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Medical insurance fraud has always been a crucial challenge in the field of healthcare industry. Existing fraud detection models mostly focus on offline learning scenes. However, fraud patterns are constantly evolving, making it difficult for models trained on past data to detect newly emerging fraud patterns, posing a severe challenge in medical fraud detection. Moreover, current incremental learning models are mostly designed to address catastrophic forgetting, but often exhibit suboptimal performance in fraud detection. To address this challenge, this paper proposes an innovative online learning method for medical insurance fraud detection, named POCL. This method combines contrastive learning pre-training with online updating strategies. In the pre-training stage, we leverage contrastive learning pre-training to learn on historical data, enabling deep feature learning and obtaining rich risk representations. In the online learning stage, we adopt a Temporal Memory Aware Synapses online updating strategy, allowing the model to perform incremental learning and optimization based on continuously emerging new data. This ensures timely adaptation to fraud patterns and reduces forgetting of past knowledge. Our model undergoes extensive experiments and evaluations on real-world insurance fraud datasets. The results demonstrate our model has significant advantages in accuracy compared to the state-of-the-art baseline methods, while also exhibiting lower running time and space consumption. Our sources are released at https://github.com/finint/POCL.
NeurIPS Conference 2024 Conference Paper
The interaction between Fourier transform and deep learning opens new avenues for long-term time series forecasting (LTSF). We propose a new perspective to reconsider the Fourier transform from a basis functions perspective. Specifically, the real and imaginary parts of the frequency components can be viewed as the coefficients of cosine and sine basis functions at tiered frequency levels, respectively. We argue existing Fourier-based methods do not involve basis functions thus fail to interpret frequency coefficients precisely and consider the time-frequency relationship sufficiently, leading to inconsistent starting cycles and inconsistent series length issues. Accordingly, a novel Fourier basis mapping (FBM) method addresses these issues by mixing time and frequency domain features through Fourier basis expansion. Differing from existing approaches, FBM (i) embeds the discrete Fourier transform with basis functions, and then (ii) can enable plug-and-play in various types of neural networks for better performance. FBM extracts explicit frequency features while preserving temporal characteristics, enabling the mapping network to capture the time-frequency relationships. By incorporating our unique time-frequency features, the FBM variants can enhance any type of networks like linear, multilayer-perceptron-based, transformer-based, and Fourier-based networks, achieving state-of-the-art LTSF results on diverse real-world datasets with just one or three fully connected layers. The code is available at: https: //github. com/runze1223/Fourier-Basis-Mapping.
AAAI Conference 2024 Conference Paper
During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.
AAAI Conference 2024 Conference Paper
Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE) require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating Inner Monologue, a cognitive process in which an individual engages in silent verbal communication with themselves. More specifically, we enable LLMs and VLMs to interact through natural language conversation (i.e., Inner Monologue) and propose to use a two-stage training process to learn how to do Inner Monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, broadening its potential applications across various AI challenges beyond vision and language tasks.
JBHI Journal 2023 Journal Article
Endoscopy has been routinely used to diagnose stomach diseases including intestinal metaplasia (IM) and gastritis atrophy (GA). Such routine examination usually demands highly skilled radiologists to focus on a single patient with substantial time, causing the following two key challenges: 1) the dependency on the radiologist's experience leading to inconsistent diagnosis results across different radiologists; 2) limited examination efficiency due to the demanding time and energy consumption to the radiologist. This paper proposes to address these two issues in endoscopy using novel machine learning method in three-folds. Firstly, we build a novel and relatively big endoscopy dataset of 21, 420 images from the widely used White Light Imaging (WLI) endoscopy and more recent Linked Color Imaging (LCI) endoscopy, which were annotated by experienced radiologists and validated with biopsy results, presenting a benchmark dataset. Secondly, we propose a novel machine learning model inspired by the human visual system, named as local attention grouping, to effectively extract key visual features, which is further improved by learning from multiple randomly selected regional images via ensemble learning. Such a method avoids the significant problem in the deep learning methods that decrease the resolution of original images to reduce the size of input samples, which would remove smaller lesions in endoscopy images. Finally, we propose a dual transfer learning strategy to train the model with co-distributed features between WLI and LCI images to further improve the performance. The experiment results, measured by accuracy, specificity, sensitivity, positive detection rate and negative detection rate, on IM are 99. 18 $\%$, 98. 90 $\%$, 99. 45 $\%$, 99. 45 $\%$, 98. 91 $\%$, respectively, and on GA are 97. 12 $\%$, 95. 34 $\%$, 98. 90 $\%$, 98. 86 $\%$, 95. 50 $\%$, respectively, achieving state of the art performance that outperforms current mainstream deep learning models.
ICLR Conference 2023 Conference Paper
This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.
AAAI Conference 2023 Conference Paper
Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web datasets. To this end, we propose a Few-shot guided Prototypical (FoPro) representation learning method, which only needs a few labeled examples from reality and can significantly improve the performance in the real-world domain. Specifically, we initialize each class center with few-shot real-world data as the ``realistic" prototype. Then, the intra-class distance between web instances and ``realistic" prototypes is narrowed by contrastive learning. Finally, we measure image-prototype distance with a learnable metric. Prototypes are polished by adjacent high-quality web images and involved in removing distant out-of-distribution samples. In experiments, FoPro is trained on web datasets with a few real-world examples guided and evaluated on real-world datasets. Our method achieves the state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods under the same few-shot settings, FoPro still excels in real-world generalization. Code is available at https://github.com/yuleiqin/fopro.
IJCAI Conference 2023 Conference Paper
Large pre-trained models have revolutionized natural language processing (NLP) research and applications, but high training costs and limited data resources have prevented their benefits from being shared equally amongst speakers of all the world's languages. To address issues of cross-linguistic access to such models and reduce energy consumption for sustainability during large-scale model training, this study proposes an effective and energy-efficient framework called GreenPLM that uses bilingual lexicons to directly ``translate'' pre-trained language models of one language into another at almost no additional cost. We validate this approach in 18 languages' BERT models and show that this framework is comparable to, if not better than, other heuristics with high training costs. In addition, given lightweight continued pre-training on limited data where available, this framework outperforms the original monolingual language models in six out of seven tested languages with up to 200x less pre-training efforts. Aiming at the Leave No One Behind Principle (LNOB), our approach manages to reduce inequalities between languages and energy consumption greatly. We make our codes and models publicly available at https: //github. com/qcznlp/GreenPLMs.
EAAI Journal 2023 Journal Article
ECAI Conference 2023 Conference Paper
Stock Movement Prediction (SMP) is a challenging task that aims at predicting the future stock price trend of companies in the stock. Recent advances mainly apply the Graph Convolutional Network (GCN) to learn connections among companies for SMP. However, these methods usually ignore the semantics of the specific relations (e. g. , investment and share) between two entities (i. e. , companies and persons) on the market knowledge graph. Meanwhile, considering the long-chain cross-shareholding structures among entities, it is difficult for GCN to obtain high-order neighbor information over long distances. To address these two problems, we present an Attention-aware Multi-order Relation GCN for SMP (AMRGCN-SMP). Specifically, an attention-aware multi-channel aggregation manner achieves the weighted fusion of nodes across multiple semantic channels. Moreover, the dynamic update of the adjacent tensor can fuse the multi-order relation representations and bring more abundant long-chain connections. The experiments on the CSI100E and CSI300E datasets demonstrate that the proposed method achieves state-of-the-art performances compared with the recent advances.
EAAI Journal 2023 Journal Article
AAAI Conference 2023 Conference Paper
Deep Neural Networks (DNNs) possess powerful prediction capability thanks to their over-parameterization design, although the large model complexity makes it suffer from noisy supervision. Recent approaches seek to eliminate impacts from noisy labels by excluding data points with large loss values and showing promising performance. However, these approaches usually associate with significant computation overhead and lack of theoretical analysis. In this paper, we adopt a perspective to connect label noise with epistemic uncertainty. We design a simple, efficient, and theoretically provable robust algorithm named USDNL for DNNs with uncertainty-based Dropout. Specifically, we estimate the epistemic uncertainty of the network prediction after early training through single Dropout. The epistemic uncertainty is then combined with cross-entropy loss to select the clean samples during training. Finally, we theoretically show the equivalence of replacing selection loss with single cross-entropy loss. Compared to existing small-loss selection methods, USDNL features its simplicity for practical scenarios by only applying Dropout to a standard network, while still achieving high model accuracy. Extensive empirical results on both synthetic and real-world datasets show that USDNL outperforms other methods. Our code is available at https://github.com/kovelxyz/USDNL.
NeurIPS Conference 2022 Conference Paper
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https: //amos22. grand-challenge. org.
NeurIPS Conference 2022 Conference Paper
Neural implicit function based on signed distance field (SDF) has achieved impressive progress in reconstructing 3D models with high fidelity. However, such approaches can only represent closed shapes. Recent works based on unsigned distance function (UDF) are proposed to handle both watertight and open surfaces. Nonetheless, as UDF is signless, its direct output is limited to point cloud, which imposes an additional challenge on extracting high-quality meshes from discrete points. To address this issue, we present a new learnable implicit representation, coded HSDF, that connects the good ends of SDF and UDF. In particular, HSDF is able to represent arbitrary topologies containing both closed and open surfaces while being compatible with existing iso-surface extraction techniques for easy field-to-mesh conversion. In addition to predicting a UDF, we propose to learn an additional sign field via a simple classifier. Unlike traditional SDF, HSDF is able to locate the surface of interest before level surface extraction by generating surface points following NDF~\cite{chibane2020ndf}. We are then able to obtain open surfaces via an adaptive meshing approach that only instantiates regions containing surface into a polygon mesh. We also propose HSDF-Net, a dedicated learning framework that factorizes the learning of HSDF into two easier problems. Experiments on multiple datasets show that HSDF outperforms state-of-the-art techniques both qualitatively and quantitatively.
NeurIPS Conference 2022 Conference Paper
The COVID-19 pandemic continues to bring up various topics discussed or debated on social media. In order to explore the impact of pandemics on people's lives, it is crucial to understand the public's concerns and attitudes towards pandemic-related entities (e. g. , drugs, vaccines) on social media. However, models trained on existing named entity recognition (NER) or targeted sentiment analysis (TSA) datasets have limited ability to understand COVID-19-related social media texts because these datasets are not designed or annotated from a medical perspective. In this paper, we release METS-CoV, a dataset containing medical entities and targeted sentiments from COVID-19 related tweets. METS-CoV contains 10, 000 tweets with 7 types of entities, including 4 medical entity types (Disease, Drug, Symptom, and Vaccine) and 3 general entity types (Person, Location, and Organization). To further investigate tweet users' attitudes toward specific entities, 4 types of entities (Person, Organization, Drug, and Vaccine) are selected and annotated with user sentiments, resulting in a targeted sentiment dataset with 9, 101 entities (in 5, 278 tweets). To the best of our knowledge, METS-CoV is the first dataset to collect medical entities and corresponding sentiments of COVID-19 related tweets. We benchmark the performance of classical machine learning models and state-of-the-art deep learning models on NER and TSA tasks with extensive experiments. Results show that this dataset has vast room for improvement for both NER and TSA tasks. With rich annotations and comprehensive benchmark results, we believe METS-CoV is a fundamental resource for building better medical social media understanding tools and facilitating computational social science research, especially on epidemiological topics. Our data, annotation guidelines, benchmark models, and source code are publicly available (\url{https: //github. com/YLab-Open/METS-CoV}) to ensure reproducibility.
AAAI Conference 2022 Short Paper
Most existing multi-view clustering methods have problems with parameter selection and high computational complexity, and there have been very few works based on hierarchical clustering to learn the complementary information of multiple views. In this paper, we propose a Multi-view Adjacencyconstrained Nearest Neighbor Clustering (MANNC) and its parameter-free version (MANNC-PF) to overcome these limitations. Experiments tested on eight real-world datasets validate the superiority of the proposed methods compared with the 13 current state-of-the-art methods.
JBHI Journal 2022 Journal Article
Visual prostheses with both comprehensive visual signal processing capability and energy efficiency are becoming increasingly demanded in the age of intelligent personal healthcare, particularly with the rise of wearable and implantable devices. To address this trend, we propose NeuroSEE, a neuromorphic energy-efficient processing framework that combines a spike representation encoding technique and a bio-inspired processing method. This framework first utilizes sparse spike trains to represent visual information, and then a bio-inspired spiking neural network (SNN) is adopted to process the spike trains. The SNN model makes use of an IF neuron with multiple spike-firing rates to decrease the energy consumption without compensating for prediction performance. The experimental results indicate that when predicting the response of the primary visual cortex, the framework achieves a state-of-the-art Pearson correlation coefficient performance. Spike-based recording and processing methods simplify the storage and transmission of redundant scene information and complex calculation processes. It could reduce power consumption by 15 times compared with the existing Convolutional neural network (CNN) processing framework. The proposed NeuroSEE framework predicts the response of the primary visual cortex in an energy efficient manner, making it a powerful tool for visual prostheses.
JMLR Journal 2021 Journal Article
This paper generalizes regularized regression problems in a hyper-reproducing kernel Hilbert space (hyper-RKHS), illustrates its utility for kernel learning and out-of-sample extensions, and proves asymptotic convergence results for the introduced regression models in an approximation theory view. Algorithmically, we consider two regularized regression models with bivariate forms in this space, including kernel ridge regression (KRR) and support vector regression (SVR) endowed with hyper-RKHS, and further combine divide-and-conquer with Nystr\"{o}m approximation for scalability in large sample cases. This framework is general: the underlying kernel is learned from a broad class, and can be positive definite or not, which adapts to various requirements in kernel learning. Theoretically, we study the convergence behavior of regularized regression algorithms in hyper-RKHS and derive the learning rates, which goes beyond the classical analysis on RKHS due to the non-trivial independence of pairwise samples and the characterisation of hyper-RKHS. Experimentally, results on several benchmarks suggest that the employed framework is able to learn a general kernel function form an arbitrary similarity matrix, and thus achieves a satisfactory performance on classification tasks. [abs] [ pdf ][ bib ] © JMLR 2021. ( edit, beta )
EAAI Journal 2021 Journal Article
AAAI Conference 2021 Conference Paper
Explainability is a key requirement for text classification in many application domains ranging from sentiment analysis to medical diagnosis or legal reviews. Existing methods often rely on “attention” mechanisms for explaining classification results by estimating the relative importance of input units. However, recent studies have shown that such mechanisms tend to mis-identify irrelevant input units in their explanation. In this work, we propose a hybrid human-AI approach that incorporates human rationales into attention-based text classification models to improve the explainability of classification results. Specifically, we ask workers to provide rationales for their annotation by selecting relevant pieces of text. We introduce MARTA, a Bayesian framework that jointly learns an attention-based model and the reliability of workers while injecting human rationales into model training. We derive a principled optimization algorithm based on variational inference with efficient updating rules for learning MARTA parameters. Extensive validation on real-world datasets shows that our framework significantly improves the state of the art both in terms of classification explainability and accuracy.
NeurIPS Conference 2021 Conference Paper
Recent advances in localized implicit functions have enabled neural implicit representation to be scalable to large scenes. However, the regular subdivision of 3D space employed by these approaches fails to take into account the sparsity of the surface occupancy and the varying granularities of geometric details. As a result, its memory footprint grows cubically with the input volume, leading to a prohibitive computational cost even at a moderately dense decomposition. In this work, we present a learnable hierarchical implicit representation for 3D surfaces, coded OctField, that allows high-precision encoding of intricate surfaces with low memory and computational budget. The key to our approach is an adaptive decomposition of 3D scenes that only distributes local implicit functions around the surface of interest. We achieve this goal by introducing a hierarchical octree structure to adaptively subdivide the 3D space according to the surface occupancy and the richness of part geometry. As octree is discrete and non-differentiable, we further propose a novel hierarchical network that models the subdivision of octree cells as a probabilistic process and recursively encodes and decodes both octree structure and surface geometry in a differentiable manner. We demonstrate the value of OctField for a range of shape modeling and reconstruction tasks, showing superiority over alternative approaches.
IJCAI Conference 2021 Conference Paper
Hypergraph, an expressive structure with flexibility to model the higher-order correlations among entities, has recently attracted increasing attention from various research domains. Despite the success of Graph Neural Networks (GNNs) for graph representation learning, how to adapt the powerful GNN-variants directly into hypergraphs remains a challenging problem. In this paper, we propose UniGNN, a unified framework for interpreting the message passing process in graph and hypergraph neural networks, which can generalize general GNN models into hypergraphs. In this framework, meticulously-designed architectures aiming to deepen GNNs can also be incorporated into hypergraphs with the least effort. Extensive experiments have been conducted to demonstrate the effectiveness of UniGNN on multiple real-world datasets, which outperform the state-of-the-art approaches with a large margin. Especially for the DBLP dataset, we increase the accuracy from 77. 4% to 88. 8% in the semi-supervised hypernode classification task. We further prove that the proposed message-passing based UniGNN models are at most as powerful as the 1-dimensional Generalized Weisfeiler-Leman (1-GWL) algorithm in terms of distinguishing non-isomorphic hypergraphs. Our code is available at https: //github. com/OneForward/UniGNN.
AAAI Conference 2020 Conference Paper
Image smoothing is a fundamental procedure in applications of both computer vision and graphics. The required smoothing properties can be different or even contradictive among different tasks. Nevertheless, the inherent smoothing nature of one smoothing operator is usually fixed and thus cannot meet the various requirements of different applications. In this paper, a non-convex non-smooth optimization framework is proposed to achieve diverse smoothing natures where even contradictive smoothing behaviors can be achieved. To this end, we first introduce the truncated Huber penalty function which has seldom been used in image smoothing. A robust framework is then proposed. When combined with the strong flexibility of the truncated Huber penalty function, our framework is capable of a range of applications and can outperform the state-of-the-art approaches in several tasks. In addition, an efficient numerical solution is provided and its convergence is theoretically guaranteed even the optimization framework is non-convex and non-smooth. The effectiveness and superior performance of our approach are validated through comprehensive experimental results in a range of applications.
AAAI Conference 2020 Conference Paper
Microblogging platforms such as Twitter are increasingly being used in event detection. Existing approaches mainly use machine learning models and rely on event-related keywords to collect the data for model training. These approaches make strong assumptions on the distribution of the relevant microposts containing the keyword – referred to as the expectation of the distribution – and use it as a posterior regularization parameter during model training. Such approaches are, however, limited as they fail to reliably estimate the informativeness of a keyword and its expectation for model training. This paper introduces a Human-AI loop approach to jointly discover informative keywords for model training while estimating their expectation. Our approach iteratively leverages the crowd to estimate both keyword-specific expectation and the disagreement between the crowd and the model in order to discover new keywords that are most beneficial for model training. These keywords and their expectation not only improve the resulting performance but also make the model training process more transparent. We empirically demonstrate the merits of our approach, both in terms of accuracy and interpretability, on multiple real-world datasets and show that our approach improves the state of the art by 24. 3%.
JBHI Journal 2020 Journal Article
Neuroimaging and genetic biomarkers have been widely studied from discriminative perspectives towards Alzheimer's disease (AD) classification, since neuroanatomical patterns and genetic variants are jointly critical indicators for AD diagnosis. Generative methods, designed to model common occurring patterns, could potentially advance the understanding of this disease, but have not been fully explored for AD characterization. Moreover, the introduction of a supervised component into the generative process can constrain the model for more discriminative characterization. In this study, we propose an original method based on supervised topic modeling to characterize AD from a generative perspective, yet maintaining discriminative power at differentiating disease populations. Our topic modeling jointly exploits discretized image features and categorical genetic features. Diagnostic information - cognitively normal (CN), mild cognitive impairment (MCI) and AD - is introduced as a supervision variable. Experimental results on the ADNI cohort demonstrate that our model, while achieving competitive discriminative performance, can discover topics revealing both well-known and novel neuroanatomical patterns including temporal, parietal and frontal regions; as well as associations between genetic factors and neuroanatomical patterns.
YNIMG Journal 2020 Journal Article
JMLR Journal 2020 Journal Article
In this paper, we propose a data-adaptive non-parametric kernel learning framework in margin based kernel methods. In model formulation, given an initial kernel matrix, a data-adaptive matrix with two constraints is imposed in an entry-wise scheme. Learning this data-adaptive matrix in a formulation-free strategy enlarges the margin between classes and thus improves the model flexibility. The introduced two constraints are imposed either exactly (on small data sets) or approximately (on large data sets) in our model, which provides a controllable trade-off between model flexibility and complexity with theoretical demonstration. In algorithm optimization, the objective function of our learning framework is proven to be gradient-Lipschitz continuous. Thereby, kernel and classifier/regressor learning can be efficiently optimized in a unified framework via Nesterov's acceleration. For the scalability issue, we study a decomposition-based approach to our model in the large sample case. The effectiveness of this approximation is illustrated by both empirical studies and theoretical guarantees. Experimental results on various classification and regression benchmark data sets demonstrate that our non-parametric kernel learning framework achieves good performance when compared with other representative kernel learning based algorithms. [abs] [ pdf ][ bib ] © JMLR 2020. ( edit, beta )
AAAI Conference 2020 Conference Paper
This paper develops a multi-task learning framework that attempts to incorporate the image structure knowledge to assist image inpainting, which is not well explored in previous works. The primary idea is to train a shared generator to simultaneously complete the corrupted image and corresponding structures — edge and gradient, thus implicitly encouraging the generator to exploit relevant structure knowledge while inpainting. In the meantime, we also introduce a structure embedding scheme to explicitly embed the learned structure features into the inpainting process, thus to provide possible preconditions for image completion. Specifically, a novel pyramid structure loss is proposed to supervise structure learning and embedding. Moreover, an attention mechanism is developed to further exploit the recurrent structures and patterns in the image to refine the generated structures and contents. Through multi-task learning, structure embedding besides with attention, our framework takes advantage of the structure knowledge and outperforms several state-of-theart methods on benchmark datasets quantitatively and qualitatively.
AAAI Conference 2020 Conference Paper
In this paper, we propose a fast surrogate leverage weighted sampling strategy to generate refined random Fourier features for kernel approximation. Compared to the current state-ofthe-art method that uses the leverage weighted scheme (Li et al. 2019), our new strategy is simpler and more effective. It uses kernel alignment to guide the sampling process and it can avoid the matrix inversion operator when we compute the leverage function. Given n observations and s random features, our strategy can reduce the time complexity for sampling from O(ns2 +s3 ) to O(ns2 ), while achieving comparable (or even slightly better) prediction performance when applied to kernel ridge regression (KRR). In addition, we provide theoretical guarantees on the generalization performance of our approach, and in particular characterize the number of random features required to achieve statistical guarantees in KRR. Experiments on several benchmark datasets demonstrate that our algorithm achieves comparable prediction performance and takes less time cost when compared to (Li et al. 2019).
IJCAI Conference 2020 Conference Paper
End-to-end learning from crowds has recently been introduced as an EM-free approach to training deep neural networks directly from noisy crowdsourced annotations. It models the relationship between true labels and annotations with a specific type of neural layer, termed as the crowd layer, which can be trained using pure backpropagation. Parameters of the crowd layer, however, can hardly be interpreted as annotator reliability, as compared with the more principled probabilistic approach. The lack of probabilistic interpretation further prevents extensions of the approach to account for important factors of annotation processes, e. g. , instance difficulty. This paper presents SpeeLFC, a structured probabilistic model that incorporates the constraints of probability axioms for parameters of the crowd layer, which allows to explicitly model annotator reliability while benefiting from the end-to-end training of neural networks. Moreover, we propose SpeeLFC-D, which further takes into account instance difficulty. Extensive validation on real-world datasets shows that our methods improve the state-of-the-art.
JBHI Journal 2019 Journal Article
Content-based medical image retrieval is an important computer-aided diagnosis technique providing the clinicians with interpretative references based on visual similarity. In this paper, we focus on the tasks of histopathological image retrieval for breast cancer diagnosis. The densely-connected multi-magnification (DCMMH) framework is proposed to generate the discriminative binary codes by exploiting the histopathological images with multiple magnification factors. The low-magnification images are boosted by the accumulated similarity based on local patches that also regularize the feature learning of high-magnification images. In order to fully utilize the information across different magnification levels, a densely-connected architecture is finally deployed for high-low magnification pairs of datasets. Experiments on BreakHis dataset demonstrate that, DCMMH outperforms the previous hashing methods on histopathological image retrieval.
EAAI Journal 2019 Journal Article
AAAI Conference 2018 Conference Paper
Spatially localized deformation components are very useful for shape analysis and synthesis in 3D geometry processing. Several methods have recently been developed, with an aim to extract intuitive and interpretable deformation components. However, these techniques suffer from fundamental limitations especially for meshes with noise or large-scale deformations, and may not always be able to identify important deformation components. In this paper we propose a novel mesh-based autoencoder architecture that is able to cope with meshes with irregular topology. We introduce sparse regularization in this framework, which along with convolutional operations, helps localize deformations. Our framework is capable of extracting localized deformation components from mesh data sets with large-scale deformations and is robust to noise. It also provides a nonlinear approach to reconstruction of meshes using the extracted basis, which is more effective than the current linear combination approach. Extensive experiments show that our method outperforms state-of-the-art methods in both qualitative and quantitative evaluations.
AAAI Conference 2018 Conference Paper
Kernel learning is a fundamental technique that has been intensively studied in the past decades. For the complicated practical tasks, the traditional “shallow” kernels (e. g. , Gaussian kernel and sigmoid kernel) are not flexible enough to produce satisfactory performance. To address this shortcoming, this paper introduces a nonlinear layer in kernel learning to enhance the model flexibility. This layer is pairwise, which fully considers the coupling information among examples. So our model contains a fixed single mapping layer (i. e. a Gaussian kernel) as well as a nonlinear pairwise layer, thereby achieving better flexibility than the existing kernel structures. Moreover, the proposed structure can be seamlessly embedded to Support Vector Machines (SVM), of which the training process can be formulated as a joint optimization problem including nonlinear function learning and standard SVM optimization. We theoretically prove that the objective function is gradient-Lipschitz continuous, which further guides us how to accelerate the optimization process in a deep kernel architecture. Experimentally, we find that the proposed structure outperforms other state-ofthe-art kernel-based algorithms on various benchmark datasets, and thus the effectiveness of the incorporated pairwise layer with its training approach is demonstrated.
YNICL Journal 2017 Journal Article
AAAI Conference 2017 Conference Paper
Feature hierarchy (FH) has proven to be effective to improve recommendation accuracy. Prior work mainly focuses on the influence of vertically affiliated features (i. e. child-parent) on user-item interactions. The relationships of horizontally organized features (i. e. siblings and cousins) in the hierarchy, however, has only been little investigated. We show in real-world datasets that feature relationships in horizontal dimension can help explain and further model user-item interactions. To fully exploit FH, we propose a unified recommendation framework that seamlessly incorporates both vertical and horizontal dimensions for effective recommendation. Our model further considers two types of semantically rich feature relationships in horizontal dimension, i. e. complementary and alternative relationships. Extensive validation on four real-world datasets demonstrates the superiority of our approach against the state of the art. An additional benefit of our model is to provide better interpretations of the generated recommendations.
IJCAI Conference 2017 Conference Paper
Representation learning (RL) has recently proven to be effective in capturing local item relationships by modeling item co-occurrence in individual user's interaction record. However, the value of RL for recommendation has not reached the full potential due to two major drawbacks: 1) recommendation is modeled as a rating prediction problem but should essentially be a personalized ranking one; 2) multi-level organizations of items are neglected for fine-grained item relationships. We design a unified Bayesian framework MRLR to learn user and item embeddings from a multi-level item organization, thus benefiting from RL as well as achieving the goal of personalized ranking. Extensive validation on real-world datasets shows that MRLR consistently outperforms state-of-the-art algorithms.
AAAI Conference 2016 Conference Paper
Multi-label propagation aims to transmit the multi-label information from labeled examples to unlabeled examples based on a weighted graph. Existing methods ignore the specific propagation difficulty of different unlabeled examples and conduct the propagation in an imperfect sequence, leading to the error-prone classification of some difficult examples with uncertain labels. To address this problem, this paper associates each possible label with a “teacher”, and proposes a “Multi-Label Teaching-to-Learn and Learning-to- Teach” (ML-TLLT) algorithm, so that the entire propagation process is guided by the teachers and manipulated from simple examples to more difficult ones. In the teaching-to-learn step, the teachers select the simplest examples for the current propagation by investigating both the definitiveness of each possible label of the unlabeled examples, and the dependencies between labels revealed by the labeled examples. In the learning-to-teach step, the teachers reversely learn from the learner’s feedback to properly select the simplest examples for the next propagation. Thorough empirical studies show that due to the optimized propagation sequence designed by the teachers, ML-TLLT yields generally better performance than seven state-of-the-art methods on the typical multi-label benchmark datasets.
EAAI Journal 2015 Journal Article
AAAI Conference 2014 Conference Paper
The smoothness hypothesis is critical for graph-based semi-supervised learning. This paper defines local smoothness, based on which a new algorithm, Reliable Label Inference via Smoothness Hypothesis (ReLISH), is proposed. ReLISH has produced smoother labels than some existing methods for both labeled and unlabeled examples. Theoretical analyses demonstrate good stability and generalizability of ReLISH. Using real-world datasets, our empirical analyses reveal that ReLISH is promising for both transductive and inductive tasks, when compared with representative algorithms, including Harmonic Functions, Local and Global Consistency, Constraint Metric Learning, Linear Neighborhood Propagation, and Manifold Regularization.
AAAI Conference 2014 Conference Paper
Manifold learning is a powerful tool for solving nonlinear dimension reduction problems. By assuming that the high-dimensional data usually lie on a low-dimensional manifold, many algorithms have been proposed. However, most algorithms simply adopt the traditional graph Laplacian to encode the data locality, so the discriminative ability is limited and the embedding results are not always suitable for the subsequent classification. Instead, this paper deploys the signed graph Laplacian and proposes Signed Laplacian Embedding (SLE) for supervised dimension reduction. By exploring the label information, SLE comprehensively transfers the discrimination carried by the original data to the embedded low-dimensional space. Without perturbing the discrimination structure, SLE also retains the locality. Theoretically, we prove the immersion property by computing the rank of projection, and relate SLE to existing algorithms in the frame of patch alignment. Thorough empirical studies on synthetic and real datasets demonstrate the effectiveness of SLE.
EAAI Journal 2007 Journal Article
EAAI Journal 2006 Journal Article
AIIM Journal 2005 Journal Article