EAAI Journal 2026 Journal Article
A hybrid method for anomaly data detection and reconstruction in proton exchange membrane fuel cells to enhance life prediction accuracy
- Donghai Hu
- Yan Sun
- Yinjie Xu
- Yuan Li
- Biaoyi Liu
- Hua Ding
- Jing Wang
- Hongwei Liu
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AIIM Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Machine learning under limited computational resources has gained increasing attention recently. A common yet challenging scenario is managing multiple time-constrained learning tasks with budgeted computational resources, known as Computational Resource Efficient Learning (CoRE-Learning). To this end, a recently proposed framework, Learning with Adaptive Resource Allocation (LARA), offers a preliminary approach. In this paper, we point out the limitations of LARA, including its reliance on interpolation-based extrapolation methods, the need for a fixed exploration phase, and the use of high-frequency re-estimation and reallocation strategies. To address these issues, we propose Look-ahead and immediate Resource Allocation (LaiRA). Our approach incorporates an efficient Dynamic Kalman Filtering (DKF) for look-ahead feasibility check with limited data and a weight-based online estimator for immediate performance evaluation. For resource allocation, LaiRA constructs an Upper Confidence Bound (UCB) to enable adaptive exploration and introduces an adaptive time-slicing method to reduce task switching costs. Empirical studies validate the effectiveness of our approach.
AAAI Conference 2026 Conference Paper
Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers’ emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross‑modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.
JBHI Journal 2026 Journal Article
The convergence of continuous physiological monitoring and intelligent building systems in smart clinics offers a transformative opportunity for patient-centered care, yet it introduces the challenge of harmonizing clinical fidelity, patient comfort, and operational sustainability. We present DT-ECO, a privacy-preserving digital twins framework that enables decision-centric co-management of multi-modal patient monitoring and clinical environmental systems. DT-ECO constructs a hybrid digital twin that integrates a physics-informed building model with graph-temporal physiological inference and battery electrochemistry, enabling real-time synchronization between patient state, IoT device operation, and environmental dynamics within a differentiable programming environment. On this foundation, a hierarchical control strategy is developed, in which a constrained deep reinforcement learning agent adaptively schedules wearable IoT sensor sampling to extend device lifetime, while a model predictive controller orchestrates HVAC operation and on-site energy resources to maintain a therapeutic environment. Extensive evaluations on DOE reference hospitals and public ECG datasets demonstrate that DT-ECO achieves a 31. 8% reduction in annual energy consumption and extends median wearable battery life by 28%, while rigorously maintaining clinical standards-evidenced by less than 0. 6% thermal comfort violation and no degradation in arrhythmia detection capability (F1-score 0. 956). By bridging the gap between patient physiology and the clinical environment, DT-ECO establishes a pathway toward precision healthcare facilities that are simultaneously patient-centric, diagnostically robust, and operationally sustainable.
YNIMG Journal 2026 Journal Article
JBHI Journal 2026 Journal Article
Hyperspectral imaging (HSI) holds immense potential for medical diagnostics by capturing tissue-specific spectral signatures that facilitate precise disease detection. However, effective HSI classification in clinical settings is hindered by two main challenges: (i) the severe lack of labelled medical HSI samples constrains model training. Prototypical networks, as a few-shot learning paradigm, have been adopted to address label scarcity. However, current Euclidean-based prototypical methods typically assume equal feature variance and spherical distributions, while ignoring intraclass covariance and spectral correlations; (ii) significant domain shifts across heterogeneous medical HSI datasets undermine model generalisation, impair multi-domain interpretability, and force expensive per-dataset retraining. To overcome these limitations, we propose a novel distance-learning-based prototypical network with multi-domain adaptation for few-shot hyperspectral medical image classification. First, by embedding a class-covariance-aware Mahalanobis metric within the prototypical block, our module adapts similarity measures to each class's intrinsic spectral–spatial covariance and scale variations, thereby enhancing prototype robustness under severe label scarcity and significantly reducing misclassification compared with existing few-shot networks. Secondly, we introduce the domain-aware adapter block designed to address domain shift and multi-domain variability by dynamically fusing shared spectral–spatial representations with domain-specific characteristics via spectral integration and switchable adapters. We undertook extensive experiments on three publicly available hyperspectral medical datasets: skin dermoscopy, multidimensional choledochal, and in-vivo brain dataset. Compared to state-of-the-art classifiers, the proposed method achieved excellent performance on all three datasets, paving the way for generalisable HSI solutions in clinical workflows and biomedical research.
AAAI Conference 2026 Conference Paper
Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components—pairs of singular vectors—which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4× less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.
AAAI Conference 2026 Conference Paper
Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.
AAAI Conference 2026 Conference Paper
Biological intelligence has driven significant progress in artificial intelligence (AI), but a critical gap remains: biological systems inherit innate abilities from genes, with brains initialized by blueprints refined over 3.5 billion years of evolution, while machines rely heavily on inefficient, data-driven learning from scratch. This gap arises from the lack of a genetic mechanism in machines to transfer and accumulate inheritable knowledge across generations. To bridge this gap, we propose learngenes, network fragments that act as inheritable 'genes' for machines. Unlike conventional knowledge transfer methods, learngenes enable efficient and universal knowledge transfer by selectively encapsulating task-agnostic knowledge. To facilitate the transfer and accumulation of task-agnostic knowledge across generations, we introduce Genetic Reinforcement Learning (GRL), a framework that simulates the learning and evolution of organisms in intelligent agents following Lamarckian principles. Through GRL, we identify learngenes as network fragments within agents' policy networks, equipping newborn agents with innate abilities for rapid adaptation to novel tasks. We demonstrate the advantages of learngene-based knowledge transfer over evolution-based search and traditional pre-trained models, and show how learngenes evolve through the accumulation of task-agnostic knowledge. Overall, this work establishes a novel paradigm for knowledge transfer and model initialization in AI, offering new possibilities for more adaptive, efficient, and scalable learning systems.
AAAI Conference 2026 Conference Paper
The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the ControlNet Relevance Score, which measures the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.
AAAI Conference 2026 Conference Paper
Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a Robust single-stage fully Sparse 3D object Detection Network with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.
JBHI Journal 2026 Journal Article
Consumer health devices generate massive volumes of sensitive medical data requiring secure authentication mechanisms that accommodate the resource constraints of wearable sensors and portable diagnostic equipment. Traditional centralized authentication approaches in Internet of Medical Things (IoMT) environments suffer from single points of failure, privacy vulnerabilities, and scalability limitations when managing diverse health monitoring devices. This paper presents secure healthcare IoMT enhanced lightweight device authentication (SHIELD), a blockchain-based lightweight authentication framework designed for resource-constrained consumer health devices. The framework leverages blockchain's immutable and decentralized properties, combined with efficient elliptic curve cryptography, to ensure secure storage and verification of device identities while providing mutual authentication between health devices and medical data servers. Security analysis demonstrates that SHIELD satisfies twelve critical security properties, including decentralization, resistance to password guessing and replay attacks, perfect forward secrecy, and session key security. Performance evaluation reveals that SHIELD achieves computational efficiency at 9. 837 milliseconds authentication latency, representing 31% improvement over previous best-performing schemes. The framework requires only 1384 bits of communication overhead and maintains minimal average delay times suitable for real-time health monitoring applications. Blockchain implementation analysis confirms practical deployment feasibility with 0. 0356 MGas operational costs per authentication session.
YNIMG Journal 2025 Journal Article
AIIM Journal 2025 Journal Article
YNIMG Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Existing imitation learning methods decouple perception and action, which overlooks the causal reciprocity between sensory representations and action execution that humans naturally leverage for adaptive behaviors. To bridge this gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified representation learning that explicitly models a dynamic interplay between perception and action through probabilistic latent dynamics. DP-AG encodes latent observations into a Gaussian posterior via variational inference and evolves them using an action-guided SDE, where the Vector–Jacobian Product (VJP) of the diffusion policy's noise predictions serves as a structured stochastic force driving latent updates. To promote bidirectional learning between perception and action, we introduce a cycle-consistent contrastive loss that organizes the gradient flow of the noise predictor into a coherent perception–action loop, enforcing mutually consistent transitions in both latent updates and action refinements. Theoretically, we derive a variational lower bound for the action-guided SDE, and prove that the contrastive objective enhances continuity in both latent and action trajectories. Empirically, DP-AG significantly outperforms state-of-the-art methods across simulation benchmarks and real-world UR5 manipulation tasks. As a result, our DP-AG offers a promising step toward bridging biological adaptability and artificial policy learning. Code is available on our project website: https: //jingwang18. github. io/dp-ag. github. io/.
EAAI Journal 2025 Journal Article
IJCAI Conference 2025 Conference Paper
Precise 3D hand posture is essential for learning musical instruments. Reconstructing highly precise 3D hand gestures enables learners to correct and master proper techniques through 3D simulation and Extended Reality. However, exsiting methods typically rely on precisely calibrated multi-camera systems, which are not easily deployable in everyday environments. In this paper, we focus on calibration-free multi-view 3D hand reconstruction in unconstrained scenarios. Establishing correspondences between multi-view images is particularly challenging without camera extrinsics. To address this, we propose A^3-Net, a multi-level alignment framework that utilizes 3D structural representations with hierarchical geometric and explicit semantic information as alignment proxies, facilitating multi-view feature interaction in both 3D geometric space and 2D visual space. Specifically, we first perfrom global geometric alignment to map multi-view features into a canonical space. Subsequently, we aggregate information into predefined sparse and dense proxies to further integrate cross-view semantics through mutual interaction. Finnaly, we perfrom 2D alignment to align projected 2D visual features with 2D observations. Our method achieves state-of-the-art results in the multi-view 3D hand reconstruction task, demonstrating the effectiveness of our proposed framework.
JBHI Journal 2025 Journal Article
Low-dose computed tomography (LDCT) is a specialized CT scan with a lower radiation dose than normal-dose CT. However, the reduced radiation dose can introduce noise and artifacts, affecting diagnostic accuracy. To enhance the LDCT image quality, we propose a Contextual Contrast Detail Attention Feature Fusion Network (CDAF-Net) for LDCT denoising. Firstly, the LDCT image, with dimensions 1 × H × W, is mapped to a feature map with dimensions C × H × W, and it is processed through the Contextual Contrast Detail Attention (CCDA) module and the Selective Kernel Feature Fusion (SKFF) module. The CCDA module combines a global contextual attention mechanism with detail-enhanced differential convolutions to better understand the overall semantics and structure of the LDCT image, capturing subtle changes and details. The SKFF module effectively merges shallow features extracted by the encoder with deep features from the decoder, integrating feature representations from different levels. This process is repeated across four different resolution feature maps, and the denoised LDCT image is output through a skip connection. We conduct experiments on the Mayo dataset, the LDCT-and-Projection-Data dataset, and the Piglet dataset. Specifically, the CDAF-Net achieves the optimal metrics with a PSNR of 33. 7262 dB, an SSIM of 0. 9254, and an RMSE of 5. 3731 on the Mayo dataset. Improvements are also observed in head CT and ultra-low-dose chest CT images of the LDCT-and-Projection-Data dataset and the Piglet dataset. Experimental results show that the proposed CDAF-Net algorithm provides superior denoising performance compared with the state-of-the-art (SOTA) algorithms.
ICLR Conference 2025 Conference Paper
The rapid advancements of AI rely on the support of integrated circuits (ICs). However, the growing complexity of digital ICs makes the traditional IC design process costly and time-consuming. In recent years, AI-assisted IC design methods have demonstrated great potential, but most methods are task-specific or focus solely on the circuit structure in graph format, overlooking other circuit modalities with rich functional information. In this paper, we introduce CircuitFusion, the first multimodal and implementation-aware circuit encoder. It encodes circuits into general representations that support different downstream circuit design tasks. To learn from circuits, we propose to fuse three circuit modalities: hardware code, structural graph, and functionality summary. More importantly, we identify four unique properties of circuits: parallel execution, functional equivalent transformation, multiple design stages, and circuit reusability. Based on these properties, we propose new strategies for both the development and application of CircuitFusion: 1) During circuit preprocessing, utilizing the parallel nature of circuits, we split each circuit into multiple sub-circuits based on sequential-element boundaries, each sub-circuit in three modalities. It enables fine-grained encoding at the sub-circuit level. 2) During CircuitFusion pre-training, we introduce three self-supervised tasks that utilize equivalent transformations both within and across modalities. We further utilize the multi-stage property of circuits to align representation with ultimate circuit implementation. 3) When applying CircuitFusion to downstream tasks, we propose a new retrieval-augmented inference method, which retrieves similar known circuits as a reference for predictions. It improves fine-tuning performance and even enables zero-shot inference. Evaluated on five different circuit design tasks, CircuitFusion consistently outperforms the state-of-the-art supervised method specifically developed for every single task, demonstrating its generalizability and ability to learn circuits' inherent properties.
AAAI Conference 2025 Conference Paper
Current collaborative perception methods often rely on fully annotated datasets, which can be expensive to obtain in practical situations. To reduce annotation costs, some works adopt sparsely supervised learning techniques and generate pseudo labels for the missing instances. However, these methods fail to achieve an optimal confidence threshold that harmonizes the quality and quantity of pseudo labels. To address this issue, we propose an end-to-end Collaborative perception Dual Teacher-Student framework (CoDTS), which employs adaptive complementary learning to produce both high-quality and high-quantity pseudo labels. Specifically, the Main Foreground Mining (MFM) module generates high-quality pseudo labels based on the prediction of the static teacher. Subsequently, the Supplement Foreground Mining (SFM) module ensures a balance between the quality and quantity of pseudo labels by adaptively identifying missing instances based on the prediction of the dynamic teacher. Additionally, the Neighbor Anchor Sampling (NAS) module is incorporated to enhance the representation of pseudo labels. To promote the adaptive complementary learning, we implement a staged training strategy that trains the student and dynamic teacher in a mutually beneficial manner. Extensive experiments demonstrate that the CoDTS effectively ensures an optimal balance of pseudo labels in both quality and quantity, establishing a new state-of-the-art in sparsely supervised collaborative perception.
ICML Conference 2025 Conference Paper
Open-Set Domain Adaptation (OSDA) aims to transfer knowledge from the labeled source domain to the unlabeled target domain that contains unknown categories, thus facing the challenges of domain shift and unknown category recognition. While recent works have demonstrated the potential of causality for domain alignment, little exploration has been conducted on causal-inspired theoretical frameworks for OSDA. To fill this gap, we introduce the concept of Susceptibility and propose a novel C ounterfactual-based susceptibility risk framework for OSDA, termed COSDA. Specifically, COSDA consists of three novel components: (i) a Susceptibility Risk Estimator (SRE) for capturing causal information, along with comprehensive derivations of the computable theoretical upper bound, forming a risk minimization framework under the OSDA paradigm; (ii) a Contrastive Feature Alignment (CFA) module, which is theoretically proven based on mutual information to satisfy the Exogeneity assumption and facilitate cross-domain feature alignment; (iii) a Virtual Multi-unknown-categories Prototype (VMP) pseudo-labeling strategy, providing label information by measuring how similar samples are to known and multiple virtual unknown category prototypes, thereby assisting in open-set recognition and intra-class discriminative feature learning. Extensive experiments demonstrate that our approach achieves state-of-the-art performance.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
ICML Conference 2025 Conference Paper
We explore the potential of AI-enhanced combinatorial optimization theory, taking online bipartite matching (OBM) as a case study. In the theoretical study of OBM, the hardness corresponds to a performance upper bound of a specific online algorithm or any possible online algorithms. Typically, these upper bounds derive from challenging instances meticulously designed by theoretical computer scientists. Zhang et al. (ICML 2024) recently provide an example demonstrating how reinforcement learning techniques enhance the hardness result of a specific OBM model. Their attempt is inspiring but preliminary. It is unclear whether their methods can be applied to other OBM problems with similar breakthroughs. This paper takes a further step by introducing DiMa, a unified and novel framework that aims at understanding the hardness of OBM problems based on denoising diffusion probabilistic models (DDPMs). DiMa models the process of generating hard instances as denoising steps, and optimizes them by a novel reinforcement learning algorithm, named shortcut policy gradient (SPG). We first examine DiMa on the classic OBM problem by reproducing its known hardest input instance in literature. Further, we apply DiMa to two well-known variants of OBM, for which the exact hardness remains an open problem, and we successfully improve their theoretical state-of-the-art upper bounds.
EAAI Journal 2025 Journal Article
IROS Conference 2025 Conference Paper
Large language models have demonstrated powerful reasoning capabilities, and their integration with robotics has revolutionized human-computer interaction and automated task planning. However, LLMs are unaware of environmental knowledge and possible state changes in the environment during planning, which makes the generated tasks unexecutable, particularly when dealing with complex long-horizon tasks involving crowded objects and dynamic relations. In this paper, we propose a LLM-based robot task planning framework with support for environmental knowledge injection, which is called DRP(Decomposition-Reflection-Prediction). The DRP framework combines LLMs with rule-based task decomposition, multi-perspective reflection and environmental prediction to generate admissible actions for complex long-horizon tasks. We only leverage few-shot prompting to implement our framework, which avoids the need for additional model training work. Experiments on VirtualHome household task dataset show that the task plans generated by our method have improved the executability by 25. 23%, the subgoal success rate by 64. 29%, and the success rate by 58. 06%, in comparison to state-of-the-art baseline methods. The complete code of our framework has been made public at https://github.com/lab-bj/taskplanning
NeurIPS Conference 2025 Conference Paper
Flow-based generative models have gained popularity for image generation and editing. For instruction-based image editing, it is critical to ensure that modifications are confined to the targeted regions. Yet existing methods often fail to maintain consistency in non-targeted regions between the original / edited images. Our primary contribution is to identify the cause of this limitation as the error accumulation across individual editing steps and to address it by incorporating the historical editing trajectory. Specifically, we formulate image editing as a control problem and leverage the Kalman filter to integrate the historical editing trajectory. Our proposed algorithm, dubbed Kalman-Edit, reuses early-stage details from the historical trajectory to enhance the structural consistency of the editing results. To speed up editing, we introduce a shortcut technique based on approximate vector field velocity estimation. Extensive experiments on several datasets demonstrate its superior performance compared to previous state-of-the-art methods.
IJCAI Conference 2025 Conference Paper
Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII) and Temporal Affinity Refiner (TAR) at the beginning and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark and facilitates the synthesis of dynamic and consistent videos. Note that the T2V process of FancyVideo essentially involves a text-to-image step followed by T+I2V. This means it also supports the generation of videos from user images, i. e. , the image-to-video (I2V) task. A significant number of experiments have shown that its performance is also outstanding.
YNIMG Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Medical time-series analysis differs fundamentally from general ones by requiring specialized domain knowledge to interpret complex signals and clinical context. Large language models (LLMs) hold great promise for augmenting medical time-series analysis by complementing raw series with rich contextual knowledge drawn from biomedical literature and clinical guidelines. However, realizing this potential depends on precise and meaningful prompts that guide the LLM to key information. Yet, determining what constitutes effective prompt content remains non-trivial—especially in medical settings where signal interpretation often hinges on subtle, expert-defined decision-making indicators. To this end, we propose InDiGO, a knowledge-aware evolutionary learning framework that integrates clinical signals and decision-making indicators through iterative optimization. Across four medical benchmarks, InDiGO consistently outperforms prior methods. The code is available at: https: //github. com/jinxyBJTU/InDiGO.
NeurIPS Conference 2025 Conference Paper
Aircraft manufacturing is the jewel in the crown of industry, in which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. Existing deep learning methods, which typically rely on predefined parametric representations (e. g. , Bézier curves) or discrete point sets, face an inherent trade-off between expressive power and resolution adaptability. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly reconstructs airfoil geometries as function curves. Our method inherits the advantages of arbitrary-resolution sampling and smoothness from parametric functions, as well as the strong expressiveness of discrete point-based representations. Empirical evaluations demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation, achieving a relative 74. 4% reduction in label error and a 23. 2% increase in diversity on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design.
NeurIPS Conference 2025 Conference Paper
Recent advances in hand-object interaction modeling have employed implicit representations, such as Signed Distance Functions (SDF) and Neural Radiance Fields (NeRF) to reconstruct hands and objects with arbitrary topology and photo-realistic detail. However, these methods often rely on dense 3D surface annotations, or are tailored to short clips constrained in motion trajectories and scene contexts, limiting their generalization to diverse environments and movement patterns. In this work, we present HOGS, an adaptively perceptive 3D Gaussian Splatting (3DGS) framework for generalizable hand-object modeling from unconstrained monocular RGB images. By integrating photometric cues from the visual modality with the physically grounded structure of 3D Gaussians, HOGS disentangles inherent geometry from transient lighting and motion-induced appearance changes. This endows hand-object assets with the ability to generalize to unseen environments and dynamic motion patterns. Experiments on two challenging datasets demonstrate that HOGS outperforms state-of-the-art methods in monocular hand-object reconstruction and photo-realistic rendering.
EAAI Journal 2025 Journal Article
JBHI Journal 2025 Journal Article
To improve the performance of object recognition under artificial prosthetic vision, this study proposes a two-stage method. The first stage is to extract the saliency and edge Mask of the object (SMP, EMP). Then, the irregular visual information of the object is processed using Irregularity Correction (IC). We design eye-hand coordination tasks and simulate artificial vision with retinal prostheses to validate strategy effectiveness, and select direct pixelation (DP) as a control group. Each subject retained a phosphene map in the same stochastic pattern in all his/her trails. The real-time experimental results showed that the deep saliency-based optimization strategies improved the performance of the subjects when completing tasks, in terms of head movement, recognition accuracy, and response time, and counts for successful small-objects recognition. The subjects have the smallest-scale average head movement (76. 53 deg ± 20. 75 deg), higher average objects recognition accuracy (91. 18% ± 2. 52%), and less time for finishing the task (35. 71 s ± 8. 66 s) and better successful search times of the small target objects (1. 35 ± 0. 33) under the SMP strategy. When integrating with IC, subjects’ average performances have further improved to 63. 39 ± 15. 38 deg, 94. 22% ± 3. 94%, 25. 76 s ± 6. 24 s and 1. 05 ± 0. 30 respectively, which also significantly outperformed the DP condition. These results indicated that when utilizing the deep-learning-based saliency detection and IC processing, subjects could shorten the searching process and were able to discern the target objects more reliably. This work could be informative to future prosthetic devices considering implementation with the technique of artificial intelligence.
IJCAI Conference 2025 Conference Paper
Multi-label learning (MLL) has gained attention for its ability to represent real-world data. Label Distribution Learning (LDL), an extension of MLL to learning from label distributions, faces challenges in collecting accurate label distributions. To address the issue of biased annotations, based on the low-rank assumption, existing works recover true distributions from biased observations by exploring the label correlations. However, recent evidence shows that the label distribution tends to be full-rank, and naive apply of low-rank approximation on biased observation leads to inaccurate recovery and performance degradation. In this paper, we address the LDL with biased annotations problem from a novel perspective, where we first degenerate the soft label distribution into a hard multi-hot label and then recover the true label information for each instance. This idea stems from an insight that assigning hard multi-hot labels is often easier than assigning a soft label distribution, and it shows stronger immunity to noise disturbances, leading to smaller label bias. Moreover, assuming that the multi-label space for predicting label distributions is low-rank offers a more reasonable approach to capturing label correlations. Theoretical analysis and experiments confirm the effectiveness and robustness of our method on real-world datasets.
AIJ Journal 2025 Journal Article
YNIMG Journal 2025 Journal Article
JBHI Journal 2025 Journal Article
There exists a tremendous amount of multimodal data in the Internet of Medical Things (IoMT), retrieval technology can extract target data on demand from the extensive multimodal medical data space, which is crucial for aiding diagnosis and medical informatization. However, existing methods only focus on single-modal data such as medical texts, without considering the privacy protection and retrieval needs of users' multimodal data. Furthermore, these existing methods only match keywords and fail to effectively mine the semantic features of multimodal data, thereby limiting the performance of retrieval systems. To address these issues, this paper proposes a multimodal encrypted retrieval method for the IoMT based on semantic feature fusion and designs a multimodal semantic feature extraction model based on searchable encryption technology to enable encrypted retrieval of multimodal data. Specifically, an edge-cloud collaboration concept is introduced to underpin a secure semantic search architecture tailored for multimodal data, which ensures low-latency encrypted retrieval while safeguarding user privacy. Besides, a semantic-aware multimodal feature extraction method is designed, enhancing the capability of mining semantic features and replacing the traditional keyword retrieval mode with semantic feature retrieval. Moreover, a multimodal data encrypted retrieval method is proposed, employing a block idea and parallel search tree structure, which achieves rapid retrieval of semantic similarity with low-cost and privacy-preserving. Simulation results demonstrate that the proposed method significantly outperforms the latest research regarding precision, search delay, and storage overhead.
TMLR Journal 2025 Journal Article
Neural message passing serves as a cornerstone framework in graph neural networks, providing a clear and intuitive mathematical guideline for the propagation and aggregation of information among interconnected nodes within graphs. Throughout this process, node representations undergo dynamic updates, considering both the individual states and connections of neighboring nodes. Concurrently, social networks, as prominent forms of interconnected data, form dynamic systems that achieve stability through continuous internal communications and opinion exchanges among social actors along their social ties. Drawing upon the shared concepts between these two domains, our study establishes an explicit connection between message passing and opinion dynamics in sociology. Moreover, we introduce a novel continuous message passing scheme termed ODNet, which integrates bounded confidence to refine the influence weight of local nodes for message propagation. By adjusting the similarity cutoffs of bounded confidence and influence weights within ODNet, we define opinion exchange rules that align with the characteristics of neural message passing and can effectively mitigate the oversmoothing issue. We extend the framework to hypergraphs and formulate corresponding continuous message passing rules, which reveal a close association with particle dynamics. Empirically, we showcase that ODNet enhances prediction performance across various social networks presented as homophilic graphs, heterophilic graphs, and hypergraphs. Notably, our proposed ODNet outperforms existing GNNs with its straightforward construction and robust theoretical foundation.
NeurIPS Conference 2025 Conference Paper
Affective brain-computer interfaces (aBCIs) play a crucial role in personalized human–computer interaction and neurofeedback modulation. To develop practical and effective aBCI paradigms and to investigate the spatial-temporal dynamics of brain activity under emotional inducement, portable electroencephalography (EEG) signals have been widely adopted. To further enhance spatial-temporal perception, functional near-infrared spectroscopy (fNIRS) has attracted increasing interest in the aBCI field and has been explored in combination with EEG. However, existing datasets typically provide only static fixation labels, overlooking the dynamic changes in subjects' emotions. Notably, some studies have attempted to collect continuously annotated emotional data, but they have recorded only peripheral physiological signals without directly observing brain activity, limiting insight into underlying neural states under different emotions. To address these challenges, we present the Real-time labeled EEG-fNIRS Dataset (REFED). To the best of our knowledge, this is the first EEG-fNIRS dataset with real-time dynamic emotional annotations. REFED simultaneously records brain signals from both EEG and fNIRS modalities while providing continuous, real-time annotations of valence and arousal. The results of the data analysis demonstrate the effectiveness of emotion inducement and the reliability of real-time annotation. This dataset offers the possibility for studying the neurovascular coupling mechanism under emotional evolution and for developing dynamic, robust affective BCIs.
AIIM Journal 2025 Journal Article
YNIMG Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77. 3% on StreamingBench, 60. 5% on OVBench, and 55. 6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96. 8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.
JBHI Journal 2025 Journal Article
Sleep stage classification is an important step in the diagnosis and treatment of sleep disorders. Despite the high classification performance of previous sleep stage classification work, some challenges remain unresolved: 1) How to effectively capture salient waves in sleep signals to improve sleep stage classification results. 2) How to capture salient waves affected by inter-subject variability. 3) How to adaptively regulate the importance of different modals for different sleep stages. To address these challenges, we propose SleepWaveNet, a multimodal salient wave detection network, which is motivated by the salient object detection task in computer vision. It has a U-Transformer structure to detect salient waves in sleep signals. Meanwhile, the subject-adaptation wave extraction architecture based on transfer learning can adapt to the information of target individuals and extract salient waves with inter-subject variability. In addition, the multimodal attention module can adaptively enhance the importance of specific modal data for sleep stage classification tasks. Experiments on three datasets show that SleepWaveNet has better overall performance than existing baselines. Moreover, visualization experiments show that the model has the ability to capture salient waves with inter-subject variability.
AAAI Conference 2025 Conference Paper
The general capabilities of large language models (LLMs) make them the infrastructure for various AI applications, but updating their inner knowledge requires significant resources. Recent model editing is a promising technique for efficiently updating a small amount of knowledge of LLMs and has attracted much attention. In particular, local editing methods, which directly update model parameters, are proven suitable for updating small amounts of knowledge. Local editing methods update weights by computing least squares closed-form solutions and identify edited knowledge by vector-level matching in inference, which achieve promising results. However, these methods still require a lot of time and resources to complete the computation. Moreover, vector-level matching lacks reliability, and such updates disrupt the original organization of the model's parameters. To address these issues, we propose a detachable and expandable Subject Word Embedding Altering (SWEA) framework, which finds the editing embeddings through token-level matching and adds them to the subject word embeddings in Transformer input. To get these editing embeddings, we propose optimizing then suppressing fusion method, which first optimizes learnable embedding vectors for the editing target and then suppresses the Knowledge Embedding Dimensions (KEDs) to obtain final editing embeddings. We thus propose SWEAOS method for editing factual knowledge in LLMs. We demonstrate the overall state-of-the-art (SOTA) performance of SWEAOS on the CounterFact and zsRE datasets. To further validate the reasoning ability of SWEAOS in editing knowledge, we evaluate it on the more complex RippleEdits benchmark. The results demonstrate that SWEAOS possesses SOTA reasoning ability.
EAAI Journal 2025 Journal Article
ICML Conference 2025 Conference Paper
Network quantization, one of the most widely studied model compression methods, effectively quantizes a floating-point model to obtain a fixed-point one with negligible accuracy loss. Although great success was achieved in reducing the model size, it may exacerbate the unfairness in model accuracy across different groups of datasets. This paper considers two widely used algorithms: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), with an attempt to understand how they cause this critical issue. Theoretical analysis with empirical verifications reveals two responsible factors, as well as how they influence a metric of fairness in depth. A comparison between PTQ and QAT is then made, explaining an observation that QAT behaves even worse than PTQ in fairness, although it often preserves a higher accuracy at lower bit-widths in quantization. Finally, the paper finds out that several simple data augmentation methods can be adopted to alleviate the disparate impacts of quantization, based on a further observation that class imbalance produces distinct values of the aforementioned factors among different attribute classes. We experiment on either imbalanced (UTK-Face and FER2013) or balanced (CIFAR-10 and MNIST) datasets using ResNet and VGG models for empirical evaluation.
NeurIPS Conference 2025 Conference Paper
Multi-view 3D human pose estimation (HPE) leverages complementary information across views to improve accuracy and robustness. Traditional methods rely on camera calibration to establish geometric correspondences, which is sensitive to calibration accuracy and lacks flexibility in dynamic settings. Calibration-free approaches address these limitations by learning adaptive view interactions, typically leveraging expressive and flexible continuous representations. However, as the multiview interaction relationship is learned entirely from data without constraint, they are vulnerable to noisy input, which can propagate, amplify and accumulate errors across all views, severely corrupting the final estimated pose. To mitigate this, we propose a novel framework that integrates a noise-resilient discrete prior into the continuous representation-based model. Specifically, we introduce the \textit{UniCodebook}, a unified, compact, robust, and discrete representation complementary to continuous features, allowing the model to benefit from robustness to noise while preserving regression capability. Furthermore, we further propose an attribute-preserving and complementarity-enhancing Discrete-Continuous Spatial Attention (DCSA) mechanism to facilitate interaction between discrete priors and continuous pose features. Extensive experiments on three representative datasets demonstrate that our approach outperforms both calibration-required and calibration-free methods, achieving state-of-the-art performance.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature’s label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature’s latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier’s accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve. Code is available on our Github: https: //github. com/JingWang18/DVD-SFDA.
NeurIPS Conference 2025 Conference Paper
Recent advances in text-to-video (T2V) generation, exemplified by models such as Sora and Kling, have demonstrated strong potential for constructing world simulators. However, existing T2V models still struggle to understand abstract physical principles and to generate videos that faithfully obey physical laws. This limitation stems primarily from the lack of explicit physical guidance, caused by a significant gap between high-level physical concepts and the generative capabilities of current models. To address this challenge, we propose the W orld S imulator A ssistant ( WISA ), a novel framework designed to systematically decompose and integrate physical principles into T2V models. Specifically, WISA decomposes physical knowledge into three hierarchical levels: textual physical descriptions, qualitative physical categories, and quantitative physical properties. It then incorporates several carefully designed modules—such as Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier—to effectively encode these attributes and enhance the model’s adherence to physical laws during generation. In addition, most existing video datasets feature only weak or implicit representations of physical phenomena, limiting their utility for learning explicit physical principles. To bridge this gap, we present WISA-80K, a new dataset comprising 80, 000 human-curated videos that depict 17 fundamental physical laws across three core domains of physics: dynamics, thermodynamics, and optics. Experimental results show that WISA substantially improves the alignment of T2V models (such as CogVideoX and Wan2. 1) with real-world physical laws, achieving notable gains on the VideoPhy benchmark. Our data, code, and models are available in the Project Page.
JBHI Journal 2024 Journal Article
Accurate and fully automated brain structure examination and prediction from 3D volumetric magnetic resonance imaging (MRI) is a necessary step in medical imaging analysis, which can assist greatly in clinical diagnosis. Traditional deep learning models suffer from severe performance degradation when applied to clinically acquired unlabeled data. The performance degradation is mainly caused by domain discrepancy such as different device types and parameter settings for data acquisition. However, existing approaches focus on the reduction of domain discrepancies but ignore the entanglement of semantic features and domain information. In this article, we explore the feature invariance of categories and domains in different projection spaces and propose a Siamese-Transport Domain Adaptation (STDA) method using a joint optimal transport theory and contrastive learning for automatic 3D MRI classification and glioma multi-grade prediction. Specifically, the learning framework updates the distribution of features across domains and categories by Siamese transport network training with an Optimal Cost Transfer Strategy (OCTS) and a Mutual Invariant Constraint (MIC) in two projective spaces to find multiple invariants in potential heterogeneity. We design three sets of transfer task scenarios with different source and target domains, and demonstrate that STDA yields substantially higher generalization performance than other state-of-the-art unsupervised domain adaptation (UDA) methods. The method is applicable on 3D MRI data from glioma to Alzheimer's disease and has promising applications in the future clinical diagnosis and treatment of brain diseases.
AAAI Conference 2024 Conference Paper
The Few-Shot Segmentation (FSS) aims to accomplish the novel class segmentation task with a few annotated images. Current FSS research based on meta-learning focuses on designing a complex interaction mechanism between the query and support feature. However, unlike humans who can rapidly learn new things from limited samples, the existing approach relies solely on fixed feature matching to tackle new tasks, lacking adaptability. In this paper, we propose a novel framework based on the adapter mechanism, namely Adaptive FSS, which can efficiently adapt the existing FSS model to the novel classes. In detail, we design the Prototype Adaptive Module (PAM), which utilizes accurate category information provided by the support set to derive class prototypes, enhancing class-specific information in the multi-stage representation. In addition, our approach is compatible with diverse FSS methods with different backbones by simply inserting PAM between the layers of the encoder. Experiments demonstrate that our method effectively improves the performance of the FSS models (e.g., MSANet, HDMNet, FPTrans, and DCAMA) and achieves new state-of-the-art (SOTA) results (i.e., 72.4% and 79.1% mIoU on PASCAL-5i 1-shot and 5-shot settings, 52.7% and 60.0% mIoU on COCO-20i 1-shot and 5-shot settings). Our code is available at https://github.com/jingw193/AdaptiveFSS.
NeurIPS Conference 2024 Conference Paper
Data-driven generative models have emerged as promising approaches towards achieving efficient mechanical inverse design. However, due to prohibitively high cost in time and money, there is still lack of open-source and large-scale benchmarks in this field. It is mainly the case for airfoil inverse design, which requires to generate and edit diverse geometric-qualified and aerodynamic-qualified airfoils following the multimodal instructions, \emph{i. e. ,} dragging points and physical parameters. This paper presents the open-source endeavors in airfoil inverse design, \emph{AFBench}, including a large-scale dataset with 200 thousand airfoils and high-quality aerodynamic and geometric labels, two novel and practical airfoil inverse design tasks, \emph{i. e. ,} conditional generation on multimodal physical parameters, controllable editing, and comprehensive metrics to evaluate various existing airfoil inverse design methods. Our aim is to establish \emph{AFBench} as an ecosystem for training and evaluating airfoil inverse design methods, with a specific focus on data-driven controllable inverse design models by multimodal instructions capable of bridging the gap between ideas and execution, the academic research and industrial applications. We have provided baseline models, comprehensive experimental observations, and analysis to accelerate future research. Our baseline model is trained on an RTX 3090 GPU within 16 hours. The codebase, datasets and benchmarks will be available at \url{https: //hitcslj. github. io/afbench/}.
EAAI Journal 2024 Journal Article
YNIMG Journal 2024 Journal Article
NeurIPS Conference 2024 Conference Paper
In recent years, the merging of vast datasets with powerful computational resources has led to the emergence of large pre-trained models in the field of deep learning. However, the common practices often overgeneralize the applicability of these models, overlooking the task-specific resource constraints. To mitigate this issue, we propose \textbf{Cluster-Learngene}, which effectively clusters critical internal modules from a large ancestry model and then inherits them to initialize descendant models of elastic scales. Specifically, based on the density characteristics of attention heads, our method adaptively clusters attention heads of each layer and position-wise feed-forward networks (FFNs) in the ancestry model as the learngene. Moreover, we introduce priority weight-sharing and learnable parameter transformations that expand the learngene to initialize descendant models of elastic scales. Through extensive experimentation, we demonstrate that Cluster-Learngene not only is more efficient compared to other initialization methods but also customizes models of elastic scales according to downstream task resources.
IJCAI Conference 2024 Conference Paper
Existing Deep Multi-view Clustering (DMVC) approaches typically concentrate on capturing consensus semantics from multiple views, where contrastive learning is widely used to align view-specific representations of each view. Unfortunately, view-specific representations are extracted from the content information of the corresponding instance, neglecting the relationships among different instances. Furthermore, existing contrastive loss imports numerous false negative pairs that conflict with the clustering objectives. In response to these challenges, we propose a contraStive and viEw-interaction stRucture learning framework for multI-viEw cluStering (SERIES). Our method takes into account the structural relations among instances and boosts the contrastive loss to improve intra-class compactness. Meanwhile, a cross-view dual relation generation mechanism is introduced to achieve the consensus structural graph across multiple views for clustering. Specifically, we initially acquire view-specific representations using multiple graph autoencoders to exploit both content information and structural information. Furthermore, to pull together the same cluster instances, a soft negative pair aware contrastive loss is employed to distinguish the dissimilar instances while attracting similar instances. Thereafter, the view-specific representations are fed into cross-view dual relation generation layers to generate the affinity matrices of each other, aiming to reveal a consistent structural graph across various views. Extensive experiments conducted on six benchmarks illustrate the superiority of our method compared to other state-of-the-art approaches.
YNIMG Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
IJCAI Conference 2024 Conference Paper
Label Distribution Learning (LDL) is a novel machine learning paradigm that assigns label distribution to each instance. Numerous LDL methods proposed to leverage label correlation in the learning process to solve the exponential-sized output space; among these, many exploited the low-rank structure of label distribution to capture label correlation. However, recent research has unveiled that label distribution matrices typically maintain full rank, posing a challenge to approaches relying on low-rank label correlation. Notably, low-rank label correlation finds widespread adoption in multi-label learning (MLL) literature due to the often low-rank nature of multi-label matrices. Inspired by that, we introduce an auxiliary MLL process within the LDL framework, focusing on capturing low-rank label correlation within this auxiliary MLL component rather than the LDL itself. By doing so, we adeptly exploited low-rank label correlation in our LDL methods. We conduct comprehensive experiments and demonstrate that our methods are superior to existing LDL methods. Besides, the ablation studies justify the advantages of exploiting low-rank label correlation in the auxiliary MLL.
EAAI Journal 2024 Journal Article
YNIMG Journal 2024 Journal Article
EAAI Journal 2024 Journal Article
ECAI Conference 2024 Conference Paper
Spiking neural networks (SNNs) have the potential to simulate sparse and spatio-temporal dynamics observed in biological neurons, making them promising for achieving energy-efficient artificial general intelligence. While backpropagation through time (BPTT) ensures reliable precision for training SNNs, it is hampered by high computation and storage complexity and does not conform to the instantaneous learning mechanism in brains. On the contrary, online training algorithms, which are biologically interpretable, offer low latency and memory efficiency, and are well-suited for on-chip learning applications. However, recent research exhibit a deficiency in the scientific comprehension of online gradients, which leads to certain limitations. To address this issue, we conduct an in-depth analysis of the calculation deviation in chain derivations induced by weight update and find two pivotal factors that affect the accuracy of online gradients: completeness and timeliness. To further enhance the performance of online training leveraging these findings, we propose spatio-temporal online learning (STOL), which substantially ameliorates the accuracy of the online gradients and demonstrates superior computation and memory efficiency. Our experiments on CIFAR-10, CIFAR-100, ImageNet, CIFAR10-DVS, and DVS128-Gesture datasets demonstrate that our method achieves state-of-the-art performance across most of these tasks. Besides, it shows a great improvement compared with existing online training algorithms.
NeurIPS Conference 2024 Conference Paper
Subsampling is effective in tackling computational challenges for massive data with rare events. Overly aggressive subsampling may adversely affect estimation efficiency, and optimal subsampling is essential to mitigate the information loss. However, existing optimal subsampling probabilities depends on data scales, and some scaling transformations may result in inefficient subsamples. This problem is more significant when there are inactive features, because their influence on the subsampling probabilities can be arbitrarily magnified by inappropriate scaling transformations. We tackle this challenge and introduce a scale-invariant optimal subsampling function in the context of sparse models, where inactive features are commonly assumed. Instead of focusing on estimating model parameters, we define an optimal subsampling function to minimize the prediction error, using adaptive lasso as an example to outline the estimation procedure and study its theoretical guarantee. We first introduce the adaptive lasso estimator for rare-events data and establish its oracle properties, thereby validating the use of subsampling. Then we derive a scale-invariant optimal subsampling function that minimizes the prediction error of the inverse probability weighted (IPW) adaptive lasso. Finally, we present an estimator based on the maximum sampled conditional likelihood (MSCL) to further improve the estimation efficiency. We conduct numerical experiments using both simulated and real-world data sets to demonstrate the performance of the proposed methods.
AAAI Conference 2024 Conference Paper
Deep Multi-view Graph Clustering (DMGC) aims to partition instances into different groups using the graph information extracted from multi-view data. The mainstream framework of DMGC methods applies graph neural networks to embed structure information into the view-specific representations and fuse them for the consensus representation. However, on one hand, we find that the graph learned in advance is not ideal for clustering as it is constructed by original multi-view data and localized connecting. On the other hand, most existing methods learn the consensus representation in a late fusion manner, which fails to propagate the structure relations across multiple views. Inspired by the observations, we propose a Structure-adaptive Unified gRaph nEural network for multi-view clusteRing (SURER), which can jointly learn a heterogeneous multi-view unified graph and robust graph neural networks for multi-view clustering. Specifically, we first design a graph structure learning module to refine the original view-specific attribute graphs, which removes false edges and discovers the potential connection. According to the view-specific refined attribute graphs, we integrate them into a unified heterogeneous graph by linking the representations of the same sample from different views. Furthermore, we use the unified heterogeneous graph as the input of the graph neural network to learn the consensus representation for each instance, effectively integrating complementary information from various views. Extensive experiments on diverse datasets demonstrate the superior effectiveness of our method compared to other state-of-the-art approaches.
JBHI Journal 2024 Journal Article
In clinical settings, the implementation of deep neural networks is impeded by the prevalent problems of label scarcity and class imbalance in medical images. To mitigate the need for labeled data, semi-supervised learning (SSL) has gained traction. However, existing SSL schemes exhibit certain limitations. 1) They commonly fail to address the class imbalance problem. Training with imbalanced data makes the model's prediction biased towards majority classes, consequently introducing prediction bias. 2) They usually suffer from training bias arising from unreasonable training strategies, such as strong coupling between the generation and utilization of pseudo labels. To address these problems, we propose a novel SSL framework called Tri-Net with Cross-Balanced pseudo supervision (TNCB). Specifically, two student networks focusing on different learning tasks and a teacher network equipped with an adaptive balancer are designed. This design enables the teacher model to pay more focus on minority classes, thereby reducing prediction bias. Additionally, we propose a virtual optimization strategy to further enhance the teacher model's resistance to class imbalance. Finally, to fully exploit valuable knowledge from unlabeled images, we employ cross-balanced pseudo supervision, where an adaptive cross loss function is introduced to reduce training bias. Extensive evaluation on four datasets with different diseases, image modalities, and imbalance ratios consistently demonstrate the superior performance of TNCB over state-of-the-art SSL methods. These results indicate the effectiveness and robustness of TNCB in addressing imbalanced medical image classification challenges.
TMLR Journal 2024 Journal Article
Source-free domain adaptation (SFDA) involves adapting a model originally trained using a labeled dataset (source domain) to perform effectively on an unlabeled dataset (target domain) without relying on any source data during adaptation. This adaptation is especially crucial when significant disparities in data distributions exist between the two domains and when there are privacy concerns regarding the source model's training data. The absence of access to source data during adaptation makes it challenging to analytically estimate the domain gap. To tackle this issue, various techniques have been proposed, such as unsupervised clustering, contrastive learning, and continual learning. In this paper, we first conduct an extensive theoretical analysis of SFDA based on contrastive learning, primarily because it has demonstrated superior performance compared to other techniques. Motivated by the obtained insights, we then introduce a straightforward yet highly effective latent augmentation method tailored for contrastive SFDA. This augmentation method leverages the dispersion of latent features within the neighborhood of the query sample, guided by the source pre-trained model, to enhance the informativeness of positive keys. Our approach, based on a single InfoNCE-based contrastive loss, outperforms state-of-the-art SFDA methods on widely recognized benchmark datasets.
EAAI Journal 2023 Journal Article
TMLR Journal 2023 Journal Article
Geostatistical learning problems are frequently characterized by spatial autocorrelation in the input features and/or the potential for covariate shift at test time. These realities violate the classical assumption of independent, identically distributed data, upon which most cross-validation algorithms rely in order to estimate the generalization performance of a model. In this paper, we present a theoretical criterion for unbiased cross-validation estimators in the geospatial setting. We also introduce a new cross-validation algorithm to evaluate models, inspired by the challenges of geospatial problems. We apply a framework for categorizing problems into different types of geospatial scenarios to help practitioners select an appropriate cross-validation strategy. Our empirical analyses compare cross-validation algorithms on both simulated and several real datasets to develop recommendations for a variety of geospatial settings. This paper aims to draw attention to some challenges that arise in model evaluation for geospatial problems and to provide guidance for users.
AAMAS Conference 2023 Conference Paper
We initiate the study of how to perturb the reward in a zero-sum Markov game with two players to induce a desirable Nash equilibrium, namely arbitrating. Such a problem admits a bi-level optimization formulation. The lower level requires solving the Nash equilibrium under a given reward function, which makes the overall problem challenging to optimize in an end-to-end way. We propose a backpropagation scheme that differentiates through the Nash equilibrium, which provides the gradient feedback for the upper level. In particular, our method only requires a black-box solver for the (regularized) Nash equilibrium (NE). We develop the convergence analysis for the proposed framework with proper black-box NE solvers and demonstrate the empirical successes in two multi-agent reinforcement learning (MARL) environments. Supplementary for all the proofs in this paper could be found in: https: //arxiv. org/abs/2302. 10058.
JBHI Journal 2023 Journal Article
In recent years, more and more people suffer from voice-related diseases. Given the limitations of current pathological speech conversion methods, that is, a method can only convert a single kind of pathological voice. In this study, we propose a novel Encoder-Decoder Generative Adversarial Network (E-DGAN) to generate personalized speech for pathological to normal voice conversion, which is suitable for multiple kinds of pathological voices. Our proposed method can also solve the problem of improving the intelligibility and personalizing custom speech of pathological voices. Feature extraction is performed using a mel filter bank. The conversion network is an encoder-decoder structure, which is used to convert the mel spectrogram of pathological voices to the mel spectrogram of normal voices. After being converted by the residual conversion network, the personalized normal speech is synthesized by the neural vocoder. In addition, we propose a subjective evaluation metric named “content similarity” to evaluate the consistency between the converted pathological voice content and the reference content. The Saarbrücken Voice Database (SVD) is used to verify the proposed method. The intelligibility and content similarity of pathological voices are increased by 18. 67% and 2. 60%, respectively. Besides, an intuitive analysis based on a spectrogram was done and a significant improvement was achieved. The results show that our proposed method can improve the intelligibility of pathological voices and personalize the conversion of pathological voices into the normal voices of 20 different speakers. Our proposed method is compared with five other pathological voice conversion methods, and our proposed method has the best evaluation results.
AAAI Conference 2023 Conference Paper
Label distribution covers a certain number of labels, representing the degree to which each label describes an instance. The learning process on the instances labeled by label distributions is called Label Distribution Learning (LDL). Although LDL has been applied successfully to many practical applications, one problem with existing LDL methods is that they are limited to data with balanced label information. However, annotation information in real-world data often exhibits imbalanced distributions, which significantly degrades the performance of existing methods. In this paper, we investigate the Imbalanced Label Distribution Learning (ILDL) problem. To handle this challenging problem, we delve into the characteristics of ILDL and empirically find that the representation distribution shift is the underlying reason for the performance degradation of existing methods. Inspired by this finding, we present a novel method named Representation Distribution Alignment (RDA). RDA aligns the distributions of feature representations and label representations to alleviate the impact of the distribution gap between the training set and the test set caused by the imbalance issue. Extensive experiments verify the superior performance of RDA. Our work fills the gap in benchmarks and techniques for practical ILDL problems.
JBHI Journal 2023 Journal Article
Accurate identification of lesions is a key step in surgical planning. However, this task mainly exists two challenges: 1) Due to the complex anatomical shapes of different lesions, most segmentation methods only achieve outstanding performance for a specific structure, rather than other lesions with location differences. 2) The huge number of parameters limits existing transformer-based segmentation models. To overcome these problems, we propose a novel slight dual-path network (SDPN) to segment variable location lesions or organs with significant differences accurately. First, we design a dual-path module to integrate local with global features without obvious memory consumption. Second, a novel Multi-spectrum attention module is proposed to pay further attention to detailed information, which can automatically adapt to the variable segmentation target. Then, the compression module based on tensor ring decomposition is designed to compress convolutional and transformer structures. In the experiment, four datasets, including three benchmark datasets and a clinical dataset, are used to evaluate SDPN. Results of the experiments show that SDPN performs better than other start-of-the-art methods for brain tumor, liver tumor, endometrial tumor and cardiac segmentation. To ensure the generalizability, we train the network on Kvasir-SEG and test on CVC-ClinicDB which collected from a different institution. The quantitative analysis shows that the clinical evaluation results are consistent with the experts. Therefore, this model may be a potential candidate for the segmentation of lesions and organs segmentation with variable locations in clinical applications.
EAAI Journal 2022 Journal Article
YNICL Journal 2022 Journal Article
JBHI Journal 2022 Journal Article
Multimodal medical image fusion can combine salient information from different source images of the same part and reduce the redundancy of information. In this paper, an efficient hybrid image decomposition (HID) method is proposed. It combines the advantages of spatial domain and transform domain methods and breaks through the limitations of the algorithms based on single category features. The accurate separation of base layer and texture details is conducive to the better effect of the fusion rules. First, the source anatomical images are decomposed into a series of high frequencies and a low frequency via nonsubsampled shearlet transform (NSST). Second, the low frequency is further decomposed using the designed optimization model based on structural similarity and structure tensor to get an energy texture layer and a base layer. Then, the modified choosing maximum (MCM) is designed to fuse base layers. The sum of modified Laplacian (SML) is used to fuse high frequencies and energy texture layers. Finally, the fused low frequency can be obtained by adding fused energy texture layer and base layer. And the fused image is reconstructed by the inverse NSST. The superiority of the proposed method is verified by amounts of experiments on 50 pairs of magnetic resonance imaging (MRI) images and computed tomography (CT) images and others, and compared with 12 state-of-the-art medical image fusion methods. It is demonstrated that the proposed hybrid decomposition model has a better ability to extract texture information than conventional ones.
JBHI Journal 2022 Journal Article
Clinically, physicians collect the benchmark medical data to establish archives for a stroke patient and then add the follow up data regularly. It has great significance on prognosis prediction for stroke patients. In this paper, we present an interpretable deep learning model to predict the one-year mortality risk on stroke. We design sub-modules to reconstruct features from original clinical data that highlight the dissimilarity and temporality of different variables. The model consists of Bidirectional Long Short-Term Memory (Bi-LSTM), in which a novel correlation attention module is proposed that takes the correlation of variables into consideration. In experiments, datasets are collected clinically from the department of neurology in a local AAA hospital. It consists of 2, 275 stroke patients hospitalized in the department of neurology from 2014 to 2016. Our model achieves a precision of 0. 9414, a recall of 0. 9502 and an F1-score of 0. 9415. In addition, we provide the analysis of the interpretability by visualizations with reference to clinical professional guidelines.
TMLR Journal 2022 Journal Article
Recent years have witnessed a surge of successful applications of machine reading comprehension. Of central importance to these tasks is the availability of massive amount of labeled data, which facilitates training of large-scale neural networks. However, in many real-world problems, annotated data are expensive to gather not only because of time cost and budget, but also of certain domain-specific restrictions such as privacy for healthcare data. In this regard, we propose an uncertainty-based active learning algorithm for reading comprehension, which interleaves data annotation and model updating to mitigate the demand of labeling. Our key techniques are two-fold: 1) an unsupervised uncertainty-based sampling scheme that queries the labels of the most informative instances with respect to the currently learned model; and 2) an adaptive loss minimization paradigm that simultaneously fits the data and controls the degree of model updating. We demonstrate on benchmark datasets that 25% less labeled samples suffice to guarantee similar, or even improved performance. Our results show strong evidence that for label-demanding scenarios, the proposed approach offers a practical guide on data collection and model training.
JBHI Journal 2021 Journal Article
The classification of six types of white blood cells (WBCs) is considered essential for leukemia diagnosis, while the classification is labor-intensive and strict with the clinical experience. To relieve the complicated process with an efficient and automatic method, we propose the A ttention-aware R esidual Network based M anifold L earning model (ARML) to classify WBCs. The proposed ARML model leverages the adaptive attention-aware residual learning to exploit the category-relevant image-level features and strengthen the first-order feature representation ability. To learn more discriminatory information than the first-order ones, the second-order features are characterized. Afterwards, ARML encodes both the first- and second-order features with Gaussian embedding into the Riemannian manifold to learn the underlying non-linear structure of the features for classification. ARML can be trained in an end-to-end fashion, and the learnable parameters are iteratively optimized. 10800 WBCs images (1800 images for each type) is collected, 9000 images and five-fold cross-validation are used for training and validation of the model, while additional 1800 images for testing. The results show that ARML achieving average classification accuracy of 0. 953 outperforms other state-of-the-art methods with fewer trainable parameters. In the ablation study, ARML achieves improved accuracy against its three variants: without manifold learning (AR), without attention-aware learning (RML), and AR without attention-aware learning. The t-SNE results illustrate that ARML has learned more distinguishable features than the comparison methods, which benefits the WBCs classification. ARML provides a clinically feasible WBCs classification solution for leukemia diagnose with an efficient manner.
NeurIPS Conference 2021 Conference Paper
Graph embedding, which represents real-world entities in a mathematical space, has enabled numerous applications such as analyzing natural languages, social networks, biochemical networks, and knowledge bases. It has been experimentally shown that graph embedding in hyperbolic space can represent hierarchical tree-like data more effectively than embedding in linear space, owing to hyperbolic space's exponential growth property. However, since the theoretical comparison has been limited to ideal noiseless settings, the potential for the hyperbolic space's property to worsen the generalization error for practical data has not been analyzed. In this paper, we provide a generalization error bound applicable for graph embedding both in linear and hyperbolic spaces under various negative sampling settings that appear in graph embedding. Our bound states that error is polynomial and exponential with respect to the embedding space's radius in linear and hyperbolic spaces, respectively, which implies that hyperbolic space's exponential growth property worsens the error. Using our bound, we clarify the data size condition on which graph embedding in hyperbolic space can represent a tree better than in Euclidean space by discussing the bias-variance trade-off. Our bound also shows that imbalanced data distribution, which often appears in graph embedding, can worsen the error.
IJCAI Conference 2021 Conference Paper
Although Label Distribution Learning (LDL) has found wide applications in varieties of classification problems, it may face the challenge of objective mismatch -- LDL neglects the optimal label for the sake of learning the whole label distribution, which leads to performance deterioration. To improve classification performance and solve the objective mismatch, we propose a new LDL algorithm called LDL-HR. LDL-HR provides a new perspective of label distribution, \textit{i. e. }, a combination of the \textbf{highest label} and the \textbf{rest label description degrees}. It works as follows. First, we learn the highest label by fitting the degenerated label distribution and large margin. Second, we learn the rest label description degrees to exploit generalization. Theoretical analysis shows the generalization of LDL-HR. Besides, the experimental results on 18 real-world datasets validate the statistical superiority of our method.
IJCAI Conference 2021 Conference Paper
Sleep staging is fundamental for sleep assessment and disease diagnosis. Although previous attempts to classify sleep stages have achieved high classification performance, several challenges remain open: 1) How to effectively extract salient waves in multimodal sleep data; 2) How to capture the multi-scale transition rules among sleep stages; 3) How to adaptively seize the key role of specific modality for sleep staging. To address these challenges, we propose SalientSleepNet, a multimodal salient wave detection network for sleep staging. Specifically, SalientSleepNet is a temporal fully convolutional network based on the $U^2$-Net architecture that is originally proposed for salient object detection in computer vision. It is mainly composed of two independent $U^2$-like streams to extract the salient features from multimodal data, respectively. Meanwhile, the multi-scale extraction module is designed to capture multi-scale transition rules among sleep stages. Besides, the multimodal attention module is proposed to adaptively capture valuable information from multimodal data for the specific sleep stage. Experiments on the two datasets demonstrate that SalientSleepNet outperforms the state-of-the-art baselines. It is worth noting that this model has the least amount of parameters compared with the existing deep neural network models.
TCS Journal 2021 Journal Article
IJCAI Conference 2020 Conference Paper
Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the latent labels hidden in the data may not only find interesting knowledge but also help us to build a more robust learning model. In this paper, a novel approach named DLCL (i. e. , Discovering Latent Class Labels for MLL) is proposed which can not only discover the latent labels in the training data but also predict new instances with the latent and known labels simultaneously. Extensive experiments show a competitive performance of DLCL against other state-of-the-art MLL approaches.
IJCAI Conference 2020 Conference Paper
Sleep stage classification is essential for sleep assessment and disease diagnosis. However, how to effectively utilize brain spatial features and transition information among sleep stages continues to be challenging. In particular, owing to the limited knowledge of the human brain, predefining a suitable spatial brain connection structure for sleep stage classification remains an open question. In this paper, we propose a novel deep graph neural network, named GraphSleepNet, for automatic sleep stage classification. The main advantage of the GraphSleepNet is to adaptively learn the intrinsic connection among different electroencephalogram (EEG) channels, represented by an adjacency matrix, thereby best serving the spatial-temporal graph convolution network (ST-GCN) for sleep stage classification. Meanwhile, the ST-GCN consists of graph convolutions for extracting spatial features and temporal convolutions for capturing the transition rules among sleep stages. Experiments on the Montreal Archive of Sleep Studies (MASS) dataset demonstrate that the GraphSleepNet outperforms the state-of-the-art baselines.
ICLR Conference 2020 Conference Paper
This paper aims to analyze knowledge consistency between pre-trained deep neural networks. We propose a generic definition for knowledge consistency between neural networks at different fuzziness levels. A task-agnostic method is designed to disentangle feature components, which represent the consistent knowledge, from raw intermediate-layer features of each neural network. As a generic tool, our method can be broadly used for different applications. In preliminary experiments, we have used knowledge consistency as a tool to diagnose representations of neural networks. Knowledge consistency provides new insights to explain the success of existing deep-learning techniques, such as knowledge distillation and network compression. More crucially, knowledge consistency can also be used to refine pre-trained networks and boost performance.
AAAI Conference 2020 Conference Paper
Logo classification has gained increasing attention for its various applications, such as copyright infringement detection, product recommendation and contextual advertising. Compared with other types of object images, the real-world logo images have larger variety in logo appearance and more complexity in their background. Therefore, recognizing the logo from images is challenging. To support efforts towards scalable logo classification task, we have curated a dataset, Logo- 2K+, a new large-scale publicly available real-world logo dataset with 2, 341 categories and 167, 140 images. Compared with existing popular logo datasets, such as FlickrLogos-32 and LOGO-Net, Logo-2K+ has more comprehensive coverage of logo categories and larger quantity of logo images. Moreover, we propose a Discriminative Region Navigation and Augmentation Network (DRNA-Net), which is capable of discovering more informative logo regions and augmenting these image regions for logo classification. DRNA- Net consists of four sub-networks: the navigator sub-network first selected informative logo-relevant regions guided by the teacher sub-network, which can evaluate its confidence belonging to the ground-truth logo class. The data augmentation sub-network then augments the selected regions via both region cropping and region dropping. Finally, the scrutinizer sub-network fuses features from augmented regions and the whole image for logo classification. Comprehensive experiments on Logo-2K+ and other three existing benchmark datasets demonstrate the effectiveness of proposed method. Logo-2K+ and the proposed strong baseline DRNA-Net are expected to further the development of scalable logo image recognition, and the Logo-2K+ dataset can be found at https: //github. com/msn199959/Logo-2k-plus-Dataset.
JBHI Journal 2020 Journal Article
Objective: accurately classifying the malignancy of lesions detected in a screening scan is critical for reducing false positives. Radiomics holds great potential to differentiate malignant from benign tumors by extracting and analyzing a large number of quantitative image features. Since not all radiomic features contribute to an effective classifying model, selecting an optimal feature subset is critical. Methods: this work proposes a new multi-objective based feature selection (MO-FS) algorithm that considers sensitivity and specificity simultaneously as the objective functions during feature selection. For MO-FS, we developed a modified entropy-based termination criterion that stops the algorithm automatically rather than relying on a preset number of generations. We also designed a solution selection methodology for multi-objective learning that uses the evidential reasoning approach (SMOLER) to automatically select the optimal solution from the Pareto-optimal set. Furthermore, we developed an adaptive mutation operation to generate the mutation probability in MO-FS automatically. Results: we evaluated the MO-FS for classifying lung nodule malignancy in low-dose CT and breast lesion malignancy in digital breast tomosynthesis. Conclusion: the experimental results demonstrated that the feature set selected by MO-FS achieved better classification performance than features selected by other commonly used methods. Significance: the proposed method is general and more effective radiomic feature selection strategy.
IJCAI Conference 2019 Conference Paper
Existing methods on representation-based subspace clustering mainly treat all features of data as a whole to learn a single self-representation and get one clustering solution. Real data however are often complex and consist of multiple attributes or sub-features, such as a face image has expressions or genders. Each attribute is distinct and complementary on depicting the data. Failing to explore attributes and capture the complementary information among them may lead to an inaccurate representation. Moreover, a single clustering solution is rather limited to depict data, which can often be interpreted from different aspects and grouped into multiple clusters according to attributes. Therefore, we propose an innovative model called attributed subspace clustering (ASC). It simultaneously learns multiple self-representations on latent representations derived from original data. By utilizing Hilbert Schmidt Independence Criterion as a co-regularizing term, ASC enforces that each self-representation is independent and corresponds to a specific attribute. A more comprehensive self-representation is then established by adding these self-representations. Experiments on several benchmark image datasets have demonstrated the effectiveness of ASC not only in terms of clustering accuracy achieved by the integrated representation, but also the diverse interpretation of data, which is beyond what current approaches can offer.
IJCAI Conference 2019 Conference Paper
Label Distribution Learning (LDL) is a novel learning paradigm, aim of which is to minimize the distance between the model output and the ground-truth label distribution. We notice that, in real-word applications, the learned label distribution model is generally treated as a classification model, with the label corresponding to the highest model output as the predicted label, which unfortunately prompts an inconsistency between the training phrase and the test phrase. To solve the inconsistency, we propose in this paper a new Label Distribution Learning algorithm for Classification (LDL4C). Firstly, instead of KL-divergence, absolute loss is applied as the measure for LDL4C. Secondly, samples are re-weighted with information entropy. Thirdly, large margin classifier is adapted to boost discrimination precision. We then reveal that theoretically LDL4C seeks a balance between generalization and discrimination. Finally, we compare LDL4C with existing LDL algorithms on 17 real-word datasets, and experimental results demonstrate the effectiveness of LDL4C in classification.
IJCAI Conference 2019 Conference Paper
Image paragraph generation is the task of producing a coherent story (usually a paragraph) that describes the visual content of an image. The problem nevertheless is not trivial especially when there are multiple descriptive and diverse gists to be considered for paragraph generation, which often happens in real images. A valid question is how to encapsulate such gists/topics that are worthy of mention from an image, and then describe the image from one topic to another but holistically with a coherent structure. In this paper, we present a new design --- Convolutional Auto-Encoding (CAE) that purely employs convolutional and deconvolutional auto-encoding framework for topic modeling on the region-level features of an image. Furthermore, we propose an architecture, namely CAE plus Long Short-Term Memory (dubbed as CAE-LSTM), that novelly integrates the learnt topics in support of paragraph generation. Technically, CAE-LSTM capitalizes on a two-level LSTM-based paragraph generation framework with attention mechanism. The paragraph-level LSTM captures the inter-sentence dependency in a paragraph, while sentence-level LSTM is to generate one sentence which is conditioned on each learnt topic. Extensive experiments are conducted on Stanford image paragraph dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, CAE-LSTM increases CIDEr performance from 20. 93% to 25. 15%.
AAAI Conference 2019 Conference Paper
Semi-supervised representation-based subspace clustering is to partition data into their underlying subspaces by finding effective data representations with partial supervisions. Essentially, an effective and accurate representation should be able to uncover and preserve the true data structure. Meanwhile, a reliable and easy-to-obtain supervision is desirable for practical learning. To meet these two objectives, in this paper we make the first attempt towards utilizing the orderly relationship, such as the data a is closer to b than to c, as a novel supervision. We propose an orderly subspace clustering approach with a novel regularization term. OSC enforces the learned representations to simultaneously capture the intrinsic subspace structure and reveal orderly structure that is faithful to true data relationship. Experimental results with several benchmarks have demonstrated that aside from more accurate clustering against state-of-the-arts, OSC interprets orderly data structure which is beyond what current approaches can offer.
IJCAI Conference 2019 Conference Paper
Reinforcement learning methods for recommender systems optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. We demonstrate our methods in simulation, and validate the scalability and effectiveness of decomposed TD-learning on YouTube.
AAAI Conference 2019 Conference Paper
As a novel learning paradigm, label distribution learning (LDL) explicitly models label ambiguity with the definition of label description degree. Although lots of work has been done to deal with real-world applications, theoretical results on LDL remain unexplored. In this paper, we rethink LDL from theoretical aspects, towards analyzing learnability of LDL. Firstly, risk bounds for three representative LDL algorithms (AA-kNN, AA-BP and SA-ME) are provided. For AA-kNN, Lipschitzness of the label distribution function is assumed to bound the risk, and for AA-BP and SA-ME, rademacher complexity is utilized to give data-dependent risk bounds. Secondly, a generalized plug-in decision theorem is proposed to understand the relation between LDL and classification, uncovering that approximation to the conditional probability distribution function in absolute loss guarantees approaching to the optimal classifier, and also data-dependent error probability bounds are presented for the corresponding LDL algorithms to perform classification. As far as we know, this is perhaps the first research on theory of LDL.
AAAI Conference 2018 Conference Paper
Motivated by the important archaeological application of exploring cultural heritage objects, in this paper we study the challenging problem of automatically segmenting curve structures that are very weakly stamped or carved on an object surface in the form of a highly noisy depth map. Different from most classical low-level image segmentation methods that are known to be very sensitive to the noise and occlusions, we propose a new supervised learning algorithm based on Convolutional Neural Network (CNN) to implicitly learn and utilize more curve geometry and pattern information for addressing this challenging problem. More specifically, we first propose a Fully Convolutional Network (FCN) to estimate the skeleton of curve structures and at each skeleton pixel, a scale value is estimated to reflect the local curve width. Then we propose a dense prediction network to re- fine the estimated curve skeletons. Based on the estimated scale values, we finally develop an adaptive thresholding algorithm to achieve the final segmentation of curve structures. In the experiment, we validate the performance of the proposed method on a dataset of depth images scanned from unearthed pottery sherds dating to the Woodland period of Southeastern North America.
AIIM Journal 2018 Journal Article
EAAI Journal 2018 Journal Article
IJCAI Conference 2018 Conference Paper
Trust prediction, aiming to predict the trust relations between users in a social network, is a key to helping users discover the reliable information. Many trust prediction methods are proposed based on the low-rank assumption of a trust network. However, one typical property of the trust network is that the trust relations follow the power-law distribution, i. e. , few users are trusted by many other users, while most tail users have few trustors. Due to these tail users, the fundamental low-rank assumption made by existing methods is seriously violated and becomes unrealistic. In this paper, we propose a simple yet effective method to address the problem of the violated low-rank assumption. Instead of discovering the low-rank component of the trust network alone, we learn a sparse component of the trust network to describe the tail users simultaneously. With both of the learned low-rank and sparse components, the trust relations in the whole network can be better captured. Moreover, the transitive closure structure of the trust relations is also integrated into our model. We then derive an effective iterative algorithm to infer the parameters of our model, along with the proof of correctness. Extensive experimental results on real-world trust networks demonstrate the superior performance of our proposed method over the state-of-the-arts.
IJCAI Conference 2018 Conference Paper
Nonnegative matrix factorization (NMF), a well-known technique to find parts-based representations of nonnegative data, has been widely studied. In reality, ordinal relations often exist among data, such as data i is more related to j than to q. Such relative order is naturally available, and more importantly, it truly reflects the latent data structure. Preserving the ordinal relations enables us to find structured representations of data that are faithful to the relative order, so that the learned representations become more discriminative. However, current NMFs pay no attention to this. In this paper, we make the first attempt towards incorporating the ordinal relations and propose a novel ranking preserving nonnegative matrix factorization (RPNMF) approach, which enforces the learned representations to be ranked according to the relations. We derive iterative updating rules to solve RPNMF's objective function with convergence guaranteed. Experimental results with several datasets for clustering and classification have demonstrated that RPNMF achieves greater performance against the state-of-the-arts, not only in terms of accuracy, but also interpretation of orderly data structure.
AAAI Conference 2018 Conference Paper
Impressive image captioning results (i. e. , an objective description for an image) are achieved with plenty of training pairs. In this paper, we take one step further to investigate the creation of narrative paragraph for a photo stream. This task is even more challenging due to the difficulty in modeling an ordered photo sequence and in generating a relevant paragraph with expressive language style for storytelling. The dif- ficulty can even be exacerbated by the limited training data, so that existing approaches almost focus on search-based solutions. To deal with these challenges, we propose a sequenceto-sequence modeling approach with reinforcement learning and adversarial training. First, to model the ordered photo stream, we propose a hierarchical recurrent neural network as story generator, which is optimized by reinforcement learning with rewards. Second, to generate relevant and story-style paragraphs, we design the rewards with two critic networks, including a multi-modal and a language-style discriminator. Third, we further consider the story generator and reward critics as adversaries. The generator aims to create indistinguishable paragraphs to human-level stories, whereas the critics aim at distinguishing them and further improving the generator by policy gradient. Experiments on three widely-used datasets show the effectiveness, against state-of-the-art methods with relative increase of 20. 2% by METEOR. We also show the subjective preference for the proposed approach over the baselines through a user study with 30 human subjects.
YNIMG Journal 2017 Journal Article
AAAI Conference 2017 Conference Paper
Network embedding, aiming to learn the low-dimensional representations of nodes in networks, is of paramount importance in many real applications. One basic requirement of network embedding is to preserve the structure and inherent properties of the networks. While previous network embedding methods primarily preserve the microscopic structure, such as the first- and second-order proximities of nodes, the mesoscopic community structure, which is one of the most prominent feature of networks, is largely ignored. In this paper, we propose a novel Modularized Nonnegative Matrix Factorization (M-NMF) model to incorporate the community structure into network embedding. We exploit the consensus relationship between the representations of nodes and community structure, and then jointly optimize NMF based representation learning model and modularity based community detection model in a unified framework, which enables the learned representations of nodes to preserve both of the microscopic and community structures. We also provide efficient updating rules to infer the parameters of our model, together with the correctness and convergence guarantees. Extensive experimental results on a variety of real-world networks show the superior performance of the proposed method over the state-of-the-arts.
YNICL Journal 2017 Journal Article
IJCAI Conference 2017 Conference Paper
Real data are usually complex and contain various components. For example, face images have expressions and genders. Each component mainly reflects one aspect of data and provides information others do not have. Therefore, exploring the semantic information of multiple components as well as the diversity among them is of great benefit to understand data comprehensively and in-depth. However, this cannot be achieved by current nonnegative matrix factorization (NMF)-based methods, despite that NMF has shown remarkable competitiveness in learning parts-based representation of data. To overcome this limitation, we propose a novel multi-component nonnegative matrix factorization (MCNMF). Instead of seeking for only one representation of data, MCNMF learns multiple representations simultaneously, with the help of the Hilbert Schmidt Independence Criterion (HSIC) as a diversity term. HSIC explores the diverse information among the representations, where each representation corresponds to a component. By integrating the multiple representations, a more comprehensive representation is then established. A new iterative updating optimization scheme is derived to solve the objective function of MCNMF, along with its correctness and convergence guarantees. Extensive experimental results on real-world datasets have shown that MCNMF not only achieves more accurate performance over the state-of-the-arts using the aggregated representation, but also interprets data from different aspects with the multiple representations, which is beyond what current NMFs can offer.
YNIMG Journal 2017 Journal Article
IJCAI Conference 2013 Conference Paper
Online feature selection with dynamic features has become an active research area in recent years. However, in some real-world applications such as image analysis and email spam filtering, features may arrive by groups. Existing online feature selection methods evaluate features individually, while existing group feature selection methods cannot handle online processing. Motivated by this, we formulate the online group feature selection problem, and propose a novel selection approach for this problem. Our proposed approach consists of two stages: online intra-group selection and online inter-group selection. In the intra-group selection, we use spectral analysis to select discriminative features in each group when it arrives. In the inter-group selection, we use Lasso to select a globally optimal subset of features. This 2-stage procedure continues until there are no more features to come or some predefined stopping conditions are met. Extensive experiments conducted on benchmark and real-world data sets demonstrate that our proposed approach outperforms other state-of-theart online feature selection methods.
YNIMG Journal 2012 Journal Article
AAAI Conference 2010 Conference Paper
There is a large body of work on the evolution of graphs in various domains, which shows that many real graphs evolve in a similar manner. In this paper we study a novel type of network formed by mentor-apprentice relationships in a massively multiplayer online role playing game. We observe that some of the static and dynamic laws which have been observed in many other real world networks are not observed in this network. Consequently well known graph generators like Preferential Attachment, Forest Fire, Butterfly, RTM, etc. , cannot be applied to such mentoring networks. We propose a novel generative model to generate networks with the characteristics of mentoring networks.
NeurIPS Conference 2006 Conference Paper
The locally linear embedding (LLE) is improved by introducing multiple linearly independent local weight vectors for each neighborhood. We characterize the reconstruction weights and show the existence of the linearly independent weight vectors at each neighborhood. The modified locally linear embedding (MLLE) proposed in this paper is much stable. It can retrieve the ideal embedding if MLLE is applied on data points sampled from an isometric manifold. MLLE is also compared with the local tangent space alignment (LTSA). Numerical examples are given that show the improvement and efficiency of MLLE.
NeurIPS Conference 2004 Conference Paper
Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algo- rithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we de- velop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better fitting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets.
TCS Journal 1998 Journal Article
ICRA Conference 1995 Conference Paper
A media access protocol, CSMA/CD-W (Carrier Sense Multiple Access with Collision Detection for Wireless) is proposed to support broadcasting and point-to-point communication in mobile robot based distributed robotic systems (DRS). Distinct from many existing experimental systems built with off-the-shelf wireless communication products for computers, no centralized mechanism such as a communication server, or "ground support" is used, which is consistent with basic principles of DRS. The proposed protocol supports wireless data communication among mobile robots on a shared radio communication channel. It differs from CSMA and its variations with the capability of detecting, in a wireless network, collisions of broadcast (undesignated) messages without using any centralized devices. Satisfactory performance of the protocol is demonstrated with a rigorously designed discrete event simulation.
ICRA Conference 1995 Conference Paper
A fully distributed algorithm is presented, which when executed by each robot, collectively allows multiple autonomous mobile robots to travel through a discrete traffic network composed of passage segments, intersections, and terminals, all of which are of only finite capacity. Each robot may establish dynamically its own route not known to others. Treating passage segments, intersections and terminals as shared, discrete resources, the algorithm guarantees ordered traffic flow in a discrete network such that (i) finite capacity constrains of passage segments and terminals are always enforced; (ii) no collision occurs at any intersection; (iii) deadlocks are detected and resolved. The system operates under the model of Distributed Robotic Systems (DRS), assuming no centralized mechanism, synchronized clock, shared memory or ground support. Interrobot communication is only required among spatially adjacent robotic units. The algorithm is implementable with today's technology.
ICRA Conference 1995 Conference Paper
Two distributed operating primitives (1 out of N and deadlock detection) are presented to support fully distributed traffic regulation and control for multiple autonomous mobile robots operating in a 2-D discrete network consisting of passage segments, intersections and terminals, all of which are of only finite capacity. In consistency with the model of distributed robotic systems (DRS), no centralized mechanism, synchronized clock, shared memory or ground support is assumed. It is shown that simple, low bandwidth inter-robot communication is only required among a finite, small number of spatially adjacent robotic units. The correctness of these two distributed algorithms are provable.
ICRA Conference 1994 Conference Paper
Inter-robot communication based on the conceptual mechanism of "sign-board" in distributed robotic systems (DRS) is discussed. Equipped by each robot, a sign-board can be written only by the robot that carries it, and be read by robots in the neighborhood. Consistent with DRS principles, the sign-board model is not supported by any centralized mechanism, and is considered a natural way of interaction among autonomous robotic units. It is shown that along with message passing, the sign-board model is one of the two important mechanisms for inter-robot communication. Previous research on DRS algorithms employing the sign-board model assume zero signal propagation delay. These algorithms may fail if non-zero propagation delay is taken into account. A simple fix for these algorithms exists if the propagation delay is bounded. Implementation strategies for the conceptual sign-board are also discussed. >
IROS Conference 1994 Conference Paper
Resource sharing is crucial in any multi-agent systems, a distributed robotic system (DRS) is not an exception. A new, general strategy of sharing multiple, discrete resources with predetermined capacities under the model of distributed robotic systems (DRS) is proposed. It is based upon a media access protocol, CSMA/CD-W (Carrier Sense Multiple Access with Collision Detection for Wireless), which supports wireless inter-robot communication among multiple autonomous mobile robots without using any centralized mechanism. This resource sharing strategy is derived based on the fact that with the single, time-multiplexed communication channel, asynchronous events for requesting and releasing resources are effectively serialized. It is shown that the control protocol is effective, efficient, reliable and robust. >
IROS Conference 1993 Conference Paper
Distributed mutual exclusion (DME) is an important concept in any multi-agent systems, including the distributed robotic system (DRS). Several basic DME algorithms employing sign-board as their inter-robot communication mechanism are presented. It is shown that a large number of DRS operating primitives, such as leader finding, dynamic ordering of robots and events, job assignment, and resource sharing, can be effectively implemented with algorithms based on DME.
IROS Conference 1991 Conference Paper
A model for studying fully distributed traffic control strategies for many-AGV systems in an operating field of a network of stations and passages is proposed. As a basic operating primitive, distributed mutual exclusion on a resource of capacity M (0