EAAI Journal 2026 Journal Article
A communication-efficient federated learning method for traffic flow prediction
- Kaiju Li
- Qiang Xu
- Dong Wang
- Xiang Nie
- Hao Wang
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Human cognition excels at transcending sensory input and forming latent representations that structure our understanding of the world. While Large Language Model (LLM) agents demonstrate emergent reasoning and decision-making abilities, they lack a principled framework for capturing latent structures and modeling uncertainty. In this work, we explore for the first time how to bridge LLM agents with probabilistic graphical models (PGMs) to address agentic reasoning under uncertainty. To this end, we introduce Verbalized Probabilistic Graphical Modeling (vPGM), a Bayesian agentic framework that (i) guides LLM agents in following key principles of PGMs through natural language and (ii) refines the resulting posterior distributions via numerical Bayesian inference. Unlike many traditional probabilistic methods requiring substantial domain expertise, vPGM bypasses expert‐driven model design, making it well‐suited for scenarios with limited assumptions. We evaluated our model on several agentic reasoning tasks, both close-ended and open-ended. Our results indicate that the model effectively enhances confidence calibration and text generation quality.
TMLR Journal 2026 Journal Article
Multimodal Large Language Models (MLLMs) deliver detailed responses on vision-language tasks, yet remain susceptible to object hallucination (introducing objects not present in the image), undermining reliability in practice. Prior efforts often rely on heuristic penalties, post-hoc correction, or generic decoding tweaks, which do not directly intervene in the mechanisms that trigger object hallucination and thus yield limited gains. To address this challenge, we propose a causal decoding framework that applies targeted causal interventions during generation to curb spurious object mentions. By reshaping the decoding dynamics to attenuate spurious dependencies, our approach reduces false object tokens while maintaining descriptive quality. Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.
AAAI Conference 2026 Conference Paper
Stochastic multi-objective optimization (SMOOP) requires ranking multivariate distributions; yet, most empirical studies perform scalarization, which loses information and is unreliable. Based on the optimal transport theory, we introduce the center-outward q-dominance relation and prove it implies strong first-order stochastic dominance (FSD). Also, we develop an empirical test procedure based on q-dominance, and derive an explicit sample size threshold, n(δ), to control the Type I error. We verify the usefulness of our approach in two scenarios: (1) as a ranking method in hyperparameter tuning; (2) as a selection method in multi-objective optimization algorithms. For the former, we analyze the final stochastic Pareto sets of seven multi-objective hyperparameter tuners on the YAHPO-MO benchmark tasks with q-dominance, which allows us to compare these tuners when the expected hypervolume indicator (HVI, the most common performance metric) of the Pareto sets becomes indistinguishable. For the latter, we replace the mean value-based selection in the NSGA-II algorithm with q-dominance, which shows a superior convergence rate on noise-augmented ZDT benchmark problems. These results establish center-outward q-dominance as a principled, tractable foundation for seeking truly stochastically dominant solutions for SMOOPs.
AAAI Conference 2026 Conference Paper
Centralized training with decentralized execution (CTDE) is a framework for MARL with wide applications. In the CTDE paradigm, agents leverage global state information during training to mitigate the non-stationarity of the MARL environment, but must rely solely on partial observations during execution. Recent work has highlighted the growing importance of inter-agent communication for more effective learning and coordination. However, most existing methods overlook the fact that real-world communication channels are often bandwidth-constrained and imperfectly reliable. Toward more communication-efficient and robust MARL, we extend the conventional CTDE framework with an information hub. The hub collects local observations from the agents to restore the global state, which is then delivered to the agents on demand. To this end, technical mechanisms are designed to enable effective global reconstruction with incomplete observations, as well as agent-specific attention to the reconstructed global information. Experiments on multiple cooperative MARL benchmarks demonstrate that our method achieves state-of-the-art performance compared to popular MARL algorithms while substantially reducing communication overhead and exhibiting strong robustness under imperfect communication channels.
TCS Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Spatiotemporal analysis of facial behavior is a crucial method for evaluating the mental state of depression patients. However, in practice, depressed patients often display facial behaviors similar to healthy individuals due to masking tendencies. Additionally, facial expressions among depressed patients are also different, increasing the difficulty of assessment. To address this, we propose a video-based automatic depression assessment model Dep-MAP for complex facial behaviors of depression patients. Dep-MAP adopts a dual-branch architecture to extract visual features of facial behavior and capture corresponding emotional semantic features. Specifically, the extracted deep semantic features are clustered, resulting in semantically distinct prototype sets, where each severity group learns a set of discriminative facial behavior prototype representations, to suppress inter-class semantic confusion. Subsequently, we propose a semantic prototype-supervised contrastive learning method, which aligns latent semantics between shallow and deep features, realizing emotional semantic guidance and self-knowledge distillation for the visual feature branch, effectively suppressing intra-class difference. Then, we integrate key depression cues across multiple spatiotemporal scales via a multi-scale weighted fusion strategy, achieving automatic depression assessment. Experimental results demonstrate that Dep-MAP effectively identifies potential key frames in temporal sequences, and aggregates key frame representations with semantic consistency, achieving significantly superior state-of-the-art results on the AVEC2013 and AVEC2014 public datasets.
AAAI Conference 2026 Conference Paper
Fine-grained Visual Recognition (FGVR) aims to distinguish between categories with subtle inter-class differences and large intra-class variations. While Vision Transformers with attention mechanisms have been widely adopted for FGVR, they usually suffer from high computational complexity and entangled global representations. Recent advancements in state-space models, exemplified by Mamba, have showcased substantial potential in vision-related tasks due to their linear scalability and rich sequence modeling capacity. To this end, we propose DHMamba, a novel Mamba based FGVR method. The proposed method leverages hypergraph to guide selective scanning and strengthen Mamba’s capability in modeling fine-grained semantics. Furthermore, a Disentangled Local Scanning (DLS) module is introduced to utilize hyperedges to allocate distinct informative patches into independent channels for mitigating the representational entanglement. Extensive experiments conducted on multiple FGVR benchmarks demonstrate that the proposed DHMamba outperforms the state-of-the-art methods, validating the efficacy of combining state-space modeling with hypergraph-based feature structuring.
AAAI Conference 2026 Conference Paper
Search and recommendation are pivotal for information access and are increasingly unified to exploit shared user-item interactions. Both tasks suffer from data sparsity, which joint modeling can mitigate by integrating behavioral data with or without explicit queries. However, existing unified frameworks rarely distinguish between users’ long- and short-term interests, despite their divergent temporal dynamics in search and recommendation. In this work, we propose a novel model, DHIM, which explicitly disentangles and integrates users' long- and short-term interests across both the search and recommendation scenarios. First, long- and short-term interests are independently extracted from search and recommendation using a unified extraction strategy. These interests are then adaptively integrated via a cross-scenario fusion module. A self‐supervised contrastive loss supervises the learning of both interest types within and across scenarios. The resulting representations are fed into downstream search and recommendation models for prediction. Extensive experiments on two public benchmarks demonstrate that our approach consistently outperforms single-scenario and state-of-the-art joint models, achieving superior accuracy and generalizability. To our knowledge, this is the first work to incorporate explicit dual-horizon interest modeling into a unified search and recommendation framework with self-supervised contrastive learning.
AAAI Conference 2026 Conference Paper
3D human avatar animation aims at transforming a human avatar from an arbitrary initial pose to a specified target pose using deformation algorithms. Existing approaches typically divide this task into two stages: canonical template construction and target pose deformation. However, current template construction methods demand extensive skeletal rigging and often produce artifacts in contact regions. Moreover, target pose deformation suffers from structural distortions caused by Linear Blend Skinning (LBS), which significantly undermines animation realism. To address these problems, we propose a unified learning-based framework to address both challenges in two phases. For the former phase, to overcome the inefficiencies and artifacts during template construction, we leverage a U-Net architecture that decouples texture and pose information in a feed-forward process, enabling fast generation of a human template. For the latter phase, we propose a data-driven refinement technique that enhances structural integrity. Extensive experiments show that our model delivers consistent performance across diverse poses with an optimal balance between efficiency and quality, surpassing state-of-the-art (SOTA) methods.
EAAI Journal 2026 Journal Article
JBHI Journal 2026 Journal Article
Odontogenic cystic lesions (OCLs) are complex jaw abnormalities that require a precise diagnosis of the disease for treatment. Visual OCL diagnosis is commonly based on reviewing cone-beam computed tomography (CBCT) to identify morpho-pathological features associated with specific lesion types in a hierarchical manner. Current state-of-the-art methods focus on extracting features from the image without any guidance beyond the lesion diagnosis, and do not fully leverage the hierarchical relationship between the lesion diagnosis and morphological features. In this study, we propose a hierarchical deep decision tree network (H2DT-Net) with three modules: a deep decision tree-based hierarchical learning module (DHLM) to leverage inter-categorical relationships; a feature category embedding module (FCEM) to capture representations from both diagnostic and morpho-pathological domains and support the DHLM; and a lesion localised attention module (LLAM) to facilitate the feature extraction process by generating lesion-focused attention maps. Evaluated on 289 CBCT images, H2DT-Net achieved state-of-the-art performance in OCL classification. We further demonstrate that our method is effective in clinical settings, where it outperformed six maxillofacial clinicians in diagnostic assessment.
AAAI Conference 2026 Conference Paper
In recent years, the rapid progress of deep learning has driven notable advancements in satellite video tracking, a critical task for applications such as environmental monitoring, disaster management, and defense. Despite these strides, existing approaches remain constrained by their inability to handle dynamic challenges, such as target appearance variations, complex motion patterns, and occlusions. Traditional methods often suffer from static template matching or overly complex update mechanisms, compromising their robustness and practicality in real-world scenarios. To address these limitations, we propose a paradigm shift in satellite video tracking by integrating historical trajectory knowledge with visual features. This fusion enhances the tracker's perceptual understanding of targets over time, enabling more adaptive and resilient tracking. By aligning spatial, temporal, and cross-modal information, our approach effectively bridges the gap between fragmented observations and coherent tracking performance, even under challenging conditions like small target detection and cluttered backgrounds. Extensive experiments conducted on multiple satellite video tracking benchmarks demonstrate the superiority of our method, with HTTrack achieving success rates of 51.5% on SV248S, 52.9% on SatSOT, and 32.6% on VISO, significantly outperforming state-of-the-art trackers and marking a step forward in achieving robust, accurate, and scalable satellite video tracking.
AAAI Conference 2026 Conference Paper
Retrieval-augmented generation (RAG) has recently emerged as a powerful framework for knowledge-intensive natural language processing tasks, which leverages the strengths of both pre-trained language models and external knowledge. While significant progress has been made, the scaling behavior of these approaches during inference remains poorly understood. Towards this end, this paper presents a comprehensive study of inference scaling law for RAG models, which investigates how inference performance scales with respect to key factors including retriever model scale, generator model scale, number of retrieved documents, and context window size. Through extensive experiments on benchmark datasets, we establish empirical scaling laws that reveal power-law and sigmoid-type relationships between these factors and performance. We further build a joint inference scaling law with theoretical justification. With the proposed scaling laws, we can understand the performance tendency of RAG models under different computational resources. We believe our insights can pave the way for efficient and effective deployment of RAG models in more applications.
AAAI Conference 2026 Conference Paper
Partial linear models (PLM) have attracted much attention for regression estimation and variable selection due to their feasibility on utilizing linear and nonlinear approximations jointly. However, theoretical understanding of how they control the false discovery rate (FDR) during variable selection remains limited. To address this issue, we formulate a new integral-based knockoffs (IKO) inference scheme for controlled variable selection in PLM, where integral-based knockoff statistics are used to measure the variable importance and B-splines (or random Fourier features) are employed for approximating nonlinear components. In theory, FDR control is guaranteed for both linear and nonlinear parts, and the statistical analysis for its power is established. Empirical evaluations validate the effectiveness of our proposed approach.
AAAI Conference 2026 Conference Paper
"Fedspeak", the stylized and often nuanced language used by the U.S. Federal Reserve, encodes implicit policy signals and strategic stances. The Federal Open Market Committee strategically employs Fedspeak as a communication tool to shape market expectations and influence both domestic and global economic conditions. As such, automatically parsing and interpreting Fedspeak presents a high-impact challenge, with significant implications for financial forecasting, algorithmic trading, and data-driven policy analysis. Technically, to enrich the semantic and contextual representation of Fedspeak texts, we incorporate domain-specific reasoning grounded in the monetary policy transmission mechanism. We further introduce a dynamic uncertainty decoding module to assess the confidence of model predictions, thereby enhancing both classification accuracy and model reliability. Experimental results demonstrate that our framework achieves state-of-the-art performance on the policy stance analysis task. Moreover, statistical analysis reveals a significant positive correlation between perceptual uncertainty and model error rates, validating the effectiveness of perceptual uncertainty as a diagnostic signal.
AAAI Conference 2026 Conference Paper
Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) becomes increasingly crucial. As the standard paradigm for IP ownership verification, LLM fingerprinting thus plays a vital role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective when the model thief fully controls the LLM's inference process. In such settings, attackers may share prompt-response pairs to enable fingerprint unlearning, or manipulate outputs to evade exact-match verification. We propose iSeal, the first fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner. It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy. These components are resistant to verification-time attacks, including collusion-based fingerprint unlearning and response manipulation, backed by both theoretical analysis and empirical results. iSeal achieves 100% Fingerprint Success Rate (FSR) on 12 LLMs against more than 10 attacks, while baselines fail under unlearning and response manipulations.
AAAI Conference 2026 Conference Paper
Data scarcity continues to be a critical bottleneck in the field of robotic manipulation, limiting the ability to train robust and generalizable models. While diffusion models provide a promising approach to synthesizing realistic robotic manipulation videos, their effectiveness hinges on the availability of precise and reasonable control instructions. Current methods primarily rely on 2D trajectories as instruction prompts, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3Dfor generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length, avoiding collisions and retiming. Next, we employ a latent editing technique to create video sequences from the initial image latent, text instruction and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method significantly reduces human intervention requirements by autonomously planing plausible 3D trajectories. Experimental results demonstrate its superior visual quality and precision.
JBHI Journal 2026 Journal Article
Acoustic features are crucial behavioral indicators for depression detection. However, prior speech-based depression detection methods often overlook the variability of emotional patterns across samples, leading to interference from speaker identity and hindering the effective extraction of emotional changes. To address this limitation, we developed the Emotional Word Reading Experiment (EWRE) and introduced a method combining self-supervised and supervised learning for depression detection from speech called MFE-Former. First, we generate fine-grained emotional representations for response segments by computing cosine similarity between intra-sample and inter-sample contexts. Concurrently, orthogonality constraints decouple identity information from emotional features, while a Transformer decoder reconstructs spectral structures to improve sensitivity to depression-related emotional patterns. Next, we propose a multi-scale emotion change perception module and a Bernoulli distribution-based joint decision module integrate multi-level information for depression detection. By enhancing the distribution differences among positive, neutral, and negative emotional features, we find that patients with depression are more inclined to express negative emotions, whereas healthy individuals express more positive emotions. The experimental results on EWRE and AVEC 2014 show that MFE-Former outperforms state-of-the-art temporal methods under conditions of variability in emotional patterns across samples.
AAAI Conference 2026 Conference Paper
Simulating microstructure evolution (MicroEvo) is vital for materials design but demands high numerical accuracy, efficiency, and physical fidelity. Although recent studies on deep learning (DL) offers a promising alternative to traditional solvers, the field lacks standardized benchmarks. Existing studies are flawed due to a lack of comparing specialized MicroEvo DL models with state-of-the-art spatio-temporal architectures, an overemphasis on numerical accuracy over physical fidelity, and a failure to analyze error propagation over time. To address these gaps, we introduce MicroEvoEval, the first comprehensive benchmark for image-based microstructure evolution prediction. We evaluate 14 models, encompassing both domain-specific and general-purpose architectures, across four representative MicroEvo tasks with datasets specifically structured for both short- and long-term assessment. Our multi-faceted evaluation framework goes beyond numerical accuracy and computational cost, incorporating a curated set of structure-preserving metrics to assess physical fidelity. Our extensive evaluations yield several key insights. Notably, we find that modern architectures (e.g., VMamba), not only achieve superior long-term stability and physical fidelity but also operate with an order-of-magnitude greater computational efficiency. The results highlight the necessity of holistic evaluation and identify these modern architectures as a highly promising direction for developing efficient and reliable surrogate models in data-driven materials science.
AAAI Conference 2026 Conference Paper
With the rapid advancement of 3D visualization, 3D Gaussian Splatting (3DGS) has emerged as a leading technique for real-time, high-fidelity rendering. While prior research has emphasized algorithmic performance and visual fidelity, the perceptual quality of 3DGS-rendered content, especially under varying reconstruction conditions, remains largely underexplored. In practice, factors such as viewpoint sparsity, limited training iterations, point downsampling, noise, and color distortions can significantly degrade visual quality, yet their perceptual impact has not been systematically studied. To bridge this gap, we present 3DGS-QA, the first subjective quality assessment dataset for 3DGS. It comprises 225 degraded reconstructions across 15 object types, enabling a controlled investigation of common distortion factors. Based on this dataset, we introduce a no-reference quality prediction model that directly operates on native 3D Gaussian primitives, without requiring rendered images or ground-truth references. Our model extracts spatial and photometric cues from the Gaussian representation to estimate perceived quality in a structure-aware manner. We further benchmark existing quality assessment methods, spanning both traditional and learning-based approaches. Experimental results show that our method consistently achieves superior performance, highlighting its robustness and effectiveness for 3DGS content evaluation. The dataset and code are made publicly available to facilitate future research in 3DGS quality assessment.
TIST Journal 2026 Journal Article
Large-scale multi-agent systems face two core challenges: inefficient policy learning and the explosion of state dimensions. Existing methods often rely on manually designed task sequences to guide agents’ learning in stages, but these designs lack adaptability to agents’ learning abilities, making it difficult to ensure the rationality of task difficulty. Moreover, the representation capability of current network structures is limited, making it challenging to efficiently handle high-dimensional state information and complex interaction relationships. To address these issues, we propose a Progressive Multi-Agent Reinforcement Learning (PMARL) framework. PMARL introduces a task adapter that adaptively selects task difficulty based on agents’ learning abilities, eliminating reliance on manual experience. Additionally, a Dynamic Dimension Adaptive Network (DDAN) is designed, incorporating hypernetwork and self-attention mechanisms to achieve adaptive feature extraction of high-dimensional states and efficient representation of agent interaction relationships. Experimental results demonstrate that PMARL exhibits higher efficiency and better adaptability compared to existing methods when addressing large-scale multi-agent tasks.
AAAI Conference 2026 Conference Paper
Federated learning (FL) protects data privacy by enabling distributed model training without direct access to client data. However, its distributed nature makes it vulnerable to model and data poisoning attacks. While numerous defenses filter malicious clients using statistical metrics, they overlook the role of model redundancy, where not all parameters contribute equally to the model and attack performance. Current attacks manipulate all model parameters uniformly, making them more detectable, while defenses focus on the overall statistics of client updates, leaving gaps for more sophisticated attacks. We propose an attack-agnostic augmentation method to enhance the stealthiness and effectiveness of existing poisoning attacks in FL, exposing flaws in current defenses and highlighting the need for fine-grained FL security. Our three-stage methodology, including pill construction, pill poisoning, and pill injection, injects poison into a compact subnet (i.e., pill) of the global model during the iterative FL training. Experimental results show that FL poisoning attacks enhanced by our method can bypass 8 state-of-the-art (SOTA) defenses, gaining an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.
AAAI Conference 2026 Conference Paper
RGB-to-RAW reconstruction, or the reverse modeling of a camera Image Signal Processing (ISP) pipeline, aims to recover high-fidelity RAW data from RGB images. Despite notable progress, existing learning-based methods typically treat this task as a direct regression objective and still struggle with detail inconsistency and color deviation, due to the ill-posed nature of inverse ISP and the inherent information loss in quantized RGB images. To address these limitations, we pioneer a generative perspective by reformulating RGB-to-RAW reconstruction as a deterministic latent transport problem and introduce a novel framework named RAW-Flow, which leverages flow matching to learn a deterministic vector field in latent space, to effectively bridge the gap between RGB and RAW representations and enable accurate reconstruction of structural details and color information. To further enhance latent transport, we introduce a cross-scale context guidance module that injects hierarchical RGB features into the flow estimation process. Moreover, we design a Dual-domain Latent Autoencoder (DLAE) with a feature alignment constraint to support the proposed latent transport framework, which jointly encodes RGB and RAW inputs while promoting stable training and high-fidelity reconstruction. Extensive experiments demonstrate that RAW-Flow outperforms state-of-the-art approaches both quantitatively and visually.
AAAI Conference 2026 Conference Paper
RGB-T tracking is increasingly deployed in safety-critical applications such as autonomous driving, surveillance, and rescue robotics, where tracking reliability is essential under adverse conditions. Although the fusion of RGB and thermal infrared (TIR) modalities offers improved robustness in low-light and occluded scenes, recent findings show that RGB-T trackers remain highly susceptible to subtle input perturbations, human-imperceptible modifications that exploit cross-modal inconsistencies to mislead tracking outputs. In real-world scenarios, such perturbations can arise from sensor spoofing, infrared camouflage, or physical-world attacks, posing serious risks to operational safety. To address this, we propose SFPT, a Semantic Feature Purification framework that enhances RGB-T tracking at the representation level. Rather than filtering corrupted inputs at the pixel level, SFPT introduces task-specific semantic anchors into the feature space to reinforce perturbation-invariant cues. These anchors are derived from descriptive language, interact with visual features to purify representations. To further suppress modality-specific interference, we design an Adaptive Perturbation-Guided Cross-Modal Fusion (APG-CMF) module, which leverages language and visual signals to estimate reliability and dynamically reweight cross-modal features, ensuring robust fusion under perturbation conditions. Extensive experiments under diverse perturbation conditions validate the effectiveness of our approach. Notably, SFPT maintains performance comparable to clean settings even when subjected to perturbations of strength 1/255 and 4/255, demonstrating strong resilience to real-world interference.
AAAI Conference 2026 Conference Paper
While large language models (LLMs) have demonstrated strong capabilities in code generation, current benchmarks primarily focus on single-turn scenarios, neglecting the complexity of multi-turn interactions and user diversity. To address this gap, we introduce Talk2Code, the first benchmark for user-stratified multi-turn dialogue code generation evaluation across algorithmic problem-solving and backend programming tasks.A distinctive feature of our benchmark is its user-stratified interaction modeling. For identical coding tasks, we construct dialogue trajectories tailored for novice, intermediate, and expert users, capturing their distinct expectations and communication patterns.To facilitate comprehensive evaluation, we propose a multi-dimensional evaluation framework assessing both code quality and interaction experience through a novel Dual-track Evaluation Method. In the Direct Generation Track, the benchmark provides golden dialogue context (excluding the final code) directly to the LLM for code generation. In contrast, the Interactive Dialogue Track simulates realistic multi-turn interactions, prompting the model to proactively clarify instructions and gather requirements before generating solutions. Code quality is evaluated in both tracks by Test Pass Rate and Success Rate, while interaction experience is assessed exclusively within the Interactive Dialogue Track through subjective and alignment indicators. Our benchmark and multi-dimensional indicator system collectively establish a new paradigm for evaluating adaptive, user-aware AI coding assistants.
TIST Journal 2026 Journal Article
With the advancement of large language models (LLMs), significant progress has been achieved in various natural language processing (NLP) tasks. However, existing LLMs still face two major challenges that hinder their broader adoption: (1) their responses tend to be generic and lack personalization tailored to individual users, and (2) they rely heavily on cloud infrastructure due to intensive computational requirements, leading to stable network dependency and response delay. Recent research has predominantly focused on either developing cloud-based personalized LLMs or exploring the on-device deployment of general-purpose LLMs. However, few studies have addressed both limitations simultaneously by investigating personalized on-device language models (LMs). To bridge this gap, we propose CDCDA-PLM, a framework for deploying personalized on-device LMs on user devices with support from a powerful cloud-based LLM. Specifically, CDCDA-PLM leverages the server-side LLM’s strong generalization capabilities to augment users’ limited personal data, mitigating the issue of data scarcity. Using both real and synthetic data, a personalized on-device LM is fine-tuned via parameter-efficient fine-tuning (PEFT) modules and deployed on users’ local devices, enabling them to process queries without depending on cloud-based LLMs. This approach eliminates reliance on network stability and ensures high response speeds. Experimental results across six NLP personalization tasks demonstrate the effectiveness of CDCDA-PLM.
AAAI Conference 2026 Conference Paper
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in large language models (LLMs) through reinforcement learning (RL) with rule-based rewards. Despite its success in language tasks, its application in multimodal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this gap, we propose UI-R1, the first framework to investigate how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. UI-R1 introduces a novel rule-based action reward scheme, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). To further improve efficiency at inference time, we present UI-R1-Efficient, a two-stage training paradigm that reduces reasoning length while boosting overall performance. In addition, we construct a compact yet high-quality dataset containing 2K challenging tasks across five prevalent mobile device action types. Experiments show that our proposed models (e.g., UI-R1-3B) achieve substantial improvements over the base model (Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 18.3% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 10.9% on ANDROIDCONTROL. Moreover, our efficient versions deliver competitive performance compared to considerably larger state-of-the-art models, underscoring the potential of reinforcement learning to advance GUI control and paving the way for future research in Human-Computer Interaction (HCI).
AAAI Conference 2026 Conference Paper
Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.
AAAI Conference 2026 Conference Paper
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from segment anything to any segmentation. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Model editing has become a *de-facto* practice to address hallucinations and outdated knowledge of large language models (LLMs). However, existing methods are predominantly evaluated in isolation, i. e. , one edit at a time, failing to consider a critical scenario of compositional model editing, where multiple edits must be integrated and jointly utilized to answer real-world multifaceted questions. For instance, in medical domains, if one edit informs LLMs that COVID-19 causes "fever" and another that it causes "loss of taste", a qualified compositional editor should enable LLMs to answer the question "What are the symptoms of COVID-19? " with both "fever" and "loss of taste" (and potentially more). In this work, we define and systematically benchmark this compositional model editing (CME) task, identifying three key undesirable issues that existing methods struggle with: *knowledge loss*, *incorrect preceding* and *knowledge sinking*. To overcome these issues, we propose A$^3$E, a novel compositional editor that (1) ***a**daptively combines and **a**daptively regularizes* pre-trained foundation knowledge in LLMs in the stage of edit training and (2) ***a**daptively merges* multiple edits to better meet compositional needs in the stage of edit composing. Extensive experiments demonstrate that A$^3$E improves the composability by at least 22. 45\% without sacrificing the performance of non-compositional model editing.
AAAI Conference 2025 Conference Paper
Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' practice efficiency. However, the scarcity of offline practice response data (e.g., answer correctness) and potential biases in human online practice create a significant gap between offline metrics and the actual online performance of personalized learning services. To address this challenge, we introduce Agent4Edu, a novel personalized learning simulator leveraging recent advancements in human intelligence through large language models (LLMs). Agent4Edu features LLM-powered generative agents equipped with learner profile, memory, and action modules tailored to personalized learning algorithms. The learner profiles are initialized using real-world response data, capturing practice styles and cognitive factors. Inspired by psychology theory, the memory module records practice facts and high-level summaries, integrating reflection mechanisms. The action module supports various behaviors, including exercise understanding, analysis, and response generation. Each agent can interact with personalized learning algorithms, such as computerized adaptive testing, enabling a multifaceted evaluation and enhancement of customized services. Through a comprehensive assessment, we explore the strengths and weaknesses of Agent4Edu, emphasizing the consistency and discrepancies in responses between agents and human learners.
IROS Conference 2025 Conference Paper
The unique optical characteristics of the underwater environment, such as light refraction and loss of salient features, pose a significant challenge to traditional vision sensors, especially in the swarm operation scenario where multiple autonomous underwater vehicles (AUVs) cooperate with the mother ship for positioning. To address these challenges, this study proposes the application of soft constraints on mirror position to improve the robustness of optical target positioning in shallow water. During the mapping phase, we establish and optimize the pose relationships between ArUco markers and their mirrors, thereby expanding the locatable space for AUVs (Autonomous Underwater Vehicles). With the arrangement of ArUco markers unchanged, the number of usable markers doubles. Surface mirror-assisted positioning provides more visual features and additional computed corner points, enhancing the reliability of camera observations and improving positioning accuracy. Experimental results demonstrate that, compared to classical algorithms from the Artificial Vision Applications (A. V. A) laboratory, our method improves position accuracy by 25. 8% in single-marker scenarios and by 14. 7% in multi-marker scenarios. Therefore, our method provides enhanced mapping and positioning for AUVs in shallow water areas where optical markers can be deployed. This study provides a new mapping paradigm and multi-body localization solution for optical marker-guided underwater swarm operations.
AILAW Journal 2025 Journal Article
Abstract Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models (LLMs) and legal domain experts throughout the entire process—from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods. Additionally, we observed that several open-source large language models performed exceptionally well in legal terminology extraction, demonstrating their cost-effectiveness and potential for sustainable applications in multilingual legal natural language processing (NLP). Finally, the open, extensible platform we developed supports continuous expert curation and can be easily integrated into various legal translation, research, and AI-powered knowledge management tools.
JBHI Journal 2025 Journal Article
Investigating the inhibitory effects of compounds on cardiac ion channels is essential for assessing cardiac drug safety. Consequently, researchers have developed computational models to evaluate combined cardiotoxicity (CCT) on cardiac ion channels. However, limitations in experimental data often cause issues like uneven data distribution and scarcity. Additionally, existing models primarily emphasize atomic information flow within graph neural networks (GNNs) while overlooking chemical bonds, leading to inadequate recognition of key structures. Therefore, this study integrates optimal transport (OT), structure remapping (SR), and Kolmogorov-Arnold networks (KANs) into a GNN-based CCT prediction model, CardiOT. First, the proposed CardiOT model employs OT pooling to optimize sample-feature joint distribution using expectation maximization, identifying “important” sample-feature pairs. Additionally, SR technology is used to emphasize the role of chemical bond information in message propagation. KAN technology is integrated to greatly enhance model interpretability. In summary, the model mitigates challenges related to uneven data distribution and scarcity. Multiple experiments on public datasets confirm the model's robust performance. We anticipate that this model will provide deeper insights into compound inhibition mechanisms on cardiac ion channels and reduce toxicity risks.
NeurIPS Conference 2025 Conference Paper
Prohibited item detection based on X-ray images is one of the most effective security inspection methods. However, the foreground-background feature coupling caused by the overlapping phenomenon specific to X-ray images makes general detectors designed for natural images perform poorly. To address this issue, we propose a Category Semantic Prior Contrastive Learning (CSPCL) mechanism, which aligns the class prototypes perceived by the classifier with the content queries to correct and supplement the missing semantic information responsible for classification, thereby enhancing the model sensitivity to foreground features. To achieve this alignment, we design a specific contrastive loss, CSP loss, which comprises the Intra-Class Truncated Attraction (ITA) loss and the Inter-Class Adaptive Repulsion (IAR) loss, and outperforms classic contrastive losses. Specifically, the ITA loss leverages class prototypes to attract intra-class content queries and preserves essential intra-class diversity via a gradient truncation function. The IAR loss employs class prototypes to adaptively repel inter-class content queries, with the repulsion strength scaled by prototype-prototype similarity, thereby improving inter-class discriminability, especially among similar categories. CSPCL is general and can be easily integrated into Deformable DETR-based models. Extensive experiments on the PIXray, OPIXray, PIDray, and CLCXray datasets demonstrate that CSPCL significantly enhances the performance of various state-of-the-art models without increasing inference complexity. The code is publicly available at https: //github. com/Limingyuan001/CSPCL.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
IJCAI Conference 2025 Conference Paper
Multi-view representation learning methods typically follow a consistent-and-specific pipeline that aims at extracting latent representations for an entity from its multiple observable views to facilitate downstream tasks. However, most of them overlook the complex underlying correlation between different views. To solve this issue, we delve into a well-known property of neural networks (NNs) that NNs tend to learn simple patterns first and then hard ones. In our case, view-consistent representations are simple patterns and view-specific representations are hard. To this end, we propose to disentangle view-consistency and view-specificity and learn them gradually. Specifically, we devise a novel curriculum learning approach that adjusts the whole model to learn view-consistent representations first and then progressively view-specific representations. Besides, we saddle each view with a learnable prior that allows each view-specific representation to appropriate its distribution. Moreover, we incorporate a mixture-of-experts layer and a disentangling module to further enhance the quality of the learned representations. Extensive experiments on five real-world datasets show that the proposed model outperforms its counterparts markedly. The code is available at https: //github. com/XLearning-SCU/2025-IJCAI-CL2P.
NeurIPS Conference 2025 Conference Paper
Machine unlearning has become a focal point in recent research, yet the specific area of feature unlearning has not been thoroughly explored. Feature unlearning involves the elimination of specific features' effects from an already trained model, presenting distinct challenges that are still not comprehensively addressed. This paper presents a novel and straightforward approach to feature unlearning that employs a tactical shuffling of the features designated for removal. By redistributing the values of the features targeted for unlearning throughout the original training dataset and subsequently fine-tuning the model with this shuffled data, our proposed method provides a theoretical guarantee for effective feature unlearning. Under mild assumptions, our method can effectively disrupt the established correlations between unlearned features and the target outcomes, while preserving the relationships between the remaining features and the predicted outcomes. Our empirical studies across various datasets, validate that our approach not only successfully removes the effects of specified features but also maintains the informational integrity of the remaining features while achieving a faster convergence rate.
IJCAI Conference 2025 Conference Paper
Fourier Neural Operators (FNO) have emerged as promising solutions for efficiently solving partial differential equations (PDEs) by learning infinite-dimensional function mappings through frequency domain transformations. However, the sparsity of high-frequency signals limits computational efficiency for high-dimensional inputs, and fixed-pattern truncation often causes high-frequency signal loss, reducing performance in scenarios such as high-resolution inputs or long-term predictions. To address these challenges, we propose FreqMoE, an efficient and progressive training framework that exploits the dependency of high-frequency signals on low-frequency components. The model first learns low-frequency weights and then applies a sparse upward-cycling strategy to construct a mixture of experts (MoE) in the frequency domain, effectively extending the learned weights to high-frequency regions. Experiments on both regular and irregular grid PDEs demonstrate that FreqMoE achieves up to 16. 6 percent accuracy improvement while using merely 2. 1 percent parameters (47. 32x reduction) compared to dense FNO. Furthermore, the approach demonstrates remarkable stability in long-term predictions and generalizes seamlessly to various FNO variants and grid structures, establishing a new Low frequency Pretraining, High frequency Fine-tuning'' paradigm for solving PDEs.
NeurIPS Conference 2025 Conference Paper
Recent efforts to employ large language models (LLMs) as agents have demonstrated promising results in a wide range of multi-step agent tasks. However, existing agents lack an effective experience reuse approach to leverage historical completed tasks. In this paper, we propose a novel experience reuse framework MetaFlowLLM, which constructs a hierarchical experience tree from historically completed tasks. Each node in this experience tree is presented as a MetaFlow which contains static execution workflow and subtask required by agents to complete dynamically. Then, we propose a Hierarchical MetaFlow Merging algorithm to construct the hierarchical experience tree. When accomplishing a new task, MetaFlowLLM can first retrieve the most relevant MetaFlow node from the experience tree and then execute it accordingly. To effectively generate valid MetaFlows from historical data, we further propose a reinforcement learning pipeline to train the MetaFlowGen. Extensive experimental results on AppWorld and WorkBench demonstrate that integrating with MetaFlowLLM, existing agents (e. g. , ReAct, Reflexion) can gain substantial performance improvement with reducing execution costs. Notably, MetaFlowLLM achieves an average success rate improvement of 32. 3% on AppWorld and 6. 2% on WorkBench, respectively.
ICLR Conference 2025 Conference Paper
This paper investigates an open research challenge of reconstructing high-quality, large-scale 3D open scenes from images. It is observed existing methods have various limitations, such as requiring precise camera poses for input and dense viewpoints for supervision. To perform effective and efficient 3D scene reconstruction, we propose a novel graph-guided 3D scene reconstruction framework, GraphGS. Specifically, given a set of images captured by RGB cameras on a scene, we first design a spatial prior-based scene structure estimation method. This is then used to create a camera graph that includes information about the camera topology. Further, we propose to apply the graph-guided multi-view consistency constraint and adaptive sampling strategy to the 3D Gaussian Splatting optimization process. This greatly alleviates the issue of Gaussian points overfitting to specific sparse viewpoints and expedites the 3D reconstruction process. We demonstrate GraphGS achieves high-fidelity 3D reconstruction from images, which presents state-of-the-art performance through quantitative and qualitative evaluation across multiple datasets.
IJCAI Conference 2025 Conference Paper
Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w. r. t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.
IROS Conference 2025 Conference Paper
Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative render-compare-refine loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0. 2° while achieving 2. 87 FPS tracking on mobile robots, which is an impressive 10× speedup compared to optimization-based approaches. Project page: https://github.com/pythongod-exe/iGaussian
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Iterative imputation is a prevalent method for missing data imputation, where each feature is imputed iteratively by treating it as a target variable estimated from all other features. However, iterative imputation method suffers from two principal limitations: (1) it imposes a single parametric model form to impute all features, neglecting the potential for optimal models to vary among features, which risks model misspecification; and (2) it assumes every feature contains missing values, overlooking the potential presence of non-missing features, termed as oracle features, which are informative for imputation. To address these limitations, we propose kernel point imputation (KPI), a bi-level optimization framework for iterative missing data imputation. At the inner level, KPI adaptively learns the optimal model form for each feature within a reproducing kernel Hilbert space, addressing limitation (1). At the outer level, KPI utilizes oracle features as supervisory signals to iteratively refine the imputations, addressing limitation (2). Experiments demonstrate that KPI outperforms competitive imputation methods. Code is available at https: //github. com/FMLYD/kpi. git.
AAAI Conference 2025 Conference Paper
Partially linear models (PLM) have attracted much attention in the field of statistical machine learning. Specially, the ability of variable selection of PLM has been studied extensively due to the high requirement of model interpretability. However, few of the existing works concerns the false discovery rate (FDR) controllability of variable selection associated with PLM. To address this issue, we formulate a new Knockoffs Inference scheme for Linear And Nonlinear Discoverer (called KI-LAND), where FDR is controlled with respect to both linear and nonlinear variables for automatic structure discovery. For the proposed KI-LAND, theoretical guarantees are established for both FDR controllability and power, and experimental evaluations are provided to validate its effectiveness.
NeurIPS Conference 2025 Conference Paper
Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory. To address this, we present a calibration approach—performed via quantile regression—that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an instance-adaptive scaling (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.
NeurIPS Conference 2025 Conference Paper
Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts. On standard benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate. Our findings demonstrate a potent alternative to agentic LLM prompting, highlighting the efficacy of interpreting and attacking LLM internals through gradient optimization.
TMLR Journal 2025 Journal Article
Simulating interactions between deformable bodies is vital in fields like material science, mechanical design, and robotics. While learning-based methods with Graph Neural Networks (GNNs) are effective at solving complex physical systems, they encounter scalability issues when modeling deformable body interactions. To model interactions between objects, pairwise global edges have to be created dynamically, which is computationally intensive and impractical for large-scale meshes. To overcome these challenges, drawing on insights from geometric representations, we propose an Adaptive Spatial Tokenization (AST) method for efficient representation of physical states. By dividing the simulation space into a grid of cells and mapping unstructured meshes onto this structured grid, our approach naturally groups adjacent mesh nodes. We then apply a cross-attention module to map the sparse cells into a compact, fixed-length embedding, serving as tokens for the entire physical state. Self-attention modules are employed to predict the next state over these tokens in latent space. This framework leverages the efficiency of tokenization and the expressive power of attention mechanisms to achieve accurate and scalable simulation results. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in modeling deformable body interactions. Notably, it remains effective on large-scale simulations with meshes exceeding 100,000 nodes, where existing methods are hindered by computational limitations. Additionally, we contribute a novel large-scale dataset encompassing a wide range of deformable body interactions to support future research in this area.
IJCAI Conference 2025 Conference Paper
Most existing multi-view representation learning methods assume view-completeness and noise-free data. However, such assumptions are strong in real-world applications. Despite advances in methods tailored to view-missing or noise problems individually, a one-size-fits-all approach that concurrently addresses both remains unavailable. To this end, we propose a holistic method, called Dual-masked Variational Autoencoders (DualVAE), which aims at learning robust multi-view representation. The DualVAE exhibits an innovative amalgamation of dual-masked prediction, mixture-of-experts learning, representation disentangling, and a joint loss function in wrapping up all components. The key novelty lies in the dual-masked (view-mask and patch-mask) mechanism to mimic missing views and noisy data. Extensive experiments on four multi-view datasets show the effectiveness of the proposed method and its superior performance in comparison to baselines. The code is available at https: //github. com/XLearning-SCU/2025-IJCAI-DualVAE.
TMLR Journal 2025 Journal Article
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents. The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: \url{https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents}
AAAI Conference 2025 Conference Paper
The distribution biases and scarcity of samples in multi-source data present significant challenges for few-shot learning (FSL) tasks based on brain-computer interface (BCI). Recent efforts have explored the application of diffusion mechanisms in FSL, typically utilizing labeled data to augment the support set. However, this approach has not effectively utilized unlabeled data nor addressed distribution biases. Inspired by the latest advancements in FSL, we propose the manhattan self-attention diffusion residual networks (MSADiff-Resnet) with dynamic bias rectification. This model explicitly adds the manhattan self-attention diffusion layer to resnet, using attention mechanisms and manhattan distance-based decay function to control local diffusion intensity, and adjusts the global diffusion strength through the parameter. This diffusion mechanism bridges labeled and unlabeled data, addressing the limitations associated with sample availability. Additionally, we effectively tackle the distribution biases of multi-source data through inter-class bias rectification and dynamic intra-class bias rectification. Moreover, this study presents for the first time a universal deep learning framework specifically designed for BCI-based FSL tasks. Extensive experiments on multi-source BCI task datasets have validated the effectiveness of proposed method.
AAAI Conference 2025 Conference Paper
Security concerns related to Large Language Models (LLMs) have been extensively explored; however, the safety implications for Multimodal Large Language Models (MLLMs), particularly in medical contexts (MedMLLMs), remain inadequately addressed. This paper investigates the security vulnerabilities of MedMLLMs, focusing on their deployment in clinical environments where the accuracy and relevance of question-and-answer interactions are crucial for addressing complex medical challenges. We introduce and redefine two attack types: mismatched malicious attack (2M-attack) and optimized mismatched malicious attack (O2M-attack), by integrating existing clinical data with atypical natural phenomena. Using the comprehensive 3MAD dataset that we developed, which spans a diverse range of medical imaging modalities and adverse medical scenarios, we performed an in-depth analysis and proposed the MCM optimization method. This approach significantly improves the attack success rate against MedMLLMs. Our evaluations, which include white-box attacks on LLaVA-Med and transfer (black-box) attacks on four other SOTA models, reveal that even MedMLLMs designed with advanced security mechanisms remain vulnerable to breaches. This study highlights the critical need for robust security measures to enhance the safety and reliability of open-source MedMLLMs, especially in light of the potential impact of jailbreak attacks and other malicious exploits in clinical applications. Warning: Medical jailbreaking may generate content that includes unverified diagnoses and treatment recommendations. Always consult professional medical advice.
IROS Conference 2025 Conference Paper
Soft robots and smart materials have seen rapid advancements in recent years, with significant potential applications in medical devices. Liquid crystal elastomers (LCEs) exhibit unique attributes of large deformations and diverse actuation modes, facilitating controllable bending in soft medical catheters and thereby enhancing their maneuverability during medical procedures. However, LCEs exhibit strong hysteresis, which makes their modeling and control challenging. In this paper, we develop a dynamic model of a light-stimulated LCE to describe its nonlinear time-dependent behavior. We first derive the relationship between the input laser power and the resulting temperature change of the LCE actuator, and then analyze the viscoelastic behavior by taking advantage of a spring-dashpot frame. For both the linear contraction actuator and the bending actuator, the dynamic equations can describe their behavior with acceptable errors. In the future, we will further test the LCE-based bending actuator of optimal design, and then perform real-time control of soft catheters with assistance of LCE actuators.
ICRA Conference 2025 Conference Paper
Object navigation in multi-floor environments presents a formidable challenge in robotics, requiring sophisticated spatial reasoning and adaptive exploration strategies. Traditional approaches have primarily focused on single-floor scenarios, overlooking the complexities introduced by multi-floor structures. To address these challenges, we first propose a Multi-floor Navigation Policy (MFNP) and implement it in Zero-Shot object navigation tasks. Our framework comprises three key components: (i) Multi-floor Navigation Policy, which enables an agent to explore across multiple floors; (ii) Multi-modal Large Language Models (MLLMs) for reasoning in the navigation process; and (iii) Inter-Floor Navigation, ensuring efficient floor transitions. We evaluate MFNP on the Habitat-Matterport 3D (HM3D) and Matterport 3D (MP3D) datasets, both include multi-floor scenes. Our experiment results demonstrate that MFNP significantly outperforms all the existing methods in Zero-Shot object navigation, achieving higher success rates and improved exploration efficiency. Ablation studies further highlight the effectiveness of each component in addressing the unique challenges of multi-floor navigation. Meanwhile, we conducted real-world experiments to evaluate the feasibility of our policy. Upon deployment of MFNP, the Unitree quadruped robot demonstrated successful multi-floor navigation and found the target object in a completely unseen environment. By introducing MFNP, we offer a new paradigm for tackling complex, multi-floor environments in object navigation tasks, opening avenues for future research in vision-based navigation in realistic, multi-floor settings.
NeurIPS Conference 2025 Conference Paper
This paper presents $\mathbf{OLinear}$, a $\mathbf{linear}$-based multivariate time series forecasting model that operates in an $\mathbf{o}$rthogonally transformed domain. Recent forecasting models typically adopt the temporal forecast (TF) paradigm, which directly encode and decode time series in the time domain. However, the entangled step-wise dependencies in series data can hinder the performance of TF. To address this, some forecasters conduct encoding and decoding in the transformed domain using fixed, dataset-independent bases (e. g. , sine and cosine signals in the Fourier transform). In contrast, we propose $\mathbf{OrthoTrans}$, a data-adaptive transformation based on an orthogonal matrix that diagonalizes the series' temporal Pearson correlation matrix. This approach enables more effective encoding and decoding in the decorrelated feature domain and can serve as a plug-in module to enhance existing forecasters. To enhance the representation learning for multivariate time series, we introduce a customized linear layer, $\mathbf{NormLin}$, which employs a normalized weight matrix to capture multivariate dependencies. Empirically, the NormLin module shows a surprising performance advantage over multi-head self-attention, while requiring nearly half the FLOPs. Extensive experiments on 24 benchmarks and 140 forecasting tasks demonstrate that OLinear consistently achieves state-of-the-art performance with high efficiency. Notably, as a plug-in replacement for self-attention, the NormLin module consistently enhances Transformer-based forecasters. The code and datasets are available at https: //github. com/jackyue1994/OLinear.
NeurIPS Conference 2025 Conference Paper
With the growing size of data and models in Large Recommendation Models, the time required for debugging has become increasingly prohibitive, underscoring the urgent need for effective guidance in parameter configuration. The Scaling Law (SL) offers analogous guidance in the Sequential Language domain, having achieved significant success by predicting model loss when scaling model size. However, the existing guidance from SL for Sequential Recommendation (SR) remains qualitative, which is because quantitative analysis of SL on SR encounters challenges with quality measurement on redundant sequences along with loss-performance discrepancy. In response, we introduce the Performance Law (P-Law) for SR models, which predicts model performance across various settings, intending to provide a quantitative framework for guiding the parameter optimization of future models. Initially, Performance Law utilizes Real Entropy to measure data quality, aiming to remove the low-quality influence of low-entropy redundant sequences. Subsequently, Performance Law investigates a fitting decay term, which facilitated the prediction of the major loss-performance discrepancy phenomena of overfitting, ultimately achieving quantitative performance prediction. Extensive experiment on various datasets demonstrates the effectiveness of Performance Law by displaying exceptional quantitative prediction ability against the original and modified qualitative SL. Additional application experiments on optimal parameter prediction and model expansion potential prediction also demonstrated the broad applicability of the Performance Law.
IJCAI Conference 2025 Conference Paper
Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, prompt compression reduces prompt length while maintaining LLM response quality. To support rapid implementation and standardization, we present the Prompt Compression Toolkit (PCToolkit), a unified plug-and-play framework for LLM prompt compression. PCToolkit integrates state-of-the-art compression algorithms, benchmark datasets, and evaluation metrics, enabling systematic performance analysis. Its modular architecture simplifies customization, offering portable interfaces for seamless incorporation of new datasets, metrics, and compression methods. Our code is available at https: //github. com/3DAgentWorld/Toolkit-for-Prompt-Compression. Our demo is at https: //huggingface. co/spaces/CjangCjengh/Prompt-Compression-Toolbox.
JMLR Journal 2025 Journal Article
Amid the ongoing advancements in Federated Learning (FL), a machine learning paradigm that allows collaborative learning with data privacy protection, personalized FL (pFL) has gained significant prominence as a research direction within the FL domain. Whereas traditional FL (tFL) focuses on jointly learning a global model, pFL aims to balance each client's global and personalized goals in FL settings. To foster the pFL research community, we started and built PFLlib, a comprehensive pFL library with an integrated benchmark platform. In PFLlib, we implemented 37 state-of-the-art FL algorithms (8 tFL algorithms and 29 pFL algorithms) and provided various evaluation environments with three statistically heterogeneous scenarios and 24 datasets. At present, PFLlib has gained more than 1600 stars and 300 forks on GitHub. [abs] [ pdf ][ bib ] [ code ] © JMLR 2025. ( edit, beta )
NeurIPS Conference 2025 Conference Paper
Diffusion models have made significant advancements in recent years. However, their performance often deteriorates when trained or fine-tuned on imbalanced datasets. This degradation is largely due to the disproportionate representation of majority and minority data in image-text pairs. In this paper, we propose a general fine-tuning approach, dubbed PoGDiff, to address this challenge. Rather than directly minimizing the KL divergence between the predicted and ground-truth distributions, PoGDiff replaces the ground-truth distribution with a Product of Gaussians (PoG), which is constructed by combining the original ground-truth targets with the predicted distribution conditioned on a neighboring text embedding. Experiments on real-world datasets demonstrate that our method effectively addresses the imbalance problem in diffusion models, improving both generation accuracy and quality.
IROS Conference 2025 Conference Paper
NeurIPS Conference 2025 Conference Paper
In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models. Additionally, we validate our evaluation metrics by demonstrating their high correlation with human assessments. Models fine-tuned on RAG-IGBench's training set exhibit improved performance across multiple benchmarks, confirming both the quality and practical utility of our dataset. Our benchmark is available at https: //github. com/zry13/RAG-IGBench.
TMLR Journal 2025 Journal Article
Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.
IJCAI Conference 2025 Conference Paper
Spatio-temporal forecasting is pivotal in numerous real-world applications, including transportation planning, energy management, and climate monitoring. In this work, we aim to harness the reasoning and generalization abilities of Pre-trained Language Models (PLMs) for more effective spatio-temporal forecasting, particularly in data-scarce scenarios. However, recent studies uncover that PLMs, which are primarily trained on textual data, often falter when tasked with modeling the intricate correlations in numerical time series, thereby limiting their effectiveness in comprehending spatio-temporal data. To bridge the gap, we propose RePST, a semantic-oriented PLM reprogramming framework tailored for spatio-temporal forecasting. Specifically, we first propose a semantic-oriented decomposer that adaptively disentangles spatially correlated time series into interpretable sub-components, which facilitates PLM to understand sophisticated spatio-temporal dynamics via a divide-and-conquer strategy. Moreover, we propose a selective discrete reprogramming scheme, which introduces an expanded spatio-temporal vocabulary space to project spatio-temporal series into discrete representations. This scheme minimizes the information loss during reprogramming and enriches the representations derived by PLMs. Extensive experiments on real-world datasets show that the proposed RePST outperforms twelve state-of-the-art baseline methods, particularly in data-scarce scenarios, highlighting the effectiveness and superior generalization capabilities of PLMs for spatio-temporal forecasting. Codes and Appendix can be found at https: //github. com/usail-hkust/REPST.
EAAI Journal 2025 Journal Article
AIIM Journal 2025 Journal Article
JBHI Journal 2025 Journal Article
The R-peak in electrocardiogram (ECG) signals is a critical physiological marker for the diagnosis of cardiovascular diseases. Although various R-peak detection methods have been proposed, their performance is often hindered by noise, especially in dynamic ECG monitoring. Furthermore, the potential of harnessing complementary information from 12-lead ECG signals has not been fully exploited. To address these challenges, this study conceptualized 12-lead ECG data as two-dimensional images and employed YOLOv5 as the model's backbone for R-peak detection, effectively transforming a signal segmentation task into an object detection task in images. Specifically, considering the characteristics of consistent R-peak positions across different leads, we proposed a strip attention mechanism to treat horizontal or vertical strips as tokens for computing inter- and intra-strip attention, enhancing the model's ability to capture R-peak positional information and likelihood. Additionally, a one-dimensional Manhattan distance-based NMS algorithm was used to minimize redundant detection frames, thereby enhancing model performance. The proposed model was rigorously evaluated on two publicly available datasets, INCART and LUDB, under varying noise conditions. On the INCART dataset, the model achieved F1 scores of 99. 97%, 99. 86%, 99. 63%, and 98. 00% at noise levels of Original, SNR = 10 dB, SNR = 5 dB, and SNR = 0 dB, respectively. Similarly, on the LUDB dataset, the F1 scores were 99. 89%, 100%, 100%, and 99. 86% for the corresponding noise levels. Extensive testing across multiple datasets and noise scenarios demonstrated that the proposed model outperformed existing state-of-the-art methods in terms of accuracy, noise robustness, and generalization capability.
AAMAS Conference 2025 Conference Paper
Current reinforcement learning (RL) models, often functioning as complex ‘black boxes, ’ obscure decision-making processes. This lack of transparency limits its applicability in critical real-world applications where clear reasoning behind algorithmic choices is crucial. To tackle this issue, we suggest moving from neural network or tabular approaches to a rule ensemble model, which improves decision-making clarity and adapts dynamically to environmental interactions. Instead, our method constructs additive rule ensembles to approximate the Q-value in reinforcement learning using orthogonal gradient boosting (OGB) combined with a post-processing rule replacement technique. This method enables the model to provide inherent explanations through the use of rules. Our study sets a theoretical foundation for rule ensembles within the reinforcement learning framework, emphasizing their capacity to boost interpretability and facilitate the analysis of rule impacts. Experimental results from seven classic environments demonstrate that our proposed rule ensembles match or exceed the performance of representative RL models such as DQN, A2C, and PPO, while also providing self-interpretability and transparency.
AAAI Conference 2025 Conference Paper
Since its introduction, the transformer has shifted the development trajectory away from traditional models (e.g., RNN, MLP) in time series forecasting, which is attributed to its ability to capture global dependencies within temporal tokens. Follow-up studies have largely involved altering the tokenization and self-attention modules to better adapt Transformers for addressing special challenges like non-stationarity, channel-wise dependency, and variable correlation in time series. However, we found that the expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting after investigating several representative methods, where there is an almost linear relationship between sequence representation entropy and mean square error, with more diverse representations performing better. In this paper, we propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective, where these learnable sequences are able to provide complementary information beyond current input to feed attention. We further enhance the Sequence Complementors via a diversification loss that is theoretically covered. The empirical evaluation of both long-term and short-term forecasting has confirmed its superiority over the recent state-of-the-art methods.
AAAI Conference 2025 Conference Paper
Discrete latent representation techniques, such as Vector Quantization (VQ) and Sparse Coding (SC), have demonstrated superior image reconstruction and generation quality compared to continuous representation methods in Variational Autoencoders (VAEs). However, existing approaches often treat the latent representations of an image independently in their discrete representation space, neglecting both the inherent structural information within each representation and the correlations among them. This oversight leads to coarse representations and suboptimal generated results. In this paper, we address these limitations by introducing correlations among and within the latent representations of individual images in the latent discrete space of VAEs using sparse coding. We impose two-dimensional structural information through adaptive thresholding, enhancing local structure in image representations while suppressing noise via parsimonious representation with a learned dictionary. Empirical studies on three real benchmark datasets, including a clinical Ultrasound dataset, BSDS500, and mini-Imagenet, demonstrate that our proposed model preserves fine-grained details in image reconstruction and significantly outperforms baseline models of SC-VAE and VQ-VAE across objective and subjective image quality metrics. Particularly noteworthy are the substantial performance improvements observed on the ultrasound dataset, where structure information is crucial. Specifically, we observe significant performance improvements of 7.68 % and 17.03 % in SSIM, 3.25 dB and 6.58 dB in PSNR, 0.15 and 0.24 in LPIPS, 45.38 and 84.05 in FID over SC-VAE and VQ-VAE, respectively, indicating the superiority of our method in terms of image reconstruction quality and fidelity.
NeurIPS Conference 2025 Conference Paper
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e. g. , the reasoning-focused model Gemini-2. 5-Pro achieved the highest accuracy of 63. 56% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
NeurIPS Conference 2025 Conference Paper
Training time-series forecasting models poses unique challenges in loss function design. Most existing approaches adopt temporal mean squared error, but this study reveals two critical limitations: (1) it ignores the presence of label autocorrelation, which biases it from the true label sequence likelihood; (2) it involves excessive number of tasks, which complicates optimization, especially for long-term forecasting. To address these issues, we introduce Time-o1, a transform-enhanced loss function for time-series forecasting. The central idea is to transform the label sequence into decorrelated components with discriminated significance. Models are then trained to align the most significant components, thereby effectively mitigating label autocorrelation and reducing task amount. Experiments demonstrate that Time-o1 achieves state-of-the-art performance and is compatible with various forecast models. Code is available at https: //github. com/Master-PLC/Time-o1.
NeurIPS Conference 2025 Conference Paper
Recent booming time series models have demonstrated remarkable forecasting performance. However, these methods often place greater focus on more effectively modelling the historical series, largely neglecting the forecasting phase, which generates long-term forecasts by separately predicting multiple time points. Given that real-world time series typically consist of various long short-term dynamics, independent predictions over individual time points may fail to express complex underlying patterns and can lead to a lack of global views. To address these issues, this work explores new perspectives from the forecasting phase and proposes a novel Implicit Forecaster (IF) as an additional decoding module. Inspired by decomposition forecasting, IF adopts a more nuanced approach by implicitly predicting constituent waves represented by their frequency, amplitude, and phase, thereby accurately forming the time series. Extensive experimental results from multiple real-world datasets show that IF can consistently boost mainstream time series models, achieving state-of-the-art forecasting performance. Code is available at this repository: https: //github. com/rakuyorain/Implicit-Forecaster.
JBHI Journal 2025 Journal Article
Balance and gait impairments play a key role in falls among the elderly. Traditional clinical scales such as the Berg Balance Scale (BBS) to assess fall risk are often subjective, time consuming, and does not assess gait performance. Shorter assessments such as Timed Up and Go (TUG) are available, but most clinicians only look into the completion time. This study aimed to develop a fast, low-cost, and automated framework for balance function assessment and comprehensive gait analysis by enhancing the traditional TUG test with a markerless motion capture (MoCap) system and machine learning models. In total, we included TUG datasets of 70 participants with varying degrees of fall risk based on the BBS scores. We segmented TUG trials into five phases automatically using data from the MoCap system and extracted features from the phases. These features were then analyzed to identify those that significantly discriminate between high and low fall risk groups. Using the identified features, various machine learning models were tested to estimate the BBS scores. The markers obtained from the markerless MoCap system were used for detailed gait analysis, and lower limb kinematics were compared between the markerless and marker-based methods. Our findings indicate that individuals at high risk of falling had longer completion times, lower performance velocities, and smaller ranges of motion in lower-limb joints. Among the tested machine learning models, random forest demonstrated the best performance in predicting BBS scores (RMSE: 0. 98, $R^{2}$: 0. 94). Additionally, our markerless MoCap system showed comparable accuracy to state-of-the-art systems, eliminating the need to attach markers or sensors. The findings could help develop a quick and objective tool for balance and gait assessment in older adults, providing quantitative data to improve screening and intervention planning.
NeurIPS Conference 2025 Conference Paper
Estimating the uncertainty of responses from Large Language Models (LLMs) remains a critical challenge. While recent Bayesian methods have demonstrated effectiveness in quantifying uncertainty through low-rank weight updates, they typically require complex fine-tuning or post-training procedures. In this paper, we propose T raining- F ree B ayesianization ( TFB ), a simple yet theoretically grounded framework that efficiently transforms trained low-rank adapters into Bayesian ones without additional training. TFB systematically searches for the maximally acceptable level of variance in the weight posterior, constrained within a family of low-rank isotropic Gaussian distributions. Our theoretical analysis shows that under mild conditions, this search process is equivalent to KL-regularized variational optimization, a generalized form of variational inference. Through comprehensive experiments, we show that TFB achieves superior uncertainty estimation and generalization compared to existing methods while eliminating the need for complex Bayesianization training procedures.
AAAI Conference 2025 Conference Paper
As a popular distributed learning paradigm, federated learning (FL) over mobile devices fosters numerous applications, while their practical deployment is hindered by participating devices' computing and communication heterogeneity. Some pioneering research efforts proposed to extract subnetworks from the global model, and assign as large a subnetwork as possible to the device for local training based on its full computing capacity. Although such fixed size subnetwork assignment enables FL training over heterogeneous mobile devices, it is unaware of (i) the dynamic changes of devices' communication and computing conditions and (ii) FL training progress and its dynamic requirements of local training contributions, both of which may cause very long FL training delay. Motivated by those dynamics, in this paper, we develop a wireless and heterogeneity aware latency efficient FL (WHALE-FL) approach to accelerate FL training through adaptive subnetwork scheduling. Instead of sticking to the fixed size subnetwork, WHALE-FL introduces a novel subnetwork selection utility function to capture device and FL training dynamics, and guides the mobile device to adaptively select the subnetwork size for local training based on (a) its computing and communication capacity, (b) its dynamic computing and/or communication conditions, and (c) FL training status and its corresponding requirements for local training contributions. Our evaluation shows that, compared with peer designs, WHALE-FL effectively accelerates FL training without sacrificing learning accuracy.
AAAI Conference 2024 Conference Paper
Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for low-quality images which are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low quality input, a novel SegRefine network which employs the unpaired image-to-image translation is proposed to generate a high quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines the curve slope analysis and a curve based regression model. We conduct quantitative and qualitative comparisons with the representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed JLU-CJUH dataset which contains more challenging low-quality images. The superior performance on both datasets shows our method can achieve both robustness and interpretability for spinal curvature estimation.
NeurIPS Conference 2024 Conference Paper
Monocular Depth Estimation (MDE) enables the prediction of scene depths from a single RGB image, having been widely integrated into production-grade autonomous driving systems, e. g. , Tesla Autopilot. Current adversarial attacks to MDE models focus on attaching an optimized adversarial patch to a designated obstacle. Although effective, this approach presents two inherent limitations: its reliance on specific obstacles and its limited malicious impact. In contrast, we propose a pioneering attack to MDE models that \textit{decouples obstacles from patches physically and deploys optimized patches on roads}, thereby extending the attack scope to arbitrary traffic participants. This approach is inspired by our groundbreaking discovery: \textit{various MDE models with different architectures, trained for autonomous driving, heavily rely on road regions} when predicting depths for different obstacles. Based on this discovery, we design the Adversarial Road Marking (AdvRM) attack, which camouflages patches as ordinary road markings and deploys them on roads, thereby posing a continuous threat within the environment. Experimental results from both dataset simulations and real-world scenarios demonstrate that AdvRM is effective, stealthy, and robust against various MDE models, achieving about 1. 507 of Mean Relative Shift Ratio (MRSR) over 8 MDE models. The code is available at \url{https: //github. com/a-c-a-c/AdvRM. git}
NeurIPS Conference 2024 Conference Paper
Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.
NeurIPS Conference 2024 Conference Paper
Sequential recommendation (SR) aims to predict items that users may be interested in based on their historical behavior sequences. We revisit SR from a novel information-theoretic perspective and find that conventional sequential modeling methods fail to adequately capture the randomness and unpredictability of user behavior. Inspired by fuzzy information processing theory, this paper introduces the DDSR model, which uses fuzzy sets of interaction sequences to overcome the limitations and better capture the evolution of users' real interests. Formally based on diffusion transition processes in discrete state spaces, which is unlike common diffusion models such as DDPM that operate in continuous domains. It is better suited for discrete data, using structured transitions instead of arbitrary noise introduction to avoid information loss. Additionally, to address the inefficiency of matrix transformations due to the vast discrete space, we use semantic labels derived from quantization or RQ-VAE to replace item IDs, enhancing efficiency and improving cold start issues. Testing on three public benchmark datasets shows that DDSR outperforms existing state-of-the-art methods in various settings, demonstrating its potential and effectiveness in handling SR tasks.
AAAI Conference 2024 Conference Paper
While decentralized training is attractive in multi-agent reinforcement learning (MARL) for its excellent scalability and robustness, its inherent coordination challenges in collaborative tasks result in numerous interactions for agents to learn good policies. To alleviate this problem, action advising methods make experienced agents share their knowledge about what to do, while less experienced agents strictly follow the received advice. However, this method of sharing and utilizing knowledge may hinder the team's exploration of better states, as agents can be unduly influenced by suboptimal or even adverse advice, especially in the early stages of learning. Inspired by the fact that humans can learn not only from the success but also from the failure of others, this paper proposes a novel knowledge sharing framework called Cautiously-Optimistic kNowledge Sharing (CONS). CONS enables each agent to share both positive and negative knowledge and cautiously assimilate knowledge from others, thereby enhancing the efficiency of early-stage exploration and the agents' robustness to adverse advice. Moreover, considering the continuous improvement of policies, agents value negative knowledge more in the early stages of learning and shift their focus to positive knowledge in the later stages. Our framework can be easily integrated into existing Q-learning based methods without introducing additional training costs. We evaluate CONS in several challenging multi-agent tasks and find it excels in environments where optimal behavioral patterns are difficult to discover, surpassing the baselines in terms of convergence rate and final performance.
AAAI Conference 2024 Conference Paper
Deep reinforcement learning (DRL) has gained immense success in many applications, including gaming AI, robotics, and system scheduling. Distributed algorithms and architectures have been vastly proposed (e.g., actor-learner architecture) to accelerate DRL training with large-scale server-based clusters. However, training on-policy algorithms with the actor-learner architecture unavoidably induces resource wasting due to synchronization between learners and actors, thus resulting in significantly extra billing. As a promising alternative, serverless computing naturally fits on-policy synchronization and alleviates resource wasting in distributed DRL training with pay-as-you-go pricing. Yet, none has leveraged serverless computing to facilitate DRL training. This paper proposes MinionsRL, the first serverless distributed DRL training framework that aims to accelerate DRL training- and cost-efficiency with dynamic actor scaling. We prototype MinionsRL on top of Microsoft Azure Container Instances and evaluate it with popular DRL tasks from OpenAI Gym. Extensive experiments show that MinionsRL reduces total training time by up to 52% and training cost by 86% compared to latest solutions.
TMLR Journal 2024 Journal Article
We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines.
NeurIPS Conference 2024 Conference Paper
Learners sharing similar implicit cognitive states often display comparable observable problem-solving performances. Leveraging collaborative connections among such similar learners proves valuable in comprehending human learning. Motivated by the success of collaborative modeling in various domains, such as recommender systems, we aim to investigate how collaborative signals among learners contribute to the diagnosis of human cognitive states (i. e. , knowledge proficiency) in the context of intelligent education. The primary challenges lie in identifying implicit collaborative connections and disentangling the entangled cognitive factors of learners for improved explainability and controllability in learner Cognitive Diagnosis (CD). However, there has been no work on CD capable of simultaneously modeling collaborative and disentangled cognitive states. To address this gap, we present Coral, a $\underline{Co}$llabo$\underline{ra}$tive cognitive diagnosis model with disentang$\underline{l}$ed representation learning. Specifically, Coral first introduces a disentangled state encoder to achieve the initial disentanglement of learners' states. Subsequently, a meticulously designed collaborative representation learning procedure captures collaborative signals. It dynamically constructs a collaborative graph of learners by iteratively searching for optimal neighbors in a context-aware manner. Using the constructed graph, collaborative information is extracted through node representation learning. Finally, a decoding process aligns the initial cognitive states and collaborative states, achieving co-disentanglement with practice performance reconstructions. Extensive experiments demonstrate the superior performance of Coral, showcasing significant improvements over state-of-the-art methods across several real-world datasets. Our code is available at https: //github. com/bigdata-ustc/Coral.
AAAI Conference 2024 Conference Paper
Active learning (AL) aims to improve model performance within a fixed labeling budget by choosing the most informative data points to label. Existing AL focuses on the single-domain setting, where all data come from the same domain (e.g., the same dataset). However, many real-world tasks often involve multiple domains. For example, in visual recognition, it is often desirable to train an image classifier that works across different environments (e.g., different backgrounds), where images from each environment constitute one domain. Such a multi-domain AL setting is challenging for prior methods because they (1) ignore the similarity among different domains when assigning labeling budget and (2) fail to handle distribution shift of data across different domains. In this paper, we propose the first general method, dubbed composite active learning (CAL), for multi-domain AL. Our approach explicitly considers the domain-level and instance-level information in the problem; CAL first assigns domain-level budgets according to domain-level importance, which is estimated by optimizing an upper error bound that we develop; with the domain-level budgets, CAL then leverages a certain instance-level query strategy to select samples to label from each domain. Our theoretical analysis shows that our method achieves a better error bound compared to current AL methods. Our empirical results demonstrate that our approach significantly outperforms the state-of-the-art AL methods on both synthetic and real-world multi-domain datasets. Code is available at https://github.com/Wang-ML-Lab/multi-domain-active-learning.
AIIM Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.
JMLR Journal 2024 Journal Article
This paper considers the problem of minimizing the sum of a smooth function and the Schatten-$p$ norm of the matrix. Our contribution involves proposing accelerated iteratively reweighted nuclear norm methods designed to solve the nonconvex low-rank minimization problem. Two major novelties characterize our approach. First, the proposed method possesses an active manifold identification property, enabling the provable identification of the correct rank of the stationary point within a finite number of iterations. Second, we introduce an adaptive updating strategy for smoothing parameters. This strategy automatically fixes parameters associated with zero singular values as constants upon detecting the correct rank while quickly driving the remaining parameters to zero. This adaptive behavior transforms the algorithm into one that effectively solves smooth problems after a few iterations, setting our work apart from existing iteratively reweighted methods for low-rank optimization. We prove the global convergence of the proposed algorithm, guaranteeing that every limit point of the iterates is a critical point. Furthermore, a local convergence rate analysis is provided under the Kurdyka-Łojasiewicz property. We conduct numerical experiments using both synthetic and real data to showcase our algorithm's efficiency and superiority over existing methods. [abs] [ pdf ][ bib ] © JMLR 2024. ( edit, beta )
IROS Conference 2024 Conference Paper
Soft pneumatic actuators can accomplish various customizable deformation/motion through the distribution of cavities and gradients in stiffness. However, traditional manufacturing methods, say molding, struggle to produce soft actuators with both complex cavities and desirable stiffness distributions. Regular 3D printing methods usually need extra printheads for support materials to fabricate soft actuators with cavities. In addition, the printing quality and fidelity of the whole structure cannot be uniform due to the effect of gravity, especially for a soft actuator with overhang features. To fabricate a soft actuator of uniform fidelity but desirable stiffness distributions, we propose an embedded 3D printing approach with only one active mixing printhead. By adjusting the mixing ratio of the dual-component silicone, we can achieve designated stiffness gradients, ranging from 30. 2 kPa to 198 kPa. With this approach, we successfully fabricate soft pneumatic actuators with overhang features, which exhibit programmable elongation and radial expansion. Additionally, we fabricate soft bending actuators which can achieve programmable workspaces due to their predetermined stiffness distribution.
IJCAI Conference 2024 Conference Paper
Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, a skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves topological consistency by up to 7 points in VI compared to 13 state-of-art methods, based on objective and subjective assessments across three different boundary segmentation datasets. The code is available at https: //github. com/clovermini/Skea_topo.
AAAI Conference 2024 Conference Paper
Federated learning collaboratively trains machine learning models among different clients while keeping data privacy and has become the mainstream for breaking data silos. However, the non-independently and identically distribution (i.e., Non-IID) characteristic of different image domains among different clients reduces the benefits of federated learning and has become a bottleneck problem restricting the accuracy and generalization of federated models. In this work, we propose a novel federated image segmentation method based on style transfer, FedST, by using a denoising diffusion probabilistic model to achieve feature disentanglement and image synthesis of cross-domain image data between multiple clients. Thus it can share style features among clients while protecting structure features of image data, which effectively alleviates the influence of the Non-IID phenomenon. Experiments prove that our method achieves superior segmentation performance compared to state-of-art methods among four different Non-IID datasets in objective and subjective assessment. The code is available at https://github.com/YoferChen/FedST.
NeurIPS Conference 2024 Conference Paper
LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 42 datasets spanning 24 financial tasks, covering eight critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, decision-making, and bilingual (English and Spanish). FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and two novel datasets for regulations and stock trading. Our evaluation of 21 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovations in financial LLMs. All datasets and code are publicly available for the research community, with results shared and updated regularly on the Open Financial LLM Leaderboard.
AAAI Conference 2024 Conference Paper
Mobile edge computing (MEC) is a promising paradigm for real-time applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-of-Information (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The uncertain edge load dynamics, the nature of the fractional objective, and hybrid continuous-discrete action space (due to the joint optimization) make this problem challenging and existing approaches not directly applicable. To this end, we propose a fractional reinforcement learning (RL) framework and prove its convergence. We further design a model-free fractional deep RL (DRL) algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks.
AAAI Conference 2024 Conference Paper
Multimodal learning with incomplete input data (missing modality) is very practical and challenging. In this work, we conduct an in-depth analysis of this challenge and find that modality dominance has a significant negative impact on the model training, greatly degrading the missing modality performance. Motivated by Grad-CAM, we introduce a novel indicator, gradients, to monitor and reduce modality dominance which widely exists in the missing-modality scenario. In aid of this indicator, we present a novel Gradient-guided Modality Decoupling (GMD) method to decouple the dependency on dominating modalities. Specifically, GMD removes the conflicted gradient components from different modalities to achieve this decoupling, significantly improving the performance. In addition, to flexibly handle modal-incomplete data, we design a parameter-efficient Dynamic Sharing (DS) framework which can adaptively switch on/off the network parameters based on whether one modality is available. We conduct extensive experiments on three popular multimodal benchmarks, including BraTS 2018 for medical segmentation, CMU-MOSI, and CMU-MOSEI for sentiment analysis. The results show that our method can significantly outperform the competitors, showing the effectiveness of the proposed solutions. Our code is released here: https://github.com/HaoWang420/Gradient-guided-Modality-Decoupling.
NeurIPS Conference 2024 Conference Paper
In this paper we present a novel method for efficient and effective 3D surface reconstruction in open scenes. Existing Neural Radiance Fields (NeRF) based works typically require extensive training and rendering time due to the adopted implicit representations. In contrast, 3D Gaussian splatting (3DGS) uses an explicit and discrete representation, hence the reconstructed surface is built by the huge number of Gaussian primitives, which leads to excessive memory consumption and rough surface details in sparse Gaussian areas. To address these issues, we propose Gaussian Voxel Kernel Functions (GVKF), which establish a continuous scene representation based on discrete 3DGS through kernel regression. The GVKF integrates fast 3DGS rasterization and highly effective scene implicit representations, achieving high-fidelity open scene surface reconstruction. Experiments on challenging scene datasets demonstrate the efficiency and effectiveness of our proposed GVKF, featuring with high reconstruction quality, real-time rendering speed, significant savings in storage and training memory consumption.
NeurIPS Conference 2024 Conference Paper
Data serves as the fundamental basis for advancing deep learning. The tabular data presented in a structured format is highly valuable for modeling and training. However, even in the era of LLM, obtaining tabular data from sensitive domains remains a challenge due to privacy or copyright concerns. Therefore, exploring the methods for effectively using models like LLMs to generate synthetic tabular data, which is privacy-preserving but similar to original one, is urgent. In this paper, we introduce a new framework HARMONIC for tabular data generation and evaluation by LLMs. In the data generation part of our framework, we employ fine-tuning to generate tabular data and enhance privacy rather than continued pre-training which is often used by previous small-scale LLM-based methods. In particular, we construct an instruction fine-tuning dataset based on the idea of the k-nearest neighbors algorithm to inspire LLMs to discover inter-row relationships. By such fine-tuning, LLMs are trained to remember the format and connections of the data rather than the data itself, which reduces the risk of privacy leakage. The experiments find that our tabular data generation achieves equivalent performance as existing methods but with better privacy by the metric of MLE, DCR, etc. In the evaluation part of our framework, we develop a specific privacy risk metric DLT for LLM synthetic data generation, which quantifies the extent to which the generator itself leaks data. We also developed LLE, a performance evaluation metric for downstream LLM tasks, which is more practical and credible than previous metrics. The experiments show that our data generation method outperform the previous methods in the metrics DLT and LLE.
AAAI Conference 2024 Conference Paper
Enhancing the generalization performance of neural networks given limited data availability remains a formidable challenge, due to the model selection trade-off between training error and generalization gap. To handle this challenge, we present a posterior optimization issue, specifically designed to reduce the generalization error of trained neural networks. To operationalize this concept, we propose a Doubly-Robust Boosting machine (DRBoost) which consists of a statistical learner and a zero-order optimizer. The statistical learner reduces the model capacity and thus the generalization gap; the zero-order optimizer minimizes the training error in a gradient-free manner. The two components cooperate to reduce the generalization error of a fully trained neural network in a doubly robust manner. Furthermore, the statistical learner alleviates the multicollinearity in the discriminative layer and enhances the generalization performance. The zero-order optimizer eliminates the reliance on gradient calculation and offers more flexibility in learning objective selection. Experiments demonstrate that DRBoost improves the generalization performance of various prevalent neural network backbones effectively.
EAAI Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
This paper studies the problem of continual learning in an open-world scenario, referred to as Open-world Continual Learning (OwCL). OwCL is increasingly rising while it is highly challenging in two-fold: i) learning a sequence of tasks without forgetting knowns in the past, and ii) identifying unknowns (novel objects/classes) in the future. Existing OwCL methods suffer from the adaptability of task-aware boundaries between knowns and unknowns, and do not consider the mechanism of knowledge transfer. In this work, we propose Pro-KT, a novel prompt-enhanced knowledge transfer model for OwCL. Pro-KT includes two key components: (1) a prompt bank to encode and transfer both task-generic and task-specific knowledge, and (2) a task-aware open-set boundary to identify unknowns in the new tasks. Experimental results using two real-world datasets demonstrate that the proposed Pro-KT outperforms the state-of-the-art counterparts in both the detection of unknowns and the classification of knowns markedly. Code released at https://github.com/YujieLi42/Pro-KT.
TMLR Journal 2024 Journal Article
Federated semi-supervised learning (FedSemi) refers to scenarios where there may be clients with fully labeled data, clients with partially labeled, and even fully unlabeled clients while preserving data privacy. However, challenges arise from client drift due to undefined heterogeneous class distributions and erroneous pseudo-labels. Existing FedSemi methods typically fail to aggregate models from unlabeled clients due to their inherent unreliability, thus overlooking unique information from their heterogeneous data distribution, leading to sub-optimal results. In this paper, we enable unlabeled client aggregation through SemiAnAgg, a novel Semi-supervised Anchor-Based federated Aggregation. SemiAnAgg learns unlabeled client contributions via an anchor model, effectively harnessing their informative value. Our key idea is that by feeding local client data to the same global model and the same consistently initialized anchor model (i.e., random model), we can measure the importance of each unlabeled client accordingly. Extensive experiments demonstrate that SemiAnAgg achieves new state-of-the-art results on four widely used FedSemi benchmarks, leading to substantial performance improvements: a 9% increase in accuracy on CIFAR-100 and a 7.6% improvement in recall on the medical dataset ISIC-18, compared with prior state-of-the-art. Code is available at: https://github.com/xmed-lab/SemiAnAgg.
JBHI Journal 2024 Journal Article
Sensor-based Human Activity Recognition (HAR) is widely used in daily life and is the basic-level bridge to virtual healthcare in the metaverse. The current challenge is the low recognition accuracy for personalized users on smart wearable devices. The limited resource cannot support large deep learning models updated locally. Besides, integrating and transmitting sensor data to the cloud would reduce the efficiency. Considering the tradeoff between performance and complexity, we propose a Lightweight Human Activity Recognition (LHAR) framework. In LHAR, we combine the cross-people HAR task with the lightweight model task. LHAR framework is designed on the teacher-student architecture and the student network consists of multiple depthwise separable convolution layers to achieve fewer parameters. The dark knowledge distilled from the complex teacher model enhances the generalization ability of LHAR. To achieve effective knowledge distillation, we propose two optimization methods. Firstly, we train the teacher model by ensemble learning to promote teacher performance. Secondly, a multi-channel data augmentation method is proposed for the diversity of the dataset, which is a plug-in operation for the ensemble teacher model. In the experiments, we compare LHAR with state-of-art models in comparison evaluation, ablation study and the hyperparameter analysis, which proves the better performance of LHAR in efficiency and effectiveness.
TIST Journal 2024 Journal Article
The topic of multimodal conversation systems has recently garnered significant attention across various industries, including travel and retail, among others. While pioneering works in this field have shown promising performance, they often focus solely on context information at the utterance level, overlooking the context-aware dependencies of multimodal semantic elements like words and images. Furthermore, the ordinal information of images, which indicates the relevance between visual context and users’ demands, remains underutilized during the integration of visual content. Additionally, the exploration of how to effectively utilize corresponding attributes provided by users when searching for desired products is still largely unexplored. To address these challenges, we propose PMATE, a P osition-aware M ultimodal di A logue system with seman T ic E lements. Specifically, to obtain semantic representations at the element level, we first unfold the multimodal historical utterances and devise a position-aware multimodal element-level encoder. This component considers all images that may be relevant to the current turn and introduces a novel position-aware image selector to choose related images before fusing the information from the two modalities. Finally, we present a knowledge-aware two-stage decoder and an attribute-enhanced image searcher for the tasks of generating textual responses and selecting image responses, respectively. We extensively evaluate our model on two large-scale multimodal dialogue datasets, and the results of our experiments demonstrate that our approach outperforms several baseline methods.
NeurIPS Conference 2024 Conference Paper
Counterfactual reasoning is pivotal in human cognition and especially important for providing explanations and making decisions. While Judea Pearl's influential approach is theoretically elegant, its generation of a counterfactual scenario often requires too much deviation from the observed scenarios to be feasible, as we show using simple examples. To mitigate this difficulty, we propose a framework of natural counterfactuals and a method for generating counterfactuals that are more feasible with respect to the actual data distribution. Our methodology incorporates a certain amount of backtracking when needed, allowing changes in causally preceding variables to minimize deviations from realistic scenarios. Specifically, we introduce a novel optimization framework that permits but also controls the extent of backtracking with a "naturalness'' criterion. Empirical experiments demonstrate the effectiveness of our method. The code is available at https: //github. com/GuangyuanHao/natural_counterfactuals.
ICLR Conference 2024 Conference Paper
The problem of minimizing the maximum of $N$ convex, Lipschitz functions plays significant roles in optimization and machine learning. It has a series of results, with the most recent one requiring $O(N\epsilon^{-2/3} + \epsilon^{-8/3})$ queries to a first-order oracle to compute an $\epsilon$-suboptimal point. On the other hand, quantum algorithms for optimization are rapidly advancing with speedups shown on many important optimization problems. In this paper, we conduct a systematic study of quantum algorithms and lower bounds for minimizing the maximum of $N$ convex, Lipschitz functions. On one hand, we develop quantum algorithms with an improved complexity bound of $\tilde{O}(\sqrt{N}\epsilon^{-5/3} + \epsilon^{-8/3})$. On the other hand, we prove that quantum algorithms must take $\tilde{\Omega}(\sqrt{N}\epsilon^{-2/3})$ queries to a first-order quantum oracle, showing that our dependence on $N$ is optimal up to poly-logarithmic factors.
AAAI Conference 2024 Conference Paper
Adversarial attacks, i.e., generating adversarial perturbations with a small magnitude to deceive deep neural networks, are important for investigating and improving model trustworthiness. Traditionally, the topic was scoped within 2D images without considering 3D multiview information. Benefiting from Neural Radiance Fields (NeRF), one can easily reconstruct a 3D scene with a Multi-Layer Perceptron (MLP) from given 2D views and synthesize photo-realistic renderings of novel vantages. This opens up a door to discussing the possibility of undertaking to attack multiview NeRF network with downstream tasks from different rendering angles, which we denote Neural Radiance Fiels-based multiview adversarial Attack (NeRFail). The goal is, given one scene and a subset of views, to deceive the recognition results of agnostic view angles as well as given views. To do so, we propose a transformation mapping from pixels to 3D points such that our attack generates multiview adversarial perturbations by attacking a subset of images with different views, intending to prevent the downstream classifier from correctly predicting images rendered by NeRF from other views. Experiments show that our multiview adversarial perturbations successfully obfuscate the downstream classifier at both known and unknown views. Notably, when retraining another NeRF on the perturbed training data, we show that the perturbation can be inherited and reproduced. The code can be found at https://github.com/jiang-wenxiang/NeRFail.
NeurIPS Conference 2024 Conference Paper
Training generative models with differential privacy (DP) typically involves injecting noise into gradient updates or adapting the discriminator's training procedure. As a result, such approaches often struggle with hyper-parameter tuning and convergence. We consider the \emph{slicing privacy mechanism} that injects noise into random low-dimensional projections of the private data, and provide strong privacy guarantees for it. These noisy projections are used for training generative models. To enable optimizing generative models using this DP approach, we introduce the \emph{smoothed-sliced $f$-divergence} and show it enjoys statistical consistency. Moreover, we present a kernel-based estimator for this divergence, circumventing the need for adversarial training. Extensive numerical experiments demonstrate that our approach can generate synthetic data of higher quality compared with baselines. Beyond performance improvement, our method, by sidestepping the need for noisy gradients, offers data scientists the flexibility to adjust generator architecture and hyper-parameters, run the optimization over any number of epochs, and even restart the optimization process---all without incurring additional privacy costs.
UAI Conference 2024 Conference Paper
Self-supervised learning models extract general-purpose representations from data. Quantifying the reliability of these representations is crucial, as many downstream models rely on them as input for their own tasks. To this end, we introduce a formal definition of _representation reliability_: the representation for a given test point is considered to be reliable if the downstream models built on top of that representation can consistently generate accurate predictions for that test point. However, accessing downstream data to quantify the representation reliability is often infeasible or restricted due to privacy concerns. We propose an ensemble-based method for estimating the representation reliability without knowing the downstream tasks a priori. Our method is based on the concept of _neighborhood consistency_ across distinct pre-trained representation spaces. The key insight is to find shared neighboring points as anchors to align these representation spaces before comparing them. We demonstrate through comprehensive numerical experiments that our method effectively captures the representation reliability with a high degree of correlation, achieving robust and favorable performance compared with baseline methods.
NeurIPS Conference 2024 Conference Paper
Diffusion models have demonstrated competitive performance in missing data imputation (MDI) task. However, directly applying diffusion models to MDI produces suboptimal performance due to two primary defects. First, the sample diversity promoted by diffusion models hinders the accurate inference of missing values. Second, data masking reduces observable indices for model training, obstructing imputation performance. To address these challenges, we introduce $\underline{\text{N}}$egative $\underline{\text{E}}$ntropy-regularized $\underline{\text{W}}$asserstein gradient flow for $\underline{\text{Imp}}$utation (NewImp), enhancing diffusion models for MDI from a gradient flow perspective. To handle the first defect, we incorporate a negative entropy regularization term into the cost functional to suppress diversity and improve accuracy. To handle the second defect, we demonstrate that the imputation procedure of NewImp, induced by the conditional distribution-related cost functional, can equivalently be replaced by that induced by the joint distribution, thereby naturally eliminating the need for data masking. Extensive experiments validate the effectiveness of our method. Code is available at [https: //github. com/JustusvLiebig/NewImp](https: //github. com/JustusvLiebig/NewImp).
IJCAI Conference 2024 Conference Paper
Federated causal discovery (FCD) aims to uncover causal relationships among variables from decentralized data across multiple clients, while preserving data privacy. In practice, the sample quality of each client's local data may vary across different variable spaces, referred to as sample quality heterogeneity. Thus, data from different clients might be suitable for learning different causal relationships among variables. Model aggregated under existing FCD methods requires the entire model parameters from each client, thereby being unable to handle the sample quality heterogeneity issue. In this paper, we propose the Federated Adaptive Causal Discovery (FedACD) method to bridge this gap. During federated model aggregation, it adaptively selects the causal relationships learned under the "good" variable space (i. e. , one with high-quality samples) from each client, while masking those learned under the "bad" variable space (i. e. , one with low-quality samples). This way, each client only needs to send the optimal learning results to the server, achieving accurate FCD. Extensive experiments on various types of datasets demonstrate significant advantages of FedACD over existing methods. The source code is available at https: //github. com/Xianjie-Guo/FedACD.
AAAI Conference 2024 Conference Paper
Hierarchical Multi-Label Classification (HMLC) is a well-established problem that aims at assigning data instances to multiple classes stored in a hierarchical structure. Despite its importance, existing approaches often face two key limitations: (i) They employ dense networks to solely explore the class hierarchy as hard criterion for maintaining taxonomic consistency among predicted classes, yet without leveraging rich semantic relationships between instances and classes; (ii) They struggle to generalize in settings with deep class levels, since the mini-batches uniformly sampled from different levels ignore the varying complexities of data and result in a non-smooth model adaptation to sparse data. To mitigate these issues, we present a Self-Paced Unified Representation (SPUR) learning framework, which focuses on the interplay between instance and classes to flexibly organize the training process of HMLC algorithms. Our framework consists of two lightweight encoders designed to capture the semantics of input features and the topological information of the class hierarchy. These encoders generate unified embeddings of instances and class hierarchy, which enable SPUR to exploit semantic dependencies between them and produce predictions in line with taxonomic constraints. Furthermore, we introduce a dynamic hardness measurement strategy that considers both class hierarchy and instance features to estimate the learning difficulty of each instance. This strategy is achieved by incorporating the propagation loss obtained at each hierarchical level, allowing for a more comprehensive assessment of learning complexity. Extensive experiments on several empirical benchmarks demonstrate the effectiveness and efficiency of SPUR compared to state-of-the-art methods, especially in scenarios with missing features.
JBHI Journal 2024 Journal Article
Recently, deep learning (DL) has enabled rapid advancements in electrocardiogram (ECG)-based automatic cardiovascular disease (CVD) diagnosis. Multi-lead ECG signals have lead systems based on the potential differences between electrodes placed on the limbs and the chest. When applying DL models, ECG signals are usually treated as synchronized signals arranged in Euclidean space, which is the abstraction and generalization of real space. However, conventional DL models typically merely focus on temporal features when analyzing Euclidean data. These approaches ignore the spatial relationships of different leads, which are physiologically significant and useful for CVD diagnosis because different leads represent activities of specific heart regions. These relationships derived from spatial distributions of electrodes can be conveniently created in non-Euclidean data, making multi-lead ECGs better conform to their nature. Considering graph convolutional network (GCN) adept at analyzing non-Euclidean data, a novel spatial-temporal residual GCN for CVD diagnosis is proposed in this work. ECG signals are firstly divided into single-channel patches and transferred into nodes, which will be connected by spatial-temporal connections. The proposed model employs residual GCN blocks and feed-forward networks to alleviate over-smoothing and over-fitting. Moreover, residual connections and patch dividing enable the capture of global and detailed spatial-temporal features. Experimental results reveal that the proposed model achieves at least a 5. 85% and 6. 80% increase in F 1 over other state-of-the-art algorithms with similar parameters and computations in both PTB-XL and Chapman databases. It indicates that the proposed model provides a promising avenue for intelligent diagnosis with limited computing resources.
IROS Conference 2024 Conference Paper
Navigating toward specific objects in unknown environments without additional training, known as Zero-Shot object navigation, poses a significant challenge in the field of robotics, which demands high levels of auxiliary information and strategic planning. Traditional works have focused on holistic solutions, overlooking the specific challenges agents encounter during navigation such as collision, low exploration efficiency, and misidentification of targets. To address these challenges, our work proposes TriHelper, a novel framework designed to assist agents dynamically through three primary navigation challenges: collision, exploration, and detection. Specifically, our framework consists of three innovative components: (i) Collision Helper, (ii) Exploration Helper, and (iii) Detection Helper. These components work collaboratively to solve these challenges throughout the navigation process. Experiments on the Habitat-Matterport 3D (HM3D) and Gibson datasets demonstrate that TriHelper significantly outperforms all existing baseline methods in Zero-Shot object navigation, showcasing superior success rates and exploration efficiency. Our ablation studies further underscore the effectiveness of each helper in addressing their respective challenges, notably enhancing the agent’s navigation capabilities. By proposing TriHelper, we offer a fresh perspective on advancing the object navigation task, paving the way for future research in the domain of Embodied AI and visual-based navigation.
AAAI Conference 2024 Conference Paper
Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost.
AAAI Conference 2024 Conference Paper
Cognitive diagnosis seeks to estimate the cognitive states of students by exploring their logged practice quiz data. It plays a pivotal role in personalized learning guidance within intelligent education systems. In this paper, we focus on an important, practical, yet often underexplored task: domain-level zero-shot cognitive diagnosis (DZCD), which arises due to the absence of student practice logs in newly launched domains. Recent cross-domain diagnostic models have been demonstrated to be a promising strategy for DZCD. These methods primarily focus on how to transfer student states across domains. However, they might inadvertently incorporate non-transferable information into student representations, thereby limiting the efficacy of knowledge transfer. To tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive diagnosis framework via one batch of early-bird students towards three diagnostic objectives. Our approach initiates with pre-training a diagnosis model with dual regularizers, which decouples student states into domain-shared and domain-specific parts. The shared cognitive signals can be transferred to the target domain, enriching the cognitive priors for the new domain, which ensures the cognitive state propagation objective. Subsequently, we devise a strategy to generate simulated practice logs for cold-start students through analyzing the behavioral patterns from early-bird students, fulfilling the domain-adaption goal. Consequently, we refine the cognitive states of cold-start students as diagnostic outcomes via virtual data, aligning with the diagnosis-oriented goal. Finally, extensive experiments on six real-world datasets highlight the efficacy of our model for DZCD and its practical application in question recommendation. The code is publicly available at https://github.com/bigdata-ustc/Zero-1-to-3.
TCS Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
Domain incremental learning aims to adapt to a sequence of domains with access to only a small subset of data (i. e. , memory) from previous domains. Various methods have been proposed for this problem, but it is still unclear how they are related and when practitioners should choose one method over another. In response, we propose a unified framework, dubbed Unified Domain Incremental Learning (UDIL), for domain incremental learning with memory. Our UDIL unifies various existing methods, and our theoretical analysis shows that UDIL always achieves a tighter generalization error bound compared to these methods. The key insight is that different existing methods correspond to our bound with different fixed coefficients; based on insights from this unification, our UDIL allows adaptive coefficients during training, thereby always achieving the tightest bound. Empirical results show that our UDIL outperforms the state-of-the-art domain incremental learning methods on both synthetic and real-world datasets. Code will be available at https: //github. com/Wang-ML-Lab/unified-continual-learning.
NeurIPS Conference 2023 Conference Paper
Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification—a procedure referred to as "impute-then-classify"—can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.
EAAI Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions made during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy when fairness constraints are applied and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing fairness interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination on standard (overused) tabular datasets. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.
EAAI Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
Deep neural networks have been increasingly used in real-world applications, making it critical to ensure their ability to adapt to new, unseen data. In this paper, we study the generalization capability of neural networks trained with (stochastic) gradient flow. We establish a new connection between the loss dynamics of gradient flow and general kernel machines by proposing a new kernel, called loss path kernel. This kernel measures the similarity between two data points by evaluating the agreement between loss gradients along the path determined by the gradient flow. Based on this connection, we derive a new generalization upper bound that applies to general neural network architectures. This new bound is tight and strongly correlated with the true generalization error. We apply our results to guide the design of neural architecture search (NAS) and demonstrate favorable performance compared with state-of-the-art NAS algorithms through numerical experiments.
JBHI Journal 2023 Journal Article
Diagnosis of skin lesions based on imaging techniques remains a challenging task because data (knowledge) uncertainty may reduce accuracy and lead to imprecise results. This paper investigates a new deep hyperspherical clustering (DHC) method for skin lesion medical image segmentation by combining deep convolutional neural networks and the theory of belief functions (TBF). The proposed DHC aims to eliminate the dependence on labeled data, improve segmentation performance, and characterize the imprecision caused by data (knowledge) uncertainty. First, the SLIC superpixel algorithm is employed to group the image into multiple meaningful superpixels, aiming to maximize the use of context without destroying the boundary information. Second, an autoencoder network is designed to transform the superpixels' information into potential features. Third, a hypersphere loss is developed to train the autoencoder network. The loss is defined to map the input to a pair of hyperspheres so that the network can perceive tiny differences. Finally, the result is redistributed to characterize the imprecision caused by data (knowledge) uncertainty based on the TBF. The proposed DHC method can well characterize the imprecision between skin lesions and non-lesions, which is particularly important for the medical procedures. A series of experiments on four dermoscopic benchmark datasets demonstrate that the proposed DHC yields better segmentation performance, increasing the accuracy of the predictions while can perceive imprecise regions compared to other typical methods.
AAAI Conference 2023 Conference Paper
Federated learning (FL) is known to be susceptible to model poisoning attacks in which malicious clients hamper the accuracy of the global model by sending manipulated model updates to the central server during the FL training process. Existing defenses mainly focus on Byzantine-robust FL aggregations, and largely ignore the impact of the underlying deep neural network (DNN) that is used to FL training. Inspired by recent findings on critical learning periods (CLP) in DNNs, where small gradient errors have irrecoverable impact on the final model accuracy, we propose a new defense, called a CLP-aware defense against poisoning of FL (DeFL). The key idea of DeFL is to measure fine-grained differences between DNN model updates via an easy-to-compute federated gradient norm vector (FGNV) metric. Using FGNV, DeFL simultaneously detects malicious clients and identifies CLP, which in turn is leveraged to guide the adaptive removal of detected malicious clients from aggregation. As a result, DeFL not only mitigates model poisoning attacks on the global model but also is robust to detection errors. Our extensive experiments on three benchmark datasets demonstrate that DeFL produces significant performance gain over conventional defenses against state-of-the-art model poisoning attacks.
YNICL Journal 2023 Journal Article
EAAI Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
Recently, federated learning (FL) is popular for its privacy-preserving and collaborative learning abilities. However, under statistically heterogeneous scenarios, we observe that biased data domains on clients cause a representation bias phenomenon and further degenerate generic representations during local training, i. e. , the representation degeneration phenomenon. To address these issues, we propose a general framework Domain Bias Eliminator (DBE) for FL. Our theoretical analysis reveals that DBE can promote bi-directional knowledge transfer between server and client, as it reduces the domain discrepancy between server and client in representation space. Besides, extensive experiments on four datasets show that DBE can greatly improve existing FL methods in both generalization and personalization abilities. The DBE-equipped FL method can outperform ten state-of-the-art personalized FL methods by a large margin. Our code is public at https: //github. com/TsingZ0/DBE.
ECAI Conference 2023 Conference Paper
Recently, there has been a surge in the use of generated data to enhance the performance of downstream models, largely due to the advancements in pre-trained language models. However, most prevailing methods trained generative and discriminative models in isolation, which left them unable to adapt to changes in each other. These approaches lead to generative models that are prone to deviating from the true data distribution and providing limited benefits to discriminative models. While some works have proposed jointly training generative and discriminative language models, their methods remain challenging due to the non-differentiable nature of discrete data. To overcome these issues, we introduce a self-consistent learning framework in the text field that involves training a discriminator and generator cooperatively in a closed-loop manner until a scoring consensus is reached. By learning directly from selected samples, our framework are able to mitigate training instabilities such as mode collapse and non-convergence. Extensive experiments on four downstream benchmarks, including AFQMC, CHIP-STS, QQP, and MRPC, demonstrate the efficacy of the proposed framework.
AAAI Conference 2023 Conference Paper
A key challenge in federated learning (FL) is the statistical heterogeneity that impairs the generalization of the global model on each client. To address this, we propose a method Federated learning with Adaptive Local Aggregation (FedALA) by capturing the desired information in the global model for client models in personalized FL. The key component of FedALA is an Adaptive Local Aggregation (ALA) module, which can adaptively aggregate the downloaded global model and local model towards the local objective on each client to initialize the local model before training in each iteration. To evaluate the effectiveness of FedALA, we conduct extensive experiments with five benchmark datasets in computer vision and natural language processing domains. FedALA outperforms eleven state-of-the-art baselines by up to 3.27% in test accuracy. Furthermore, we also apply ALA module to other federated learning methods and achieve up to 24.19% improvement in test accuracy. Code is available at https://github.com/TsingZ0/FedALA.
AAAI Conference 2023 Conference Paper
Traditional federated learning (FL) algorithms, such as FedAvg, fail to handle non-i.i.d data because they learn a global model by simply averaging biased local models that are trained on non-i.i.d local data, therefore failing to model the global data distribution. In this paper, we present a novel Bayesian FL algorithm that successfully handles such a non-i.i.d FL setting by enhancing the local training task with an auxiliary task that explicitly estimates the global data distribution. One key challenge in estimating the global data distribution is that the data are partitioned in FL, and therefore the ground-truth global data distribution is inaccessible. To address this challenge, we propose an expectation-propagation-inspired probabilistic neural network, dubbed federated neural propagation (FedNP), which efficiently estimates the global data distribution given non-i.i.d data partitions. Our algorithm is sampling-free and end-to-end differentiable, can be applied with any conventional FL frameworks and learns richer global data representation. Experiments on both image classification tasks with synthetic non-i.i.d image data partitions and real-world non-i.i.d speech recognition tasks demonstrate that our framework effectively alleviates the performance deterioration caused by non-i.i.d data.
AAAI Conference 2023 Conference Paper
We study a fully online matching problem with stochastic arrivals and departures. In this model, each online arrival follows a known identical and independent distribution over a fixed set of agent types. Its sojourn time is unknown in advance and follows type-specific distributions with known expectations. The goal is to maximize the weighted reward from successful matches. To solve this problem, we first propose a linear program (LP)-based algorithm whose competitive ratio is lower bounded by 0.155 under mild conditions. We further achieve better ratios in some special cases. To demonstrate the challenges of the problem, we further establish several hardness results. In particular, we show that no online algorithm can achieve a competitive ratio better than 2/3 in this model and there is no LP-based algorithm (with respect to our proposed LP) with a competitive ratio better than 1/3. Finally, we demonstrate the effectiveness and efficiency of our algorithm numerically.
JMLR Journal 2023 Journal Article
Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks. [abs] [ pdf ][ bib ] © JMLR 2023. ( edit, beta )
AAAI Conference 2023 Conference Paper
The computation of trajectory similarity is a crucial task in many spatial data analysis applications. However, existing methods have been designed primarily for trajectories in Euclidean space, which overlooks the fact that real-world trajectories are often generated on road networks. This paper addresses this gap by proposing a novel framework, called GRLSTM (Graph-based Residual LSTM). To jointly capture the properties of trajectories and road networks, the proposed framework incorporates knowledge graph embedding (KGE), graph neural network (GNN), and the residual network into the multi-layer LSTM (Residual-LSTM). Specifically, the framework constructs a point knowledge graph to study the multi-relation of points, as points may belong to both the trajectory and the road network. KGE is introduced to learn point embeddings and relation embeddings to build the point fusion graph, while GNN is used to capture the topology structure information of the point fusion graph. Finally, Residual-LSTM is used to learn the trajectory embeddings.To further enhance the accuracy and robustness of the final trajectory embeddings, we introduce two new neighbor-based point loss functions, namely, graph-based point loss function and trajectory-based point loss function. The GRLSTM is evaluated using two real-world trajectory datasets, and the experimental results demonstrate that GRLSTM outperforms all the state-of-the-art methods significantly.
IJCAI Conference 2023 Conference Paper
Recently, Zero-Shot Node Classification (ZNC) has been an emerging and crucial task in graph data analysis. This task aims to predict nodes from unseen classes which are unobserved in the training process. Existing work mainly utilizes Graph Neural Networks (GNNs) to associate features' prototypes and labels' semantics thus enabling knowledge transfer from seen to unseen classes. However, the multi-faceted semantic orientation in the feature-semantic alignment has been neglected by previous work, i. e. the content of a node usually covers diverse topics that are relevant to the semantics of multiple labels. It's necessary to separate and judge the semantic factors that tremendously affect the cognitive ability to improve the generality of models. To this end, we propose a Knowledge-Aware Multi-Faceted framework (KMF) that enhances the richness of label semantics via the extracted KG (Knowledge Graph)-based topics. And then the content of each node is reconstructed to a topic-level representation that offers multi-faceted and fine-grained semantic relevancy to different labels. Due to the particularity of the graph's instance (i. e. , node) representation, a novel geometric constraint is developed to alleviate the problem of prototype drift caused by node information aggregation. Finally, we conduct extensive experiments on several public graph datasets and design an application of zero-shot cross-domain recommendation. The quantitative results demonstrate both the effectiveness and generalization of KMF with the comparison of state-of-the-art baselines.
JBHI Journal 2023 Journal Article
Microscopic hyperspectral image (MHSI) has received considerable attention in the medical field. The wealthy spectral information provides potentially powerful identification ability when combining with advanced convolutional neural network (CNN). However, for high-dimensional MHSI, the local connection of CNN makes it difficult to extract the long-range dependencies of spectral bands. Transformer overcomes this problem well because of its self-attention mechanism. Nevertheless, transformer is inferior to CNN in extracting spatial detailed features. Therefore, a classification framework integrating transformer and CNN in parallel, named as Fusion Transformer (FUST), is proposed for MHSI classification tasks. Specifically, the transformer branch is employed to extract the overall semantics and capture the long-range dependencies of spectral bands to highlight the key spectral information. The parallel CNN branch is designed to extract significant multiscale spatial features. Furthermore, the feature fusion module is developed to effectively fuse and process the features extracted by the two branches. Experimental results on three MHSI datasets demonstrate that the proposed FUST achieves superior performance when compared with state-of-the-art methods.
NeurIPS Conference 2023 Conference Paper
Estimating individual treatment effects from observational data is challenging due to treatment selection bias. Prevalent methods mainly mitigate this issue by aligning different treatment groups in the latent space, the core of which is the calculation of distribution discrepancy. However, two issues that are often overlooked can render these methods invalid: (1) mini-batch sampling effects (MSE), where the calculated discrepancy is erroneous in non-ideal mini-batches with outcome imbalance and outliers; (2) unobserved confounder effects (UCE), where the unobserved confounders are not considered in the discrepancy calculation. Both of these issues invalidate the calculated discrepancy, mislead the training of estimators, and thus impede the handling of treatment selection bias. To tackle these issues, we propose Entire Space CounterFactual Regression (ESCFR), which is a new take on optimal transport technology in the context of causality. Specifically, based on the canonical optimal transport framework, we propose a relaxed mass-preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that ESCFR estimates distribution discrepancy accurately, handles the treatment selection bias effectively, and outperforms prevalent competitors significantly.
NeurIPS Conference 2023 Conference Paper
Existing private synthetic data generation algorithms are agnostic to downstream tasks. However, end users may have specific requirements that the synthetic data must satisfy. Failure to meet these requirements could significantly reduce the utility of the data for downstream use. We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user, while preserving strong privacy guarantees and dataset quality. Our technique involves resampling from the synthetic data to filter out samples that do not meet the selected utility measures, using an efficient stochastic first-order algorithm to find optimal resampling weights. Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
EAAI Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
Diffusion models have achieved state-of-the-art performance in generative modeling tasks across various domains. Prior works on time series diffusion models have primarily focused on developing conditional models tailored to specific forecasting or imputation tasks. In this work, we explore the potential of task-agnostic, unconditional diffusion models for several time series applications. We propose TSDiff, an unconditionally-trained diffusion model for time series. Our proposed self-guidance mechanism enables conditioning TSDiff for downstream tasks during inference, without requiring auxiliary networks or altering the training procedure. We demonstrate the effectiveness of our method on three different time series tasks: forecasting, refinement, and synthetic data generation. First, we show that TSDiff is competitive with several task-specific conditional forecasting methods ( predict ). Second, we leverage the learned implicit probability density of TSDiff to iteratively refine the predictions of base forecasters with reduced computational overhead over reverse diffusion ( refine ). Notably, the generative performance of the model remains intact — downstream forecasters trained on synthetic samples from TSDiff outperform forecasters that are trained on samples from other state-of-the-art generative time series models, occasionally even outperforming models trained on real data ( synthesize ). Our code is available at https: //github. com/amazon-science/unconditional-time-series-diffusion
NeurIPS Conference 2023 Conference Paper
Earth system forecasting has traditionally relied on complex physical models that are computationally expensive and require significant domain expertise. In the past decade, the unprecedented increase in spatiotemporal Earth observation data has enabled data-driven forecasting models using deep learning techniques. These models have shown promise for diverse Earth system forecasting tasks but either struggle with handling uncertainty or neglect domain-specific prior knowledge, resulting in averaging possible futures to blurred forecasts or generating physically implausible predictions. To address these limitations, we propose a two-stage pipeline for probabilistic spatiotemporal forecasting: 1) We develop PreDiff, a conditional latent diffusion model capable of probabilistic forecasts. 2) We incorporate an explicit knowledge alignment mechanism to align forecasts with domain-specific physical constraints. This is achieved by estimating the deviation from imposed constraints at each denoising step and adjusting the transition distribution accordingly. We conduct empirical studies on two datasets: N-body MNIST, a synthetic dataset with chaotic behavior, and SEVIR, a real-world precipitation nowcasting dataset. Specifically, we impose the law of conservation of energy in N-body MNIST and anticipated precipitation intensity in SEVIR. Experiments demonstrate the effectiveness of PreDiff in handling uncertainty, incorporating domain-specific prior knowledge, and generating forecasts that exhibit high operational utility.
NeurIPS Conference 2023 Conference Paper
In recommender systems, the collected data used for training is always subject to selection bias, which poses a great challenge for unbiased learning. Previous studies proposed various debiasing methods based on observed user and item features, but ignored the effect of hidden confounding. To address this problem, recent works suggest the use of sensitivity analysis for worst-case control of the unknown true propensity, but only valid when the true propensity is near to the nominal propensity within a finite bound. In this paper, we first perform theoretical analysis to reveal the possible failure of previous approaches, including propensity-based, multi-task learning, and bi-level optimization methods, in achieving unbiased learning when hidden confounding is present. Then, we propose a unified multi-task learning approach to remove hidden confounding, which uses a few unbiased ratings to calibrate the learned nominal propensities and nominal error imputations from biased data. We conduct extensive experiments on three publicly available benchmark datasets containing a fully exposed large-scale industrial dataset, validating the effectiveness of the proposed methods in removing hidden confounding.
AAAI Conference 2023 Conference Paper
This paper presents Uncertainty-aware Contrastive Learning (UCoL): a fully unsupervised framework for discriminative facial representation learning. Our UCoL is built upon a momentum contrastive network, referred to as Dual-path Momentum Network. Specifically, two flows of pairwise contrastive training are conducted simultaneously: one is formed with intra-instance self augmentation, and the other is to identify positive pairs collected by online pairwise prediction. We introduce a novel uncertainty-aware consistency K-nearest neighbors algorithm to generate predicted positive pairs, which enables efficient discriminative learning from large-scale open-world unlabeled data. Experiments show that UCoL significantly improves the baselines of unsupervised models and performs on par with the semi-supervised and supervised face representation learning methods.
NeurIPS Conference 2023 Conference Paper
Accurate Urban SpatioTemporal Prediction (USTP) is of great importance to the development and operation of the smart city. As an emerging building block, multi-sourced urban data are usually integrated as urban knowledge graphs (UrbanKGs) to provide critical knowledge for urban spatiotemporal prediction models. However, existing UrbanKGs are often tailored for specific downstream prediction tasks and are not publicly available, which limits the potential advancement. This paper presents UUKG, the unified urban knowledge graph dataset for knowledge-enhanced urban spatiotemporal predictions. Specifically, we first construct UrbanKGs consisting of millions of triplets for two metropolises by connecting heterogeneous urban entities such as administrative boroughs, POIs, and road segments. Moreover, we conduct qualitative and quantitative analysis on constructed UrbanKGs and uncover diverse high-order structural patterns, such as hierarchies and cycles, that can be leveraged to benefit downstream USTP tasks. To validate and facilitate the use of UrbanKGs, we implement and evaluate 15 KG embedding methods on the KG completion task and integrate the learned KG embeddings into 9 spatiotemporal models for five different USTP tasks. The extensive experimental results not only provide benchmarks of knowledge-enhanced USTP models under different task settings but also highlight the potential of state-of-the-art high-order structure-aware UrbanKG embedding methods. We hope the proposed UUKG fosters research on urban knowledge graphs and broad smart city applications. The dataset and source code are available at https: //github. com/usail-hkust/UUKG/.
NeurIPS Conference 2023 Conference Paper
Existing regression models tend to fall short in both accuracy and uncertainty estimation when the label distribution is imbalanced. In this paper, we propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct. Different from typical variational autoencoders assuming I. I. D. representations (a data point's representation is not directly affected by other data points), our VIR borrows data with similar regression labels to compute the latent representation's variational distribution; furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions to impose probabilistic reweighting on the imbalanced data, thereby providing better uncertainty estimation. Experiments in several real-world datasets show that our VIR can outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation. Code will soon be available at https: //github. com/Wang-ML-Lab/variational-imbalanced-regression.
ICRA Conference 2022 Conference Paper
This paper presents a novel robot arm that is capable of switching between a rigid robot arm and a continuum robot arm. Therefore, the novel robot arm can perform adaptive physical interaction and manipulation against complex working environments and tasks. The switch-ability of the robot arm is achieved with two types of joints: knee-like flexible joints and continuum flexible joints, with which the continuum segment of the robot arm is capable of locking and losing, hence the degree of freedom of the robot arm is capable to be switched. In this work, kinematics is established for specifying the relationship between joints space and global coordinates in both rigid and continuum configurations. Then, the posture and workspace in rigid and continuum configurations are analyzed and illustrated with numerical simulations, and compared based on the established kinematic model. Finally, a series of preliminary experimental testing toward the joint motion and stiffness has been carried out to validate the design, the kinematic model, and the motion performance of the proposed robot arm. Both the numerical and experimental results show that the knee-like joints can guarantee favorable motion accuracy, and the motion of continuum segment from the testing is well aligned with the motion calculated from the theoretical model. Moreover, the stiffness of rigid configuration is larger than the continuum configuration based on the stiffness experiment results. Therefore, the proposed novel robot arm is capable to handle adaptive interaction and manipulation in a diverse environment through the switching between the rigid and continuum configurations.
NeurIPS Conference 2022 Conference Paper
We consider the problem of producing fair probabilistic classifiers for multi-class classification tasks. We formulate this problem in terms of ``projecting'' a pre-trained (and potentially unfair) classifier onto the set of models that satisfy target group-fairness requirements. The new, projected model is given by post-processing the outputs of the pre-trained classifier by a multiplicative factor. We provide a parallelizable, iterative algorithm for computing the projected classifier and derive both sample complexity and convergence guarantees. Comprehensive numerical comparisons with state-of-the-art benchmarks demonstrate that our approach maintains competitive performance in terms of accuracy-fairness trade-off curves, while achieving favorable runtime on large datasets. We also evaluate our method at scale on an open dataset with multiple classes, multiple intersectional groups, and over 1M samples.
AAAI Conference 2022 Conference Paper
Recurrent neural networks have proven effective in modeling sequential user feedbacks for recommender systems. However, they usually focus solely on item relevance and fail to effectively explore diverse items for users, therefore harming the system performance in the long run. To address this problem, we propose a new type of recurrent neural networks, dubbed recurrent exploration networks (REN), to jointly perform representation learning and effective exploration in the latent space. REN tries to balance relevance and exploration while taking into account the uncertainty in the representations. Our theoretical analysis shows that REN can preserve the rateoptimal sublinear regret even when there exists uncertainty in the learned representations. Our empirical study demonstrates that REN can achieve satisfactory long-term rewards on both synthetic and real-world recommendation datasets, outperforming state-of-the-art models.
NeurIPS Conference 2022 Conference Paper
Detecting out-of-distribution (OOD) samples is essential for reliably deploying deep learning classifiers in open-world applications. However, existing detectors relying on discriminative probability suffer from the overconfident posterior estimate for OOD data. Other reported approaches either impose strong unproven parametric assumptions to estimate OOD sample density or develop empirical detectors lacking clear theoretical motivations. To address these issues, we propose a theoretical probabilistic framework for OOD detection in deep classification networks, in which two regularization constraints are constructed to reliably calibrate and estimate sample density to identify OOD. Specifically, the density consistency regularization enforces the agreement between analytical and empirical densities of observable low-dimensional categorical labels. The contrastive distribution regularization separates the densities between in distribution (ID) and distribution-deviated samples. A simple and robust implementation algorithm is also provided, which can be used for any pre-trained neural network classifiers. To the best of our knowledge, we have conducted the most extensive evaluations and comparisons on computer vision benchmarks. The results show that our method significantly outperforms state-of-the-art detectors, and even achieves comparable or better performance than methods utilizing additional large-scale outlier exposure datasets.
ICRA Conference 2022 Conference Paper
To achieve real-time dynamic simulation analysis and optimization design, a dynamic digital twin of a nonholonomic mobile manipulator (one UR5e mounted on an industrial mobile robot MIR 200) has been developed in this paper. First, the digital twin integrated with dynamics of a mobile manipulator is established. The framework of the dynamic digital twin is presented in detail. Then, the dynamic model of the system has been established with the consideration of the physical interaction between the robot and humans/environments using Lagrange formulation. Finally, the experimental testing has been conducted to validate the dynamic model and evaluate the performances (such as real-time property, accuracy, etc.) of the dynamic digital twin that is integrated with the physical human/environment-robot interaction.
ICRA Conference 2022 Conference Paper
Collaborative robots are gradually taking over the leading position in automating the production and manufacturing of the SMEs, where the human-robot collaboration is highly emphasized. Therefore, estimating the force and simulating the performance of robots are of great importance. As a newly introduced technology, digital twin, has gained more attentions for simulation, process evaluation, real-time monitoring, etc. However, the current state-of-the-art of digital twin for robots still remains on the kinematic level, and the integrated robot system dynamics is too complex to be incorporated into the digital twin. Therefore, this research starts with the perspective of harmonic drive based robot joint, and proposes a dynamic model of robot joint by analyzing the composition, transmission principle, and internal interactions. Then the experimental parameter identification is performed to obtain the inherent parameters, which can reflect the system performance characteristics. Finally, a preliminary digital twin of robot joint integrated with dynamic model is established with Gazebo and MATLAB. The proposed approach could be used to simulate the dynamic behavior of robot joint in real time and make contributions to the state of the art for digital twin.
NeurIPS Conference 2022 Conference Paper
Conventionally, Earth system (e. g. , weather and climate) forecasting relies on numerical simulation with complex physical models and hence is both expensive in computation and demanding on domain expertise. With the explosive growth of spatiotemporal Earth observation data in the past decade, data-driven models that apply Deep Learning (DL) are demonstrating impressive potential for various Earth system forecasting tasks. The Transformer as an emerging DL architecture, despite its broad success in other domains, has limited adoption in this area. In this paper, we propose Earthformer, a space-time Transformer for Earth system forecasting. Earthformer is based on a generic, flexible and efficient space-time attention block, named Cuboid Attention. The idea is to decompose the data into cuboids and apply cuboid-level self-attention in parallel. These cuboids are further connected with a collection of global vectors. We conduct experiments on the MovingMNIST dataset and a newly proposed chaotic $N$-body MNIST dataset to verify the effectiveness of cuboid attention and figure out the best design of Earthformer. Experiments on two real-world benchmarks about precipitation nowcasting and El Niño/Southern Oscillation (ENSO) forecasting show that Earthformer achieves state-of-the-art performance.
NeurIPS Conference 2022 Conference Paper
Human intelligence has shown remarkably lower latency and higher precision than most AI systems when processing non-stationary streaming data in real-time. Numerous neuroscience studies suggest that such abilities may be driven by internal predictive modeling. In this paper, we explore the possibility of introducing such a mechanism in unsupervised domain adaptation (UDA) for handling non-stationary streaming data for real-time streaming applications. We propose to formulate internal predictive modeling as a continuous-time Bayesian filtering problem within a stochastic dynamical system context. Such a dynamical system describes the dynamics of model parameters of a UDA model evolving with non-stationary streaming data. Building on such a dynamical system, we then develop extrapolative continuous-time Bayesian neural networks (ECBNN), which generalize existing Bayesian neural networks to represent temporal dynamics and allow us to extrapolate the distribution of model parameters before observing the incoming data, therefore effectively reducing the latency. Remarkably, our empirical results show that ECBNN is capable of continuously generating better distributions of model parameters along the time axis given historical data only, thereby achieving (1) training-free test-time adaptation with low latency, (2) gradually improved alignment between the source and target features and (3) gradually improved model performance over time during the real-time testing stage.
AAAI Conference 2022 Conference Paper
We investigate the fairness concerns of training a machine learning model using data with missing values. Even though there are a number of fairness intervention methods in the literature, most of them require a complete training set as input. In practice, data can have missing values, and data missing patterns can depend on group attributes (e. g. gender or race). Simply applying off-the-shelf fair learning algorithms to an imputed dataset may lead to an unfair model. In this paper, we first theoretically analyze different sources of discrimination risks when training with an imputed dataset. Then, we propose an integrated approach based on decision trees that does not require a separate process of imputation and learning. Instead, we train a tree with missing incorporated as attribute (MIA), which does not require explicit imputation, and we optimize a fairness-regularized objective function. We demonstrate that our approach outperforms existing fairness intervention methods applied to an imputed dataset, through several experiments on real-world datasets.
TIST Journal 2022 Journal Article
Computational understanding of humor is an important topic under creative language understanding and modeling. It can play a key role in complex human-AI interactions. The challenge here is that human perception of humorous content is highly subjective. The same joke may receive different funniness ratings from different readers. This makes it highly challenging for humor recognition models to achieve personalization in practical scenarios. Existing approaches are generally designed based on the assumption that users have a consensus on whether a given text is humorous or not. Thus, they cannot handle diverse humor preferences well. In this article, we propose the FedHumor approach for the recognition of humorous content in a personalized manner through Federated Learning (FL). Extending a pre-trained language model, FedHumor guides the fine-tuning process by considering diverse distributions of humor preferences from individuals. It incorporates a diversity adaptation strategy into the FL paradigm to train a personalized humor recognition model. To the best of our knowledge, FedHumor is the first text-based personalized humor recognition model through federated learning. Extensive experiments demonstrate the advantage of FedHumor in recognizing humorous texts compared to nine state-of-the-art humor recognition approaches with superior capability for handling the diversity in humor labels produced by users with diverse preferences.
IJCAI Conference 2022 Conference Paper
Traffic flow forecasting plays a vital role in the transportation domain. Existing studies usually manually construct correlation graphs and design sophisticated models for learning spatial and temporal features to predict future traffic states. However, manually constructed correlation graphs cannot accurately extract the complex patterns hidden in the traffic data. In addition, it is challenging for the prediction model to fit traffic data due to its irregularly-shaped distribution. To solve the above-mentioned problems, in this paper, we propose a novel learning-based method to learn a spatial-temporal correlation graph, which could make good use of the traffic flow data. Moreover, we propose First-Order Gradient Supervision (FOGS), a novel method for traffic flow forecasting. FOGS utilizes first-order gradients, rather than specific flows, to train prediction model, which effectively avoids the problem of fitting irregularly-shaped distributions. Comprehensive numerical evaluations on four real-world datasets reveal that the proposed methods achieve state-of-the-art performance and significantly outperform the benchmarks.
AAAI Conference 2022 Conference Paper
Bayesian neural networks (BNNs) have drawn extensive interest due to the unique probabilistic representation framework. However, Bayesian neural networks have limited publicized deployments because of the relatively poor model performance in real-world applications. In this paper, we argue that the randomness of sampling in Bayesian neural networks causes errors in the updating of model parameters during training and poor performance of some sampled models in testing. To solve this, we propose to train Bayesian neural networks with Adversarial Distribution as a theoretical solution. To avoid the difficulty of calculating Adversarial Distribution analytically, we further present the Adversarial Sampling method as an approximation in practice. We conduct extensive experiments with multiple network structures on different datasets, e. g. , CIFAR-10 and CIFAR-100. Experimental results validate the correctness of the theoretical analysis and the effectiveness of the Adversarial Sampling on improving model performance. Additionally, models trained with Adversarial Sampling still keep their ability to model uncertainties and perform better when predictions are retained according to the uncertainties, which further verifies the generality of the Adversarial Sampling approach.
ICRA Conference 2022 Conference Paper
Dielectric elastomer actuators (DEAs) have been widely employed to drive various soft robots, due to their quiet fast muscle-like behavior. It is significant but challenging to model and control these soft actuators, due to their viscoelastic property, irregular geometry, complex structure, etc. In this paper, we propose a data-driven sparse identification method to discover the hidden governing equations of DEAs. These equations can help us interpret the nonlinear properties of DEAs. Due to their low computational cost, we can further use these equations to explore classic model-based control methods for real-time accurate control of viscoelastic DEAs. The experiments show that the proposed method can model the viscoelastic behavior of the DEAs with reasonable accuracy. A feedforward controller is finally developed to validate the effectiveness of the proposed method. It is expected that this modeling method can pave the way for accurate control of soft actuators/robots with structural and material nonlinearities.
AAAI Conference 2022 Conference Paper
Despite the recent progress in Graph Neural Networks (GNNs), it remains challenging to explain the predictions made by GNNs. Existing explanation methods mainly focus on post-hoc explanations where another explanatory model is employed to provide explanations for a trained GNN. The fact that post-hoc methods fail to reveal the original reasoning process of GNNs raises the need of building GNNs with built-in interpretability. In this work, we propose Prototype Graph Neural Network (ProtGNN), which combines prototype learning with GNNs and provides a new perspective on the explanations of GNNs. In ProtGNN, the explanations are naturally derived from the case-based reasoning process and are actually used during classification. The prediction of Prot- GNN is obtained by comparing the inputs to a few learned prototypes in the latent space. Furthermore, for better interpretability and higher efficiency, a novel conditional subgraph sampling module is incorporated to indicate which part of the input graph is most similar to each prototype in Prot- GNN+. Finally, we evaluate our method on a wide range of datasets and perform concrete case studies. Extensive results show that ProtGNN and ProtGNN+ can provide inherent interpretability while achieving accuracy on par with the noninterpretable counterparts.
AAAI Conference 2022 Conference Paper
Online on-demand ridesourcing service has played a huge role in transforming urban transportation. A central function in most on-demand ridesourcing platforms is to dynamically assign drivers to rider requests that could balance the request waiting times and the driver pick-up distances. To deal with the online nature of this problem, existing literature either divides the time horizon into short windows and applies a static offline assignment algorithm within each window or assumes a fully online setting that makes decisions for each request immediately upon its arrival. In this paper, we propose a more realistic model for the driver-request assignment that bridges the above two settings together. Our model allows the requests to wait after their arrival but assumes that they may leave at any time following a quitting function. Under this model, we design an efficient algorithm for assigning available drivers to requests in real-time. Our algorithm is able to incorporate future estimated driver arrivals into consideration and make strategic waiting and matching decisions that could balance the waiting time and pick-up distance of the assignment. We prove that our algorithm is optimal ex-ante in the single-request setting, and demonstrate its effectiveness in the general multirequest setting through experiments on both synthetic and real-world datasets.
AAAI Conference 2022 Conference Paper
Federated learning (FL) is a popular technique to train machine learning (ML) models with decentralized data. Extensive works have studied the performance of the global model; however, it is still unclear how the training process affects the final test accuracy. Exacerbating this problem is the fact that FL executions differ significantly from traditional ML with heterogeneous data characteristics across clients, involving more hyperparameters. In this work, we show that the final test accuracy of FL is dramatically affected by the early phase of the training process, i. e. , FL exhibits critical learning periods, in which small gradient errors irrecoverably impact the final test accuracy. To further explain this phenomenon, we generalize the trace of Fisher Information Matrix (FIM) to FL and define a new notion called FedFIM, a quantity reflecting the local curvature of each client from the beginning of training in FL. Our findings suggest that the initial learning phase plays a critical role in understanding the FL performance. This is in contrast to many existing works which generally do not connect the final accuracy of FL to the early phase training. Finally, seizing critical learning periods in FL is of independent interest and could be useful for other problems such as the choices of hyperparameters including but not limited to the number of client selected per round, batch size, so as to improve the performance of FL training and testing.
AAAI Conference 2022 Conference Paper
Recent advances in supervised deep learning methods are enabling remote measurements of photoplethysmographybased physiological signals using facial videos. The performance of these supervised methods, however, are dependent on the availability of large labelled data. Contrastive learning as a self-supervised method has recently achieved state-of-the-art performances in learning representative data features by maximising mutual information between different augmented views. However, existing data augmentation techniques for contrastive learning are not designed to learn physiological signals from videos and often fail when there are complicated noise and subtle and periodic colour/shape variations between video frames. To address these problems, we present a novel self-supervised spatiotemporal learning framework for remote physiological signal representation learning, where there is a lack of labelled training data. Firstly, we propose a landmark-based spatial augmentation that splits the face into several informative parts based on the Shafer’s dichromatic reflection model to characterise subtle skin colour fluctuations. We also formulate a sparsitybased temporal augmentation exploiting Nyquist–Shannon sampling theorem to effectively capture periodic temporal changes by modelling physiological signal features. Furthermore, we introduce a constrained spatiotemporal loss which generates pseudo-labels for augmented video clips. It is used to regulate the training process and handle complicated noise. We evaluated our framework on 3 public datasets and demonstrated superior performances than other self-supervised methods and achieved competitive accuracy compared to the state-of-the-art supervised methods. Code is available at https: //github. com/Dylan-H-Wang/SLF-RPM.
JMLR Journal 2022 Journal Article
This paper primarily focuses on computing the Euclidean projection of a vector onto the lp ball in which p ∈ (0,1). Such a problem emerges as the core building block in statistical machine learning and signal processing tasks because of its ability to promote the sparsity of the desired solution. However, efficient numerical algorithms for finding the projections are still not available, particularly in large-scale optimization. To meet this challenge, we first derive the first-order necessary optimality conditions of this problem. Based on this characterization, we develop a novel numerical approach for computing the stationary point by solving a sequence of projections onto the reweighted l1-balls. This method is practically simple to implement and computationally efficient. Moreover, the proposed algorithm is shown to converge uniquely under mild conditions and has a worst-case O(1/\sqrt{k}) convergence rate. Numerical experiments demonstrate the efficiency of our proposed algorithm. [abs] [ pdf ][ bib ] © JMLR 2022. ( edit, beta )
AAAI Conference 2022 Conference Paper
Uncertainty estimation is an essential step in the evaluation of the robustness for deep learning models in computer vision, especially when applied in risk-sensitive areas. However, most state-of-the-art deep learning models either fail to obtain uncertainty estimation or need significant modification (e. g. , formulating a proper Bayesian treatment) to obtain it. Most previous methods are not able to take an arbitrary model off the shelf and generate uncertainty estimation without retraining or redesigning it. To address this gap, we perform a systematic exploration into training-free uncertainty estimation for dense regression, an unrecognized yet important problem, and provide a theoretical construction justifying such estimations. We propose three simple and scalable methods to analyze the variance of outputs from a trained network under tolerable perturbations: infer-transformation, infer-noise, and infer-dropout. They operate solely during the inference, without the need to re-train, re-design, or fine-tune the models, as typically required by state-of-the-art uncertainty estimation methods. Surprisingly, even without involving such perturbations in training, our methods produce comparable or even better uncertainty estimation when compared to training-required state-of-the-art methods. Code is available at https: //github. com/lumi9587/train-free-uncertainty.
TMLR Journal 2022 Journal Article
Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. To address this challenge, we propose a novel and effective training procedure that alternates between estimating the hard assignments of the discrete latent variables over a specifically designed mismatch localization finite-state acceptor (ML-FSA) and updating the parameters of neural networks. In this work, we focus on the mismatch localization problem for speech and text, and our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.
IJCAI Conference 2022 Conference Paper
With the rise of telemedicine, the task of developing Dialogue Systems for Medical Diagnosis (DSMD) has received much attention in recent years. Different from early researches that needed to rely on extra human resources and expertise to build the system, recent researches focused on how to build DSMD in a data-driven manner. However, the previous data-driven DSMD methods largely overlooked the system interpretability, which is critical for a medical application, and they also suffered from the data sparsity issue at the same time. In this paper, we explore how to bring interpretability to data-driven DSMD. Specifically, we propose a more interpretable decision process to implement the dialogue manager of DSMD by reasonably mimicking real doctors' inquiry logics, and we devise a model with highly transparent components to conduct the inference. Moreover, we collect a new DSMD dataset, which has a much larger scale, more diverse patterns, and is of higher quality than the existing ones. The experiments show that our method obtains 7. 7%, 10. 0%, 3. 0% absolute improvement in diagnosis accuracy respectively on three datasets, demonstrating the effectiveness of its rational decision process and model design. Our codes and the GMD-12 dataset are available at https: //github. com/lwgkzl/BR-Agent.
YNICL Journal 2021 Journal Article
NeurIPS Conference 2021 Conference Paper
Optimization is a key component for training machine learning models and has a strong impact on their generalization. In this paper, we consider a particular optimization method---the stochastic gradient Langevin dynamics (SGLD) algorithm---and investigate the generalization of models trained by SGLD. We derive a new generalization bound by connecting SGLD with Gaussian channels found in information and communication theory. Our bound can be computed from the training data and incorporates the variance of gradients for quantifying a particular kind of "sharpness" of the loss landscape. We also consider a closely related algorithm with SGLD, namely differentially private SGD (DP-SGD). We prove that the generalization capability of DP-SGD can be amplified by iteration. Specifically, our bound can be sharpened by including a time-decaying factor if the DP-SGD algorithm outputs the last iterate while keeping other iterates hidden. This decay factor enables the contribution of early iterations to our bound to reduce with time and is established by strong data processing inequalities---a fundamental tool in information theory. We demonstrate our bound through numerical experiments, showing that it can predict the behavior of the true generalization gap.
AAAI Conference 2021 Conference Paper
Rationality and emotion are two fundamental elements of humans. Endowing agents with rationality and emotion has been one of the major milestones in AI. However, in the field of conversational AI, most existing models only specialize in one aspect and neglect the other, which often leads to dull or unrelated responses. In this paper, we hypothesize that combining rationality and emotion into conversational agents can improve response quality. To test the hypothesis, we focus on one fundamental aspect of rationality, i. e. , commonsense, and propose CARE, a novel model for commonsense-aware emotional response generation. Specifically, we first propose a framework to learn and construct commonsense-aware emotional latent concepts of the response given an input message and a desired emotion. We then propose three methods to collaboratively incorporate the latent concepts into response generation. Experimental results on two large-scale datasets support our hypothesis and show that our model can produce more accurate and commonsense-aware emotional responses and achieve better human ratings than state-of-the-art models that only specialize in one aspect.
TIST Journal 2021 Journal Article
In recent years, with the development of deep learning, text-generation technology has undergone great changes and provided many kinds of services for human beings, such as restaurant reservation and daily communication. The automatically generated text is becoming more and more fluent so researchers begin to consider more anthropomorphic text-generation technology, that is, the conditional text generation, including emotional text generation, personalized text generation, and so on. Conditional Text Generation (CTG) has thus become a research hotspot. As a promising research field, we find that much attention has been paid to exploring it. Therefore, we aim to give a comprehensive review of the new research trends of CTG. We first summarize several key techniques and illustrate the technical evolution route in the field of neural text generation, based on the concept model of CTG. We further make an investigation of existing CTG fields and propose several general learning models for CTG. Finally, we discuss the open issues and promising research directions of CTG.
IJCAI Conference 2021 Conference Paper
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a novel cross-modal retrieval task, where abstract sketches are used as queries to retrieve natural images under zero-shot scenario. Most existing methods regard ZS-SBIR as a traditional classification problem and employ a cross-entropy or triplet-based loss to achieve retrieval, which neglect the problems of the domain gap between sketches and natural images and the large intra-class diversity in sketches. Toward this end, we propose a novel Domain-Smoothing Network (DSN) for ZS-SBIR. Specifically, a cross-modal contrastive method is proposed to learn generalized representations to smooth the domain gap by mining relations with additional augmented samples. Furthermore, a category-specific memory bank with sketch features is explored to reduce intra-class diversity in the sketch domain. Extensive experiments demonstrate that our approach notably outperforms the state-of-the-art methods in both Sketchy and TU-Berlin datasets.
IJCAI Conference 2021 Conference Paper
As machine learning becomes more widely used for critical applications, the need to study its implications in privacy becomes urgent. Given access to the target model and auxiliary information, model inversion attack aims to infer sensitive features of the training dataset, which leads to great privacy concerns. Despite its success in the grid domain, directly applying model inversion techniques on non grid domains such as graph achieves poor attack performance due to the difficulty to fully exploit the intrinsic properties of graphs and attributes of graph nodes used in GNN models. To bridge this gap, we present Graph Model Inversion attack, which aims to infer edges of the training graph by inverting Graph Neural Networks, one of the most popular graph analysis tools. Specifically, the projected gradient module in our method can tackle the discreteness of graph edges while preserving the sparsity and smoothness of graph features. Moreover, a well designed graph autoencoder module can efficiently exploit graph topology, node attributes, and target model parameters. With the proposed method, we study the connection between model inversion risk and edge influence and show that edges with greater influence are more likely to be recovered. Extensive experiments over several public datasets demonstrate the effectiveness of our method. We also show that differential privacy in its canonical form can hardly defend our attack while preserving decent utility.
AAAI Conference 2021 Conference Paper
Automatically solving math word problems is a crucial task for exploring the intelligence levels of machines in the general AI domain. It is highly challenging since it requires not only natural language understanding but also mathematical expression inference. Existing solutions usually explore sequence-to-sequence models to generate expressions, where the problems are simply encoded sequentially. However, such models are generally far from enough for understanding problems as similar to humans and lead to incorrect answers. To this end, in this paper, we propose a novel Hierarchical Math Solver (HMS) to make deep understanding and exploitation of problems. In problem understanding, imitating human reading habits, we propose a hierarchical word-clauseproblem encoder. Specifically, we first split each problem into several clauses and learn problem semantics from the local clause level to the global problem level. Then, in clause understanding, we propose a dependency-based module to enhance clause semantics with the dependency structure of the problem. Next, in expression inference, we propose a novel tree-based decoder to generate the mathematical expression for the answer. In the decoder, we apply a hierarchical attention mechanism to enhance the problem semantics with context from different levels, and a pointer-generator network to guide the model to copy existing information and infer extra knowledge. Extensive experimental results on two widely used datasets demonstrate that HMS achieves not only better answers but also more reasonable inference.
AAAI Conference 2021 Conference Paper
We study the problem of imposing conversational goals/keywords on open-domain conversational agents, where the agent is required to lead the conversation to a target keyword smoothly and fast. Solving this problem enables the application of conversational agents in many real-world scenarios, e. g. , recommendation and psychotherapy. The dominant paradigm for tackling this problem is to 1) train a next-turn keyword classifier, and 2) train a keyword-augmented response retrieval model. However, existing approaches in this paradigm have two limitations: 1) the training and evaluation datasets for next-turn keyword classification are directly extracted from conversations without human annotations, thus, they are noisy and have low correlation with human judgements, and 2) during keyword transition, the agents solely rely on the similarities between word embeddings to move closer to the target keyword, which may not reflect how humans converse. In this paper, we assume that human conversations are grounded on commonsense and propose a keyword-guided neural conversational model that can leverage external commonsense knowledge graphs (CKG) for both keyword transition and response retrieval. Automatic evaluations suggest that commonsense improves the performance of both next-turn keyword prediction and keyword-augmented response retrieval. In addition, both self-play and human evaluations show that our model produces responses with smoother keyword transition and reaches the target keyword faster than competitive baselines.
JBHI Journal 2021 Journal Article
In this article, we study a novel problem: “automatic prescription recommendation for PD patients. ” To realize this goal, we first build a dataset by collecting 1) symptoms of PD patients, and 2) their prescription drug provided by neurologists. Then, we build a novel computer-aided prescription model by learning the relation between observed symptoms and prescription drug. Finally, for the new coming patients, we could recommend (predict) suitable prescription drug on their observed symptoms by our prescription model. From the methodology part, our proposed model, namely Prescription viA Learning lAtent Symptoms (PALAS), could recommend prescription using the multi-modality representation of the data. In PALAS, a latent symptom space is learned to better model the relationship between symptoms and prescription drug, as there is a large semantic gap between them. Moreover, we present an efficient alternating optimization method for PALAS. We evaluated our method using the data collected from 136 PD patients at Nanjing Brain Hospital, which can be regarded as a large dataset in PD research community. The experimental results demonstrate the effectiveness and clinical potential of our method in this recommendation task, if compared with other competing methods.
ICRA Conference 2021 Conference Paper
This study presents a novel dynamic modelling method for analyzing the modal properties of a curved extensible continuum manipulator (ECM). Considering the variable bending angle and the length, the kinematic and static models are firstly established for descripting the geometric posture deformation and deflection generated by gravity. Then, the modal dynamic model is developed along the centerline direction of the ECM in the polar coordinates, by which the modal properties, include modal frequency and mode shape of curved continuum manipulator with different posture parameters are analyzed. The modal response is further analyzed based on the generalized coordinate method. Finally, the modal properties of ECM with variable posture are experimentally tested and compared with the theoretical value using the in-house developed ECM physical prototype. In the experiments, the bending, length varying and circular tracking motion are performed for the comprehensive validation of the proposed dynamics modelling method.
NeurIPS Conference 2021 Conference Paper
Predicting molecular properties with data-driven methods has drawn much attention in recent years. Particularly, Graph Neural Networks (GNNs) have demonstrated remarkable success in various molecular generation and prediction tasks. In cases where labeled data is scarce, GNNs can be pre-trained on unlabeled molecular data to first learn the general semantic and structural information before being finetuned for specific tasks. However, most existing self-supervised pretraining frameworks for GNNs only focus on node-level or graph-level tasks. These approaches cannot capture the rich information in subgraphs or graph motifs. For example, functional groups (frequently-occurred subgraphs in molecular graphs) often carry indicative information about the molecular properties. To bridge this gap, we propose Motif-based Graph Self-supervised Learning (MGSSL) by introducing a novel self-supervised motif generation framework for GNNs. First, for motif extraction from molecular graphs, we design a molecule fragmentation method that leverages a retrosynthesis-based algorithm BRICS and additional rules for controlling the size of motif vocabulary. Second, we design a general motif-based generative pretraining framework in which GNNs are asked to make topological and label predictions. This generative framework can be implemented in two different ways, i. e. , breadth-first or depth-first. Finally, to take the multi-scale information in molecular graphs into consideration, we introduce a multi-level self-supervised pre-training. Extensive experiments on various downstream benchmark tasks show that our methods outperform all state-of-the-art baselines.
YNIMG Journal 2021 Journal Article
IJCAI Conference 2021 Conference Paper
It has become increasingly thorny for computer vision competitions to preserve fairness when participants intentionally fine-tune their models against the test datasets to improve their performance. To mitigate such unfairness, competition organizers restrict the training and evaluation process of participants' models. However, such restrictions introduce massive computation overheads for organizers and potential intellectual property leakage for participants. Thus, we propose Themis, a framework that trains a noise generator jointly with organizers and participants to prevent intentional fine-tuning by protecting test datasets from surreptitious manual labeling. Specifically, with the carefully designed noise generator, Themis adds noise to perturb test sets without twisting the performance ranking of participants' models. We evaluate the validity of Themis with a wide spectrum of real-world models and datasets. Our experimental results show that Themis effectively enforces competition fairness by precluding manual labeling of test sets and preserving the performance ranking of participants' models.
AAAI Conference 2021 Conference Paper
With the prevalence of location based services, activity trajectories are being generated at a rapid pace. The activity trajectory data enriches traditional trajectory data with semantic activities of users, which not only shows where the users have been, but also the preference of users. However, the large volume of data is expensive for people to explore. To address this issue, we study the problem of Diversity-aware Activity Trajectory Selection (DaATS). Given a region of interest for a user, it finds a small number of representative activity trajectories that can provide the user with a broad coverage of different aspects of the region. The problem is challenging in both the efficiency of trajectory similarity computation and subset selection. To tackle the two challenges, we propose a novel solution by: (1) exploiting a deep metric learning method to speedup the similarity computation; and (2) proving that DaATS is an NP-hard problem, and developing an efficient approximation algorithm with performance guarantees. Experiments on two real-world datasets show that our proposal significantly outperforms state-of-the-art baselines.
AAAI Conference 2020 Conference Paper
Recently, end-to-end text spotting that aims to detect and recognize text from cluttered images simultaneously has received particularly growing interest in computer vision. Different from the existing approaches that formulate text detection as bounding box extraction or instance segmentation, we localize a set of points on the boundary of each text instance. With the representation of such boundary points, we establish a simple yet effective scheme for end-to-end text spotting, which can read the text of arbitrary shapes. Experiments on three challenging datasets, including ICDAR2015, Total- Text and COCO-Text demonstrate that the proposed method consistently surpasses the state-of-the-art in both scene text detection and end-to-end text recognition tasks.
JBHI Journal 2020 Journal Article
Cervical cancer ranks as the second most common cancer in women worldwide. In clinical practice, colposcopy is an indispensable part of screening for cervical intraepithelial neoplasia (CIN) grades and cervical cancer but exhibits high misdiagnosis rate. Existing computer-assisted algorithms for analyzing cervigram images have neglected that colposcopy is a sequential and multistate process, which is unsuitable for clinical applications. In this work, we construct a cervigram-based recurrent convolutional neural network (C-RCNN) to classify different CIN grades and cervical cancer. Convolutional neural networks are leveraged to extract spatial features. We develop a sequence-encoding module to encode discriminative temporal features and a multistate-aware convolutional layer to integrate features from different states of cervigram images. To train and evaluate the performance of C-RCNN, we leveraged a dataset of 4, 753 real cervigrams and obtained 96. 13% test accuracy with a specificity and sensitivity of 98. 22% and 95. 09%, respectively. Areas under each receiver operating characteristic curves are above 0. 94, proving that visual representations and sequential dynamics can be jointly and effectively optimized in the training phase. Comparative analysis demonstrated the effectiveness of the proposed C-RCNN against competing methods, showing significant improvement over only focusing on a single frame. This architecture can be extended to other applications in medical image analysis
AAAI Conference 2020 Conference Paper
Actor and action video segmentation with language queries aims to segment out the expression referred objects in the video. This process requires comprehensive language reasoning and fine-grained video understanding. Previous methods mainly leverage dynamic convolutional networks to match visual and semantic representations. However, the dynamic convolution neglects spatial context when processing each region in the frame and is thus challenging to segment similar objects in the complex scenarios. To address such limitation, we construct a context modulated dynamic convolutional network. Specifically, we propose a context modulated dynamic convolutional operation in the proposed framework. The kernels for the specific region are generated from both language sentences and surrounding context features. Moreover, we devise a temporal encoder to incorporate motions into the visual features to further match the query descriptions. Extensive experiments on two benchmark datasets, Actor-Action Dataset Sentences (A2D Sentences) and J-HMDB Sentences, demonstrate that our proposed approach notably outperforms stateof-the-art methods.
IS Journal 2020 Journal Article
Online recommendation is an important feature in many applications. In practice, the interaction between the users and the recommender system might be sparse, i. e. , the users are not always interacting with the recommender system. For example, some users prefer to sweep around the recommendation instead of clicking into the details. Therefore, a response of zero may not necessarily be a negative response, but a nonresponse. It comes worse to distinguish these two situations when only one item is recommended to the user each time and few further information is reachable. Most existing recommendation strategies ignore the difference between nonresponses and negative responses. In this article, we propose a novel approach to make online recommendations via sparse interactions. We design a contextual bandit algorithm, named hSAOR, for online recommendation. Our method makes probabilistic estimations on whether the user is interacting or not, by reasonably assuming that similar items are similarly attractive. It uses positive and negative responses to build the user preference model, ignoring all nonresponses. Theoretical analyses and experimental results demonstrate its effectiveness.
IJCAI Conference 2020 Conference Paper
Point-of-interest (POI) recommendation has become an increasingly important sub-field of recommendation system research. Previous methods employ various assumptions to exploit the contextual information for improving the recommendation accuracy. The common property among them is that similar users are more likely to visit similar POIs and similar POIs would like to be visited by the same user. However, none of existing methods utilize similarity explicitly to make recommendations. In this paper, we propose a new framework for POI recommendation, which explicitly utilizes similarity with contextual information. Specifically, we categorize the context information into two groups, i. e. , global and local context, and develop different regularization terms to incorporate them for recommendation. A graph Laplacian regularization term is utilized to exploit the global context information. Moreover, we cluster users into different groups, and let the objective function constrain the users in the same group to have similar predicted POI ratings. An alternating optimization method is developed to optimize our model and get the final rating matrix. The results in our experiments show that our algorithm outperforms all the state-of-the-art methods.
JBHI Journal 2020 Journal Article
This paper proposes a novel hybrid network named multiple-feature-branch convolutional bidirectional recurrent neural network (MFB-CBRNN) for myocardial infarction (MI) detection using 12-lead ECGs. The model efficiently combines convolutional neural network-based and recurrent neural network-based structures. Each feature branch consists of several one-dimensional convolutional and pooling layers, corresponding to a certain lead. All the feature branches are independent from each other, which are utilized to learn the diverse features from different leads. Moreover, a bidirectional long short term memory network is employed to summarize all the feature branches. Its good ability of feature aggregation has been proved by the experiments. Furthermore, the paper develops a novel optimization method, lead random mask (LRM), to alleviate overfitting and implement an implicit ensemble like dropout. The model with LRM can achieve a more accurate MI detection. Class-based and subject-based fivefold cross validations are both carried out using Physikalisch-Technische Bundesanstalt diagnostic database. Totally, there are 148 MI and 52 healthy control subjects involved in the experiments. The MFB-CBRNN achieves an overall accuracy of 99. 90% in class-based experiments, and an overall accuracy of 93. 08% in subject-based experiments. Compared with other related studies, our algorithm achieves a comparable or even better result on MI detection. Therefore, the MFB-CBRNN has a good generalization capacity and is suitable for MI detection using 12-lead ECGs. It has a potential to assist the real-world MI diagnostics and reduce the burden of cardiologists.
IJCAI Conference 2020 Conference Paper
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a specific cross-modal retrieval task for searching natural images given free-hand sketches under the zero-shot scenario. Most existing methods solve this problem by simultaneously projecting visual features and semantic supervision into a low-dimensional common space for efficient retrieval. However, such low-dimensional projection destroys the completeness of semantic knowledge in original semantic space, so that it is unable to transfer useful knowledge well when learning semantic features from different modalities. Moreover, the domain information and semantic information are entangled in visual features, which is not conducive for cross-modal matching since it will hinder the reduction of domain gap between sketch and image. In this paper, we propose a Progressive Domain-independent Feature Decomposition (PDFD) network for ZS-SBIR. Specifically, with the supervision of original semantic knowledge, PDFD decomposes visual features into domain features and semantic ones, and then the semantic features are projected into common space as retrieval features for ZS-SBIR. The progressive projection strategy maintains strong semantic supervision. Besides, to guarantee the retrieval features to capture clean and complete semantic information, the cross-reconstruction loss is introduced to encourage that any combinations of retrieval features and domain features can reconstruct the visual features. Extensive experiments demonstrate the superiority of our PDFD over state-of-the-art competitors.
EAAI Journal 2019 Journal Article
TIST Journal 2019 Journal Article
The discovery of Markov blanket (MB) for feature selection has attracted much attention in recent years, since the MB of the class attribute is the optimal feature subset for feature selection. However, almost all existing MB discovery algorithms focus on either improving computational efficiency or boosting learning accuracy, instead of both. In this article, we propose a novel MB discovery algorithm for balancing efficiency and accuracy, called <underline>BA</underline>lanced <underline>M</underline>arkov <underline>B</underline>lanket (BAMB) discovery. To achieve this goal, given a class attribute of interest, BAMB finds candidate PC (parents and children) and spouses and removes false positives from the candidate MB set in one go. Specifically, once a feature is successfully added to the current PC set, BAMB finds the spouses with regard to this feature, then uses the updated PC and the spouse set to remove false positives from the current MB set. This makes the PC and spouses of the target as small as possible and thus achieves a trade-off between computational efficiency and learning accuracy. In the experiments, we first compare BAMB with 8 state-of-the-art MB discovery algorithms on 7 benchmark Bayesian networks, then we use 10 real-world datasets and compare BAMB with 12 feature selection algorithms, including 8 state-of-the-art MB discovery algorithms and 4 other well-established feature selection methods. On prediction accuracy, BAMB outperforms 12 feature selection algorithms compared. On computational efficiency, BAMB is close to the IAMB algorithm while it is much faster than the remaining seven MB discovery algorithms.
AAAI Conference 2019 Conference Paper
We consider the problem of inferring the values of an arbitrary set of variables (e. g. , risk of diseases) given other observed variables (e. g. , symptoms and diagnosed diseases) and high-dimensional signals (e. g. , MRI images or EEG). This is a common problem in healthcare since variables of interest often differ for different patients. Existing methods including Bayesian networks and structured prediction either do not incorporate high-dimensional signals or fail to model conditional dependencies among variables. To address these issues, we propose bidirectional inference networks (BIN), which stich together multiple probabilistic neural networks, each modeling a conditional dependency. Predictions are then made via iteratively updating variables using backpropagation (BP) to maximize corresponding posterior probability. Furthermore, we extend BIN to composite BIN (CBIN), which involves the iterative prediction process in the training stage and improves both accuracy and computational efficiency by adaptively smoothing the optimization landscape. Experiments on synthetic and real-world datasets (a sleep study and a dermatology dataset) show that CBIN is a single model that can achieve state-of-the-art performance and obtain better accuracy in most inference tasks than multiple models each specifically trained for a different task.
TIST Journal 2019 Journal Article
Cloud computing extends Transportation Cyber-Physical Systems (T-CPS) with provision of enhanced computing and storage capability via offloading computing tasks to remote cloud servers. However, cloud computing cannot fulfill the requirements such as low latency and context awareness in T-CPS. The appearance of Mobile Edge Computing (MEC) can overcome the limitations of cloud computing via offloading the computing tasks at edge servers in approximation to users, consequently reducing the latency and improving the context awareness. Although MEC has the potential in improving T-CPS, it is incapable of processing computational-intensive tasks such as deep learning algorithms due to the intrinsic storage and computing-capability constraints. Therefore, we design and develop a lightweight deep learning model to support MEC applications in T-CPS. In particular, we put forth a stacked convolutional neural network (CNN) consisting of factorization convolutional layers alternating with compression layers (namely, lightweight CNN-FC). Extensive experimental results show that our proposed lightweight CNN-FC can greatly decrease the number of unnecessary parameters, thereby reducing the model size while maintaining the high accuracy in contrast to conventional CNN models. In addition, we also evaluate the performance of our proposed model via conducting experiments at a realistic MEC platform. Specifically, experimental results at this MEC platform show that our model can maintain the high accuracy while preserving the portable model size.
IJCAI Conference 2019 Conference Paper
In this paper, we investigate the structural similarities within a finite Markov decision process (MDP). We view a finite MDP as a heterogeneous directed bipartite graph and propose novel measures for state similarity and action similarity in a mutual reinforcement manner. We prove that the state similarity is a metric and the action similarity is a pseudometric. We also establish the connection between the proposed similarity measures and the optimal values of the MDP. Extensive experiments show that the proposed measures are effective.
ICRA Conference 2019 Conference Paper
Efficient and reliable dynamic modelling and analysis is critical to the shape control and estimate of a backbone continuum manipulator. This paper presents a novel dynamic modelling method to investigate the deformation modal properties of a vertical stretch-retractable continuum manipulator (VSRCM) under the tip horizontal load. Based on the equivalence of the elastic bending potential energy, this method simplifies the backbone continuum manipulator to a single beam, an equivalent-guided beam (EGB). One end of the EGB is fixed, and the other end is guided. The modal dynamics model of the EGB is then established by employing the vibration theory of a continuous beam, and the modal properties of the continuum manipulator are analyzed. Static and Dynamic experiments of the VSRCM are performed to validate the dynamic modeling method through the comparison and analysis of the free dynamic response and frequency properties of the continuum manipulator. To analyze the free dynamic response, the large deflection static model with tip horizontal load is also established based on the elliptic integral approach. Experimental results and comparison demonstrate that the static model is efficient and precise to predict the large deformation, and the modal dynamic models well reflect the actual vibration modal characteristics of the VSRCM.
AAAI Conference 2019 Conference Paper
Over the past few years, there has been a resurgence of interest in using recurrent neural network-hidden Markov model (RNN-HMM) for automatic speech recognition (ASR). Some modern recurrent network models, such as long shortterm memory (LSTM) and simple recurrent unit (SRU), have demonstrated promising results on this task. Recently, several scientific perspectives in the fields of neuroethology and speech production suggest that human speech signals may be represented in discrete point patterns involving acoustic events in the speech signal. Based on this hypothesis, it may pose some challenges for RNN-HMM acoustic modeling: firstly, it arbitrarily discretizes the continuous input into the interval features at a fixed frame rate, which may introduce discretization errors; secondly, the occurrences of such acoustic events are unknown. Furthermore, the training targets of RNN-HMM are obtained from other (inferior) models, giving rise to misalignments. In this paper, we propose a recurrent Poisson process (RPP) which can be seen as a collection of Poisson processes at a series of time intervals in which the intensity evolves according to the RNN hidden states that encode the history of the acoustic signal. It aims at allocating the latent acoustic events in continuous time. Such events are efficiently drawn from the RPP using a sampling-free solution in an analytic form. The speech signal containing latent acoustic events is reconstructed/sampled dynamically from the discretized acoustic features using linear interpolation, in which the weight parameters are estimated from the onset of these events. The above processes are further integrated into an SRU, forming our final model, called recurrent Poisson process unit (RPPU). Experimental evaluations on ASR tasks including ChiME-2, WSJ0 and WSJ0&1 demonstrate the effectiveness and benefits of the RPPU. For example, it achieves a relative WER reduction of 10. 7% over state-of-the-art models on WSJ0.
AAAI Conference 2019 Conference Paper
Two-stream architecture have shown strong performance in video classification task. The key idea is to learn spatiotemporal features by fusing convolutional networks spatially and temporally. However, there are some problems within such architecture. First, it relies on optical flow to model temporal information, which are often expensive to compute and store. Second, it has limited ability to capture details and local context information for video data. Third, it lacks explicit semantic guidance that greatly decrease the classification performance. In this paper, we proposed a new two-stream based deep framework for video classification to discover spatial and temporal information only from RGB frames, moreover, the multi-scale pyramid attention (MPA) layer and the semantic adversarial learning (SAL) module is introduced and integrated in our framework. The MPA enables the network capturing global and local feature to generate a comprehensive representation for video, and the SAL can make this representation gradually approximate to the real video semantics in an adversarial manner. Experimental results on two public benchmarks demonstrate our proposed methods achieves state-of-the-art results on standard video datasets.
IJCAI Conference 2019 Conference Paper
Beyond existing multi-view clustering, this paper studies a more realistic clustering scenario, referred to as incomplete multi-view clustering, where a number of data instances are missing in certain views. To tackle this problem, we explore spectral perturbation theory. In this work, we show a strong link between perturbation risk bounds and incomplete multi-view clustering. That is, as the similarity matrix fed into spectral clustering is a quantity bounded in magnitude O(1), we transfer the missing problem from data to similarity and tailor a matrix completion method for incomplete similarity matrix. Moreover, we show that the minimization of perturbation risk bounds among different views maximizes the final fusion result across all views. This provides a solid fusion criteria for multi-view data. We motivate and propose a Perturbation-oriented Incomplete multi-view Clustering (PIC) method. Experimental results demonstrate the effectiveness of the proposed method.
JBHI Journal 2018 Journal Article
To keep pace with the developments in medical informatics, health medical data is being collected continually. But, owing to the diversity of its categories and sources, medical data has become so complicated in many hospitals that it now needs a clinical decision support (CDS) system for its management. To effectively utilize the accumulating health data, we propose a CDS framework that can integrate heterogeneous health data from different sources such as laboratory test results, basic information of patients, and health records into a consolidated representation of features of all patients. Using the electronic health medical data so created, multilabel classification was employed to recommend a list of diseases and thus assist physicians in diagnosing or treating their patients' health issues more efficiently. Once the physician diagnoses the disease of a patient, the next step is to consider the likely complications of that disease, which can lead to more diseases. Previous studies reveal that correlations do exist among some diseases. Considering these correlations, a k-nearest neighbors algorithm is improved for multilabel learning by using correlations among labels (CML-kNN). The CML- kNN algorithm first exploits the dependence between every two labels to update the origin label matrix and then performs multilabel learning to estimate the probabilities of labels by using the integrated features. Finally, it recommends the top N diseases to the physicians. Experimental results on real health medical data establish the effectiveness and practicability of the proposed CDS framework.
AAMAS Conference 2018 Conference Paper
This paper studies a variation of stochastic multi-armed bandit (MAB) problem where the agent knows a prior knowledge named Near-optimal Mean Reward (NoMR). We show that the cumulative regret of this bandit variation has a lower bound of Ω (1/∆), where ∆ is the gap between the optimal and the second optimal mean reward. An algorithm called NoMR-Bandit is proposed to this variation, and we demonstrate that the cumulative regret of NoMR- Bandit has a uniform upper bound of O (∆). It is concluded that NoMR-Bandit is optimal in terms of the order of regret bounds.
IROS Conference 2018 Conference Paper
Due to the widespread use of industrial robots in market, its application has extended to welding, painting, and freight handling. And tool coordinate calibration is regularly modified after tool replacement due to collision accident or routine maintenance. After tool replacement, operators often rebuild tool coordinates. This is the traditional mode of operation in the current industrial practices. However, smart factory will make artificial intelligence method replace manual method. This paper presents a system independent method for automatic calibration of the tool coordinate system which is faster, simpler, cheaper and more effective than the manual method. The proposed method required images to be captured using two “eye to hand” cameras and one “eye in hand” camera. Tool position data is then acquired through CamShift and MeanShift algorithm for image trajectory tracking along with coordinate system conversion, several methods like PCA, LDA can deal with the vision data. Optimal Deep Neural Network (DNN) method error compensation of a robot allows the tool to automatically run with the calibration system functions. We have developed a 6 degrees of freedom(DoF) industrial robot for this experiment. Nine different kinds of DNN models are built and finally with suitable tool coordinate error compensation for the current robot, tool calibration can be achieved adaptively and efficiently.
IJCAI Conference 2018 Conference Paper
Point-of-interest (POI) recommendation, i. e. , recommending unvisited POIs for users, is a fundamental problem for location-based social networks. POI recommendation distinguishes itself from traditional item recommendation, e. g. , movie recommendation, via geographical influence among POIs. Existing methods model the geographical influence between two POIs as the probability or propensity that the two POIs are co-visited by the same user given their physical distance. These methods assume that geographical influence between POIs is determined by their physical distance, failing to capture the asymmetry of geographical influence and the high variation of geographical influence across POIs. In this paper, we exploit POI-specific geographical influence to improve POI recommendation. We model the geographical influence between two POIs using three factors: the geo-influence of POI, the geo-susceptibility of POI, and their physical distance. Geo-influence captures POI? s capacity at exerting geographical influence to other POIs, and geo-susceptibility reflects POI? s propensity of being geographically influenced by other POIs. Experimental results on two real-world datasets demonstrate that POI-specific geographical influence significantly improves the performance of POI recommendation.
AAAI Conference 2018 Conference Paper
Cross-domain data reconstruction methods derive a shared transformation across source and target domains. These methods usually make a specific assumption on noise, which exhibits limited ability when the target data are contaminated by different kinds of complex noise in practice. To enhance the robustness of domain adaptation under severe noise conditions, this paper proposes a novel reconstruction based algorithm in an information-theoretic setting. Specifically, benefiting from the theoretical property of correntropy, the proposed algorithm is distinguished with: detecting the contaminated target samples without making any specific assumption on noise; greatly suppressing the negative influence of noise on cross-domain transformation. Moreover, a relative entropy based regularization of the transformation is incorporated to avoid trivial solutions with the reaped theoretic advantages, i. e. , non-negativity and scale-invariance. For optimization, a half-quadratic technique is developed to minimize the nonconvex information-theoretic objectives with explicitly guaranteed convergence. Experiments on two real-world domain adaptation tasks demonstrate the superiority of our method.
JBHI Journal 2018 Journal Article
In this paper, a novel algorithm based on a convolutional neural network (CNN) is proposed for myocardial infarction detection via multilead electrocardiogram (ECG). A beat segmentation algorithm utilizing multilead ECG is designed to obtain multilead beats, and fuzzy information granulation is adopted for preprocessing. Then, the beats are input into our multilead-CNN (ML-CNN), a novel model that includes sub two-dimensional (2-D) convolutional layers and lead asymmetric pooling (LAP) layers. As different leads represent various angles of the same heart, LAP can capture multiscale features of different leads, exploiting the individual characteristics of each lead. In addition, sub 2-D convolution can utilize the holistic characters of all the leads. It uses 1-D kernels shared among the different leads to generate local optimal features. These strategies make the ML-CNN suitable for multilead ECG processing. To evaluate our algorithm, actual ECG datasets from the PTB diagnostic database are used. The sensitivity of our algorithm is 95. 40%, the specificity is 97. 37%, and the accuracy is 96. 00% in the experiments. Targeting lightweight mobile healthcare applications, real-time analyses are performed on both MATLAB and ARM Cortex-A9 platforms. The average processing times for each heartbeat are approximately 17. 10 and 26. 75 ms, respectively, which indicate that this method has good potential for mobile healthcare applications.
IJCAI Conference 2018 Conference Paper
By optimizing probability distributions over discrete latent codes, Stochastic Generative Hashing (SGH) bypasses the critical and intractable binary constraints on hash codes. While encouraging results were reported, SGH still suffers from the deficient usage of latent codes, i. e. , there often exist many uninformative latent dimensions in the code space, a disadvantage inherited from its auto-encoding variational framework. Motivated by the fact that code redundancy usually is severer when more complex decoder network is used, in this paper, we propose a constrained deep generative architecture to simplify the decoder for data reconstruction. Specifically, our new framework forces the latent hashing codes to not only reconstruct data through the generative network but also retain minimal squared L2 difference to the last real-valued network hidden layer. Furthermore, during posterior inference, we propose to regularize the standard auto-encoding objective with an additional term that explicitly accounts for the negative redundancy degree of latent code dimensions. We interpret such modifications as Bayesian posterior regularization and design an adversarial strategy to optimize the generative, the variational, and the redundancy-resistanting parameters. Empirical results show that our new method can significantly boost the quality of learned codes and achieve state-of-the-art performance for image retrieval.
TIST Journal 2018 Journal Article
Massive public resume data emerging on the internet indicates individual-related characteristics in terms of profile and career experiences. Resume Analysis (RA) provides opportunities for many applications, such as recruitment trend predict, talent seeking and evaluation. Existing RA studies either largely rely on the knowledge of domain experts, or leverage classic statistical or data mining models to identify and filter explicit attributes based on pre-defined rules. However, they fail to discover the latent semantic information from semi-structured resume text, i.e., individual career progress trajectory and social-relations, which are otherwise vital to comprehensive understanding of people’s career evolving patterns. Besides, when dealing with large numbers of resumes, how to properly visualize such semantic information to reduce the information load and to support better human cognition is also challenging. To tackle these issues, we propose a visual analytics system called ResumeVis to mine and visualize resume data. First, a text mining-based approach is presented to extract semantic information. Then, a set of visualizations are devised to represent the semantic information in multiple perspectives. Through interactive exploration on ResumeVis performed by domain experts, the following tasks can be accomplished: to trace individual career evolving trajectory; to mine latent social-relations among individuals; and to hold the full picture of massive resumes’ collective mobility. Case studies with over 2,500 government officer resumes demonstrate the effectiveness of our system.
ICRA Conference 2018 Conference Paper
We present a robust and precise localization system that achieves centimeter-level localization accuracy in disparate city scenes. Our system adaptively uses information from complementary sensors such as GNSS, LiDAR, and IMU to achieve high localization accuracy and resilience in challenging scenes, such as urban downtown, highways, and tunnels. Rather than relying only on LiDAR intensity or 3D geometry, we make innovative use of LiDAR intensity and altitude cues to significantly improve localization system accuracy and robustness. Our GNSS RTK module utilizes the help of the multi-sensor fusion framework and achieves a better ambiguity resolution success rate. An error-state Kalman filter is applied to fuse the localization measurements from different sources with novel uncertainty estimation. We validate, in detail, the effectiveness of our approaches, achieving 5-10cm RMS accuracy and outperforming previous state-of-the-art systems. Importantly, our system, while deployed in a large autonomous driving fleet, made our vehicles fully autonomous in crowded city streets despite road construction that occurred from time to time. A dataset including more than 60 km real traffic driving in various urban roads is used to comprehensively test our system.
NeurIPS Conference 2017 Conference Paper
With the goal of making high-resolution forecasts of regional rainfall, precipitation nowcasting has become an important and fundamental technology underlying various public services ranging from rainstorm warnings to flight safety. Recently, the Convolutional LSTM (ConvLSTM) model has been shown to outperform traditional optical flow based methods for precipitation nowcasting, suggesting that deep learning models have a huge potential for solving the problem. However, the convolutional recurrence structure in ConvLSTM-based models is location-invariant while natural motion and transformation (e. g. , rotation) are location-variant in general. Furthermore, since deep-learning-based precipitation nowcasting is a newly emerging area, clear evaluation protocols have not yet been established. To address these problems, we propose both a new model and a benchmark for precipitation nowcasting. Specifically, we go beyond ConvLSTM and propose the Trajectory GRU (TrajGRU) model that can actively learn the location-variant structure for recurrent connections. Besides, we provide a benchmark that includes a real-world large-scale dataset from the Hong Kong Observatory, a new training loss, and a comprehensive evaluation protocol to facilitate future research and gauge the state of the art.
AAAI Conference 2017 Conference Paper
As a fundamental constituent of machine learning, domain adaptation generalizes a learning model from a source domain to a different (but related) target domain. In this paper, we focus on semi-supervised domain adaptation and explicitly extend the applied range of unlabeled target samples into the combination of distribution alignment and adaptive classifier learning. Specifically, our extension formulates the following aspects in a single optimization: 1) learning a crossdomain predictive model by developing the Fredholm integral based kernel prediction framework; 2) reducing the distribution difference between two domains; 3) exploring multiple kernels to induce an optimal learning space. Correspondingly, such an extension is distinguished with allowing for noise resiliency, facilitating knowledge transfer and analyzing diverse data characteristics. It is emphasized that we prove the differentiability of our formulation and present an effective optimization procedure based on the reduced gradient, guaranteeing rapid convergence. Comprehensive empirical studies verify the effectiveness of the proposed method.
AAAI Conference 2017 Conference Paper
Link prediction is a fundamental task in such areas as social network analysis, information retrieval, and bioinformatics. Usually link prediction methods use the link structures or node attributes as the sources of information. Recently, the relational topic model (RTM) and its variants have been proposed as hybrid methods that jointly model both sources of information and achieve very promising accuracy. However, the representations (features) learned by them are still not effective enough to represent the nodes (items). To address this problem, we generalize recent advances in deep learning from solely modeling i. i. d. sequences of attributes to jointly modeling graphs and non-i. i. d. sequences of attributes. Specifically, we follow the Bayesian deep learning framework and devise a hierarchical Bayesian model, called relational deep learning (RDL), to jointly model high-dimensional node attributes and link structures with layers of latent variables. Due to the multiple nonlinear transformations in RDL, standard variational inference is not applicable. We propose to utilize the product of Gaussians (PoG) structure in RDL to relate the inferences on different variables and derive a generalized variational inference algorithm for learning the variables and predicting the links. Experiments on three real-world datasets show that RDL works surprisingly well and significantly outperforms the state of the art.
AAMAS Conference 2016 Conference Paper
We consider multi-agent decision making problems in which agents need to communicate with other agents to make socially optimal decisions but, at the same time, have some private information that they do not want to share. Abstract argumentation has been widely used in both single-agent and multi-agent decision making problems, because of its ability for reasoning with incomplete and conflicting information. In this work, we propose an abstract argumentation-based knowledge representation and communication protocol, such that agents can find socially optimal strategies by only disclosing the ‘necessary’ and ‘disclosable’ information. We prove that our protocol is sound, efficient, of perfect information security and guaranteed to terminate.
JBHI Journal 2016 Journal Article
Intrabody communication has been of great research interest in recent years. This paper proposes a novel, compact but accurate body transmission channel model based on RC distribution networks and transmission line theory. The comparison between simulation and measurement results indicates that the proposed approach accurately models the body channel characteristics. In addition, the impedance-matching networks at the transmitter output and the receiver input further maximize the power transferred to the receiver, relax the receiver complexity, and increase the transmission performance. Based on the simulation results, the power gain can be increased by up to 16 dB after matching. A binary phase-shift keying modulation scheme is also used to evaluate the bit-error-rate improvement.
NeurIPS Conference 2016 Conference Paper
Hybrid methods that utilize both content and rating information are commonly used in many recommender systems. However, most of them use either handcrafted features or the bag-of-words representation as a surrogate for the content information but they are neither effective nor natural enough. To address this problem, we develop a collaborative recurrent autoencoder (CRAE) which is a denoising recurrent autoencoder (DRAE) that models the generation of content sequences in the collaborative filtering (CF) setting. The model generalizes recent advances in recurrent deep learning from i. i. d. input to non-i. i. d. (CF-based) input and provides a new denoising scheme along with a novel learnable pooling scheme for the recurrent autoencoder. To do this, we first develop a hierarchical Bayesian model for the DRAE and then generalize it to the CF setting. The synergy between denoising and CF enables CRAE to make accurate recommendations while learning to fill in the blanks in sequences. Experiments on real-world datasets from different domains (CiteULike and Netflix) show that, by jointly modeling the order-aware generation of sequences for the content information and performing CF for the ratings, CRAE is able to significantly outperform the state of the art on both the recommendation task based on ratings and the sequence generation task based on content information.
AAAI Conference 2016 Conference Paper
There are two classes of average reward reinforcement learning (RL) algorithms: model-based ones that explicitly maintain MDP models and model-free ones that do not learn such models. Though model-free algorithms are known to be more efficient, they often cannot converge to optimal policies due to the perturbation of parameters. In this paper, a novel model-free algorithm is proposed, which makes use of constant shifting values (CSVs) estimated from prior knowledge. To encourage exploration during the learning process, the algorithm constantly subtracts the CSV from the rewards. A terminating condition is proposed to handle the unboundedness of Q-values caused by such substraction. The convergence of the proposed algorithm is proved under very mild assumptions. Furthermore, linear function approximation is investigated to generalize our method to handle large-scale tasks. Extensive experiments on representative MDPs and the popular game Tetris show that the proposed algorithms significantly outperform the state-of-the-art ones.
AAMAS Conference 2016 Conference Paper
Markov decision processes (MDPs) have been studied for many decades. Recent research in using transfer learning methods to solve MDPs has shown that knowledge learned from one MDP may be used to solve a similar MDP better. In this paper, we propose two metrics for measuring the distance between finite MDPs. Our metrics are based on the Hausdorff metric which measures the distance between two subsets of a metric space and the Kantorovich metric for measuring the distance between probabilistic distributions. Our metrics can be used to compute the distance between reinforcement learning tasks that are modeled as MDPs. The second contribution of this paper is that we apply the metrics to direct transfer learning by finding the similar source tasks. Our third contribution is that we propose two knowledge transfer methods which transfer value functions of the selected source tasks to the target task. Extensive experimental results show that our metrics are effective in finding similar tasks and significantly improve the performance of transfer learning with the transfer methods.
NeurIPS Conference 2016 Conference Paper
Neural networks (NN) have achieved state-of-the-art performance in various applications. Unfortunately in applications where training data is insufficient, they are often prone to overfitting. One effective way to alleviate this problem is to exploit the Bayesian approach by using Bayesian neural networks (BNN). Another shortcoming of NN is the lack of flexibility to customize different distributions for the weights and neurons according to the data, as is often done in probabilistic graphical models. To address these problems, we propose a class of probabilistic neural networks, dubbed natural-parameter networks (NPN), as a novel and lightweight Bayesian treatment of NN. NPN allows the usage of arbitrary exponential-family distributions to model the weights and neurons. Different from traditional NN and BNN, NPN takes distributions as input and goes through layers of transformation before producing distributions to match the target output distributions. As a Bayesian treatment, efficient backpropagation (BP) is performed to learn the natural parameters for the distributions over both the weights and neurons. The output distributions of each layer, as byproducts, may be used as second-order representations for the associated tasks such as link prediction. Experiments on real-world datasets show that NPN can achieve state-of-the-art performance.
NeurIPS Conference 2015 Conference Paper
The goal of precipitation nowcasting is to predict the future rainfall intensity in a local region over a relatively short period of time. Very few previous studies have examined this crucial and challenging weather forecasting problem from the machine learning perspective. In this paper, we formulate precipitation nowcasting as a spatiotemporal sequence forecasting problem in which both the input and the prediction target are spatiotemporal sequences. By extending the fully connected LSTM (FC-LSTM) to have convolutional structures in both the input-to-state and state-to-state transitions, we propose the convolutional LSTM (ConvLSTM) and use it to build an end-to-end trainable model for the precipitation nowcasting problem. Experiments show that our ConvLSTM network captures spatiotemporal correlations better and consistently outperforms FC-LSTM and the state-of-the-art operational ROVER algorithm for precipitation nowcasting.
AAAI Conference 2015 Conference Paper
Tag recommendation has become one of the most important ways of organizing and indexing online resources like articles, movies, and music. Since tagging information is usually very sparse, effective learning of the content representation for these resources is crucial to accurate tag recommendation. Recently, models proposed for tag recommendation, such as collaborative topic regression and its variants, have demonstrated promising accuracy. However, a limitation of these models is that, by using topic models like latent Dirichlet allocation as the key component, the learned representation may not be compact and effective enough. Moreover, since relational data exist as an auxiliary data source in many applications, it is desirable to incorporate such data into tag recommendation models. In this paper, we start with a deep learning model called stacked denoising autoencoder (SDAE) in an attempt to learn more effective content representation. We propose a probabilistic formulation for SDAE and then extend it to a relational SDAE (RSDAE) model. RSDAE jointly performs deep representation learning and relational learning in a principled way under a probabilistic framework. Experiments conducted on three real datasets show that both learning more effective representation and learning from relational data are beneficial steps to take to advance the state of the art.
AAAI Conference 2015 Conference Paper
Learning an appropriate feature representation across source and target domains is one of the most effective solutions to domain adaptation problems. Conventional cross-domain feature learning methods rely on the Reproducing Kernel Hilbert Space (RKHS) induced by a single kernel. Recently, Multiple Kernel Learning (MKL), which bases classifiers on combinations of kernels, has shown improved performance in the tasks without distribution difference between domains. In this paper, we generalize the framework of MKL for cross-domain feature learning and propose a novel Transfer Feature Representation (TFR) algorithm. TFR learns a convex combination of multiple kernels and a linear transformation in a single optimization which integrates the minimization of distribution difference with the preservation of discriminating power across domains. As a result, standard machine learning models trained in the source domain can be reused for the target domain data. After rewritten into a differentiable formulation, TFR can be optimized by a reduced gradient method and reaches the convergence. Experiments in two real-world applications verify the effectiveness of our proposed method.
AAAI Conference 2014 Conference Paper
Supervised metric learning plays a substantial role in statistical classification. Conventional metric learning algorithms have limited utility when the training data and testing data are drawn from related but different domains (i. e. , source domain and target domain). Although this issue has got some progress in feature-based transfer learning, most of the work in this area suffers from non-trivial optimization and pays little attention to preserving the discriminating information. In this paper, we propose a novel metric learning algorithm to transfer knowledge from the source domain to the target domain in an information-theoretic setting, where a shared Mahalanobis distance across two domains is learnt by combining three goals together: 1) reducing the distribution difference between different domains; 2) preserving the geometry of target domain data; 3) aligning the geometry of source domain data with its label information. Based on this combination, the learnt Mahalanobis distance effectively transfers the discriminating power and propagates standard classifiers across these two domains. More importantly, our proposed method has closed-form solution and can be efficiently optimized. Experiments in two real-world applications demonstrate the effectiveness of our proposed method.
IJCAI Conference 2013 Conference Paper
Recently, tag recommendation (TR) has become a very hot research topic in data mining and related areas. However, neither co-occurrence based methods which only use the item-tag matrix nor content based methods which only use the item content information can achieve satisfactory performance in real TR applications. Hence, how to effectively combine the item-tag matrix, item content information, and other auxiliary information into the same recommendation framework is the key challenge for TR. In this paper, we first adapt the collaborative topic regression (CTR) model, which has been successfully applied for article recommendation, to combine both item-tag matrix and item content information for TR. Furthermore, by extending CTR we propose a novel hierarchical Bayesian model, called CTR with social regularization (CTR-SR), to seamlessly integrate the item-tag matrix, item content information, and social networks between items into the same principled model. Experiments on real data demonstrate the effectiveness of our proposed models.
IJCAI Conference 2013 Conference Paper
With the emergence of large-scale evolving (timevarying) networks, dynamic network analysis (DNA) has become a very hot research topic in recent years. Although a lot of DNA methods have been proposed by researchers from different communities, most of them can only model snapshot data recorded at a very rough temporal granularity. Recently, some models have been proposed for DNA which can be used to model large-scale citation networks at a fine temporal granularity. However, they suffer from a significant decrease of accuracy over time because the learned parameters or node features are static (fixed) during the prediction process for evolving citation networks. In this paper, we propose a novel model, called online egocentric model (OEM), to learn time-varying parameters and node features for evolving citation networks. Experimental results on real-world citation networks show that our OEM can not only prevent the prediction accuracy from decreasing over time but also uncover the evolution of topics in citation networks.
NeurIPS Conference 2007 Conference Paper
Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let $n$ denote the number of training instances, $p$ the reduced matrix dimension after factorization ($p$ is significantly smaller than $n$), and $m$ the number of machines. PSVM reduces the memory requirement from $\MO$($n^2$) to $\MO$($np/m$), and improves computation time to $\MO$($np^2/m$). Empirical studies on up to $500$ computers shows PSVM to be effective.