EAAI Journal 2026 Journal Article
A cross-scale interaction framework combining Mamba and Convolutional Neural Networks for Arbitrary-Scale Super-Resolution of infrared images
- Feiwei Qin
- Xinyu Cao
- Changmiao Wang
- Kai Zhang
- Yong Peng
- Jing Bai
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
TMLR Journal 2026 Journal Article
Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure-based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in https://github.com/zkysfls/2025-sbdd-benchmark
AAAI Conference 2026 Conference Paper
Human-Scene Interaction (HSI) seeks to generate realistic human behaviors within complex environments, yet it faces significant challenges in handling long-horizon, high-level tasks and generalizing to unseen scenes. To address these limitations, we introduce FantasyHSI, a novel HSI framework centered on video generation and multi-agent systems that operates without paired data. We model the complex interaction process as a dynamic directed graph, upon which we build a collaborative multi-agent system. This system comprises a scene navigator agent for environmental perception and high-level path planning, and a planning agent that decomposes long-horizon goals into atomic actions. Critically, we introduce a critic agent that establishes a closed-loop feedback mechanism by evaluating the deviation between generated actions and the planned path. This allows for the dynamic correction of trajectory drifts caused by the stochasticity of the generative model, thereby ensuring long-term logical consistency. To enhance the physical realism of the generated motions, we leverage Direct Preference Optimization (DPO) to train the action generator, significantly reducing artifacts such as limb distortion and foot-sliding. Extensive experiments on our custom SceneBench benchmark demonstrate that FantasyHSI significantly outperforms existing methods in terms of generalization, long-horizon task completion, and physical realism.
YNIMG Journal 2026 Journal Article
EAAI Journal 2026 Journal Article
IS Journal 2026 Journal Article
This article proposes a novel approach for image ordinal estimation, leveraging the power of optimal transport (OT) and prompt learning. Traditional ordinal regression methods primarily focus on learning a model to predict numerical scores, which may not directly reflect the intrinsic order. To address this limitation, we introduce a framework, termed ordinal prompt-regularized graph optimal transport (OPGOT), which utilizes OT to align the distribution of images and that of ordinal labels. First, we incorporate prompt learning with pretrained text encoders to construct ordinal prompts through a token-wise distance-based weighting scheme, enabling the model to capture the semantic relationships between ordinal categories. Second, OPGOT matches the graphs of image features and prompt embeddings via optimizing the OT with language-image cost. Hence, the learned transport plan reflects the intrinsic ordinal relationships. We conduct extensive evaluations on four benchmark datasets of different scenarios, demonstrating that OPGOT achieves significant improvements against existing methods.
AAAI Conference 2026 Conference Paper
Large Language Models (LLMs) have achieved remarkable success in recent years, owing to their impressive generalization capabilities and rich world knowledge. To capitalize on the potential of using LLMs as recommender systems, mainstream approaches typically focus on two paradigms. The first paradigm designs multi-domain or multi-task instruction data for generalizable recommendation, so as to align LLMs with general recommendation areas and deal with cold-start recommendation. The second paradigm focuses on enhancing domain-specific recommendation tasks, improving performance in warm recommendation scenarios. While most previous works treat these two paradigms separately, we argue that they have complementary advantages, and combining them can yield better results. In this paper, we propose a generalizable and efficient LLM-based recommendation framework RecCocktail. Our approach begins with fine-tuning a "base spirit" LoRA module using domain-general recommendation instruction data to align LLM with recommendation knowledge. Next, given users' behavior of a specific domain, we construct a domain-specific "ingredient" LoRA module. We then provide an entropy-guided adaptive merging method to mix the "base spirit" and the "ingredient" in the weight space. Please note that, RecCocktail combines the advantages of the existing two paradigms without introducing additional time or space overhead during the inference phase. Moreover, RecCocktail is efficient with plug and play, as the "base spirit" LoRA is trained only once, and any domain-specific "ingredient" can be efficiently mixed with only domain-specific fine-tuning. Extensive experiments on multiple datasets under both warm and cold-start recommendation scenarios validate the effectiveness and generality of the proposed RecCocktail.
TMLR Journal 2026 Journal Article
Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds more than 7× faster than the previous state-of-the-art method, while requiring 4× less VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. Our code is available at https://github.com/XiaojieGu/UltraEdit
NeurIPS Conference 2025 Conference Paper
Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e. g. , optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1. 5 seconds on a single A100 GPU.
NeurIPS Conference 2025 Conference Paper
The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: vfbench. github. io.
AAAI Conference 2025 Conference Paper
In recent years, deep multimodal learning has seen significant advancements. However, there remains a lack of multimodal fusion methods capable of dynamically adjusting the weighting of information both within and across modalities based on input samples. In the domain of multimodal intent recognition, the text modality often contains the most relevant information for intent detection, while the audio and visual modalities provide comparatively less critical information. There is a significant variation in the density of important information across different modalities and samples. To address this challenge, we propose a Dynamic Attention Allocation Fusion (DAF) method with an adaptive network structure that dynamically allocates attention both within individual modalities and across multiple modalities. This approach enables the model to focus more effectively on the most informative modalities and their respective internal features. Furthermore, we introduce a multi-view contrastive learning framework based on DAF (MVCL-DAF). This framework uses distinct and isolated modules to process information from various modalities, taking inspiration from the way the human brain processes multimodal information. Each modality independently infers intent using its respective module, while DAF integrates the multimodal information to produce a comprehensive global intent prediction. The text modality, functioning as the primary modality due to its rich semantic content, guides the other modules in the multi-view contrastive learning process. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods.
IROS Conference 2025 Conference Paper
Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning, demonstrating remarkable capabilities in 3D scene understanding. However, current 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies: excessive computational overhead and redundant information flows. Unlike 2D VLMs processing single images, 3D LMMs exhibit inherent architectural redundancy due to the heterogeneous mechanisms between spatial tokens and visual tokens. To address this challenge, we propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens through spatial contribution analysis. Our method automatically tailors pruning strategies to different 3D LMM architectures by quantifying token-level information flows via attention pattern mining. Extensive experiments on LLaVA-3D (a 7B parameter 3D-LMM) demonstrate that AdaToken-3D achieves 21% faster inference speed and 63% FLOPs reduction while maintaining original task accuracy. Beyond efficiency gains, this work systematically investigates redundancy patterns in multimodal spatial information flows through quantitative token interaction analysis. Our findings reveal that over 60% of spatial tokens contribute minimally (<5%) to the final predictions, establishing theoretical foundations for efficient 3D multimodal learning.
JBHI Journal 2025 Journal Article
EEG-based seizure prediction enables timely treatment for patients, but its performance is limited by the difficulty in effectively characterizing the temporal dynamics of epileptic brain networks. Metastability, which describes recurring topographical patterns of spontaneous neural activity over time, provides a unique perspective for capturing the dynamic evolution before seizure onset. In this study, we propose a seizure prediction model that fuses consistent epileptic network processes across subjects into a higher-order latent space. Specifically, we first construct metastable transition patterns to identify the recurrent network states over time. Through adversarial feature learning, we then impose the metastability prior on the latent embedding space encoded via a variational autoencoder (VAE), while leveraging the maximum mean discrepancy measure (MMD) to further mitigate the patient gap. The latent representation, endowed with physiological priors, is ultimately utilized for patient-independent seizure prediction. We evaluate our method on two publicly available and one clinical scalp EEG datasets. Compared to the existing methods, our method has improved AUC, sensitivity, and specificity on CHB-MIT dataset by approximately 9%, 5%, and 5%, respectively. Our method shows that combining brain network-based physiological prior with deep learning for EEG representation learning is a brand-new strategy for associating seizures with complex brain network variations, enabling reliable patient-independent seizure prediction.
NeurIPS Conference 2025 Conference Paper
While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem—excessive and unnecessary reasoning—which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones—Direct Answer, Short CoT, and Code—as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of $\sim$30%, and up to $\sim$70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a $\sim$2$\times$ speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens—ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage. All the resources will be released.
NeurIPS Conference 2025 Conference Paper
Recent advances in computational pathology have led to the emergence of numerous foundation models. These models typically rely on general-purpose encoders with multi-instance learning for whole slide image (WSI) classification or apply multimodal approaches to generate reports directly from images. However, these models cannot emulate the diagnostic approach of pathologists, who systematically examine slides at low magnification to obtain an overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. Instead, existing models directly output final diagnoses without revealing the underlying reasoning process. To address this gap, we introduce CPathAgent, an innovative agent-based approach that mimics pathologists' diagnostic workflow by autonomously navigating across WSI through zoom-in/out and move operations based on observed visual features, thereby generating substantially more transparent and interpretable diagnostic summaries. To achieve this, we develop a multi-stage training strategy that unifies patch-level, region-level, and WSI-level capabilities within a single model, which is essential for replicating how pathologists understand and reason across diverse image scales. Additionally, we construct PathMMU-HR², the first expert-validated benchmark for large region analysis. This represents a critical intermediate scale between patches and whole slides, reflecting a key clinical reality where pathologists typically examine several key large regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across benchmarks at three different image scales, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for computational pathology.
IJCAI Conference 2025 Conference Paper
Multimodal sentiment analysis (MSA) has shown promising results but often poses significant challenges in real-world applications due to its dependence on the complete and aligned multimodal sequences. While existing approaches attempt to address missing modalities through feature reconstruction, they often neglect the complex interplay between homogeneous and heterogeneous relationships in multimodal features. To address this problem, we propose Decoupled-Adaptive Reconstruction (DAR), a novel framework that explicitly addresses these limitations through two key components: (1) a mutual information-based decoupling module that decomposes features into common and independent representations, and (2) a reconstruction module that independently processes these decoupled features before fusion for downstream tasks. Extensive experiments on two benchmark datasets demonstrate that DAR significantly outperforms existing methods in both modality reconstruction and sentiment analysis tasks, particularly in scenarios with missing or unaligned modalities. Our results show improvements of 2. 21% in bi-classification accuracy and 3. 9% in regression error compared to state-of-the-art baselines on the MOSEI dataset.
TIST Journal 2025 Journal Article
Despite their excellent performance in graph representation learning, graph convolutional networks have been proved to be vulnerable to adversarial perturbations on the connectivity between nodes in an unnoticed manner. In this work, by looking into the impacts of adversarial attacks on graph data, we empirically find that the dominant edge-addition attacks generally increase the heterophily between connected nodes, which will fool the transductive inference models on node classification task. To defend against such attacks, we develop a Two-Stage Denoising (TSD) method that aims at removing possible malicious edges so as to mitigate the heterophily issue introduced by attacks. In particular, after a rough removal of the links that have quite low feature similarity, our method further spots the potentially heterophilous links by predicting node labels with a multi-view labeling consensus. This design is based on assumption that if the label predictions for the same node from two different views of a graph data are consistent, then we have a high chance to acquire the reliable labeling. The experiments demonstrate that by denoising a graph this way, the robustness of graph convolutional networks on node classification task is remarkably improved, compared to several strong competitive robust graph neural network models.
AAAI Conference 2025 Conference Paper
Recently, deep neural networks (DNNs) have emerged as the leading approach for low-light image enhancement (LLIE). However, training these models generally requires large-scale paired datasets, which are challenging to obtain due to the labor-intensive and time-consuming nature of real-world data collection. To alleviate this issue, synthetic data are often combined with real-captured data for training. However, most existing low-light image synthesis methods are simply performed in the sRGB domain using Gamma correction or manual adjustments via Lightroom, which fail to incorporate the physical imaging prior through the image signal processing (ISP) pipeline and thus result in limited dataset size and degradation space. Consequently, LLIE methods trained on such data often exhibit some drawbacks in the results, such as inaccurate white balance and abnormal enhancement artifacts, which limit their practicality and generalizability. In this paper, we propose a practical low-light image synthesis pipeline capable of generating unlimited paired training data. Our pipeline starts with a reverse ISP model that converts sRGB images back to the unprocessed RAW domain, where we then simulate low-light degradation, noise degradation, and white balance adjustments. Finally, the degraded RAW images are processed through a forward ISP model to produce low-light sRGB images. The pipeline further employs multiple tone mapping curves and color correction matrices (CCMs) to expand the degradation space. Hence, trained with our proposed synthetic data, existing state-of-the-art (SOTA) LLIE deep models are expected to improve their performance. Extensive experiments across various datasets demonstrate that our synthetic data can indeed effectively enhance existing LLIE deep models, improving both their practicality and generalizability.
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems that integrate data from images, text, and other sources using modality fusion techniques. This introduces new challenges to the existing LLM-based recommendation paradigm which relies solely on text modality information. Moreover, although Multimodal Large Language Models (MLLMs) capable of processing multi-modal inputs have emerged, how to equip MLLMs with multi-modal recommendation capabilities remains largely unexplored. To this end, in this paper, we propose the Multimodal Large Language Model-enhanced Sequential Multimodal Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. Specifically, we first utilize an MLLM-based item-summarizer to extract image feature given an item and convert the image into text. Then, we employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences based on an LLM-based user-summarizer. Finally, to enable the MLLM for multi-modal recommendation task, we propose to fine-tune a MLLM-based recommender using Supervised Fine-Tuning (SFT) techniques. Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences.
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
Sequential recommendation aims to predict the next item a user is likely to interact with based on their historical interaction sequence. Capturing user intent is crucial in this process, as each interaction is typically driven by specific intentions (e.g., buying skincare products for skin maintenance, buying makeup for cosmetic purposes, etc.). However, users often have multiple, dynamically changing intents, making it challenging for models to accurately learn these intents when relying on the entire historical sequence as input. To address this, we propose a novel framework called Intent Oriented Contrastive Learning for Sequential Recommendation (IOCLRec). This framework begins by segmenting users’ sequential behaviors into multiple subsequences, which represent the coarse-grained intents of users at different points in their interaction history. These subsequences form the basis for the three contrastive learning modules within IOCLRec. The fine-grained intent contrastive learning module uncovers detailed intent representations, while the single-intent and multi-intent contrastive learning modules utilize intent-oriented data augmentation operators to capture the diverse intents of users. These three modules work synergistically, driving comprehensive performance optimization in intricate sequential recommendation scenarios. Our method has been extensively evaluated on four public datasets, demonstrating superior effectiveness.
TMLR Journal 2025 Journal Article
Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by: (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamers achieves substantial performance improvements over reactive baselines. It is competitive, while being - times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments. All code, models, and data are publicly available at https://github.com/OSU-NLP-Group/WebDreamer
EAAI Journal 2025 Journal Article
AAAI Conference 2025 Conference Paper
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks, demonstrating superior performance and efficacy across various applications. The promising results come at the cost of slow inference, as each denoising step requires running the whole transformer model with a large amount of parameters. In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. Furthermore, we show that the lower bound of similarity between outputs at consecutive steps is notably high, and this similarity can be linearly approximated using the inputs. To verify our demonstrations, we propose the **LazyDiT**, a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations. Specifically, we incorporate lazy learning layers into the model, effectively trained to maximize laziness, enabling dynamic skipping of redundant computations. Experimental results show that LazyDiT outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency.
NeurIPS Conference 2025 Conference Paper
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
EAAI Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Project page is at https: //caiyuanhao1998. github. io/project/OmniVCus/
NeurIPS Conference 2025 Conference Paper
With the rapid progress of large language models (LLMs) and diffusion models, there has been growing interest in personalized content generation. However, current conversational systems often present the same recommended content to all users, falling into the dilemma of "one-size-fits-all. " To break this limitation and boost user engagement, in this paper, we introduce PCG ( P ersonalized Visual C ontent G eneration), a unified framework for personalizing item images within conversational systems. We tackle two key bottlenecks: the depth of personalization and the fidelity of generated images. Specifically, an LLM-powered Inclinations Analyzer is adopted to capture user likes and dislikes from context to construct personalized prompts. Moreover, we design a dual-stage LoRA mechanism—Global LoRA for understanding task-specific visual style, and Local LoRA for capturing preferred visual elements from conversation history. During training, we introduce the visual content condition method to ensure LoRA learns both historical visual context and maintains fidelity to the original item images. Extensive experiments on benchmark conversational datasets—including objective metrics and GPT-based evaluations—demonstrate that our framework outperforms strong baselines, which highlight its potential to redefine personalization in visual content generation for conversational scenarios like e-commerce and real-world recommendation.
EAAI Journal 2025 Journal Article
ICLR Conference 2025 Conference Paper
Federated learning (FL) has garnered significant attention from academia and industry in recent years due to its advantages in data privacy, scalability, and communication efficiency. However, current FL algorithms face a critical limitation: their performance heavily depends on meticulously tuned hyperparameters, particularly the learning rate or stepsize. This manual tuning process is challenging in federated settings due to data heterogeneity and limited accessibility of local datasets. Consequently, the reliance on problem-specific parameters hinders the widespread adoption of FL and potentially compromises its performance in dynamic or diverse environments. To address this issue, we introduce PAdaMFed, a novel algorithm for nonconvex FL that carefully combines adaptive stepsize and momentum techniques. PAdaMFed offers two key advantages: 1) it operates autonomously without relying on problem-specific parameters; and 2) it manages data heterogeneity and partial participation without requiring heterogeneity bounds. Despite these benefits, PAdaMFed provides several strong theoretical guarantees: 1) It achieves state-of-the-art convergence rates with a sample complexity of $\mathcal{O}(\epsilon^{-4})$ and communication complexity of $\mathcal{O}(\epsilon^{-3})$ to obtain an accuracy of $||\nabla f\left(\boldsymbol{\theta}\right)|| \leq \epsilon$, even using constant learning rates; 2) these complexities can be improved to the best-known $\mathcal{O}(\epsilon^{-3})$ for sampling and $\mathcal{O}(\epsilon^{-2})$ for communication when incorporating variance reduction; 3) it exhibits linear speedup with respect to the number of local update steps and participating clients at each global round. These attributes make PAdaMFed highly scalable and adaptable for various real-world FL applications. Extensive empirical evidence on both image classification and sentiment analysis tasks validates the efficacy of our approaches.
NeurIPS Conference 2025 Conference Paper
The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect its the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search (Simhadri et al. , NeurIPS 2021), this competition addressed sparse, filtered, out-of-distribution, and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.
NeurIPS Conference 2025 Conference Paper
Transformer-based Large Language Models (LLMs) have become increasingly important. However, scaling LLMs to longer contexts incurs slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper presents RetrievalAttention, a training-free approach to both accelerate the decoding phase and reduce GPU memory consumption by pre-building KV vector indexes for fixed contexts and maintaining them in CPU memory for efficient retrieval. Unlike conventional KV cache methods, RetrievalAttention integrate approximate nearest neighbor search (ANNS) indexes into attention computation. We observe that off-the-shelf ANNS techniques often fail due to the out-of-distribution (OOD) nature of query and key vectors in attention mechanisms. RetrievalAttention overcomes this with an attention-aware vector index. Our evaluation shows RetrievalAttention achieves near full attention accuracy while accessing only 1-3\% of the data, significantly reducing inference costs. Remarkably, RetrievalAttention enables LLMs with 8B parameters to handle 128K tokens on a single NVIDIA RTX4090 (24GB), achieving a decoding speed of 0. 107 seconds per token.
AAAI Conference 2025 Conference Paper
Diffusion models have been utilized as powerful tools for various image editing tasks, including semantic image painting (SIP), which aims to generate content within masked regions conditioned on a reference image or text. SIP, especially those using images as conditions, often suffers from three issues: semantic inconsistency, unnatural transitions, and style inconsistency, which significantly hinder its practical application. To address these challenges, we propose a novel Semantic Image Painting framework with INdependent INformation INjection (Spin). Specifically, we compute a saliency map to segregate the reference image into salient and non-salient components. We then filter out the non-salient information during the semantic embedding extraction phase and precisely inject the semantic embedding into the masked region instead of the whole image during the semantic generation phase. Furthermore, we impose an additional style guidance to promote style consistency between background and foreground. Experimental results demonstrate that Spin achieve superior semantic similarity and image coherence across various styles, including realistic, pencil drawings, cartoon, and oil painting. Additionally, Spin offers diversity and editability, and can be integrated into other models that meet our prerequisites.
JBHI Journal 2025 Journal Article
Super-resolution reconstruction (SRR) of isotropic fetal brain MR images is critical for prenatal ex aminations but is hindered by fetal motion and misalignment of thick-slice scans. To address these challenges comprehensively, we introduce an innovative deep learning model, namely 3D-WISE, a 3D Weighted Interpolation for Super-resolution Estimation of fetal brain MRI. The model generates high-quality isotropic fetal brain MR images by learning the interpolation weights to correct misalignments between slices and volumes. These misalignments are estimated by extracting deep features from multiple motion corrupted stacks. Specifically, 3D-WISE incorporates two key components: (1) a weight learning module for multi view interpolation and (2) a feature extraction module guided by multi-type attention mechanisms. The weight learning module first maps motion-corrupted thick-slice stacks into latent feature spaces. The resulting features are then fed to an implicit decoding block to estimate interpolation weights of the surrounding points for a given coordinate. We further enhance our approach by incorporating convolutional block attention and atlas-induced cross-attention mechanisms. Extensive experiments on two benchmark datasets show that our 3D-WISE achieves remarkably improved performance compared to the widely adoptedregistration-reconstruction framework. We also ex tend the experiments on anatomical structure reconstruction and achieve promising results, highlighting the significant potential of our 3D-WISE for fetal brain MR images SRR in clinical settings. Our code is available at https://github.com/sj-huang/3D-WISE.
EAAI Journal 2025 Journal Article
IJCAI Conference 2025 Conference Paper
Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition (WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1. 13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0. 41% increase in recognition accuracy when analyzing subtle emotional cues.
EAAI Journal 2024 Journal Article
IJCAI Conference 2024 Conference Paper
Multi-view clustering (MVC) has garnered significant attention in recent studies. In this paper, we propose a novel MVC method, named CCL-MVC. The novel method constructs a cross-order neighbor tensor of multi-view data to recover a low-rank essential tensor, preserves noise-free, comprehensive, and complementary cross-order relationships among the samples. Furthermore, it constructs a consensus representation matrix by fusing the low-rank essential tensor with auto-adjusted cross-view diversity embedding, fully exploiting both consensus and discriminative information of the data. An effective optimization algorithm is developed, which is theoretically guaranteed to converge. Extensive experimental results confirm the effectiveness of the proposed method.
NeurIPS Conference 2024 Conference Paper
Molecular docking, a technique for predicting ligand binding poses, is crucial in structure-based drug design for understanding protein-ligand interactions. Recent advancements in docking methods, particularly those leveraging geometric deep learning (GDL), have demonstrated significant efficiency and accuracy advantages over traditional sampling methods. Despite these advancements, current methods are often tailored for specific docking settings, and limitations such as the neglect of protein side-chain structures, difficulties in handling large binding pockets, and challenges in predicting physically valid structures exist. To accommodate various docking settings and achieve accurate, efficient, and physically reliable docking, we propose a novel two-stage docking framework, DeltaDock, consisting of pocket prediction and site-specific docking. We innovatively reframe the pocket prediction task as a pocket-ligand alignment problem rather than direct prediction in the first stage. Then we follow a bi-level coarse-to-fine iterative refinement process to perform site-specific docking. Comprehensive experiments demonstrate the superior performance of DeltaDock. Notably, in the blind docking setting, DeltaDock achieves a 31\% relative improvement over the docking success rate compared with the previous state-of-the-art GDL model DiffDock. With the consideration of physical validity, this improvement increases to about 300\%.
AAAI Conference 2024 Conference Paper
Deriving DSLR-quality sRGB images from smartphone RAW images has become a compelling challenge due to discernible detail disparity, color mapping instability, and spatial misalignment in RAW-sRGB data pairs. We present DiffRAW, a novel method that incorporates the diffusion model for the first time in learning RAW-to-sRGB mappings. By leveraging the diffusion model, our approach effectively learns the high-quality detail distribution of DSLR images, thereby enhancing the details of output images. Simultaneously, we use the RAW image as a diffusion condition to maintain image structure information such as contours and textures. To mitigate the interference caused by the color and spatial misalignment in training data pairs, we embed a color-position preserving condition within DiffRAW, ensuring that the output images do not exhibit color biases and pixel shift issues. To accelerate the inference process of DiffRAW, we designed the Domain Transform Diffusion Method, an efficient diffusion process with its corresponding reverse process. The Domain Transform Diffusion Method can reduce the required inference steps for diffusion model-based image restoration/enhancement algorithms while enhancing the quality of the generated images. Through evaluations on the ZRR dataset, DiffRAW consistently demonstrates state-of-the-art performance across all perceptual quality metrics (e.g., LPIPS, FID, MUSIQ), while achieving comparable results in PSNR and SSIM.
ICML Conference 2024 Conference Paper
Contrastive learning is a powerful paradigm for representation learning with prominent success in computer vision and NLP, but how to extend its success to high-dimensional tensors remains a challenge. This is because tensor data often exhibit high-order mode-interactions that are hard to profile and with negative samples growing combinatorially faster than second-order contrastive learning; furthermore, many real-world tensors have ordinal entries that necessitate more delicate comparative levels. To solve the challenge, we propose High-Order Contrastive Tensor Completion (HOCTC), an innovative network to extend contrastive learning to sparse ordinal tensor data. HOCTC employs a novel attention-based strategy with query-expansion to capture high-order mode interactions even in case of very limited tokens, which transcends beyond second-order learning scenarios. Besides, it extends two-level comparisons (positive-vs-negative) to fine-grained contrast-levels using ordinal tensor entries as a natural guidance. Efficient sampling scheme is proposed to enforce such delicate comparative structures, generating comprehensive self-supervised signals for high-order representation learning. Extensive experiments show that HOCTC has promising results in sparse tensor completion in traffic/recommender applications.
NeurIPS Conference 2024 Conference Paper
We present LRM-Zero, a Large Reconstruction Model (LRM) trained entirely on synthesized 3D data, achieving high-quality sparse-view 3D reconstruction. The core of LRM-Zero is our procedural 3D dataset, Zeroverse, which is automatically synthesized from simple primitive shapes with random texturing and augmentations (e. g. , height fields, boolean differences, and wireframes). Unlike previous 3D datasets (e. g. , Objaverse) which are often captured or crafted by humans to approximate real 3D data, Zeroverse completely ignores realistic global semantics but is rich in complex geometric and texture details that are locally similar to or even more intricate than real objects. We demonstrate that our LRM-Zero, trained with our fully synthesized Zeroverse, can achieve high visual quality in the reconstruction of real-world objects, competitive with models trained on Objaverse. We also analyze several critical design choices of Zeroverse that contribute to LRM-Zero's capability and training stability. Our work demonstrates that 3D reconstruction, one of the core tasks in 3D vision, can potentially be addressed without the semantics of real-world objects. The Zeroverse's procedural synthesis code and interactive visualization are available at: https: //desaixie. github. io/lrm-zero/.
NeurIPS Conference 2024 Conference Paper
Color video snapshot compressive imaging (SCI) employs computational imaging techniques to capture multiple sequential video frames in a single Bayer-patterned measurement. With the increasing popularity of quad-Bayer pattern in mainstream smartphone cameras for capturing high-resolution videos, mobile photography has become more accessible to a wider audience. However, existing color video SCI reconstruction algorithms are designed based on the traditional Bayer pattern. When applied to videos captured by quad-Bayer cameras, these algorithms often result in color distortion and ineffective demosaicing, rendering them impractical for primary equipment. To address this challenge, we propose the MambaSCI method, which leverages the Mamba and UNet architectures for efficient reconstruction of quad-Bayer patterned color video SCI. To the best of our knowledge, our work presents the first algorithm for quad-Bayer patterned SCI reconstruction, and also the initial application of the Mamba model to this task. Specifically, we customize Residual-Mamba-Blocks, which residually connect the Spatial-Temporal Mamba (STMamba), Edge-Detail-Reconstruction (EDR) module, and Channel Attention (CA) module. Respectively, STMamba is used to model long-range spatial-temporal dependencies with linear complexity, EDR is for better edge-detail reconstruction, and CA is used to compensate for the missing channel information interaction in Mamba model. Experiments demonstrate that MambaSCI surpasses state-of-the-art methods with lower computational and memory costs. PyTorch style pseudo-code for the core modules is provided in the supplementary materials. Code is at https: //github. com/PAN083/MambaSCI.
NeurIPS Conference 2024 Conference Paper
Single-image relighting is a challenging task that involves reasoning about the complex interplay between geometry, materials, and lighting. Many prior methods either support only specific categories of images, such as portraits, or require special capture conditions, like using a flashlight. Alternatively, some methods explicitly decompose a scene into intrinsic components, such as normals and BRDFs, which can be inaccurate or under-expressive. In this work, we propose a novel end-to-end 2D relighting diffusion model, called Neural Gaffer, that takes a single image of any object and can synthesize an accurate, high-quality relit image under any novel environmental lighting condition, simply by conditioning an image generator on a target environment map, without an explicit scene decomposition. Our method builds on a pre-trained diffusion model, and fine-tunes it on a synthetic relighting dataset, revealing and harnessing the inherent understanding of lighting present in the diffusion model. We evaluate our model on both synthetic and in-the-wild Internet imagery and demonstrate its advantages in terms of generalization and accuracy. Moreover, by combining with other generative methods, our model enables many downstream 2D tasks, such as text-based relighting and object insertion. Our model can also operate as a strong relighting prior for 3D tasks, such as relighting a radiance field.
AAAI Conference 2024 Conference Paper
As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.
JBHI Journal 2024 Journal Article
As a common and critical medical image analysis task, deep learning based biomedical image segmentation is hindered by the dependence on costly fine-grained annotations. To alleviate this data dependence, in this article, a novel approach, called Polygonal Approximation Learning (PAL), is proposed for convex object instance segmentation with only bounding-box supervision. The key idea behind PAL is that the detection model for convex objects already contains the necessary information for segmenting them since their convex hulls, which can be generated approximately by the intersection of bounding boxes, are equivalent to the masks representing the objects. To extract the essential information from the detection model, a repeated detection approach is employed on biomedical images where various rotation angles are applied and a dice loss with the projection of the rotated detection results is utilized as a supervised signal in training our segmentation model. In biomedical imaging tasks involving convex objects, such as nuclei instance segmentation, PAL outperforms the known models (e. g. , BoxInst) that rely solely on box supervision. Furthermore, PAL achieves comparable performance with mask-supervised models including Mask R-CNN and Cascade Mask R-CNN. Interestingly, PAL also demonstrates remarkable performance on non-convex object instance segmentation tasks, for example, surgical instrument and organ instance segmentation.
IJCAI Conference 2024 Conference Paper
In numerous user-centric services on mobile applications (apps), accurately mining user interests and generating effective user representations are paramount. Traditional approaches, which often involve training task-specific user representations, are becoming increasingly impractical due to their high computational costs and limited adaptability. This paper introduces a novel solution to this challenge: the Multi-type App-usage Fusion Network (MAFN). MAFN innovatively pre-trains universal user representations, leveraging multi-type app behaviors to overcome key limitations in existing methods. We address two primary challenges: 1) the varying frequency of user behaviors (ranging from low-frequency actions like (un)installations to high-frequency yet insightful app launches); and 2) the integration of multi-type behaviors to form a cohesive representation. Our approach involves the creation of novel pre-training tasks that harness self-supervised signals from diverse app behaviors, capturing both long-term and short-term user interests. MAFN's unique fusion approach effectively amalgamates these interests into a unified vector space, facilitating the development of a versatile, general-purpose user representation. With a practical workflow, extensive experiments with three typical downstream tasks on real-world datasets verify the effectiveness of our approach.
ECAI Conference 2024 Conference Paper
Arbitrary-scale super-resolution (ASSR) aims to learn a single model for image super-resolution at arbitrary magnifying scales. Existing ASSR networks typically comprise an off-the-shelf scale-agnostic feature extractor and an arbitrary scale upsampler. These feature extractors often use fixed network architectures to address different ASSR inference tasks, each of which is characterized by an input image and an upsampling scale. However, this overlooks the difficulty variance of super-resolution on different inference scenarios, where simple images or small SR scales could be resolved with less computational effort than difficult images or large SR scales. To tackle this difficulty variability, in this paper, we propose a Task-Aware Dynamic Transformer (TADT) as an input-adaptive feature extractor for efficient image ASSR. Our TADT consists of a multi-scale feature extraction backbone built upon groups of Multi-Scale Transformer Blocks (MSTBs) and a Task-Aware Routing Controller (TARC). The TARC predicts the inference paths within feature extraction backbone, specifically selecting MSTBs based on the input images and SR scales. The prediction of inference path is guided by a new loss function to trade-off the SR accuracy and efficiency. Experiments demonstrate that, when working with three popular arbitrary-scale upsamplers, our TADT achieves state-of-the-art ASSR performance when compared with mainstream feature extractors, but with relatively fewer computational costs. The code is available at https: //github. com/Tillyhere/TADT.
EAAI Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE.
TCS Journal 2024 Journal Article
AAAI Conference 2024 Conference Paper
The recent advancements in Deep Reinforcement Learning (DRL) have significantly enhanced the performance of adaptive Traffic Signal Control (TSC). However, DRL policies are typically represented by neural networks, which are over-parameterized black-box models. As a result, the learned policies often lack interpretability, and cannot be deployed directly in the real-world edge hardware due to resource constraints. In addition, the DRL methods often exhibit limited generalization performance, struggling to generalize the learned policy to other geographical regions. These factors limit the practical application of learning-based approaches. To address these issues, we suggest the use of an inherently interpretable program for representing the control policy. We present a new approach, Programmatic Interpretable reinforcement learning for traffic signal control (π-light), designed to autonomously discover non-differentiable programs. Specifically, we define a Domain Specific Language (DSL) and transformation rules for constructing programs, and utilize Monte Carlo Tree Search (MCTS) to find the optimal program in a discrete space. Extensive experiments demonstrate that our method consistently outperforms baseline approaches. Moreover, π-Light exhibits superior generalization capabilities compared to DRL, enabling training and evaluation across intersections from different cities. Finally, we analyze how the learned program policies can directly deploy on edge devices with extremely limited resources.
NeurIPS Conference 2023 Conference Paper
User modeling, which aims to capture users' characteristics or interests, heavily relies on task-specific labeled data and suffers from the data sparsity issue. Several recent studies tackled this problem by pre-training the user model on massive user behavior sequences with a contrastive learning task. Generally, these methods assume different views of the same behavior sequence constructed via data augmentation are semantically consistent, i. e. , reflecting similar characteristics or interests of the user, and thus maximizing their agreement in the feature space. However, due to the diverse interests and heavy noise in user behaviors, existing augmentation methods tend to lose certain characteristics of the user or introduce noisy behaviors. Thus, forcing the user model to directly maximize the similarity between the augmented views may result in a negative transfer. To this end, we propose to replace the contrastive learning task with a new pretext task: Augmentation-Adaptive SelfSupervised Ranking (AdaptSSR), which alleviates the requirement of semantic consistency between the augmented views while pre-training a discriminative user model. Specifically, we adopt a multiple pairwise ranking loss which trains the user model to capture the similarity orders between the implicitly augmented view, the explicitly augmented view, and views from other users. We further employ an in-batch hard negative sampling strategy to facilitate model training. Moreover, considering the distinct impacts of data augmentation on different behavior sequences, we design an augmentation-adaptive fusion mechanism to automatically adjust the similarity order constraint applied to each sample based on the estimated similarity between the augmented views. Extensive experiments on both public and industrial datasets with six downstream tasks verify the effectiveness of AdaptSSR.
TCS Journal 2023 Journal Article
IJCAI Conference 2023 Conference Paper
Commonsense Question Answering (CQA) aims to answer questions that require human commonsense. Closed-book CQA, as one of the subtasks, requires the model to answer questions without retrieving external knowledge, which emphasizes the importance of the model's problem-solving ability. Most previous methods relied on large-scale pre-trained models to generate question-related knowledge while ignoring the crucial role of skills in the process of answering commonsense questions. Generally, skills refer to the learned ability in performing a specific task or activity, which are derived from knowledge and experience. In this paper, we introduce a new approach named Dynamic Skill-aware Commonsense Question Answering (DSCQA), which transcends the limitations of traditional methods by informing the model about the need for each skill in questions and utilizes skills as a critical driver in CQA process. To be specific, DSCQA first employs commonsense skill extraction module to generate various skill representations. Then, DSCQA utilizes dynamic skill module to generate dynamic skill representations. Finally, in perception and emphasis module, various skills and dynamic skill representations are used to help question-answering process. Experimental results on two publicly available CQA datasets show the effectiveness of our proposed model and the considerable impact of introducing skills.
NeurIPS Conference 2023 Conference Paper
Text-guided image editing is widely needed in daily life, ranging from personal use to professional applications such as Photoshop. However, existing methods are either zero-shot or trained on an automatically synthesized dataset, which contains a high volume of noise. Thus, they still require lots of manual tuning to produce desirable outcomes in practice. To address this issue, we introduce MagicBrush, the first large-scale, manually annotated dataset for instruction-guided real image editing that covers diverse scenarios: single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), which supports trainining large-scale text-guided image editing models. We fine-tune InstructPix2Pix on MagicBrush and show that the new model can produce much better images according to human evaluation. We further conduct extensive experiments to evaluate current image editing baselines from multiple dimensions including quantitative, qualitative, and human evaluations. The results reveal the challenging nature of our dataset and the gap between current baselines and real-world editing needs.
IROS Conference 2023 Conference Paper
Most navigation approaches treat obstacles as static objects and choose to bypass them. However, the detour could be costly or could lead to failures in indoor environments. The recently developed navigation among movable obstacles (NAMO) methods prefer to remove all the movable obstacles blocking the way, which might be not the best choice when planning and moving obstacles takes a long time. We propose a pipeline where the robot solves the NAMO problems by optimizing the total time to reach the goal. This is achieved by a supervised learning approach that can predict the time of planning and performing obstacle motion before actually doing it if this leads to faster goal reaching. Besides, a pose generator based on reinforcement learning is proposed to decide where the robot can move the obstacle. The method is evaluated in two kinds of simulation environments and the results demonstrate its advantages compared to the classical bypass and obstacle removal strategies.
EAAI Journal 2023 Journal Article
YNICL Journal 2023 Journal Article
NeurIPS Conference 2022 Conference Paper
In many web applications, deep learning-based CTR prediction models (deep CTR models for short) are widely adopted. Traditional deep CTR models learn patterns in a static manner, i. e. , the network parameters are the same across all the instances. However, such a manner can hardly characterize each of the instances which may have different underlying distributions. It actually limits the representation power of deep CTR models, leading to sub-optimal results. In this paper, we propose an efficient, effective, and universal module, named as Adaptive Parameter Generation network (APG), which can dynamically generate parameters for deep CTR models on-the-fly based on different instances. Extensive experimental evaluation results show that APG can be applied to a variety of deep CTR models and significantly improve their performance. Meanwhile, APG can reduce the time cost by 38. 7\% and memory usage by 96. 6\% compared to a regular deep CTR model. We have deployed APG in the industrial sponsored search system and achieved 3\% CTR gain and 1\% RPM gain respectively.
NeurIPS Conference 2022 Conference Paper
Video restoration aims at restoring multiple high-quality frames from multiple low-quality frames. Existing video restoration methods generally fall into two extreme cases, i. e. , they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.
NeurIPS Conference 2022 Conference Paper
Vision Transformers (ViTs) yield impressive performance across various vision tasks. However, heavy computation and memory footprint make them inaccessible for edge devices. Previous works apply importance criteria determined independently by each individual component to prune ViTs. Considering that heterogeneous components in ViTs play distinct roles, these approaches lead to suboptimal performance. In this paper, we introduce joint importance, which integrates essential structural-aware interactions between components for the first time, to perform collaborative pruning. Based on the theoretical analysis, we construct a Taylor-based approximation to evaluate the joint importance. This guides pruning toward a more balanced reduction across all components. To further reduce the algorithm complexity, we incorporate the interactions into the optimization function under some mild assumptions. Moreover, the proposed method can be seamlessly applied to various tasks including object detection. Extensive experiments demonstrate the effectiveness of our method. Notably, the proposed approach outperforms the existing state-of-the-art approaches on ImageNet, increasing accuracy by 0. 7% over the DeiT-Base baseline while saving 50% FLOPs. On COCO, we are the first to show that 70% FLOPs of FasterRCNN with ViT backbone can be removed with only 0. 3% mAP drop. The code is available at https: //github. com/hikvision-research/SAViT.
ICRA Conference 2022 Conference Paper
Liquid state estimation is important for robotics tasks such as pouring; however, estimating the state of transparent liquids is a challenging problem. We propose a novel segmentation pipeline that can segment transparent liquids such as water from a static, RGB image without requiring any manual annotations or heating of the liquid for training. Instead, we use a generative model that is capable of translating images of colored liquids into synthetically generated transparent liquid images, trained only on an unpaired dataset of colored and transparent liquid images. Segmentation labels of colored liquids are obtained automatically using background subtraction. Our experiments show that we are able to accurately predict a segmentation mask for transparent liquids without requiring any manual annotations. We demonstrate the utility of transparent liquid segmentation in a robotic pouring task that controls pouring by perceiving the liquid height in a transparent cup. Accompanying video and supplementary materials can be found at https://sites.google.com/view/transparentliquidpouring.
AAAI Conference 2022 Conference Paper
Graph neural networks are a promising architecture for learning and inference with graph-structured data. Yet, how to generate informative, fixed-dimensional graph-level features for graphs with varying size and topology can still be challenging. Typically, this is achieved through graph-pooling, which summarizes a graph by compressing all its nodes into a single vector after convolutional operations. Is such a “collapsing-style” graph-pooling the only choice for graph classification? From complex system’s point of view, properties of a complex system arise largely from the interaction among its components. Therefore, we speculate that preserving the interacting relation between parts, instead of pooling them together, could benefit system-level prediction. To verify this, we propose SLIM, a graph neural network model for Structural Landmarking and Interaction Modelling. The main idea is to compute a set of end-to-end optimizable sub-structure landmarks, so that any input graph can be projected onto these (spatially) local structural representatives for a faithful, global characterization. By doing this, explicit interaction between component parts of a graph can be leveraged directly in generating useful graphlevel representations despite significant topological variations. Encouraging results are observed on benchmark datasets for graph classification, demonstrating the value of interaction modelling in the design of graph neural networks.
NeurIPS Conference 2022 Conference Paper
Recently, Transformers have been shown to enhance the performance of multi-view stereo by enabling long-range feature interaction. In this work, we propose Window-based Transformers (WT) for local feature matching and global feature aggregation in multi-view stereo. We introduce a Window-based Epipolar Transformer (WET) which reduces matching redundancy by using epipolar constraints. Since point-to-line matching is sensitive to erroneous camera pose and calibration, we match windows near the epipolar lines. A second Shifted WT is employed for aggregating global information within cost volume. We present a novel Cost Transformer (CT) to replace 3D convolutions for cost volume regularization. In order to better constrain the estimated depth maps from multiple views, we further design a novel geometric consistency loss (Geo Loss) which punishes unreliable areas where multi-view consistency is not satisfied. Our WT multi-view stereo method (WT-MVSNet) achieves state-of-the-art performance across multiple datasets and ranks $1^{st}$ on Tanks and Temples benchmark. Code will be available upon acceptance.
AAAI Conference 2021 Conference Paper
We study the problem of adversarial language games, in which multiple agents with conflicting goals compete with each other via natural language interactions. While adversarial language games are ubiquitous in human activities, little attention has been devoted to this field in natural language processing. In this work, we propose a challenging adversarial language game called Adversarial Taboo as an example, in which an attacker and a defender compete around a target word. The attacker is tasked with inducing the defender to utter the target word invisible to the defender, while the defender is tasked with detecting the target word before being induced by the attacker. In Adversarial Taboo, a successful attacker and defender need to hide or infer the intention, and induce or defend during conversations. This requires several advanced language abilities, such as adversarial pragmatic reasoning and goal-oriented language interactions in open domain, which will facilitate many downstream NLP tasks. To instantiate the game, we create a game environment and a competition platform. Comprehensive experiments on several baseline attack and defense strategies show promising and interesting results, based on which we discuss some directions for future research. The code and datasets of this paper can be obtained from https: //github. com/thunlp/AdversarialTaboo.
IS Journal 2021 Journal Article
In recent years, money laundering has become much easier to be achieved but more challenging to be detected than before, which has enormous adversary effects on finance, military, and other related fields. In the real-time scenario, every money laundering case has a unique structure in terms of transactions. It is not sufficient to detect suspicious behavior by just following the probability theory, where usually the thresholds are given by experts. Since the crime of money laundering is more prevalent and sophisticated nowadays, it will increase the complexity of the detection if the accounts with the personal information are combined with the form of the transaction topology. Hence, the graph topology analysis could be used for antimoney laundering tools. This article proposes eight common topologies based on coupling and connection from simple to much more complicated structures to solve various kinds of problems concerning money laundering in the real world. Moreover, we also propose an efficient solution based on graph and subgraph isomorphism and distance measurement to detect money laundering behavior. In this way, the detection of money laundering behavior will be more efficient and effective for various situations while referencing the proposed eight topological structures.
IS Journal 2021 Journal Article
Anomaly detection has been widely applied in modern data-driven security applications to detect abnormal events/entities that deviate from the majority. However, less work has been done in terms of detecting suspicious event sequences/paths, which are better discriminators than single events/entities for distinguishing normal and abnormal behaviors in complex systems such as cyber-physical systems. A key and challenging step in this endeavor is how to discover those abnormal event sequences from millions of system event records in an efficient and accurate way. To address this issue, we propose NINA, a network diffusion based algorithm for identifying anomalous event sequences. Experimental results on both static and streaming data show that NINA is efficient (processes about 2 million records per minute) and accurate.
IROS Conference 2021 Conference Paper
Maintaining lateral and longitudinal trajectory tracking accuracy is challenging for autonomous ground vehicles (AGVs). This paper considers kinematics and dynamics of longitudinal and lateral motion to form a novel composite structure considering the cross-impacts of acceleration and steering commands on tracking errors in the lateral and longitudinal directions, respectively. The multi-tiered structure uses backstepping with smooth robust control to iteratively map kinematics-based velocity and yaw rate commands to slip-yaw dynamics-based acceleration and steering commands. In kinematics, longitudinal tracking error is stabilized by sliding mode control (SMC) while variable structure control (VSC) stabilizes lateral tracking error and balances tracking accuracy and steering gracefulness. Backstepping extends these commands through vehicle dynamics to provide robust steering and acceleration commands. Cross impacts between lateral and longitudinal motion is addressed by vehicle modeling and controller designs. A robust observer is applied for sideslip estimation to reject uncertainties. Peaking from the high gain observer and robust control is addressed. Stability analysis is provided and field experiments on an open road demonstrate and validate effectiveness of the controllers.
EAAI Journal 2021 Journal Article
AAAI Conference 2021 Conference Paper
Match outcome prediction in group comparison setting is a challenging but important task. Existing works mainly focus on learning individual effects or mining limited interactions between teammates, which is not sufficient for capturing complex interactions between teammates as well as between opponents. Besides, the importance of interacting with different characters is still largely underexplored. To this end, we propose a novel Neural Attentional Cooperation-competition model (NeuralAC), which incorporates weighted-cooperation effects (i. e. , intra-team interactions) and weighted-competition effects (i. e. , inter-team interactions) for predicting match outcomes. Specifically, we first project individuals to latent vectors and learn complex interactions through deep neural networks. Then, we design two novel attention-based mechanisms to capture the importance of intra-team and inter-team interactions, which enhance NeuralAC with both accuracy and interpretability. Furthermore, we demonstrate NeuralAC can generalize several previous works. To evaluate the performances of NeuralAC, we conduct extensive experiments on four E-sports datasets. The experimental results clearly verify the effectiveness of NeuralAC compared with several state-of-the-art methods.
AAAI Conference 2021 Conference Paper
Recently multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. However, most of the multimodal methods use attention mechanisms to extract visual clues regardless of whether the text and image are relevant. Practically, the irrelevant textimage pairs account for a large proportion in tweets. The visual clues that are unrelated to the texts will exert uncertain or even negative effects on multimodal model learning. In this paper, we introduce a method of text-image relation propagation into the multimodal BERT model. We integrate soft or hard gates to select visual clues and propose a multitask algorithm to train on the MNER datasets. In the experiments, we deeply analyze the changes in visual attention before and after the use of text-image relation propagation. Our model achieves state-of-the-art performance on the MNER datasets.
TCS Journal 2020 Journal Article
ICRA Conference 2020 Conference Paper
Path following accuracy and error convergence with graceful motion in vehicle steering control is challenging due to the competing nature of these requirements, especially across a range of operating speeds. This work is founded upon slip-based kinematic and dynamic models, which allow derivation of controllers considering error due to sideslip and the mapping between steering commands and graceful lateral motion. A novel recursive backstepping steering controller is proposed that better couples yaw-rate based path following commands to steering angle and rate. Observer based sideslip estimates are combined with heading error in the kinematic controller to provide feedforward slip compensation. Path following error is compensated by a Variable Structure Controller (VSC) to balance graceful motion, path following error, and robustness. Yaw rate commands are used by a backstepping dynamic controller to generate robust steering commands. A High Gain Observer (HGO) estimates sideslip and yaw rate for output feedback control. Stability analysis is provided and peaking is addressed. Field experimental results evaluate the work and provide comparisons to MPC.
JBHI Journal 2020 Journal Article
Excessive stress is one of the main causes of mental illness. Long-term exposure of stress could affect one's physiological wellbeing (such as hypertension) and psychological condition (such as depression). Multisensory information such as heart rate variability (HRV) and pH can provide suitable information about mental and physical stress. This paper proposes a novel approach for stress condition monitoring using disposable flexible sensors. By integrating flexible amplifiers with a commercially available flexible polyvinylidene difluoride (PVDF) mechanical deformation sensor and a pH-type chemical sensor, the proposed system can detect arterial pulses from the neck and pH levels from sweat located in the back of the body. The system uses organic thin film transistor (OTFT)-based signal amplification front-end circuits with modifications to accommodate the dynamic signal ranges obtained from the sensors. The OTFTs were manufactured on a low-cost flexible polyethylene naphthalate (PEN) substrate using a coater capable of Roll-to-Roll (R2R) deposition. The proposed system can capture physiological indicators with data interrogated by Near Field Communication (NFC). The device has been successfully tested with healthy subjects, demonstrating its feasibility for real-time stress monitoring.
JBHI Journal 2019 Journal Article
Optical coherence tomography (OCT) is a high-resolution and noninvasive imaging modality that has become one of the most prevalent techniques for ophthalmic diagnosis. Retinal layer segmentation is very crucial for doctors to diagnose and study retinal diseases. However, manual segmentation is often a time-consuming and subjective process. In this work, we propose a new method for automatically segmenting retinal OCT images, which integrates deep features and hand-designed features to train a structured random forests classifier. The deep convolutional features are learned from deep residual network. With the trained classifier, we can get the contour probability graph of each layer; finally, the shortest path is employed to achieve the final layer segmentation. The experimental results show that our method achieves good results with the mean layer contour error of 1. 215 pixels, whereas that of the state of the art was 1. 464 pixels, and achieves an F1-score of 0. 885, which is also better than 0. 863 that is obtained by the state of the art method.
TCS Journal 2019 Journal Article
ICML Conference 2019 Conference Paper
Non-negative matrix factorization is a powerful tool for learning useful representations in the data and has been widely applied in many problems such as data mining and signal processing. Orthogonal NMF, which can improve the locality of decomposition, has drawn considerable interest in solving clustering problems in recent years. However, imposing simultaneous non-negative and orthogonal structure can be quite difficult, and so existing algorithms can only solve it approximately. To address this challenge, we propose an innovative procedure called Greedy Orthogonal Pivoting Algorithm (GOPA). The GOPA algorithm fully exploits the sparsity of non-negative orthogonal solutions to break the global problem into a series of local optimizations, in which an adaptive subset of coordinates are updated in a greedy, closed-form manner. The biggest advantage of GOPA is that it promotes exact orthogonality and provides solid empirical evidence that stronger orthogonality does contribute favorably to better clustering performance. On the other hand, we further design randomized and parallel version of GOPA, which can further reduce the computational cost and improve accuracy, making it suitable for large data.
AAAI Conference 2019 Conference Paper
Cross-domain sentiment classification refers to utilizing useful knowledge in the source domain to help sentiment classification in the target domain which has few or no labeled data. Most existing methods mainly concentrate on extracting common features between domains. Unfortunately, they cannot fully consider the effects of the aspect (e. g. , the battery life in reviewing an electronic product) information of the sentences. In order to better solve this problem, we propose an Interactive Attention Transfer Network (IATN) for crossdomain sentiment classification. IATN provides an interactive attention transfer mechanism, which can better transfer sentiment across domains by incorporating information of both sentences and aspects. Specifically, IATN comprises two attention networks, one of them is to identify the common features between domains through domain classification, and the other aims to extract information from the aspects by using the common features as a bridge. Then, we conduct interactive attention learning for those two networks so that both the sentences and the aspects can influence the final sentiment representation. Extensive experiments on the Amazon reviews dataset and crowdfunding reviews dataset not only demonstrate the effectiveness and universality of our method, but also give an interpretable way to track the attention information for sentiment.
AAAI Conference 2019 Conference Paper
Random Fourier features are a powerful framework to approximate shift invariant kernels with Monte Carlo integration, which has drawn considerable interest in scaling up kernel-based learning, dimensionality reduction, and information retrieval. In the literature, many sampling schemes have been proposed to improve the approximation performance. However, an interesting theoretic and algorithmic challenge still remains, i. e. , how to optimize the design of random Fourier features to achieve good kernel approximation on any input data using a low spectral sampling rate? In this paper, we propose to compute more adaptive random Fourier features with optimized spectral samples (wj’s) and feature weights (pj’s). The learning scheme not only significantly reduces the spectral sampling rate needed for accurate kernel approximation, but also allows joint optimization with any supervised learning framework. We establish generalization bounds using Rademacher complexity, and demonstrate advantages over previous methods. Moreover, our experiments show that the empirical kernel approximation provides effective regularization for supervised learning.
TCS Journal 2019 Journal Article
YNIMG Journal 2018 Journal Article
AAAI Conference 2018 Conference Paper
Graphs are natural data structures adopted to represent realworld data of complex relationships. In recent years, a surge of interest has been received to build predictive models over graphs, with prominent examples in chemistry, computational biology, and social networks. The overwhelming complexity of graph space often makes it challenging to extract interpretable and discriminative structural features for classification tasks. In this work, we propose a novel neural network structure called Substructure Assembling Network (SAN) to extract graph features and improve the generalization performance of graph classification. The key innovation of our work is a unified substructure assembling unit, which is a variant of Recurrent Neural Network (RNN) designed to hierarchically assemble useful pieces of graph components so as to fabricate discriminative substructures. SAN adopts a sequential, probabilistic decision process, and therefore it can tune substructure features in a finer granularity. Meanwhile, the parameterized soft decisions can be continuously improved with supervised learning through back-propagation, leading to optimizable search trajectories. Overall, SAN embraces both the flexibility of combinatorial pattern search and the strong optimizability of deep learning, and delivers promising results as well as interpretable structural features in graph classification against state-of-the-art techniques.
IROS Conference 2017 Conference Paper
Piezoelectric ceramics(PZT)actuator has been widely used in flexure-guided nanopositioning stage because of their high resolution. However, it is quite hard to achieve high-rate precision positioning control because of the complex hysteresis nonlinearity effect of PZT actuator. Thus, an online RELM algorithm with forgetting property(FReOS-ELM) is proposed to handle this issue. Firstly, we adopt regularized extreme learning machine(RELM)to build an intelligent hysteresis model. The training of the algorithm is completed only in one step, which avoids the shortcomings of the traditional hysteresis model based on artificial neural network(ANN) that slow training speed and easy to fall into the local minimum. Then, based on the regularized on-line sequential extreme learning machine(ReOS-ELM), an on-line RELM algorithm with forgetting property(FReOS-ELM) is designed, which can avoid the computational load of ReOS-ELM in the process of adding new data for learning on-line. In the experiment, a real-time voltage signal with varying frequencies and amplitudes is adopted, and the output displacement data of the nanopositioning stage is also acquired and analyzed. The results powerfully verify that the performance of the established hysteresis model based on the proposed FReOS-ELM is satisfactory, which can be used to improve the practical positioning performance for flexure nanopositioning stage.
YNIMG Journal 2017 Journal Article
AIJ Journal 2017 Journal Article
NeurIPS Conference 2016 Conference Paper
Tensor factorization is a powerful tool to analyse multi-way data. Recently proposed nonlinear factorization methods, although capable of capturing complex relationships, are computationally quite expensive and may suffer a severe learning bias in case of extreme data sparsity. Therefore, we propose a distributed, flexible nonlinear tensor factorization model, which avoids the expensive computations and structural restrictions of the Kronecker-product in the existing TGP formulations, allowing an arbitrary subset of tensor entries to be selected for training. Meanwhile, we derive a tractable and tight variational evidence lower bound (ELBO) that enables highly decoupled, parallel computations and high-quality inference. Based on the new bound, we develop a distributed, key-value-free inference algorithm in the MapReduce framework, which can fully exploit the memory cache mechanism in fast MapReduce systems such as Spark. Experiments demonstrate the advantages of our method over several state-of-the-art approaches, in terms of both predictive performance and computational efficiency.
IJCAI Conference 2016 Conference Paper
Anomaly detection plays an important role in modern data-driven security applications, such as detecting suspicious access to a socket from a process. In many cases, such events can be described as a collection of categorical values that are considered as entities of different types, which we call heterogeneous categorical events. Due to the lack of intrinsic distance measures among entities, and the exponentially large event space, most existing work relies heavily on heuristics to calculate abnormal scores for events. Different from previous work, we propose a principled and unified probabilistic model APE (Anomaly detection via Probabilistic pairwise interaction and Entity embedding) that directly models the likelihood of events. In this model, we embed entities into a common latent space using their observed co-occurrence in different events. More specifically, we first model the compatibility of each pair of entities according to their embeddings. Then we utilize the weighted pairwise interactions of different entity types to define the event probability. Using Noise-Contrastive Estimation with "context-dependent" noise distribution, our model can be learned efficiently regardless of the large event space. Experimental results on real enterprise surveillance data show that our methods can accurately detect abnormal events compared to other state-of-the-art abnormal detection techniques.
EAAI Journal 2014 Journal Article
AAAI Conference 2014 Conference Paper
Semi-supervised kernel design is an essential step for obtaining good predictive performance in semi-supervised learning tasks. In the current literatures, a large family of algorithms builds the new kernel by using the weighted average of predefined base kernels. While optimal weighting schemes have been studied extensively, the choice of base kernels received much less attention. Many methods simply adopt the empirical kernel matrices or its eigenvectors. Such base kernels are computed irrespective of class labels and may not always reflect useful structures in the data. As a result, in case of poor base kernels, the generalization performance can be degraded however hard their weights are tuned. In this paper, we propose to construct high-quality base kernels with the help of label information to globally improve the final target alignment. In particular, we devise label-aware kernel eigenvectors under the framework of semi-supervised eigenfunction extrapolation, which span base kernels that are more useful for learning. Such base kernels are individually better aligned to the learning target, so their mixture will more likely generate a good classifier. Our approach is computationally efficient, and demonstrates encouraging performance in semisupervised classification and regression.
YNIMG Journal 2012 Journal Article
ICRA Conference 2012 Conference Paper
Stochastic Clustering Auctions (SCAs) constitute a class of cooperative auction methods that enable improvement of the global cost of the task allocations obtained with fast greedy algorithms. Prior research had developed Contracts Sequencing Algorithms (CSAs) that are deterministic and enable transfers, swaps, and other types of contracts between team members. In contrast to CSAs, SCAs use stochastic transfers or swaps between the task clusters assigned to each team member and have algorithm parameters that can enable tradeoffs between optimality and computational and communication requirements. The first SCA was based on a “Gibbs Sampler” and constrained the stochastic cluster reallocations to simple single transfers or swaps; it is applicable to heterogeneous teams. Subsequently, a more efficient SCA was developed, based on the generalized Swendsen-Wang method; it achieves the increased efficiency by connecting tasks that appear to be synergistic and then stochastically reassigning these connected tasks, hence enabling more complex and efficient movements between clusters than the first SCA. However, its application was limited to homogeneous teams. The contribution of this work is to present an efficient SCA for heterogeneous teams; it is based on a modified Swendsen-Wang method. For centralized auctioning and homogeneous teams, extensive numerical experiments were used to provide a comparison in terms of costs and computational and communication requirements of the three SCAs and a baseline CSA. It was seen that the new SCA maintains the efficiency of the second SCA and can yield similar performance to the baseline CSA in far fewer iterations. The same metrics were used to evaluate the performance of the new SCA for heterogeneous teams. A distributed version of the new SCA was also evaluated in numerical experiments. The results show that, as expected, the distributed SCA continually improves the global performance with each iteration, but converges to a lower cost solution than the centralized SCA.
TAAS Journal 2012 Journal Article
This article considers the problem of optimal task allocation for heterogeneous teams, for example, teams of heterogeneous robots or human-robot teams. It is well-known that this problem is NP-hard and hence computationally feasible approaches must develop an approximate solution. Here, we propose a solution via a Stochastic Clustering Auction (SCA) that uses a Markov chain search process along with simulated annealing. This is the first stochastic auction method used in conjunction with global optimization. It is based on stochastic transfer and swap moves between the clusters of tasks assigned to the various robots and considers not only downhill movements, but also uphill movements, which can avoid local minima. A novel feature of this algorithm is that, by tuning the annealing suite and turning the uphill movements on and off, the global team performance after algorithm convergence can slide in the region between the global optimal performance and the performance associated with a random allocation. Extensive numerical experiments are used to evaluate the performance of SCA in terms of costs and computational and communication requirements. For centralized auctioning, the SCA algorithm is compared to fast greedy auction algorithms. Distributed auctioning is then compared with centralized SCA.
IROS Conference 2010 Conference Paper
It will be shown that the global cost of the task allocations obtained with fast greedy algorithms can be improved upon by using a class of auction methods called Stochastic Clustering Auctions (SCAs). SCAs use stochastic transfers or swaps between the task clusters assigned to each team member, allow both uphill and downhill cost movements, and rely on simulated annealing. The choice of a key annealing parameter and turning the uphill movements on and off enables the converged solution of a SCA to slide in the region between the global optimal performance and the performance associated with a random allocation. The first SCA, called GSSCA, was based on a Gibbs Sampler, which constrained the stochastic cluster reallocations to simple single transfers or swaps. This paper presents a new and more efficient SCA, called SWSCA, based on the generalized Swendsen-Wang method that enables more complex and efficient movements between clusters by connecting tasks that appear to be synergistic and then stochastically reassigning these connected tasks. For centralized auctioning, extensive numerical experiments are used to compare the performance of SWSCA with GSSCA in terms of costs and computational and communication requirements. Distributed SWSCA is then compared with centralized SWSCA using communication links between robots that were motivated by a generic topology called a “scale free network”.
TIME Conference 2006 Conference Paper
A probabilistic interval algebra (PIA) network is an interval algebra network with probabilities associated with the labels on an edge. The probabilities on each edge sum to 1. A solution is a consistent scenario where the product of the probabilities associated with each unique edge label is maximized. In this paper we investigate previous PIA network solution algorithms, and propose new ones. Our first algorithm is based on best first search and guarantees to output the optimal solution. However, this algorithm is only feasible for toy problems. We augment the algorithm with three heuristics. Although our proposed algorithm does not guarantee an optimal solution, it is very useful in practice. Good solutions can be generated quickly
NeurIPS Conference 2006 Conference Paper
Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the dis- tance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensional- ity. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.