EAAI Journal 2026 Journal Article
Chaos identification for dairy cow weight time series with deep wavelet transform
- Kexin Meng
- Shanjie Yang
- Ningning Feng
- Shijiao Gao
- Meng Liu
- Shuli Mei
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
Low-Rank Adaptation (LoRA) has emerged as a powerful parameter-efficient fine-tuning method for adapting large language models to downstream tasks. Recent studies have leveraged Mixture-of-Experts (MoE) mechanism to effectively integrate multiple LoRA modules, facilitating efficient parameter adaptation for multi-task scenarios. It has been shown that fostering knowledge sharing across LoRA experts can greatly enhance parameter adaptation efficiency. However, the existing approach for LoRA expert knowledge sharing still faces two key limitations: constrained functional specialization and induced expert homogenization. To address these issues, we propose a novel diversity-regulated asymmetric MoE-LoRA decomposition framework, which achieves flexible knowledge sharing through asymmetric expert decomposition and guarantees expert diversity with a dual orthogonality regularization. Extensive experiments on eight public benchmarks, spanning both multi-task and single-task settings, demonstrate the superiority of our approach over existing methods.
AAAI Conference 2026 Conference Paper
AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. However, current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique challenges of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D, together with the instruction-tuning dataset EgoIT, which is collected from multiple sources to enhance the model's instruction-following capabilities. Building upon the datasets, we propose a migration strategy and further design a progressive mapping learning pipeline with three stages: Demonstrator Self-Preparation, Demonstrator-Learner Guidance, and Learner Self-Practice. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.
AAAI Conference 2026 Conference Paper
Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) → intention inference (reason) → action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.
TIST Journal 2026 Journal Article
Transformer-based semantic segmentation methods have demonstrated outstanding performance by leveraging global self-attention to effectively capture long-range dependence. However, there still exist two issues in existing works: (1) Most of them utilize the full-rank weight matrix to support the self-attention mechanism and feed-forward network in modelling long-range dependence between patches/pixels, resulting in a high computational cost during both training and inference. (2) Most of them ignore information interactions between high-level semantics and low-level structures during the image resolution recovery, which leads to the performance degradation in segmenting objects with complex boundaries. To tackle these challenges, a lightweight Kolmogorov-Arnold Transformer model (LKAFormer) is proposed for the image semantic segmentation, containing a two-stream lightweight Transformer encoder and a graph feature pyramid aggregation KAN-decoder. The former constructs a hierarchical feature cross-scale fusion pipeline to obtain sufficient semantics containing comprehensive multi-scale information via setting coarse-grained and fine-grained streams with different-size patches of images. In that pipeline, feature lightweight focusing modules model complex and long-range dependence across patches/pixels to refine image semantics with less computational costs by lightweight multi-head self-attention and lightweight feed-forward network designs. The latter leverages the learnable nonlinear transformation mechanism of the Kolmogorov-Arnold Transformer architecture to adaptively capture spatial structure dependence of distinct sub-regions of images. And then, it jointly performs the intra-scale graph fusion and cross-scale graph fusion during the image resolution recovery to enhance information interactions between high-level semantics and low-level structures, which achieves the robust boundary localization and texture refinement of segmentation objects. Finally, plentiful experiments are conducted on three challenging datasets, and the results show LKAFormer sets a new baseline in the image segmentation task in comparison with 11 methods.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivEn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios.
AAAI Conference 2026 Conference Paper
Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.
AAAI Conference 2026 Conference Paper
Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.
JBHI Journal 2026 Journal Article
Automatic medical report generation (MRG) has advanced significantly with retrieval-augmented strategies. However, existing methods face two persistent challenges: 1) a largely reliance on single-modal retrieval, which limits multimodal semantic capture and cross-modal alignment; and 2) a lack of reliable information control, leading to irrelevant noisy content and potential hallucinations. To address these limitations, we propose Uncertainty-aware Cross-modal Alignment and Refinement, named U-CAR, a unified framework that enhances both semantic integration and retrieval reliability. First, a cross-modal alignment module explicitly learns fine-grained correspondences between visual and textual representations, ensuring consistent semantics across modalities. This alignment guides the construction of dual-path retrieval-aware memory banks, with one in the visual domain and one in the textual domain, enabling retrieval to capture complementary cues from both modalities. Second, we design a cross-modal retrieval-augmented generation strategy that jointly attends to the retrieved visual and textual context, thereby enriching semantic coverage and reinforcing the integration of multi-modal evidence in the generated reports. In parallel, we introduce an uncertainty-aware refinement mechanism that quantifies generation confidence to adaptively determine the necessity of retrieval. Experiments on the IU X-Ray and MIMIC-CXR datasets demonstrate that U-CAR outperforms the current state-of-the-art methods, achieving a 9% improvement in CIDEr on IU X-Ray. and a 4% gain in BLEU-4 on MIMIC-CXR. These results underscore U-CAR's effectiveness in generating accurate, coherent, and clinically relevant medical reports. Codes are available in https://github.com/Zhounan1222/U-CAR/tree/main.
AAAI Conference 2026 Conference Paper
Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an “Audio-Visual Confusion” scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs “Is there a/an {muted-object} sound”. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves accuracy by 10~30% over the baseline model with limited training data.
EAAI Journal 2025 Journal Article
AAMAS Conference 2025 Conference Paper
In this work, we propose a simple and effective framework (i. e. , Lite- DIO), marking the first attempt to accelerate deep inertial odometry with knowledge distillation. In Lite-DIO, we first independently construct the Transformer-based teacher model and a lightweight student network. Then, adaptive transferring knowledge is enabled between the teacher model and the student network in a duallevel contrastive distillation manner. With such design, the distilled knowledge comes from not only the teacher model’s predictions but also the latent high-order collaborative semantics preserved in embeddings. Extensive experiments conducted on three real-world datasets demonstrate that the proposed Lite-DIO significantly reduces model size and inference time compared to existing popular alternatives, while the compressed model still maintains competitive localization accuracy.
NeurIPS Conference 2025 Conference Paper
Spatial transcriptomics (ST) technologies provide gene expression measurements with spatial resolution, enabling the dissection of tissue structure and function. A fundamental challenge in ST analysis is clustering spatial spots into coherent functional regions. While existing models effectively integrate expression and spatial signals, they largely overlook sequence-level biological priors encoded in the DNA sequences of expressed genes. To bridge this gap, we propose SAINT (Sequence-Aware Integration for Nucleotide-informed Transcriptomics), a unified framework that augments spatial representation learning with nucleotide-derived features. We construct sequence-augmented datasets across 14 tissue sections from three widely used ST benchmarks (DLPFC, HBC, and MBA), retrieving reference DNA sequences for each expressed gene and encoding them using a pretrained Nucleotide Transformer. For each spot, gene-level embeddings are aggregated via expression-weighted and attention-based pooling, then fused with spatial-expression representations through a late fusion module. Extensive experiments demonstrate that SAINT consistently improves clustering performance across multiple datasets. Experiments validate the superiority, effectiveness, sensitivity, and transferability of our framework, confirming the complementary value of incorporating sequence-level priors into spatial transcriptomics clustering.
NeurIPS Conference 2025 Conference Paper
Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.
TMLR Journal 2024 Journal Article
Message passing graph neural networks (GNNs) are known to have their expressiveness upper-bounded by 1-dimensional Weisfeiler-Leman (1-WL) algorithm. To achieve more powerful GNNs, existing attempts either require \emph{ad hoc} features, or involve operations that incur high time and space complexities. In this work, we propose a \textit{general} and \textit{provably powerful} GNN framework that preserves the \textit{scalability} of the message passing scheme. In particular, we first propose to empower 1-WL for graph isomorphism test by considering edges among neighbors, giving rise to NC-1-WL. The expressiveness of NC-1-WL is shown to be strictly above 1-WL and below 3-WL theoretically. Further, we propose the NC-GNN framework as a differentiable neural version of NC-1-WL. Our simple implementation of NC-GNN is provably as powerful as NC-1-WL. Experiments demonstrate that our NC-GNN performs effectively and efficiently on various benchmarks.
AAAI Conference 2024 Conference Paper
Crime prediction is a crucial yet challenging task within urban computing, which benefits public safety and resource optimization. Over the years, various models have been proposed, and spatial-temporal hypergraph learning models have recently shown outstanding performances. However, three correlations underlying crime are ignored, thus hindering the performance of previous models. Specifically, there are two spatial correlations and one temporal correlation, i.e., (1) co-occurrence of different types of crimes (type spatial correlation), (2) the closer to the crime center, the more dangerous it is around the neighborhood area (neighbor spatial correlation), and (3) the closer between two timestamps, the more relevant events are (hawkes temporal correlation). To this end, we propose Hawkes-enhanced Spatial-Temporal Hypergraph Contrastive Learning framework (HCL), which mines the aforementioned correlations via two specific strategies. Concretely, contrastive learning strategies are designed for two spatial correlations, and hawkes process modeling is adopted for temporal correlations. Extensive experiments demonstrate the promising capacities of HCL from four aspects, i.e., superiority, transferability, effectiveness, and sensitivity.
AAAI Conference 2024 Conference Paper
GraIL and its variants have shown their promising capacities for inductive relation reasoning on knowledge graphs. However, the uni-directional message-passing mechanism hinders such models from exploiting hidden mutual relations between entities in directed graphs. Besides, the enclosing subgraph extraction in most GraIL-based models restricts the model from extracting enough discriminative information for reasoning. Consequently, the expressive ability of these models is limited. To address the problems, we propose a novel GraIL-based framework, termed MINES, by introducing a Message Intercommunication mechanism on the Neighbor-Enhanced Subgraph. Concretely, the message intercommunication mechanism is designed to capture the omitted hidden mutual information. It introduces bi-directed information interactions between connected entities by inserting an undirected/bi-directed GCN layer between uni-directed RGCN layers. Moreover, inspired by the success of involving more neighbors in other graph-based tasks, we extend the neighborhood area beyond the enclosing subgraph to enhance the information collection for inductive relation reasoning. Extensive experiments prove the promising capacity of the proposed MINES from various aspects, especially for the superiority, effectiveness, and transfer ability.
NeurIPS Conference 2024 Conference Paper
Fragment-based drug discovery, in which molecular fragments are assembled into new molecules with desirable biochemical properties, has achieved great success. However, many fragment-based molecule generation methods show limited exploration beyond the existing fragments in the database as they only reassemble or slightly modify the given ones. To tackle this problem, we propose a new fragment-based molecule generation framework with retrieval augmentation, namely Fragment Retrieval-Augmented Generation ( f -RAG). f -RAG is based on a pre-trained molecular generative model that proposes additional fragments from input fragments to complete and generate a new molecule. Given a fragment vocabulary, f -RAG retrieves two types of fragments: (1) hard fragments, which serve as building blocks that will be explicitly included in the newly generated molecule, and (2) soft fragments, which serve as reference to guide the generation of new fragments through a trainable fragment injection module. To extrapolate beyond the existing fragments, f -RAG updates the fragment vocabulary with generated fragments via an iterative refinement process which is further enhanced with post-hoc genetic fragment modification. f -RAG can achieve an improved exploration-exploitation trade-off by maintaining a pool of fragments and expanding it with novel and high-quality fragments through a strong generative prior.
YNIMG Journal 2023 Journal Article
NeurIPS Conference 2023 Conference Paper
We tackle the problem of graph out-of-distribution (OOD) generalization. Existing graph OOD algorithms either rely on restricted assumptions or fail to exploit environment information in training data. In this work, we propose to simultaneously incorporate label and environment causal independence (LECI) to fully make use of label and environment information, thereby addressing the challenges faced by prior methods on identifying causal and invariant subgraphs. We further develop an adversarial training strategy to jointly optimize these two properties for casual subgraph discovery with theoretical guarantees. Extensive experiments and analysis show that LECI significantly outperforms prior methods on both synthetic and real-world datasets, establishing LECI as a practical and effective solution for graph OOD generalization.
NeurIPS Conference 2023 Conference Paper
Supervised machine learning approaches have been increasingly used in accelerating electronic structure prediction as surrogates of first-principle computational methods, such as density functional theory (DFT). While numerous quantum chemistry datasets focus on chemical properties and atomic forces, the ability to achieve accurate and efficient prediction of the Hamiltonian matrix is highly desired, as it is the most important and fundamental physical quantity that determines the quantum states of physical systems and chemical properties. In this work, we generate a new Quantum Hamiltonian dataset, named as QH9, to provide precise Hamiltonian matrices for 2, 399 molecular dynamics trajectories and 130, 831 stable molecular geometries, based on the QM9 dataset. By designing benchmark tasks with various molecules, we show that current machine learning models have the capacity to predict Hamiltonian matrices for arbitrary molecules. Both the QH9 dataset and the baseline models are provided to the community through an open-source benchmark, which can be highly valuable for developing machine learning methods and accelerating molecular and materials design for scientific and technological applications. Our benchmark is publicly available at \url{https: //github. com/divelab/AIRS/tree/main/OpenDFT/QHBench}.
NeurIPS Conference 2023 Conference Paper
In this paper, we present a novel problem, namely video timeline modeling. Our objective is to create a video-associated timeline from a set of videos related to a specific topic, thereby facilitating the content and structure understanding of the story being told. This problem has significant potential in various real-world applications, for instance, news story summarization. To bootstrap research in this area, we curate a realistic benchmark dataset, YouTube-News-Timeline, consisting of over $12$k timelines and $300$k YouTube news videos. Additionally, we propose a set of quantitative metrics to comprehensively evaluate and compare methodologies. With such a testbed, we further develop and benchmark several deep learning approaches to tackling this problem. We anticipate that this exploratory work will pave the way for further research in video timeline modeling. The assets are available via https: //github. com/google-research/google-research/tree/master/video_timeline_modeling.
JMLR Journal 2021 Journal Article
Although there exist several libraries for deep learning on graphs, they are aiming at implementing basic operations for graph deep learning. In the research community, implementing and benchmarking various advanced tasks are still painful and time-consuming with existing libraries. To facilitate graph deep learning research, we introduce DIG: Dive into Graphs, a turnkey library that provides a unified testbed for higher level, research-oriented graph deep learning tasks. Currently, we consider graph generation, self-supervised learning on graphs, explainability of graph neural networks, and deep learning on 3D graphs. For each direction, we provide unified implementations of data interfaces, common algorithms, and evaluation metrics. Altogether, DIG is an extensible, open-source, and turnkey library for researchers to develop new methods and effortlessly compare with common baselines using widely used datasets and evaluation metrics. Source code is available at https://github.com/divelab/DIG. [abs] [ pdf ][ bib ] [ code ] © JMLR 2021. ( edit, beta )
NeurIPS Conference 2020 Conference Paper
Graph based semi-supervised learning is the problem of learning a labeling function for the graph nodes given a few example nodes, often called seeds, usually under the assumption that the graph’s edges indicate similarity of labels. This is closely related to the local graph clustering or community detection problem of finding a cluster or community of nodes around a given seed. For this problem, we propose a novel generalization of random walk, diffusion, or smooth function methods in the literature to a convex p-norm cut function. The need for our p-norm methods is that, in our study of existing methods, we find those principled methods based on eigenvector, spectral, random walk, or linear system often have difficulty capturing the correct boundary of a target label or target cluster. In contrast, 1-norm or maxflow-mincut based methods capture the boundary, but cannot grow from small seed set; hybrid procedures that use both have many hard to set parameters. In this paper, we propose a generalization of the objective function behind these methods involving p-norms. To solve the p-norm cut problem we give a strongly local algorithm -- one whose runtime depends on the size of the output rather than the size of the graph. Our method can be thought as a nonlinear generalization of the Anderson-Chung-Lang push procedure to approximate a personalized PageRank vector efficiently. Our procedure is general and can solve other types of nonlinear objective functions, such as p-norm variants of Huber losses. We provide a theoretical analysis of finding planted target clusters with our method and show that the p-norm cut functions improve on the standard Cheeger inequalities for random walk and spectral methods. Finally, we demonstrate the speed and accuracy of our new method in synthetic and real world datasets.
NeurIPS Conference 2020 Conference Paper
Ensemble is a general way of improving the accuracy and stability of learning models, especially for the generalization ability on small datasets. Compared with tree-based methods, relatively less works have been devoted to an in-depth study on effective ensemble design for neural networks. In this paper, we propose a principled ensemble technique by constructing the so-called diversified ensemble layer to combine multiple networks as individual modules. We theoretically show that each individual model in our ensemble layer corresponds to weights in the ensemble layer optimized in different directions. Meanwhile, the devised ensemble layer can be readily integrated into popular neural architectures, including CNNs, RNNs, and GCNs. Extensive experiments are conducted on public tabular datasets, images, and texts. By adopting weight sharing approach, the results show our method can notably improve the accuracy and stability of the original neural networks with ignorable extra time and space overhead.
AAAI Conference 2017 Conference Paper
Feature selection aims to select a small subset from the high-dimensional features which can lead to better learning performance, lower computational complexity, and better model readability. The class imbalance problem has been neglected by traditional feature selection methods, therefore the selected features will be biased towards the majority classes. Because of the superiority of F-measure to accuracy for imbalanced data, we propose to use F-measure as the performance measure for feature selection algorithms. As a pseudo-linear function, the optimization of F-measure can be achieved by minimizing the total costs. In this paper, we present a novel cost-sensitive feature selection (CSFS) method which optimizes F-measure instead of accuracy to take class imbalance issue into account. The features will be selected according to optimal F-measure classifier after solving a series of cost-sensitive feature selection sub-problems. The features selected by our method will fully represent the characteristics of not only majority classes, but also minority classes. Extensive experimental results conducted on synthetic, multi-class and multi-label datasets validate the efficiency and significance of our feature selection method.
IJCAI Conference 2017 Conference Paper
Supporting vector machine (SVM) is the most frequently used classifier for machine learning tasks. However, its training time could become cumbersome when the size of training data is very large. Thus, many kinds of representative subsets are chosen from the original dataset to reduce the training complexity. In this paper, we propose to choose the representative points which are noted as anchors obtained from non-negative matrix factorization (NMF) in a divide-and-conquer framework, and then use the anchors to train an approximate SVM. Our theoretical analysis shows that the solving the DCA-SVM can yield an approximate solution close to the primal SVM. Experimental results on multiple datasets demonstrate that our DCA-SVM is faster than the state-of-the-art algorithms without notably decreasing the accuracy of classification results.
AAAI Conference 2015 Conference Paper
Multi-label image classification is of significant interest due to its major role in real-world web image analysis applications such as large-scale image retrieval and browsing. Recently, matrix completion (MC) has been developed to deal with multi-label classification tasks. MC has distinct advantages, such as robustness to missing entries in the feature and label spaces and a natural ability to handle multi-label problems. However, current MC-based multi-label image classi- fication methods only consider data represented by a singleview feature, therefore, do not precisely characterize images that contain several semantic concepts. An intuitive way to utilize multiple features taken from different views is to concatenate the different features into a long vector; however, this concatenation is prone to over-fitting and leads to high time complexity in MC-based image classification. Therefore, we present a novel multi-view learning model for MCbased image classification, called low-rank multi-view matrix completion (lrMMC), which first seeks a low-dimensional common representation of all views by utilizing the proposed low-rank multi-view learning (lrMVL) algorithm. In lrMVL, the common subspace is constrained to be low rank so that it is suitable for MC. In addition, combination weights are learned to explore complementarity between different views. An efficient solver based on fixed-point continuation (FPC) is developed for optimization, and the learned low-rank representation is then incorporated into MC-based image classification. Extensive experimentation on the challenging PAS- CAL VOC’ 07 dataset demonstrates the superiority of lr- MMC compared to other multi-label image classification approaches.