Arrow Research search

Author name cluster

Lin Ma

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

52 papers
1 author row

Possible papers

52

AAAI Conference 2026 Conference Paper

DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation

  • Xuexun Liu
  • Xiaoxu Xu
  • Qiudan Zhang
  • Lin Ma
  • Xu Wang

Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose DBGroup, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches.

AAAI Conference 2026 Conference Paper

Leveraging Visual Blur Perception Characteristics for EEG Decoding

  • Wenchao Liu
  • Hongwei Li
  • Zhouyang Xu
  • Lin Ma
  • Haifeng Li

In recent years, electroencephalography (EEG)-based visual decoding research has become a key direction for revealing brain processing mechanisms and realizing brain-computer interfaces. This emerging field has attracted extensive attention in the fields of brain science, cognitive neuroscience, and artificial intelligence. Among various approaches, contrastive learning has demonstrated strong performance in aligning multi-modal data, effectively enabling unified representations across modalities. However, during human visual perception, images are often subject to varying degrees of blurring due to the uneven distribution of retinal photoreceptor cells and the limited speed of lens accommodation. To address the mismatch between EEG and visual representations, we propose a novel visual decoding framework inspired by human perceptual blurring. Specifically, multi-level Gaussian blurring is applied to the visual stimuli to simulate human visual characteristics, followed by a feature selection module to construct robust visual representations. For EEG decoding, we design a lightweight and efficient network employing positively constrained spatial convolutions to identify channels associated with visual processing. The EEG and visual features are then aligned using contrastive learning. We evaluate the proposed framework on the Things-EEG dataset. Experimental results show significant improvements in the zero-shot brain-to-image retrieval task, achieving a top-1 accuracy of 80% and a top-5 accuracy of 96.9%, surpassing previous state-of-the-art methods by margins of 29.1% and 17.2%, respectively. These findings highlight the potential of incorporating perceptual properties into EEG-based visual decoding.

AAAI Conference 2026 Conference Paper

X-SAM: From Segment Anything to Any Segmentation

  • Hao Wang
  • Limeng Qiao
  • Zequn Jie
  • Zhijian Huang
  • Chengjian Feng
  • Qingfang Zheng
  • Lin Ma
  • Xiangyuan Lan

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from segment anything to any segmentation. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.

EAAI Journal 2025 Journal Article

A grid-based boundary sharpening clustering algorithm

  • Lin Ma
  • Qijing Yan
  • Mengxia Lv
  • Tiefeng Ma
  • Mingchang Cheng

To address the clustering problem for arbitrary shapes, in this paper, we propose a Grid-based Boundary Sharpening clustering algorithm called as “GBSharp”. This method is grounded in morphology and relies on two fundamental morphological operations: dilation and erosion. The main innovations of the proposed algorithm lie in two aspects. Firstly, we further introduce the concepts of inward dilation and bridge erosion based on the basic morphological operations to reduce the impact of the chain effect. Secondly, a unique indexing structure is designed specifically for non-empty cells in high dimensional space. In addition, to tackle the complex conditional judgments encountered in high-dimensional scenarios, we further utilize the inversion method for bridge-erosion operation. Experiments conducted on synthetic datasets and real-world datasets further validate the effectiveness and efficiency of the proposed algorithm.

AAAI Conference 2025 Conference Paper

Affordances-Oriented Planning Using Foundation Models for Continuous Vision-Language Navigation

  • Jiaqi Chen
  • Bingqian Lin
  • Xinmin Liu
  • Lin Ma
  • Xiaodan Liang
  • Kwan-Yee K. Wong

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a high-level PathAgent which marks planned paths into the image input and reasons the most probable path by comprehending all environmental information. Finally, we convert the selected path into 3D coordinates using camera intrinsic parameters and depth information, avoiding challenging 3D predictions for LLMs. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. We establish an effective connection between LLM and 3D world, presenting novel prospects for employing foundation models in low-level motion control.

NeurIPS Conference 2025 Conference Paper

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

  • Siyu Jiao
  • Gengwei Zhang
  • Yinlong Qian
  • Jiancheng Huang
  • Yao Zhao
  • Humphrey Shi
  • Lin Ma
  • Yunchao Wei

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images (< 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1. 0B model outperforms its VAR counterpart on the ImageNet 256 × 256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2. 08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0. 25/0. 28 FID and popular diffusion models LDM/DiT by 1. 52/0. 19 FID, respectively. When transferring our 1. 0B model to the ImageNet 512 × 512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2. 3B model, which is a fully supervised model trained at 512 × 512 resolution.

NeurIPS Conference 2025 Conference Paper

GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection

  • Jiaming Li
  • Zhijia Liang
  • Weikai Chen
  • Lin Ma
  • Guanbin Li

Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, GUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization.

IJCAI Conference 2025 Conference Paper

Learning Dynamical Coupled Operator For High-dimensional Black-box Partial Differential Equations

  • Yichi Wang
  • Tian Huang
  • Dandan Huang
  • Zhaohai Bai
  • Xuan Wang
  • Lin Ma
  • Haodi Zhang

The deep operator networks (DON), a class of neural operators that learn mappings between function spaces, have recently emerged as surrogate models for parametric partial differential equations (PDEs). However, their full potential for accurately approximating general black-box PDEs remains underexplored due to challenges in training stability and performance, primarily arising from difficulties in learning mappings between low-dimensional inputs and high-dimensional outputs. Furthermore, inadequate encoding of input functions and query positions limits the generalization ability of DONs. To address these challenges, we propose the Dynamical Coupled Operator (DCO), which incorporates temporal dynamics to learn coupled functions, reducing information loss and improving training robustness. Additionally, we introduce an adaptive spectral input function encoder based on empirical mode decomposition to enhance input function representation, as well as a hybrid location encoder to improve query location encoding. We provide theoretical guarantees on the universal expressiveness of DCO, ensuring its applicability to a wide range of PDE problems. Extensive experiments on real-world, high-dimensional PDE datasets demonstrate that DCO significantly outperforms DONs.

IJCAI Conference 2025 Conference Paper

MCF-Spouse: A Multi-Label Causal Feature Selection Method with Optimal Spouses Discovery

  • Lin Ma
  • Liang Hu
  • Qiang Huang
  • Pingting Hao
  • Juncheng Hu

Multi-label causal feature selection has garnered considerable attention for its ability to identify the most informative features while accounting for the causal dependencies between labels and features. However, previous work often overlooks the unique contributions of labels to the target variables in multi-label settings, focusing instead on prioritizing feature variables. Moreover, existing methods typically rely on traditional Markov Blanket (MB) discovery to construct an initial MB, which often fails to explore the most valuable form of spouse variables to feature selection in multi-label scenarios, leading to significant computational overhead due to redundant Conditional Independence (CI) tests required for spouse search. To address these challenges, we propose the Multi-label Causal Feature Selection Method with Optimal Spouses Discovery, MCF-Spouse, which leverages mutual information to quantify the contributions of both labels and features, ensuring the retention of the most informative variables in multi-label settings. Moreover, we systematically analyzes all potential forms of spouse variables to identify the optimal spouse case, significantly reducing the spouse search space and alleviating the time overhead associated with CI tests. Experiments conducted on diverse real-world datasets demonstrate that MCF-Spouse consistently outperforms state-of-the-art methods across multiple metrics, offering a scalable and interpretable solution for multi-label causal feature selection.

IJCAI Conference 2025 Conference Paper

SSPNet: Leveraging Robust Medication Recommendation with History and Knowledge

  • Haodi Zhang
  • Jiawei Wen
  • Jiahong Li
  • Yuanfeng Song
  • Liang-Jie Zhang
  • Lin Ma

Automated medication recommendation is a crucial task within the domain of artificial intelligence in healthcare, where recommender systems are supposed to deliver precise, personalized drug combinations tailored to the evolving health states of patients. Existing approaches often treat clinical records (e. g. , diagnoses, procedures) as isolated or unified entities, neglecting the inherent set-structured nature of medical data and the need to model interdependencies among clinical elements. To address the gap, we propose SSPNet, a novel end-to-end framework designed to process complete clinical record sets and directly generate optimal medication sets. SSPNet employs a set-based encoder to effectively capture and represent a patient's health condition from the electronic health records (EHRs), while a permutation-consistent decoder predicts the entire medication combination as a set. In addition, we introduce a novel personalized representation mechanism to capture the drugs previously used by individual patients. Extensive experiments on MIMIC-Ⅲ and MIMIC-Ⅳ data sets reveal that SSPNet surpasses existing state-of-the-art methods in the accuracy of medication recommendations.

NeurIPS Conference 2025 Conference Paper

Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy

  • Xiaoxiao Ma
  • Feng Zhao
  • Pengyang Ling
  • Haibo Qiu
  • Zhixiang Wei
  • Hu Yu
  • Jie Huang
  • Zhixiong Zeng

In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.

NeurIPS Conference 2025 Conference Paper

VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction-Editing Data and Long Captions

  • Ziteng Wang
  • Siqi Yang
  • Limeng Qiao
  • Lin Ma

Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

NeurIPS Conference 2025 Conference Paper

VITRIX-UniViTAR: Unified Vision Transformer with Native Resolution

  • Limeng Qiao
  • Yiyang Gan
  • Bairui Wang
  • Jie Qin
  • Shuang Xu
  • Siqi Yang
  • Lin Ma

Conventional Vision Transformer streamlines visual modeling by employing a uniform input resolution, which underestimates the inherent variability of natural visual data and incurs a cost in spatial-contextual fidelity. While preliminary explorations have superficially investigated native resolution modeling, existing works still lack systematic training recipe from the visual representation perspective. To bridge this gap, we introduce Unified Vision Transformer with Native Resolution, i. e. UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. Our framework first conducts architectural upgrades to the vanilla paradigm by integrating multiple advanced components. Building upon these improvements, a progressive training paradigm is introduced, which strategically combines two core mechanisms: (1) resolution curriculum learning, transitioning from fixed-resolution pretraining to native resolution tuning, thereby leveraging ViT’s inherent adaptability to variable-length sequences, and (2) visual modality adaptation via inter-batch image-video switching, which balances computational efficiency with enhanced temporal reasoning. In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model, thereby accelerating early-stage convergence. Finally, trained exclusively on public accessible image-caption data, our UniViTAR family across multiple model scales from 0. 3B to 1B achieves state-of-the-art performance on a wide variety of visual-related tasks. The code and models are available here.

AAAI Conference 2024 Conference Paper

ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field

  • Zhangkai Ni
  • Peiqi Yang
  • Wenhan Yang
  • Hanli Wang
  • Lin Ma
  • Sam Kwong

Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccurate. In our work, we introduce a novel model: the Collaborative Neural Radiance Fields (ColNeRF) designed to work with sparse input. The collaboration in ColNeRF includes the cooperation among sparse input source images and the cooperation among the output of the NeRF. Through this, we construct a novel collaborative module that aligns information from various views and meanwhile imposes self-supervised constraints to ensure multi-view consistency in both geometry and appearance. A Collaborative Cross-View Volume Integration module (CCVI) is proposed to capture complex occlusions and implicitly infer the spatial location of objects. Moreover, we introduce self-supervision of target rays projected in multiple directions to ensure geometric and color consistency in adjacent regions. Benefiting from the collaboration at the input and output ends, ColNeRF is capable of capturing richer and more generalized scene representation, thereby facilitating higher-quality results of the novel view synthesis. Our extensive experimental results demonstrate that ColNeRF outperforms state-of-the-art sparse input generalizable NeRF methods. Furthermore, our approach exhibits superiority in fine-tuning towards adapting to new scenes, achieving competitive performance compared to per-scene optimized NeRF-based methods while significantly reducing computational costs. Our code is available at: https://github.com/eezkni/ColNeRF.

NeurIPS Conference 2024 Conference Paper

EEGPT: Pretrained Transformer for Universal and Reliable Representation of EEG Signals

  • Guagnyu Wang
  • Wenchao Liu
  • Yuhong He
  • Cong Xu
  • Lin Ma
  • Haifeng Li

Electroencephalography (EEG) is crucial for recording brain activity, with applications in medicine, neuroscience, and brain-computer interfaces (BCI). However, challenges such as low signal-to-noise ratio (SNR), high inter-subject variability, and channel mismatch complicate the extraction of robust, universal EEG representations. We propose EEGPT, a novel 10-million-parameter pretrained transformer model designed for universal EEG feature extraction. In EEGPT, a mask-based dual self-supervised learning method for efficient feature extraction is designed. Compared to other mask-based self-supervised learning methods, EEGPT introduces spatio-temporal representation alignment. This involves constructing a self-supervised task based on EEG representations that possess high SNR and rich semantic information, rather than on raw signals. Consequently, this approach mitigates the issue of poor feature quality typically extracted from low SNR signals. Additionally, EEGPT's hierarchical structure processes spatial and temporal information separately, reducing computational complexity while increasing flexibility and adaptability for BCI applications. By training on a large mixed multi-task EEG dataset, we fully exploit EEGPT's capabilities. The experiment validates the efficacy and scalability of EEGPT, achieving state-of-the-art performance on a range of downstream tasks with linear-probing. Our research advances EEG representation learning, offering innovative solutions for bio-signal processing and AI applications. The code for this paper is available at: https: //github. com/BINE022/EEGPT

AAAI Conference 2024 Conference Paper

Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning

  • Yang Jiao
  • Zequn Jie
  • Shaoxiang Chen
  • Lechao Cheng
  • Jingjing Chen
  • Lin Ma
  • Yu-Gang Jiang

Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. Under such a paradigm, accurate BEV representation construction relies on reliable depth estimation for multi-camera images. However, existing approaches exhaustively predict depths for every pixel without prioritizing objects, which are precisely the entities requiring detection in the 3D space. To this end, we propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector. First, a category-specific structural priors mining approach is proposed for enhancing the efficacy of monocular depth generation. Besides, a self-boosting learning strategy is further proposed to encourage the model to place more emphasis on challenging objects in computation-expensive temporal stereo matching. Together they provide advanced depth estimation results for high-quality BEV features construction, benefiting the ultimate 3D detection. The proposed method achieves state-of-the-art performances on the challenging nuScenes benchmark, and extensive experimental results demonstrate the effectiveness of our designs.

NeurIPS Conference 2024 Conference Paper

LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

  • Xuexun Liu
  • Xiaoxu Xu
  • Jinlong Li
  • Qiudan Zhang
  • Xu Wang
  • Nicu Sebe
  • Lin Ma

Referring 3D Segmentation is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two-stage paradigm, first conducting language-agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human-labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. Specifically, we design a Point-Word Cross-Modal Alignment module for aligning the fine-grained features of points and textual embedding. Query Mask Predictor module and Query-Sentence Alignment module are introduced for coarse-grained alignment between masks and query. Furthermore, we propose an area regularization loss, which coarsely reduces irrelevant background predictions on a large scale. Besides, a point-to-point contrastive loss is proposed concentrating on distinguishing points with subtly similar features. Through extensive experiments, we achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3. 7% mIoU using only binary labels. Code is available at https: //github. com/mellody11/LESS.

NeurIPS Conference 2024 Conference Paper

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

  • Yang Jiao
  • Shaoxiang Chen
  • Zequn Jie
  • Jingjing Chen
  • Lin Ma
  • Yu-Gang Jiang

Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to the format of the language model, which is the main component of a LMM. This adaptation leads to convenient development of such LMMs with minimal modifications, however, it overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. We decouple the LMM's learning of perception capabilities into task-agnostic and task-specific stages. Lumen first promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all the tasks we address in this paper. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Comprehensive experimental results on a series of vision-centric and VQA benchmarks indicate that our Lumen model not only achieves or surpasses the performance of existing LMM-based approaches in a range of vision-centric tasks while maintaining general visual understanding and instruction following capabilities.

NeurIPS Conference 2024 Conference Paper

Splatter a Video: Video Gaussian Representation for Versatile Processing

  • Yang-Tian Sun
  • Yi-Hua Huang
  • Lin Ma
  • Xiaoyang Lyu
  • Yan-Pei Cao
  • Xiaojuan Qi

Video representation is a long-standing problem that is crucial for various downstream tasks, such as tracking, depth prediction, segmentation, view synthesis, and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation—video Gaussian representation—that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.

AAAI Conference 2023 Conference Paper

Curriculum Multi-Negative Augmentation for Debiased Video Grounding

  • Xiaohan Lan
  • Yitian Yuan
  • Hong Chen
  • Xin Wang
  • Zequn Jie
  • Lin Ma
  • Zhi Wang
  • Wenwu Zhu

Video Grounding (VG) aims to locate the desired segment from a video given a sentence query. Recent studies have found that current VG models are prone to over-rely the groundtruth moment annotation distribution biases in the training set. To discourage the standard VG model's behavior of exploiting such temporal annotation biases and improve the model generalization ability, we propose multiple negative augmentations in a hierarchical way, including cross-video augmentations from clip-/video-level, and self-shuffled augmentations with masks. These augmentations can effectively diversify the data distribution so that the model can make more reasonable predictions instead of merely fitting the temporal biases. However, directly adopting such data augmentation strategy may inevitably carry some noise shown in our cases, since not all of the handcrafted augmentations are semantically irrelevant to the groundtruth video. To further denoise and improve the grounding accuracy, we design a multi-stage curriculum strategy to adaptively train the standard VG model from easy to hard negative augmentations. Experiments on newly collected Charades-CD and ActivityNet-CD datasets demonstrate our proposed strategy can improve the performance of the base model on both i.i.d and o.o.d scenarios.

NeurIPS Conference 2023 Conference Paper

Punctuation-level Attack: Single-shot and Single Punctuation Can Fool Text Models

  • Wenqiang Wang
  • Chongyang Du
  • Tao Wang
  • Kaihao Zhang
  • Wenhan Luo
  • Lin Ma
  • Wei Liu
  • Xiaochun Cao

The adversarial attacks have attracted increasing attention in various fields including natural language processing. The current textual attacking models primarily focus on fooling models by adding character-/word-/sentence-level perturbations, ignoring their influence on human perception. In this paper, for the first time in the community, we propose a novel mode of textual attack, punctuation-level attack. With various types of perturbations, including insertion, displacement, deletion, and replacement, the punctuation-level attack achieves promising fooling rates against SOTA models on typical textual tasks and maintains minimal influence on human perception and understanding of the text by mere perturbation of single-shot single punctuation. Furthermore, we propose a search method named Text Position Punctuation Embedding and Paraphrase (TPPEP) to accelerate the pursuit of optimal position to deploy the attack, without exhaustive search, and we present a mathematical interpretation of TPPEP. Thanks to the integrated Text Position Punctuation Embedding (TPPE), the punctuation attack can be applied at a constant cost of time. Experimental results on public datasets and SOTA models demonstrate the effectiveness of the punctuation attack and the proposed TPPE. We additionally apply the single punctuation attack to summarization, semantic-similarity-scoring, and text-to-image tasks, and achieve encouraging results.

NeurIPS Conference 2022 Conference Paper

Expansion and Shrinkage of Localization for Weakly-Supervised Semantic Segmentation

  • Jinlong Li
  • Zequn Jie
  • Xu Wang
  • Xiaolin Wei
  • Lin Ma

Generating precise class-aware pseudo ground-truths, a. k. a, class activation maps (CAMs), is essential for Weakly-Supervised Semantic Segmentation. The original CAM method usually produces incomplete and inaccurate localization maps. To tackle with this issue, this paper proposes an Expansion and Shrinkage scheme based on the offset learning in the deformable convolution, to sequentially improve the recall and precision of the located object in the two respective stages. In the Expansion stage, an offset learning branch in a deformable convolution layer, referred to as expansion sampler'', seeks to sample increasingly less discriminative object regions, driven by an inverse supervision signal that maximizes image-level classification loss. The located more complete object region in the Expansion stage is then gradually narrowed down to the final object region during the Shrinkage stage. In the Shrinkage stage, the offset learning branch of another deformable convolution layer referred to as the shrinkage sampler'', is introduced to exclude the false positive background regions attended in the Expansion stage to improve the precision of the localization maps. We conduct various experiments on PASCAL VOC 2012 and MS COCO 2014 to well demonstrate the superiority of our method over other state-of-the-art methods for Weakly-Supervised Semantic Segmentation. The code is available at https: //github. com/TyroneLi/ESOL_WSSS.

AAAI Conference 2022 Conference Paper

Explore Inter-contrast between Videos via Composition for Weakly Supervised Temporal Sentence Grounding

  • Jiaming Chen
  • Weixin Luo
  • Wei Zhang
  • Lin Ma

Weakly supervised temporal sentence grounding aims to temporally localize the target segment corresponding to a given natural language query, where it provides video-query pairs without temporal annotations during training. Most existing methods use the fused visual-linguistic feature to reconstruct the query, where the least reconstruction error determines the target segment. This work introduces a novel approach that explores the inter-contrast between videos in a composed video by selecting components from two different videos and fusing them into a single video. Such a straightforward yet effective composition strategy provides the temporal annotations at multiple composed positions, resulting in numerous videos with temporal ground-truths for training the temporal sentence grounding task. A transformer framework is introduced with multi-tasks training to learn a compact but efficient visual-linguistic space. The experimental results on the public Charades-STA and ActivityNet- Caption dataset demonstrate the effectiveness of the proposed method, where our approach achieves comparable performance over the state-of-the-art weakly-supervised baselines. The code is available at https: //github. com/PPjmchen/ Composition WSTG.

YNICL Journal 2022 Journal Article

Frequency-dependent white-matter functional network changes associated with cognitive deficits in subcortical vascular cognitive impairment

  • Juanwei Ma
  • Feng Liu
  • Yang Wang
  • Lin Ma
  • Yali Niu
  • Jing Wang
  • Zhaoxiang Ye
  • Jing Zhang

Vascular cognitive impairment (VCI) refers to all forms of cognitive decline associated with cerebrovascular diseases, in which white matter (WM) is highly vulnerable. Although previous studies have shown that blood oxygen level-dependent (BOLD) signals inside WM can effectively reflect neural activities, whether WM BOLD signal alterations are present and their roles underlying cognitive impairment in VCI remain largely unknown. In this study, 36 subcortical VCI (SVCI) patients and 36 healthy controls were enrolled to evaluate WM dysfunction. Specifically, fourteen distinct WM networks were identified from resting-state functional MRI using K-means clustering analysis. Subsequently, between-network functional connectivity (FC) and within-network BOLD signal amplitude of WM networks were calculated in three frequency bands (band A: 0.01-0.15 Hz, band B: 0.08-0.15 Hz, and band C: 0.01-0.08 Hz). Patients with SVCI manifested decreased FC mainly in bilateral parietal WM regions, forceps major, superior and inferior longitudinal fasciculi. These connections extensively linked with distinct WM networks and with gray-matter networks such as frontoparietal control, dorsal and ventral attention networks, which exhibited frequency-specific alterations in SVCI. Additionally, extensive amplitude reductions were found in SVCI, showing frequency-dependent properties in parietal, anterior corona radiate, pre/post central, superior and inferior longitudinal fasciculus networks. Furthermore, these decreased FC and amplitudes showed significant positive correlations with cognitive performances in SVCI, and high diagnostic performances for SVCI especially combining all bands. Our study indicated that VCI-related cognitive deficits were characterized by frequency-dependent WM functional abnormalities, which offered novel applicable neuromarkers for VCI.

AAAI Conference 2022 Conference Paper

Visual Consensus Modeling for Video-Text Retrieval

  • Shuqiang Cao
  • Bairui Wang
  • Wei Zhang
  • Lin Ma

In this paper, we propose a novel method to mine the commonsense knowledge shared between the video and text modalities for video-text retrieval, namely visual consensus modeling. Different from the existing works, which learn the video and text representations and their complicated relationships solely based on the pairwise video-text data, we make the first attempt to model the visual consensus by mining the visual concepts from videos and exploiting their co-occurrence patterns within the video and text modalities with no reliance on any additional concept annotations. Specifically, we build a shareable and learnable graph as the visual consensus, where the nodes denoting the mined visual concepts and the edges connecting the nodes representing the co-occurrence relationships between the visual concepts. Extensive experimental results on the public benchmark datasets demonstrate that our proposed method, with the ability to effectively model the visual consensus, achieves state-of-the-art performance on the bidirectional video-text retrieval task. Our code is available at https: //github. com/sqiangcao99/VCM.

AAAI Conference 2021 Conference Paper

Similarity Reasoning and Filtration for Image-Text Matching

  • Haiwen Diao
  • Ying Zhang
  • Lin Ma
  • Huchuan Lu

Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representations are firstly learned to characterize the local and global alignments in a more comprehensive manner, and then the Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relationaware similarities with both the local and global alignments. The Similarity Attention Filtration (SAF) module is further developed to integrate these alignments effectively by selectively attending on the significant and representative alignments and meanwhile casting aside the interferences of nonmeaningful alignments. We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF modules with extensive qualitative experiments and analyses.

AAAI Conference 2020 Conference Paper

Feature Deformation Meta-Networks in Image Captioning of Novel Objects

  • Tingjia Cao
  • Ke Han
  • Xiaomei Wang
  • Lin Ma
  • Yanwei Fu
  • Yu-Gang Jiang
  • Xiangyang Xue

This paper studies the task of image captioning with novel objects, which only exist in testing images. Intrinsically, this task can reflect the generalization ability of models in understanding and captioning the semantic meanings of visual concepts and objects unseen in training set, sharing the similarity to one/zero-shot learning. The critical difficulty thus comes from that no paired images and sentences of the novel objects can be used to help train the captioning model. Inspired by recent work (Chen et al. 2019b) that boosts one-shot learning by learning to generate various image deformations, we propose learning meta-networks for deforming features for novel object captioning. To this end, we introduce the feature deformation meta-networks (FDM-net), which is trained on source data, and learn to adapt to the novel object features detected by the auxiliary detection model. FDM-net includes two sub-nets: feature deformation, and scene graph sentence reconstruction, which produce the augmented image features and corresponding sentences, respectively. Thus, rather than directly deforming images, FDM-net can efficiently and dynamically enlarge the paired images and texts by learning to deform image features. Extensive experiments are conducted on the widely used novel object captioning dataset, and the results show the effectiveness of our FDM-net. Ablation study and qualitative visualization further give insights of our model.

AAAI Conference 2020 Conference Paper

Recurrent Nested Model for Sequence Generation

  • Wenhao Jiang
  • Lin Ma
  • Wei Lu

Depth has been shown beneficial to neural network models. In this paper, we make an attempt to make the encoder-decoder model deeper for sequence generation. We propose a module that can be plugged into the middle between the encoder and decoder to increase the depth of the whole model. The proposed module follows a nested structure, which is divided into blocks with each block containing several recurrent transition steps. To reduce the training difficulty and preserve the necessary information for the decoder during transitions, inter-block connections and intra-block connections are constructed in our model. The inter-block connections provide the thought vectors from the current block to all the subsequent blocks. The intra-block connections connect all the hidden states entering the current block to the current transition step. The advantages of our model are illustrated on the image captioning and code captioning tasks.

AAAI Conference 2020 Conference Paper

Temporally Grounding Language Queries in Videos by Contextual Boundary-Aware Prediction

  • Jingwen Wang
  • Lin Ma
  • Wenhao Jiang

The task of temporally grounding language queries in videos is to temporally localize the best matched video segment corresponding to a given language (sentence). It requires certain models to simultaneously perform visual and linguistic understandings. Previous work predominantly ignores the precision of segment localization. Sliding window based methods use predefined search window sizes, which suffer from redundant computation, while existing anchor-based approaches fail to yield precise localization. We address this issue by proposing an end-to-end boundary-aware model, which uses a lightweight branch to predict semantic boundaries corresponding to the given linguistic information. To better detect semantic boundaries, we propose to aggregate contextual information by explicitly modeling the relationship between the current element and its neighbors. The most con- fident segments are subsequently selected based on both anchor and boundary predictions at the testing stage. The proposed model, dubbed Contextual Boundary-aware Prediction (CBP), outperforms its competitors with a clear margin on three public datasets.

AAAI Conference 2019 Conference Paper

Cousin Network Guided Sketch Recognition via Latent Attribute Warehouse

  • Kaihao Zhang
  • Wenhan Luo
  • Lin Ma
  • Hongdong Li

We study the problem of sketch image recognition. This problem is plagued with two major challenges: 1) sketch images are often scarce in contrast to the abundance of natural images, rendering the training task difficult, and 2) the significant domain gap between sketch image and its natural image counterpart makes the task of bridging the two domains challenging. In order to overcome these challenges, in this paper we propose to transfer the knowledge of a network learned from natural images to a sketch network - a new deep net architecture which we term as cousin network. This network guides a sketch-recognition network to extract more relevant features that are close to those of natural images, via adversarial training. Moreover, to enhance the transfer ability of the classification model, a sketch-to-image attribute warehouse is constructed to approximate the transformation between the sketch domain and the real image domain. Extensive experiments conducted on the TU-Berlin dataset show that the proposed model is able to efficiently distill knowledge from natural images and achieves superior performance than the current state of the art.

NeurIPS Conference 2019 Conference Paper

Exploiting Local and Global Structure for Point Cloud Semantic Segmentation with Contextual Point Representations

  • Xu Wang
  • Jingming He
  • Lin Ma

In this paper, we propose one novel model for point cloud semantic segmentation, which exploits both the local and global structures within the point cloud based onthe contextual point representations. Specifically, we enrich each point represen-tation by performing one novel gated fusion on the point itself and its contextualpoints. Afterwards, based on the enriched representation, we propose one novelgraph pointnet module, relying on the graph attention block to dynamically com-pose and update each point representation within the local point cloud structure. Finally, we resort to the spatial-wise and channel-wise attention strategies to exploitthe point cloud global structure and thereby yield the resulting semantic label foreach point. Extensive results on the public point cloud databases, namely theS3DIS and ScanNet datasets, demonstrate the effectiveness of our proposed model, outperforming the state-of-the-art approaches. Our code for this paper is available at https: //github. com/fly519/ELGS.

IJCAI Conference 2019 Conference Paper

Hallucinating Optical Flow Features for Video Classification

  • Yongyi Tang
  • Lin Ma
  • Lianqiang Zhou

Appearance and motion are two key components to depict and characterize the video content. Currently, the two-stream models have achieved state-of-the-art performances on video classification. However, extracting motion information, specifically in the form of optical flow features, is extremely computationally expensive, especially for large-scale video classification. In this paper, we propose a motion hallucination network, namely MoNet, to imagine the optical flow features from the appearance features, with no reliance on the optical flow computation. Specifically, MoNet models the temporal relationships of the appearance features and exploits the contextual relationships of the optical flow features with concurrent connections. Extensive experimental results demonstrate that the proposed MoNet can effectively and efficiently hallucinate the optical flow features, which together with the appearance features consistently improve the video classification performances. Moreover, MoNet can help cutting down almost a half of computational and data-storage burdens for the two-stream video classification. Our code is available at: https: //github. com/YongyiTang92/MoNet-Features

AAAI Conference 2019 Conference Paper

Hierarchical Photo-Scene Encoder for Album Storytelling

  • Bairui Wang
  • Lin Ma
  • Wei Zhang
  • Wenhao Jiang
  • Feng Zhang

In this paper, we propose a novel model with a hierarchical photo-scene encoder and a reconstructor for the task of album storytelling. The photo-scene encoder contains two subencoders, namely the photo and scene encoders, which are stacked together and behave hierarchically to fully exploit the structure information of the photos within an album. Specifically, the photo encoder generates semantic representation for each photo while exploiting temporal relationships among them. The scene encoder, relying on the obtained photo representations, is responsible for detecting the scene changes and generating scene representations. Subsequently, the decoder dynamically and attentively summarizes the encoded photo and scene representations to generate a sequence of album representations, based on which a story consisting of multiple coherent sentences is generated. In order to fully extract the useful semantic information from an album, a reconstructor is employed to reproduce the summarized album representations based on the hidden states of the decoder. The proposed model can be trained in an end-to-end manner, which results in an improved performance over the state-of-the-arts on the public visual storytelling (VIST) dataset. Ablation studies further demonstrate the effectiveness of the proposed hierarchical photo-scene encoder and reconstructor.

AAAI Conference 2019 Conference Paper

Localizing Natural Language in Videos

  • Jingyuan Chen
  • Lin Ma
  • Xinpeng Chen
  • Zequn Jie
  • Jiebo Luo

In this paper, we consider the task of natural language video localization (NLVL): given an untrimmed video and a natural language description, the goal is to localize a segment in the video which semantically corresponds to the given natural language description. We propose a localizing network (L- Net), working in an end-to-end fashion, to tackle the NLVL task. We first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. A self interactor is proposed to perform crossframe matching, which dynamically encodes and aggregates the matching evidences. Finally, a boundary model is proposed to locate the positions of video segments corresponding to the natural sentence description by predicting the starting and ending points of the segment. Extensive experiments conducted on the public TACoS and DiDeMo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.

IJCAI Conference 2019 Conference Paper

Position Focused Attention Network for Image-Text Matching

  • Yaxiong Wang
  • Hao Yang
  • Xueming Qian
  • Lin Ma
  • Jing Lu
  • Biao Li
  • Xin Fan

Image-text matching tasks have recently attracted a lot of attention in the computer vision field. The key point of this cross-domain problem is how to accurately measure the similarity between the visual and the textual contents, which demands a fine understanding of both modalities. In this paper, we propose a novel position focused attention network (PFAN) to investigate the relation between the visual and the textual views. In this work, we integrate the object position clue to enhance the visual-text joint-embedding learning. We first split the images into blocks, by which we infer the relative position of region in the image. Then, an attention mechanism is proposed to model the relations between the image region and blocks and generate the valuable position feature, which will be further utilized to enhance the region expression and model a more reliable relationship between the visual image and the textual sentence. Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical news dataset (Tencent-News) to validate the practical application value of proposed method. As far as we know, this is the first attempt to test the performance on the practical application. Our method can achieve the state-of-art performance on all of these three datasets.

NeurIPS Conference 2019 Conference Paper

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos

  • Yitian Yuan
  • Lin Ma
  • Jingwen Wang
  • Wei Liu
  • Wenwu Zhu

Temporal sentence grounding in videos aims to detect and localize one target video segment, which semantically corresponds to a given sentence. Existing methods mainly tackle this task via matching and aligning semantics between a sentence and candidate video segments, while neglect the fact that the sentence information plays an important role in temporally correlating and composing the described contents in videos. In this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism, which relies on the sentence semantics to modulate the temporal convolution operations for better correlating and composing the sentence related video contents over time. More importantly, the proposed SCDM performs dynamically with respect to the diverse video contents so as to establish a more precise matching relationship between sentence and video, thereby improving the temporal grounding accuracy. Extensive experiments on three public datasets demonstrate that our proposed model outperforms the state-of-the-arts with clear margins, illustrating the ability of SCDM to better associate and localize relevant video contents for temporal sentence grounding. Our code for this paper is available at https: //github. com/yytzsy/SCDM.

NeurIPS Conference 2018 Conference Paper

Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation

  • Wenqi Ren
  • Jiawei Zhang
  • Lin Ma
  • Jinshan Pan
  • Xiaochun Cao
  • Wangmeng Zuo
  • Wei Liu
  • Ming-Hsuan Yang

In this paper, we present a deep convolutional neural network to capture the inherent properties of image degradation, which can handle different kernels and saturated pixels in a unified framework. The proposed neural network is motivated by the low-rank property of pseudo-inverse kernels. We first compute a generalized low-rank approximation for a large number of blur kernels, and then use separable filters to initialize the convolutional parameters in the network. Our analysis shows that the estimated decomposed matrices contain the most essential information of the input kernel, which ensures the proposed network to handle various blurs in a unified framework and generate high-quality deblurring results. Experimental results on benchmark datasets with noise and saturated pixels demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.

IJCAI Conference 2018 Conference Paper

Image-level to Pixel-wise Labeling: From Theory to Practice

  • Tiezhu Sun
  • Wei Zhang
  • Zhijie Wang
  • Lin Ma
  • Zequn Jie

Conventional convolutional neural networks (CNNs) have achieved great success in image semantic segmentation. Existing methods mainly focus on learning pixel-wise labels from an image directly. In this paper, we advocate tackling the pixel-wise segmentation problem by considering the image-level classification labels. Theoretically, we analyze and discuss the effects of image-level labels on pixel-wise segmentation from the perspective of information theory. In practice, an end-to-end segmentation model is built by fusing the image-level and pixel-wise labeling networks. A generative network is included to reconstruct the input image and further boost the segmentation model training with an auxiliary loss. Extensive experimental results on benchmark dataset demonstrate the effectiveness of the proposed method, where good image-level labels can significantly improve the pixel-wise segmentation accuracy.

AAAI Conference 2018 Conference Paper

Learning to Guide Decoding for Image Captioning

  • Wenhao Jiang
  • Lin Ma
  • Xinpeng Chen
  • Hanwang Zhang
  • Wei Liu

Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at each time step. The guiding network can be plugged into the current encoder-decoder framework and trained in an end-to-end manner. Hence, the guiding vector can be adaptively learned according to the signal from the decoder, making itself to embed information from both image and language. Additionally, discriminative supervision can be employed to further improve the quality of guidance. The advantages of our proposed approach are verified by experiments carried out on the MS COCO dataset.

IJCAI Conference 2018 Conference Paper

Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamics

  • Yongyi Tang
  • Lin Ma
  • Wei Liu
  • Wei-Shi Zheng

Human motion prediction aims at generating future frames of human motion based on an observed sequence of skeletons. Recent methods employ the latest hidden states of a recurrent neural network (RNN) to encode the historical skeletons, which can only address short-term prediction. In this work, we propose a motion context modeling by summarizing the historical human motion with respect to the current prediction. A modified highway unit (MHU) is proposed for efficiently eliminating motionless joints and estimating next pose given the motion context. Furthermore, we enhance the motion dynamic by minimizing the gram matrix loss for long-term motion prediction. Experimental results show that the proposed model can promisingly forecast the human future movements, which yields superior performances over related state-of-the-art approaches. Moreover, specifying the motion context with the activity labels enables our model to perform human motion transfer.

NeurIPS Conference 2018 Conference Paper

Parsimonious Quantile Regression of Financial Asset Tail Dynamics via Sequential Learning

  • Xing Yan
  • Weizhong Zhang
  • Lin Ma
  • Wei Liu
  • Qi Wu

We propose a parsimonious quantile regression framework to learn the dynamic tail behaviors of financial asset returns. Our model captures well both the time-varying characteristic and the asymmetrical heavy-tail property of financial time series. It combines the merits of a popular sequential neural network model, i. e. , LSTM, with a novel parametric quantile function that we construct to represent the conditional distribution of asset returns. Our model also captures individually the serial dependences of higher moments, rather than just the volatility. Across a wide range of asset classes, the out-of-sample forecasts of conditional quantiles or VaR of our model outperform the GARCH family. Further, the proposed approach does not suffer from the issue of quantile crossing, nor does it expose to the ill-posedness comparing to the parametric probability density function approach.

YNIMG Journal 2016 Journal Article

Alterations of functional connectivities from early to middle adulthood: Clues from multivariate pattern analysis of resting-state fMRI data

  • Lixia Tian
  • Lin Ma
  • Linlin Wang

In contrast to extended research interests in the maturation and aging of human brain, alterations of brain structure and function from early to middle adulthood have been much less studied. The aim of the present study was to investigate the extent and pattern of the alterations of functional interactions between brain regions from early to middle adulthood. We carried out the study by multivariate pattern analysis of resting-state fMRI (RS-fMRI) data of 63 adults aged 18 to 45years. Specifically, using elastic net, we performed brain age estimation and age-group classification (young adults aged 18–28years vs. middle-aged adults aged 35–45years) based on the resting-state functional connectivities (RSFCs) between 160 regions of interest (ROIs) evaluated on the RS-fMRI data of each subject. The results indicate that the estimated brain ages were significantly correlated with the chronological age (R =0. 78, MAE=4. 81), and a classification rate of 94. 44% and area under the receiver operating characteristic curve (AUC) of 0. 99 were obtained when classifying the young and middle-aged adults. These results provide strong evidence that functional interactions between brain regions undergo notable alterations from early to middle adulthood. By analyzing the RSFCs that contribute to brain age estimation/age-group classification, we found that a majority of the RSFCs were inter-network, and we speculate that inter-network RSFCs might mature late but age early as compared to intra-network ones. In addition, the strengthening/weakening of the RSFCs associated with the left/right hemispheric ROIs, the weakening of cortico-cerebellar RSFCs and the strengthening of the RSFCs between the default mode network and other networks contributed much to both brain age estimation and age-group classification. All these alterations might reflect that aging of brain function is already in progress in middle adulthood. Overall, the present study indicated that the RSFCs undergo notable alterations from early to middle adulthood and highlighted the necessity of careful considerations of possible influences of these alterations in related studies.

AAAI Conference 2016 Conference Paper

Learning to Answer Questions from Image Using Convolutional Neural Network

  • Lin Ma
  • Zhengdong Lu
  • Hang Li

In this paper, we propose to employ the convolutional neural network (CNN) for the image question answering (QA) task. Our proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer. More specifically, our model consists of three CNNs: one image CNN to encode the image content, one sentence CNN to compose the words of the question, and one multimodal convolution layer to learn their joint representation for the classification in the space of candidate answer words. We demonstrate the efficacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for image QA, with the performances significantly outperforming the state-of-the-art.

YNIMG Journal 2015 Journal Article

Nature of functional links in valuation networks differentiates impulsive behaviors between abstinent heroin-dependent subjects and nondrug-using subjects

  • Tianye Zhai
  • Yongcong Shao
  • Gang Chen
  • Enmao Ye
  • Lin Ma
  • Lubin Wang
  • Yu Lei
  • Guangyu Chen

Advanced neuroimaging studies have identified brain correlates of pathological impulsivity in a variety of neuropsychiatric disorders. However, whether and how these spatially separate and functionally integrated neural correlates collectively contribute to aberrant impulsive behaviors remains unclear. Building on recent progress in neuroeconomics toward determining a biological account of human behaviors, we employed resting-state functional MRI to characterize the nature of the links between these neural correlates and to investigate their impact on impulsivity. We demonstrated that through functional connectivity with the ventral medial prefrontal cortex, the δ-network (regions of the executive control system, such as the dorsolateral prefrontal cortex) and the β-network (regions of the reward system involved in the mesocorticolimbic pathway), jointly influence impulsivity measured by the Barratt impulsiveness scale scores. In control nondrug-using subjects, the functional link between the β- and δ-networks is balanced, and the δ-network competitively controls impulsivity. However, in abstinent heroin-dependent subjects, the link is imbalanced, with stronger β-network connectivity and weaker δ-network connectivity. The imbalanced link is associated with impulsivity, indicating that the β- and δ-networks may mutually reinforce each other in abstinent heroin-dependent subjects. These findings of an aberrant link between the β- and δ-networks in abstinent heroin-dependent subjects may shed light on the mechanism of aberrant behaviors of drug addiction and may serve as an endophenotype to mark individual subjects' self-control capacity.