Arrow Research search

Author name cluster

Jin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

54 papers
2 author rows

Possible papers

54

AAAI Conference 2026 Conference Paper

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

  • Riling Wei
  • Kelu Yao
  • Chuanguang Yang
  • Jin Wang
  • Zhuoyan Gao
  • Chao Li

Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.

AAAI Conference 2026 Conference Paper

HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

  • Liheng Zhang
  • Jin Wang
  • Hui Li
  • Bingfeng Zhang
  • Weifeng Liu

3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

AAAI Conference 2026 Conference Paper

LLMdoctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language Models

  • Tiesunlong Shen
  • Rui Mao
  • Jin Wang
  • Heming Sun
  • Jian Zhang
  • Xuejie Zhang
  • Erik Cambria

Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising alternative, existing approaches often rely on distorted trajectory-level signals or inefficient sampling, fundamentally capping performance and failing to preserve the generative diversity of the base model. This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. It integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO) to steer a large, frozen patient LLM with a smaller, specialized doctor model. Unlike conventional methods that rely on trajectory-level rewards, LLMdoctor first extracts fine-grained, token-level preference signals from the patient model's behavioral variations. These signals then guide the training of the doctor model via TFPO, which establishes flow consistency across all subtrajectories, enabling precise token-by-token alignment while inherently preserving generation diversity. Extensive experiments demonstrate that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.

AAAI Conference 2026 Conference Paper

SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

  • Kaiyuan Chen
  • Guangmin Zheng
  • Jin Wang
  • Xiaobing Zhou
  • Xuejie Zhang

Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO's impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.

AAAI Conference 2026 Conference Paper

Step-GRPO: Enhancing Reasoning Quality and Efficiency via Structured PRM-Based Reinforcement Learning

  • Weijie Li
  • Jin Wang
  • Liang-Chih Yu
  • Xuejie Zhang

Large reasoning models (LRMs) improve performance at test time by thinking longer, but this often leads to overthinking and high computational cost. To address this, recent reinforcement learning (RL) methods adopt outcome-level rewards, such as rule- or prompt-based signals, that favor shorter correct reasoning paths but often overlook reasoning quality. While such rewards neglect intermediate reasoning, dense supervision from process reward models (PRMs) has proven more effective in promoting coherent and high-quality reasoning. However, static PRM supervision introduces two challenges: reward hacking, since fixed rewards poorly capture global reasoning objectives, and the high training cost of obtaining dense reward labels at scale. To overcome these issues, we propose Step Group Relative Policy Optimization (Step-GRPO), a GRPO-based method that integrates step-level PRM signals into sparse trajectory-level feedback, avoiding costly step-level supervision while improving reasoning quality beyond accuracy. In addition, Step-GRPO employs a step-attention mechanism that captures inter-step dependencies and emphasizes critical reasoning steps, effectively mitigating reward hacking. We apply Step-GRPO to train large language models and observe consistent gains in reasoning quality, accuracy, and shorter reasoning traces across multiple math benchmarks, outperforming reinforcement learning baselines at substantially lower cost. Notably, the proposed model achieves 36.7 percent accuracy on AIME 2024 with 11,000 training samples and a training cost of 38 US dollars, surpassing baselines that require over 1,000 US dollars and more than 40,000 samples, demonstrating strong cost-effectiveness and scalability.

AAAI Conference 2025 Conference Paper

Batch Selection for Multi-Label Classification Guided by Uncertainty and Dynamic Label Correlations

  • Ao Zhou
  • Bin Liu
  • Jin Wang
  • Grigorios Tsoumakas

The accuracy of deep neural networks is significantly influenced by the effectiveness of mini-batch construction during training. In single-label scenarios, such as binary and multi-class classification tasks, it has been demonstrated that batch selection algorithms preferring samples with higher uncertainty achieve better performance than difficulty-based methods. Although there are two batch selection methods tailored for multi-label data, none of them leverage important uncertainty information. Adapting the concept of uncertainty to multi-label data is not a trivial task, since there are two issues that should be tackled. First, traditional variance or entropy-based uncertainty measures ignore fluctuations of predictions within sliding windows and the importance of the current model state. Second, existing multi-label methods do not explicitly exploit the label correlations, particularly the uncertainty-based label correlations that evolve during the training process. In this paper, we propose an uncertainty-based multi-label batch selection algorithm. It assesses uncertainty for each label by considering differences between successive predictions and the confidence of current outputs, and further leverages dynamic uncertainty-based label correlations to emphasize instances whose uncertainty is synergistically expressed across multiple labels. Empirical studies demonstrate the effectiveness of our method in improving the performance and accelerating the convergence of various multi-label deep learning models.

AAAI Conference 2025 Conference Paper

Data-Free Black-Box Federated Learning via Zeroth-Order Gradient Estimation

  • Xinge Ma
  • Jin Wang
  • Xuejie Zhang

Federated learning (FL) enables decentralized clients to collaboratively train a global model under the orchestration of a central server without exposing their individual data. However, the iterative exchange of model parameters between the server and clients imposes heavy communication burdens, risks potential privacy leakage, and even precludes collaboration among heterogeneous clients. Distillation-based FL tackles these challenges by exchanging low-dimensional model outputs rather than model parameters, yet it highly relies on a task-relevant auxiliary dataset that is often not available in practice. Data-free FL attempts to overcome this limitation by training a server-side generator to directly synthesize task-specific data samples for knowledge transfer. However, the update rule of the generator requires clients to share on-device models for white-box access, which greatly compromises the advantages of distillation-based FL. This motivates us to explore a data-free and black-box FL framework via Zeroth-order Gradient Estimation (FedZGE), which estimates the gradients after flowing through on-device models in a black-box optimization manner to complete the training of the generator in terms of fidelity, transferability, diversity, and equilibrium, without involving any auxiliary data or sharing any model parameters, thus combining the advantages of both distillation-based FL and data-free FL. Experiments on large-scale image classification datasets and network architectures demonstrate the superiority of FedZGE in terms of data heterogeneity, model heterogeneity, communication efficiency, and privacy protection.

AAAI Conference 2025 Conference Paper

Divide-Solve-Combine: An Interpretable and Accurate Prompting Framework for Zero-shot Multi-Intent Detection

  • Libo Qin
  • Qiguang Chen
  • Jingxuan Zhou
  • Jin Wang
  • Hao Fei
  • Wanxiang Che
  • Min Li

Zero-shot multi-intent detection is capable of capturing multiple intents within a single utterance without any training data, which gains increasing attention. Building on the success of large language models (LLM), dominant approaches in the literature explore prompting techniques to enable zero-shot multi-intent detection. While significant advancements have been witnessed, the existing prompting approaches still face two major issues: lacking explicit reasoning and lacking interpretability. Therefore, in this paper, we introduce a Divide-Solve-Combine Prompting (DSCP) to address the above issues. Specifically, DSCP explicitly decomposes multi-intent detection into three components including (1) single-intent division prompting is utilized to decompose an input query into distinct sub-sentences, each containing a single intent; (2) intent-by-intent solution prompting is applied to solve each sub-sentence recurrently; and (3) multi-intent combination prompting is employed for combining each sub-sentence result to obtain the final multi-intent result. By decomposition, DSCP allows the model to track the explicit reasoning process and improve the interpretability. In addition, we propose an interactive divide-solve-combine prompting (Inter-DSCP) to naturally capture the interaction capabilities of large language models. Experimental results on two standard multi-intent benchmarks (i.e., MixATIS and MixSNIPS) reveal that both DSCP and Inter-DSCP obtain substantial improvements over baselines, achieving superior performance and higher interpretability.

AAAI Conference 2025 Conference Paper

Efficient Event-Based Semantic Segmentation via Exploiting Frame-Event Fusion: A Hybrid Neural Network Approach

  • Hebei Li
  • Yansong Peng
  • Jiahui Yuan
  • Peixi Wu
  • Jin Wang
  • Yueyi Zhang
  • Xiaoyan Sun

Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset.

AAAI Conference 2025 Conference Paper

Event-Enhanced Blurry Video Super-Resolution

  • Dachun Kai
  • Yueyi Zhang
  • Jin Wang
  • Zeyu Xiao
  • Zhiwei Xiong
  • Xiaoyan Sun

In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is 2.59 dB more accurate and 7.28× faster than the recent best BVSR baseline FMA-Net.

NeurIPS Conference 2025 Conference Paper

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

  • Jin Wang
  • Yao Lai
  • Aoxue Li
  • Shifeng Zhang
  • Jiacheng Sun
  • Ning Kang
  • Chengyue Wu
  • Zhenguo Li

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

NeurIPS Conference 2025 Conference Paper

Jury-and-Judge Chain-of-Thought for Uncovering Toxic Data in 3D Visual Grounding

  • Kaixiang Huang
  • Qifeng Zhang
  • Jin Wang
  • Jingru Yang
  • Yang Zhou
  • Huan Yu
  • Guodong Lu
  • Shengfeng He

3D Visual Grounding (3DVG) faces persistent challenges due to coarse scene-level observations and logically inconsistent annotations, which introduce ambiguities that compromise data quality and hinder effective model supervision. To address these challenges, we introduce Refer-Judge, a novel framework that harnesses the reasoning capabilities of Multimodal Large Language Models (MLLMs) to identify and mitigate toxic data. At the core of Refer-Judge is a Jury-and-Judge Chain-of-Thought paradigm, inspired by the deliberative process of the judicial system. This framework targets the root causes of annotation noise: jurors collaboratively assess 3DVG samples from diverse perspectives, providing structured, multi-faceted evaluations. Judges then consolidate these insights using a Corroborative Refinement strategy, which adaptively reorganizes information to correct ambiguities arising from biased or incomplete observations. Through this two-stage deliberation, Refer-Judge significantly enhances the reliability of data judgments. Extensive experiments demonstrate that our framework not only achieves human-level discrimination at the scene level but also improves the performance of baseline algorithms via data purification. Code is available at https: //github. com/Hermione-HKX/Refer_Judge.

AAAI Conference 2025 Conference Paper

MegActor-Sigma: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

  • Shurong Yang
  • Huadong Li
  • Juhao Wu
  • Minhao Jing
  • Linze Li
  • Renhe Ji
  • Jiajun Liang
  • Haoqiang Fan

Diffusion models have demonstrated superior performance in portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor-Sigma: a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a ``Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the ``Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset for training. Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations.

ICLR Conference 2025 Conference Paper

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

  • Fanqing Meng
  • Jin Wang
  • Chuanhao Li 0001
  • Quanfeng Lu
  • Hao Tian 0006
  • Tianshuo Yang
  • Jiaqi Liao
  • Xizhou Zhu

The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. Our evaluation of nearly 30 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Even the most advanced models, such as GPT-4o, achieve only 55.7\% accuracy on MMIU. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We aim for MMIU to advance the frontier of LVLM research and development. We release the data and code at https://github.com/MMIUBenchmark/MMIU.

AAAI Conference 2025 Conference Paper

Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective

  • You Zhang
  • Jin Wang
  • Liang-Chih Yu
  • Dan Xu
  • Xuejie Zhang

Current neural networks often employ multi-domain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.

IROS Conference 2025 Conference Paper

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

  • Shunlei Li
  • Jin Wang
  • Rui Dai
  • Wanyu Ma
  • Wing Yin Ng
  • Yingbai Hu
  • Zheng Li 0012

In modern healthcare, the demand for autonomous robotic assistants has grown significantly, particularly in the operating room, where surgical tasks require precision and reliability. Robotic scrub nurses have emerged as a promising solution to improve efficiency and reduce human error during surgery. However, challenges remain in terms of accurately grasping and handing over surgical instruments, especially when dealing with complex objects in dynamic environments. In this work, we introduce RoboNurse-VLA, a novel robotic scrub nurse system based on a Vision-Language-Action (VLA) model. RoboNurse-VLA integrates Segment Anything Model 2 (SAM 2) and Llama 2, leveraging an LLM head to enhance reasoning capabilities. By combining SAM 2’s mask generation with Llama 2’s advanced reasoning, RoboNurse-VLA can accurately interpret task requirements, identify optimal grasping points, and determine appropriate handover poses. Designed for real-time operation, RoboNurse-VLA enables precise grasping and seamless handover of surgical instruments based on voice commands from the surgeon. Utilizing state-of-the-art vision and language models, it effectively addresses challenges related to object detection, pose optimization, and handling difficult-to-grasp instruments. Extensive evaluations demonstrate that RoboNurse-VLA outperforms existing models, achieving high success rates in surgical instrument handovers, even for previously unseen tools and complex objects. This work represents a significant advancement in autonomous surgical assistance, highlighting the potential of VLA models for real-world medical applications. More details can be found at https:// robonurse-vla.github.io.

AAAI Conference 2025 Conference Paper

Sample-aware Adaptive Structured Pruning for Large Language Models

  • Jun Kong
  • Xinge Ma
  • Jin Wang
  • Xuejie Zhang

Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20% pruning ratio, the model pruned with AdaPruner maintains 97% of the performance of the unpruned model.

IJCAI Conference 2025 Conference Paper

TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

  • Rui Yan
  • Jin Wang
  • Hongyu Qu
  • Xiaoyu Du
  • Dong Zhang
  • Jinhui Tang
  • Tieniu Tan

Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework, namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts inquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. TEST-V achieves state-of-the-art results across four benchmarks and shows good interpretability.

AAAI Conference 2025 Conference Paper

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

  • Kuanghong Liu
  • Jin Wang
  • Kangjian He
  • Dan Xu
  • Xuejie Zhang

Conventional multi-source domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

IROS Conference 2024 Conference Paper

Autonomous Behavior Planning For Humanoid Loco-manipulation Through Grounded Language Model

  • Jin Wang
  • Arturo Laurenzi
  • Nikos G. Tsagarakis

Enabling humanoid robots to perform autonomously loco-manipulation in unstructured environments is crucial and highly challenging for achieving embodied intelligence. This involves robots being able to plan their actions and behaviors in long-horizon tasks while using multi-modality to perceive deviations between task execution and high-level planning. Recently, large language models (LLMs) have demonstrated powerful planning and reasoning capabilities for comprehension and processing of semantic information through robot control tasks, as well as the usability of analytical judgment and decision-making for multi-modal inputs. To leverage the power of LLMs towards humanoid loco-manipulation, we propose a novel language-model based framework that enables robots to autonomously plan behaviors and low-level execution under given textual instructions, while observing and correcting failures that may occur during task execution. To systematically evaluate this framework in grounding LLMs, we created the robot ’action’ and ’sensing’ behavior library for task planning, and conducted mobile manipulation tasks and experiments in both simulated and real environments using the CENTAURO robot, and verified the effectiveness and application of this approach in robotic tasks with autonomous behavioral planning. Video: https://youtu.be/mmnaxthEX34

NeurIPS Conference 2024 Conference Paper

Continuous Heatmap Regression for Pose Estimation via Implicit Neural Representation

  • Shengxiang Hu
  • Huaijiang Sun
  • Dong Wei
  • Xiaoning Sun
  • Jin Wang

Heatmap regression has dominated human pose estimation due to its superior performance and strong generalization. To meet the requirements of traditional explicit neural networks for output form, existing heatmap-based methods discretize the originally continuous heatmap representation into 2D pixel arrays, which leads to performance degradation due to the introduction of quantization errors. This problem is significantly exacerbated as the size of the input image decreases, which makes heatmap-based methods not much better than coordinate regression on low-resolution images. In this paper, we propose a novel neural representation for human pose estimation called NerPE to achieve continuous heatmap regression. Given any position within the image range, NerPE regresses the corresponding confidence scores for body joints according to the surrounding image features, which guarantees continuity in space and confidence during training. Thanks to the decoupling from spatial resolution, NerPE can output the predicted heatmaps at arbitrary resolution during inference without retraining, which easily achieves sub-pixel localization precision. To reduce the computational cost, we design progressive coordinate decoding to cooperate with continuous heatmap regression, in which localization no longer requires the complete generation of high-resolution heatmaps. The code is available at https: //github. com/hushengxiang/NerPE.

ICML Conference 2024 Conference Paper

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

  • Kaining Ying
  • Fanqing Meng
  • Jin Wang
  • Zhiqian Li
  • Han Lin
  • Yue Yang
  • Hao Zhang 0117
  • Wenbo Zhang 0009

Large Vision-Language Models (LVLMs) show significant strides in general-propose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, and reasoning. MMT-Bench comprises $31, 325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $20$ publicly available LVLMs such as the proprietary GeminiProVision model, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.

AAAI Conference 2024 Conference Paper

Personalized LoRA for Human-Centered Text Understanding

  • You Zhang
  • Jin Wang
  • Liang-Chih Yu
  • Dan Xu
  • Xuejie Zhang

Effectively and efficiently adapting a pre-trained language model (PLM) for human-centered text understanding (HCTU) is challenging since user tokens are million-level in most personalized applications and do not have concrete explicit semantics. A standard and parameter-efficient approach (e.g., LoRA) necessitates memorizing numerous suits of adapters for each user. In this work, we introduce a personalized LoRA (PLoRA) with a plug-and-play (PnP) framework for the HCTU task. PLoRA is effective, parameter-efficient, and dynamically deploying in PLMs. Moreover, a personalized dropout and a mutual information maximizing strategies are adopted and hence the proposed PLoRA can be well adapted to few/zero-shot learning scenarios for the cold-start issue. Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods in full/few/zero-shot learning scenarios for the HCTU task, even though it has fewer trainable parameters. For reproducibility, the code for this paper is available at: https://github.com/yoyo-yun/PLoRA.

NeurIPS Conference 2023 Conference Paper

Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training

  • Hanyang Peng
  • Shuang Qin
  • Yue Yu
  • Jin Wang
  • Hui Wang
  • Ge Li

Various gradient compression algorithms have been proposed to alleviate the communication bottleneck in distributed learning, and they have demonstrated effectiveness in terms of high compression ratios and theoretical low communication complexity. However, when it comes to practically training modern deep neural networks (DNNs), these algorithms have yet to match the inference performance of uncompressed SGD-momentum (SGDM) and adaptive optimizers (e. g. ,Adam). More importantly, recent studies suggest that these algorithms actually offer no speed advantages over SGDM/Adam when used with common distributed DNN training frameworks ( e. g. , DistributedDataParallel (DDP)) in the typical settings, due to heavy compression/decompression computation or incompatibility with the efficient All-Reduce or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel 1-bit adaptive optimizer, dubbed *Bi*nary *r*andomization a*d*aptive optimiz*er* (**Birder**). The quantization of Birder can be easily and lightly computed, and it does not require warmup with its uncompressed version in the beginning. Also, we devise Hierarchical-1-bit-All-Reduce to further lower the communication volume. We theoretically prove that it promises the same convergence rate as the Adam. Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to ${2. 5 \times}$ speedup for training ResNet-50 and ${6. 3\times}$ speedup for training BERT-Base. Code is publicly available at https: //openi. pcl. ac. cn/c2net_optim/Birder.

AAAI Conference 2023 Conference Paper

Joint Multimodal Entity-Relation Extraction Based on Edge-Enhanced Graph Alignment Network and Word-Pair Relation Tagging

  • Li Yuan
  • Yi Cai
  • Jin Wang
  • Qing Li

Multimodal named entity recognition (MNER) and multimodal relation extraction (MRE) are two fundamental subtasks in the multimodal knowledge graph construction task. However, the existing methods usually handle two tasks independently, which ignores the bidirectional interaction between them. This paper is the first to propose jointly performing MNER and MRE as a joint multimodal entity-relation extraction (JMERE) task. Besides, the current MNER and MRE models only consider aligning the visual objects with textual entities in visual and textual graphs but ignore the entity-entity relationships and object-object relationships. To address the above challenges, we propose an edge-enhanced graph alignment network and a word-pair relation tagging (EEGA) for the JMERE task. Specifically, we first design a word-pair relation tagging to exploit the bidirectional interaction between MNER and MRE and avoid error propagation. Then, we propose an edge-enhanced graph alignment network to enhance the JMERE task by aligning nodes and edges in the cross-graph. Compared with previous methods, the proposed method can leverage the edge information to auxiliary alignment between objects and entities and find the correlations between entity-entity relationships and object-object relationships. Experiments are conducted to show the effectiveness of our model.

AAAI Conference 2023 Conference Paper

Learning to Memorize Entailment and Discourse Relations for Persona-Consistent Dialogues

  • Ruijun Chen
  • Jin Wang
  • Liang-Chih Yu
  • Xuejie Zhang

Maintaining engagement and consistency is particularly important in dialogue systems. Existing works have improved the performance of dialogue systems by intentionally learning interlocutor personas with sophisticated network structures. One issue with this approach is that it requires more personal corpora with annotations. Additionally, these models typically perform the next utterance prediction to generate a response but neglect the discourse coherence in the entire conversation. To address these issues, this study proposes a method of learning to memorize entailment and discourse relations for persona-consistent dialogue tasks. Entailment text pairs in natural language inference dataset were applied to learn latent entailment relations as external memories by premise-to-hypothesis generation task. Furthermore, an internal memory with a similar architecture was applied to the discourse information in the dialogue. Placing orthogonality restrictions on these two memory spaces ensures that the latent entailment relations remain dialogue-independent. Both memories collaborate to obtain entailment and discourse representation for the generation, allowing a deeper understanding of both consistency and coherence. Experiments on two large public datasets, PersonaChat and DSTC7-AVSD, demonstrated the effectiveness of the proposed method. Both automatic and human evaluations indicate that the proposed model outperforms several strong baselines in terms of both persona consistency and response coherence. Our source code is availabled at https://github.com/Chenrj233/LMEDR.

AAAI Conference 2023 Conference Paper

Supervised Contrastive Few-Shot Learning for High-Frequency Time Series

  • Xi Chen
  • Cheng Ge
  • Ming Wang
  • Jin Wang

Significant progress has been made in representation learning, especially with recent success on self-supervised contrastive learning. However, for time series with less intuitive or semantic meaning, sampling bias may be inevitably encountered in unsupervised approaches. Although supervised contrastive learning has shown superior performance by leveraging label information, it may also suffer from class collapse. In this study, we consider a realistic scenario in industry with limited annotation information available. A supervised contrastive framework is developed for high-frequency time series representation and classification, wherein a novel variant of supervised contrastive loss is proposed to include multiple augmentations while induce spread within each class. Experiments on four mainstream public datasets as well as a series of sensitivity and ablation analyses demonstrate that the learned representations are effective and robust compared with the direct supervised learning and self-supervised learning, notably under the minimal few-shot situation.

AAAI Conference 2022 Conference Paper

Interpretable Generative Adversarial Networks

  • Chao Li
  • Kelu Yao
  • Jin Wang
  • Boyu Diao
  • Yongjun Xu
  • Quanshi Zhang

Learning a disentangled representation is still a challenge in the field of the interpretability of generative adversarial networks (GANs). This paper proposes a generic method to modify a traditional GAN into an interpretable GAN, which ensures that filters in an intermediate layer of the generator encode disentangled localized visual concepts. Each filter in the layer is supposed to consistently generate image regions corresponding to the same visual concept when generating different images. The interpretable GAN learns to automatically discover meaningful visual concepts without any annotations of visual concepts. The interpretable GAN enables people to modify a specific visual concept on generated images by manipulating feature maps of the corresponding filters in the layer. Our method can be broadly applied to different types of GANs. Experiments have demonstrated the effectiveness of our method.

IJCAI Conference 2020 Conference Paper

Discovering Subsequence Patterns for Next POI Recommendation

  • Kangzhi Zhao
  • Yong Zhang
  • Hongzhi Yin
  • Jin Wang
  • Kai Zheng
  • Xiaofang Zhou
  • Chunxiao Xing

Next Point-of-Interest (POI) recommendation plays an important role in location-based services. State-of-the-art methods learn the POI-level sequential patterns in the user's check-in sequence but ignore the subsequence patterns that often represent the socio-economic activities or coherence of preference of the users. However, it is challenging to integrate the semantic subsequences due to the difficulty to predefine the granularity of the complex but meaningful subsequences. In this paper, we propose Adaptive Sequence Partitioner with Power-law Attention (ASPPA) to automatically identify each semantic subsequence of POIs and discover their sequential patterns. Our model adopts a state-based stacked recurrent neural network to hierarchically learn the latent structures of the user's check-in sequence. We also design a power-law attention mechanism to integrate the domain knowledge in spatial and temporal contexts. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model.

IJCAI Conference 2019 Conference Paper

Hierarchical Inter-Attention Network for Document Classification with Multi-Task Learning

  • Bing Tian
  • Yong Zhang
  • Jin Wang
  • Chunxiao Xing

Document classification is an essential task in many real world applications. Existing approaches adopt both text semantics and document structure to obtain the document representation. However, these models usually require a large collection of annotated training instances, which are not always feasible, especially in low-resource settings. In this paper, we propose a multi-task learning framework to jointly train multiple related document classification tasks. We devise a hierarchical architecture to make use of the shared knowledge from all tasks to enhance the document representation of each task. We further propose an inter-attention approach to improve the task-specific modeling of documents with global information. Experimental results on 15 public datasets demonstrate the benefits of our proposed model.

TIST Journal 2019 Journal Article

Large-Scale Frequent Episode Mining from Complex Event Sequences with Hierarchies

  • Xiang Ao
  • Haoran Shi
  • Jin Wang
  • Luo Zuo
  • Hongwei Li
  • Qing He

Frequent Episode Mining (FEM), which aims at mining frequent sub-sequences from a single long event sequence, is one of the essential building blocks for the sequence mining research field. Existing studies about FEM suffer from unsatisfied scalability when faced with complex sequences as it is an NP-complete problem for testing whether an episode occurs in a sequence. In this article, we propose a scalable, distributed framework to support FEM on “big” event sequences. As a rule of thumb, “big” illustrates an event sequence is either very long or with masses of simultaneous events. Meanwhile, the events in this article are arranged in a predefined hierarchy. It derives some abstractive events that can form episodes that may not directly appear in the input sequence. Specifically, we devise an event-centered and hierarchy-aware partitioning strategy to allocate events from different levels of the hierarchy into local processes. We then present an efficient special-purpose algorithm to improve the local mining performance. We also extend our framework to support maximal and closed episode mining in the context of event hierarchy, and to the best of our knowledge, we are the first attempt to define and discover hierarchy-aware maximal and closed episodes. We implement the proposed framework on Apache Spark and conduct experiments on both synthetic and real-world datasets. Experimental results demonstrate the efficiency and scalability of the proposed approach and show that we can find practical patterns when taking event hierarchies into account.

IJCAI Conference 2019 Conference Paper

Learn Smart with Less: Building Better Online Decision Trees with Fewer Training Examples

  • Ariyam Das
  • Jin Wang
  • Sahil M. Gandhi
  • Jae Lee
  • Wei Wang
  • Carlo Zaniolo

Online decision tree models are extensively used in many industrial machine learning applications for real-time classification tasks. These models are highly accurate, scalable and easy to use in practice. The Very Fast Decision Tree (VFDT) is the classic online decision tree induction model that has been widely adopted due to its theoretical guarantees as well as competitive performance. However, VFDT and its variants solely rely on conservative statistical measures like Hoeffding bound to incrementally grow the tree. This makes these models extremely circumspect and limits their ability to learn fast. In this paper, we efficiently employ statistical resampling techniques to build an online tree faster using fewer examples. We first theoretically show that a naive implementation of resampling techniques like non-parametric bootstrap does not scale due to large memory and computational overheads. We mitigate this by proposing a robust memory-efficient bootstrap simulation heuristic (Mem-ES) that successfully expedites the learning process. Experimental results on both synthetic data and large-scale real world datasets demonstrate the efficiency and effectiveness of our proposed technique.

IJCAI Conference 2018 Conference Paper

Beyond Polarity: Interpretable Financial Sentiment Analysis with Hierarchical Query-driven Attention

  • Ling Luo
  • Xiang Ao
  • Feiyang Pan
  • Jin Wang
  • Tong Zhao
  • Ningzi Yu
  • Qing He

Sentiment analysis has played a significant role in financial applications in recent years. The informational and emotive aspects of news texts may affect the prices, volatilities, volume of trades, and even potential risks of financial subjects. Previous studies in this field mainly focused on identifying polarity~(e. g. positive or negative). However, as financial decisions broadly require justifications, only plausible polarity cannot provide enough evidence during the decision making processes of humanity. Hence an explainable solution is in urgent demand. In this paper, we present an interpretable neural net framework for financial sentiment analysis. First, we design a hierarchical model to learn the representation of a document from multiple granularities. In addition, we propose a query-driven attention mechanism to satisfy the unique characteristics of financial documents. With the domain specified questions provided by the financial analysts, we can discover different spotlights for queries from different aspects. We conduct extensive experiments on a real-world dataset. The results demonstrate that our framework can learn better representation of the document and unearth meaningful clues on replying different users? preferences. It also outperforms the state-of-the-art methods on sentiment prediction of financial documents.

IJCAI Conference 2017 Conference Paper

Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification

  • Jin Wang
  • Zhongyuan Wang
  • Dawei Zhang
  • Jun Yan

Text classification is a fundamental task in NLP applications. Most existing work relied on either explicit or implicit text representation to address this problem. While these techniques work well for sentences, they can not easily be applied to short text because of its shortness and sparsity. In this paper, we propose a framework based on convolutional neural networks that combines explicit and implicit representations of short text for classification. We first conceptualize a short text as a set of relevant concepts using a large taxonomy knowledge base. We then obtain the embedding of short text by coalescing the words and relevant concepts on top of pre-trained word vectors. We further incorporate character level features into our model to capture fine-grained subword information. Experimental results on five commonly used datasets show that our proposed method significantly outperforms state-of-the-art methods.

JBHI Journal 2013 Journal Article

Thermal Imaging as a Biometrics Approach to Facial Signature Authentication

  • A. M. Guzman
  • M. Goryawala
  • Jin Wang
  • A. Barreto
  • J. Andrian
  • N. Rishe
  • M. Adjouadi

A new thermal imaging framework with unique feature extraction and similarity measurements for face recognition is presented. The research premise is to design specialized algorithms that would extract vasculature information, create a thermal facial signature, and identify the individual. The proposed algorithm is fully integrated and consolidates the critical steps of feature extraction through the use of morphological operators, registration using the Linear Image Registration Tool, and matching through unique similarity measures designed for this task. The novel approach at developing a thermal signature template using four images taken at various instants of time ensured that unforeseen changes in the vasculature over time did not affect the biometric matching process as the authentication process relied only on consistent thermal features. Thirteen subjects were used for testing the developed technique on an in-house thermal imaging system. The matching using the similarity measures showed an average accuracy of 88. 46% for skeletonized signatures and 90. 39% for anisotropically diffused signatures. The highly accurate results obtained in the matching process clearly demonstrate the ability of the thermal infrared system to extend in application to other thermal-imaging-based systems. Empirical results applying this approach to an existing database of thermal images prove this assertion.