Arrow Research search

Author name cluster

Junjie Hu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

AAAI Conference 2026 Conference Paper

Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation

  • Shengqian Zhu
  • Chengrong Yu
  • Qiang Wang
  • Ying Song
  • Guangjun Li
  • Jiafei Wu
  • Xiaogang Xu
  • Zhang Yi

Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class annotations. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading cues from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global historical prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms current state-of-the-art methods, highlighting its robustness and generalization capabilities.

TMLR Journal 2026 Journal Article

From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

  • Zefan Cai
  • Haoyi Qiu
  • Haozhe Zhao
  • Ke Wan
  • Jiachen Li
  • Jiuxiang Gu
  • Wen Xiao
  • Nanyun Peng

Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (verbs and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

JBHI Journal 2026 Journal Article

Rethinking Propagation Methods for Interactive Medical Image Segmentation

  • Shengqian Zhu
  • Yuncheng Shen
  • Yingyong Yin
  • Ying Song
  • Zhang Yi
  • Guangjun Li
  • Junjie Hu

Propagation-based methods have drawn increasing research attention in interactive medical image segmentation. However, existing propagation-based methods face two significant challenges: 1) Due tothe continuous nature of anatomical structures within the organs and tumors throughout the volume, over-propagation is likely to occur as the propagation process reaches the end of structures, leadingto a degradation in segmentation performance. 2) During the multi-round refinement process, selecting the worst-segmented slice for refinement tends to hinder the optimization of segmentation results. To overcome these challenges, we propose the Discrepancy Aware Network (DANet), which includes a Discrepancy Learning Module (DLM) and employs a confidence loss to achieve accurate segmentation. Specifically, DLM captures the temporal-contextual discrepancy between previous and current slices, enabling the model to perceive the variations of the target. Furthermore, the confidence loss is responsible for regularizing the over-confident segmentation at the image level by estimating the target foreground. Additionally, we design a straightforward slice selection strategy to optimize the refinement process. Extensive experimental results on five public medical datasets demonstrate significant improvements over state-of-the-art methods (e. g. , with +1. 07% improvement on the MSD-Spleen dataset).

TMLR Journal 2025 Journal Article

COMMA: A Communicative Multimodal Multi-Agent Benchmark

  • Timothy Ossowski
  • Danyal Maqbool
  • Jixuan Chen
  • Zefan Cai
  • Tyler J. Bradshaw
  • Junjie Hu

The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.

AAAI Conference 2025 Conference Paper

DualNet: Robust Self-Supervised Stereo Matching with Pseudo-Label Supervision

  • Yun Wang
  • Jiahao Zheng
  • Chenghao Zhang
  • Zhanjie Zhang
  • Kunhong Li
  • Yongjian Zhang
  • Junjie Hu

Self-supervised stereo matching has drawn attention due to its ability to estimate disparity without needing ground-truth data. However, existing self-supervised stereo matching methods heavily rely on the photo-metric consistency assumption, which is vulnerable to natural disturbances, resulting in ambiguous supervision and inferior performance compared to the supervised ones. To relax the limitation of the photo-metric consistency assumption and even bypass this assumption, we propose a novel self-supervised framework named DualNet, which consists of two key steps: robust self-supervised teacher learning and pseudo-label supervised student training. Specifically, the teacher model is first trained in a self-supervised manner with a focus on feature-metric consistency and data augmentation consistency. Then, the output of the teacher model is geometrically constrained to obtain high-quality pseudo labels. Benefiting from these high-quality pseudo labels, the student model can outperform its teacher model by a large margin. With the two well-designed steps, the proposed framework DualNet ranks 1st among all self-supervised methods on multiple benchmarks, surprisingly even outperforming several supervised counterparts.

ICML Conference 2025 Conference Paper

MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

  • Ruida Wang
  • Rui Pan 0002
  • Yu Xin Li
  • Jipeng Zhang
  • Yizhen Jia
  • Shizhe Diao
  • Renjie Pi
  • Junjie Hu

Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose MA-LoT: Model-CollAboration Lean-based Long Chain-of-Thought, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel LoT-Transfer Learning training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a 61. 07% accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33. 61%), single-model tree search (InternLM-Step-Prover, 50. 70%), and whole-proof generation (Godel-Prover, 55. 33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.

NeurIPS Conference 2025 Conference Paper

PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

  • WANG Yun
  • Junjie Hu
  • Qiaole Dong
  • Yongjian Zhang
  • Yanwei Fu
  • Tin Lun Lam
  • Dapeng Wu

Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a Pick-and-Play Memory (PPM) construction module for dynamic Stereo matching, dubbed as PPMStereo. PPM consists of a pick process that identifies the most relevant frames and a play process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. Codes are available at \textcolor{blue}{https: //github. com/cocowy1/PPMStereo}.

NeurIPS Conference 2025 Conference Paper

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

  • Zefan Cai
  • Wen Xiao
  • Hanshi Sun
  • Cheng Luo
  • Yikai Zhang
  • Ke Wan
  • Yucheng Li
  • Yeyang Zhou

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 38% of the KV cache. This KV-cache reduction also leads to a 50% memory saving and a 2x speedup over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

NeurIPS Conference 2024 Conference Paper

BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

  • Jiongxiao Wang
  • Jiazhao Li
  • Yiquan Li
  • Xiangyu Qi
  • Junjie Hu
  • Yixuan Li
  • Patrick McDaniel
  • Muhao Chen

Despite the general capabilities of Large Language Models (LLMs) like GPT-4, these models still request fine-tuning or adaptation with customized data when meeting the specific business demands and intricacies of tailored use cases. However, this process inevitably introduces new safety threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning on users' uploaded examples that contain just a few harmful examples. Though potential defenses have been proposed that the service providers of LMaaS can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, service providers will construct prefixed safety examples with a secret prompt, acting as a "backdoor trigger". By integrating prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the "backdoor attack", establishing a strong correlation between the secret prompt and safety generations. Consequently, safe responses are ensured once service providers prepend this secret prompt ahead of any user input during inference. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance. Furthermore, we also present the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data.

ICML Conference 2024 Conference Paper

DFA-RAG: Conversational Semantic Router for Large Language Model with Definite Finite Automaton

  • Yiyou Sun
  • Junjie Hu
  • Wei Cheng 0002
  • Haifeng Chen

This paper introduces the retrieval-augmented large language model with Definite Finite Automaton (DFA-RAG), a novel framework designed to enhance the capabilities of conversational agents using large language models (LLMs). Traditional LLMs face challenges in generating regulated and compliant responses in special scenarios with predetermined response guidelines, like emotional support and customer service. Our framework addresses these challenges by embedding a Definite Finite Automaton (DFA), learned from training dialogues, within the LLM. This structured approach acts as a semantic router which enables the LLM to adhere to a deterministic response pathway. The routing is achieved by the retrieval-augmentation generation (RAG) strategy, which carefully selects dialogue examples aligned with the current conversational context. The advantages of DFA-RAG include an interpretable structure through human-readable DFA, context-aware retrieval for responses in conversations, and plug-and-play compatibility with existing LLMs. Extensive benchmarks validate DFA-RAG’s effectiveness, indicating its potential as a valuable contribution to the conversational agent.

IJCAI Conference 2022 Conference Paper

Private Semi-Supervised Federated Learning

  • Chenyou Fan
  • Junjie Hu
  • Jianwei Huang

We study a federated learning (FL) framework to effectively train models from scarce and skewly distributed labeled data. We consider a challenging yet practical scenario: a few data sources own a small amount of labeled data, while the rest mass sources own purely unlabeled data. Classical FL requires each client to have enough labeled data for local training, thus is not applicable in this scenario. In this work, we design an effective federated semi-supervised learning framework (FedSSL) to fully leverage both labeled and unlabeled data sources. We establish a unified data space across all participating agents, so that each agent can generate mixed data samples to boost semi-supervised learning (SSL), while keeping data locality. We further show that FedSSL can integrate differential privacy protection techniques to prevent labeled data leakage at the cost of minimum performance degradation. On SSL tasks with as small as 0. 17% and 1% of MNIST and CIFAR-10 datasets as labeled data, respectively, our approach can achieve 5-20% performance boost over the state-of-the-art methods.

AAAI Conference 2020 Conference Paper

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

  • Junjie Hu
  • Yu Cheng
  • Zhe Gan
  • Jingjing Liu
  • Jianfeng Gao
  • Graham Neubig

Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a natural and topicallycoherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a “highquality” story to the human eye. We further propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluation demonstrate that our ReCo-RL model achieves better performance than state-ofthe-art baselines on both traditional metrics and the proposed new criteria.

AAMAS Conference 2019 Conference Paper

Deep Generative and Discriminative Domain Adaptation

  • Han Zhao
  • Junjie Hu
  • Zhenyao Zhu
  • Adam Coates
  • Geoff Gordon

The ability to adapt to and learn from different domains and environments is crucial for agents to generalize. In this paper we propose a probabilistic framework for domain adaptation that blends both generative and discriminative modeling in a principled way. Under this framework, generative and discriminative models correspond to specific choices of the prior over parameters. By maximizing both the marginal and the conditional log-likelihoods, our models can use both labeled instances from the source domain as well as unlabeled instances from both source and target domains. We show that the popular reconstruction loss of autoencoder corresponds to an upper bound of the negative marginal log-likelihoods of unlabeled instances, and give a generalization bound that explicitly incorporates it into the analysis. We instantiate our framework using neural networks, and build a concrete model, DAuto.

AAAI Conference 2015 Conference Paper

Kernelized Online Imbalanced Learning with Fixed Budgets

  • Junjie Hu
  • Haiqin Yang
  • Irwin King
  • Michael Lyu
  • Anthony Man-Cho So

Online learning from imbalanced streaming data to capture the nonlinearity and heterogeneity of the data is significant in machine learning and data mining. To tackle this problem, we propose a kernelized online imbalanced learning (KOIL) algorithm to directly maximize the area under the ROC curve (AUC). We address two more challenges: 1) How to control the number of support vectors without sacrificing model performance; and 2) how to restrict the fluctuation of the learned decision function to attain smooth updating. To this end, we introduce two buffers with fixed budgets (buffer sizes) for positive class and negative class, respectively, to store the learned support vectors, which can allow us to capture the global information of the decision boundary. When determining the weight of a new support vector, we confine its influence only to its k-nearest opposite support vectors. This can restrict the effect of new instances and prevent the harm of outliers. More importantly, we design a sophisticated scheme to compensate the model after replacement is conducted when either buffer is full. With this compensation, the learned model approaches the one learned with infinite budgets. We present both theoretical analysis and extensive experimental comparison to demonstrate the effectiveness of our proposed KOIL.