Arrow Research search

Author name cluster

Bin Zhu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

AAAI Conference 2026 Conference Paper

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

  • Jiarui Yang
  • Bin Zhu
  • Jingjing Chen
  • Yu-Gang Jiang

Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic's update is stabilized using intra-chunk n-step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.

TIST Journal 2026 Journal Article

Evaluating LLM-based Agents for Multi-turn Conversations: A Survey

  • Shengyue Guan
  • Jindong Wang
  • Jiang Bian
  • Bin Zhu
  • Jian-Guang Lou
  • Haoyi Xiong

This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state-of-the-art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines what to evaluate and another that explains how to evaluate. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues. Together, these frameworks summarize the current status quo, expose limitations in traditional practices, and provide a structured blueprint for improvement. Based on the summarization of existing studies, we identify several challenges and propose future directions, including the development of scalable, real-time evaluation pipelines, enhanced privacy-preserving mechanisms, and robust metrics that capture dynamic multi-turn interactions. Our contributions bridge historical insights with modern practices, paving the way for next-generation, reliably evaluated conversational AI systems and offering a comprehensive guide for researchers and practitioners.

AAAI Conference 2026 Conference Paper

Next Patch Prediction for AutoRegressive Visual Generation

  • Yatian Pang
  • Peng Jin
  • Shuo Yang
  • Bin Zhu
  • Bin Lin
  • Chaoran Feng
  • Zhenyu Tang
  • Liuhan Chen

Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks. Pioneering works introduce NTP to autoregressive visual generation tasks. In this work, we rethink the NTP for autoregressive image generation and extend it to a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens with higher information density. By using patch tokens as a more compact input sequence, the autoregressive model is trained to predict the next patch, significantly reducing computational costs. To further exploit the natural hierarchical structure of image data, we propose a multi-scale coarse-to-fine patch grouping strategy. With this strategy, the training process begins with a large patch size and ends with vanilla NTP where the patch size is 1x1, thus maintaining the original inference process without modifications. Extensive experiments across a diverse range of model sizes demonstrate that NPP could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet 256x256 generation benchmark. Notably, our method retains the original autoregressive model architecture without introducing additional trainable parameters or specifically designing a custom image tokenizer, offering a flexible and plug-and-play solution for enhancing autoregressive visual generation.

NeurIPS Conference 2025 Conference Paper

Consensus-Robust Transfer Attacks via Parameter and Representation Perturbations

  • Shixin Li
  • Zewei Li
  • Xiaojing Ma
  • Xiaofan Bai
  • Pingyi Hu
  • Dongmei Zhang
  • Bin Zhu

Adversarial examples crafted on one model often exhibit poor transferability to others, hindering their effectiveness in black-box settings. This limitation arises from two key factors: (i) \emph{decision-boundary variation} across models and (ii) \emph{representation drift} in feature space. We address these challenges through a new perspective that frames transferability for \emph{untargeted attacks} as a \emph{consensus-robust optimization} problem: adversarial perturbations should remain effective across a neighborhood of plausible target models. To model this uncertainty, we introduce two complementary perturbation channels: a \emph{parameter channel}, capturing boundary shifts via weight perturbations, and a \emph{representation channel}, addressing feature drift via stochastic blending of clean and adversarial activations. We then propose \emph{CORTA} (COnsensus--Robust Transfer Attack), a lightweight attack instantiated from this robust formulation using two first-order strategies: (i) sensitivity regularization based on the squared Frobenius norm of logits’ Jacobian with respect to weights, and (ii) Monte Carlo sampling for blended feature representations. Our theoretical analysis provides a certified lower bound linking these approximations to the robust objective. Extensive experiments on CIFAR-100 and ImageNet show that CORTA significantly outperforms state-of-the-art transfer-based methods---including ensemble approaches---across CNN and Vision Transformer targets. Notably, CORTA achieves a \emph{19. 1 percentage-point gain in transfer success rate over the best prior method} while using only a single surrogate model.

AAAI Conference 2025 Conference Paper

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

  • Haozhuo Zhang
  • Bin Zhu
  • Yu Cao
  • Yanbin Hao

Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model’s understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing and colors.

AAAI Conference 2025 Conference Paper

RAGG: Retrieval-Augmented Grasp Generation Model

  • Zhenhua Tang
  • Bin Zhu
  • Yanbin Hao
  • Chong-Wah Ngo
  • Richang Hong

Intent-based grasp generation inherently involves challenges such as manipulation ambiguity and modality gaps. To address these, we propose a novel Retrieval-Augmented Grasp Generation model (RAGG). Our key insight is that when humans manipulate new objects, they initially mimic the interaction patterns observed in similar objects, then progressively adjust hand-object contact. Consequently, we develop RAGG as a two-stage approach, encompassing retrieval-guided generation and structurally stable grasp refinement. In the first stage, we propose a Retrieval-Augmented Diffusion Model (ReDim), which identifies the most relevant interaction instance from a knowledge base to explicitly guide grasp generation, thereby mitigating ambiguity and bridging modality gaps to ensure semantically correct manipulation. In the second stage, we introduce a Progressive Refinement Network (PRN) with Kolmogorov-Arnold Network (KAN) layers to refine the generated coarse grasp, employing a Structural Similarity Index loss to constrain the spatial relationship between the hand and the object, thus ensuring the stability of the grasp. Extensive experiments on the OakInk and GRAB benchmarks demonstrate that RAGG achieves superior results compared to state-of-the-art approach, indicating not only better physical feasibility and controllability but also strong generalization and interpretability for unseen objects.

ICLR Conference 2024 Conference Paper

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

  • Bin Zhu
  • Bin Lin 0014
  • Munan Ning
  • Yang Yan
  • Jiaxi Cui
  • Hongfa Wang
  • Yatian Pang
  • Wenhao Jiang

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N ≥ 3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining and then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities.

IJCAI Conference 2023 Conference Paper

Controlling Neural Style Transfer with Deep Reinforcement Learning

  • Chengming Feng
  • Jing Hu
  • Xin Wang
  • Shu Hu
  • Bin Zhu
  • Xi Wu
  • Hongtu Zhu
  • Siwei Lyu

Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters. In this paper, we propose the first deep Reinforcement Learning (RL) based architecture that splits one-step style transfer into a step-wise process for the NST task. Our RL-based method tends to preserve more details and structures of the content image in early steps, and synthesize more style patterns in later steps. It is a user-easily-controlled style-transfer method. Additionally, as our RL-based model performs the stylization progressively, it is lightweight and has lower computational complexity than existing one-step Deep Learning (DL) based models. Experimental results demonstrate the effectiveness and robustness of our method.

YNIMG Journal 2023 Journal Article

Information transmission velocity-based dynamic hierarchical brain networks

  • Lin Jiang
  • Fali Li
  • Zhaojin Chen
  • Bin Zhu
  • Chanlin Yi
  • Yuqin Li
  • Tao Zhang
  • Yueheng Peng

The brain functions as an accurate circuit that regulates information to be sequentially propagated and processed in a hierarchical manner. However, it is still unknown how the brain is hierarchically organized and how information is dynamically propagated during high-level cognition. In this study, we developed a new scheme for quantifying the information transmission velocity (ITV) by combining electroencephalogram (EEG) and diffusion tensor imaging (DTI), and then mapped the cortical ITV network (ITVN) to explore the information transmission mechanism of the human brain. The application in MRI-EEG data of P300 revealed bottom-up and top-down ITVN interactions subserving P300 generation, which was comprised of four hierarchical modules. Among these four modules, information exchange between visual- and attention-activated regions occurred at a high velocity, related cognitive processes could thus be efficiently accomplished due to the heavy myelination of these regions. Moreover, inter-individual variability in P300 was probed to be attributed to the difference in information transmission efficiency of the brain, which may provide new insight into the cognitive degenerations in clinical neurodegenerative disorders, such as Alzheimer's disease, from the transmission velocity perspective. Together, these findings confirm the capacity of ITV to effectively determine the efficiency of information propagation in the brain.

NeurIPS Conference 2022 Conference Paper

EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

  • Ahmad Darkhalil
  • Dandan Shan
  • Bin Zhu
  • Jian Ma
  • Amlan Kar
  • Richard Higgins
  • Sanja Fidler
  • David Fouhey

We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e. g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9. 9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http: //epic-kitchens. github. io/VISOR

NeurIPS Conference 2022 Conference Paper

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training

  • Yulong Liu
  • Guibo Zhu
  • Bin Zhu
  • Qi Song
  • Guojing Ge
  • Haoran Chen
  • GuanHui Qiao
  • Ru Peng

Vision-Language Pre-training (VLP) has been shown to be an efficient method to improve the performance of models on different vision-and-language downstream tasks. Substantial studies have shown that neural networks may be able to learn some general rules about language and visual concepts from a large-scale weakly labeled image-text dataset. However, most of the public cross-modal datasets that contain more than 100M image-text pairs are in English; there is a lack of available large-scale and high-quality Chinese VLP datasets. In this work, we propose a new framework for automatic dataset acquisition and cleaning with which we construct a new large-scale and high-quality cross-modal dataset named as TaiSu, containing 166 million images and 219 million Chinese captions. Compared with the recently released Wukong dataset, our dataset is achieved with much stricter restrictions on the semantic correlation of image-text pairs. We also propose to combine texts collected from the web with texts generated by a pre-trained image-captioning model. To the best of our knowledge, TaiSu is currently the largest publicly accessible Chinese cross-modal dataset. Furthermore, we test our dataset on several vision-language downstream tasks. TaiSu outperforms BriVL by a large margin on the zero-shot image-text retrieval task and zero-shot image classification task. TaiSu also shows better performance than Wukong on the image-retrieval task without using image augmentation for training. Results demonstrate that TaiSu can serve as a promising VLP dataset, both for understanding and generative tasks. More information can be referred to https: //github. com/ksOAn6g5/TaiSu.

YNICL Journal 2019 Journal Article

Characterization of white matter changes along fibers by automated fiber quantification in the early stages of Alzheimer's disease

  • Xin Zhang
  • Yu Sun
  • Weiping Li
  • Bing Liu
  • Wenbo Wu
  • Hui Zhao
  • Renyuan Liu
  • Yue Zhang

Brain white matter fiber bundles in patients with mild cognitive impairment (MCI) and Alzheimer's disease (AD) have abnormalities not usually seen in unaffected subjects. Ideal algorithm of the localization-specific properties in white matter integrity might reveal the changes of tissue properties varying along each tract, while previous studies only detected the mean DTI parameters of each fiber. The aim of this study was to investigate whether these abnormalities of nerve fiber tracts are localized to specific regions of the tracts or spread throughout and to analyze which of the examined fiber tracts are involved in the early stages of Alzheimer's disease. In this study, we utilized VBA, TBSS as well as AFQ together to comprehensively investigate the white matter fiber impairment on 25 CE patients, 29 MCI patients and 34 normal control (NC) subjects. Two tract profiles, fractional anisotropy (FA) and mean diffusivity (MD), were extracted to evaluate the white matter integrity at 100 locations along each of 20 fiber tracts and then we validated the results with 27 CE patients, 21 MCI patients and 22 NC from the ADNI cohort. Also, we compare the AFQ with VBA and TBSS in our cohort. In comparison with NC, AD patients showed widespread FA reduction in 25% (5 /20) and MD increase in 65%(13/20) of the examined fiber tracts. The MCI patients showed a regional FA reduction in 5% (1/20) of the examined fiber tracts (right cingulum cingulate) and MD increase in 5%(1/20) of the examined fiber tracts (left arcuate fasciculus). Among these changed tracts, only the right cingulum cingulate showed widespread disruption of myelin or/and fiber axons in MCI and aggravated deterioration in AD, findings supported by FA/MD changes both by the mean and FA changes by point wise methods and TBSS. And the AFQ findings from ADNI cohort showed some similarity with our cohort, especially in the pointwise comparison of MD profiles between AD vs NC. Furthermore, the pattern of white matter abnormalities was different across neuronal fiber tracts; for example, the MCI and AD patients showed similar FA reduction in the middle part of the right cingulum cingulate, and the anterior part were not damaged. However, the left arcuate fasciculus showed MD elevation located at the temporal part of the fibers in the MCI patients and expanding to the temporal and middle part of the fibers in AD patients. So, the AFQ may be an alternative complementary method of VBA and TBSS, and may provide new insights into white matter degeneration in MCI and its association with AD.

YNIMG Journal 2018 Journal Article

Detectability and reproducibility of the olfactory fMRI signal under the influence of magnetic susceptibility artifacts in the primary olfactory cortex

  • Jiaming Lu
  • Xin Wang
  • Zhao Qing
  • Zhu Li
  • Wen Zhang
  • Ying Liu
  • Lihua Yuan
  • Le Cheng

For human olfactory functional MRI studies, the primary olfactory cortex (POC) suffers severe magnetic susceptibility artifacts, which adversely influences the detectability and reproducibility of the olfactory fMRI data and its clinical applications. The goal of this work is to assess the impacts of the image artifacts on the detectability and reproducibility of the olfactory activation in the POC. The severity of artifacts in the POC were classified into three levels using a Subjective Artifact score (SA_score). The mean temporal signal-to-noise ratio (tSNR) of the fMRI data acquired by a given MRI sequence and olfactory activation (β value) in POC were evaluated and compared to the concurrent activations in the primary visual cortex (Brodmann area 17, BA17) by an odor-visual association paradigm using ninety-nine normal human subjects. Our study revealed that the mean tSNR in POC was above the threshold for reliable detection of the functional activation signal, and, consequently, the mean olfactory activations in the POC were not significantly different from those in BA17. The reproducibility of the activation in the POC was assessed by a random half-split stimulation of a test-retest experiment. The overlap of the activation maps for all the trials (n = 1000) in the POC were not statistically different from that observed in BA17. These results show that the detectability and reproducibility of olfactory activation in the presence of susceptibility artifacts in the POC was at similar level of that in the visual cortex.

YNICL Journal 2018 Journal Article

Short- and long-range synergism disorders in lifelong premature ejaculation evaluated using the functional connectivity density and network property

  • Jiaming Lu
  • Xin Zhang
  • Huiting Wang
  • Zhao Qing
  • Peng Han
  • Ming Li
  • Jiadong Xia
  • Fei Chen

This study was aimed to investigate brain function connectivity in premature ejaculation (PE) patients using the functional connectivity density (FCD) and network property of resting-state functional magnetic resonance imaging. Twenty PE patients (mean age: 27. 95 ± 4. 52 years) and 15 normal controls (mean age: 27. 87 ± 3. 78 years) with no self-reported history of neurologic or psychiatric disease were enrolled in this study. International Index of Erectile Function-5 and Chinese Index of Sexual Function for Premature Ejaculation-5 questionnaires and self-reported intravaginal ejaculatory latency time (IELT) were obtained from each participant for symptom assessment. Two-sample t-tests (intergroup comparison) were applied in the short-range FCD (SFCD) analysis, long-range FCD (LFCD) analysis, region of interest–based analysis, and network topological organization analysis. Pearson correlation analysis was performed to correlate IELT with FCD or the network property. The patients with PE showed significantly decreased SFCD in the bilateral middle temporal gyrus, left orbitofrontal cortex, nucleus accumbens, fusiform, caudate, and thalamus (p < 0. 05, AlphaSim-corrected). Notably, all these aforementioned brain areas are located in the dopamine pathway. In contrast, increased LFCD was observed in the left insula, Heschl's gyrus, putamen, bilateral precuneus, supplementary motor area, middle cingulate cortex, and anterior cingulate cortex in PE patients (p < 0. 05, AlphaSim-corrected). In addition, the network topological analysis found reinforced network connectivity between several nodes. The degree of hub nodes increased in the patients with PE. IELT was positively correlated with SFCD and negatively correlated with LFCD or the degree of hub nodes (p < 0. 05, Pearson correlation). In summary, our results are important for understanding the brain network in PE patients. The present findings indicate that PE patients have a significant synergism disorder across the region of dopamine pathway, which implied neuronal pathological changes might be related with the change of dopamine. The FCD and network property can serve as new disease severity biomarkers and therapeutic targets in PE.