Arrow Research search

Author name cluster

Jie Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

82 papers
2 author rows

Possible papers

82

AAAI Conference 2026 Conference Paper

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

  • Juyuan Wang
  • Rongchen Zhao
  • Wei Wei
  • Yufeng Wang
  • Mo Yu
  • Jie Zhou
  • Jin Xu
  • Liyan Xu

Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and its high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods could fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition on reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based stateful reasoning.

AAAI Conference 2026 Conference Paper

LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization

  • Junsong Li
  • Jie Zhou
  • Bihao Zhan
  • Yutao Yang
  • Qianjun Pan
  • Shilian Chen
  • Tianyu Huai
  • Xin Li

Alignment plays a crucial role in Large Language Models (LLMs) in aligning with human preferences on a specific task/domain. Traditional alignment methods suffer from catastrophic forgetting, where models lose previously learned values when adapting to new preferences or domains. We introduce LifeAlign, a novel framework for lifelong alignment that enables LLMs to maintain consistent human preference alignment across sequential learning tasks without forgetting previously learned values. Our approach consists of two key innovations. First, we propose a focalized preference optimization strategy that aligns LLMs with new preferences while preventing the erosion of alignment acquired from previous tasks. Second, we develop a short-to-long memory consolidation mechanism that merges denoised short-term preference representations into stable long-term memory using intrinsic dimensionality reduction, enabling efficient storage and retrieval of alignment patterns across diverse domains. We evaluate LifeAlign across multiple sequential alignment tasks spanning different domains and preference types. Experimental results demonstrate that our method achieves superior performance in maintaining both preference alignment quality and knowledge retention compared to existing lifelong learning approaches.

JBHI Journal 2026 Journal Article

Whisperization and Masked CycleGAN-Based Framework for Electrolaryngeal Speech Enhancement

  • Jie Zhou
  • Li Wang
  • Fengji Li
  • Shaochuan Zhang
  • Fan Fan
  • Tao Liu
  • Xiaohong Chen
  • Haijun Niu

Electrolarynx (EL) provides an effective approach to voice rehabilitation for patients with phonation disorder. However, due to its reliance on an external mechanical source, EL speech suffers from limited acoustic cues, leading to degraded quality and restricting the potential of subsequent modeling and enhancement. This paper proposes a novel EL speech enhancement framework that combines whisperization with Masked CycleGAN model. The whisperization step removes redundant constant excitation and mechanical noise, generating an intermediate speech form—whisper-like EL (W-EL) speech, whose acoustic and perceptual properties are closer to natural whisper. Subsequently, the Masked CycleGAN employs a frame-level masking strategy to guide the generator in reconstructing missing prosodic and linguistic features. Thus, we achieved a dual-stage enhancement of “redundancy removal” and “deficiency compensation. ” Acoustic feature analysis demonstrates that the converted W-EL speech is more similar to normal speech in terms of spectrogram, fundamental frequency (F0) values, and F0 contours, while also compensating for the missing low frequency energy below 500 Hz. Objective evaluations show significant improvements across multiple metrics. Subjective evaluations confirm that W-EL speech exhibits higher naturalness and intelligibility compared to original EL speech. Moreover, the combined “whisperization + voice conversion” framework further enhances perceptual quality. This study not only offer a novel pathway for EL speech enhancement, but also may provide valuable insights for improving other types of pathological speech.

TMLR Journal 2025 Journal Article

A Survey on the Honesty of Large Language Models

  • Siheng Li
  • Cheng Yang
  • Taiqiang Wu
  • Chufan Shi
  • Yuji Zhang
  • Xinyu Zhu
  • Zesen Cheng
  • Deng Cai

Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don't know and be able to faithfully express their knowledge. Despite promising, current LLMs still exhibit significant dishonest behaviors, such as confidently presenting wrong answers or failing to express what they know. In addition, research on the honesty of LLMs also faces challenges, including varying definitions of honesty, difficulties in distinguishing between known and unknown knowledge, and a lack of comprehensive understanding of related research. To address these issues, we provide a survey on the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Moreover, we offer insights for future research, aiming to inspire further exploration in this important area.

NeurIPS Conference 2025 Conference Paper

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

  • Hanlei Zhang
  • Zhuohang Li
  • Hua Xu
  • Yeshuang Zhu
  • Peiwu Wang
  • Haige Zhu
  • Jie Zhou
  • Jinchao Zhang

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https: //github. com/thuiar/MMLA.

IROS Conference 2025 Conference Paper

Cockroach's Turning Strategy Enhanced Hexapod Robot with Flexible Torso

  • Yiming Li
  • Xingyu Li
  • Jie Zhou
  • Chenfeng Xie
  • Yao Li
  • Bing Li

The design and control of hexapod robots have become an active research field due to the ability to achieve adaptive and stable multi-terrain locomotion. However, existing hexapod robots focus on the integration of flexible pitch joints to enhance their obstacle-crossing and slope-climbing abilities, and few biological observations have been made to gain insight into the agile steering mechanisms of hexapod insects. Herein, we observed the steering movements of Madagascar cockroaches. Observations showed that cockroaches exhibited specific phase relationships in addition to regular tripod gait pattern during steering. Moreover, we also found that a smaller steering radius resulted in a larger lateral bending angle of the thoracic segments. Inspired by this, a hexapod robot with a flexible torso (F-RHex) was designed and fabricated. Bio-inspired gait patterns were abstracted and simplified into two steering strategies: gait-based and mix-based. Compared to the purely gait-based strategy, the F-RHex testing results demonstrated a ~27. 4% reduction in turning radius and ~40% enhancement in steering velocity, implying that the mix-based strategy offers superior steering capability.

NeurIPS Conference 2025 Conference Paper

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

  • Zhengrui Ma
  • Yang Feng
  • Chenze Shao
  • Fandong Meng
  • Jie Zhou
  • Min Zhang

We introduce \emph{SLED}, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models. Demos and code are available at \url{https: //github. com/ictnlp/SLED-TTS}.

AAAI Conference 2025 Conference Paper

Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection

  • Kedi Chen
  • Qin Chen
  • Jie Zhou
  • Xinqi Tao
  • Bowen Ding
  • Jingwen Xie
  • Mingchen Xie
  • Peilong Li

Large Language Models (LLMs) are prone to hallucination with non-factual or unfaithful statements, which undermines the applications in real-world scenarios. Recent researches focus on uncertainty-based hallucination detection, which utilizes the output probability of LLMs for uncertainty calculation and does not rely on external knowledge or frequent sampling from LLMs. Whereas, most approaches merely consider the uncertainty of each independent token, while the intricate semantic relations among tokens and sentences are not well studied, which limits the detection of hallucination that spans over multiple tokens and sentences in the passage. In this paper, we propose a method to enhance uncertainty modeling with semantic graph for hallucination detection. Specifically, we first construct a semantic graph that well captures the relations among entity tokens and sentences. Then, we incorporate the relations between two entities for uncertainty propagation to enhance sentence-level hallucination detection. Given that hallucination occurs due to the conflict between sentences, we further present a graph-based uncertainty calibration method that integrates the contradiction probability of the sentence with its neighbors in the semantic graph for uncertainty calculation. Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection.

JBHI Journal 2025 Journal Article

GSST: Multimodal Graph-SMILES Fusion with Soft SMILES Tokens for Molecular Property Prediction

  • Jie Zhou
  • Qichang Zhao
  • Pengcheng Shu
  • Xiang Tang
  • Jianxin Wang

Accurate prediction of molecular properties is essential for drug discovery. While single modal molecular representations have shown promising results, their generalizability remains limited due to data sparsity and the inherent incompleteness of single modal characterizations. To overcome these limitations, we propose GSST, a novel multimodal pretraining framework that integrates graph-based molecular representations with SMILES sequences through learnable soft SMILES tokens. At the core of GSST is the G2S-Former module, which injects topological information from the graph-based representation into the soft SMILES tokens to enable effective cross-modal interaction while preserving modality-specific features. Extensive experiments on the MoleculeNet and MoleculeACE benchmarks demonstrate that GSST consistently outperforms state-of-the-art methods in molecular property prediction and activity cliff assessment. These results underscore the importance of effective multimodal alignment in capturing shared molecular patterns and alleviating the challenges posed by limited labeled data. GSST represents a scalable and high-throughput approach with significant potential to advance drug discovery.

AAAI Conference 2025 Conference Paper

Learning with Open-world Noisy Data via Class-independent Margin in Dual Representation Space

  • Linchao Pan
  • Can Gao
  • Jie Zhou
  • Jinbao Wang

Learning with Noisy Labels (LNL) aims to improve the model generalization when facing data with noisy labels, and existing methods generally assume that noisy labels come from known classes, called closed-set noise. However, in real-world scenarios, noisy labels from similar unknown classes, i.e., open-set noise, may occur during the training and inference stage. Such open-world noisy labels may significantly impact the performance of LNL methods. In this study, we propose a novel dual-space joint learning method to robustly handle the open-world noise. To mitigate model overfitting on closed-set and open-set noises, a dual representation space is constructed by two networks. One is a projection network that learns shared representations in the prototype space, while the other is a One-Vs-All (OVA) network that makes predictions using unique semantic representations in the class-independent space. Then, bi-level contrastive learning and consistency regularization are introduced in two spaces to enhance the detection capability for data with unknown classes. To benefit from the memorization effects across different types of samples, class-independent margin criteria are designed for sample identification, which selects clean samples, weights closed-set noise, and filters open-set noise effectively. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods and achieves an average accuracy improvement of 4.55\% and an AUROC improvement of 6.17\% on CIFAR80N.

IJCAI Conference 2025 Conference Paper

MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection

  • Jiayi Cheng
  • Can Gao
  • Jie Zhou
  • Jiajun Wen
  • Tao Dai
  • Jinbao Wang

3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. This study presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the 3D AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3. 1% and 9. 3% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The code is available at https: //github. com/iCAN-SZU/MC3D-AD.

IJCAI Conference 2025 Conference Paper

MedDiT: A Knowledge-Controlled Diffusion Transformer Framework for Dynamic Medical Image Generation in Virtual Simulated Patient

  • Yanzeng Li
  • Cheng Zeng
  • Jinchao Zhang
  • Jie Zhou
  • Lei Zou

Medical education relies heavily on Simulated Patients (SPs) to provide a safe environment for students to practice clinical skills, including medical image analysis. However, the high cost of recruiting qualified SPs and the lack of diverse medical imaging datasets have presented significant challenges. To address these issues, this paper introduces MedDiT, a novel knowledge-controlled conversational framework that can dynamically generate plausible medical images aligned with simulated patient symptoms, enabling diverse diagnostic skill training. Specifically, MedDiT integrates various patient Knowledge Graphs (KGs), which describe the attributes and symptoms of patients, to dynamically prompt Large Language Models' (LLMs) behavior and control the patient characteristics, mitigating hallucination during medical conversation. Additionally, a well-tuned Diffusion Transformer (DiT) model is incorporated to generate medical images according to the specified patient attributes in the KG. In this paper, we present the capabilities of MedDiT through a practical demonstration, showcasing its ability to act in diverse simulated patient cases and generate the corresponding medical images. This can provide an abundant and interactive learning experience for students, advancing medical education by offering an immersive simulation platform for future healthcare professionals. The work sheds light on the feasibility of incorporating advanced technologies like LLM, KG, and DiT in education applications, highlighting their potential to address the challenges faced in simulated patient-based medical education.

NeurIPS Conference 2025 Conference Paper

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

  • Yuqi Wu
  • Wenzhao Zheng
  • Jie Zhou
  • Jiwen Lu

Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code: https: //github. com/YkiWu/Point3R.

TMLR Journal 2024 Journal Article

Decentralized Decoupled Training for Federated Long-Tailed Learning

  • Wenkai Yang
  • Deli Chen
  • Hao Zhou
  • Fandong Meng
  • Jie Zhou
  • Xu Sun

In the real world, the data samples often follow a long-tailed distribution, which poses a great challenge for Federated Learning (FL). That is, when the data is decentralized and long-tailed, FL may produce a poorly-behaved global model that is severely biased towards the head classes with the majority of the training samples. To settle this issue, decoupled training has recently been introduced to FL. Decoupled training aims to re-balance the biased classifier after the normal instance-balanced training, and has achieved promising results in centralized long-tailed learning. The current study directly adopts the decoupled training idea on the server side by re-training the classifier on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. Unfortunately, this practice restricts the capacity of decoupled training in federated long-tailed learning as the low-quality pseudo features lead to a sub-optimal classifier. In this work, motivated by the distributed characteristic of FL, we propose a decentralized decoupled training mechanism by leveraging the abundant real data stored in the local. Specifically, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce a supplementary classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings. Our code is available at https://github.com/keven980716/Federated_Learning_Experiments.

NeurIPS Conference 2024 Conference Paper

FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner

  • Wenliang Zhao
  • Minglei Shi
  • Xumin Yu
  • Jie Zhou
  • Jiwen Lu

Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor's outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53. 1\%$\sim$58. 3\% on class-conditional generation and 29. 8\%$\sim$38. 5\% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2. 12 on ImageNet with 100 (ms / img) and FID of 3. 93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art. Code is available at https: //github. com/shiml20/FlowTurbo.

AAAI Conference 2024 Conference Paper

Generative Multi-Modal Knowledge Retrieval with Large Language Models

  • Xinwei Long
  • Jiali Zeng
  • Fandong Meng
  • Zhiyuan Ma
  • Kaiyan Zhang
  • Bowen Zhou
  • Jie Zhou

Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-intensive multi-modal applications. However, existing methods face challenges in terms of their effectiveness and training efficiency, especially when it comes to training and integrating multiple retrievers to handle multi-modal queries. In this paper, we propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases, even when trained with limited data. We retrieve knowledge via a two-step process: 1) generating knowledge clues related to the queries, and 2) obtaining the relevant document by searching databases using the knowledge clue. In particular, we first introduce an object-aware prefix-tuning technique to guide multi-grained visual learning. Then, we align multi-grained visual features into the textual feature space of the LLM, employing the LLM to capture cross-modal interactions. Subsequently, we construct instruction data with a unified format for model training. Finally, we propose the knowledge-guided generation strategy to impose prior constraints in the decoding steps, thereby promoting the generation of distinctive knowledge clues. Through experiments conducted on three benchmarks, we demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.

NeurIPS Conference 2024 Conference Paper

Learning 1D Causal Visual Representation with De-focus Attention Networks

  • Chenxin Tao
  • Xizhou Zhu
  • Shiqian Su
  • Lewei Lu
  • Changyao Tian
  • Xuan Luo
  • Gao Huang
  • Hongsheng Li

Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an "over-focus" issue in existing 1D causal vision models, where attention overly concentrates on a small proportion of visual tokens. The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization. To address this, we propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns. During training, large and scheduled drop path rates, and an auxiliary loss on globally pooled features for global understanding tasks are introduced. These two strategies encourage the model to attend to a broader range of tokens and enhance network optimization. Extensive experiments validate the efficacy of our approach, demonstrating that 1D causal visual representation can perform comparably to 2D non-causal representation in tasks such as global perception, dense prediction, and multi-modal understanding. Code shall be released.

AAAI Conference 2024 Conference Paper

Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding

  • Wenjia Geng
  • Yong Liu
  • Lei Chen
  • Sujia Wang
  • Jie Zhou
  • Yansong Tang

Weakly Supervised temporal Article Grounding (WSAG) is a challenging and practical task in video understanding. Specifically, given a video and a relevant article, whose sentences are at different semantic scales, WSAG aims to localize corresponding video segments for all “groundable” sentences. Compared to other grounding tasks, e.g., localizing one target segment with respect to a given sentence query, WSAG confronts an essential obstacle rooted in the intricate multi-scale information inherent within both textual and visual modalities. Existing methods overlook the modeling and alignment of such structured information present in multi-scale video segments and hierarchical textual content. To this end, we propose a Multi-Scale Video-Text Correspondence Learning (MVTCL) framework, which enhances the grounding performance in complex scenes by modeling multi-scale semantic correspondence both within and between modalities. Specifically, MVTCL initially aggregates video content spanning distinct temporal scales and leverages hierarchical textual relationships in both temporal and semantic dimensions via a semantic calibration module. Then multi-scale contrastive learning module is introduced to generate more discriminative representations by selecting typical contexts and performing inter-video contrastive learning. Through the multi-scale semantic calibration architecture and supervision design, our method achieves new state-of-the-art performance on existing WSAG benchmarks.

AAAI Conference 2024 Conference Paper

MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA

  • Lang Yu
  • Qin Chen
  • Jie Zhou
  • Liang He

Large language models (LLMs) have shown great success in various Natural Language Processing (NLP) tasks, whist they still need updates after deployment to fix errors or keep pace with the changing knowledge in the world. Researchers formulate such problem as Model Editing and have developed various editors focusing on different axes of editing properties. However, current editors can hardly support all properties and rely on heavy computational resources. In this paper, we propose a plug-in Model Editing method based on neuron-indexed dynamic LoRA (MELO), which alters the behavior of language models by dynamically activating certain LoRA blocks according to the index built in an inner vector database. Our method satisfies various editing properties with high efficiency and can be easily integrated into multiple LLM backbones. Experimental results show that our proposed MELO achieves state-of-the-art editing performance on three sequential editing tasks (document classification, question answering and hallucination correction), while requires the least trainable parameters and computational cost.

NeurIPS Conference 2024 Conference Paper

NeuroGauss4D-PCI: 4D Neural Fields and Gaussian Deformation Fields for Point Cloud Interpolation

  • Chaokang Jiang
  • Dalong Du
  • Jiuming Liu
  • Siting Zhu
  • Zhenqiang Liu
  • Zhuang Ma
  • Zhujin Liang
  • Jie Zhou

Point Cloud Interpolation confronts challenges from point sparsity, complex spatiotemporal dynamics, and the difficulty of deriving complete 3D point clouds from sparse temporal information. This paper presents NeuroGauss4D-PCI, which excels at modeling complex non-rigid deformations across varied dynamic scenes. The method begins with an iterative Gaussian cloud soft clustering module, offering structured temporal point cloud representations. The proposed temporal radial basis function Gaussian residual utilizes Gaussian parameter interpolation over time, enabling smooth parameter transitions and capturing temporal residuals of Gaussian distributions. Additionally, a 4D Gaussian deformation field tracks the evolution of these parameters, creating continuous spatiotemporal deformation fields. A 4D neural field transforms low-dimensional spatiotemporal coordinates ($x, y, z, t$) into a high-dimensional latent space. Finally, we adaptively and efficiently fuse the latent features from neural fields and the geometric features from Gaussian deformation fields. NeuroGauss4D-PCI outperforms existing methods in point cloud frame interpolation, delivering leading performance on both object-level (DHB) and large-scale autonomous driving datasets (NL-Drive), with scalability to auto-labeling and point cloud densification tasks.

NeurIPS Conference 2024 Conference Paper

Q-VLM: Post-training Quantization for Large Vision-Language Models

  • Changyuan Wang
  • Ziwei Wang
  • Xiuwei Xu
  • Yansong Tang
  • Jie Zhou
  • Jiwen Lu

In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2. 78x and increase generate speed by 1. 44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks.

NeurIPS Conference 2024 Conference Paper

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

  • Hang Yin
  • Xiuwei Xu
  • Zhenyu Wu
  • Jie Zhou
  • Jiwen Lu

In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than \textbf{10\%} SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark. Code of this project will be released in the final version.

AAAI Conference 2024 Conference Paper

Teaching Large Language Models to Translate with Comparison

  • Jiali Zeng
  • Fandong Meng
  • Yongjing Yin
  • Jie Zhou

Open-sourced large language models (LLMs) have demonstrated remarkable efficacy in various tasks with instruction tuning. However, these models can sometimes struggle with tasks that require more specialized knowledge such as translation. One possible reason for such deficiency is that instruction tuning aims to generate fluent and coherent text that continues from a given instruction without being constrained by any task-specific requirements. Moreover, it can be more challenging to tune smaller LLMs with lower-quality training data. To address this issue, we propose a novel framework using examples in comparison to teach LLMs to learn translation. Our approach involves output comparison and preference comparison, presenting the model with carefully designed examples of correct and incorrect translations and an additional preference loss for better regularization. Empirical evaluation on four language directions of WMT2022 and FLORES-200 benchmarks shows the superiority of our proposed method over existing methods. Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations. Please refer to Github for more details: https://github.com/lemon0830/TIM.

AAAI Conference 2024 Conference Paper

Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models

  • Kun Zhang
  • Jiali Zeng
  • Fandong Meng
  • Yuanzhuo Wang
  • Shiqi Sun
  • Long Bai
  • Huawei Shen
  • Jie Zhou

Large language models (LLMs) have recently demonstrated remarkable performance across various Natual Language Processing tasks. In the field of multi-hop reasoning, the Chain-of-thought (CoT) prompt method has emerged as a paradigm, using curated stepwise reasoning demonstrations to enhance LLM's ability to reason and produce coherent rational pathways. To ensure the accuracy, reliability, and traceability of the generated answers, many studies have incorporated information retrieval (IR) to provide LLMs with external knowledge. However, existing CoT with IR methods decomposes questions into sub-questions based on a single compositionality type, which limits their effectiveness for questions involving multiple compositionality types. Additionally, these methods suffer from inefficient retrieval, as complex questions often contain abundant information, leading to the retrieval of irrelevant information inconsistent with the query's intent. In this work, we propose a novel question decomposition framework called TRQA for multi-hop question answering, which addresses these limitations. Our framework introduces a reasoning tree (RT) to represent the structure of complex questions. It consists of four components: the Reasoning Tree Constructor (RTC), the Question Generator (QG), the Retrieval and LLM Interaction Module (RAIL), and the Answer Aggregation Module (AAM). Specifically, the RTC predicts diverse sub-question structures to construct the reasoning tree, allowing a more comprehensive representation of complex questions. The QG generates sub-questions for leaf-node in the reasoning tree, and we explore two methods for QG: prompt-based and T5-based approaches. The IR module retrieves documents aligned with sub-questions, while the LLM formulates answers based on the retrieved information. Finally, the AAM aggregates answers along the reason tree, producing a definitive response from bottom to top.

NeurIPS Conference 2024 Conference Paper

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

  • Chenyu Yang
  • Xizhou Zhu
  • Jinguo Zhu
  • Weijie Su
  • Junjie Wang
  • Xuan Dong
  • Wenhai Wang
  • Lewei Lu

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e. g. , LAION), but can also leverage interleaved pre-training data (e. g. , MMC4) to learn robust visual representations from scratch, showcasing the potential of vision model pre-training with interleaved image-text data.

NeurIPS Conference 2024 Conference Paper

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

  • Wenkai Yang
  • Xiaohan Bi
  • Yankai Lin
  • Sishuo Chen
  • Jie Zhou
  • Xu Sun

Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.

NeurIPS Conference 2024 Conference Paper

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

  • Ziyi Wang
  • Yanbo Wang
  • Xumin Yu
  • Jie Zhou
  • Jiwen Lu

Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https: //github. com/wangzy22/XMask3D.

AAAI Conference 2023 Conference Paper

A Simple Baseline for Multi-Camera 3D Object Detection

  • Yunpeng Zhang
  • Wenzhao Zheng
  • Zheng Zhu
  • Guan Huang
  • Jiwen Lu
  • Jie Zhou

3D object detection with surrounding cameras has been a promising direction for autonomous driving. In this paper, we present SimMOD, a Simple baseline for Multi-camera Object Detection, to solve the problem. To incorporate multiview information as well as build upon previous efforts on monocular 3D object detection, the framework is built on sample-wise object proposals and designed to work in a twostage manner. First, we extract multi-scale features and generate the perspective object proposals on each monocular image. Second, the multi-view proposals are aggregated and then iteratively refined with multi-view and multi-scale visual features in the DETR3D-style. The refined proposals are endto-end decoded into the detection results. To further boost the performance, we incorporate the auxiliary branches alongside the proposal generation to enhance the feature learning. Also, we design the methods of target filtering and teacher forcing to promote the consistency of two-stage training. We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD and achieve competitive performance. Code will be available at https://github.com/zhangyp15/SimMOD.

AAAI Conference 2023 Short Paper

Feature Decomposition for Reducing Negative Transfer: A Novel Multi-Task Learning Method for Recommender System (Student Abstract)

  • Jie Zhou
  • Qian Yu
  • Chuan Luo
  • Jing Zhang

We propose a novel multi-task learning method termed Feature Decomposition Network (FDN). The key idea of the proposed FDN is to reduce the phenomenon of feature redundancy by explicitly decomposing features into task-specific features and task-shared features with carefully designed constraints. Experimental results show that our proposed FDN can outperform the state-of-the-art (SOTA) methods by a noticeable margin on Ali-CCP.

NeurIPS Conference 2023 Conference Paper

Fed-FA: Theoretically Modeling Client Data Divergence for Federated Language Backdoor Defense

  • Zhiyuan Zhang
  • Deli Chen
  • Hao Zhou
  • Fandong Meng
  • Jie Zhou
  • Xu Sun

Federated learning algorithms enable neural network models to be trained across multiple decentralized edge devices without sharing private data. However, they are susceptible to backdoor attacks launched by malicious clients. Existing robust federated aggregation algorithms heuristically detect and exclude suspicious clients based on their parameter distances, but they are ineffective on Natural Language Processing (NLP) tasks. The main reason is that, although text backdoor patterns are obvious at the underlying dataset level, they are usually hidden at the parameter level, since injecting backdoors into texts with discrete feature space has less impact on the statistics of the model parameters. To settle this issue, we propose to identify backdoor clients by explicitly modeling the data divergence among clients in federated NLP systems. Through theoretical analysis, we derive the f-divergence indicator to estimate the client data divergence with aggregation updates and Hessians. Furthermore, we devise a dataset synthesization method with a Hessian reassignment mechanism guided by the diffusion theory to address the key challenge of inaccessible datasets in calculating clients' data Hessians. We then present the novel Federated F-Divergence-Based Aggregation~(\textbf{Fed-FA}) algorithm, which leverages the f-divergence indicator to detect and discard suspicious clients. Extensive empirical results show that Fed-FA outperforms all the parameter distance-based methods in defending against backdoor attacks among various natural language backdoor attack scenarios.

IJCAI Conference 2023 Conference Paper

Humming2Music: Being A Composer As Long As You Can Humming

  • Yao Qiu
  • Jinchao Zhang
  • Huiying Ren
  • Yong Shan
  • Jie Zhou

Creating a piece of music is difficult for people who have never been trained to compose. We present an automatic music generation system to lower the threshold of creating music. The system takes the user's humming as input and creates full music based on the humming melody. The system consists of five modules: 1) humming transcription, 2) melody generation, 3) broken chord generation, 4) accompaniment generation, and 5) audio synthesis. The first module transcribes the user's humming audio to a score, and then the melody generation module composes a complete melody based on the user's humming melody. After that, the third module will generate a broken chord track to accompany the full melody, and the fourth module will create more accompanying tracks. Finally, the audio synthesis module mixes all the tracks to generate the music. Through the user experiment, our system can generate high-quality music with natural expression based on the user's humming input.

IJCAI Conference 2023 Conference Paper

LingGe: An Automatic Ancient Chinese Poem-to-Song Generation System

  • Yong Shan
  • Jinchao Zhang
  • Huiying Ren
  • Yao Qiu
  • Jie Zhou

This paper presents a novel system, named LingGe ("伶歌" in Chinese), to generate songs for ancient Chinese poems automatically. LingGe takes the poem as the lyric, composes music conditioned on the lyric, and finally outputs a full song including the singing and the accompaniment. It consists of four modules: rhythm recognition, melody generation, accompaniment generation, and audio synthesis. Firstly, the rhythm recognition module analyzes the song structure and rhythm according to the poem. Secondly, the melody generation module assembles the rhythm into the template and then generates the melody. Thirdly, the accompaniment generation module predicts the accompaniment in harmony with the melody. Finally, the audio synthesis module generates singing and accompaniment audio and then mixes them to obtain songs. The results show that LingGe can generate high-quality and expressive songs for ancient Chinese poems, both in harmony and rhythm.

NeurIPS Conference 2023 Conference Paper

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory

  • Yinan Liang
  • Ziwei Wang
  • Xiuwei Xu
  • Yansong Tang
  • Jie Zhou
  • Jiwen Lu

Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73. 62\% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https: //github. com/liangyn22/MCUFormer.

AAAI Conference 2023 Conference Paper

Rephrasing the Reference for Non-autoregressive Machine Translation

  • Chenze Shao
  • Jinchao Zhang
  • Jie Zhou
  • Yang Feng

Non-autoregressive neural machine translation (NAT) models suffer from the multi-modality problem that there may exist multiple possible translations of a source sentence, so the reference sentence may be inappropriate for the training when the NAT output is closer to other translations. In response to this problem, we introduce a rephraser to provide a better training target for NAT by rephrasing the reference sentence according to the NAT output. As we train NAT based on the rephraser output rather than the reference sentence, the rephraser output should fit well with the NAT output and not deviate too far from the reference, which can be quantified as reward functions and optimized by reinforcement learning. Experiments on major WMT benchmarks and NAT baselines show that our approach consistently improves the translation quality of NAT. Specifically, our best variant achieves comparable performance to the autoregressive Transformer, while being 14.7 times more efficient in inference.

ICLR Conference 2023 Conference Paper

Transformer-Patcher: One Mistake Worth One Neuron

  • Zeyu Huang
  • Yikang Shen
  • Xiaofeng Zhang 0004
  • Jie Zhou
  • Wenge Rong
  • Zhang Xiong 0001

Large Transformer-based Pretrained Language Models (PLMs) dominate almost all Natural Language Processing (NLP) tasks. Nevertheless, they still make mistakes from time to time. For a model deployed in an industrial environment, fixing these mistakes quickly and robustly is vital to improve user experiences. Previous works formalize such problems as Model Editing (ME) and mostly focus on fixing one mistake. However, the one-mistake-fixing scenario is not an accurate abstraction of the real-world challenge. In the deployment of AI services, there are ever-emerging mistakes, and the same mistake may recur if not corrected in time. Thus a preferable solution is to rectify the mistakes as soon as they appear nonstop. Therefore, we extend the existing ME into the Sequential Model Editing (SME) to help develop more practical editing methods. Our study shows that current ME methods either fail to make a sequence of edits or to remember previous edits. We then introduce Transformer-Patcher, a novel model editor that can shift the behavior of transformer-based models by simply adding and training a few neurons in the last Feed-Forward Network layer. Experimental results on both classification and generation tasks show that Transformer-Patcher can successively correct up to thousands of errors (Reliability) and generalize to their equivalent inputs (Generality) while retaining the model’s accuracy on irrelevant inputs (Locality). Our method outperforms previous fine-tuning and HyperNetwork-based methods and achieves state-of-the-art performance for Sequential Model Editing (SME).

NeurIPS Conference 2023 Conference Paper

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

  • Wenliang Zhao
  • Lujia Bai
  • Yongming Rao
  • Jie Zhou
  • Jiwen Lu

Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e. g. , $<$10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3. 87 FID on CIFAR10 (unconditional) and 7. 51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https: //github. com/wl-zhao/UniPC.

NeurIPS Conference 2023 Conference Paper

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

  • Wenhai Wang
  • Zhe Chen
  • Xiaokang Chen
  • Jiannan Wu
  • Xizhou Zhu
  • Gang Zeng
  • Ping Luo
  • Tong Lu

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The code shall be released.

TMLR Journal 2023 Journal Article

When to Trust Aggregated Gradients: Addressing Negative Client Sampling in Federated Learning

  • Wenkai Yang
  • Yankai Lin
  • Guangxiang Zhao
  • Peng Li
  • Jie Zhou
  • Xu Sun

Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate, which is supposed to minimize the Euclidean distance between the aggregated gradient given currently sampled clients and that if all clients could participate in the current round. We show that our proposed indicator can effectively reflect the merged data distribution of sampled clients, thus we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method in various settings. Our code is available at https://github.com/lancopku/FedGLAD.

NeurIPS Conference 2022 Conference Paper

A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models

  • Yuanxin Liu
  • Fandong Meng
  • Zheng Lin
  • Jiangnan Li
  • Peng Fu
  • Yanan Cao
  • Weiping Wang
  • Jie Zhou

Despite the remarkable success of pre-trained language models (PLMs), they still face two challenges: First, large-scale PLMs are inefficient in terms of memory footprint and computation. Second, on the downstream tasks, PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. In response to the efficiency problem, recent studies show that dense PLMs can be replaced with sparse subnetworks without hurting the performance. Such subnetworks can be found in three scenarios: 1) the fine-tuned PLMs, 2) the raw PLMs and then fine-tuned in isolation, and even inside 3) PLMs without any parameter fine-tuning. However, these results are only obtained in the in-distribution (ID) setting. In this paper, we extend the study on PLMs subnetworks to the OOD setting, investigating whether sparsity and robustness to dataset bias can be achieved simultaneously. To this end, we conduct extensive experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks. Our results demonstrate that \textbf{sparse and robust subnetworks (SRNets) can consistently be found in BERT}, across the aforementioned three scenarios, using different training and compression methods. Furthermore, we explore the upper bound of SRNets using the OOD information and show that \textbf{there exist sparse and almost unbiased BERT subnetworks}. Finally, we present 1) an analytical study that provides insights on how to promote the efficiency of SRNets searching process and 2) a solution to improve subnetworks' performance at high sparsity. The code is available at \url{https: //github. com/llyx97/sparse-and-robust-PLM}.

AAAI Conference 2022 Conference Paper

Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation for Reference-Based Super-resolution

  • Bin Xia
  • Yapeng Tian
  • Yucheng Hang
  • Wenming Yang
  • Qingmin Liao
  • Jie Zhou

Reference-based super-resolution (RefSR) has made significant progress in producing realistic textures using an external reference (Ref) image. However, existing RefSR methods obtain high-quality correspondence matchings consuming quadratic computation resources with respect to the input size, limiting its application. Moreover, these approaches usually suffer from scale misalignments between the lowresolution (LR) image and Ref image. In this paper, we propose an Accelerated Multi-Scale Aggregation network (AMSA) for Reference-based Super-Resolution, including Coarse-to-Fine Embedded PatchMatch (CFE-PatchMatch) and Multi-Scale Dynamic Aggregation (MSDA) module. To improve matching efficiency, we design a novel Embedded PatchMacth scheme with random samples propagation, which involves end-to-end training with asymptotic linear computational cost to the input size. To further reduce computational cost and speed up convergence, we apply the coarseto-fine strategy on Embedded PatchMacth constituting CFE- PatchMatch. To fully leverage reference information across multiple scales and enhance robustness to scale misalignment, we develop the MSDA module consisting of Dynamic Aggregation and Multi-Scale Aggregation. The Dynamic Aggregation corrects minor scale misalignment by dynamically aggregating features, and the Multi-Scale Aggregation brings robustness to large scale misalignment by fusing multi-scale information. Experimental results show that the proposed AMSA achieves superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.

IJCAI Conference 2022 Conference Paper

CUP: Curriculum Learning based Prompt Tuning for Implicit Event Argument Extraction

  • Jiaju Lin
  • Qin Chen
  • Jie Zhou
  • Jian Jin
  • Liang He

Implicit event argument extraction (EAE) aims to identify arguments that could scatter over the document. Most previous work focuses on learning the direct relations between arguments and the given trigger, while the implicit relations with long-range dependency are not well studied. Moreover, recent neural network based approaches rely on a large amount of labeled data for training, which is unavailable due to the high labelling cost. In this paper, we propose a Curriculum learning based Prompt tuning (CUP) approach, which resolves implicit EAE by four learning stages. The stages are defined according to the relations with the trigger node in a semantic graph, which well captures the long-range dependency between arguments and the trigger. In addition, we integrate a prompt-based encoder-decoder model to elicit related knowledge from pre-trained language models (PLMs) in each stage, where the prompt templates are adapted with the learning progress to enhance the reasoning for arguments. Experimental results on two well-known benchmark datasets show the great advantages of our proposed approach. In particular, we outperform the state-of-the-art models in both fully-supervised and low-data scenarios.

AAAI Conference 2022 Conference Paper

Efficient Non-local Contrastive Attention for Image Super-resolution

  • Bin Xia
  • Yucheng Hang
  • Yapeng Tian
  • Wenming Yang
  • Qingmin Liao
  • Jie Zhou

Non-Local Attention (NLA) brings significant improvement for Single Image Super-Resolution (SISR) by leveraging intrinsic feature correlation in natural images. However, NLA gives noisy information large weights and consumes quadratic computation resources with respect to the input size, limiting its performance and application. In this paper, we propose a novel Efficient Non-Local Contrastive Attention (ENLCA) to perform long-range visual modeling and leverage more relevant non-local features. Specifically, ENLCA consists of two parts, Efficient Non-Local Attention (ENLA) and Sparse Aggregation. ENLA adopts the kernel method to approximate exponential function and obtains linear computation complexity. For Sparse Aggregation, we multiply inputs by an amplification factor to focus on informative features, yet the variance of approximation increases exponentially. Therefore, contrastive learning is applied to further separate relevant and irrelevant features. To demonstrate the effectiveness of ENLCA, we build an architecture called Efficient Non-Local Contrastive Network (ENLCN) by adding a few of our modules in a simple backbone. Extensive experimental results show that ENLCN reaches superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.

IJCAI Conference 2022 Conference Paper

High-Resolution and Arbitrary-Sized Chinese Landscape Painting Creation Based on Generative Adversarial Networks

  • Peixiang Luo
  • Jinchao Zhang
  • Jie Zhou

This paper outlines an automated creation system for Chinese landscape paintings based on generative adversarial networks. The system consists of three cascaded modules: generation, resizing, and super-resolution. The generation module first generates a square-shaped painting, then the resizing module predicts an appropriate aspect ratio for it and performs resizing, and finally the super-resolution module is used to increase the resolution and improve the quality. After training each module with the images we collected from the web, our system can create high-resolution landscape paintings in arbitrary sizes.

NeurIPS Conference 2022 Conference Paper

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

  • Yongming Rao
  • Wenliang Zhao
  • Yansong Tang
  • Jie Zhou
  • Ser Nam Lim
  • Jiwen Lu

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution ($\textit{g}^\textit{n}$Conv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. $\textit{g}^\textit{n}$Conv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from the effectiveness in visual encoders, we also show $\textit{g}^\textit{n}$Conv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that $\textit{g}^\textit{n}$Conv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https: //github. com/raoyongming/HorNet.

IJCAI Conference 2022 Conference Paper

Neutral Utterances are Also Causes: Enhancing Conversational Causal Emotion Entailment with Social Commonsense Knowledge

  • Jiangnan Li
  • Fandong Meng
  • Zheng Lin
  • Rui Liu
  • Peng Fu
  • Yanan Cao
  • Weiping Wang
  • Jie Zhou

Conversational Causal Emotion Entailment aims to detect causal utterances for a non-neutral targeted utterance from a conversation. In this work, we build conversations as graphs to overcome implicit contextual modelling of the original entailment style. Following the previous work, we further introduce the emotion information into graphs. Emotion information can markedly promote the detection of causal utterances whose emotion is the same as the targeted utterance. However, it is still hard to detect causal utterances with different emotions, especially neutral ones. The reason is that models are limited in reasoning causal clues and passing them between utterances. To alleviate this problem, we introduce social commonsense knowledge (CSK) and propose a Knowledge Enhanced Conversation graph (KEC). KEC propagates the CSK between two utterances. As not all CSK is emotionally suitable for utterances, we therefore propose a sentiment-realized knowledge selecting strategy to filter CSK. To process KEC, we further construct the Knowledge Enhanced Directed Acyclic Graph networks. Experimental results show that our method outperforms baselines and infers more causes with different emotions from the targeted utterance.

NeurIPS Conference 2022 Conference Paper

OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression

  • Wanhua Li
  • Xiaoke Huang
  • Zheng Zhu
  • Yansong Tang
  • Xiu Li
  • Jie Zhou
  • Jiwen Lu

This paper presents a language-powered paradigm for ordinal regression. Existing methods usually treat each rank as a category and employ a set of weights to learn these concepts. These methods are easy to overfit and usually attain unsatisfactory performance as the learned concepts are mainly derived from the training set. Recent large pre-trained vision-language models like CLIP have shown impressive performance on various visual tasks. In this paper, we propose to learn the rank concepts from the rich semantic CLIP latent space. Specifically, we reformulate this task as an image-language matching problem with a contrastive objective, which regards labels as text and obtains a language prototype from a text encoder for each rank. While prompt engineering for CLIP is extremely time-consuming, we propose OrdinalCLIP, a differentiable prompting method for adapting CLIP for ordinal regression. OrdinalCLIP consists of learnable context tokens and learnable rank embeddings. The learnable rank embeddings are constructed by explicitly modeling numerical continuity, resulting in well-ordered, compact language prototypes in the CLIP space. Once learned, we can only save the language prototypes and discard the huge language model, resulting in zero additional computational overhead compared with the linear head counterpart. Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks, and gains improvements in few-shot and distribution shift settings for age estimation. The code is available at https: //github. com/xk-huang/OrdinalCLIP.

NeurIPS Conference 2022 Conference Paper

P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

  • Ziyi Wang
  • Xumin Yu
  • Yongming Rao
  • Jie Zhou
  • Jiwen Lu

Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we provide a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Enjoying prosperous development from image pre-training field, our method attains 89. 3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Segmentation. Code is available at https: //github. com/wangzy22/P2P.

IJCAI Conference 2022 Conference Paper

Rethinking the Promotion Brought by Contrastive Learning to Semi-Supervised Node Classification

  • Deli Chen
  • Yankai Lin
  • Lei Li
  • Xuancheng Ren
  • Peng Li
  • Jie Zhou
  • Xu Sun

Graph Contrastive Learning (GCL) has proven highly effective in promoting the performance of Semi-Supervised Node Classification (SSNC). However, existing GCL methods are generally transferred from other fields like CV or NLP, whose underlying working mechanism remains underexplored. In this work, we first deeply probe the working mechanism of GCL in SSNC, and find that the promotion brought by GCL is severely unevenly distributed: the improvement mainly comes from subgraphs with less annotated information, which is fundamentally different from contrastive learning in other fields. However, existing GCL methods generally ignore this uneven distribution of annotated information and apply GCL evenly to the whole graph. To remedy this issue and further improve GCL in SSNC, we propose the Topology InFormation gain-Aware Graph Contrastive Learning (TIFA-GCL) framework that considers the annotated information distribution across graph in GCL. Extensive experiments on six benchmark graph datasets, including the enormous OGB-Products graph, show that TIFA-GCL can bring a larger improvement than existing GCL methods in both transductive and inductive settings. Further experiments demonstrate the generalizability and interpretability of TIFA-GCL.

IJCAI Conference 2022 Conference Paper

Uncertainty-Aware Representation Learning for Action Segmentation

  • Lei Chen
  • Muheng Li
  • Yueqi Duan
  • Jie Zhou
  • Jiwen Lu

In this paper, we propose an uncertainty-aware representation Learning (UARL) method for action segmentation. Most existing action segmentation methods exploit continuity information of the action period to predict frame-level labels, which ignores the temporal ambiguity of the transition region between two actions. Moreover, similar periods of different actions, e. g. , the beginning of some actions, will confuse the network if they are annotated with different labels, which causes spatial ambiguity. To address this, we design the UARL to exploit the transitional expression between two action periods by uncertainty learning. Specially, we model every frame of actions with an active distribution that represents the probabilities of different actions, which captures the uncertainty of the action and exploits the tendency during the action. We evaluate our method on three popular action prediction datasets: Breakfast, Georgia Tech Egocentric Activities (GTEA), and 50Salads. The experimental results demonstrate that our method achieves the performance with state-of-the-art.

AAAI Conference 2021 Conference Paper

Aspect-Level Sentiment-Controllable Review Generation with Mutual Learning Framework

  • Huimin Chen
  • Yankai Lin
  • Fanchao Qi
  • Jinyi Hu
  • Peng Li
  • Jie Zhou
  • Maosong Sun

Review generation, aiming to automatically generate review text according to the given information, is proposed to assist in the unappealing review writing. However, most of existing methods only consider the overall sentiments of reviews and cannot achieve aspect-level sentiment control. Even though some previous studies attempt to generate aspect-level sentiment-controllable reviews, they usually require largescale human annotations which are unavailable in the real world. To address this issue, we propose a mutual learning framework to take advantage of unlabeled data to assist the aspect-level sentiment-controllable review generation. The framework consists of a generator and a classifier which utilize confidence mechanism and reconstruction reward to enhance each other. Experimental results show our model can achieve aspect-sentiment control accuracy up to 88% without losing generation quality.

AAAI Conference 2021 Conference Paper

Boundary-Aware Geometric Encoding for Semantic Segmentation of Point Clouds

  • Jingyu Gong
  • Jiachen Xu
  • Xin Tan
  • Jie Zhou
  • Yanyun Qu
  • Yuan Xie
  • Lizhuang Ma

Boundary information plays a significant role in 2D image segmentation, while usually being ignored in 3D point cloud segmentation where ambiguous features might be generated in feature extraction, leading to misclassification in the transition area between two objects. In this paper, firstly, we propose a Boundary Prediction Module (BPM) to predict boundary points. Based on the predicted boundary, a boundaryaware Geometric Encoding Module (GEM) is designed to encode geometric information and aggregate features with discrimination in a neighborhood, so that the local features belonging to different categories will not be polluted by each other. To provide extra geometric information for boundaryaware GEM, we also propose a light-weight Geometric Convolution Operation (GCO), making the extracted features more distinguishing. Built upon the boundary-aware GEM, we build our network and test it on benchmarks like ScanNet v2, S3DIS. Results show our methods can significantly improve the baseline and achieve state-of-the-art performance.

NeurIPS Conference 2021 Conference Paper

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

  • Yongming Rao
  • Wenliang Zhao
  • Benlin Liu
  • Jiwen Lu
  • Jie Zhou
  • Cho-Jui Hsieh

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31% $\sim$ 37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0. 5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https: //github. com/raoyongming/DynamicViT

AAAI Conference 2021 Conference Paper

Faster Depth-Adaptive Transformers

  • Yijin Liu
  • Fandong Meng
  • Jie Zhou
  • Yufeng Chen
  • Jinan Xu

Depth-adaptive neural networks can dynamically adjust depths according to the hardness of input words, and thus improve efficiency. The main challenge is how to measure such hardness and decide the required depths (i. e. , layers) to conduct. Previous works generally build a halting unit to decide whether the computation should continue or stop at each layer. As there is no specific supervision of depth selection, the halting unit may be under-optimized and inaccurate, which results in suboptimal and unstable performance when modeling sentences. In this paper, we get rid of the halting unit and estimate the required depths in advance, which yields a faster depth-adaptive model. Specifically, two approaches are proposed to explicitly measure the hardness of input words and estimate corresponding adaptive depth, namely 1) mutual information (MI) based estimation and 2) reconstruction loss based estimation. We conduct experiments on the text classification task with 24 datasets in various sizes and domains. Results confirm that our approaches can speed up the vanilla Transformer (up to 7x) while preserving high accuracy. Moreover, efficiency and robustness are significantly improved when compared with other depthadaptive approaches.

NeurIPS Conference 2021 Conference Paper

Global Filter Networks for Image Classification

  • Yongming Rao
  • Wenliang Zhao
  • Zheng Zhu
  • Jiwen Lu
  • Jie Zhou

Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform. We exhibit favorable accuracy/complexity trade-offs of our models on both ImageNet and downstream tasks. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness. Code is available at https: //github. com/raoyongming/GFNet

AAAI Conference 2021 Conference Paper

Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information

  • Qiu Ran
  • Yankai Lin
  • Peng Li
  • Jie Zhou

Non-autoregressive neural machine translation (NAT) generates each target word in parallel and has achieved promising inference acceleration. However, existing NAT models still have a big gap in translation quality compared to autoregressive neural machine translation models due to the multimodality problem: the target words may come from multiple feasible translations. To address this problem, we propose a novel NAT framework ReorderNAT which explicitly models the reordering information to guide the decoding of NAT. Specially, ReorderNAT utilizes deterministic and nondeterministic decoding strategies that leverage reordering information as a proxy for the final translation to encourage the decoder to choose words belonging to the same translation. Experimental results on various widely-used datasets show that our proposed model achieves better performance compared to most existing NAT models, and even achieves comparable translation quality as autoregressive translation models with a significant speedup.

AAAI Conference 2021 Conference Paper

Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network for Emotional Conversation Generation

  • Yunlong Liang
  • Fandong Meng
  • Ying Zhang
  • Yufeng Chen
  • Jinan Xu
  • Jie Zhou

The success of emotional conversation systems depends on sufficient perception and appropriate expression of emotions. In a real-world conversation, we firstly instinctively perceive emotions from multi-source information, including the emotion flow of dialogue history, facial expressions, and personalities of speakers, and then express suitable emotions according to our personalities, but these multiple types of information are insufficiently exploited in emotional conversation fields. To address this issue, we propose a heterogeneous graph-based model for emotional conversation generation. Specifically, we design a Heterogeneous Graph-Based Encoder to represent the conversation content (i. e. , the dialogue history, its emotion flow, facial expressions, and speakers’ personalities) with a heterogeneous graph neural network, and then predict suitable emotions for feedback. After that, we employ an Emotion-Personality-Aware Decoder to generate a response not only relevant to the conversation context but also with appropriate emotions, by taking the encoded graph representations, the predicted emotions from the encoder and the personality of the current speaker as inputs. Experimental results show that our model can effectively perceive emotions from multi-source knowledge and generate a satisfactory response, which significantly outperforms previous state-of-the-art models.

IS Journal 2021 Journal Article

Intelligent Rack-Level Cooling Management in Data Centers With Active Ventilation Tiles: A Deep Reinforcement Learning Approach

  • Jianxiong Wan
  • Jie Zhou
  • Xiang Gui

The raised-floor architecture is widely adopted in the current data center industry. Due to the structural design and workload imbalance, many server racks suffer from a nonuniform inlet temperature in raised-floor data centers. This compromises not only the cooling performance, but also the computing capacity and system reliability. In this article, we employ the active ventilation tiles (AVTs), i. e. , ordinary ventilation tiles with attached fans, to enhance the local cold air delivery and improve the cooling performance. We propose an AVT control algorithm adapted from the recently developed deep reinforcement learning (DRL) techniques to tackle the complex data center environment and thermodynamics. Different from previous studies where the algorithm required extensive pretraining before deployment, our algorithm is fully online. In order to improve the sample efficiency and accelerate the learning speed, we integrate the Dyna architecture to take advantage of the experienced system transitions. We also leverage the idea of shared reward and fingerprint to encourage the cooperation for multi-AVT control. The performance of the proposed solution is then evaluated by deploying a prototype implementation in a production data center. Experimental results reveal that our solution significantly improves the rack inlet temperature distribution.

AAAI Conference 2021 Conference Paper

Multi-Proxy Wasserstein Classifier for Image Classification

  • Benlin Liu
  • Yongming Rao
  • Jiwen Lu
  • Jie Zhou
  • Cho-Jui Hsieh

Most widely-used convolutional neural networks (CNNs) end up with a global average pooling layer and a fullyconnected layer. In this pipeline, a certain class is represented by one template vector preserved in the feature banks of fully-connected layer. Yet, a class may have multiple properties useful for recognition while the above formulation only captures one of them. Therefore, it is desired to represent a class by multiple proxies. However, directly adding multiple linear layers turns out to be a trivial solution as no improvement can be observed. To tackle this problem, we adopt optimal transport theory to calculate a non-uniform matching flow between the elements in the feature map of a sample and the proxies of a class in a closed way. By doing so, the models are enabled to achieve partial matching as both the feature maps and the proxy set can now focus on a subset of elements from the counterpart. Such formulation also enables us to embed the samples into the Wasserstein metric space, which has many advantages over the original Euclidean space. This formulation can be achieved by a lightweight iterative algorithm, which can be easily embedded into the automatic differentiation framework. Empirical studies are performed on two widely-used classification datasets, CIFAR, and ILSVRC2012, and the substantial improvements on these two benchmarks demonstrate the effectiveness of our method.

AAAI Conference 2021 Conference Paper

SIMPLE: SIngle-network with Mimicking and Point Learning for Bottom-up Human Pose Estimation

  • Jiabin Zhang
  • Zheng Zhu
  • Jiwen Lu
  • Junjie Huang
  • Guan Huang
  • Jie Zhou

The practical application requests both accuracy and efficiency on multi-person pose estimation algorithms. But the high accuracy and fast inference speed are dominated by top-down methods and bottom-up methods respectively. To make a better trade-off between accuracy and efficiency, we propose a novel multi-person pose estimation framework, SIngle-network with Mimicking and Point Learning for Bottom-up Human Pose Estimation (SIMPLE). Specifically, in the training process, we enable SIMPLE to mimic the pose knowledge from the high-performance top-down pipeline, which significantly promotes SIMPLE’s accuracy while maintaining its high efficiency during inference. Besides, SIMPLE formulates human detection and pose estimation as a unified point learning framework to complement each other in single-network. This is quite different from previous works where the two tasks may interfere with each other. To the best of our knowledge, both mimicking strategy between different method types and unified point learning are firstly proposed in pose estimation. In experiments, our approach achieves the new state-of-the-art performance among bottom-up methods on the COCO, MPII and Pose- Track datasets. Compared with the top-down approaches, SIMPLE has comparable accuracy and faster inference speed.

NeurIPS Conference 2021 Conference Paper

Topology-Imbalance Learning for Semi-Supervised Node Classification

  • Deli Chen
  • Yankai Lin
  • Guangxiang Zhao
  • Xuancheng Ren
  • Peng Li
  • Jie Zhou
  • Xu Sun

The class imbalance problem, as an important issue in learning node representations, has drawn increasing attention from the community. Although the imbalance considered by existing studies roots from the unequal quantity of labeled examples in different classes (quantity imbalance), we argue that graph data expose a unique source of imbalance from the asymmetric topological properties of the labeled nodes, i. e. , labeled nodes are not equal in terms of their structural role in the graph (topology imbalance). In this work, we first probe the previously unknown topology-imbalance issue, including its characteristics, causes, and threats to semisupervised node classification learning. We then provide a unified view to jointly analyzing the quantity- and topology- imbalance issues by considering the node influence shift phenomenon with the Label Propagation algorithm. In light of our analysis, we devise an influence conflict detection–based metric Totoro to measure the degree of graph topology imbalance and propose a model-agnostic method ReNode to address the topology-imbalance issue by re-weighting the influence of labeled nodes adaptively based on their relative positions to class boundaries. Systematic experiments demonstrate the effectiveness and generalizability of our method in relieving topology-imbalance issue and promoting semi-supervised node classification. The further analysis unveils varied sensitivity of different graph neural networks (GNNs) to topology imbalance, which may serve as a new perspective in evaluating GNN architectures.

AAAI Conference 2020 Conference Paper

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

  • Feilong Chen
  • Fandong Meng
  • Jiaming Xu
  • Peng Li
  • Bo Xu
  • Jie Zhou

Visual Dialog is a vision-language task that requires an AI agent to engage in a conversation with humans grounded in an image. It remains a challenging task since it requires the agent to fully understand a given question before making an appropriate response not only from the textual dialog history, but also from the visually-grounded information. While previous models typically leverage single-hop reasoning or single-channel reasoning to deal with this complex multimodal reasoning task, which is intuitively insufficient. In this paper, we thus propose a novel and more powerful Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM. DMRM synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning. Specifically, DMRM maintains a dual channel to obtain the question- and history-aware image features and the question- and image-aware dialog history features by a mulithop reasoning process in each channel. Additionally, we also design an effective multimodal attention to further enhance the decoder to generate more accurate responses. Experimental results on the VisDial v0. 9 and v1. 0 datasets demonstrate that the proposed model is effective and outperforms compared models by a significant margin.

AAAI Conference 2020 Conference Paper

Enhancing Pointer Network for Sentence Ordering with Pairwise Ordering Predictions

  • Yongjing Yin
  • Fandong Meng
  • Jinsong Su
  • Yubin Ge
  • Lingeng Song
  • Jie Zhou
  • Jiebo Luo

Dominant sentence ordering models use a pointer network decoder to generate ordering sequences in a left-to-right fashion. However, such a decoder only exploits the noisy leftside encoded context, which is insufficient to ensure correct sentence ordering. To address this deficiency, we propose to enhance the pointer network decoder by using two pairwise ordering prediction modules: The FUTURE module predicts the relative orientations of other unordered sentences with respect to the candidate sentence, and the HIS- TORY module measures the local coherence between several (e. g. , 2) previously ordered sentences and the candidate sentence, without the influence of noisy left-side context. Using the pointer mechanism, we then incorporate this dynamically generated information into the decoder as a supplement to the left-side context for better predictions. On several commonly-used datasets, our model significantly outperforms other baselines, achieving the state-of-the-art performance. Further analyses verify that pairwise ordering predictions indeed provide extra useful context as expected, leading to better sentence ordering. We also evaluate our sentence ordering models on a downstream task, multi-document summarization, and the summaries reordered by our model achieve the best coherence scores. Our code is available at https: //github. com/DeepLearnXMU/Pairwise. git.

AAAI Conference 2020 Conference Paper

Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View

  • Deli Chen
  • Yankai Lin
  • Wei Li
  • Peng Li
  • Jie Zhou
  • Xu Sun

Graph Neural Networks (GNNs) have achieved promising performance on a wide range of graph-based tasks. Despite their success, one severe limitation of GNNs is the over-smoothing issue (indistinguishable representations of nodes in different classes). In this work, we present a systematic and quantitative study on the over-smoothing issue of GNNs. First, we introduce two quantitative metrics, MAD and MADGap, to measure the smoothness and oversmoothness of the graph nodes representations, respectively. Then, we verify that smoothing is the nature of GNNs and the critical factor leading to over-smoothness is the low information-to-noise ratio of the message received by the nodes, which is partially determined by the graph topology. Finally, we propose two methods to alleviate the oversmoothing issue from the topological view: (1) MADReg which adds a MADGap-based regularizer to the training objective; (2) AdaEdge which optimizes the graph topology based on the model predictions. Extensive experiments on 7 widely-used graph datasets with 10 typical GNN models show that the two proposed methods are effective for relieving the over-smoothing issue, thus improving the performance of various GNN models.

AAAI Conference 2020 Conference Paper

Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation

  • Chenze Shao
  • Jinchao Zhang
  • Yang Feng
  • Fandong Meng
  • Jie Zhou

Non-Autoregressive Neural Machine Translation (NAT) achieves significant decoding speedup through generating target words independently and simultaneously. However, in the context of non-autoregressive translation, the word-level cross-entropy loss cannot model the target-side sequential dependency properly, leading to its weak correlation with the translation quality. As a result, NAT tends to generate in- fluent translations with over-translation and under-translation errors. In this paper, we propose to train NAT to minimize the Bag-of-Ngrams (BoN) difference between the model output and the reference sentence. The bag-of-ngrams training objective is differentiable and can be efficiently calculated, which encourages NAT to capture the target-side sequential dependency and correlates well with the translation quality. We validate our approach on three translation tasks and show that our approach largely outperforms the NAT baseline by about 5. 0 BLEU scores on WMT14 En↔De and about 2. 5 BLEU scores on WMT16 En↔Ro.

AAAI Conference 2020 Conference Paper

Multi-Zone Unit for Recurrent Neural Networks

  • Fandong Meng
  • Jinchao Zhang
  • Yang Liu
  • Jie Zhou

Recurrent neural networks (RNNs) have been widely used to deal with sequence learning problems. The input-dependent transition function, which folds new observations into hidden states to sequentially construct fixed-length representations of arbitrary-length sequences, plays a critical role in RNNs. Based on single space composition, transition functions in existing RNNs often have difficulty in capturing complicated long-range dependencies. In this paper, we introduce a new Multi-zone Unit (MZU) for RNNs. The key idea is to design a transition function that is capable of modeling multiple space composition. The MZU consists of three components: zone generation, zone composition, and zone aggregation. Experimental results on multiple datasets of the character-level language modeling task and the aspect-based sentiment analysis task demonstrate the superiority of the MZU.

IJCAI Conference 2020 Conference Paper

SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A Learnable Scene Descriptor

  • Jiachen Xu
  • Jingyu Gong
  • Jie Zhou
  • Xin Tan
  • Yuan Xie
  • Lizhuang Ma

Besides local features, global information plays an essential role in semantic segmentation, while recent works usually fail to explicitly extract the meaningful global information and make full use of it. In this paper, we propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information. The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene and directly guides the point-level semantic segmentation through filtering out categories not belonging to this scene. Additionally, to alleviate segmentation noise in local region, we design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label, leading to the enhancement of the distinguishing ability of point-wise features. We integrate our methods into several prevailing networks and conduct extensive experiments on benchmark datasets ScanNet and ShapeNet. Results show that our methods greatly improve the performance of baselines and achieve state-of-the-art performance.

IJCAI Conference 2019 Conference Paper

A Dual Reinforcement Learning Framework for Unsupervised Text Style Transfer

  • Fuli Luo
  • Peng Li
  • Jie Zhou
  • Pengcheng Yang
  • Baobao Chang
  • Xu Sun
  • Zhifang Sui

Unsupervised text style transfer aims to transfer the underlying style of text but keep its main content unchanged without parallel data. Most existing methods typically follow two steps: first separating the content from the original style, and then fusing the content with the desired style. However, the separation in the first step is challenging because the content and style interact in subtle ways in natural language. Therefore, in this paper, we propose a dual reinforcement learning framework to directly transfer the style of the text via a one-step mapping model, without any separation of content and style. Specifically, we consider the learning of the source-to-target and target-to-source mappings as a dual task, and two rewards are designed based on such a dual structure to reflect the style accuracy and content preservation, respectively. In this way, the two one-step mapping models can be trained via reinforcement learning, without any use of parallel data. Automatic evaluations show that our model outperforms the state-of-the-art systems by a large margin, especially with more than 10 BLEU points improvement averaged on two benchmark datasets. Human evaluations also validate the effectiveness of our model in terms of style accuracy, content preservation and fluency. Our code and data, including outputs of all baselines and our model are available at https: //github. com/luofuli/DualRL.

JBHI Journal 2018 Journal Article

Left Atrial Appendage Segmentation Using Fully Convolutional Neural Networks and Modified Three-Dimensional Conditional Random Fields

  • Cheng Jin
  • Jianjiang Feng
  • Lei Wang
  • Heng Yu
  • Jiang Liu
  • Jiwen Lu
  • Jie Zhou

Thrombosis has become a global disease threatening human health. The left atrial appendage (LAA) is a major source of thrombosis in patients with atrial fibrillation (AF). Positive correlation exists between LAA volume and AF risk. LAA morphology has been suggested to influence thromboembolic risk in AF patients and to help predict thromboembolic events in low-risk patient groups. Automatic segmentation of LAA can greatly help physicians diagnose AF. In consideration of the large anatomical variations of the LAA, we proposed a robust method for automatic LAA segmentation on computed tomographic angiography (CTA) data using fully convolutional neural networks with three-dimensional (3–D) conditional random fields (CRFs). After manual localization of ROI of LAA, we adopted the FCN in natural image segmentation and transferred their learned models by fine-tuning the networks to segment each 2–D LAA slice. Subsequently, we used a modified dense 3–D CRF that accounts for the 3–D spatial information and larger contextual information to refine the segmentations of all slices. Our method was evaluated on 150 sets of CTA data using five-fold cross validation. Compared with manual annotation, we obtained a mean dice overlap of $\text{94. 76}\%$ and a mean volume overlap of $\text{91. 10}\%$ with a computation time of less than 40 s per volume. Experimental results demonstrated the robustness of our method in dealing with large anatomical variations and computational efficiency for adoption in a daily clinical routine.)

AAAI Conference 2018 Conference Paper

SNNN: Promoting Word Sentiment and Negation in Neural Sentiment Classification

  • Qinmin Hu
  • Jie Zhou
  • Qin Chen
  • Liang He

We mainly investigate word influence in neural sentiment classification, which results in a novel approach to promoting word sentiment and negation as attentions. Particularly, a sentiment and negation neural network (SNNN) is proposed, including a sentiment neural network (SNN) and a negation neural network (NNN). First, we modify the word level by embedding the word sentiment and negation information as the extra layers for the input. Second, we adopt a hierarchical LSTM model to generate the word-level, sentence-level and document-level representations respectively. After that, we enhance word sentiment and negation as attentions over the semantic level. Finally, the experiments conducting on the IMDB and Yelp data sets show that our approach is superior to the state-of-the-art methods. Furthermore, we draw the interesting conclusions that (1) LSTM performs better than CNN and RNN for neural sentiment classification; (2) word sentiment and negation are a strong alliance as attentions, while overfitting occurs when they are simultaneously applied at the embedding layer; and (3) word sentiment/negation can be singly implemented for better performance as both embedding layer and attention at the same time.

IJCAI Conference 2017 Conference Paper

ME-MD: An Effective Framework for Neural Machine Translation with Multiple Encoders and Decoders

  • Jinchao Zhang
  • Qun Liu
  • Jie Zhou

The encoder-decoder neural framework is widely employed for Neural Machine Translation (NMT) with a single encoder to represent the source sentence and a single decoder to generate target words. The translation performance heavily relies on the representation ability of the encoder and the generation ability of the decoder. To further enhance NMT, we propose to extend the original encoder-decoder framework to a novel one, which has multiple encoders and decoders (ME-MD). Through this way, multiple encoders extract more diverse features to represent the source sequence and multiple decoders capture more complicated translation knowledge. Our proposed ME-MD framework is convenient to integrate heterogeneous encoders and decoders with multiple depths and multiple types. Experiment on Chinese-English translation task shows that our ME-MD system surpasses the state-of-the-art NMT system by 2. 1 BLEU points and surpasses the phrase-based Moses by 7. 38 BLEU points. Our framework is general and can be applied to other sequence to sequence tasks.

JBHI Journal 2017 Journal Article

Multiple ECG Fiducial Points-Based Random Binary Sequence Generation for Securing Wireless Body Area Networks

  • Guanglou Zheng
  • Gengfa Fang
  • Rajan Shankaran
  • Mehmet A. Orgun
  • Jie Zhou
  • Li Qiao
  • Kashif Saleem

Generating random binary sequences (BSes) is a fundamental requirement in cryptography. A BS is a sequence of $N$ bits, and each bit has a value of 0 or 1. For securing sensors within wireless body area networks (WBANs), electrocardiogram (ECG)-based BS generation methods have been widely investigated in which interpulse intervals (IPIs) from each heartbeat cycle are processed to produce BSes. Using these IPI-based methods to generate a 128-bit BS in real time normally takes around half a minute. In order to improve the time efficiency of such methods, this paper presents an ECG multiple fiducial-points based binary sequence generation (MFBSG) algorithm. The technique of discrete wavelet transforms is employed to detect arrival time of these fiducial points, such as P, Q, R, S, and T peaks. Time intervals between them, including RR, RQ, RS, RP, and RT intervals, are then calculated based on this arrival time, and are used as ECG features to generate random BSes with low latency. According to our analysis on real ECG data, these ECG feature values exhibit the property of randomness and, thus, can be utilized to generate random BSes. Compared with the schemes that solely rely on IPIs to generate BSes, this MFBSG algorithm uses five feature values from one heart beat cycle, and can be up to five times faster than the solely IPI-based methods. So, it achieves a design goal of low latency. According to our analysis, the complexity of the algorithm is comparable to that of fast Fourier transforms. These randomly generated ECG BSes can be used as security keys for encryption or authentication in a WBAN system.

NeurIPS Conference 2017 Conference Paper

Runtime Neural Pruning

  • Ji Lin
  • Yongming Rao
  • Jiwen Lu
  • Jie Zhou

In this paper, we propose a Runtime Neural Pruning (RNP) framework which prunes the deep neural network dynamically at the runtime. Unlike existing neural pruning methods which produce a fixed pruned model for deployment, our method preserves the full ability of the original network and conducts pruning according to the input image and current feature maps adaptively. The pruning is performed in a bottom-up, layer-by-layer manner, which we model as a Markov decision process and use reinforcement learning for training. The agent judges the importance of each convolutional kernel and conducts channel-wise pruning conditioned on different samples, where the network is pruned more when the image is easier for the task. Since the ability of network is fully preserved, the balance point is easily adjustable according to the available resources. Our method can be applied to off-the-shelf network structures and reach a better tradeoff between speed and accuracy, especially with a large pruning rate.

NeurIPS Conference 2015 Conference Paper

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question

  • Haoyuan Gao
  • Junhua Mao
  • Jie Zhou
  • Zhiheng Huang
  • Lei Wang
  • Wei Xu

In this paper, we present the mQA model, which is able to answer questions about the content of an image. The answer can be a sentence, a phrase or a single word. Our model contains four components: a Long Short-Term Memory (LSTM) to extract the question representation, a Convolutional Neural Network (CNN) to extract the visual representation, an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer. We construct a Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate our mQA model. It contains over 150, 000 images and 310, 000 freestyle Chinese question-answer pairs and their English translations. The quality of the generated answers of our mQA model on this dataset is evaluated by human judges through a Turing Test. Specifically, we mix the answers provided by humans and our model. The human judges need to distinguish our model from the human. They will also provide a score (i. e. 0, 1, 2, the larger the better) indicating the quality of the answer. We propose strategies to monitor the quality of this evaluation process. The experiments show that in 64. 7% of cases, the human judges cannot distinguish our model from humans. The average score is 1. 454 (1. 918 for human). The details of this work, including the FM-IQA dataset, can be found on the project page: \url{http: //idl. baidu. com/FM-IQA. html}.

IJCAI Conference 2009 Conference Paper

  • Quanquan Gu
  • Jie Zhou

Nonnegative Matrix Factorization (NMF) has been widely used in machine learning and data mining. It aims to find two nonnegative matrices whose product can well approximate the nonnegative data matrix, which naturally lead to parts-based representation. In this paper, we present a local learning regularized nonnegative matrix factorization (LL- NMF) for clustering. It imposes an additional constraint on NMF that the cluster label of each point can be predicted by the points in its neighborhood. This constraint encodes both the discriminative information and the geometric structure, and is good at clustering data on manifold. An iterative multiplicative updating algorithm is proposed to optimize the objective, and its convergence is guaranteed theoretically. Experiments on many benchmark data sets demonstrate that the proposed method outperforms NMF as well as many state of the art clustering methods.