Arrow Research search

Author name cluster

Hao Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

76 papers
2 author rows

Possible papers

76

TMLR Journal 2026 Journal Article

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

  • Jinyi Hu
  • Shengding Hu
  • Yuxuan Song
  • Yufei Huang
  • Mingxuan Wang
  • Hao Zhou
  • Zhiyuan Liu
  • Wei-Ying Ma

Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We validate the effectiveness of ACDiT on image, video, and text generation and show that ACDiT performs best among all autoregressive baselines under similar model scales on visual generation tasks. We also demonstrate that, benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the generative objective. The analysis of the trade-off between autoregressive and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and sheds light on new avenues for unified models.

JBHI Journal 2026 Journal Article

HSD: Hough-Based Structure-Aware Detection of B-Lines in Lung Ultrasound

  • Tuo Liu
  • Hao Zhou
  • Jia-Hao Wang
  • Yu Zhang
  • Chen Chen
  • Yang Chen
  • Guang-Quan Zhou
  • Lin Li

B-lines are artifacts produced by the interaction of the ultrasound with the small air-liquid interface, which often serve as crucial biomarkers for evaluating lung pathology, such as the presence of liquid. However, due to the reverberation phenomenon, B-lines manifest as blurred, strip-like comet tails perpendicularly originating from the pleural line, making their automatic identification in speckle-noisy ultrasound images particularly challenging. This study proposes a Hough-based structure-aware detection framework, dubbed HSD, which leverages structural priors and the intrinsic relationship between the pleural line and B-lines to enhance B-line detection in ultrasound images. First, the proposed method adopts the shared encoder and two collaborative decoders to improve B-lines identification with the auxiliary pleural line detection, ensuring effective representation learning of linear structural features under inherent prior constraints. Specifically, one decoder incorporates Hough-based regression to reinforce the modeling of the global linear nature for B-line detection, alleviating the appearance influences of the fuzzy comet-tail. Simultaneously, another pathway enhances the exploration of the slender, curved morphology by integrating semantic context learning with linear heatmap regression, thereby facilitating the detection of the pleural line for calibration of B-lines. Second, we introduce a position-aware rectification module to ensure the consistency of the pleural line and its perpendicular alignment with B-lines. This post-processing module reduces the influence of ambiguous pixels, improving the robustness of B-line detection. Extensive experimental results on an in-house ultrasound dataset demonstrate the superiority of the proposed approach, which achieves a precision of 0. 743, a recall of 0. 953, and an F-measure of 0. 837, substantially ahead of other methods, suggesting its potential for detecting pathological indicators in lung ultrasound.

AAAI Conference 2026 Conference Paper

MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

  • Hao Zhou
  • Xiaobao Guo
  • Yuzhe Zhu
  • Adams Wai-Kin Kong

Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-modal task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, we propose a method called MACS to conduct multi-source audio-to-image generation. To our best knowledge, this is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, effective image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 out of the 21 evaluation indexes on all tasks and delivers superior visual quality.

NeurIPS Conference 2025 Conference Paper

Accelerating 3D Molecule Generative Models with Trajectory Diagnosis

  • Zhilong Zhang
  • Yuxuan Song
  • Yichun Wang
  • Jingjing Gong
  • Hanlin Wu
  • Dongzhan Zhou
  • Hao Zhou
  • Wei-Ying Ma

Geometric molecule generative models have found expanding applications across various scientific domains, but their generation inefficiency has become a critical bottleneck. Through a systematic investigation of the generative trajectory, we discover a unique challenge for molecule geometric graph generation: generative models require determining the permutation order of atoms in the molecule before refining its atomic feature values. Based on this insight, we decompose the generation process into permutation phase and adjustment phase, and propose a geometric-informed prior and consistency parameter objective to accelerate each phase. Extensive experiments demonstrate that our approach achieves competitive performance with approximately 10 sampling steps, 7. 5 × faster than previous state-of-the-art models and approximately 100 × faster than diffusion-based models, offering a significant step towards scalable molecular generation.

NeurIPS Conference 2025 Conference Paper

Bi-Level Decision-Focused Causal Learning for Large-Scale Marketing Optimization: Bridging Observational and Experimental Data

  • Shuli Zhang
  • Hao Zhou
  • Jiaqi Zheng
  • Guibin Jiang
  • Cheng Bing
  • Wei Lin
  • Guihai Chen

Online Internet platforms require sophisticated marketing strategies to optimize user retention and platform revenue — a classical resource allocation problem. Traditional solutions adopt a two-stage pipeline: machine learning (ML) for predicting individual treatment effects to marketing actions, followed by operations research (OR) optimization for decision-making. This paradigm presents two fundamental technical challenges. First, the prediction-decision misalignment: Conventional ML methods focus solely on prediction accuracy without considering downstream optimization objectives, leading to improved predictive metrics that fail to translate to better decisions. Second, the bias-variance dilemma: Observational data suffers from multiple biases (e. g. , selection bias, position bias), while experimental data (e. g. , randomized controlled trials), though unbiased, is typically scarce and costly --- resulting in high-variance estimates. We propose Bi -level D ecision- F ocused C ausal L earning ( Bi-DFCL ) that systematically addresses these challenges. First, we develop an unbiased estimator of OR decision quality using experimental data, which guides ML model training through surrogate loss functions that bridge discrete optimization gradients. Second, we establish a bi-level optimization framework that jointly leverages observational and experimental data, solved via implicit differentiation. This novel formulation enables our unbiased OR estimator to correct learning directions from biased observational data, achieving optimal bias-variance tradeoff. Extensive evaluations on public benchmarks, industrial marketing datasets, and large-scale online A/B tests demonstrate the effectiveness of Bi-DFCL, showing statistically significant improvements over state-of-the-art. Currently, Bi-DFCL has been deployed across several marketing scenarios at Meituan, one of the largest online food delivery platforms in the world.

ICRA Conference 2025 Conference Paper

Computationally and Sample Efficient Safe Reinforcement Learning Using Adaptive Conformal Prediction

  • Hao Zhou
  • Yanze Zhang
  • Wenhao Luo

Safety is a critical concern in learning-enabled autonomous systems especially when deploying these systems in real-world scenarios. An important challenge is accurately quantifying the uncertainty of unknown models to generate provably safe control policies that facilitate the gathering of informative data, thereby achieving both safe and optimal policies. Additionally, the selection of the data-driven model can significantly impact both the real-time implementation and the uncertainty quantification process. In this paper, we propose a provably sample efficient episodic safe learning framework that remains robust across various model choices with quantified uncertainty for online control tasks. Specifically, we first employ Quadrature Fourier Features (QFF) for kernel function approximation of Gaussian Processes (GPs) to enable efficient approximation of unknown dynamics. Then the Adaptive Conformal Prediction (ACP) is used to quantify the uncertainty from online observations and combined with the Control Barrier Functions (CBF) to characterize the uncertainty-aware safe control constraints under learned dynamics. Finally, an optimism-based exploration strategy is integrated with ACP-based CBFs for safe exploration and near-optimal safe nonlinear control. Theoretical proofs and simulation results are provided to demonstrate the effectiveness and efficiency of the proposed framework.

AAAI Conference 2025 Conference Paper

Core-to-Global Reasoning for Compositional Visual Question Answering

  • Hao Zhou
  • Tingjin Luo
  • Zhangqi Jiang

Compositional visual question answering (Compositional VQA) needs to provide an answer to a compositional question, which requires the model to have advanced capabilities of multi-modal semantic understanding and logical reasoning. However, current VQA models mainly concentrate on enriching the visual representations of images and neglect the redundancy in the enriched information to bring some negative impacts. To enhance the value and availability of semantic features, we propose a novel core-to-global reasoning (CTGR) model for compositional VQA. The model first extracts both global features and core features from image and question through a feature embedding module. Then, to enhance the value of semantic features, we propose an information filtering module to align visual features and text features at the core semantic level and to filter out the redundancy carried by image and question features at the global semantic level, which can further strengthen cross-modal correlations. Besides, we design a novel core-to-global reasoning mechanism for multimodal fusion, which integrates content features from core learning and context features from global features for accurate answer predictions. Finally, extensive experimental results on GQA, GQA-sub, VQA2.0 and Visual7W demonstrate the effectiveness and superiority of CTGR.

NeurIPS Conference 2025 Conference Paper

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

  • Qiying Yu
  • Zheng Zhang
  • Ruofei Zhu
  • Yufeng Yuan
  • Xiaochen Zuo
  • Yu Yue
  • Weinan Dai
  • Tiantian Fan

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the D ecoupled Clip and D ynamic s A mpling P olicy O ptimization ( DAPO ) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2. 5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

NeurIPS Conference 2025 Conference Paper

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

  • Jiangjie Chen
  • Qianyu He
  • Siyu Yuan
  • Aili Chen
  • Zhicheng Cai
  • Weinan Dai
  • Hongli Yu
  • Jiaze Chen

Large Language Models (LLMs), such as OpenAI’s o1 and DeepSeek’s R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce ENIGMATA, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across 7 categories, each with: 1) a generator that produces unlimited examples with controllable difficulty, and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose ENIGMATA-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2. 5-32B-ENIGMATA, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like ENIGMATA-Eval, ARC-AGI (32. 8%), and ARC-AGI 2 (0. 6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1. 5-Thinking (20B activated parameters and 200B total parameters), puzzle data from ENIGMATA further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of ENIGMATA. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Project page: https: //seed-enigmata. github. io.

IJCAI Conference 2025 Conference Paper

Generative AI for Immersive Video: Recent Advances and Future Opportunities

  • Kaiyuan Hu
  • Yili Jin
  • Hao Zhou
  • Linfeng Du
  • Jiangchuan Liu
  • Xue Liu

Immersive video serves as a key component of eXtended Reality (XR) that aims to create and interact with simulated virtual or hybrid environments. Such a technology allows users to experience immersive sensations that transcend time and space, and meanwhile continuously providing training data for emerging technologies like Embodied AI. Thanks to the advancements in sensing, computing, and display, recent years have witnessed many excellent works for XR and related hardware or software systems. However, challenges like high creation cost, lack of immersion, and limited scalability hinder the practical application of immersive video services. Whilst recently emerged generative artificial intelligence (GenAI) provides us with new insights in tackling existing challenges. In this paper, we conduct a comprehensive survey into the recent advances and future opportunities on how GenAI can benefit immersive video services. By introducing a systematic taxonomy, we meticulously classify the pertinent techniques and applications into three well-defined categories aligned with the pipeline of immersive video service: content creation, network delivery, and client-side display. This categorization enables a structured exploration of the diverse roles on how GenAI can benefit immersive video service, providing a framework for a more comprehensive understanding and evaluation of these technologies. To the best of our knowledge, this work is the first systematic survey of GenAI in XR settings, laying a foundation for future research in this interdisciplinary domain.

ICML Conference 2025 Conference Paper

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

  • Kai Liu
  • Bowen Xu
  • Shaoyu Wu
  • Xin Chen
  • Hao Zhou
  • Yongliang Tao
  • Lulu Hu

Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA ( La yerwise Ro tated S parse A ctivation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0. 17 perplexity gap with a consistent 1. 30$\times$ wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0. 54%, while surpassing TEAL by 1. 77% and CATS by 17. 14%.

AAAI Conference 2025 Conference Paper

MoE-LPR: Multilingual Extension of Large Language Models Through Mixture-of-Experts with Language Priors Routing

  • Hao Zhou
  • Zhijun Wang
  • Shujian Huang
  • Xin Huang
  • Xue Han
  • Junlan Feng
  • Chao Deng
  • Weihua Luo

Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data. Enhancing non-English language capabilities through post-pretraining often results in catastrophic forgetting of high-resource languages. Previous methods either achieve good expansion with severe forgetting or slight forgetting with poor expansion, indicating the challenge of balancing language expansion while preventing forgetting. In this paper, we propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviate this problem. MoE-LPR employs a two-stage training approach to enhance the multilingual capability. First, the model is post-pretrained into a Mixture-of-Experts(MoE) architecture by upcycling, where all the original parameters are frozen and new experts are added. In this stage, we focus improving the ability on expanded languages, without using any original language data. Then, the model reviews the knowledge of the original languages with replay data amounting to less than 1% of post-pretraining, where we incorporate language priors routing to better recover the abilities of the original languages. Evaluations on multiple benchmarks show that MoE-LPR outperforms other post-pretraining methods. Freezing original parameters preserves original language knowledge while adding new experts preserves the learning ability. Reviewing with LPR enables effective utilization of multilingual knowledge within the parameters. Additionally, the MoE architecture maintains the same inference overhead while increasing total model parameters. Extensive experiments demonstrate MoE-LPR’s effectiveness in improving expanded languages and preserving original language proficiency with superior scalability.

NeurIPS Conference 2025 Conference Paper

MOF-BFN: Metal-Organic Frameworks Structure Prediction via Bayesian Flow Networks

  • Rui Jiao
  • Hanlin Wu
  • Wenbing Huang
  • Yuxuan Song
  • Yawen Ouyang
  • Yu Rong
  • Tingyang Xu
  • Pengju Wang

Metal-Organic Frameworks (MOFs) have attracted considerable attention due to their unique properties including high surface area and tunable porosity, and promising applications in catalysis, gas storage, and drug delivery. Structure prediction for MOFs is a challenging task, as these frameworks are intrinsically periodic and hierarchically organized, where the entire structure is assembled from building blocks like metal nodes and organic linkers. To address this, we introduce MOF-BFN, a novel generative model for MOF structure prediction based on Bayesian Flow Networks (BFNs). Given the local geometry of building blocks, MOF-BFN jointly predicts the lattice parameters, as well as the positions and orientations of all building blocks within the unit cell. In particular, the positions are modelled in the fractional coordinate system to naturally incorporate the periodicity. Meanwhile, the orientations are modeled as unit quaternions sampled from learned Bingham distributions via the proposed Bingham BFN, enabling effective orientation generation on the 4D unit hypersphere. Experimental results demonstrate that MOF-BFN achieves state-of-the-art performance across multiple tasks, including structure prediction, geometric property evaluation, and de novo generation, offering a promising tool for designing complex MOF materials.

IROS Conference 2025 Conference Paper

NailTact: Single-camera based Tactile Fingertip with Nail

  • Hao Zhou
  • Masahiro Miyazaki
  • Kazuhiro Shimonomura

Vision-based tactile sensing, an economical and widely utilized methodology, has the potential to offer crucial contact geometry information for localizing objectives even in cases of visual occlusion. However, this kind of fingertip sensor is limited. When a person picks up a relatively small object placed on a flat surface with two fingers, they may not only use the pads of their fingers depending on the size of the object but also use their fingernails for small or thin objects. Fingers with nail structures have been shown to be effective in picking up objects like this in robot hands as well. Moreover, in actual work, accidental contact between sensors and surrounding objects such as tables often occurs. Sensors with fingernails can avoid this situation in advance by having the fingernails touch the object before the fingertip touches the object. In this work, we present the NailTact, which can detect the force applied to both the fingertip part and the nail part from the same camera image using a single camera. Using the prototype robot finger, we will verify the sensor response characteristics to the load on the nail and the sensor response when grasping an object with the nail and the situation when the finger makes contact with a table. We also present a simple model that illustrates the relationship between the force applied to the nail and the movement of the marker. In the card-grasping experiment, we not only successfully grasped a very thin object but also measured the grasping force.

NeurIPS Conference 2025 Conference Paper

Rationalized All-Atom Protein Design with Unified Multi-Modal Bayesian Flow

  • Hanlin Wu
  • Yuxuan Song
  • Zhe Zhang
  • Zhilong Zhang
  • Hao Zhou
  • Wei-Ying Ma
  • Jingjing Liu

Designing functional proteins is a critical yet challenging problem due to the intricate interplay between backbone structures, sequences, and side-chains. Current approaches often decompose protein design into separate tasks, which can lead to accumulated errors, while recent efforts increasingly focus on all-atom protein design. However, we observe that existing all-atom generation approaches suffering from an information shortcut issue, where models inadvertently infer sequences from side-chain information, compromising their ability to accurately learn sequence distributions. To address this, we introduce a novel rationalized information flow strategy to eliminate the information shortcut. Furthermore, motivated by the advantages of Bayesian flows over differential equation–based methods, we propose the first Bayesian flow formulation for protein backbone orientations by recasting orientation modeling as an equivalent hyperspherical generation problem with antipodal symmetry. To validate, our method delivers consistently exceptional performance in both peptide and antibody design tasks.

ICML Conference 2025 Conference Paper

Reducing Confounding Bias without Data Splitting for Causal Inference via Optimal Transport

  • Yuguang Yan
  • Zongyu Li
  • Haolin Yang
  • Zeqin Yang
  • Hao Zhou
  • Ruichu Cai
  • Zhifeng Hao

Causal inference seeks to estimate the effect given a treatment such as a medicine or the dosage of a medication. To reduce the confounding bias caused by the non-randomized treatment assignment, most existing methods reduce the shift between subpopulations receiving different treatments. However, these methods split limited training samples into smaller groups, which cuts down the number of samples in each group, while precise distribution estimation and alignment highly rely on a sufficient number of training samples. In this paper, we propose a distribution alignment paradigm without data splitting, which can be naturally applied in the settings of binary and continuous treatments. To this end, we characterize the confounding bias by considering different probability measures of the same set including all the training samples, and exploit the optimal transport theory to analyze the confounding bias and outcome estimation error. Based on this, we propose to learn balanced representations by reducing the bias between the marginal distribution and the conditional distribution of a treatment. As a result, data reduction caused by splitting is avoided, and the outcome prediction model trained on one treatment group can be generalized to the entire population. The experiments on both binary and continuous treatment settings demonstrate the effectiveness of our method.

NeurIPS Conference 2025 Conference Paper

Retrieval is Not Enough: Enhancing RAG through Test-Time Critique and Optimization

  • Jiaqi Wei
  • Hao Zhou
  • Xiang Zhang
  • Di Zhang
  • Zijie Qiu
  • Noah Wei
  • Jinzhe Li
  • Wanli Ouyang

Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs). However, standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions. In this work, we reinterpret RAG as \textit{Retrieval-Augmented Reasoning} and identify a central but underexplored problem: \textit{Reasoning Misalignment}—the divergence between an LLM's internal reasoning trajectory and the evidential constraints provided by retrieval. To address this issue, we propose \textsc{AlignRAG}, a novel iterative framework grounded in \textit{Critique-Driven Alignment (CDA)}. We further introduce \textsc{AlignRAG-auto}, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations. At the heart of \textsc{AlignRAG} lies a \textit{contrastive critique synthesis} mechanism that generates retrieval-sensitive critiques while mitigating self-bias. This mechanism trains a dedicated retrieval-augmented \textit{Critic Language Model (CLM)} using labeled critiques that distinguish between evidence-aligned and misaligned reasoning. Empirical evaluations show that our approach significantly improves reasoning fidelity. Our 8B-parameter CLM improves performance over the Self-Refine baseline by \textbf{12. 1\%} on out-of-domain tasks and outperforms a standard 72B-parameter CLM by \textbf{2. 2\%}. Furthermore, \textsc{AlignRAG-auto} achieves this state-of-the-art performance while dynamically determining the optimal number of refinement steps, enhancing efficiency and usability. \textsc{AlignRAG} remains compatible with existing RAG architectures as a \textit{plug-and-play} module and demonstrates strong robustness under both informative and noisy retrieval scenarios. Overall, \textsc{AlignRAG} offers a principled solution for aligning model reasoning with retrieved evidence, substantially improving the factual reliability and robustness of RAG systems. Our source code is provided at \href{https: //github. com/upup-wei/RAG-ReasonAlignment}{link}.

NeurIPS Conference 2025 Conference Paper

Retro-R1: LLM-based Agentic Retrosynthesis

  • Wei Liu
  • Jiangtao Feng
  • Hongli Yu
  • Yuxuan Song
  • Yuqiang Li
  • Shufei Zhang
  • Lei Bai
  • Wei-Ying Ma

Retrosynthetic planning is a fundamental task in chemical discovery. Due to the vast combinatorial search space, identifying viable synthetic routes remains a significant challenge--even for expert chemists. Recent advances in Large Language Models (LLMs), particularly equipped with reinforcement learning, have demonstrated strong human-like reasoning and planning abilities, especially in mathematics and code problem solving. This raises a natural question: Can the reasoning capabilities of LLMs be harnessed to develop an AI chemist capable of learning effective policies for multi-step retrosynthesis? In this study, we introduce Retro-R1, a novel LLM-based retrosynthesis agent trained via reinforcement learning to design molecular synthesis pathways. Unlike prior approaches, which typically rely on single-turn, question-answering formats, Retro-R1 interacts dynamically with plug-in single-step retrosynthesis tools and learns from environmental feedback. Experimental results show that Retro-R1 achieves a 55. 79\% pass@1 success rate, surpassing the previous state of the art by 8. 95\%. Notably, Retro-R1 demonstrates strong generalization to out-of-domain test cases, where existing methods tend to fail despite their high in-domain performance. Our work marks a significant step toward equipping LLMs with advanced, chemist-like reasoning abilities, highlighting the promise of reinforcement learning for enabling data-efficient, generalizable, and sophisticated scientific problem-solving in LLM-based agents.

NeurIPS Conference 2025 Conference Paper

ShortListing Model: A Streamlined Simplex Diffusion for Discrete Variable Generation

  • Yuxuan Song
  • Zhe Zhang
  • Yu Pei
  • Jingjing Gong
  • Qiying Yu
  • Zheng Zhang
  • Mingxuan Wang
  • Hao Zhou

Generative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex-based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier-free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character-level and large-vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at https: //github. com/GenSI-THUAIR/SLM.

ICRA Conference 2025 Conference Paper

SLAM Assisted 3D Tracking System for Laparoscopic Surgery

  • Jingwei Song
  • Ray Zhang 0001
  • Wenwei Zhang
  • Hao Zhou
  • Maani Ghaffari

A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORBSLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the ORB-SLAM2 monocular mode. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking, and the 3D shape is incorporated as a geometric prior in its pose graph optimization. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and “organ-background” relative motion.

JBHI Journal 2025 Journal Article

SleepHybridNet: A Lightweight Hybrid CNN-Transformer Model for Enhanced N1 Sleep Staging From Single-Channel EEG

  • Hao Zhou
  • Mengxiang Su
  • Jeng-Shyang Pan
  • Chenglong Dai
  • Ying Chen
  • Shu-Chuan Chu

This study introduces SleepHybridNet, a lightweight hybrid CNN-Transformer model designed to enhance the classification of non-rapid eye movement stage 1 (N1) sleep using single-channel electroencephalogram (EEG) signals. Accurate identification of the N1 stage is of critical importance in both sleep neuroscience and clinical practice. However, due to the ambiguous features during N1 stage, current deep learning models still struggle to achieve satisfactory performance. To address these challenges, SleepHybridNet integrates multi-scale feature fusion and sequence modeling through a novel architecture. It consists of a Multi-Scale Convolutional Neural Network (MSCNN) module, a Transformer encoder, a spectral feature extraction unit, and a multi-task classifier. Experimental results based on the publicly available Sleep-EDF Expanded dataset demonstrate that SleepHybridNet outperforms existing methods in both classification accuracy and generalization capability. Specifically, the model achieves an overall accuracy of 88. 2% and an F1-score of 0. 633 for the N1 stage, showing superior performance particularly in underrepresented classes such as N1 and N3 stages. With only 5. 1 M parameters, the lightweight design of the model can enable practical deployment in clinical settings, bridging the gap between high-performance deep learning algorithms and practical applicability in sleep medicine. Future work may explore the integration of multimodal data from wearable sensors to further expand its use in diverse application scenarios.

IJCAI Conference 2024 Conference Paper

A Strategic Analysis of Prepayments in Financial Credit Networks

  • Hao Zhou
  • Yongzhao Wang
  • Konstantinos Varsos
  • Nicholas Bishop
  • Rahul Savani
  • Anisoara Calinescu
  • Michael Wooldridge

In financial credit networks, prepayments enable a firm to settle its debt obligations ahead of an agreed-upon due date. Prepayments have a transformative impact on the structure of networks, influencing the financial well-being (utility) of individual firms. This study investigates prepayments from both theoretical and empirical perspectives. We first establish the computational complexity of finding prepayments that maximize welfare, assuming global coordination among firms in the financial network. Subsequently, our focus shifts to understanding the strategic behavior of individual firms in the presence of prepayments. We introduce a prepayment game where firms strategically make prepayments, delineating the existence of pure strategy Nash equilibria and analyzing the price of anarchy (stability) within this game. Recognizing the computational challenges associated with determining Nash equilibria in prepayment games, we use a simulation-based approach, known as empirical game-theoretic analysis (EGTA). Through EGTA, we are able to find Nash equilibria among a carefully-chosen set of heuristic strategies. By scrutinizing the equilibrium behavior of firms, we outline the characteristics of high-performing strategies for strategic prepayments and establish connections between our empirical and theoretical findings.

NeurIPS Conference 2024 Conference Paper

AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario

  • Yuhan Li
  • Hao Zhou
  • Wenxiang Shang
  • Ran Lin
  • Xuanhong Chen
  • Bingbing Ni

While image-based virtual try-on has made significant strides, emerging approaches still fall short of delivering high-fidelity and robust fitting images across various scenarios, as their models suffer from issues of ill-fitted garment styles and quality degrading during the training process, not to mention the lack of support for various combinations of attire. Therefore, we first propose a lightweight, scalable, operator known as Hydra Block for attire combinations. This is achieved through a parallel attention mechanism that facilitates the feature injection of multiple garments from conditionally encoded branches into the main network. Secondly, to significantly enhance the model's robustness and expressiveness in real-world scenarios, we evolve its potential across diverse settings by synthesizing the residuals of multiple models, as well as implementing a mask region boost strategy to overcome the instability caused by information leakage in existing models. Equipped with the above design, AnyFit surpasses all baselines on high-resolution benchmarks and real-world data by a large gap, excelling in producing well-fitting garments replete with photorealistic and rich details. Furthermore, AnyFit’s impressive performance on high-fidelity virtual try-ons in any scenario from any image, paves a new path for future research within the fashion community.

TMLR Journal 2024 Journal Article

Decentralized Decoupled Training for Federated Long-Tailed Learning

  • Wenkai Yang
  • Deli Chen
  • Hao Zhou
  • Fandong Meng
  • Jie Zhou
  • Xu Sun

In the real world, the data samples often follow a long-tailed distribution, which poses a great challenge for Federated Learning (FL). That is, when the data is decentralized and long-tailed, FL may produce a poorly-behaved global model that is severely biased towards the head classes with the majority of the training samples. To settle this issue, decoupled training has recently been introduced to FL. Decoupled training aims to re-balance the biased classifier after the normal instance-balanced training, and has achieved promising results in centralized long-tailed learning. The current study directly adopts the decoupled training idea on the server side by re-training the classifier on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. Unfortunately, this practice restricts the capacity of decoupled training in federated long-tailed learning as the low-quality pseudo features lead to a sub-optimal classifier. In this work, motivated by the distributed characteristic of FL, we propose a decentralized decoupled training mechanism by leveraging the abundant real data stored in the local. Specifically, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce a supplementary classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings. Our code is available at https://github.com/keven980716/Federated_Learning_Experiments.

ICLR Conference 2024 Conference Paper

FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition

  • Xiaohu Huang
  • Hao Zhou
  • Kun Yao
  • Kai Han

In this paper, we introduce \textbf{FROSTER}, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: \url{https://visual-ai.github.io/froster}.

JBHI Journal 2024 Journal Article

Generating Biomedical Hypothesis With Spatiotemporal Transformers

  • Huiwei Zhou
  • Lanlan Wang
  • Weihong Yao
  • Wenchu Li
  • Hao Zhou
  • Hongyun Zeng

Generating biomedical hypotheses is a difficult task as it requires uncovering the implicit associations between massive scientific terms from a large body of published literature. A recent line of Hypothesis Generation (HG) approaches - temporal graph-based approaches - have shown great success in modeling temporal evolution of term-pair relationships. However, these approaches model the temporal evolution of each term or term-pair with Recurrent Neural Network (RNN) independently, which neglects the rich covariation among all terms or term-pairs while ignoring direct dependencies between any two timesteps in a temporal sequence. To address this problem, we propose a Spatiotemporal Transformer-based Hypothesis Generation (STHG) method to interleave spatial covariation and temporal progression in a unified framework for constructing direct connections between any two term-pairs while modeling the temporal relevance between any two timesteps. Experiments on three biomedical relationship datasets show that STHG outperforms the state-of-the-art methods.

AAAI Conference 2024 Conference Paper

Intentional Evolutionary Learning for Untrimmed Videos with Long Tail Distribution

  • Yuxi Zhou
  • Xiujie Wang
  • Jianhua Zhang
  • Jiajia Wang
  • Jie Yu
  • Hao Zhou
  • Yi Gao
  • Shengyong Chen

Human intention understanding in untrimmed videos aims to watch a natural video and predict what the person’s intention is. Currently, exploration of predicting human intentions in untrimmed videos is far from enough. On the one hand, untrimmed videos with mixed actions and backgrounds have a significant long-tail distribution with concept drift characteristics. On the other hand, most methods can only perceive instantaneous intentions, but cannot determine the evolution of intentions. To solve the above challenges, we propose a loss based on Instance Confidence and Class Accuracy (ICCA), which aims to alleviate the prediction bias caused by the long-tail distribution with concept drift characteristics in video streams. In addition, we propose an intention-oriented evolutionary learning method to determine the intention evolution pattern (from what action to what action) and the time of evolution (when the action evolves). We conducted extensive experiments on two untrimmed video datasets (THUMOS14 and ActivityNET v1.3), and our method has achieved excellent results compared to SOTA methods. The code and supplementary materials are available at https://github.com/Jennifer123www/UntrimmedVideo.

ICLR Conference 2024 Conference Paper

Multimodal Molecular Pretraining via Modality Blending

  • Qiying Yu
  • Yudi Zhang 0008
  • Yuyan Ni
  • Shikun Feng
  • Yanyan Lan
  • Hao Zhou
  • Jingjing Liu

Self-supervised learning has recently gained growing interest in molecular modeling for scientific tasks such as AI-assisted drug discovery. Current studies consider leveraging both 2D and 3D molecular structures for representation learning. However, relying on straightforward alignment strategies that treat each modality separately, these methods fail to exploit the intrinsic correlation between 2D and 3D representations that reflect the underlying structural characteristics of molecules, and only perform coarse-grained molecule-level alignment. To derive fine-grained alignment and promote structural molecule understanding, we introduce an atomic-relation level "blend-then-predict" self-supervised learning approach, MoleBLEND, which first blends atom relations represented by different modalities into one unified relation matrix for joint encoding, then recovers modality-specific information for 2D and 3D structures individually. By treating atom relationships as anchors, MoleBLEND organically aligns and integrates visually dissimilar 2D and 3D modalities of the same molecule at fine-grained atomic level, painting a more comprehensive depiction of each molecule. Extensive experiments show that MoleBLEND achieves state-of-the-art performance across major 2D/3D molecular benchmarks. We further provide theoretical insights from the perspective of mutual-information maximization, demonstrating that our method unifies contrastive, generative (cross-modality prediction) and mask-then-predict (single-modality prediction) objectives into one single cohesive framework.

NeurIPS Conference 2024 Conference Paper

MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering

  • Yizhen Luo
  • Zikun Nie
  • Massimo Hong
  • Suyuan Zhao
  • Hao Zhou
  • Zaiqing Nie

Studying protein mutations within amino acid sequences holds tremendous significance in life sciences. Protein language models (PLMs) have demonstrated strong capabilities in broad biological applications. However, due to architectural design and lack of supervision, PLMs model mutations implicitly with evolutionary plausibility, which is not satisfactory to serve as explainable and engineerable tools in real-world studies. To address these issues, we present MutaPLM, a unified framework for interpreting and navigating protein mutations with protein language models. MutaPLM introduces a protein delta network that captures explicit protein mutation representations within a unified feature space, and a transfer learning pipeline with a chain-of-thought (CoT) strategy to harvest protein mutation knowledge from biomedical texts. We also construct MutaDescribe, the first large-scale protein mutation dataset with rich textual annotations, which provides cross-modal supervision signals. Through comprehensive experiments, we demonstrate that MutaPLM excels at providing human-understandable explanations for mutational effects and prioritizing novel mutations with desirable properties. Our code, model, and data are open-sourced at https: //github. com/PharMolix/MutaPLM.

NeurIPS Conference 2024 Conference Paper

P$^2$C$^2$Net: PDE-Preserved Coarse Correction Network for efficient prediction of spatiotemporal dynamics

  • Qi Wang
  • Pu Ren
  • Hao Zhou
  • Xin-Yang Liu
  • Zhiwen Deng
  • Yi Zhang
  • Ruizhi Chengze
  • Hongsheng Liu

When solving partial differential equations (PDEs), classical numerical methods often require fine mesh grids and small time stepping to meet stability, consistency, and convergence conditions, leading to high computational cost. Recently, machine learning has been increasingly utilized to solve PDE problems, but they often encounter challenges related to interpretability, generalizability, and strong dependency on rich labeled data. Hence, we introduce a new PDE-Preserved Coarse Correction Network (P$^2$C$^2$Net) to efficiently solve spatiotemporal PDE problems on coarse mesh grids in small data regimes. The model consists of two synergistic modules: (1) a trainable PDE block that learns to update the coarse solution (i. e. , the system state), based on a high-order numerical scheme with boundary condition encoding, and (2) a neural network block that consistently corrects the solution on the fly. In particular, we propose a learnable symmetric Conv filter, with weights shared over the entire model, to accurately estimate the spatial derivatives of PDE based on the neural-corrected system state. The resulting physics-encoded model is capable of handling limited training data (e. g. , 3--5 trajectories) and accelerates the prediction of PDE solutions on coarse spatiotemporal grids while maintaining a high accuracy. P$^2$C$^2$Net achieves consistent state-of-the-art performance with over 50\% gain (e. g. , in terms of relative prediction error) across four datasets covering complex reaction-diffusion processes and turbulent flows.

ICML Conference 2024 Conference Paper

Reducing Balancing Error for Causal Inference via Optimal Transport

  • Yuguang Yan
  • Hao Zhou
  • Zeqin Yang
  • Weilin Chen 0001
  • Ruichu Cai
  • Zhifeng Hao

Most studies on causal inference tackle the issue of confounding bias by reducing the distribution shift between the control and treated groups. However, it remains an open question to adopt an appropriate metric for distribution shift in practice. In this paper, we define a generic balancing error on reweighted samples to characterize the confounding bias, and study the connection between the balancing error and the Wasserstein discrepancy derived from the theory of optimal transport. We not only regard the Wasserstein discrepancy as the metric of distribution shift, but also explore the association between the balancing error and the underlying cost function involved in the Wasserstein discrepancy. Motivated by this, we propose to reduce the balancing error under the framework of optimal transport with learnable marginal distributions and the cost function, which is implemented by jointly learning weights and representations associated with factual outcomes. The experiments on both synthetic and real-world datasets demonstrate the effectiveness of our proposed method.

JAIR Journal 2024 Journal Article

Selfishly Prepaying in Financial Credit Networks

  • Hao Zhou
  • Yongzhao Wang
  • Konstantinos Varsos
  • Nicholas Bishop
  • Rahul Savani
  • Anisoara Calinescu
  • Michael Wooldridge

In financial credit networks, prepayments enable a firm to settle its debt obligations ahead of an agreed-upon due date. Prepayments have a transformative impact on the structure of networks, influencing the financial well-being (utility) of individual firms. This study investigates prepayments from both theoretical and empirical perspectives. We first establish the computational complexity of finding prepayments that maximize welfare, assuming global coordination among firms in the financial network. Subsequently, our focus shifts to understanding the strategic behavior of individual firms in the presence of prepayments. We introduce a prepayment game where firms strategically make prepayments, delineating the existence of pure strategy Nash equilibria and analyzing the price of anarchy (stability) within this game. Recognizing the computational challenges associated with determining Nash equilibria in prepayment games, we use a simulation-based approach, known as empirical game-theoretic analysis (EGTA). Through EGTA, we are able to find Nash equilibria among a carefully-chosen set of heuristic strategies. By examining the equilibrium behavior of firms, we outline the characteristics of high-performing strategies for strategic prepayments and establish connections between our empirical and theoretical findings.

ICML Conference 2024 Conference Paper

Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models

  • Tanmay Gautam
  • Youngsuk Park
  • Hao Zhou
  • Parameswaran Raman
  • Wooseok Ha

Fine-tuning language models (LMs) has demonstrated success in a wide array of downstream tasks. However, as LMs are scaled up, the memory requirements for backpropagation become prohibitively high. Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate gradients. More recently, MeZO, an adaptation of ZO-SGD, has been shown to consistently outperform zero-shot and in-context learning when combined with suitable task prompts. In this work, we couple ZO methods with variance reduction techniques to enhance stability and convergence for inference-based LM fine-tuning. We introduce Memory-Efficient Zeroth-Order Stochastic Variance-Reduced Gradient (MeZO-SVRG) and demonstrate its efficacy across multiple LM fine-tuning tasks, eliminating the reliance on task-specific prompts. Evaluated across a range of both masked and autoregressive LMs on benchmark GLUE tasks, MeZO-SVRG outperforms MeZO with up to 20% increase in test accuracies in both full- and partial-parameter fine-tuning settings. MeZO-SVRG benefits from reduced computation time as it often surpasses MeZO’s peak test accuracy with a $2\times$ reduction in GPU-hours. MeZO-SVRG significantly reduces the required memory footprint compared to first-order SGD, i. e. by $2\times$ for autoregressive models. Our experiments highlight that MeZO-SVRG’s memory savings progressively improve compared to SGD with larger batch sizes.

TMLR Journal 2024 Journal Article

VideoGLUE: Video General Understanding Evaluation of Foundation Models

  • Liangzhe Yuan
  • Nitesh Bharadwaj Gundavarapu
  • Long Zhao
  • Hao Zhou
  • Yin Cui
  • Lu Jiang
  • Xuan Yang
  • Menglin Jia

We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore,we jointly profile FMs’ efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich videos,localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue

AAMAS Conference 2023 Conference Paper

Debt Transfers in Financial Networks: Complexity and Equilibria

  • Panagiotis Kanellopoulos
  • Maria Kyropoulou
  • Hao Zhou

We consider the operation of debt transfer in interbank networks. In particular, assuming a financial system that is represented by a network of banks and their bilateral debt contracts, we consider the setting where a bank can transfer the right to claim a debt to one of its lenders, under some assumptions. Perhaps surprisingly, such an operation can benefit the banks involved, and potentially the entire network as well, in terms of maximizing natural objectives related to financial well-being, like total assets and equity. We consider debt transfers in both a centralized and a distributed (game-theoretic) setting. First, we examine the computational complexity of computing debt transfer combinations that maximize total payments or total equity, or satisfy other desirable properties. We then study debt transfer operations from a game-theoretic standpoint. We formally define games that emerge when banks can be strategic about choosing whether or not to transfer their debt claims. We prove theoretical results on the existence and quality of pure Nash equilibria in debt transfer games, as well as the computational complexity of relevant problems. We complement our theoretical study with an empirical analysis involving different heuristics about computing debt transfer combinations, as well as game-playing dynamics of debt transfer operations on synthetic data.

AAAI Conference 2023 Conference Paper

Direct Heterogeneous Causal Learning for Resource Allocation Problems in Marketing

  • Hao Zhou
  • Shaoming Li
  • Guibin Jiang
  • Jiaqi Zheng
  • Dong Wang

Marketing is an important mechanism to increase user engagement and improve platform revenue, and heterogeneous causal learning can help develop more effective strategies. Most decision-making problems in marketing can be formulated as resource allocation problems and have been studied for decades. Existing works usually divide the solution procedure into two fully decoupled stages, i.e., machine learning (ML) and operation research (OR) --- the first stage predicts the model parameters and they are fed to the optimization in the second stage. However, the error of the predicted parameters in ML cannot be respected and a series of complex mathematical operations in OR lead to the increased accumulative errors. Essentially, the improved precision on the prediction parameters may not have a positive correlation on the final solution due to the side-effect from the decoupled design. In this paper, we propose a novel approach for solving resource allocation problems to mitigate the side-effects. Our key intuition is that we introduce the decision factor to establish a bridge between ML and OR such that the solution can be directly obtained in OR by only performing the sorting or comparison operations on the decision factor. Furthermore, we design a customized loss function that can conduct direct heterogeneous causal learning on the decision factor, an unbiased estimation of which can be guaranteed when the loss convergences. As a case study, we apply our approach to two crucial problems in marketing: the binary treatment assignment problem and the budget allocation problem with multiple treatments. Both large-scale simulations and online A/B Tests demonstrate that our approach achieves significant improvement compared with state-of-the-art.

NeurIPS Conference 2023 Conference Paper

Equivariant Flow Matching with Hybrid Probability Transport for 3D Molecule Generation

  • Yuxuan Song
  • Jingjing Gong
  • Minkai Xu
  • Ziyao Cao
  • Yanyan Lan
  • Stefano Ermon
  • Hao Zhou
  • Wei-Ying Ma

The generation of 3D molecules requires simultaneously deciding the categorical features (atom types) and continuous features (atom coordinates). Deep generative models, especially Diffusion Models (DMs), have demonstrated effectiveness in generating feature-rich geometries. However, existing DMs typically suffer from unstable probability dynamics with inefficient sampling speed. In this paper, we introduce geometric flow matching, which enjoys the advantages of both equivariant modeling and stabilized probability dynamics. More specifically, we propose a hybrid probability path where the coordinates probability path is regularized by an equivariant optimal transport, and the information between different modalities is aligned. Experimentally, the proposed method could consistently achieve better performance on multiple molecule generation benchmarks with 4. 75$\times$ speed up of sampling on average.

NeurIPS Conference 2023 Conference Paper

Fed-FA: Theoretically Modeling Client Data Divergence for Federated Language Backdoor Defense

  • Zhiyuan Zhang
  • Deli Chen
  • Hao Zhou
  • Fandong Meng
  • Jie Zhou
  • Xu Sun

Federated learning algorithms enable neural network models to be trained across multiple decentralized edge devices without sharing private data. However, they are susceptible to backdoor attacks launched by malicious clients. Existing robust federated aggregation algorithms heuristically detect and exclude suspicious clients based on their parameter distances, but they are ineffective on Natural Language Processing (NLP) tasks. The main reason is that, although text backdoor patterns are obvious at the underlying dataset level, they are usually hidden at the parameter level, since injecting backdoors into texts with discrete feature space has less impact on the statistics of the model parameters. To settle this issue, we propose to identify backdoor clients by explicitly modeling the data divergence among clients in federated NLP systems. Through theoretical analysis, we derive the f-divergence indicator to estimate the client data divergence with aggregation updates and Hessians. Furthermore, we devise a dataset synthesization method with a Hessian reassignment mechanism guided by the diffusion theory to address the key challenge of inaccessible datasets in calculating clients' data Hessians. We then present the novel Federated F-Divergence-Based Aggregation~(\textbf{Fed-FA}) algorithm, which leverages the f-divergence indicator to detect and discard suspicious clients. Extensive empirical results show that Fed-FA outperforms all the parameter distance-based methods in defending against backdoor attacks among various natural language backdoor attack scenarios.

NeurIPS Conference 2023 Conference Paper

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

  • Junkun Yuan
  • Xinyu Zhang
  • Hao Zhou
  • Jian Wang
  • Zhongwei Qiu
  • Zhiyin Shao
  • Shaofeng Zhang
  • Sifan Long

Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Specifically, we employ this prior to guide the mask sampling process. Image patches, corresponding to human part regions, have high priority to be masked out. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks. To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image. We term the entire method as HAP. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78. 1% mAP on MSMT17 for person re-identification, 86. 54% mA on PA-100K for pedestrian attribute recognition, 78. 2% AP on MS COCO for 2D pose estimation, and 56. 0 PA-MPJPE on 3DPW for 3D pose and shape estimation.

ICLR Conference 2023 Conference Paper

Learning Harmonic Molecular Representations on Riemannian Manifold

  • Yiqun Wang
  • Yuning Shen
  • Shi Chen 0003
  • Lihao Wang
  • Fei Ye
  • Hao Zhou

Molecular representation learning plays a crucial role in AI-assisted drug discovery research. Encoding 3D molecular structures through Euclidean neural networks has become the prevailing method in the geometric deep learning community. However, the equivariance constraints and message passing in Euclidean space may limit the network expressive power. In this work, we propose a Harmonic Molecular Representation learning (HMR) framework, which represents a molecule using the Laplace-Beltrami eigenfunctions of the molecular surface. HMR offers a multi-resolution representation of molecular geometric and chemical properties on 2D Riemannian manifold. We also introduce a harmonic message passing method to realize efficient spectral message passing over the surface manifold for better molecular encoding. Our proposed method shows comparable predictive power to current models in small molecule property prediction, and outperforms the state-of-the-art deep learning models for the rigid protein docking challenge, demonstrating its versatility in molecular representation learning.

AAMAS Conference 2022 Conference Paper

Forgiving Debt in Financial Network Games

  • Panagiotis Kanellopoulos
  • Maria Kyropoulou
  • Hao Zhou

A financial system is represented by a network, where nodes correspond to banks, and directed labeled edges correspond to debt contracts between banks. Once a payment schedule has been defined, where we assume that a bank cannot refuse a payment towards one of its lenders if it has sufficient funds, the liquidity of the system is defined as the sum of total payments made in the network. Maximizing systemic liquidity is a natural objective of any financial authority, so, we study the setting where the financial authority offers bailout money to some bank(s) or forgives the debts of others in order to maximize liquidity, and examine efficient ways to achieve this. We investigate the approximation ratio provided by the greedy bailout policy compared to the optimal one, and we study the computational hardness of finding the optimal debt removal and budget-constrained optimal bailout policy, respectively. We also study financial systems from a game-theoretic standpoint. We observe that the removal of some incoming debt might be in the best interest of a bank, if that helps one of its borrowers remain solvent and avoid costs related to default. Assuming that a bank’s well-being (i. e. , utility) is aligned with the incoming payments they receive from the network, we define and analyze a game among banks who want to maximize their utility by strategically giving up some incoming payments. In addition, we extend the previous game by considering bailout payments. After formally defining the above games, we prove results about the existence and quality of pure Nash equilibria, as well as the computational complexity of finding such equilibria.

IJCAI Conference 2022 Conference Paper

Forgiving Debt in Financial Network Games

  • Panagiotis Kanellopoulos
  • Maria Kyropoulou
  • Hao Zhou

We consider financial networks, where nodes correspond to banks and directed labeled edges correspond to debt contracts between banks. Maximizing systemic liquidity, i. e. , the total money flow, is a natural objective of any financial authority. In particular, the financial authority may offer bailout money to some bank(s) or forgive the debts of others in order to maximize liquidity, and we examine efficient ways to achieve this. We study the computational hardness of finding the optimal debt-removal and budget-constrained optimal bailout policy, respectively, and we investigate the approximation ratio provided by the greedy bailout policy compared to the optimal one. We also study financial systems from a game-theoretic standpoint. We observe that the removal of some incoming debt might be in the best interest of a bank. Assuming that a bank's well-being (i. e. , utility) is aligned with the incoming payments they receive from the network, we define and analyze a game among banks who want to maximize their utility by strategically giving up some incoming payments. In addition, we extend the previous game by considering bailout payments. After formally defining the above games, we prove results about the existence and quality of pure Nash equilibria, as well as the computational complexity of finding such equilibria.

AAAI Conference 2022 Conference Paper

LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification

  • Jiangjie Chen
  • Qiaoben Bao
  • Changzhi Sun
  • Xinbo Zhang
  • Jiaze Chen
  • Hao Zhou
  • Yanghua Xiao
  • Lei Li

Given a natural language statement, how to verify its veracity against a large-scale textual knowledge source like Wikipedia? Most existing neural models make predictions without giving clues about which part of a false claim goes wrong. In this paper, we propose LOREN, an approach for interpretable fact verification. We decompose the verification of the whole claim at phrase-level, where the veracity of the phrases serves as explanations and can be aggregated into the final verdict according to logical rules. The key insight of LOREN is to represent claim phrase veracity as three-valued latent variables, which are regularized by aggregation logical rules. The final claim verification is based on all latent variables. Thus, LOREN enjoys the additional benefit of interpretability — it is easy to explain how it reaches certain results with claim phrase veracity. Experiments on a public fact verification benchmark show that LOREN is competitive against previous approaches while enjoying the merit of faithful and accurate interpretability. The resources of LOREN are available at: https: //github. com/jiangjiechen/LOREN.

AAAI Conference 2022 Conference Paper

Non-autoregressive Translation with Layer-Wise Prediction and Deep Supervision

  • Chenyang Huang
  • Hao Zhou
  • Osmar R. Zaïane
  • Lili Mou
  • Lei Li

How do we perform efficient inference while retaining high translation quality? Existing neural machine translation models, such as Transformer, achieve high performance, but they decode words one by one, which is inefficient. Recent non-autoregressive translation models speed up the inference, but their quality is still inferior. In this work, we propose DSLP, a highly efficient and high-performance model for machine translation. The key insight is to train a nonautoregressive Transformer with Deep Supervision and feed additional Layer-wise Predictions. We conducted extensive experiments on four translation tasks (both directions of WMT’14 EN–DE and WMT’16 EN–RO). Results show that our approach consistently improves the BLEU scores compared with respective base models. Specifically, our best variant outperforms the autoregressive model on three translation tasks, while being 14. 8 times more efficient in inference.

NeurIPS Conference 2022 Conference Paper

Regularized Molecular Conformation Fields

  • Lihao Wang
  • Yi Zhou
  • Yiqun Wang
  • Xiaoqing Zheng
  • Xuanjing Huang
  • Hao Zhou

Predicting energetically favorable 3-dimensional conformations of organic molecules frommolecular graph plays a fundamental role in computer-aided drug discovery research. However, effectively exploring the high-dimensional conformation space to identify (meta) stable conformers is anything but trivial. In this work, we introduce RMCF, a novel framework to generate a diverse set of low-energy molecular conformations through samplingfrom a regularized molecular conformation field. We develop a data-driven molecular segmentation algorithm to automatically partition each molecule into several structural building blocks to reduce the modeling degrees of freedom. Then, we employ a Markov Random Field to learn the joint probability distribution of fragment configurations and inter-fragment dihedral angles, which enables us to sample from different low-energy regions of a conformation space. Our model constantly outperforms state-of-the-art models for the conformation generation task on the GEOM-Drugs dataset. We attribute the success of RMCF to modeling in a regularized feature space and learning a global fragment configuration distribution for effective sampling. The proposed method could be generalized to deal with larger biomolecular systems.

AAAI Conference 2022 Conference Paper

Unsupervised Editing for Counterfactual Stories

  • Jiangjie Chen
  • Chun Gan
  • Sijie Cheng
  • Hao Zhou
  • Yanghua Xiao
  • Lei Li

Creating what-if stories requires reasoning about prior statements and possible outcomes of the changed conditions. One can easily generate coherent endings under new conditions, but it would be challenging for current systems to do it with minimal changes to the original story. Therefore, one major challenge is the trade-off between generating a logical story and rewriting with minimal-edits. In this paper, we propose EDUCAT, an editing-based unsupervised approach for counterfactual story rewriting. EDUCAT includes a target position detection strategy based on estimating causal effects of the what-if conditions, which keeps the causal invariant parts of the story. EDUCAT then generates the stories under fluency, coherence and minimal-edits constraints. We also propose a new metric to alleviate the shortcomings of current automatic metrics and better evaluate the trade-off. We evaluate EDU- CAT on a public counterfactual story rewriting benchmark. Experiments show that EDUCAT achieves the best trade-off over unsupervised SOTA methods according to both automatic and human evaluation. The resources of EDUCAT are available at: https: //github. com/jiangjiechen/EDUCAT.

NeurIPS Conference 2022 Conference Paper

Zero-Shot 3D Drug Design by Sketching and Generating

  • Siyu Long
  • Yi Zhou
  • Xinyu Dai
  • Hao Zhou

Drug design is a crucial step in the drug discovery cycle. Recently, various deep learning-based methods design drugs by generating novel molecules from scratch, avoiding traversing large-scale drug libraries. However, they depend on scarce experimental data or time-consuming docking simulation, leading to overfitting issues with limited training data and slow generation speed. In this study, we propose the zero-shot drug design method DESERT (Drug dEsign by SkEtching and geneRaTing). Specifically, DESERT splits the design process into two stages: sketching and generating, and bridges them with the molecular shape. The two-stage fashion enables our method to utilize the large-scale molecular database to reduce the need for experimental data and docking simulation. Experiments show that DESERT achieves a new state-of-the-art at a fast speed.

AAAI Conference 2021 Conference Paper

ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization

  • Xunpeng Huang
  • Runxin Xu
  • Hao Zhou
  • Zhe Wang
  • Zhengyang Liu
  • Lei Li

Stochastic gradient descent (SGD) is a widely used method for its outstanding generalization ability and simplicity. Adaptive gradient methods have been proposed to further accelerate the optimization process. In this paper, we revisit existing adaptive gradient optimization methods with a new interpretation. Such new perspective leads to a refreshed understanding of the roles of second moments in stochastic optimization. Based on this, we propose Angle-Calibration Moment method (ACMo), a novel stochastic optimization method. It enjoys the benefits of second moments with only first moment updates. Theoretical analysis shows that ACMo is able to achieve the same convergence rate as mainstream adaptive methods. Experiments on a variety of CV and NLP tasks demonstrate that ACMo has a comparable convergence to state-of-the-art Adam-type optimizers, and even a better generalization performance in most cases. The code is available at https: //github. com/Xunpeng746/ACMo.

AAAI Conference 2021 Conference Paper

Consecutive Decoding for Speech-to-text Translation

  • Qianqian Dong
  • Mingxuan Wang
  • Hao Zhou
  • Shuang Xu
  • Bo Xu
  • Lei Li

Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal crosslingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, TED English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms the previous state-ofthe-art methods. The code is available at https: //github. com/ dqqcasia/st.

AAAI Conference 2021 Conference Paper

Deep Graph-neighbor Coherence Preserving Network for Unsupervised Cross-modal Hashing

  • Jun Yu
  • Hao Zhou
  • Yibing Zhan
  • Dacheng Tao

Unsupervised cross-modal hashing (UCMH) has become a hot topic recently. Current UCMH focuses on exploring data similarities. However, current UCMH methods calculate the similarity between two data, mainly relying on the two data’s cross-modal features. These methods suffer from inaccurate similarity problems that result in a suboptimal retrieval Hamming space, because the cross-modal features between the data are not sufficient to describe the complex data relationships, such as situations where two data have different feature representations but share the inherent concepts. In this paper, we devise a deep graph-neighbor coherence preserving network (DGCPN). Specifically, DGCPN stems from graph models and explores graph-neighbor coherence by consolidating the information between data and their neighbors. DGCPN regulates comprehensive similarity preserving losses by exploiting three types of data similarities (i. e. , the graph-neighbor coherence, the coexistent similarity, and the intra- and inter-modality consistency) and designs a half-real and half-binary optimization strategy to reduce the quantization errors during hashing. Essentially, DGCPN addresses the inaccurate similarity problem by exploring and exploiting the data’s intrinsic relationships in a graph. We conduct extensive experiments on three public UCMH datasets. The experimental results demonstrate the superiority of DGCPN, e. g. , by improving the mean average precision from 0. 722 to 0. 751 on MIRFlickr-25K using 64-bit hashing codes to retrieve texts from images. We will release the source code package and the trained model on https: //github. com/Atmegal/DGCPN.

NeurIPS Conference 2021 Conference Paper

Duplex Sequence-to-Sequence Learning for Reversible Machine Translation

  • Zaixiang Zheng
  • Hao Zhou
  • Shujian Huang
  • Jiajun Chen
  • Jingjing Xu
  • Lei Li

Sequence-to-sequence learning naturally has two directions. How to effectively utilize supervision signals from both directions? Existing approaches either require two separate models, or a multitask-learned model but with inferior performance. In this paper, we propose REDER (Reversible Duplex Transformer), a parameter-efficient model and apply it to machine translation. Either end of REDER can simultaneously input and output a distinct language. Thus REDER enables {\em reversible machine translation} by simply flipping the input and output ends. Experiments verify that REDER achieves the first success of reversible machine translation, which helps outperform its multitask-trained baselines by up to 1. 3 BLEU.

AAAI Conference 2021 Conference Paper

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

  • Qianqian Dong
  • Rong Ye
  • Mingxuan Wang
  • Hao Zhou
  • Shuang Xu
  • Bo Xu
  • Lei Li

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand- Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at https: //github. com/dqqcasia/st.

AAAI Conference 2020 Conference Paper

Importance-Aware Learning for Neural Headline Editing

  • Qingyang Wu
  • Lei Li
  • Hao Zhou
  • Ying Zeng
  • Zhou Yu

Many social media news writers are not professionally trained. Therefore, social media platforms have to hire professional editors to adjust amateur headlines to attract more readers. We propose to automate this headline editing process through neural network models to provide more immediate writing support for these social media news writers. To train such a neural headline editing model, we collected a dataset which contains articles with original headlines and professionally edited headlines. However, it is expensive to collect a large number of professionally edited headlines. To solve this low-resource problem, we design an encoder-decoder model which leverages large scale pre-trained language models. We further improve the pre-trained model’s quality by introducing a headline generation task as an intermediate task before the headline editing task. Also, we propose Self Importance- Aware (SIA) loss to address the different levels of editing in the dataset by down-weighting the importance of easily classified tokens and sentences. With the help of Pre-training, Adaptation, and SIA, the model learns to generate headlines in the professional editor’s style. Experimental results show that our method significantly improves the quality of headline editing comparing against previous methods.

AAAI Conference 2020 Conference Paper

Infomax Neural Joint Source-Channel Coding via Adversarial Bit Flip

  • Yuxuan Song
  • Minkai Xu
  • Lantao Yu
  • Hao Zhou
  • Shuo Shao
  • Yong Yu

Although Shannon theory states that it is asymptotically optimal to separate the source and channel coding as two independent processes, in many practical communication scenarios this decomposition is limited by the finite bit-length and computational power for decoding. Recently, neural joint source-channel coding (NECST) (Choi et al. 2018) is proposed to sidestep this problem. While it leverages the advancements of amortized inference and deep learning (Kingma and Welling 2013; Grover and Ermon 2018) to improve the encoding and decoding process, it still cannot always achieve compelling results in terms of compression and error correction performance due to the limited robustness of its learned coding networks. In this paper, motivated by the inherent connections between neural joint source-channel coding and discrete representation learning, we propose a novel regularization method called Infomax Adversarial-Bit- Flip (IABF) to improve the stability and robustness of the neural joint source-channel coding scheme. More specifically, on the encoder side, we propose to explicitly maximize the mutual information between the codeword and data; while on the decoder side, the amortized reconstruction is regularized within an adversarial framework. Extensive experiments conducted on various real-world datasets evidence that our IABF can achieve state-of-the-art performances on both compression and error correction benchmarks and outperform the baselines by a significant margin.

AAAI Conference 2020 Conference Paper

Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

  • Hao Zhou
  • Wengang Zhou
  • Yun Zhou
  • Houqiang Li

Despite the recent success of deep learning in continuous sign language recognition (CSLR), deep models typically focus on the most discriminative features, ignoring other potentially non-trivial and informative contents. Such characteristic heavily constrains their capability to learn implicit visual grammars behind the collaboration of different visual cues (i, e. , hand shape, facial expression and body posture). By injecting multi-cue learning into neural network design, we propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem. Our STMC network consists of a spatial multi-cue (SMC) module and a temporal multi-cue (TMC) module. The SMC module is dedicated to spatial representation and explicitly decomposes visual features of different cues with the aid of a self-contained pose estimation branch. The TMC module models temporal correlations along two parallel paths, i. e. , intra-cue and intercue, which aims to preserve the uniqueness and explore the collaboration of multiple cues. Finally, we design a joint optimization strategy to achieve the end-to-end sequence learning of the STMC network. To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks: PHOENIX-2014, CSL and PHOENIX-2014-T. Experimental results demonstrate that the proposed method achieves new state-of-the-art performance on all three benchmarks.

AAAI Conference 2020 Conference Paper

Towards Making the Most of BERT in Neural Machine Translation

  • Jiacheng Yang
  • Mingxuan Wang
  • Hao Zhou
  • Chengqi Zhao
  • Weinan Zhang
  • Yong Yu
  • Lei Li

GPT-2 and BERT demonstrate the effectiveness of using pretrained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (CTNMT) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed CTNMT consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show CTNMT gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pretraining aided NMT by 1. 4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentencepairs, our base model still significantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score.

AAAI Conference 2019 Conference Paper

CGMH: Constrained Sentence Generation by Metropolis-Hastings Sampling

  • Ning Miao
  • Hao Zhou
  • Lili Mou
  • Rui Yan
  • Lei Li

In real-world applications of natural language generation, there are often constraints on the target sentences in addition to fluency and naturalness requirements. Existing language generation techniques are usually based on recurrent neural networks (RNNs). However, it is non-trivial to impose constraints on RNNs while maintaining generation quality, since RNNs generate sentences sequentially (or with beam search) from the first word to the last. In this paper, we propose CGMH, a novel approach using Metropolis-Hastings sampling for constrained sentence generation. CGMH allows complicated constraints such as the occurrence of multiple keywords in the target sentences, which cannot be handled in traditional RNN-based approaches. Moreover, CGMH works in the inference stage, and does not require parallel corpora for training. We evaluate our method on a variety of tasks, including keywords-to-sentence generation, unsupervised sentence paraphrasing, and unsupervised sentence error correction. CGMH achieves high performance compared with previous supervised methods for sentence generation. Our code is released at https: //github. com/NingMiao/CGMH

IJCAI Conference 2019 Conference Paper

Correct-and-Memorize: Learning to Translate from Interactive Revisions

  • Rongxiang Weng
  • Hao Zhou
  • Shujian Huang
  • Lei Li
  • Yifan Xia
  • Jiajun Chen

State-of-the-art machine translation models are still not on a par with human translators. Previous work takes human interactions into the neural machine translation process to obtain improved results in target languages. However, not all model--translation errors are equal -- some are critical while others are minor. In the meanwhile, same translation mistakes occur repeatedly in similar context. To solve both issues, we propose CAMIT, a novel method for translating in an interactive environment. Our proposed method works with critical revision instructions, therefore allows human to correct arbitrary words in model-translated sentences. In addition, CAMIT learns from and softly memorizes revision actions based on the context, alleviating the issue of repeating mistakes. Experiments in both ideal and real interactive translation settings demonstrate that our proposed CAMIT enhances machine translation results significantly while requires fewer revision instructions from human compared to previous methods.

IJCAI Conference 2019 Conference Paper

GraspSnooker: Automatic Chinese Commentary Generation for Snooker Videos

  • Zhaoyue Sun
  • Jiaze Chen
  • Hao Zhou
  • Deyu Zhou
  • Lei Li
  • Mingmin Jiang

We demonstrate a web-based software system, GraspSnooker, which is able to automatically generate Chinese text commentaries for snooker game videos. It consists of a video analyzer, a strategy predictor and a commentary generator. As far as we know, it is the first attempt on snooker commentary generation, which might be helpful for snooker learners to understand the game.

NeurIPS Conference 2019 Conference Paper

Kernelized Bayesian Softmax for Text Generation

  • Ning Miao
  • Hao Zhou
  • Chengqi Zhao
  • Wenxian Shi
  • Lei Li

Neural models for text generation require a softmax layer with proper token embeddings during the decoding phase. Most existing approaches adopt single point embedding for each token. However, a word may have multiple senses according to different context, some of which might be distinct. In this paper, we propose KerBS, a novel approach for learning better embeddings for text generation. KerBS embodies two advantages: (a) it employs a Bayesian composition of embeddings for words with multiple senses; (b) it is adaptive to semantic variances of words and robust to rare sentence context by imposing learned kernels to capture the closeness of words (senses) in the embedding space. Empirical studies show that KerBS significantly boosts the performance of several text generation tasks.

AAAI Conference 2018 Conference Paper

A Deep Cascade Network for Unaligned Face Attribute Classification

  • Hui Ding
  • Hao Zhou
  • Shaohua Zhou
  • Rama Chellappa

Humans focus attention on different face regions when recognizing face attributes. Most existing face attribute classification methods use the whole image as input. Moreover, some of these methods rely on fiducial landmarks to provide de- fined face parts. In this paper, we propose a cascade network that simultaneously learns to localize face regions specific to attributes and performs attribute classification without alignment. First, a weakly-supervised face region localization network is designed to automatically detect regions (or parts) specific to attributes. Then multiple part-based networks and a whole-image-based network are separately constructed and combined together by the region switch layer and attribute relation layer for final attribute classification. A multi-net learning method and hint-based model compression is further proposed to get an effective localization model and a compact classification model, respectively. Our approach achieves significantly better performance than state-of-the-art methods on unaligned CelebA dataset, reducing the classification error by 30. 9%.

AAAI Conference 2018 Conference Paper

Augmenting End-to-End Dialogue Systems With Commonsense Knowledge

  • Tom Young
  • Erik Cambria
  • Iti Chaturvedi
  • Hao Zhou
  • Subham Biswas
  • Minlie Huang

Building dialogue systems that can converse naturally with humans is a challenging yet intriguing problem of artificial intelligence. In open-domain human-computer conversation, where the conversational agent is expected to respond to human utterances in an interesting and engaging way, commonsense knowledge has to be integrated into the model effectively. In this paper, we investigate the impact of providing commonsense knowledge about the concepts covered in the dialogue. Our model represents the first attempt to integrating a large commonsense knowledge base into end-toend conversational models. In the retrieval-based scenario, we propose a model to jointly take into account message content and related commonsense for selecting an appropriate response. Our experiments suggest that the knowledgeaugmented models are superior to their knowledge-free counterparts.

NeurIPS Conference 2018 Conference Paper

BRITS: Bidirectional Recurrent Imputation for Time Series

  • Wei Cao
  • Dong Wang
  • Jian Li
  • Hao Zhou
  • Lei Li
  • Yitan Li

Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and applies to general settings with missing data. We evaluate our model on three real-world datasets, including an air quality dataset, a health-care data, and a localization data for human activity. Experiments show that our model outperforms the state-of-the-art methods in both imputation and classification/regression accuracies.

IJCAI Conference 2018 Conference Paper

Commonsense Knowledge Aware Conversation Generation with Graph Attention

  • Hao Zhou
  • Tom Young
  • Minlie Huang
  • Haizhou Zhao
  • Jingfang Xu
  • Xiaoyan Zhu

Commonsense knowledge is vital to many natural language processing tasks. In this paper, we present a novel open-domain conversation generation model to demonstrate how large-scale commonsense knowledge can facilitate language understanding and generation. Given a user post, the model retrieves relevant knowledge graphs from a knowledge base and then encodes the graphs with a static graph attention mechanism, which augments the semantic information of the post and thus supports better understanding of the post. Then, during word generation, the model attentively reads the retrieved knowledge graphs and the knowledge triples within each graph to facilitate better generation through a dynamic graph attention mechanism. This is the first attempt that uses large-scale commonsense knowledge in conversation generation. Furthermore, unlike existing models that use knowledge triples (entities) separately and independently, our model treats each knowledge graph as a whole, which encodes more structured, connected semantic information in the graphs. Experiments show that the proposed model can generate more appropriate and informative responses than state-of-the-art baselines.

AAAI Conference 2018 Conference Paper

Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory

  • Hao Zhou
  • Minlie Huang
  • Tianyang Zhang
  • Xiaoyan Zhu
  • Bing Liu

Perception and expression of emotion are key factors to the success of dialogue systems or conversational agents. However, this problem has not been studied in large-scale conversation generation so far. In this paper, we propose Emotional Chatting Machine (ECM) that can generate appropriate responses not only in content (relevant and grammatical) but also in emotion (emotionally consistent). To the best of our knowledge, this is the first work that addresses the emotion factor in large-scale conversation generation. ECM addresses the factor using three new mechanisms that respectively (1) models the high-level abstraction of emotion expressions by embedding emotion categories, (2) captures the change of implicit internal emotion states, and (3) uses explicit emotion expressions with an external emotion vocabulary. Experiments show that the proposed model can generate responses appropriate not only in content but also in emotion.

JAIR Journal 2017 Journal Article

A Neural Probabilistic Structured-Prediction Method for Transition-Based Natural Language Processing

  • Hao Zhou
  • Yue Zhang
  • Chuan Cheng
  • Shujian Huang
  • Xinyu Dai
  • Jiajun Chen

We propose a neural probabilistic structured-prediction method for transition-based natural language processing, which integrates beam search and contrastive learning. The method uses a global optimization model, which can leverage arbitrary features over non-local context. Beam search is used for efficient heuristic decoding, and contrastive learning is performed for adjusting the model according to search errors. When evaluated on both chunking and dependency parsing tasks, the proposed method achieves significant accuracy improvements over the locally normalized greedy baseline on the two tasks, respectively.

NeurIPS Conference 2016 Conference Paper

Hypothesis Testing in Unsupervised Domain Adaptation with Applications in Alzheimer's Disease

  • Hao Zhou
  • Vamsi Ithapu
  • Sathya Narayanan Ravi
  • Vikas Singh
  • Grace Wahba
  • Sterling Johnson

Consider samples from two different data sources $\{\mathbf{x_s^i}\} \sim P_{\rm source}$ and $\{\mathbf{x_t^i}\} \sim P_{\rm target}$. We only observe their transformed versions $h(\mathbf{x_s^i})$ and $g(\mathbf{x_t^i})$, for some known function class $h(\cdot)$ and $g(\cdot)$. Our goal is to perform a statistical test checking if $P_{\rm source}$ = $P_{\rm target}$ while removing the distortions induced by the transformations. This problem is closely related to concepts underlying numerous domain adaptation algorithms, and in our case, is motivated by the need to combine clinical and imaging based biomarkers from multiple sites and/or batches, where this problem is fairly common and an impediment in the conduct of analyses with much larger sample sizes. We develop a framework that addresses this problem using ideas from hypothesis testing on the transformed measurements, where in the distortions need to be estimated {\it in tandem} with the testing. We derive a simple algorithm and study its convergence and consistency properties in detail, and we also provide lower-bound strategies based on recent work in continuous optimization. On a dataset of individuals at risk for neurological disease, our results are competitive with alternative procedures that are twice as expensive and in some cases operationally infeasible to implement.