Arrow Research search

Author name cluster

Zihao Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

39 papers
2 author rows

Possible papers

39

AAAI Conference 2026 Conference Paper

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

  • Shulei Ji
  • Zihao Wang
  • Jiaxing Yu
  • Xiangyuan Yang
  • Shuyu Li
  • Songruoyao Wu
  • Kejun Zhang

Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons.

TMLR Journal 2026 Journal Article

Dynamics‑Aligned Diffusion Planning for Offline RL: A Unified Framework with Forward and Inverse Guidance

  • Zihao Wang
  • Ke Jiang
  • Xiaoyang Tan

Diffusion-based planning has emerged as a powerful paradigm for offline reinforcement learning (RL). However, existing approaches often overlook the physical constraints imposed by real-world dynamics, resulting in dynamics inconsistency—a mismatch between diffusion-generated trajectories and those feasible under true environment transitions. To address this issue, we propose Dynamics-Aligned Diffusion Planning (DADP), a unified framework that explicitly enforces dynamics consistency during the diffusion denoising process. DADP offers two complementary variants: DADP-F (Forward), which employs a forward dynamics model to ensure state-level feasibility, and DADP-I (Inverse), which leverages an inverse dynamics model to enhance action-level executability. Both variants share a unified guidance formulation that integrates task return optimization and dynamics alignment through gradient-based updates. Experiments on state-based D4RL Maze2D and MuJoCo benchmarks demonstrate that DADP-F and DADP-I outperform state-of-the-art offline RL baselines, effectively reducing dynamics inconsistency and improving long-horizon robustness. This unifies diffusion-based planning with physically grounded dynamics modeling.

EAAI Journal 2026 Journal Article

Optimizing fuzzy job shop scheduling using graph neural networks and deep reinforcement learning

  • WenJia Yang
  • Zihao Wang
  • Gai-Ge Wang

With the rapid advancement of manufacturing, optimizing production scheduling is essential for enhancing enterprise competitiveness. In real-world job shop environments, uncertainties such as equipment fluctuations and variable processing times are prevalent, making the fuzzy job shop scheduling problem (FJSSP) a key research focus. However, the introduction of fuzzy time information significantly increases problem complexity, challenging traditional optimization methods to deliver high-quality solutions efficiently. While deep reinforcement learning (DRL) shows promise for complex scheduling, its application to FJSSP faces obstacles in modeling fuzziness, extracting features from intricate constraints, and ensuring generalization across scales. To address these challenges, this paper proposes a DRL-based framework for FJSSP that models scheduling states using a parallel disjunctive graph and employs a multi-channel graph convolutional network for feature extraction. This approach effectively integrates fuzzy information and complex constraints, transforming FJSSP into a sequential decision-making process suitable for DRL. Experimental results demonstrate that the proposed method outperforms traditional dispatching rules in both solution quality and computational efficiency, while also exhibiting strong generalization to large-scale and unseen scheduling scenarios.

AAAI Conference 2026 Conference Paper

RPGen: Robust and Differentially Private Synthetic Image Generation

  • Zihao Wang
  • Hao Peng
  • Wei Dong
  • Yuecen Wei
  • Li Sun
  • Zhengtao Yu

Differentially private (DP) image synthesis enables the generation of realistic images while bounding privacy leakage, facilitating secure data sharing across organizations. However, the Gaussian noise injected during DP training, such as via DP-SGD, often severely degrades synthesis quality by disrupting model convergence. To address this, we introduce RPGen, a novel framework that enhances diffusion models' parameter robustness to mitigate DP noise effects without compromising privacy guarantees. At its core, RPGen employs adversarial model perturbation (AMP) during public pre-training to build resilience against perturbations, but we identify and tackle the critical issue of robustness transferability across domains. RPGen achieves this through a three-step process: (1) A pre-trained classifier infers labels for private images, aggregated into a class distribution noised with Gaussian mechanism for DP, and public samples are selected to match this privatized distribution for domain alignment; (2) The diffusion model is pre-trained on this curated subset with adversarial model perturbation to foster robustness; (3) The model undergoes fine-tuning on private data using DP-SGD. This synergy of robustness augmentation and transferability optimization yields high-fidelity synthesis. Extensive evaluations on ImageNet for pre-training, with CelebA and CIFAR-10 for synthesis, show RPGen outperforming state-of-the-art baselines across epsilon in 1, 5, 10. On average, it achieves 20.18% lower FID and 5.45% higher classification accuracy. Ablations confirm the efficacy of domain curation and modest perturbations, establishing RPGen as a new benchmark for privacy-utility trade-offs in image generation.

EAAI Journal 2025 Journal Article

A multi-view new energy vehicle form generation design method combining Kansei imagery and deep learning

  • Zihao Wang
  • Le Xi
  • Yifan Ding
  • Wenjie Fang
  • Kaiming Wang
  • Hongliang Zuo

In the competitive landscape of new energy vehicles, exterior design has become a crucial differentiator amid functional homogenization. User preferences are central to shaping vehicle appearance, yet most perceptual design methods rely on a single viewpoint, limiting insights into complex preference patterns. This study proposes a multi-perspective mapping approach that integrates Kansei engineering with deep learning. Firstly, user core imagery is collected and mined through big data. Secondly, Kernels Network (KNet) semantic segmentation model, Residual Networks (ResNet) tri-view (front/side/rear) score prediction model and fully connected network (FCN) feature fusion model are integrated to construct a multi-view feature mapping system. Finally, the optimal combination of morphological elements is explored based on the Elite Genetic Algorithm (EGA), and the scheme is validated through generative artificial intelligence (AI) workflow. The experimental results demonstrate that, employing “Cool” as a case study, the three-view scheme and the combination scheme devised by this research process exhibit substantial superiority over the majority of the samples. Under identical parameters, the scheme with decision constraints surpasses the randomly generated scheme in terms of perceptual scores and stability. The performance of the test set and the experimental results collectively substantiate the model’s validity. This workflow—covering preference extraction, morphological decomposition, AI-driven generation, and validation—provides a scalable framework for new energy vehicle exterior design. It also demonstrates novel applications of Kansei engineering in multi-view fusion and generative form design.

ICML Conference 2025 Conference Paper

A Recipe for Causal Graph Regression: Confounding Effects Revisited

  • Yujia Yin
  • Tianyi Qu
  • Zihao Wang
  • Yifan Chen

Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under out-of-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided on https: //github. com/causal-graph/CGR.

IROS Conference 2025 Conference Paper

A Sim-to-Real Transfer Framework for Enhancing Marine Vehicle Performance in Ocean Environments

  • Ze Zheng
  • Zihao Wang
  • Wenbo Xie

Reinforcement learning (RL) has gained attention for complex decision-making in uncertain environments. However, high costs and risks of real-world experimentation limit its direct application to marine vehicles. This motivates the use of simulation-based training and sim-to-real transfer techniques. Despite growing interest, a systematic understanding of how to design effective transfer strategies for marine contexts remains lacking. This paper presents a sim-to-real transfer framework tailored for marine vehicles, integrating high-fidelity, data-driven dynamics modeling with multi-factor domain randomization to address marine environmental uncertainties. Maneuvering data is utilized to extract nonlinear hydrodynamic characteristics of marine vehicles to enhance model realism. Additionally, domain randomization is explored across multiple environmental factors, including wind, wave, and current. To evaluate transferability, we construct a sim-to-sim platform with a pseudo-real environment that emulates the reality gap and adopt a path-following task using Soft Actor-Critic. We comprehensively assess the impacts of model fidelity and environmental randomization strategies on sim-to-real transfer performance. Results indicate that model accuracy positively impacts transfer performance, while aggressive domain randomization may reduce adaptability in calm conditions. Finally, a data-driven modeling and multi-factor randomization recipe is proposed for RL policy transfer in marine applications.

IJCAI Conference 2025 Conference Paper

AI-Assisted Human-Pet Artistic Musical Co-Creation for Wellness Therapy

  • Zihao Wang
  • Le Ma
  • Yuhang Jin
  • Yongsheng Feng
  • Xin Pan
  • Shulei Ji
  • Kejun Zhang

This paper explores AI-mediated human-pet musical co-creation from an interdisciplinary perspective, leveraging recent advancements in animal-assisted therapy. These advancements have shown significant psychosocial benefits, especially in reducing anxiety and enhancing social engagement. Building on these findings, this study innovatively employs pet vocal timbres as 'digital avatars' to enhance emotional investment during the music creation process. We propose PetCoCre, a novel system that applies pet vocal timbres in three distinct character paradigms within AI music creation: (1) PetRhythm: using pet voices as rhythmic percussion through beat synchronization. (2) PetMelody: enabling pet voices to act as melodic instruments via pitch-shifting alignment. (3) PetVocalia: utilizing pet vocal timbres as the target timbre for SVC (Singing Voice Conversion), where the converted singing voice replaces the original singer's voice, thus preserving the original semantic content. Beyond these character paradigms, our technical innovation lies in proposing SaMoye, the first open-source, high-quality zero-shot SVC model that effectively overcomes existing methods' zero-shot limitations by employing mixed speaker embeddings for timbre enhancement and leveraging a large-scale singing voice dataset. In our experiments, we collected dog and cat vocalization data from pet stores and conducted experiments with 30 participants. Results demonstrate that the human-pet co-creation mode led to significant enhancements in pleasure and creative satisfaction compared to solo AI music generation, along with a significant reduction in participants' anxiety levels. Through collaborative art creation, this research pioneers new paradigms for animal-assisted therapeutic interventions and expands the boundaries of AI-assisted creative collaboration.

AAAI Conference 2025 Conference Paper

ESEG: Event-Based Segmentation Boosted by Explicit Edge-Semantic Guidance

  • Yucheng Zhao
  • Gengyu Lyu
  • Ke Li
  • Zihao Wang
  • Hao Chen
  • Zhen Yang
  • Yongjian Deng

Event-based semantic segmentation (ESS) has attracted researchers' attention recently, as event cameras can solve problems such as under/over-exposure or motion blur that are difficult for RGB cameras to handle. However, event data are noisy and sparse, resulting in difficulties for the model to locate and extract reliable cues from their sparse representations, especially when performing pixel-level tasks. In this paper, we propose a novel framework ESEG to alleviate the dilemma. Given that event signals relate closely to moving edges, instead of proposing complex structures to expect them to recognize those reliable edge regions behind event signals on their own, we introduce the explicit edge-semantic supervision as a reference to let the ESS model globally optimize semantics, considering the high confidence of event data in edge regions. In addition, we propose a fusion module named Density-Aware Dynamic-Window Cross Attention Fusion (D\textsuperscript{2}CAF), in which the density perception, cross-attention, and dynamic window masking mechanisms are jointly imposed to optimize edge-dense feature fusion, leveraging the characteristics of event cameras. Experimental results on DSEC and DDD17 datasets demonstrate the efficacy of the ESEG framework and its core designs.

ICLR Conference 2025 Conference Paper

GROOT-2: Weakly Supervised Multimodal Instruction Following Agents

  • Shaofei Cai
  • Bowei Zhang 0007
  • Zihao Wang
  • Haowei Lin
  • Xiaojian Ma 0001
  • Anji Liu
  • Yitao Liang

Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. To address this issue, we frame the problem as a semi-supervised learning task and introduce \agent, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions. \agent’s effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation, demonstrating its robust multimodal instruction-following capabilities.

TMLR Journal 2025 Journal Article

Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs

  • Qi Hu
  • Weifeng Jiang
  • Haoran Li
  • Zihao Wang
  • Jiaxin Bai
  • Qianren Mao
  • Yangqiu Song
  • Lixin Fan

The increasing demand for deep learning-based foundation models has highlighted the importance of efficient data retrieval mechanisms. Neural graph databases (NGDBs) offer a compelling solution, leveraging neural spaces to store and query graph-structured data, thereby enabling LLMs to access precise and contextually relevant information. However, current NGDBs are constrained to single-graph operation, limiting their capacity to reason across multiple, distributed graphs. Furthermore, the lack of support for multi-source graph data in existing NGDBs hinders their ability to capture the complexity and diversity of real-world data. In many applications, data is distributed across multiple sources, and the ability to reason across these sources is crucial for making informed decisions. This limitation is particularly problematic when dealing with sensitive graph data, as directly sharing and aggregating such data poses significant privacy risks. As a result, many applications that rely on NGDBs are forced to choose between compromising data privacy or sacrificing the ability to reason across multiple graphs. To address these limitations, we propose to learn Federated Neural Graph DataBase (FedNGDB), a pioneering systematic framework that empowers privacy-preserving reasoning over multi-source graph data. FedNGDB leverages federated learning to collaboratively learn graph representations across multiple sources, enriching relationships between entities, and improving the overall quality of graph data. Unlike existing methods, FedNGDB can handle complex graph structures and relationships, making it suitable for various downstream tasks. We evaluate FedNGDBs on three real-world datasets, demonstrating its effectiveness in retrieving relevant information from multi-source graph data while keeping sensitive information secure on local devices. Our results show that FedNGDBs can efficiently retrieve answers to cross-graph queries, making it a promising approach for LLMs and other applications that rely on efficient data retrieval mechanisms.

ICLR Conference 2025 Conference Paper

Learning Hierarchical Polynomials of Multiple Nonlinear Features

  • Hengyu Fu
  • Zihao Wang
  • Eshaan Nichani
  • Jason D. Lee

In deep learning theory, a critical question is to understand how neural networks learn hierarchical features. In this work, we study the learning of hierarchical polynomials of multiple nonlinear features using three-layer neural networks. We examine a broad class of functions of the form $f^{\star}=g^{\star}\circ \mathbf{p}$, where $\mathbf{p}:\mathbb{R}^{d} \rightarrow \mathbb{R}^{r}$ represents multiple quadratic features with $r \ll d$ and $g^{\star}:\mathbb{R}^{r}\rightarrow \mathbb{R}$ is a polynomial of degree $p$. This can be viewed as a nonlinear generalization of the multi-index model, and also an expansion upon previous work on nonlinear feature learning that focused only on a single feature (i.e. $r = 1$). Our primary contribution shows that a three-layer neural network trained via layerwise gradient descent suffices for - complete recovery of the space spanned by the nonlinear features - efficient learning of the target function $f^{\star}=g^{\star}\circ \mathbf{p}$ or transfer learning of $f=g\circ \mathbf{p}$ with a different link function within $\widetilde{\mathcal{O}}(d^4)$ samples and polynomial time. For such hierarchical targets, our result substantially improves the sample complexity ${\Theta}(d^{2p})$ of the kernel methods, demonstrating the power of efficient feature learning. It is important to highlight that our results leverage novel techniques and thus manage to go beyond all prior settings such as single-index and multi-index models as well as models depending just on one nonlinear feature, contributing to a more comprehensive understanding of feature learning in deep learning.

ICLR Conference 2025 Conference Paper

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

  • Junyan Ye
  • Baichuan Zhou
  • Zilong Huang
  • Junan Zhang
  • Tianyi Bai
  • Hengrui Kang
  • Jun He
  • Honglin Lin

With the rapid development of AI-generated content, the future internet may be inundated with synthetic data, making the discrimination of authentic and credible multimodal data increasingly challenging. Synthetic data detection has thus garnered widespread attention, and the performance of large multimodal models (LMMs) in this task has attracted significant interest. LMMs can provide natural language explanations for their authenticity judgments, enhancing the explainability of synthetic content detection. Simultaneously, the task of distinguishing between real and synthetic data effectively tests the perception, knowledge, and reasoning capabilities of LMMs. In response, we introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities. LOKI encompasses video, image, 3D, text, and audio modalities, comprising 18K carefully curated questions across 26 subcategories with clear difficulty levels. The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks, allowing for a comprehensive analysis of LMMs. We evaluated 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities. More information about LOKI can be found at https://opendatalab.github.io/LOKI/.

ICML Conference 2025 Conference Paper

MCU: An Evaluation Framework for Open-Ended Game Agents

  • Xinyue Zheng
  • Haowei Lin
  • Kaichen He
  • Zihao Wang
  • Qiang Fu 0016
  • Haobo Fu
  • Zilong Zheng
  • Yitao Liang

Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3, 452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91. 5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available at https: //github. com/CraftJarvis/MCU.

AAAI Conference 2025 Conference Paper

MSV-PCT: Multi-Sparse-View Enhanced Transformer Framework for Salient Object Detection in Point Clouds

  • Zihao Wang
  • Yiming Huang
  • Gengyu Lyu
  • Yucheng Zhao
  • Ziyu Zhou
  • Bochen Xie
  • Zhen Yang
  • Yongjian Deng

Salient object detection (SOD) methods for 2D images have great significance in the field of human-computer interaction (HCI). However, as a common data format in HCI, the SOD research in the form of 3D point cloud data remains limited. Previous works commonly treat this task as point cloud segmentation, which perceives all points in the scene for prediction. However, these methods neglect that SOD is designed to simulate human visual perception where human can only see the surfaces rather than occluded point clouds. Thereby, these methods may fail when meet such situations. This paper aims to solve this problem by approximately simulating the perception paradigm of humans towards 3D scenes. Thus, we propose a framework based on the 3D visual point cloud backbone and its multi-view projection named MSV-PCT. Specifically, instead of relying solely on general point cloud learning frameworks, we additionally introduce multi-sparse-view learning branches to supplement the SOD perception. Furthermore, we propose a novel point cloud edge detection loss function to effectively address artifacts, enabling the accurate segmentation of the edges of salient objects from the background. Finally, to evaluate the generalization of point cloud SOD methods, we introduce a new approach to generate simulated PC-SOD datasets from RGBD-SOD data. Experiments on the simulated datasets show that MSV-PCT achieves better accuracy and robustness.

ICLR Conference 2025 Conference Paper

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

  • Zihao Wang
  • Bin Cui 0001
  • Shaoduo Gan

Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. However, most of these methods treat all layers equally, allocating the same KV budget to each layer. This approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions, i.e., sequence-wise and layer-wise. Based on our observations regarding layer-wise importance in inference, we propose \sys to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three representative sequence-wise algorithms to compress the KV-cache for each layer with its very own budget. Specifically, we first measure each layer's importance by calculating the cosine similarity of the input prompt differences before and after the self-attention layers. Based on this similarity, we then categorize the layers into two groups and adjust their KV budgets accordingly. By optimizing the KV-cache from both sequence's and layer's dimensions, \sys achieves around 30\% to 70\% of the memory reductions and up to 2.2 $\times$ of throughput improvements in a wide range of LLMs and benchmarks. The code is available at https://github.com/hetailang/SqueezeAttention.

ICML Conference 2025 Conference Paper

The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

  • Zihao Wang
  • Yibo Jiang
  • Jiahao Yu
  • Heqing Huang

Large language models (LLMs) that integrate multiple input roles (e. g. , system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role—a concept we call role separation —is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine role-separation learning: the process of teaching LLMs to robustly distinguish system and user tokens. Through a simple, controlled experimental framework, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing invariant signals that mark role boundaries by adjusting token-wise cues in the model’s input encoding. In particular, modifying position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.

AAAI Conference 2025 Conference Paper

Transtreaming: Adaptive Delay-aware Transformer for Real-time Streaming Perception

  • Xiang Zhang
  • Yufei Cui
  • Chenchen Fu
  • Zihao Wang
  • Yuyang Sun
  • Xue Liu
  • Weiwei Wu

Real-time object detection is critical for the decision-making process for many real-world applications, such as collision avoidance and path planning in autonomous driving. This work presents an innovative real-time streaming perception method, Transtreaming, which addresses the challenge of real-time object detection with dynamic computational delays. The core innovation of Transtreaming lies in its adaptive delay-aware transformer, which can concurrently predict multiple future frames and select the output that best matches the real-world present time, compensating for any system-induced computational delays. The proposed model outperforms existing state-of-the-art methods, even in single-frame detection scenarios, by leveraging a transformer-based methodology. It demonstrates robust performance across a range of devices, from powerful V100 to modest 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, Transtreaming meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability of many real-world systems, such as autonomous driving.

AAAI Conference 2024 Conference Paper

A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis

  • Nailei Hei
  • Qianyu Guo
  • Zihao Wang
  • Yan Wang
  • Haofen Wang
  • Wenqiang Zhang

Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics. Data and code are available at https://github.com/Naylenv/UF-FGTG.

ICLR Conference 2024 Conference Paper

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

  • Shaofei Cai
  • Bowei Zhang 0007
  • Zihao Wang
  • Xiaojian Ma 0001
  • Anji Liu
  • Yitao Liang

We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis.

ICLR Conference 2024 Conference Paper

Learning Hierarchical Polynomials with Three-Layer Neural Networks

  • Zihao Wang
  • Eshaan Nichani
  • Jason D. Lee

We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde O(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde \Theta(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde O(d^2)$, which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of $\widetilde\Theta(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde O(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.

IJCAI Conference 2024 Conference Paper

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music

  • Zihao Wang
  • Shuyu Li
  • Tao Zhang
  • Qi Wang
  • Pengfei Yu
  • Jinyang Luo
  • Yan Liu
  • Ming Xi

The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a large-scale, private dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1, 000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of CaiMD for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music.

AAAI Conference 2024 Conference Paper

NestE: Modeling Nested Relational Structures for Knowledge Graph Reasoning

  • Bo Xiong
  • Mojtaba Nayyeri
  • Linhao Luo
  • Zihao Wang
  • Shirui Pan
  • Steffen Staab

Reasoning with knowledge graphs (KGs) has primarily focused on triple-shaped facts. Recent advancements have been explored to enhance the semantics of these facts by incorporating more potent representations, such as hyper-relational facts. However, these approaches are limited to atomic facts, which describe a single piece of information. This paper extends beyond atomic facts and delves into nested facts, represented by quoted triples where subjects and objects are triples themselves (e.g., ((BarackObama, holds_position, President), succeed_by, (DonaldTrump, holds_position, President))). These nested facts enable the expression of complex semantics like situations over time and logical patterns} over entities and relations. In response, we introduce NestE, a novel KG embedding approach that captures the semantics of both atomic and nested factual knowledge. NestE represents each atomic fact as a 1*3 matrix, and each nested relation is modeled as a 3*3 matrix that rotates the 1*3 atomic fact matrix through matrix multiplication. Each element of the matrix is represented as a complex number in the generalized 4D hypercomplex space, including (spherical) quaternions, hyperbolic quaternions, and split-quaternions. Through thorough analysis, we demonstrate the embedding's efficacy in capturing diverse logical patterns over nested facts, surpassing the confines of first-order logic-like expressions. Our experimental results showcase NestE's significant performance gains over current baselines in triple prediction and conditional link prediction. The code and pre-trained models are open available at https://github.com/xiongbo010/NestE.

NeurIPS Conference 2024 Conference Paper

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

  • Zihao Wang
  • Shaofei Cai
  • Zhancun Mu
  • Haowei Lin
  • Ceyao Zhang
  • Xuejie Liu
  • Qing Li
  • Anji Liu

This paper presents OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau = \{o_0, a_0, \dots\}$ and an imitation learning policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the imitation learning policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. The dataset, models, and code will be released at https: //craftjarvis. org/OmniJARVIS.

AAAI Conference 2024 Conference Paper

ProAgent: Building Proactive Cooperative Agents with Large Language Models

  • Ceyao Zhang
  • Kaijie Yang
  • Siyi Hu
  • Zihao Wang
  • Guanghe Li
  • Yihang Sun
  • Cheng Zhang
  • Zhaowei Zhang

Building agents with adaptive behavior in cooperative tasks stands as a paramount goal in the realm of multi-agent systems. Current approaches to developing cooperative agents rely primarily on learning-based methods, whose policy generalization depends heavily on the diversity of teammates they interact with during the training phase. Such reliance, however, constrains the agents' capacity for strategic adaptation when cooperating with unfamiliar teammates, which becomes a significant challenge in zero-shot coordination scenarios. To address this challenge, we propose ProAgent, a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with teammates. ProAgent can analyze the present state, and infer the intentions of teammates from observations. It then updates its beliefs in alignment with the teammates' subsequent actual behaviors. Moreover, ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various of coordination scenarios. Experimental evaluations conducted within the Overcooked-AI environment unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training when cooperating with AI agents. Furthermore, in partnered with human proxy models, its performance exhibits an average improvement exceeding 10% compared to the current state-of-the-art method. For more information about our project, please visit https://pku-proagent.github.io.

IJCAI Conference 2024 Conference Paper

SDformer: Transformer with Spectral Filter and Dynamic Attention for Multivariate Time Series Long-term Forecasting

  • Ziyu Zhou
  • Gengyu Lyu
  • Yiming Huang
  • Zihao Wang
  • Ziyu Jia
  • Zhen Yang

Transformer has gained widespread adoption in modeling time series due to the exceptional ability of its self-attention mechanism in capturing long-range dependencies. However, when processing time series data with numerous variates, the vanilla self-attention mechanism tends to distribute attention weights evenly and smoothly, causing row-homogenization in attention maps and further hampering time series forecasting. To tackle this issue, we propose an advanced Transformer architecture entitled SDformer, which designs two novel modules, Spectral-Filter-Transform (SFT) and Dynamic-Directional-Attention (DDA), and integrates them into the encoder of Transformer to achieve more intensive attention allocation. Specifically, the SFT module utilizes the Fast Fourier Transform to select the most prominent frequencies, along with a Hamming Window to smooth and denoise the filtered series data; The DDA module applies a specialized kernel function to the query and key vectors projected from the denoised data, concentrating this innovative attention mechanism more effectively on the most informative variates to obtain a sharper attention distribution. These two modules jointly enable attention weights to be more salient among numerous variates, which in turn enhances the attention's ability to capture multivariate correlations, improving the performance in forecasting. Extensive experiments on public datasets demonstrate its superior performance over other state-of-the-art models. Code is available at https: //github. com/zhouziyu02/SDformer.

TMLR Journal 2024 Journal Article

SEAL: Simultaneous Label Hierarchy Exploration And Learning

  • Zhiquan Tan
  • Zihao Wang
  • Yifan Zhang

Label hierarchy is an important source of external knowledge that can enhance classification performance. However, most existing methods rely on predefined label hierarchies that may not match the data distribution. To address this issue, we propose Simultaneous label hierarchy Exploration And Learning (SEAL), a new framework that explores the label hierarchy by augmenting the observed labels with latent labels that follow a prior hierarchical structure. Our approach uses a 1-Wasserstein metric over the tree metric space as an objective function, which enables us to simultaneously learn a data-driven label hierarchy and perform (semi-)supervised learning. We evaluate our method on several standard benchmarks and show that it achieves improved results in semi-supervised image classification scenarios.

ICML Conference 2024 Conference Paper

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

  • Haowei Lin
  • Baizhou Huang
  • Haotian Ye
  • Qinyu Chen
  • Zihao Wang
  • Sujian Li
  • Jianzhu Ma
  • Xiaojun Wan 0001

The ever-growing ecosystem of LLMs has posed a challenge in selecting the most appropriate pre-trained model to fine-tune amidst a sea of options. Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. In this work, we formulate this resource-constrained selection task into predicting fine-tuning performance and illustrate its natural connection with Scaling Law. Unlike pre-training, we find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase". We also explain why existing Scaling Law fails to capture this phase transition phenomenon both theoretically and empirically. To address this, we introduce the concept of "pre-learned data size" into our Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better. By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption, while other methods may provide negatively correlated selection. The project page is available at rectified-scaling-law. github. io.

ICML Conference 2024 Conference Paper

Transforming and Combining Rewards for Aligning Large Language Models

  • Zihao Wang
  • Chirag Nagpal
  • Jonathan Berant
  • Jacob Eisenstein
  • Alexander Nicholas D'Amour
  • Sanmi Koyejo
  • Victor Veitch

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is "better" than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. The derived transformation is straightforward: we apply a log-sigmoid function to the centered rewards, a method we term "LSC-transformation" (log-sigmoid-centered transformation). This transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is "good" in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

NeurIPS Conference 2023 Conference Paper

Concept Algebra for (Score-Based) Text-Controlled Generative Models

  • Zihao Wang
  • Lin Gui
  • Jeffrey Negrea
  • Victor Veitch

This paper concerns the structure of learned representations in text-guided generative models, focusing on score-based models. A key property of such models is that they can compose disparate concepts in a 'disentangled' manner. This suggests these models have internal representations that encode concepts in a 'disentangled' manner. Here, we focus on the idea that concepts are encoded as subspaces of some representation space. We formalize what this means, show there's a natural choice for the representation, and develop a simple method for identifying the part of the representation corresponding to a given concept. In particular, this allows us to manipulate the concepts expressed by the model through algebraic manipulation of the representation. We demonstrate the idea with examples using Stable Diffusion.

NeurIPS Conference 2023 Conference Paper

Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents

  • Zihao Wang
  • Shaofei Cai
  • Guanzhou Chen
  • Anji Liu
  • Xiaojian (Shawn) Ma
  • Yitao Liang

In this paper, we study the problem of planning in Minecraft, a popular, democratized yet challenging open-ended environment for developing multi-task embodied agents. We've found two primary challenges of empowering such agents with planning: 1) planning in an open-ended world like Minecraft requires precise and multi-step reasoning due to the long-term nature of the tasks, and 2) as vanilla planners do not consider the achievability of the current agent when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient. To this end, we propose ``$\underline{D}$escribe, $\underline{E}$xplain, $\underline{P}$lan and $\underline{S}$elect'' ($\textbf{DEPS}$), an interactive planning approach based on Large Language Models (LLMs). Our approach helps with better error correction from the feedback during the long-haul planning, while also bringing the sense of proximity via goal $\textbf{Selector}$, a learnable module that ranks parallel sub-goals based on the estimated steps of completion and improves the original plan accordingly. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances. Further testing reveals our method's general effectiveness in popularly adopted non-open-ended domains as well (i. e. , ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the $\texttt{ObtainDiamond}$ grand challenge with our approach.

NeurIPS Conference 2023 Conference Paper

Theoretical Analysis of the Inductive Biases in Deep Convolutional Networks

  • Zihao Wang
  • Lei Wu

In this paper, we provide a theoretical analysis of the inductive biases in convolutional neural networks (CNNs). We start by examining the universality of CNNs, i. e. , the ability to approximate any continuous functions. We prove that a depth of $\mathcal{O}(\log d)$ suffices for deep CNNs to achieve this universality, where $d$ in the input dimension. Additionally, we establish that learning sparse functions with CNNs requires only $\widetilde{\mathcal{O}}(\log^2d)$ samples, indicating that deep CNNs can efficiently capture {\em long-range} sparse correlations. These results are made possible through a novel combination of the multichanneling and downsampling when increasing the network depth. We also delve into the distinct roles of weight sharing and locality in CNNs. To this end, we compare the performance of CNNs, locally-connected networks (LCNs), and fully-connected networks (FCNs) on a simple regression task, where LCNs can be viewed as CNNs without weight sharing. On the one hand, we prove that LCNs require ${\Omega}(d)$ samples while CNNs need only $\widetilde{\mathcal{O}}(\log^2d)$ samples, highlighting the critical role of weight sharing. On the other hand, we prove that FCNs require $\Omega(d^2)$ samples, whereas LCNs need only $\widetilde{\mathcal{O}}(d)$ samples, underscoring the importance of locality. These provable separations quantify the difference between the two biases, and the major observation behind our proof is that weight sharing and locality break different symmetries in the learning process.

NeurIPS Conference 2022 Conference Paper

Posterior Collapse of a Linear Latent Variable Model

  • Zihao Wang
  • Liu Ziyin

This work identifies the existence and cause of a type of posterior collapse that frequently occurs in the Bayesian deep learning practice. For a general linear latent variable model that includes linear variational autoencoders as a special case, we precisely identify the nature of posterior collapse to be the competition between the likelihood and the regularization of the mean due to the prior. Our result also suggests that posterior collapse may be a general problem of learning for deeper architectures and deepens our understanding of Bayesian deep learning.

NeurIPS Conference 2021 Conference Paper

Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs

  • Zihao Wang
  • Hang Yin
  • Yangqiu Song

Complex Query Answering (CQA) is an important reasoning task on knowledge graphs. Current CQA learning models have been shown to be able to generalize from atomic operators to more complex formulas, which can be regarded as the combinatorial generalizability. In this paper, we present EFO-1-QA, a new dataset to benchmark the combinatorial generalizability of CQA models by including 301 different queries types, which is 20 times larger than existing datasets. Besides, our benchmark, for the first time, provide a benchmark to evaluate and analyze the impact of different operators and normal forms by using (a) 7 choices of the operator systems and (b) 9 forms of complex queries. Specifically, we provide the detailed study of the combinatorial generalizability of two commonly used operators, i. e. , projection and intersection, and justify the impact of the forms of queries given the canonical choice of operators. Our code and data can provide an effective pipeline to benchmark CQA models.

AAAI Conference 2021 System Paper

IFDDS: An Anti-fraud Outbound Robot

  • Zihao Wang
  • Minghui Yang
  • Chunxiang Jin
  • Jia Liu
  • Zujie Wen
  • Saishuai Liu
  • Zhe Zhang

With the rapid growth of internet finance and e-payment, payment fraud has attracted increasing attention. To prevent customers from being cheated, systems often block risky payments depending on a risk factor. However, this may also inadvertently block cases which are not actually risky. To solve this problem, we present IFDDS, a system that proactively chats with customers through intelligent speech interaction to precisely determine the actual payment risk. Our system adopts imitation learning to learn dialogue policies. In addition, it encompasses a dialogue risk detection module which identifies fraud probability every turn based on the dialogue state. We create a web-based user interface which simulates a practical voice-based dialogue system.

IJCAI Conference 2021 Conference Paper

Local Representation is Not Enough: Soft Point-Wise Transformer for Descriptor and Detector of Local Features

  • Zihao Wang
  • Xueyi Li
  • Zhen Li

Significant progress has been witnessed for the descriptor and detector of local features, but there still exist several challenging and intractable limitations, such as insufficient localization accuracy and non-discriminative description, especially in repetitive- or blank-texture regions, which haven't be well addressed. The coarse feature representation and limited receptive field are considered as the main issues for these limitations. To address these issues, we propose a novel Soft Point-Wise Transformer for Descriptor and Detector, simultaneously mining long-range intrinsic and cross-scale dependencies of local features. Furthermore, our model leverages the distinct transformers based on the soft point-wise attention, substantially decreasing the memory and computation complexity, especially for high-resolution feature maps. In addition, multi-level decoder is constructed to guarantee the high detection accuracy and discriminative description. Extensive experiments demonstrate that our model outperforms the existing state-of-the-art methods on the image matching and visual localization benchmarks.

AIIM Journal 2020 Journal Article

A generic approach for cell segmentation based on Gabor filtering and area-constrained ultimate erosion

  • Zihao Wang
  • Zhenzhou Wang

Nowadays, the demand for segmenting different types of cells imaged by microscopes is increased tremendously. The requirements for the segmentation accuracy are becoming stricter. Because of the great diversity of cells, no traditional methods could segment various types of cells with adequate accuracy. In this paper, we aim to propose a generic approach that is capable of segmenting various types of cells robustly and counting the total number of cells accurately. To this end, we utilize the gradients of cells instead of intensity for cell segmentation because the gradients are less affected by the global intensity variations. To improve the segmentation accuracy, we utilize the Gabor filter to increase the intensity uniformity of the gradient image. To get the optimal segmentation, we utilize the slope difference distribution based threshold selection method to segment the Gabor filtered gradient image. At last, we propose an area-constrained ultimate erosion method to separate the connected cells robustly. Twelve types of cells are used to test the proposed approach in this paper. Experimental results showed that the proposed approach is very promising in meeting the strict accuracy requirements for many applications.

IJCAI Conference 2020 Conference Paper

Two-stage Behavior Cloning for Spoken Dialogue System in Debt Collection

  • Zihao Wang
  • Jia Liu
  • Hengbin Cui
  • Chunxiang Jin
  • Minghui Yang
  • Yafang Wang
  • Xiaolong Li
  • Renxin Mao

With the rapid growth of internet finance and the booming of financial lending, the intelligent calling for debt collection in FinTech companies has driven increasing attention. Nowadays, the widely used intelligent calling system is based on dialogue flow, namely configuring the interaction flow with the finite-state machine. In our scenario of debt collection, the completed dialogue flow contains more than one thousand interactive paths. All the dialogue procedures are artificially specified, with extremely high maintenance costs and error-prone. To solve this problem, we propose the behavior-cloning-based collection robot framework without any dialogue flow configuration, called two-stage behavior cloning (TSBC). In the first stage, we use multi-label classification model to obtain policies that may be able to cope with the current situation according to the dialogue state; in the second stage, we score several scripts under each obtained policy to select the script with the highest score as the reply for the current state. This framework makes full use of the massive manual collection records without labeling and fully absorbs artificial wisdom and experience. We have conducted extensive experiments in both single-round and multi-round scenarios and showed the effectiveness of the proposed system. The accuracy of a single round of dialogue can be improved by 5%, and the accuracy of multiple rounds of dialogue can be increased by 3. 1%.

AAAI Conference 2017 Conference Paper

Salience Estimation via Variational Auto-Encoders for Multi-Document Summarization

  • Piji Li
  • Zihao Wang
  • Wai Lam
  • Zhaochun Ren
  • Lidong Bing

We propose a new unsupervised sentence salience framework for Multi-Document Summarization (MDS), which can be divided into two components: latent semantic modeling and salience estimation. For latent semantic modeling, a neural generative model called Variational Auto-Encoders (VAEs) is employed to describe the observed sentences and the corresponding latent semantic representations. Neural variational inference is used for the posterior inference of the latent variables. For salience estimation, we propose an unsupervised data reconstruction framework, which jointly considers the reconstruction for latent semantic space and observed term vector space. Therefore, we can capture the salience of sentences from these two different and complementary vector spaces. Thereafter, the VAEs-based latent semantic model is integrated into the sentence salience estimation component in a unified fashion, and the whole framework can be trained jointly by back-propagation via multi-task learning. Experimental results on the benchmark datasets DUC and TAC show that our framework achieves better performance than the state-of-the-art models.