Arrow Research search

Author name cluster

Yue Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

99 papers
2 author rows

Possible papers

99

AAAI Conference 2026 Conference Paper

Burst Image Quality Assessment: A New Benchmark and Unified Framework for Multiple Downstream Tasks

  • Xiaoye Liang
  • Lai Jiang
  • Minglang Qiao
  • Yichen Guo
  • Yue Zhang
  • Xin Deng
  • Shengxi Li
  • Yufan Liu

In recent years, the development of burst imaging technology has improved the capture and processing capabilities of visual data, enabling a wide range of applications. However, the redundancy in burst images leads to the increased storage and transmission demands, as well as reduced efficiency of downstream tasks. To address this, we propose a new task of Burst Image Quality Assessment (BuIQA), to evaluate the task-driven quality of each frame within a burst sequence, providing reasonable cues for burst image selection. Specifically, we establish the first benchmark dataset for BuIQA, consisting of 7,346 burst sequences with 45,827 images and 191,572 annotated quality scores for multiple downstream scenarios. Inspired by the data analysis, a unified BuIQA framework is proposed to achieve an efficient adaption for BuIQA under diverse downstream scenarios. Specifically, a task-driven prompt generation network is developed with heterogeneous knowledge distillation, to learn the priors of the downstream task. Then, the task-aware quality assessment network is introduced to assess the burst image quality based on the task prompt. Extensive experiments across 10 downstream scenarios demonstrate the impressive BuIQA performance of the proposed approach, outperforming the state-of-the-art. Furthermore, it can achieve 0.33 dB PSNR improvement in the downstream tasks of denoising and super-resolution, by applying our approach to select the high-quality burst frames.

AAAI Conference 2026 Conference Paper

Dynamic Semantic Tokenization for Time Series via Elastic Sampling on Physics-aware Perception

  • Huaizhang Liao
  • Zhixiong Yang
  • Jingyuan Xia
  • Yuheng Sun
  • Yue Zhang
  • Shengxi Li
  • Yongxiang Liu

Despite the remarkable success of semantic token learning in NLP and vision domains, token-level representation mechanisms face fundamental challenges when extended to continuous time series analysis. We identify a core limitation lies in the intrinsic absence of semantically meaningful tokenization boundaries within time-series, which differs substantially from discrete text tokens and presents unique complexities compared to spatially coherent image patches. While existing works mechanically apply fixed-length partitioning, recent evidence from time series foundation models reveals performance ceilings in prediction tasks under such paradigms. This paper introduces a novel tokenization framework known as physics-aware tokenization (PATK), designed to implement adaptive time-frequency tokenization via distribution-sensitive sampling strategies. Key innovations include: 1) A Rate-of-Variation (RoV) distribution is meticulously structured to encompass multi-scale temporal dynamics in the time domain, alongside a Spectral Energy Intensity (SEI) distribution devised to reveal global seasonal patterns within the frequency domain; 2) A physics-aware hidden Markov modeling (PA-HMM) is then established to adaptively breaks down continuous time-series into distinct tokens with elastic lengths, responding to physics-aware probabilities sampled from RoV and SEI distributions. The proposed PATK allows steady integration with both conventional Transformers and advanced large-scale time series models (including LLM-transferred methods and pretrained time series foundation models). Simulations across various datasets demonstrate that PATK excels in classification and forecasting tasks, showing notable adaptability to model long-term dependencies, strengthening resilience against disturbances, and robustness to missing data events.

JBHI Journal 2026 Journal Article

Instance-Based Transfer Learning With Similarity-Aware Subject Selection for Cross-Subject SSVEP-Based BCIs

  • Ziwen Wang
  • Yue Zhang
  • Zhiqiang Zhang
  • Sheng Quan Xie
  • Alexander Lanzon
  • William P. Heath
  • Zhenhong Li

Steady-state visual evoked potential (SSVEP)-based brain-computer interfaces (BCIs) can achieve high recognition accuracy with sufficient training data. Transfer learning presents a promising solution to alleviate data requirements for the target subject by leveraging data from source subjects; however, effectively addressing individual variability among both target and source subjects remains a challenge. This paper proposes a novel transfer learning framework, termed instance-based task-related component analysis (iTRCA), which leverages knowledge from source subjects while considering their individual contributions. iTRCA extracts two types of features: (1) the subject-general feature, capturing shared information between source and target subjects in a common latent space, and (2) the subject-specific feature, preserving the unique characteristics of the target subject. To mitigate negative transfer, we further design an enhanced framework, subject selection-based iTRCA (SS-iTRCA), which integrates a similarity-based subject selection strategy to identify appropriate source subjects for transfer based on their task-related components (TRCs). Comparative evaluations on the Benchmark, BETA, and a self-collected dataset demonstrate the effectiveness of the proposed iTRCA and SS-iTRCA frameworks. This study provides a potential solution for developing high-performance SSVEP-based BCIs with reduced target subject data.

AAAI Conference 2026 Conference Paper

LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection

  • Jian Wu
  • Hang Yu
  • Bingchang Liu
  • Yang Wenjie
  • Peng Di
  • Jianguo Li
  • Yue Zhang

Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM as an implicit classifier for domain-specific Data Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.

AAAI Conference 2026 Conference Paper

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

  • Junshu Pan
  • Wei Shen
  • Shulin Huang
  • Qiji Zhou
  • Yue Zhang

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly training on offline preference data to align with human preferences. During DPO training, the reference model serves as a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the absence of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and requires stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that improves preference optimization by introducing a guiding reference model. This reference model provides foresight into the desired policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on the AlpacaEval 2 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

AAAI Conference 2026 Conference Paper

Schema-Guided Event Reasoning: A Plug-and-Play Event Reasoning Framework Based on Large Language Models

  • Yuying Liu
  • Xuechen Zhao
  • Yanyi Huang
  • Ye Wang
  • Xin Song
  • Yue Zhang
  • Haiyan Liu
  • Bin Zhou

Recent advancements in Large Language Models have increasingly demonstrated their potential for event reasoning. However, LLMs still struggle with this task due to inadequate modeling of event structures. Although introducing schema knowledge has been shown to improve event reasoning performance, existing methods rely on predefined schema library, compromising their scalability and lightweight deployment. To address these challenges, we propose SGER, a plug-and-play Schema-Guided Event Reasoning framework. In the schema extraction stage, the model maps event descriptions with diverse surface forms to potential semantic structure representations, achieving an abstract transformation from instances to schemas. The schema prediction stage captures the potential associations between historical event schemas to make forward-looking inferences about possible future event schemas. In the event reasoning stage, we integrate historical events and predicted schemas into prompts to guide LLMs in generating specific, contextually consistent predicted events. Experimental evaluations demonstrate that our framework significantly improves event reasoning performance of LLMs.

AAAI Conference 2025 Conference Paper

Concurrent Planning and Execution in Lifelong Multi-Agent Path Finding with Delay Probabilities

  • Yue Zhang
  • Zhe Chen
  • Daniel Harabor
  • Pierre Le Bodic
  • Peter J. Stuckey

In multi-agent systems, when we account for the possibility of delays during execution, online planning becomes more complicated, as both execution and planning should be able to handle delays when agents are moving. Lifelong Multi-Agent Path Finding (LMAPF) is the problem of (re)planning the collision-free moves of agents to their goals in a shared space, while agents continuously receive new goals. PIE (Planning and Improving while Executing) is a recent approach to LMAPF which concurrently replans later parts of agents' trajectories while execution occurs. However, the execution is assumed to be perfect. Existing approaches either use policy-based methods to quickly coordinate agents every timestep with instant delay feedback, or deploy an execution policy to adjust a solution for delays on the fly. These approaches may introduce large amounts of unnecessary delays to agents due to their planner guarantees or simple delay-handling policies. In this paper, we extend PIE to define a framework for solving the lifelong MAPF problem with execution delays. We instantiate our framework with different execution and replanning strategies, and experimentally evaluate them. Overall, we find that this framework can substantially improve the throughput by up to a factor 3 for lifelong MAPF, compared to approaches that handle delays with simple execution policies.

AAAI Conference 2025 Conference Paper

Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization

  • Yue Zhang
  • Liqiang Jing
  • Vibhav Gogate

We introduce a new task called Defeasible Visual Entailment (DVE), where the goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update. While this concept is well-established in Natural Language Inference, it remains unexplored in visual entailment. At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications such as detecting misleading information in images, enhancing visual question answering, and refining decision-making processes in autonomous systems. Existing metrics do not adequately capture the change in the entailment relationship brought by updates. To address this, we propose a novel inference-aware evaluator designed to capture changes in entailment strength induced by updates, using pairwise contrastive learning and categorical information learning. Additionally, we introduce a reward-driven update optimization method to further enhance the quality of updates generated by multimodal models. Experimental results demonstrate the effectiveness of our proposed evaluator and optimization method.

ICLR Conference 2025 Conference Paper

Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information

  • Yuke Zhu
  • Yue Zhang
  • Dongdong Liu
  • Chi Xie
  • Zihua Xiong
  • Bo Zheng 0007
  • Sheng Guo 0005

Recent advancements in document understanding have been dominated by leveraging large language models (LLMs) and multimodal large models. However, enabling LLMs to comprehend complex document layouts and structural information often necessitates intricate network modifications or costly pre-training, limiting their practical applicability. In this paper, we introduce Group Position Embedding (GPE), a novel and efficient technique to enhance the layout understanding capabilities of LLMs without architectural changes or additional pre-training. GPE achieves this by strategically grouping the attention heads and feeding each group with distinct positional embeddings, effectively encoding layout information relevant to document comprehension. This simple yet powerful method allows for effective integration of layout information within the existing LLM framework. We evaluate GPE against several competitive baselines across five mainstream document tasks. We also introduce a challenging benchmark called BLADE, specifically designed to assess layout comprehension. Extensive experiments on both established and BLADE benchmarks confirm the efficacy of GPE in significantly advancing the state-of-the-art in document understanding. Our code is available at https://github.com/antgroup/GroupPositionEmbedding.git

AAAI Conference 2025 Conference Paper

Exploring Model Editing for LLM-based Aspect-Based Sentiment Classification

  • Shichen Li
  • Zhongqing Wang
  • Zheyu Zhao
  • Yue Zhang
  • Peifeng Li

Model editing aims at selectively updating a small subset of a neural model's parameters with an interpretable strategy to achieve desired modifications. It can significantly reduce computational costs to adapt to large language models(LLMs). Given its ability to precisely target critical components within LLMs, model editing shows great potential for efficient fine-tuning applications. In this work, we investigate model editing to serve as an efficient method for adapting LLMs to solve aspect-based sentiment classification. Through causal interventions, we trace and determine which neuron hidden states are essential for the model’s predictions. By performing interventions and restorations on each component of an LLM, we identify the importance of these components for aspect-based sentiment classification. Our findings reveal that a distinct set of mid-layer representations is essential for detecting the sentiment polarity of given aspect words. Leveraging these insights, we develop a model editing approach that focuses exclusively on these critical parts of the LLM, leading to a more efficient method for adapting LLMs. Our in and out of domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.

IJCAI Conference 2025 Conference Paper

Exploring the Frontiers of Animation Video Generation in the Sora Era: Method, Dataset and Benchmark

  • Yudong Jiang
  • Baohan Xu
  • Siqian Yang
  • Mingyu Ying
  • Jing Liu
  • Chao Xu
  • Siqi Wang
  • Yidi Wu

Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation benchmark. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our entire project is publicly available on https: //github. com/bilibili/Index-anisora/tree/main

NeurIPS Conference 2025 Conference Paper

Learning to Rank for In-Context Example Retrieval

  • Yuwen Ji
  • Luodan Zhang
  • Ambyer han
  • Haoran Que
  • Lei Shi
  • Wang Chao
  • Yue Zhang

Recent advances in retrieval-based in-context learning (ICL) train the retriever using a classification objective, which categorizes in-context examples (ICEs) into the most useful and the rest based on absolute scores. However, during inference, ICEs are retrieved by score ranking rather than classification — The classification training objective deviates from this test scenario. Hence, in this paper, we propose a novel algorithm that trains a retrieval model by ranking formulation, where the preference rankings between ICEs are given by comparing the likelihood of the LLM generating the correct answer conditioned on each exemplar. By learning to rank, we motivate the retriever to automatically learn diverse rationales why specific examples are more useful for ICL decisions. This addresses the issue that classification models poorly capture broader utility. Experimental results demonstrate the top-1 performance of our proposal across 9 NLP tasks, with ablation studies and case studies further validating the effectiveness of our design. The code can be found in: https: //github. com/2022neo/SeDPO_NIPS25

NeurIPS Conference 2025 Conference Paper

Learning to Reason under Off-Policy Guidance

  • Jianhao Yan
  • Yafu Li
  • Zican Hu
  • Zhi Wang
  • Ganqu Cui
  • Xiaoye Qu
  • Yu Cheng
  • Yue Zhang

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(RLVR). However, existing RLVR approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments RLVR with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over +6. 4 average gain across six math benchmarks and an advantage of over +6. 2 points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

ICRA Conference 2025 Conference Paper

LEMMo-Plan: LLM-Enhanced Learning from Multi-Modal Demonstration for Planning Sequential Contact-Rich Manipulation Tasks

  • Kejia Chen 0005
  • Zheng Shen
  • Yue Zhang
  • Lingyun Chen
  • Fan Wu 0015
  • Zhenshan Bing
  • Sami Haddadin
  • Alois C. Knoll

Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce LEMMo-Plan, an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance. More materials are available on our project website: lemmo-plan.github.io/LEMMo-Plan/.

IROS Conference 2025 Conference Paper

Multi-Robot Assembly of Deformable Linear Objects Using Multi-Modal Perception

  • Kejia Chen 0005
  • Celina Dettmering
  • Florian Pachler
  • Zhuo Liu
  • Yue Zhang
  • Tailai Cheng
  • Jonas Dirr
  • Zhenshan Bing

Industrial assembly of deformable linear objects (DLOs) such as cables offers great potential for many industries. However, DLOs pose several challenges for robot-based automation due to the inherent complexity of deformation and, consequentially, the difficulties in anticipating the behavior of DLOs in dynamic situations. Although existing studies have addressed isolated subproblems like shape tracking, grasping, and shape control, there has been limited exploration of integrated workflows that combine these individual processes. To address this gap, we propose an object-centric perception and planning framework to achieve a comprehensive DLO assembly process throughout the industrial value chain. The framework utilizes visual and tactile information to track the DLO’s shape as well as contact state across different stages, which facilitates effective planning of robot actions. Our approach encompasses robot-based bin picking of DLOs from cluttered environments, followed by a coordinated handover to two additional robots that mount the DLOs onto designated fixtures. Real-World experiments employing a setup with multiple robots demonstrate the effectiveness of the approach and its relevance to industrial scenarios.

AAAI Conference 2025 Conference Paper

Partial Point Cloud Registration with Multi-view 2D Image Learning

  • Yue Zhang
  • Yue Wu
  • Wenping Ma
  • Maoguo Gong
  • Hao Li
  • Biao Hou

Learning representations from numerous 2D image data has shown promising performance, yet very few works apply this representations to point cloud registration. In this paper, we explore how to leverage the 2D information to assist the point cloud registration, and propose IAPReg, an Image-Assisted Partial 3D point cloud Registration framework with the multi-view images generated by the input point cloud. It is expected to enrich 3D information with 2D knowledge, and leverage 2D knowledge to assist with point cloud registration. Specifically, we create multi-view depth maps by projecting the input point cloud from several specific views, and then extract 2D and 3D features using some well-established models. To fuse the information learned from 2D and 3D modalities, inter-modality multi-view learning module is proposed to enhance geometric information and complement semantic information. Weighted SVD is a common method to reduce the impact of inaccurate correspondences on registration. However, determining the correspondence weights is not trivial. Therefore, we design a 2D-weighted SVD method, where the 2D knowledge is employed to provide weight information of correspondences. Extensive experiments perform that our method outperform the state-of-the-art method without additional 2D training data.

JAIR Journal 2025 Journal Article

Prioritised Planning: Completeness, Optimality, and Complexity

  • Jonathan Morag
  • Yue Zhang
  • Daniel Koyfman
  • Zhe Chen
  • Ariel Felner
  • Daniel Harabor
  • Roni Stern

Prioritised Planning (PP) is a popular approach for multi-agent and multi-robot navigation. In PP, collision-free paths are computed for one agent at a time, following a total order over the agents, called a priority ordering. Many MAPF algorithms follow this approach or use it in some way, including several state-of-the-art MAPF algorithms, although it is known that PP is neither complete nor optimal. In this work, we characterise the space of problems a PP algorithm can solve, and define the search problem of identifying whether a given MAPF problem is in that space. We call this search problem Prioritised MAPF (P-MAPF) and investigate its computational complexity, showing that it is generally NP-hard. Then, we develop a novel efficient search algorithm called Path and Priority Search (PaPS), which solves P-MAPF, providing guarantees of completeness and optimality. We next observe that PP algorithms operate with two primary degrees of freedom – the choice of priority ordering, and the choice of individual paths for agents. Accordingly, we further divide P-MAPF into two planning problems corresponding to the two degrees of freedom. We call them Priority-Function Constrained MAPF (PFC-MAPF), where the path choice is fixed while the priority ordering is not, and Priority Constrained MAPF (PC-MAPF), where the priority ordering is fixed while the path choice is not. We analyse these problems as well, and show how PaPS can be easily adapted to create algorithms that solve these problems optimally. We experiment with our algorithms in a range of settings, including comparisons with existing PP baselines. Our results show how the different degrees of freedom of PP-based algorithms affect their behaviour, and provide the first-known results for solution-quality optimality for PP-based algorithms on a popular MAPF benchmark set. The latter can be used as a lower bound for any PP algorithm.

IJCAI Conference 2025 Conference Paper

Screening, Rectifying, and Re-Screening: A Unified Framework for Tuning Vision-Language Models with Noisy Labels

  • Chaowei Fang
  • Hangfei Ma
  • Zhihao Li
  • De Cheng
  • Yue Zhang
  • Guanbin Li

Pre-trained vision-language models have shown remarkable potential for downstream tasks. However, their fine-tuning under noisy labels remains an open problem due to challenges like self-confirmation bias and the limitations of conventional small-loss criteria. In this paper, we propose a unified framework to address these issues, consisting of three key steps: Screening, Rectifying, and Re-Screening. First, a dual-level semantic matching mechanism is introduced to categorize samples into clean, ambiguous, and noisy samples by leveraging both macro-level and micro-level textual prompts. Second, we design tailored pseudo-labeling strategies to rectify noisy and ambiguous labels, enabling their effective incorporation into the training process. Finally, a re-screening step, utilizing cross-validation with an auxiliary vision-language model, mitigates self-confirmation bias and enhances the robustness of the framework. Extensive experiments across ten datasets demonstrate that the proposed method significantly outperforms existing approaches for tuning vision-language pre-trained models with noisy labels.

NeurIPS Conference 2025 Conference Paper

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

  • Xeron Du
  • Yifan Yao
  • Kaijing Ma
  • Bingli Wang
  • Tianyu Zheng
  • Minghao Liu
  • Yiming Liang
  • Xiaolong Jin

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e. g. , the reasoning-focused model Gemini-2. 5-Pro achieved the highest accuracy of 63. 56% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

NeurIPS Conference 2025 Conference Paper

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

  • Jinyang Li
  • Xiaolong Li
  • Ge Qu
  • Per Jacobsson
  • Bowen Qin
  • Binyuan Hui
  • Shuzheng Si
  • Nan Huo

Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current Large Language Models (LLMs), while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging on SQL issues. In order to address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 carefully curated PostgreSQL tasks ( BIRD-CRITIC-PG ) and 570 multi-dialect tasks ( BIRD-CRITIC-Multi ), which are distilled from authentic user issues and replayed within new environments to facilitate rigorous and contamination-free evaluation. Baseline evaluations on BIRD-CRITIC underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38. 87% success rate on BIRD-CRITIC-PG and 33. 33% on BIRD-CRITIC-Multi. Meanwhile, realizing open-source models for database tasks is crucial which can empower local development while safeguarding data privacy. Therefore, we present Six-Gym ( S ql-f IX -Gym), a training environment for elevating the capabilities of open-source models specifically for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f -Plan Boosting, which extracts high-level debugging plans automatically from SQL solutions, enabling the teacher LLMs to harvest and produce 73. 7% more successful trajectories for training. We integrate these components into an open-source agent, BIRD-Fixer. Based on Qwen-2. 5-Coder-14B, BIRD-Fixer raises its success rate to 38. 11% on BIRD-CRITIC-PG and 29. 65% on BIRD-CRITIC-Multi, surpassing many leading proprietary models such as Claude-3. 7-Sonnet and GPT-4. 1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities for both research and industry.

NeurIPS Conference 2025 Conference Paper

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

  • Shulin Huang
  • Linyi Yang
  • Yan Song
  • Shawn Chen
  • Leyang Cui
  • Ziyu Wan
  • Qingcheng Zeng
  • Ying Wen

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to robustly evaluate the reasoning capability of LLMs. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2, 912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces data contamination impact. Our data and codes are available at https: //github. com/huangshulin123/ThinkBench.

ICLR Conference 2025 Conference Paper

Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

  • Di Wu 0057
  • Siyuan Li 0002
  • Chen Feng
  • Lu Cao
  • Yue Zhang
  • Jie Yang 0033
  • Mohamad Sawan

Recent advancements in brain-computer interfaces (BCIs) and deep learning have made decoding lexical tones from intracranial recordings possible, providing the potential to restore the communication ability of speech-impaired tonal language speakers. However, data heterogeneity induced by both physiological and instrumental factors poses a significant challenge for unified invasive brain tone decoding. Particularly, the existing heterogeneous decoding paradigm (training subject-specific models with individual data) suffers from the intrinsic limitation that fails to learn generalized neural representations and leverages data across subjects. To this end, we introduce Homogeneity-Heterogeneity Disentangled Learning for Neural Representations (H2DiLR), a framework that disentangles and learns the homogeneity and heterogeneity from intracranial recordings of multiple subjects. To verify the effectiveness of H2DiLR, we collected stereoelectroencephalography (sEEG) from multiple participants reading Mandarin materials containing 407 syllables (covering nearly all Mandarin characters). Extensive experiments demonstrate that H2DiLR, as a unified decoding paradigm, outperforms the naive heterogeneous decoding paradigm by a large margin. We also empirically show that H2DiLR indeed captures homogeneity and heterogeneity during neural representation learning.

AAAI Conference 2025 Conference Paper

Unaligned Message-Passing and Contextualized-Pretraining for Robust Geo-Entity Resolution

  • Yuwen Ji
  • Wenbo Xie
  • Jiaqi Zhang
  • Chao Wang
  • Ning Guo
  • Lei Shi
  • Yue Zhang

Geo-entity resolution involves linking records that refer to the same entities across different spatial datasets, which underpins location-based services. Given the varying quality of geo-data, this task is known to be challenging, as directly comparing the semantic-centric representations of two entities is no longer reliable. To robustify geo-entity resolution in this context, the main research question is how to effectively extend the current semantics-centric representations of geo-entity with geographical context from its spatial neighbors. Existing methods consider names from neighbors, but they struggle to fully utilize the unaligned neighbor attributes. In this paper, we study the representation of geo-context for robust geo-entity resolution and propose two adaptations that efficiently leverage unaligned geo-entity attributes across spatial neighbors: (1) A plugin module, namely Unaligned Message-Passing (UMP), that propagates unaligned neighbor features to integrate geo-context into the token embeddings output by language model; (2) a contextualized pretraining framework (CP) that allows the former to leverage unlabelled geo-entity data. Experiments show that our method surpasses the baselines, achieving higher F1 scores on 8 real-world geo-datasets in terms of robustness, with an improvement of up to 7.9%. The ablation study further justifies our proposal.

TIST Journal 2024 Journal Article

A Survey on Evaluation of Large Language Models

  • Yupeng Chang
  • Xu Wang
  • Jindong Wang
  • Yuan Wu
  • Linyi Yang
  • Kaijie Zhu
  • Hao Chen
  • Xiaoyuan Yi

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey

TMLR Journal 2024 Journal Article

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

  • Yue Chang
  • Liqiang Jing
  • Xiaopeng Zhang
  • Yue Zhang

Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments. On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM. Our source code will be released on GitHub.

AAMAS Conference 2024 Conference Paper

Anytime Multi-Agent Path Finding using Operation Parallelism in Large Neighborhood Search

  • Shao-Hung Chan
  • Zhe Chen
  • Dian-Lun Lin
  • Yue Zhang
  • Daniel Harabor
  • Sven Koenig
  • Tsung-Wei Huang
  • Thomy Phan

Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths for multiple agents in a shared environment while improving the solution quality. The state-of-the-art anytime MAPF algorithm is based on Large Neighborhood Search (MAPF- LNS), which is a combinatorial search algorithm that iteratively destroys and repairs a subset of collision-free paths. In this paper, we propose Destroy-Repair Operation Parallelism for MAPF-LNS (DROP-LNS), a parallel framework that performs multiple destroy and repair operations concurrently to explore more regions of the search space and improve the solution quality. Unlike MAPF-LNS, DROP-LNS is able to exploit multiple threads during the search. The results show that DROP-LNS outperforms the state-of-the-art anytime MAPF algorithms, namely MAPF-LNS and LaCAM*, with respect to solution quality when terminated at the same runtime.

NeurIPS Conference 2024 Conference Paper

AutoSurvey: Large Language Models Can Automatically Write Surveys

  • Yidong Wang
  • Qi Guo
  • Wenjin Yao
  • Hongbo Zhang
  • Xin Zhang
  • Zhen Wu
  • Meishan Zhang
  • Xinyu Dai

This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys in rapidly evolving fields like artificial intelligence. Traditional survey paper creation faces challenges due to the vast volume and complexity of information, prompting the need for efficient survey methods. While large language models (LLMs) offer promise in automating this process, challenges such as context window limitations, parametric knowledge constraints, and the lack of evaluation benchmarks remain. AutoSurvey addresses these challenges through a systematic approach that involves initial retrieval and outline generation, subsection drafting by specialized LLMs, integration and refinement, and rigorous evaluation and iteration. Our contributions include a comprehensive solution to the survey problem, a reliable evaluation method, and experimental validation demonstrating AutoSurvey's effectiveness.

NeurIPS Conference 2024 Conference Paper

Can Language Models Learn to Skip Steps?

  • Tengxiao Liu
  • Qipeng Guo
  • Xiangkun Hu
  • Cheng Jiayang
  • Yue Zhang
  • Xipeng Qiu
  • Zheng Zhang

Trained on vast corpora of human language, language models demonstrate emergent human-like reasoning abilities. Yet they are still far from true intelligence, which opens up intriguing opportunities to explore the parallels of humans and model behaviors. In this work, we study the ability to skip steps in reasoning—a hallmark of human expertise developed through practice. Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not inherently possess such motivations to minimize reasoning steps. To address this, we introduce a controlled framework that stimulates step-skipping behavior by iteratively refining models to generate shorter and accurate reasoning paths. Empirical results indicate that models can develop the step skipping ability under our guidance. Moreover, after fine-tuning on expanded datasets that include both complete and skipped reasoning sequences, the models can not only resolve tasks with increased efficiency without sacrificing accuracy, but also exhibit comparable and even enhanced generalization capabilities in out-of-domain scenarios. Our work presents the first exploration into human-like step-skipping ability and provides fresh perspectives on how such cognitive abilities can benefit AI models.

JAIR Journal 2024 Journal Article

Cross-domain Constituency Parsing by Leveraging Heterogeneous Data

  • Peiming Guo
  • Meishan Zhang
  • Yulong Chen
  • Jianling Li
  • Min Zhang
  • Yue Zhang

Knowledge transfer is investigated in various natural language processing tasks except cross-domain constituency parsing. In this paper, we leverage heterogeneous data to transfer cross-domain and cross-task knowledge to constituency parsing. Concretely, we first select language modeling, named entity recognition, CCG supertagging and dependency parsing as auxiliary tasks and collect the corpora of these tasks covering various domains as cross-domain and cross-task heterogeneous data. Second, we exploit three types of prefixes: shared, task and domain prefix, to merge cross-domain and cross-task data and decompose the general, task and domain representation in the pretrained language model. Third, we convert the data formats of multi-source heterogeneous datasets and loss objectives of the auxiliary tasks into a consistent formalization closer to constituency parsing. Finally, we jointly train the model to transfer task and domain knowledge to cross-domain constituency parsing. We verify the effectiveness of our proposed model on five target domains of MCTB. Experimental results show that our knowledge transfer model outperforms various baseline models, including conventional chart-based and transition-based parsers and the current large-scale language model for zero-shot and few-shot settings.

NeurIPS Conference 2024 Conference Paper

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

  • Yu Zhang
  • Songlin Yang
  • Ruijie Zhu
  • Yue Zhang
  • Leyang Cui
  • Yiqiao Wang
  • Bolun Wang
  • Freda Shi

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in ``finetuning pretrained Transformers to RNNs'' (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

AAAI Conference 2024 Conference Paper

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

  • Yue Zhang
  • Ming Zhang
  • Haipeng Yuan
  • Shichun Liu
  • Yongyao Shi
  • Tao Gui
  • Qi Zhang
  • Xuanjing Huang

Recently, the evaluation of Large Language Models has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with. As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion. In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4, with different scoring methods and ranking systems. We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results. We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at https://github.com/llmeval. The version with the appendix are publicly available at https://arxiv.org/abs/2312.07398.

ICAPS Conference 2024 Conference Paper

More Flexible Proximity Wildcards Path Planning with Compressed Path Databases

  • Xi Chen
  • Yue Zhang
  • Yonggang Zhang

Grid-based path planning is one of the classic problems in AI, and a popular topic in application areas such as computer games and robotics. Compressed Path Databases (CPDs) are recognized as a state-of-the-art method for grid-based path planning. It is able to find an optimal path extremely fast without state-space search. In recent years, researchers have tended to focus on improving CPDs by reducing CPD size or improving search performance. Among various methods, proximity wildcards are one of the most proven improvements in reducing the size of CPD. However, its proximity area is significantly restricted by complex terrain, which significantly affects the pathfinding efficiency and causes additional costs. In this paper, we enhance CPDs from the perspective of improving search efficiency and reducing search costs. Our work focuses on using more flexible methods to obtain larger proximity areas, so that more heuristic information can be used to improve search performance. Experiments conducted on the Grid-Based Path Planning Competition (GPPC) benchmarks demonstrate that the two proposed methods can effectively improve search efficiency and reduce search costs by up to 3 orders of magnitude. Remarkably, our methods can further reduce the storage cost, and improve the compression capability of CPDs simultaneously.

NeurIPS Conference 2024 Conference Paper

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

  • Dongyu Ru
  • Lin Qiu
  • Xiangkun Hu
  • Tianhang Zhang
  • Peng Shi
  • Shuaichen Chang
  • Cheng Jiayang
  • Cunxiang Wang

Despite Retrieval-Augmented Generation (RAG) has shown promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. Meta evaluation verifies that RAGChecker has significantly better correlations with human judgments than other evaluation metrics. Using RAGChecker, we evaluate 8 RAG systems and conduct an in-depth analysis of their performance, revealing insightful patterns and trade-offs in the design choices of RAG architectures. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.

TMLR Journal 2024 Journal Article

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

  • Yue Zhang
  • Ziqiao Ma
  • Jialu Li
  • Yanyuan Qiao
  • Zun Wang
  • Joyce Chai
  • Qi Wu
  • Mohit Bansal

Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to document the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.

NeurIPS Conference 2023 Conference Paper

Evaluating Open-QA Evaluation

  • Cunxiang Wang
  • Sirui Cheng
  • Qipeng Guo
  • Yuanhao Yue
  • Bowen Ding
  • Zhikun Xu
  • Yidong Wang
  • Xiangkun Hu

This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs). Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods and methods to improve LLM-based evaluators. We believe this new QA-Eval task and corresponding dataset EVOUNA will facilitate the development of more effective automatic evaluation tools and prove valuable for future research in this area. All resources are available at https: //github. com/wangcunxiang/QA-Eval and it is under the Apache-2. 0 License.

AAAI Conference 2023 Conference Paper

GLUECons: A Generic Benchmark for Learning under Constraints

  • Hossein Rajaby Faghihi
  • Aliakbar Nafar
  • Chen Zheng
  • Roshanak Mirzaee
  • Yue Zhang
  • Andrzej Uszok
  • Alexander Wan
  • Tanawan Premsri

Recent research has shown that integrating domain knowledge into deep learning architectures is effective; It helps reduce the amount of required data, improves the accuracy of the models' decisions, and improves the interpretability of models. However, the research community lacks a convened benchmark for systematically evaluating knowledge integration methods. In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. In all cases, we model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints. We report the results of these models using a new set of extended evaluation criteria in addition to the task performances for a more in-depth analysis. This effort provides a framework for a more comprehensive and systematic comparison of constraint integration techniques and for identifying related research challenges. It will facilitate further research for alleviating some problems of state-of-the-art neural models.

IJCAI Conference 2023 Conference Paper

HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

  • Runzhong Zhang
  • Suchen Wang
  • Yueqi Duan
  • Yansong Tang
  • Yue Zhang
  • Yap-Peng Tan

In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.

AAAI Conference 2023 Conference Paper

Instance Smoothed Contrastive Learning for Unsupervised Sentence Embedding

  • Hongliang He
  • Junlei Zhang
  • Zhenzhong Lan
  • Yue Zhang

Contrastive learning-based methods, such as unsup-SimCSE, have achieved state-of-the-art (SOTA) performances in learning unsupervised sentence embeddings. However, in previous studies, each embedding used for contrastive learning only derived from one sentence instance, and we call these embeddings instance-level embeddings. In other words, each embedding is regarded as a unique class of its own, which may hurt the generalization performance. In this study, we propose IS-CSE (instance smoothing contrastive sentence embedding) to smooth the boundaries of embeddings in the feature space. Specifically, we retrieve embeddings from a dynamic memory buffer according to the semantic similarity to get a positive embedding group. Then embeddings in the group are aggregated by a self-attention operation to produce a smoothed instance embedding for further analysis. We evaluate our method on standard semantic text similarity (STS) tasks and achieve an average of 78.30%, 79.47%, 77.73%, and 79.42% Spearman’s correlation on the base of BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large respectively, a 2.05%, 1.06%, 1.16% and 0.52% improvement compared to unsup-SimCSE.

JBHI Journal 2022 Journal Article

DeepRecS: From RECIST Diameters to Precise Liver Tumor Segmentation

  • Yue Zhang
  • Chengtao Peng
  • Liying Peng
  • Yingying Xu
  • Lanfen Lin
  • Ruofeng Tong
  • Zhiyi Peng
  • Xiongwei Mao

Liver tumor segmentation (LiTS) is of primary importance in diagnosis and treatment of hepatocellular carcinoma. Known automated LiTS methods could not yield satisfactory results for clinical use since they were hard to model flexible tumor shapes and locations. In clinical practice, radiologists usually estimate tumor shape and size by a Response Evaluation Criteria in Solid Tumor (RECIST) mark. Inspired by this, in this paper, we explore a deep learning (DL) based interactive LiTS method, which incorporates guidance from user-provided RECIST marks. Our method takes a three-step framework to predict liver tumor boundaries. Under this architecture, we develop a RECIST mark propagation network (RMP-Net) to estimate RECIST-like marks in off-RECIST slices. We also devise a context-guided boundary-sensitive network (CGBS-Net) to distill tumors’ contextual and boundary information from corresponding RECIST(-like) marks, and then predict tumor maps. To further refine the segmentation results, we process the tumor maps using a 3D conditional random field (CRF) algorithm and a morphology hole-filling operation. Verified on two clinical contrast-enhanced abdomen computed tomography (CT) image datasets, our proposed approach can produce promising segmentation results, and outperforms the state-of-the-art interactive segmentation methods.

AAAI Conference 2022 Conference Paper

DeepThermal: Combustion Optimization for Thermal Power Generating Units Using Offline Reinforcement Learning

  • Xianyuan Zhan
  • Haoran Xu
  • Yue Zhang
  • Xiangyu Zhu
  • Honglei Yin
  • Yu Zheng

Optimizing the combustion efficiency of a thermal power generating unit (TPGU) is a highly challenging and critical task in the energy industry. We develop a new data-driven AI system, namely DeepThermal, to optimize the combustion control strategy for TPGUs. At its core, is a new model-based offline reinforcement learning (RL) framework, called MORE, which leverages historical operational data of a TGPU to solve a highly complex constrained Markov decision process problem via purely offline training. In DeepThermal, we first learn a data-driven combustion process simulator from the offline dataset. The RL agent of MORE is then trained by combining real historical data as well as carefully filtered and processed simulation data through a novel restrictive exploration scheme. DeepThermal has been successfully deployed in four large coal-fired thermal power plants in China. Real-world experiments show that DeepThermal effectively improves the combustion efficiency of TPGUs. We also report the superior performance of MORE by comparing with the state-of-the-art algorithms on the standard offline RL benchmarks.

AAAI Conference 2022 Conference Paper

NumHTML: Numeric-Oriented Hierarchical Transformer Model for Multi-Task Financial Forecasting

  • Linyi Yang
  • Jiazheng Li
  • Ruihai Dong
  • Yue Zhang
  • Barry Smyth

Financial forecasting has been an important and active area of machine learning research because of the challenges it presents and the potential rewards that even minor improvements in prediction accuracy or forecasting may entail. Traditionally, financial forecasting has heavily relied on quantitative indicators and metrics derived from structured financial statements. Earnings conference call data, including text and audio, is an important source of unstructured data that has been used for various prediction tasks using deep earning and related approaches. However, current deep learningbased methods are limited in the way that they deal with numeric data; numbers are typically treated as plain-text tokens without taking advantage of their underlying numeric structure. This paper describes a numeric-oriented hierarchical transformer model (NumHTML) to predict stock returns, and financial risk using multi-modal aligned earnings calls data by taking advantage of the different categories of numbers (monetary, temporal, percentages etc.) and their magnitude. We present the results of a comprehensive evaluation of NumHTML against several state-of-the-art baselines using a real-world publicly available dataset. The results indicate that NumHTML significantly outperforms the current stateof-the-art across a variety of evaluation metrics and that it has the potential to offer significant financial gains in a practical trading context.

ICML Conference 2022 Conference Paper

Supervised Off-Policy Ranking

  • Yue Jin
  • Yue Zhang
  • Tao Qin 0001
  • Xudong Zhang 0001
  • Jian Yuan
  • Houqiang Li
  • Tie-Yan Liu

Off-policy evaluation (OPE) is to evaluate a target policy with data generated by other policies. Most previous OPE methods focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end goal of OPE is to compare two or multiple candidate policies and choose a good one, which is a much simpler task than precisely evaluating their true performance; and (2) there are usually multiple policies that have been deployed to serve users in real-world systems and thus the true performance of these policies can be known. Inspired by the two observations, in this work, we study a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of target policies based on supervised learning by leveraging off-policy data and policies with known performance. We propose a method to solve SOPR, which learns a policy scoring model by minimizing a ranking loss of the training policies rather than estimating the precise policy performance. The scoring model in our method, a hierarchical Transformer based model, maps a set of state-action pairs to a score, where the state of each pair comes from the off-policy data and the action is taken by a target policy on the state in an offline manner. Extensive experiments on public datasets show that our method outperforms baseline methods in terms of rank correlation, regret value, and stability. Our code is publicly available at GitHub.

NeurIPS Conference 2022 Conference Paper

USB: A Unified Semi-supervised Learning Benchmark for Classification

  • Yidong Wang
  • Hao Chen
  • Yue Fan
  • Wang Sun
  • Ran Tao
  • Wenxin Hou
  • Renjie Wang
  • Linyi Yang

Semi-supervised learning (SSL) improves model generalization by leveraging massive unlabeled data to augment limited labeled samples. However, currently, popular SSL evaluation protocols are often constrained to computer vision (CV) tasks. In addition, previous work typically trains deep neural networks from scratch, which is time-consuming and environmentally unfriendly. To address the above issues, we construct a Unified SSL Benchmark (USB) for classification by selecting 15 diverse, challenging, and comprehensive tasks from CV, natural language processing (NLP), and audio processing (Audio), on which we systematically evaluate the dominant SSL methods, and also open-source a modular and extensible codebase for fair evaluation of these SSL methods. We further provide the pre-trained versions of the state-of-the-art neural models for CV tasks to make the cost affordable for further tuning. USB enables the evaluation of a single SSL algorithm on more tasks from multiple domains but with less cost. Specifically, on a single NVIDIA V100, only 39 GPU days are required to evaluate FixMatch on 15 tasks in USB while 335 GPU days (279 GPU days on 4 CV datasets except for ImageNet) are needed on 5 CV tasks with TorchSSL.

IJCAI Conference 2022 Conference Paper

Visual Emotion Representation Learning via Emotion-Aware Pre-training

  • Yue Zhang
  • Wanying Ding
  • Ran Xu
  • Xiaohua Hu

Despite recent progress in deep learning, visual emotion recognition remains a challenging problem due to ambiguity of emotion perception, diverse concepts related to visual emotion and lack of large-scale annotated dataset. In this paper, we present a large-scale multimodal pre-training method to learn visual emotion representation by aligning emotion, object, attribute triplet with a contrastive loss. We conduct our pre-training on a large web dataset with noisy tags and fine-tune on visual emotion classification datasets. Our method achieves state-of-the-art performance for visual emotion classification.

AAAI Conference 2021 Conference Paper

Brain Decoding Using fNIRS

  • Lu Cao
  • Dandan Huang
  • Yue Zhang
  • Xiaowei Jiang
  • Yanan Chen

Brain activation can reflect semantic information elicited by natural words and concepts. Increasing research has been conducted on decoding such neural activation patterns using representational semantic models. However, prior work decoding semantic meaning from neurophysiological responses has been largely limited to ECoG, fMRI, MEG, and EEG techniques, each having its own advantages and limitations. More recently, the functional near infrared spectroscopy (fNIRS) has emerged as an alternative hemodynamic-based approach and possesses a number of strengths. We investigate brain decoding tasks under the help of fNIRS and empirically compare fNIRS with fMRI. Primarily, we find that: 1) like fMRI scans, activation patterns recorded from fNIRS encode rich information for discriminating concepts, but show limits on the possibility of decoding fine-grained semantic clues; 2) fNIRS decoding shows robustness across different brain regions, semantic categories and even subjects; 3) fNIRS has higher accuracy being decoded based on multi-channel patterns as compared to single-channel ones, which is in line with our intuition of the working mechanism of human brain. Our findings prove that fNIRS has the potential to promote a deep integration of NLP and cognitive neuroscience from the perspective of language understanding. We release the largest fNIRS dataset by far to facilitate future research.

AAAI Conference 2021 Conference Paper

Contextualized Rewriting for Text Summarization

  • Guangsheng Bao
  • Yue Zhang

Extractive summarization suffers from irrelevance, redundancy and incoherence. Existing work shows that abstractive rewriting for extractive summaries can improve the conciseness and readability. These rewriting systems consider extracted summaries as the only input, which is relatively focused but can lose important background knowledge. In this paper, we investigate contextualized rewriting, which ingests the entire original document. We formalize contextualized rewriting as a seq2seq problem with group alignments, introducing group tag as a solution to model the alignments, identifying extracted summaries through content-based addressing. Results show that our approach significantly outperforms non-contextualized rewriting systems without requiring reinforcement learning, achieving strong improvements on ROUGE scores upon multiple extractive summarizers.

JBHI Journal 2021 Journal Article

MI-UNet: Multi-Inputs UNet Incorporating Brain Parcellation for Stroke Lesion Segmentation From T1-Weighted Magnetic Resonance Images

  • Yue Zhang
  • Jiong Wu
  • Yilong Liu
  • Yifan Chen
  • Ed X. Wu
  • Xiaoying Tang

Stroke is a serious manifestation of various cerebrovascular diseases and one of the most dangerous diseases in the world today. Volume quantification and location detection of chronic stroke lesions provide vital biomarkers for stroke rehabilitation. Recently, deep learning has seen a rapid growth, with a great potential in segmenting medical images. In this work, unlike most deep learning-based segmentation methods utilizing only magnetic resonance (MR) images as the input, we propose and validate a novel stroke lesion segmentation approach named multi-inputs UNet (MI-UNet) that incorporates brain parcellation information, including gray matter (GM), white matter (WM) and lateral ventricle (LV). The brain parcellation is obtained from 3D diffeomorphic registration and is concatenated with the original MR image to form two-channel inputs to the subsequent MI-UNet. Effectiveness of the proposed pipeline is validated using a dataset consisting of 229 T1-weighted MR images. Experiments are conducted via a five-fold cross-validation. The proposed MI-UNet performed significantly better than UNet in both 2D and 3D settings. Our best results obtained by 3D MI-UNet has superior segmentation performance, as measured by the Dice score, Hausdorff distance, average symmetric surface distance, as well as precision, over other state-of-the-art methods.

AAAI Conference 2021 Conference Paper

Natural Language Inference in Context – Investigating Contextual Reasoning over Long Texts

  • Hanmeng Liu
  • Leyang Cui
  • Jian Liu
  • Yue Zhang

Natural language inference (NLI) is a fundamental NLP task, investigating the entailment relationship between two texts. Popular NLI datasets present the task at sentence-level. While adequate for testing semantic representations, they fall short for testing contextual reasoning over long texts, which is a natural part of the human inference process. We introduce ConTRoL, a new dataset for ConTextual Reasoning over Long Texts. Consisting of 8, 325 expert-designed “contexthypothesis” pairs with gold labels, ConTRoL is a passagelevel NLI dataset with a focus on complex contextual reasoning types such as logical reasoning. It is derived from competitive selection and recruitment test (verbal reasoning test) for police recruitment, with expert level quality. Compared with previous NLI benchmarks, the materials in ConTRoL are much more challenging, involving a range of reasoning types. Empirical results show that state-of-the-art language models perform by far worse than educated humans. Our dataset can also serve as a testing-set for downstream tasks like checking the factual correctness of summaries.

IJCAI Conference 2021 Conference Paper

When Computational Representation Meets Neuroscience: A Survey on Brain Encoding and Decoding

  • Lu Cao
  • Dandan Huang
  • Yue Zhang

Real human language mechanisms and the artificial intelligent language processing methods are two independent systems. Exploring the relationship between the two can help develop human-like language models and is also beneficial to reveal the neuroscience of the reading brain. The flourishing research in this interdisciplinal research field calls for surveys to systemically study and analyze the recent successes. However, such a comprehensive review still cannot be found, which motivates our work. This article first briefly introduces the interdisciplinal research progress, then systematically discusses the task of brain decoding from the perspective of simple concepts and complete sentences, and also describes main limitations in this field and put forward with possible solutions. Finally, we conclude this survey with certain open research questions that will stimulate further studies.

AAAI Conference 2020 Conference Paper

Alignment-Enhanced Transformer for Constraining NMT with Pre-Specified Translations

  • Kai Song
  • Kun Wang
  • Heng Yu
  • Yue Zhang
  • Zhongqiang Huang
  • Weihua Luo
  • Xiangyu Duan
  • Min Zhang

We investigate the task of constraining NMT with prespecified translations, which has practical significance for a number of research and industrial applications. Existing works impose pre-specified translations as lexical constraints during decoding, which are based on word alignments derived from target-to-source attention weights. However, multiple recent studies have found that word alignment derived from generic attention heads in the Transformer is unreliable. We address this problem by introducing a dedicated head in the multi-head Transformer architecture to capture external supervision signals. Results on five language pairs show that our method is highly effective in constraining NMT with pre-specified translations, consistently outperforming previous methods in translation quality.

IJCAI Conference 2020 Conference Paper

CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

  • Libo Qin
  • Minheng Ni
  • Yue Zhang
  • Wanxiang Che

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBERT), have shown success in a variety of zero-shot cross-lingual tasks. However, these models are limited by having inconsistent contextualized representations of subwords across different languages. Existing work addresses this issue by bilingual projection and fine-tuning technique. We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT, which encourages model to align representations from source and multiple target languages once by mixing their context information. Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages. Experimental results on five tasks with 19 languages show that our method leads to significantly improved performances for all the tasks compared with mBERT.

IJCAI Conference 2020 Conference Paper

Dialogue State Induction Using Neural Latent Variable Models

  • Qingkai Min
  • Libo Qin
  • Zhiyang Teng
  • Xiao Liu
  • Yue Zhang

Dialogue state modules are a useful component in a task-oriented dialogue system. Traditional methods find dialogue states by manually labeling training corpora, upon which neural models are trained. However, the labeling process can be costly, slow, error-prone, and more importantly, cannot cover the vast range of domains in real-world dialogues for customer service. We propose the task of dialogue state induction, building two neural latent variable models that mine dialogue states automatically from unlabeled customer service dialogue records. Results show that the models can effectively find meaningful dialogue states. In addition, equipped with induced dialogue states, a state-of-the-art dialogue system gives better performance compared with not using a dialogue state module.

AAAI Conference 2020 Conference Paper

Evaluating Commonsense in Pre-Trained Language Models

  • Xuhui Zhou
  • Yue Zhang
  • Leyang Cui
  • Dandan Huang

Contextualized representations trained over large raw text data have given remarkable improvements for NLP tasks including question answering and reading comprehension. There have been works showing that syntactic, semantic and word sense knowledge are contained in such representations, which explains why they benefit such tasks. However, relatively little work has been done investigating commonsense knowledge contained in contextualized representations, which is crucial for human question answering and reading comprehension. We study the commonsense ability of GPT, BERT, XLNet, and RoBERTa by testing them on seven challenging benchmarks, finding that language modeling and its variants are effective objectives for promoting models’ commonsense ability while bi-directional context and larger training set are bonuses. We additionally find that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other. Interestingly, the models show confusion on these test cases, which suggests that they learn commonsense at the surface rather than the deep level. We release a test set, named CATs publicly, for future research.

AAAI Conference 2020 Conference Paper

FAS-Net: Construct Effective Features Adaptively for Multi-Scale Object Detection

  • Jiangqiao Yan
  • Yue Zhang
  • Zhonghan Chang
  • Tengfei Zhang
  • Menglong Yan
  • Wenhui Diao
  • Hongqi Wang
  • Xian Sun

Feature pyramid is the mainstream method for multi-scale object detection. In most detectors with feature pyramid, each proposal is predicted based on feature grids pooled from only one feature level, which is assigned heuristically. Recent studies report that the feature representation extracted using this method is sub-optimal, since they ignore the valid information exists on other unselected layers of the feature pyramid. To address this issue, researchers present to fuse valid information across all feature levels. However, these methods can be further improved: the feature fusion strategies, which use common operation (element-wise max or sum) in most detectors, should be replaced by a more flexible way. In this work, a novel method called feature adaptive selection subnetwork (FAS-Net) is proposed to construct effective features for detecting objects of different scales. Particularly, its adaption consists of two level: global attention and local adaptive selection. First, we model the global context of each feature map with global attention based feature selection module (GAFSM), which can strengthen the effective features across each layer adaptively. Then we extract the features of each region of interest (RoI) on the entire feature pyramid to construct a RoI feature pyramid. Finally, the RoI feature pyramid is sent to the feature adaptive selection module (FASM) to integrate the strengthened features according to the input adaptively. Our FAS-Net can be easily extended to other two-stage object detectors with feature pyramid, and supports to analyze the importance of different feature levels for multi-scale objects quantitatively. Besides, FAS-Net can also be further applied to instance segmentation task and get consistent improvements. Experiments on PASCAL07/12 and MSCOCO17 demonstrate the effectiveness and generalization of the proposed method.

IJCAI Conference 2020 Conference Paper

Inference-Masked Loss for Deep Structured Output Learning

  • Quan Guo
  • Hossein Rajaby Faghihi
  • Yue Zhang
  • Andrzej Uszok
  • Parisa Kordjamshidi

Structured learning algorithms usually involve an inference phase that selects the best global output variables assignments based on the local scores of all possible assignments. We extend deep neural networks with structured learning to combine the power of learning representations and leveraging the use of domain knowledge in the form of output constraints during training. Introducing a non-differentiable inference module to gradient-based training is a critical challenge. Compared to using conventional loss functions that penalize every local error independently, we propose an inference-masked loss that takes into account the effect of inference and does not penalize the local errors that can be corrected by the inference. We empirically show the inference-masked loss combined with the negative log-likelihood loss improves the performance on different tasks, namely entity relation recognition on CoNLL04 and ACE2005 corpora, and spatial role labeling on CLEF 2017 mSpRL dataset. We show the proposed approach helps to achieve better generalizability, particularly in the low-data regime.

AAAI Conference 2020 Conference Paper

Latent Emotion Memory for Multi-Label Emotion Classification

  • Hao Fei
  • Yue Zhang
  • Yafeng Ren
  • Donghong Ji

Identifying multiple emotions in a sentence is an important research topic. Existing methods usually model the problem as multi-label classification task. However, previous methods have two issues, limiting the performance of the task. First, these models do not consider prior emotion distribution in a sentence. Second, they fail to effectively capture the context information closely related to the corresponding emotion. In this paper, we propose a Latent Emotion Memory network (LEM) for multi-label emotion classification. The proposed model can learn the latent emotion distribution without external knowledge, and can effectively leverage it into the classification network. Experimental results on two benchmark datasets show that the proposed model outperforms strong baselines, achieving the state-of-the-art performance.

IJCAI Conference 2020 Conference Paper

LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning

  • Jian Liu
  • Leyang Cui
  • Hanmeng Liu
  • Dandan Huang
  • Yile Wang
  • Yue Zhang

Machine reading is a fundamental task for testing the capability of natural language understand- ing, which is closely related to human cognition in many aspects. With the rising of deep learning techniques, algorithmic models rival human performances on simple QA, and thus increasingly challenging machine reading datasets have been proposed. Though various challenges such as evidence integration and commonsense knowledge have been integrated, one of the fundamental capabilities in human reading, namely logical reasoning, is not fully investigated. We build a comprehensive dataset, named LogiQA, which is sourced from expert-written questions for testing human Logical reasoning. It consists of 8, 678 QA instances, covering multiple types of deductive reasoning. Results show that state-of-the-art neural models perform by far worse than human ceiling. Our dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting. The dataset is freely available at https: //github. com/lgw863/LogiQA-dataset.

AAAI Conference 2020 Conference Paper

Multi-Point Semantic Representation for Intent Classification

  • Jinghan Zhang
  • Yuxiao Ye
  • Yue Zhang
  • Likun Qiu
  • Bin Fu
  • Yang Li
  • Zhenglu Yang
  • Jian Sun

Detecting user intents from utterances is the basis of natural language understanding (NLU) task. To understand the meaning of utterances, some work focuses on fully representing utterances via semantic parsing in which annotation cost is labor-intentsive. While some researchers simply view this as intent classification or frequently asked questions (FAQs) retrieval, they do not leverage the shared utterances among different intents. We propose a simple and novel multi-point semantic representation framework with relatively low annotation cost to leverage the fine-grained factor information, decomposing queries into four factors, i. e. , topic, predicate, object/condition, query type. Besides, we propose a compositional intent bi-attention model under multi-task learning with three kinds of attention mechanisms among queries, labels and factors, which jointly combines coarse-grained intent and fine-grained factor information. Extensive experiments show that our framework and model significantly outperform several state-of-the-art approaches with an improvement of 1. 35%-2. 47% in terms of accuracy.

AAAI Conference 2020 Conference Paper

Relation Extraction Exploiting Full Dependency Forests

  • Lifeng Jin
  • Linfeng Song
  • Yue Zhang
  • Kun Xu
  • Wei-Yun Ma
  • Dong Yu

Dependency syntax has long been recognized as a crucial source of features for relation extraction. Previous work considers 1-best trees produced by a parser during preprocessing. However, error propagation from the out-of-domain parser may impact the relation extraction performance. We propose to leverage full dependency forests for this task, where a full dependency forest encodes all possible trees. Such representations of full dependency forests provide a differentiable connection between a parser and a relation extraction model, and thus we are also able to study adjusting the parser parameters based on end-task loss. Experiments on three datasets show that full dependency forests and parser adjustment give significant improvements over carefully designed baselines, showing state-of-the-art or competitive performances on biomedical or newswire benchmarks.

JBHI Journal 2020 Journal Article

Semi-Supervised Learning for Semantic Segmentation of Emphysema With Partial Annotations

  • Liying Peng
  • Lanfen Lin
  • Hongjie Hu
  • Yue Zhang
  • Huali Li
  • Yutaro Iwamoto
  • Xian-Hua Han
  • Yen-Wei Chen

Segmentation and quantification of each subtype of emphysema is helpful to monitor chronic obstructive pulmonary disease. Due to the nature of emphysema (diffuse pulmonary disease), it is very difficult for experts to allocate semantic labels to every pixel in the CT images. In practice, partially annotating is a better choice for the radiologists to reduce their workloads. In this paper, we propose a new end-to-end trainable semi-supervised framework for semantic segmentation of emphysema with partial annotations, in which a segmentation network is trained from both annotated and unannotated areas. In addition, we present a new loss function, referred to as Fisher loss, to enhance the discriminative power of the model and successfully integrate it into our proposed framework. Our experimental results show that the proposed methods have superior performance over the baseline supervised approach (trained with only annotated areas) and outperform the state-of-the-art methods for emphysema segmentation.

AAAI Conference 2019 Conference Paper

A Neural Network Approach to Verb Phrase Ellipsis Resolution

  • Wei-Nan Zhang
  • Yue Zhang
  • Yuanxing Liu
  • Donglin Di
  • Ting Liu

Verb Phrase Ellipsis (VPE) is a linguistic phenomenon, where some verb phrases as syntactic constituents are omitted and typically referred by an auxiliary verb. It is ubiquitous in both formal and informal text, such as news articles and dialogues. Previous work on VPE resolution mainly focused on manually constructing features extracted from auxiliary verbs, syntactic trees, etc. However, the optimization of feature representation, the effectiveness of continuous features and the automatic composition of features are not well addressed. In this paper, we explore the advantages of neural models on VPE resolution in both pipeline and end-to-end processes, comparing the differences between statistical and neural models. Two neural models, namely multi-layer perception and the Transformer, are employed for the subtasks of VPE detection and resolution. Experimental results show that the neural models outperform the state-of-the-art baselines in both subtasks and the end-to-end results.

IJCAI Conference 2019 Conference Paper

Extracting Entities and Events as a Single Task Using a Transition-Based Neural Model

  • Junchi Zhang
  • Yanxia Qin
  • Yue Zhang
  • Mengchi Liu
  • Donghong Ji

The task of event extraction contains subtasks including detections for entity mentions, event triggers and argument roles. Traditional methods solve them as a pipeline, which does not make use of task correlation for their mutual benefits. There have been recent efforts towards building a joint model for all tasks. However, due to technical challenges, there has not been work predicting the joint output structure as a single task. We build a first model to this end using a neural transition-based framework, incrementally predicting complex joint structures in a state-transition process. Results on standard benchmarks show the benefits of the joint model, which gives the best result in the literature.

IJCAI Conference 2019 Conference Paper

Learning Interpretable Relational Structures of Hinge-loss Markov Random Fields

  • Yue Zhang
  • Arti Ramesh

Statistical relational models such as Markov logic networks (MLNs) and hinge-loss Markov random fields (HL-MRFs) are specified using templated weighted first-order logic clauses, leading to the creation of complex, yet easy to encode models that effectively combine uncertainty and logic. Learning the structure of these models from data reduces the human effort of identifying the right structures. In this work, we present an asynchronous deep reinforcement learning algorithm to automatically learn HL-MRF clause structures. Our algorithm possesses the ability to learn semantically meaningful structures that appeal to human intuition and understanding, while simultaneously being able to learn structures from data, thus learning structures that have both the desirable qualities of interpretability and good prediction performance. The asynchronous nature of our algorithm further provides the ability to learn diverse structures via exploration, while remaining scalable. We demonstrate the ability of the models to learn semantically meaningful structures that also achieve better prediction performance when compared with a greedy search algorithm, a path-based algorithm, and manually defined clauses on two computational social science applications: i) modeling recovery in alcohol use disorder, and ii) detecting bullying.

AAAI Conference 2019 Conference Paper

Who Blames Whom in a Crisis? Detecting Blame Ties from News Articles Using Neural Networks

  • Shuailong Liang
  • Olivia Nicol
  • Yue Zhang

Blame games tend to follow major disruptions, be they financial crises, natural disasters or terrorist attacks. To study how the blame game evolves and shapes the dominant crisis narratives is of great significance, as sense-making processes can affect regulatory outcomes, social hierarchies, and cultural norms. However, it takes tremendous time and efforts for social scientists to manually examine each relevant news article and extract the blame ties (A blames B). In this study, we define a new task, Blame Tie Extraction, and construct a new dataset related to the United States financial crisis (2007- 2010) from The New York Times, The Wall Street Journal and USA Today. We build a Bi-directional Long Short-Term Memory (BiLSTM) network for contexts where the entities appear in and it learns to automatically extract such blame ties at the document level. Leveraging the large unsupervised model such as GloVe and ELMo, our best model achieves an F1 score of 70% on the test set for blame tie extraction, making it a useful tool for social scientists to extract blame ties more efficiently.

IJCAI Conference 2018 Conference Paper

Event Factuality Identification via Generative Adversarial Networks with Auxiliary Classification

  • Zhong Qian
  • Peifeng Li
  • Yue Zhang
  • Guodong Zhou
  • Qiaoming Zhu

Event factuality identification is an important semantic task in NLP. Traditional research heavily relies on annotated texts. This paper proposes a two-step framework, first extracting essential factors related with event factuality from raw texts as the input, and then identifying the factuality of events via a Generative Adversarial Network with Auxiliary Classification (AC-GAN). The use of AC-GAN allows the model to learn more syntactic information and address the imbalance among factuality values. Experimental results on FactBank show that our method significantly outperforms several state-of-the-art baselines, particularly on events with embedded sources, speculative and negative factuality values.

AAAI Conference 2018 Conference Paper

Improved English to Russian Translation by Neural Suffix Prediction

  • Kai Song
  • Yue Zhang
  • Min Zhang
  • Weihua Luo

Neural machine translation (NMT) suffers a performance de- ficiency when a limited vocabulary fails to cover the source or target side adequately, which happens frequently when dealing with morphologically rich languages. To address this problem, previous work focused on adjusting translation granularity or expanding the vocabulary size. However, morphological information is relatively under-considered in NMT architectures, which may further improve translation quality. We propose a novel method, which can not only reduce data sparsity but also model morphology through a simple but effective mechanism. By predicting the stem and suf- fix separately during decoding, our system achieves an improvement of up to 1. 98 BLEU compared with previous work on English to Russian translation. Our method is orthogonal to different NMT architectures and stably gains improvements on various domains.

IJCAI Conference 2018 Conference Paper

Joint Extraction of Entities and Relations Based on a Novel Graph Scheme

  • Shaolei Wang
  • Yue Zhang
  • Wanxiang Che
  • Ting Liu

Both entity and relation extraction can benefit from being performed jointly, allowing each task to correct the errors of the other. Most existing neural joint methods extract entities and relations separately and achieve joint learning through parameter sharing, leading to a drawback that information between output entities and relations cannot be fully exploited. In this paper, we convert the joint task into a directed graph by designing a novel graph scheme and propose a transition-based approach to generate the directed graph incrementally, which can achieve joint learning through joint decoding. Our method can model underlying dependencies not only between entities and relations, but also between relations. Experiments on NewYork Times (NYT) corpora show that our approach outperforms the state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Multi-Modal Multi-Task Learning for Automatic Dietary Assessment

  • Qi Liu
  • Yue Zhang
  • Zhenguang Liu
  • Ye Yuan
  • Li Cheng
  • Roger Zimmermann

We investigate the task of automatic dietary assessment: given meal images and descriptions uploaded by real users, our task is to automatically rate the meals and deliver advisory comments for improving users’ diets. To address this practical yet challenging problem, which is multi-modal and multi-task in nature, an end-to-end neural model is proposed. In particular, comprehensive meal representations are obtained from images, descriptions and user information. We further introduce a novel memory network architecture to store meal representations and reason over the meal representations to support predictions. Results on a real-world dataset show that our method outperforms two strong image captioning baselines significantly.

JAIR Journal 2018 Journal Article

Transition-Based Neural Word Segmentation Using Word-Level Features

  • Meishan Zhang
  • Yue Zhang
  • Guohong Fu

Character-based and word-based methods are two different solutions for Chinese word segmentation, the former exploiting sequence labeling models over characters and the latter using word-level features. Neural models have been exploited for character-based Chinese word segmentation, giving high accuracies by making use of external character embeddings, yet requiring less feature engineering. In this paper, we study a neural model for word-based Chinese word segmentation, by replacing the manually-designed discrete features with neural features in a transition-based word segmentation framework. Experimental results demonstrate that word features lead to comparable performance to the best systems in the literature, and a further combination of discrete and neural features obtains top accuracies on several benchmarks.

IJCAI Conference 2017 Conference Paper

A Neural Model for Joint Event Detection and Summarization

  • Zhongqing Wang
  • Yue Zhang

Twitter new event detection aims to identify first stories in a tweet stream. Typical approaches consider two sub tasks. First, it is necessary to filter out mundane or irrelevant tweets. Second, tweets are grouped automatically into event clusters. Traditionally, these two sub tasks are processed separately, and integrated under a pipeline setting, despite that there is inter-dependence between the two tasks. In addition, one further related task is summarization, which is to extract a succinct summary for representing a large group of tweets. Summarization is related to detection, under the new event setting in that salient information is universal between event representing tweets and informative event summaries. In this paper, we build a joint model to filter, cluster, and summarize the tweets for new events. In particular, deep representation learning is used to vectorize tweets, which serves as basis that connects tasks. A neural stacking model is used for integrating a pipeline of different sub tasks, and for better sharing between the predecessor and successors. Experiments show that our proposed neural joint model is more effective compared to its pipeline baseline.

JAIR Journal 2017 Journal Article

A Neural Probabilistic Structured-Prediction Method for Transition-Based Natural Language Processing

  • Hao Zhou
  • Yue Zhang
  • Chuan Cheng
  • Shujian Huang
  • Xinyu Dai
  • Jiajun Chen

We propose a neural probabilistic structured-prediction method for transition-based natural language processing, which integrates beam search and contrastive learning. The method uses a global optimization model, which can leverage arbitrary features over non-local context. Beam search is used for efficient heuristic decoding, and contrastive learning is performed for adjusting the model according to search errors. When evaluated on both chunking and dependency parsing tasks, the proposed method achieves significant accuracy improvements over the locally normalized greedy baseline on the two tasks, respectively.

IJCAI Conference 2017 Conference Paper

DDoS Event Forecasting using Twitter Data

  • Zhongqing Wang
  • Yue Zhang

Distributed Denial of Service (DDoS) attacks have been significant threats to the Internet. Traditional research in cyber security focuses on detecting emerging DDoS attacks by tracing network package flow. A characteristic of DDoS defense is that rescue time is limited since the launch of attack. More resilient detection and defence models are typically more costly. We aim at predicting the likelihood of DDoS attacks by monitoring relevant text streams in social media, so that the level of defense can be adjusted dynamically for maximizing cost-effect. To our knowledge, this is a novel and challenge research question for DDoS rescue. Because the input of this task is a text stream rather than a document, information should be collected both on the textual content of individual posts. We propose a fine-grained hierarchical stream model to capture semantic information over infinitely long history, and reveal burstiness and trends. Empirical evaluation shows that social text streams are indeed informative for DDoS forecasting, and our proposed hierarchical model is more effective compared to strong baseline text stream models and discrete bag-of-words models.

IJCAI Conference 2016 Conference Paper

Bag-of-Embeddings for Text Classification

  • Peng Jin
  • Yue Zhang
  • Xingyuan Chen
  • Yunqing Xia

Words are central to text classification. It has been shown that simple Naive Bayes models with word and bigram features can give highly competitive accuracies when compared to more sophisticated models with part-of-speech, syntax and semantic features. Embeddings offer distributional features about words. We study a conceptually simple classification model by exploiting multi-prototype word embeddings based on text classes. The key assumption is that words exhibit different distributional characteristics under different text classes. Based on this assumption, we train multi-prototype distributional word representations for different text classes. Given a new document, its text class is predicted by maximizing the probabilities of embedding vectors of its words under the class. In two standard classification benchmark datasets, one is balance and the other is imbalance, our model outperforms state-of-the-art systems, on both accuracy and macro-average F-1 score.

AAAI Conference 2016 Conference Paper

Context-Sensitive Twitter Sentiment Classification Using Neural Network

  • Yafeng Ren
  • Yue Zhang
  • Meishan Zhang
  • Donghong Ji

Sentiment classification on Twitter has attracted increasing research in recent years. Most existing work focuses on feature engineering according to the tweet content itself. In this paper, we propose a contextbased neural network model for Twitter sentiment analysis, incorporating contextualized features from relevant Tweets into the model in the form of word embedding vectors. Experiments on both balanced and unbalanced datasets show that our proposed models outperform the current state-of-the-art.

AAAI Conference 2016 Conference Paper

Dependency Tree Representations of Predicate-Argument Structures

  • Likun Qiu
  • Yue Zhang
  • Meishan Zhang

We present a novel annotation framework for representing predicate-argument structures, which uses dependency trees to encode the syntactic and semantic roles of a sentence simultaneously. The main contribution is a semantic role transmission model, which eliminates the structural gap between syntax and shallow semantics, making them compatible. A Chinese semantic treebank was built under the proposed framework, and the first release containing about 14K sentences is made freely available. The proposed framework enables semantic role labeling to be solved as a sequence labeling task, and experiments show that standard sequence labelers can give competitive performance on the new treebank compared with state-of-the-art graph structure models.

AAAI Conference 2016 Conference Paper

Gated Neural Networks for Targeted Sentiment Analysis

  • Meishan Zhang
  • Yue Zhang
  • Duy-Tin Vo

Targeted sentiment analysis classifies the sentiment polarity towards each target entity mention in given text documents. Seminal methods extract manual discrete features from automatic syntactic parse trees in order to capture semantic information of the enclosing sentence with respect to a target entity mention. Recently, it has been shown that competitive accuracies can be achieved without using syntactic parsers, which can be highly inaccurate on noisy text such as tweets. This is achieved by applying distributed word representations and rich neural pooling functions over a simple and intuitive segmentation of tweets according to target entity mentions. In this paper, we extend this idea by proposing a sentencelevel neural model to address the limitation of pooling functions, which do not explicitly model tweet-level semantics. First, a bi-directional gated neural network is used to connect the words in a tweet so that pooling functions can be applied over the hidden layer instead of words for better representing the target and its contexts. Second, a three-way gated neural network structure is used to model the interaction between the target mention and its surrounding contexts. Experiments show that our proposed model gives significantly higher accuracies compared to the current best method for targeted sentiment analysis.

AAAI Conference 2016 Conference Paper

Improving Twitter Sentiment Classification Using Topic-Enriched Multi-Prototype Word Embeddings

  • Yafeng Ren
  • Yue Zhang
  • Meishan Zhang
  • Donghong Ji

It has been shown that learning distributed word representations is highly useful for Twitter sentiment classification. Most existing models rely on a single distributed representation for each word. This is problematic for sentiment classi- fication because words are often polysemous and each word can contain different sentiment polarities under different topics. We address this issue by learning topic-enriched multiprototype word embeddings (TMWE). In particular, we develop two neural networks which 1) learn word embeddings that better capture tweet context by incorporating topic information, and 2) learn topic-enriched multiple prototype embeddings for each word. Experiments on Twitter sentiment benchmark datasets in SemEval 2013 show that TMWE outperforms the top system with hand-crafted features, and the current best neural network model.

IJCAI Conference 2016 Conference Paper

Joint Models for Extracting Adverse Drug Events from Biomedical Text

  • Fei Li
  • Yue Zhang
  • Meishan Zhang
  • Donghong Ji

Extracting adverse drug events receives much research attention in the biomedical community. Previous work adopts pipeline models, firstly recognizing drug/disease entity mentions and then identifying adverse drug events from drug/disease pairs. In this paper, we investigate joint models for simultaneously extracting drugs, diseases and adverse drug events. Compared with pipeline models, joint models have two main advantages. First, they make use of information integration to facilitate performance improvement; second, they reduce error propagation in pipeline methods. We compare a discrete model and a deep neural model for extracting drugs, diseases and adverse drug events jointly. Experimental results on a standard ADE corpus show that the discrete joint model outperforms a state-of-the-art baseline pipeline significantly. In addition, when discrete features are replaced by neural features, the recall is further improved.

AAAI Conference 2016 Conference Paper

Joint Word Segmentation, POS-Tagging and Syntactic Chunking

  • Chen Lyu
  • Yue Zhang
  • Donghong Ji

Chinese chunking has traditionally been solved by assuming gold standard word segmentation. We find that the accuracies drop drastically when automatic segmentation is used. Inspired by the fact that chunking knowledge can potentially improve segmentation, we explore a joint model that performs segmentation, POStagging and chunking simultaneously. In addition, to address the sparsity of full chunk features, we employ a semi-supervised method to derive chunk cluster features from large-scale automatically-chunked data. Results show the effectiveness of the joint model with semi-supervised features.

ICML Conference 2016 Conference Paper

On the Consistency of Feature Selection With Lasso for Non-linear Targets

  • Yue Zhang
  • Weihong Guo
  • Soumya Ray

An important question in feature selection is whether a selection strategy recovers the “true” set of features, given enough data. We study this question in the context of the popular Least Absolute Shrinkage and Selection Operator (Lasso) feature selection strategy. In particular, we consider the scenario when the model is misspecified so that the learned model is linear while the underlying real target is nonlinear. Surprisingly, we prove that under certain conditions, Lasso is still able to recover the correct features in this case. We also carry out numerical studies to empirically verify the theoretical results and explore the necessity of the conditions under which the proof holds.

IJCAI Conference 2015 Conference Paper

Deep Learning for Event-Driven Stock Prediction

  • Xiao Ding
  • Yue Zhang
  • Ting Liu
  • Junwen Duan

We propose a deep learning method for eventdriven stock market prediction. First, events are extracted from news text, and represented as dense vectors, trained using a novel neural tensor network. Second, a deep convolutional neural network is used to model both short-term and long-term influences of events on stock price movements. Experimental results show that our model can achieve nearly 6% improvements on S&P 500 index prediction and individual stock prediction, respectively, compared to state-of-the-art baseline methods. In addition, market simulation results show that our system is more capable of making profits than previously reported systems trained on S&P 500 stock historical data.

IJCAI Conference 2015 Conference Paper

Target-Dependent Twitter Sentiment Classification with Rich Automatic Features

  • Tin Duy Vo
  • Yue Zhang

Target-dependent sentiment analysis on Twitter has attracted increasing research attention. Most previous work relies on syntax, such as automatic parse trees, which are subject to noise for informal text such as tweets. In this paper, we show that competitive results can be achieved without the use of syntax, by extracting a rich set of automatic features. In particular, we split a tweet into a left context and a right context according to a given target, using distributed word representations and neural pooling functions to extract features. Both sentiment-driven and standard embeddings are used, and a rich set of neural pooling functions are explored. Sentiment lexicons are used as an additional source of information for feature extraction. In standard evaluation, the conceptually simple method gives a 4. 8% absolute improvement over the state-of-the-art on three-way targeted sentiment classification, achieving the best reported results for this task.

AAAI Conference 2015 Conference Paper

Word Segmentation for Chinese Novels

  • Likun Qiu
  • Yue Zhang

Word segmentation is a necessary first step for automatic syntactic analysis of Chinese text. Chinese segmentation is highly accurate on news data, but the accuracies drop significantly on other domains, such as science and literature. For scientific domains, a significant portion of out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentation significantly. For the literature domain, however, there is not a fixed set of domain terms. For example, each novel can contain a specific set of person, organization and location names. We investigate a method for automatically mining common noun entities for each novel using information extraction techniques, and use the resulting entities to improve a state-of-the-art segmentation model for the novel. In particular, we design a novel double-propagation algorithm that mines noun entities together with common contextual patterns, and use them as plug-in features to a model trained on the source domain. An advantage of our method is that no retraining for the segmentation model is needed for each novel, and hence it can be applied efficiently given the huge number of novels on the web. Results on five different novels show significantly improved accuracies, in particular for OOV words.

AAAI Conference 2014 Conference Paper

Joint Morphological Generation and Syntactic Linearization

  • Linfeng Song
  • Yue Zhang
  • Kai Song
  • Qun Liu

There has been growing interest in stochastic methods to natural language generation (NLG). While most NLG pipelines separate morphological generation and syntactic linearization, the two tasks are closely related. In this paper, we study joint morphological generation and linearization, making use of word order and inflections information for both tasks and reducing error propagation. Experiments show that the joint method significantly outperforms a strong pipelined baseline (by 1. 1 BLEU points). It also achieves the best reported result on the Generation Challenge 2011 shared task.

IJCAI Conference 2013 Conference Paper

Partial-Tree Linearization: Generalized Word Ordering for Text Synthesis

  • Yue Zhang

We present partial-tree linearization, a generalized word ordering (i. e. ordering a set of input words into a grammatical and fluent sentence) task for text-to-text applications. Recent studies of word ordering can be categorized into either abstract word ordering (no input syntax except for POS) or tree linearization (input words are associated with a full unordered syntax tree). Partial-tree linearization covers the whole spectrum of input between these two extremes. By allowing POS and dependency relations to be associated with any subset of input words, partial-tree linearization is more practical for a dependency-based NLG pipeline, such as transfer-based MT and abstractive text summarization. In addition, a partial-tree linearizer can also perform abstract word ordering and full-tree linearization. Our system achieves the best published results on standard PTB evaluations of these tasks.