Arrow Research search

Author name cluster

Yao Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
1 author row

Possible papers

24

AAAI Conference 2026 Conference Paper

Exploiting Inter-Session Information with Frequency-enhanced Dual-Path Networks for Sequential Recommendation

  • Peng He
  • Yanglei Gan
  • Tingting Dai
  • Run Lin
  • Xuexin Li
  • Yao Liu
  • Qiao Liu

Sequential recommendation (SR) aims to predict a user's next item preference by modeling historical interaction sequences. Recent advances often integrate frequency-domain modules to compensate for self-attention's low-pass nature by restoring the high-frequency signals critical for personalized recommendations. Nevertheless, existing frequency-aware solutions process each session in isolation and optimize exclusively with time-domain objectives. Consequently, they overlook cross-session spectral dependencies and fail to enforce alignment between predicted and actual spectral signatures, leaving valuable frequency information under-exploited. To this end, we propose FreqRec, a Frequency-Enhanced Dual-Path Network for sequential Recommendation that jointly captures inter-session and intra-session behaviors via a learnable Frequency-domain Multi-layer Perceptron. Moreover, FreqRec is optimized under a composite objective that combines cross entropy with a frequency-domain consistency loss, explicitly aligning predicted and true spectral signatures. Extensive experiments on three benchmarks show that FreqRec surpasses strong baselines and remains robust under data sparsity and noisy-log conditions.

AAAI Conference 2026 Conference Paper

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

  • Yao Cheng
  • Yibo Zhao
  • Jiapeng Zhu
  • Yao Liu
  • Xing Sun
  • Xiang Li

Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches, especially those based on vector similarity, fail to effectively capture relational dependencies and support multi-step reasoning. In this work, we propose CogGRAG, a human cognition-inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG models the reasoning process as a tree-structured mind map that decomposes the original problem into interrelated subproblems and explicitly encodes their semantic relationships. This structure not only provides a global view to guide subsequent retrieval and reasoning but also enables self-consistent verification across reasoning paths. The framework operates in three stages: (1) top-down problem decomposition via mind map construction, (2) structured retrieval of both local and global knowledge from external Knowledge Graphs (KGs), and (3) bottom-up reasoning with dual-process self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies problem decomposition, knowledge retrieval, and reasoning under a single graph-structured cognitive framework, allowing early integration of relational knowledge and adaptive verification. Extensive experiments demonstrate that CogGRAG achieves superior accuracy and reliability compared to existing methods.

TIST Journal 2026 Journal Article

Learning Causality-Aware Exploration with Transformers for Goal-Oriented Navigation

  • Ruoyu Wang
  • Tong Yu
  • Mingjie Li
  • Yuanjiang Cao
  • Yao Liu
  • Lina Yao

Navigation is a fundamental task in the research of Embodied AI, and recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct application of RNNs and Transformers often overlooks the distinct characteristics of navigation tasks compared to traditional sequential data modeling. These methods are inherently designed to capture long-term dependencies, which are relatively weak in navigation scenarios, potentially limiting their performance in such tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Navigation tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Navigation. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the model’s Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method’s generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks, and simulation environments, specifically, in Object Navigation within RoboTHOR, Objective Navigation, Point Navigation in Habitat, and R2R Navigation. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings. Additionally, further analysis highlights the robustness of our method, demonstrating its capacity to consistently perform well across diverse experimental settings and varying conditions. This robustness underscores the adaptability and generalizability of our approach, reinforcing its potential for application across a wide range of tasks.

AAAI Conference 2026 Conference Paper

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

  • Sonal Kumar
  • Šimon Sedláček
  • Vaibhavi Lokegaonkar
  • Fernando López
  • Wenyi Yu
  • Nishit Anand
  • Hyeonggon Ryu
  • Lichang Chen

Audio comprehension—including speech, non-speech sounds, and music—is essential for achieving human-level intelligence. Consequently, AI agents must demonstrate holistic audio understanding to qualify as generally intelligent. However, evaluating auditory intelligence comprehensively remains challenging. To address this gap, we introduce MMAU-Pro, the most comprehensive and rigorously curated benchmark for assessing audio intelligence in AI systems. MMAU-Pro contains 5,305 instances, where each instance has one or more audios paired with human expert-generated question-answer pairs, spanning speech, sound, music, and their combinations. Unlike existing benchmarks, MMAU-Pro evaluates auditory intelligence across 49 unique skills and multiple complex dimensions, including long-form audio comprehension, spatial audio reasoning, multi-audio understanding, among others. All questions are meticulously designed to require deliberate multi-hop reasoning, including both multiple-choice and open-ended response formats. Importantly, audio data is sourced directly ``from the wild" rather than from existing datasets with known distributions. We evaluate 22 leading open-source and proprietary multimodal AI models, revealing significant limitations: even state-of-the-art models such as Gemini 2.5 Flash and Audio Flamingo 3 achieve only 57.33% and 45.9% accuracy, respectively, approaching random performance in multiple categories. Our extensive analysis highlights specific shortcomings and provides novel insights, offering actionable perspectives for the community to enhance future AI systems' progression toward audio general intelligence.

NeurIPS Conference 2025 Conference Paper

Ask a Strong LLM Judge when Your Reward Model is Uncertain

  • Zhenghao Xu
  • Qin Lu
  • Qingru Zhang
  • Liang Qiu
  • Ilgee Hong
  • Changlong Yu
  • Wenlin Yao
  • Yao Liu

Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.

TMLR Journal 2025 Journal Article

Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens

  • Zhepeng Cen
  • Yao Liu
  • Siliang Zeng
  • Pratik Chaudhari
  • Huzefa Rangwala
  • George Karypis
  • Rasool Fakoor

Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset. However, during inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously \emph{generated} tokens as input to predict the next one. Marginal differences in predictions at each step can cascade over successive steps, resulting in different distributions from what the models were trained for and potentially leading to unpredictable behavior. This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time. Our first approach is Batch-Scheduled Sampling, where, during training, we stochastically choose between the ground-truth token from the dataset and the model's own generated token as input to predict the next token. This is done in an offline manner, modifying the context window by interleaving ground-truth tokens with those generated by the model. Our second approach is Reference-Answer-based Correction, where we explicitly incorporate a self-correction capability into the model during training. This enables the model to effectively self-correct the gaps between the generated sequences and the ground truth data without relying on an external oracle model. By incorporating our proposed strategies during training, we have observed an overall improvement in performance compared to baseline methods, as demonstrated by our extensive experiments using summarization, general question-answering, and math question-answering tasks.

IJCAI Conference 2025 Conference Paper

D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

  • Jia Zhang
  • Chen-Xi Zhang
  • Yao Liu
  • Yi-Xuan Jin
  • Xiao-Wen Yang
  • Bo Zheng
  • Yi Liu
  • Lan-Zhe Guo

Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on both public datasets and the real-world Taobao Live application demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.

TMLR Journal 2025 Journal Article

Offline Learning and Forgetting for Reasoning with Large Language Models

  • Tianwei Ni
  • Allen Nie
  • Sapana Chaudhary
  • Yao Liu
  • Huzefa Rangwala
  • Rasool Fakoor

Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model’s search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.

TAAS Journal 2025 Journal Article

Synchronized Trajectory Prediction for Hybrid Multi-agent via Attention-Denoised Endpoint Distribution

  • Yao Liu
  • Jinzhu Yang
  • Quan Z. Sheng
  • Lina Yao

Trajectory prediction is critical for applications related to connected autonomous vehicles (CAVs), where multi-agent trajectory prediction can significantly reduce collisions and congestion in highways or hybrid-open scenarios. It serves as the foundation for autonomous driving, enabling vehicles to navigate complex environments safely and efficiently. Prior methods have assessed the spatio-temporal dynamics of agents but often neglected intrinsic intent and uncertainty, thereby limiting their effectiveness. We present the Denoised Endpoint Distribution model for trajectory prediction, which distinctively models agents’ spatio-temporal features alongside their intrinsic intentions and uncertainties. By employing Diffusion and Transformer models to focus on agent endpoints rather than entire trajectories, our approach significantly reduces model complexity and enhances performance through endpoint information. In addition, the designed attention-aware spatio-temporal graphs provide strong guidance information for the diffusion model, enhancing its ability to accurately predict trajectories. Our experiments on open datasets, including highways and hybrid open scenarios, along with comparative and ablation studies, demonstrate the validity of our model and the importance of its components.

AAAI Conference 2024 Conference Paper

patchDPCC: A Patchwise Deep Compression Framework for Dynamic Point Clouds

  • Zirui Pan
  • Mengbai Xiao
  • Xu Han
  • Dongxiao Yu
  • Guanghui Zhang
  • Yao Liu

When compressing point clouds, point-based deep learning models operate points in a continuous space, which has a chance to minimize the geometric fidelity loss introduced by voxelization in preprocessing. But these methods could hardly scale to inputs with arbitrary points. Furthermore, the point cloud frames are individually compressed, failing the conventional wisdom of leveraging inter-frame similarity. In this work, we propose a patchwise compression framework called patchDPCC, which consists of a patch group generation module and a point-based compression model. Algorithms are developed to generate patches from different frames representing the same object, and more importantly, these patches are regulated to have the same number of points. We also incorporate a feature transfer module in the compression model, which refines the feature quality by exploiting the inter-frame similarity. Our model generates point-wise features for entropy coding, which guarantees the reconstruction speed. The evaluation on the MPEG 8i dataset shows that our method improves the compression ratio by 47.01% and 85.22% when compared to PCGCv2 and V-PCC with the same reconstruction quality, which is 9% and 16% better than that D-DPCC does. Our method also achieves the fastest decoding speed among the learning-based compression models.

NeurIPS Conference 2023 Conference Paper

Budgeting Counterfactual for Offline RL

  • Yao Liu
  • Pratik Chaudhari
  • Rasool Fakoor

The main challenge of offline reinforcement learning, where data is limited, arises from a sequence of counterfactual reasoning dilemmas within the realm of potential actions: What if we were to choose a different course of action? These circumstances frequently give rise to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Hence, it becomes crucial to acknowledge that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy make in order to control the extrapolation. Contrary to existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. Theoretically, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.

NeurIPS Conference 2023 Conference Paper

TD Convergence: An Optimization Perspective

  • Kavosh Asadi
  • Shoham Sabach
  • Yao Liu
  • Omer Gottesman
  • Rasool Fakoor

We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning.

NeurIPS Conference 2022 Conference Paper

Provably sample-efficient RL with side information about latent dynamics

  • Yao Liu
  • Dipendra Misra
  • Miro Dudik
  • Robert E. Schapire

We study reinforcement learning (RL) in settings where observations are high-dimensional, but where an RL agent has access to abstract knowledge about the structure of the state space, as is the case, for example, when a robot is tasked to go to a specific room in a building using observations from its own camera, while having access to the floor plan. We formalize this setting as transfer reinforcement learning from an "abstract simulator, " which we assume is deterministic (such as a simple model of moving around the floor plan), but which is only required to capture the target domain's latent-state dynamics approximately up to unknown (bounded) perturbations (to account for environment stochasticity). Crucially, we assume no prior knowledge about the structure of observations in the target domain except that they can be used to identify the latent states (but the decoding map is unknown). Under these assumptions, we present an algorithm, called TASID, that learns a robust policy in the target domain, with sample complexity that is polynomial in the horizon, and independent of the number of states, which is not possible without access to some prior knowledge. In synthetic experiments, we verify various properties of our algorithm and show that it empirically outperforms transfer RL algorithms that require access to "full simulators" (i. e. , those that also simulate observations).

AAAI Conference 2021 Conference Paper

Asynchronous Teacher Guided Bit-wise Hard Mining for Online Hashing

  • Sheng Jin
  • Qin Zhou
  • Hongxun Yao
  • Yao Liu
  • Xian-Sheng Hua

Online hashing for streaming data has attracted increasing attention recently. However, most existing algorithms focus on batch inputs and instance-balanced optimization, which is limited in the single datum input case and does not match the dynamic training in online hashing. Furthermore, constantly updating the online model with new-coming samples will inevitably lead to the catastrophic forgetting problem. In this paper, we propose a novel online hashing method to handle the above-mentioned issues jointly, termed Asynchronus Teacher-Guided Bit-wise Hard Mining for Online Hashing. Firstly, to meet the needs of datum-wise online hashing, we design a novel binary codebook that is discriminative to separate different classes. Secondly, we propose a novel semantic loss (termed bit-wise attention loss) to dynamically focus on hard samples of each bit during training. Last but not least, we design an asynchronous knowledge distillation scheme to alleviate the catastrophic forgetting problem, where the teacher model is delaying updated to maintain the old knowledge, guiding the student model learning. Extensive experiments conducted on two public benchmarks demonstrate the favorable performance of our method over the state-of-the-arts.

NeurIPS Conference 2020 Conference Paper

Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration

  • Yao Liu
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a new decision policy may visit states and actions outside the support of the batch data, and function approximation and optimization with limited samples can further increase the potential of learning policies with overly optimistic estimates of their future performance. Some recent approaches to address these concerns have shown promise, but can still be overly optimistic in their expected outcomes. Theoretical work that provides strong guarantees on the performance of the output policy relies on a strong concentrability assumption, which makes it unsuitable for cases where the ratio between state-action distributions of behavior policy and some candidate policies is large. This is because, in the traditional analysis, the error bound scales up with this ratio. We show that using \emph{pessimistic value estimates} in the low-data regions in Bellman optimality and evaluation back-up can yield more adaptive and stronger guarantees when the concentrability assumption does not hold. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. We highlight the necessity of our pessimistic update and the limitations of previous algorithms and analyses by illustrative MDP examples and demonstrate an empirical comparison of our algorithm and other state-of-the-art batch RL baselines in standard benchmarks.

AAAI Conference 2020 Conference Paper

SSAH: Semi-Supervised Adversarial Deep Hashing with Self-Paced Hard Sample Generation

  • Sheng Jin
  • Shangchen Zhou
  • Yao Liu
  • Chao Chen
  • Xiaoshuai Sun
  • Hongxun Yao
  • Xian-Sheng Hua

Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting suf- ficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment data in semisupervised learning. However, existing GAN-based methods treat image generations and hashing learning as two isolated processes, leading to generation ineffectiveness. Besides, most works fail to exploit the semantic information in unlabeled data. In this paper, we propose a novel Semisupervised Self-pace Adversarial Hashing method, named SSAH to solve the above problems in a unified framework. The SSAH method consists of an adversarial network (A- Net) and a hashing network (H-Net). To improve the quality of generative images, first, the A-Net learns hard samples with multi-scale occlusions and multi-angle rotated deformations which compete against the learning of accurate hashing codes. Second, we design a novel self-paced hard generation policy to gradually increase the hashing difficulty of generated samples. To make use of the semantic information in unlabeled ones, we propose a semi-supervised consistent loss. The experimental results show that our method can significantly improve state-of-the-art models on both the widelyused hashing datasets and fine-grained datasets.

RLDM Conference 2019 Conference Abstract

Off-Policy Policy Gradient with Stationary Distribution Correction

  • Yao Liu
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

The ability to use data about prior decisions, and their outcomes, to make counterfactual infer- ences about how alternative decision policies might perform, is a cornerstone of intelligent behavior and has substantial practical importance. We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process, and consider how to perform off-policy policy optimization using a policy gradient method. Policy gradient methods have had great re- cent success when used in online reinforcement learning, and can be often a nice way to encode inductive bias, as well as to be able to tackle continuous action domains. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under a new target policy. Here we buildon recent progress for estimating the ratio of the Markov chain stationary distribution of states in policy eval- uation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important, and empirical simulations to suggest the benefits of this approach. We hope this is a step towards practical algorithms that can efficiently leverage prior data in order to inform better future decision policies.

NeurIPS Conference 2018 Conference Paper

Representation Balancing MDPs for Off-policy Policy Evaluation

  • Yao Liu
  • Omer Gottesman
  • Aniruddh Raghu
  • Matthieu Komorowski
  • Aldo Faisal
  • Finale Doshi-Velez
  • Emma Brunskill

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in common synthetic benchmarks and a HIV treatment simulation domain.

EWRL Workshop 2018 Workshop Paper

When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms

  • Yao Liu
  • Emma Brunskill

Efficient exploration is one of the key challenges for reinforcement learning (RL) algorithms. Most traditional sample efficiency bounds require strategic exploration. Recently many deep RL algorithms with simple heuristic exploration strategies that have few formal guarantees, achieve surprising success in many domains. These results pose an important question about understanding these exploration strategies such as e-greedy, as well as understanding what characterize the difficulty of exploration in MDPs. In this work we propose problem specific sample complexity bounds of Q learning with random walk exploration that rely on several structural properties. We also link our theoretical results to some empirical benchmark domains, to illustrate if our bound gives polynomial sample complexity in these domains and how that is related with the empirical performance.

RLDM Conference 2017 Conference Abstract

Model Selection for Off-Policy Policy Evaluation

  • Yao Liu
  • Philip Thomas
  • Emma Brunskill

In this work we study the off-policy policy evaluation problem, which is about how to predict the value of a policy by data from other policies. This is crucial for many applications where we can not deploy new policy directly due to safety or cost. We consider the model selection problems for a better off-policy estimators, when we have models from different sources. Traditional off-policy policy evaluation method can be divided into importance sampling estimators or model based estimators, and they respectively suffer from high variance and bias. Recent work such as doubly robust and MAGIC, shows that we can get benefit from combining importance sampling method with model value. However they all assume that they have only one model. In case we have several different models, which is common in some complex domains, it may be hard to select the best one from them, and may lose the potential benefit from all models. we present a evidence example to show that select model by simply minimizing the notation of error in previous estimator (MAGIC) can fall into a wrong model, which suggest that selecting a best model for off-policy policy evaluation is non-trivial and worth of further exploration. We propose two new estimators of model bias and a cross validation way to help to choose a model, and shows the preliminary result.

IJCAI Conference 2016 Conference Paper

A Decision Procedure for a Fragment of Linear Time Mu-Calculus

  • Yao Liu
  • Zhenhua Duan
  • Cong Tian

In this paper, we study an expressive fragment, namely Gmu, of linear time mu-calculus as a high-level goal specification language. We define Goal Progression Form (GPF) for Gmu formulas and show that every closed formula can be transformed into this form. Based on GPF, we present the notion of Goal Progression Form Graph (GPG) which can be used to describe models of a formula. Further, we propose a simple and intuitive GPG-based decision procedure for checking satisfiability of Gmu formulas which has the same time complexity as the decision problem of Linear Temporal Logic (LTL). However, Gmu is able to express a wider variety of temporal goals compared with LTL.