From conversational agents to online recommendation to search and advertising, we are already interacting with increasingly sophisticated sequential decision making systems in daily life. Traditionally, sequential decision making research has focused on balancing the exploration-exploitation trade-off, or casting the interaction paradigm under reinforcement / imitation learning dichotomy. We aim to take a holistic view and call for a collective effort to translate principled research ideas into practically relevant solutions.
This workshop aims to bring together researchers from industry and academia in order to describe recent advances and discuss future research directions pertaining to real-world sequential decision making, broadly construed. We aim to highlight new and emerging research opportunities for the machine learning community that arise from the evolving needs for making decision making theoretically and practically relevant for realistic applications. We are particularly interested in understanding the current challenges that prevent broader adoption of current policy learning and evaluation algorithms in high-impact applications, across a broad range of domains.
We welcome attendance and contribution from researchers, with either theoretical or applied interests, among a diverse range of focus:
Application areas: applications of learning-based techniques to real-life sequential decision making (robotics, healthcare, education, transportation & energy, smart-grids, sustainability, NLP, social media & advertising, agriculture, manufacturing, economics and policy)
Methods that address real-world desiderata and concerns: safety, reliable decision making, theoretical guarantees, verifiability & interpretability, data-efficiency, data-heterogeneity, efficient exploration, counterfactual reasoning and off-policy evaluation, cost function design, efficient implementations to large-scale systems
Cross-boundary research along the theme of RL+X where X indicate areas not commonly viewed as RL in contemporary research. We would like to encourage researchers to explore the interface between traditional RL with:
Other related areas in machine learning including but not limited to: imitation learning, transfer learning, active learning, structured prediction, off-policy learning, fairness in ML, privacy
Areas outside of machine learning including but not limited to: control theory & dynamical systems, formal methods, causal inference, game-theory, operations research, systems research, human-computer interactions, human behavior modeling
14:00 - 14:05 | Opening Remarks |
14:05 - 14:30 | Efficient Reinforcement Learning When Data is Costly - Emma Brunskill (Invited Talk)
[video] [speaker bio] Emma Brunskill is an assistant professor in the Computer Science Department at Stanford University where she leads the AI for Human Impact group. She is the recipient of a multiple early faculty career awards (National Science Foundation, Office of Naval Research, Microsoft Research) and her group has received several best research paper nominations (CHI, EDMx2) and awards (UAI, RLDM). |
14:30 - 15:00 | Doubly Robust Off-policy Evaluation with Shrinkage - Miro Dudík (Invited Talk)
[video] [talk abstract]
Contextual bandits are a learning protocol that encompasses applications such as news recommendation, advertising, and mobile health, where an algorithm repeatedly observes some information about a user, makes a decision what content to present, and accrues a reward if the presented content is successful. In this talk, I will focus on the fundamental task of evaluating a new policy given historic data. I will describe the asymptotically optimal approach of doubly robust (DR) estimation, the reasons for its shortcomings in finite samples, and how to overcome these shortcomings by directly optimizing the bound on finite-sample error. Optimization yields a new family of estimators that, similarly to DR, leverage any direct model of rewards, but shrink importance weights to obtain a better bias-variance tradeoff than DR. Error bounds can also be used to select the best among multiple reward predictors. Somewhat surprisingly, reward predictors that work best with standard DR are not the same as those that work best with our modified DR. Our new estimator and model selection procedure perform extremely well across a wide variety of settings, so we expect they will enjoy broad practical use. [speaker bio] Miroslav Dudík is a principal researcher in machine learning at Microsoft Research, NYC, currently focusing on contextual bandits, reinforcement learning and algorithmic fairness. He received his PhD from Princeton in 2007. He is a co-creator of the MaxEnt package for modeling species distributions, which is used by biologists around the world to design national parks, model impacts of climate change, and discover new species. |
15:00 - 16:00 | Poster Session Part 1 and Coffee Break |
16:00 - 16:30 | Link between Causal Inference and Reinforcement Learning and Applications to Learning from Offline/Observational Data - Suchi Saria (Invited Talk)
[video] [speaker bio]
Suchi Saria is the John C. Malone Assistant
Professor at Johns Hopkins University where she directs the Machine Learning and Healthcare Lab. Her work with the lab enables new classes of diagnostic and treatment planning tools for healthcare—tools that use statistical machine learning techniques to tease
out subtle information from “messy” observational datasets, and provide reliable inferences for individualizing care decisions.
|
16:30 - 17:00 | Dynamic Pricing and Matching for Ride-Hailing - Dawn Woodard (Invited Talk)
[video] [talk abstract] Ride-hailing platforms like Uber, Lyft, Didi Chuxing, and Ola have achieved explosive growth, in part by improving the efficiency of matching between riders and drivers, and by calibrating the balance of supply and demand through dynamic pricing. We survey methods for dynamic pricing and matching in ride-hailing, and show that these are critical for providing an experience with low waiting time for both riders and drivers. We also discuss approaches used to predict key inputs into those algorithms: demand, supply, and travel time in the road network. Then we link the two levers together by studying a pool-matching mechanism called dynamic waiting that varies rider waiting and walking before dispatch, which is inspired by a recent carpooling product Express Pool from Uber. We show using data from Uber that by jointly optimizing dynamic pricing and dynamic waiting, price variability can be mitigated, while increasing capacity utilization, trip throughput, and welfare. We also highlight several key practical challenges and directions of future research from a practitioner's perspective. [speaker bio]
Dawn Woodard leads data science for Uber Maps, which is the mapping platform used in Uber’s rider and driver app and decision systems (such as pricing and dispatch). The team’s technologies include road map and points of interest definition, map search, route optimization, travel time prediction, and navigation.
|
17:00 - 17:30 | Panel Discussion with Emma Brunskill, Miro Dudík, Suchi Saria and Dawn Woodard
Enter Questions for Panelists Here |
17:30 - 18:00 | Poster Session - Part 2 |
Assistant Professor
Stanford University
Principal Researcher
Microsoft Research
Assistant Professor
Johns Hopkins
Director of Data Science, Maps
Uber
We are excited to include the following papers in the poster sessions of the workshop:
We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this problem. From a technical standpoint, this work generalizes upon previous work in two main aspects: our model allows for adversarial noise in the dynamics, and allows for general convex costs.
This extended abstract considers the reliability of prediction-based decision-making in a task of deciding which action $a$ to take for a target unit after observing its covariates $\tilde{x}$ and predicted outcomes $\hat{p}(\tilde{y} \mid \tilde{x}, a)$. An example case is personalized medicine and the decision of which treatment to give to a patient. A common problem when learning these models from observational data is imbalance, that is, difference in treated/control covariate distributions. We propose to assess the decision-making reliability by estimating the model's Type S error rate, which is the probability of the model inferring the sign of the treatment effect wrong. Furthermore, we use the estimated reliability as a criterion for active learning, in order to collect new (possibly expensive) observations, instead of making a forced choice based on unreliable predictions. We demonstrate the effectiveness of this decision-making aware active learning in two decision-making tasks: in simulated data with binary outcomes and in a medical dataset with synthetic and continuous treatment outcomes.
Reinforcement learning (RL) has proven its worth in a series of artificial domains, and is beginning to show some successes in real-world scenarios. However, much of the research advances in RL are often hard to leverage in real-world systems due to a series of assumptions that are rarely satisfied in practice. We present a set of nine unique challenges that must be addressed to productionize RL to real world problems. For each of these challenges, we specify the exact meaning of the challenge, present some approaches from the literature, and specify some metrics for evaluating that challenge. An approach that addresses all nine challenges would be applicable to a large number of real world problems. We also present an example domain that has been modified to present these challenges as a testbed for practical RL research.
We consider Online Convex Optimization (OCO) in the setting where the costs are $m$-strongly convex and the online learner pays a switching cost for changing decisions between rounds. We show that the recently proposed Online Balanced Descent (OBD) algorithm is constant competitive in this setting, with competitive ratio $3 + O(1/m)$, irrespective of the ambient dimension. We demonstrate the generality of our approach by showing that the OBD framework can be used to construct competitive algorithms for a variety of online problems across learning and control, including online variants of ridge regression, logistic regression, maximum likelihood estimation, and LQR control.
In this work, we consider the task of learning effective medical treatment policies for sepsis, a dangerous health condition, from observational data. We examine the performance of various reinforcement learning methodologies on this problem, varying the state representation used. We develop careful baselines for the performance of these methods when using different state representations, standardising the reward function formulation and evaluation methdology. Our results illustrate that simple, tabular Q-learning and Deep Q-Learning both lead to the most effective medical treatment strategies, and that temporal encoding in the state representation aids in discovering improved policies.
We model human decision-making behavior in a risk-taking task using inverse reinforcement learning (IRL) for the purposes of understanding the reward function of real human risk decision making. We are the first work that uses IRL to remove reward function behind human risk-taking decision making and interpret human decisions in risk-prone and risk-averse people. We hypothesize that the state history (e.g. rewards and decisions in previous trials) are related to the human reward function, which leads to risk-averse and risk-prone decisions. In the model, these factors are input to IRL as features, and their reward weights are estimated. The behavioral results confirm the irrational and sub-optimal risk-related decisions. The model results quantify the weight of features and show that risk-prone person tends to decide based on the current bump number, while the risk-averse person relies on burst information from previous trial and the average end status. Our results demonstrate that IRL is an effective tool to model human decision-making behavior and helps interpret human psychological process in risk decisionmaking, and with potential as an assessment tool for risk-taking as well.
From traditional Web search to virtual assistants and Web accelerators, services that rely on online information need to keep track of continuing content changes. Many of these services need to explicitly request content updates from remote sources (e.g., web pages), which gives rise to a formidable challenge. A search engine, for example, may track billions of sources, with content change timing or even frequency being initially unknown for most. How should it schedule requests for updates under constraints of network bandwidth? We introduce a novel optimization objective for this setting that has several desirable properties, and efficient algorithms for it with optimality guarantees even in the face of mixed content change observability and initially unknown model parameters. An evaluation on a set of 20M web pages crawled daily for 14 weeks shows the scalability and other properties of our approach.
Connector insertion and many other tasks commonly found in modern manufacturing settings involve complex contact dynamics and friction. Since it is difficult to capture related physical effects with first-order modeling, traditional control methodologies often result in brittle and inaccurate controllers, which have to be manually tuned. Reinforcement learning (RL) methods have been demonstrated to be capable of learning controllers in such environments from autonomous interaction with the environment, but running RL algorithms in the real world poses sample efficiency and safety challenges. Moreover, in practical real-world settings we cannot assume access to perfect state information or dense reward signals. In this paper, we consider a variety of difficult industrial insertion tasks with visual inputs and different natural reward specifications, namely sparse rewards and goal images. We show that methods that combine RL with prior information, such as classical controllers or demonstrations, can solve these tasks directly by real-world interaction.
Designing an effective materials fabrication process demands searching the realm of possible fabrication processes to understand how desired properties are impacted by the specifications of the process. We apply active learning in the context of Bayesian optimization to propose an adaptive experimental design -- our goal is to identify processes with the most desirable properties in a sample efficient fashion. In reflecting on our cross-disciplinary industry-academic collaboration between materials scientists and mathematicians, we discuss our approach to managing practical concerns and complications; these issues may be rarely addressed in theoretical literature but were necessary for our circumstances.
Most on-policy imitation algorithms, such as DAgger, are designed for learning with a fixed supervisor. However, there are many settings in which the supervisor improves during policy learning, such as when the supervisor is a human performing a novel task or an improving algorithmic controller. We consider learning from an “improving supervisor” and derive a bound on the static-regret of online gradient descent when a converging supervisor policy is used. We present an on-policy imitation learning algorithm, Follow the Improving Teacher (FIT), which uses a deep model-based reinforcement learning (deepMBRL) algorithm to provide the sample complexity benefits of model-based methods but enable faster training and evaluation via distillation into a reactive controller. We evaluate FIT with experiments on the Reacher and Pusher MuJoCodomains using the deep MBRL algorithm, PETS, as the improving supervisor. To the best of our knowledge, this work is the first to formally consider the setting of an improving supervisor in on-policy imitation learning.
We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling.
We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.
We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD(k) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.
We study the sample complexity of model-based reinforcement learning (henceforth RL) in gen- eral contextual decision processes that require strategic exploration to find a near-optimal policy. We design new algorithms for RL with a generic model class and analyze their statistical properties. Our algorithms have sample complexity governed by a new structural parameter called the witness rank, which we show to be small in several set- tings of interest, including factored MDPs. We also show that the witness rank of a problem is never larger than the recently proposed Bellman rank parameter governing the sample complexity of the model-free algorithm OLIVE (Jiang et al., 2017), the only other provably sample-efficient al- gorithm for global exploration at this level of gen- erality. Focusing on the special case of factored MDPs, we prove an exponential lower bound for a general class of model-free approaches, including OLIVE, which when combined with our algorith- mic results demonstrates exponential separation between model-based and model-free RL in some rich-observation settings.
Restless bandit problems are instances of non-stationary multi-armed bandits. There are plenty of results from the optimization perspective, which aims to efficiently find a near-optimal policy when system parameters are known, but the learning perspective where the parameters are unknown is rarely investigated. In this paper, we analyze the performance of Thompson sampling in restless bandits with unknown parameters. We consider a general policy map to define our competitor and prove an $\tilde{O}(\sqrt{T})$ Bayesian regret bound. Our competitor is flexible enough to represent various benchmarks including the best fixed action policy, the optimal policy, the Whittle index policy, or the myopic policy. The empirical results also support our theoretical findings.
In many reinforcement learning (RL) game tasks, episodes should be interrupted after a certain time, as the agent could sometimes fall into a deadlock state. This paper provides a dynamic interruption policy in RL game training via a newly defined multi-sampling multi-armed bandit (MSMAB) model, and offers an efficient algorithm named Exp3.P.MS for the new bandit setting. The experimental results show that the dynamic interruptions can adapt to the performance of the RL agent and spur a fast learning in game playing.
Policy evaluation problems in multi-agent reinforcement learning (MARL) have attracted growing interest recently. In these settings, agents collaborate to learn the value of a given policy with private local rewards and jointly observed state-action pairs. However, most fully decentralized algorithms treat each agent equally, without considering the communication structure of the agents over a given network, and the corresponding effects on communication efficiency. In this paper, we propose a hierarchical distributed algorithm that differentiates the roles of each of the agents during the evaluation process. This allows us to freely choose various mixing schemes (and corresponding mixing matrices that are not necessarily doubly stochastic), in order to reduce the communication cost, while still maintaining convergence at rates as fast as or even faster than the previous approaches. Theoretically, we show the proposed method, which contains existing distributed methods as a special case, achieves the same order of convergence rate as state-of-the-art methods. Numerical experiments on real datasets demonstrate the superior performances of our approach over other advanced algorithms in terms of convergence and total communication efficiency.
This paper deals with adversarial attacks on neural policy perceptions in a reinforcement learning (RL) context. Classic approaches perform untargeted attacks on the state of the agent. Here, we argue that it is more realistic to attack the observations provided by the environment rather than the internal state computed by the agent. We propose an untargeted attack over observations and show that its effectiveness even holds when reduced to a constant attack over observations. We also propose an approach to perform targeted attacks on the observations of the agent, so that it acts as told by an opponent policy. We illustrate this on deep RL agents playing Space Invaders.
We consider online Markov decision process (MDP) problems where both the transition probabilities and the rewards are time-varying or even adversarially generated. We propose an online algorithm based on an online implementation of value iterations and show that its dynamic regret, i.e. its total reward compared with that of the optimal (nonstationary) policies in hindsight, is upper bounded by the total variation of the transition probabilities and the rewards. Moreover, we show that the dynamic regret of any online algorithm is lower bounded by the total variation of the transition probabilities and the rewards, indicating that the proposed algorithm is optimal up to a constant factor. Finally, we test our algorithm in a power management problem for a data center and show that our algorithm reduces energy costs and ensures quality of services (QoS) under real electricity prices and job arrival rates.
Counterfactual regret minimization (CFR) is proved to approach a Nash equilibrium when solving two-player constant-sum perfect-recall imperfect information games. However, for large games, the convergence speed of current CFR variants is still the key limitation, especially in real-time applications. (1) We propose a novel counterfactual regret minimization method with instant updates, which has a lower convergence bound than CFR variants and can improve their convergence. (2) We prove that the space complexity of our instant CFR is same with CFR variants and the proved bounds are much tighter than the previous ones. (3) We also prove that weighted average strategy by skipping previous iterations approaches an approximate Nash equilibrium. % This method can empirically obtain a more efficiently convergent Nash equilibrium. The proved methods have been compared on several different game instances including Leduc Hold'em and five different subgames of Heads-Up No-Limit Texas Hold'em (HUNL) generated by DeepStack. In all of these game instances, the proved methods approach Nash equilibria more efficiently than all the compared state-of-the-art methods. In subgames of HUNL, our method converges three times faster than the hybrid method used in DeepStack.
Obtaining more relevant information enables better decision making, but may be costly. Optimal sequential decision making allows us to trade off the desire to make good decisions by acquiring further information with the cost of performing that acquisition. To this end, we propose a principled framework, named EDDI (Efficient Dynamic Discovery of high-value Information), based on the theory of Bayesian experimental design. In EDDI we propose a novel partial variational autoencoder (Partial VAE), to efficiently handle missing data with different missing patterns. Additionally, we extend the VAE based framework to Bayesian treatment of the weights, which obtains better performance in small data regime. EDDI then combines it with an acquisition function that maximizes expected information gain on a set of target variables at each step.
In-hand manipulation with dexterous robotic hands is a complex problem that not only requires highly coordinated finger movements but also deals with interaction variability. Traditional approaches solve this problem either by relying on complex models that are not always readily available or by constraining the problem in order to make it more tractable. In this paper, we propose a hierarchical control approach where a higher level policy is learned through reinforcement learning, while low level controllers ensure grip stability throughout the manipulation action. We show that this structure allows learning the unconstrained task with RL methods that cannot learn it in a non-hierarchical setting. The low level controllers also provide an abstraction to the tactile sensors input, allowing transfer to real robot platforms.
Multiple machine learning and prediction models are often used for the same prediction or recommendation task. In our recent work, where we develop and deploy airline ancillary pricing models in an online setting, we found that among multiple pricing models developed, no one model clearly dominates other models for all incoming customer requests. Thus, as algorithm designers, we face an exploration - exploitation dilemma. In this work, we introduce an adaptive meta-decision framework that uses Thompson sampling, a popular multi-armed bandit solution method, to route customer requests to various pricing models based on their online performance. We show that this adaptive approach outperform a uniformly random selection policy by improving the expected revenue per offer by 43% and conversion score by 58% in an offline simulation.
We consider the problem of active feature selection where we dynamically choose the set of features that acquires the highest predictive performance relative to a task. We propose a model-based deep reinforcement learning framework for Optimal Discovery of high-value INformation (ODIN) in which the agent either chooses to ask for a new feature or to stop and predict. Utilizing the ability of the partial variational autoencoder (Ma et al., 2018) the framework models the conditional distribution of the features allowing for data efficiency. We introduce a novel cost function that is sensitive to both cost and order of feature acquisition. ODIN handles missing data naturally and ensures the globally optimal solution for most efficient feature acquisition while preserving data efficiency. We show improved performance on both synthetic and real-life datasets.
Features in machine learning problems are often time-varying and may be related to outputs in an algebraic or dynamical manner. The dynamic nature of these machine learning problems renders current higher order accelerated gradient descent methods unstable or weakens their convergence guarantees. Inspired by methods employed in adaptive control, this paper proposes new algorithms for the case when time-varying features are present, and demonstrates provable performance guarantees. In particular, we develop a unified variational perspective within a continuous time algorithm. This variational perspective includes higher order learning concepts and normalization, both of which stem from adaptive control, and allows stability to be established for dynamical machine learning problems where time-varying features are present. These higher order algorithms are also examined for provably correct learning in adaptive control and identification. Simulations are provided to verify the theoretical results.
Access to parallel and distributed computation has enabled researchers and developers to improve algorithms and performance in many applications. Recent research has focused on next generation special purpose systems with multiple kinds of coprocessors, known as heterogeneous system-on-chips (SoC). In this paper, we introduce a method to intelligently schedule--and learn to schedule--a stream of tasks to available processing elements in such a system. We use deep reinforcement learning enabling complex sequential decision making and empirically show that our reinforcement learning system provides for a viable, better alternative to conventional scheduling heuristics with respect to minimizing execution time.
The problem of batch learning from logged contextual bandit feedback (BLBF) is ubiquitous in recommender systems, search, and online retail. Most previous methods for BLBF have followed a ``Model the Bias'' approach, estimating the expected reward of a policy using inverse propensity score (IPS) weighting. While unbiased, controlling the variance can be challenging. In contrast, we take a ``Model the World'' approach using the Direct Method (DM), where we learn a reward-regression model and derive a policy from the estimated rewards. While this approach has not been competitive with IPS weighting for mismatched models due to its bias, we show how directly minimizing the bias of the reward-regression model can lead to highly effective policy learning. In particular, we propose Bias Corrected Reward Imputation(BCRI) and formulate the policy learning problem as bi-level optimization, where the upper level maximizes the DM estimate and the lower level fits a weighted reward-regression. We empirically characterize the effectiveness of BCRI compared to conventional reward-regression baselines and an IPS-based method.
Off-policy policy evaluation (OPE) is the task of predicting the online performance of a policy using only pre-collected historical data (collected from an existing deployed policy or set of policies). For many real-world applications, accurate OPE is crucial since deploying bad policies can be prohibitively costly or dangerous. With the increasing interest in deploying learning-based methods for safety-critical applications, the study of OPE has also become correspondingly more important. In this paper, we present the first comprehensive empirical analysis of most of the recently proposed OPE methods. Based on thousands of experiments and detailed empirical analyses, we offer a summarized set of guidelines for effectively using OPE in practice, as well as suggest directions for future research to address current limitations.
Temporal Difference (TD) learning is a widely used class of algorithms in reinforcement learning. The success of TD learning algorithms relies heavily on the choice of the learning rate used. In this paper, we use recent results on finite-time performance of TD learning with linear function approximation to design an adaptive learning rate selection rule. The rule comprises a diagnostic test which determines whether the algorithm has reached the steady-state, i.e., the algorithm has stopped learning. Once such a determination is made, the rule adapts the learning rate (reducing the step size) so that the algorithm continues to learn. We implement our proposed rule on a variety of popular reinforcement learning applications and show that in all these scenarios, our rule outperforms the best fixed learning rate as well as other commonly used adaptive learning rate selection rules.
Recently infinite horizon off-policy evaluation method based on the estimation of density ratio has been proposed (Liu et al., 2018). Before that doubly robust estimator is the strongest baseline in off-policy evaluation infinite horizon. A natural question is if we can apply doubly robust method in infinite horizon setting. This paper answer that question and reveal an interesting connection between density ratio function and value function. We provide both theoretical and empirical result to show that our method yields significantly advantage over the previous method.
Tactical decision making is a critical feature for advanced driving systems, that incorporates several challenges such as complexity of the uncertain environment and reliability of the autonomous system. In this work, we develop a multi-modal architecture that includes the environmental modeling of ego surrounding and train a deep reinforcement learning (DRL) agent that yields consistent performance in stochastic highway driving scenarios. To this end, we feed the occupancy grid of the ego surrounding into the DRL agent and obtain the high-level sequential commands (i.e. lane change) to send them to lower-level controllers. We will show that dividing the autonomous driving problem into a multi-layer control architecture enables us to leverage the AI power to solve each layer separately and achieve an admissible reliability score. Comparing with end-to-end approaches, this architecture enables us to end up with a more reliable system which can be implemented in actual self-driving cars.
A number of applications involve the sequential arrival of users, and require showing each user a set of items. It is well known that the order in which the items are presented to a user can have a significant impact on the performance of the system, due to phenomenon such as primacy effects. An example application (which forms the focus of this paper) is the bidding process in conference peer review. Here reviewers enter the system sequentially, each reviewer needs to be shown the list of submitted papers, and the reviewer then ``bids'' to review some papers. In deciding on the ordering of the list of papers to show, there are two competing goals: (i) satisfying reviewers by showing them relevant items, and (ii) obtaining sufficiently many bids for each paper to ensure high quality reviews for all papers. In this paper, we first develop a framework to study this problem in a principled manner. We then design an algorithm called SUPER*, inspired by the A* algorithm, for this goal. We show that \super has a number of attractive properties (some analogous to that of A*). Finally, we conduct experiments which reveal that our algorithm is quite robust to the complexities arising in the real world.
We study the problem of learning sequential decision-making policies in settings with multiple state-action representations. Such settings naturally arise in many domains, such as planning (e.g., multiple integer programming formulations) and various combinatorial optimization problems (e.g., those with both integer programming and graph-based formulations). Inspired by the classical co-training framework for classification, we study the problem of co-training for policy learning. We present sufficient conditions under which learning from two views can improve upon learning from a single view alone. Motivated by these theoretical insights, we present an algorithm for co-training for sequential decision making. Our framework is compatible with both reinforcement learning and imitation learning. We validate the effectiveness of our approach on a challenging class of combinatorial optimization problems: minimum vertex cover.
Counterfactual Regret Minimization (CRF) is a fundamental and effective technique for solving Imperfect Information Games (IIG). However, the original CRF algorithm only works for discrete state and action spaces, and the resulting strategy is maintained as a tabular representation. Such tabular representation limits the method from being directly applied to large games. In this paper, we propose a double neural representation for the imperfect information games, where one neural network represents the cumulative regret, and the other represents the average strategy. Such neural representations enable us to avoid any manual game abstraction and conduct an end-to-end continual optimization. To make neural learning efficient, we also developed several novel techniques including a robust sampling method and mini-batch Monte Carlo Counterfactual Regret Minimization (MCCFR), which may be of independent interests. Empirically, on games tractable to tabular approaches, our approach converged to results comparable to those of tabular counterparts and achieved much lower exploitability than deep reinforcement learning. On larger games, our approach achieved strategy profiles comparable to DeepStack's results while using significantly less memory.
Sequence generation with reinforcement learning (RL) has received significant attention recently. However, the sparse-reward issue remains to be a main challenge in the RL training process, where only a scalar guiding signal is available after an entire sequence has been generated. This type of sparse rewards tend to ignore global structural information of a sequence, causing generation of sequences semantically inconsistent. In this paper, we present a model-based RL approach to overcome this issue. Specifically, we propose a novel guider network to model the sequence-generation environment, which can assist next-word prediction and provide intermediate rewards for generator optimization. Extensive experiments demonstrate that the proposed method leads to improved performance for both unconditional and conditional sequence-generation tasks.
Value function learning plays a central role in many state-of-the-art reinforcement learning algorithms. However, many standard algorithms such as Q-learning lose convergence guarantees when function approximation is used, as is observed in practice. In this paper, we propose a novel loss function, the minimization of which results in the true value function. This loss deals with the double-sample challenge faced by algorithms like residual gradient and is easier to optimize than recently proposed primal-dual formulations. In practice, our approach may be combined with general (differentiable) function classes such as neural networks, and is shown to work reliably and effectively for policy evaluation problems.
Financial portfolio optimization is the process of sequentially allocating wealth to a collection of assets (portfolio) during consecutive trading periods, based on investors' risk-return profile. Automating this process with machine learning remains a challenging problem. Here, we design a deep reinforcement learning (RL) architecture with an autonomous trading agent such that, given a portfolio, weight of assets in the portfolio are updated periodically, based on a global objective. In particular, without relying on a naive application of off the shelf model-free agent, we train our trading agent within a novel model-based RL architecture using an infused prediction module (IPM), and a behavior cloning module (BCM), extending standard actor-critic algorithms. We further design a back-testing and trade execution engine which interact with the RL agent in real time. Using historical {\em real} financial market data over multiple years, we simulate daily trading with practical constraints, and demonstrate that our proposed model is robust, profitable and risk-sensitive, as compared to baseline trading strategies and model-free RL agents as used in prior work.
Contract bridge is a multi-player imperfect-information game where one partnership collaborate with each other to compete against the other partnership. The game consists of two phases: bidding and playing. While playing is relatively easy for modern software, bidding is challenging and requires agents to learn a communication protocol to reach the optimal contract jointly, with their own private information. The agents need to exchange information to their partners, and interfere opponents, through a sequence of actions. In this work, we train a strong agent to bid competitive bridge purely through selfplay, outperforming WBridge5, a championship-winning software. Furthermore, we show that explicitly modeling belief is not necessary in boosting the performance. To our knowledge, this is the first competitive bridge agent that is trained with no domain knowledge. It outperforms previous state-of-the-art that use human replays with 70x fewer number of parameters.
In preference-based reinforcement learning (PBRL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in PBRL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. We present DUELING POSTERIOR SAMPLING (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the user's preferences. To solve the credit assignment problem, we develop a Bayesian approach to translate user preferences to a posterior distribution over state/action reward models. We prove an asymptotic no-regret rate for DPS with Bayesian logistic regression credit assignment; to our knowledge, this is the first regret guarantee for PBRL. We also discuss possible avenues for extending this proof methodology to analyze other credit assignment models, and finally, evaluate the approach empirically.
In many real-world scenarios, an autonomousagent often encounters various tasks within a sin-gle complex environment. We propose to build agraph abstraction over the environment structureto accelerate the learning of these tasks. Here, nodes are important points of interest (pivotal states) and edges represent feasible traversals be-tween them. Our approach has two stages. First,we jointly train a latent pivotal state model and acuriosity-driven goal-conditioned policy in a task-agnostic manner. Second, a high-level Manageruses the world graph to quickly find solutions tonew tasks and expresses subgoals in reference totheir nearby pivotal states to a low-level Worker.The Worker can then also use the graph to traverseand explore in long range. We perform a thorough ablation study to evaluate our approach on a suite of challenging maze tasks, demonstrating significant advantages from the proposed framework over baselines that lack world graph knowledge in terms of performance and efficiency.
Paper Submission Deadline: May 20, 2019, 11:59 PM PST
Author Notification: May 31, 2019, 11:59 PM PST
Final Version: June 9, 2019, 11:59 PM PST
Workshop: June 14, 2019 (14:00-18:00 PST)
Caltech
Caltech
Microsoft Research AI
Georgia Tech & NVIDIA
Georgia Tech