There are two main kinds of problems in Reinforcement Learning (RL), evaluation of a policy aka the prediction problem and finding the best policy aka the control problem. A policy dictates the action to be taken given the current state. In this post, we will evaluate 2 pricing policies for air fare using Temporal Difference (TD) learning algorithms. Then we will use a policy improvement algorithm to find the better of the 2 policies. We will use my Python package for RL called qinisa. It has some classic Reinforcement Learning and Multi Arm Bandit algorithms.Here is the GitHub repo.

## Reinforcement Learning

Reinforcement Learning is used for decision making under uncertain conditions. An RL agent takes some action in the external environment based on it’s current state and gets some reward from the environment after some possible delay. The goal of the agent is to maximize reward over a set of actions taken. The interaction with the environment could be continuous or episodic. Episodic interaction ends when the agent reaches terminal state. An RL agent is characterized by

**state**: state of the agent**action**: Action taken by the agent from a given state**reward**: Reward obtained from the environment as result of taking some action from some state**policy**: Defines the action to be taken from a given state. A policy could be deterministic or probabilistic. If probabilistic, action is sampled from an action distribution for a state

For high dimensional state and action space or continuous state and action, it’s more appropriate to use Deep Learning. Various RL problems amount to finding functions and neural network being a universal function approximate is very effective. Deep reinforcement Learning is proper approach for these cases.

## Policy Evaluation and Improvement

This post is focused on the policy evaluation problem. In policy evaluation problem the value of each state is estimated. The value is the reward for taking action from the given state and discounted sum of rewards from all other subsequent states visited as per the policy. This can be expressed as a recursive equation, according which the state value is the sum of reward from the state plus the discounted value of the next state. The state value function for a policy is as follows

**V(s) = ∑ _{a} Π(s,a) ∑_{s′} P^{a}_{ss′}(R^{a}_{ss′} + Γ V(s′))**

where

V(s) = state value

Π(s,a) = policy

P

^{a}

_{ss′}= state transition probability through action a

R

^{a}

_{ss′}= reward for taking action

Γ = discount factor

V(s′) = next state value

If the policy is deterministic, the first summation goes away. If the environment is deterministic, then there is closed form solution obtained by solving a set of linear equations. That’s rarely the case in the real world, where the environment is non deterministic.

Iterative policy evaluation is used where state values are iteratively updated. The iterative form is known as the Bellman equation.The equation is same as before except that the state values are subscripted with iteration count as V(s)_{k+1}and V(s′)_{k}. As k approaches infinity the state values approach the true state value. Value of a state is replaced with reward and and the old value of the successor state.

There are 2 techniques for policy evaluation, Money Carlo and Temporal Difference. In Money Carlo, for a state value all the discounted rewards obtained from all subsequent states visited as per the policy are accumulated. The value of a state is not updated until all the rewards are obtained.

In Temporal Difference, the current value of the next state is used and value of a state is updated immediately. Here is the algorithm. For TD to converge, learning rate should decay at each iteration.

Input: policy(P). learning rate (lr), discount rate (dr) Output : state values intialize all V(s) repeat for all episodes set intial state s repeat until terminal state get action a from policy P for state s take action a and get reward r and the next state s1 lr = decay(lr) V(s) = V(s) + lr x (r + dr x V(s1) - V(s)) s = s1

For policy comparison, we calculate new state values for the first policy, except that we use the action based on the second policy. If the new set state values are greater than the original state values for the first policy then then the second policy is better.

Input: policy (P2), state values from the policiy P1 (V1), discount rate(dr) Output : best poilicy (bestPolicy) count = 0 repeat for all states get action a from policy P2 for state s take action a and get reward r and the next state s1 V3(s) = r + dr * V1(s1) if V3(s) >= V1(s) count = count + 1 if V3 is greater that or equal to V1 for all states bestPolicy = P2 else bestPolicy = P1

## Airfare Policy

We will evaluate 2 policies for airfare for certain flight. The pricing policies are enforced for a period of 60 days prior to the flight date. The 60 day period is divided into 4 segments. There are 5 discrete occupancy values, each occupancy level corresponds a range of booked seats. A state is defined by a combination of time segment and the occupancy level at the beginning of that time segment. There are 20 states.

An action is defined by either a discount or surcharge on the base price. A discount of 0 implies the base price There are 10 possible actions. The reward is defined as the revenue from ticket sale during a time segment with some discount or surcharge on the base price.

A policy tells us how much to discount or surcharge on the base price for a time segment and occupancy level. There are 2 determinists policies defined by the airlines staff based on domain expertise. The second policy more aggressive in pricing. Our goal is to find which policy is better i.e generates more revenue. Please refer to the Python driver code for details

## Airfare Policy Evaluation and Comparison

We are using a Python RL package called qinisa. In the latest version following solutions have been added. More classing RL algorithms will be added in future

*TD learning**Policy comparison and improvement**Q learning*

In the real world, actual flight booking data following the prescribed pricing policy would have been used for reward feedback. I have simulated the flight booking data and reward. Demand was considered random and booking data i.e reward was obtained by sampling from some probability distribution.

Please refer to the tutorial for details. There are 2 steps. In the first step, Temporal Difference is used is used to find state values for both policies. The output is saved. In the second step, the out[ut from the first step is used to compare the 2 policies using policy improvement algorithm. Here is the final outp[ut from the second step.

{'P1O1': (104.63950797215047, 59.76476143241095, 1), 'P2O1': (77.56241200948618, 4.170746154089204, 1), 'P3O1': (76.82000000000001, 0, 1), 'P1O2': (86.0251197131279, 32.45583466698227, 1), 'P2O2': (95.52928870050667, 57.38294560229932, 1), 'P3O2': (72.944, 0, 1), 'P2O3': (93.7013444717806, 37.96590682594545, 1), 'P3O3': (69.16666666666667, 32.96340634574586, 1), 'P2O4': (69.80174226686508, 4.24651170580217, 1), 'P3O4': (65.4, 36.30835110383048, 1), 'P2O5': (39.23240893353175, 0, 1), 'P3O5': (25.715666666666667, 9.579377824770257, 1)} num of states 12 num of states with higher value 12

The number of states is less than total total number of states because the terminal states are excluded and some of the initial non reachable states are excluded. The hash map output contains for state the new state value as per policy improvement algorithm, existing state value and a flag indicating whether the new state value is greater.

It shows that the second policy i.e the more aggressive one is better than the first policy. For TD learning details please refer to implementation.

## Summing Up

We have gone through policy evaluation and policy comparison algorithms in Reinforcement learning using airfare policy as an example. our final objective was to find the better among the 2 policies. In future post, we will discuss how to find an optimal pricing policy for airfare using Q Learning.