2023. 1. 13. 16:59ㆍReinforcement Learning
2.1 k-armed Bandit Problem
Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps. This is the original form of the k-armed bandit problem, so named by analogy to a slot machine, or “one-armed bandit,” except that it has k levers instead of one. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payo↵s for hitting the jackpot. Through repeated action selections you are to maximize your winnings by concentrating your actions on the best levers.
In our k-armed bandit problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action. We denote the action selected on time step $t$ as $A_t$, and the corresponding reward as $R_t$. The value then of an arbitrary action a, denoted $q_*(a)$, is the expected reward given that a is selected:
$$q_*(a) = E[R_t | A_t =a] .$$
If you knew the value of each action, then it would be trivial to solve the k-armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates. We denote the estimated value of action a at time step t as $Q_t(a)$. We would like $Q_t(a)$ to be close to $q_*(a)$.
greedy, exploiting, and exploring
If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run. For example, suppose a greedy action’s value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don’t know which one. If you have many time steps ahead on which to make action selections, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the “conflict” between exploration and exploitation.
2.2 Action-Value Method
We begin by looking more closely at methods for estimating the values of actions and for using the estimates to make action selection decisions, which we collectively call action-value methods. Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received: $$Q_t(a) \equiv \frac{sum \, of \, rewards \, when \, a \, taken \, prior \, to \, t} {number \, of \, times \, a \, taken \, prior \, to \quad t} = \frac{\sum_{i=1}^{t-1}{R_i*1_{A_i=a}}}{\sum_{i=1}^{t-1}{1_{A_i=a}}} $$
where predicate denotes the random variable that is $1_predicate$ if predicate is true and 0 if it is not. If the denominator is zero, then we instead define $Q_t(a)$ as some default value, such as 0. As the denominator goes to infinity, by the law of large numbers, $Q_t(a)$ converges to $q_*(a)$. We call this the sample-average method for estimating action values.
2.3 The 10-armed Testbed
2.4 Incremental Implementation
The action-value methods we have discussed so far all estimate action values as sample averages of observed rewards. We now turn to the question of how these averages can be computed in a computationally ecient manner, in particular, with constant memory and constant per-time-step computation. To simplify notation we concentrate on a single action. Let $R_i$ now denote the reward received after the $i$th selection of this action, and let $Q_n$ denote the estimate of its action value after it has been selected $n - 1$ times, which we can now write simply as
The obvious implementation would be to maintain a record of all the rewards and then perform this computation whenever the estimated value was needed. However, if this is done, then the memory and computational requirements would grow over time as more rewards are seen. ($O(n)$ time) Each additional reward would require additional memory to store it and additional computation to compute the sum in the numerator.
Of course this is not an unsolveable problem if we use incremental formulas below.
Note that the step-size parameter (StepSize) used in the incremental method (2.3) decreases from time step to time step ($\frac{1}{n}$).
2.5 Tracking of a Nonstationary Problem
The averaging methods discussed so far are appropriate for stationary bandit problems, that is, for bandit problems in which the reward probabilities do not change over time. However for a nonstationary problem the weight of the near reward and final reward has to vary
If we look at the update formula for non stationary problem we can use constant StepSize. This results in $Q_{n+1}$ being a weighted average of past rewards and the initial estimate $Q_1$:
'Reinforcement Learning' 카테고리의 다른 글
5. Monte Carlo/ Temporal Difference (0) | 2023.01.31 |
---|---|
4. Dynamic Programming (0) | 2023.01.20 |
3. Finite Markov Decision Processes (0) | 2023.01.13 |
1. RL Introduction (0) | 2023.01.12 |