Bird’s-Eye View of Reinforcement Learning Algorithms Taxonomy

Episode 3 of the “Invitation to All Aspiring RL Practitioner” Series

Published in

Towards Data Science

7 min readOct 30, 2020

In the first part of this series, we’ve learned about some important terms and concepts in Reinforcement Learning (RL). We’ve also learned how RL is applied in an autonomous race car in the second part.

In this article, we will learn about the taxonomy of Reinforcement Learning algorithms. We will not only learn about one taxonomy but several taxonomies from many different points of view.

After we have familiar with the taxonomy, we will learn more about each of the branches in future episodes. Without wasting any more time, let’s take a deep breath, make a cup of chocolate, and I invite you to learn with me about the bird’s-eye view of the RL algorithms taxonomy!

Photo by American Heritage Chocolate on Unsplash

Model-Free vs Model-Based

One way to classify RL algorithms is by asking whether the agent has access to a model of the environment or not. In other words, by asking whether we can know exactly how the environment will respond to our agent’s action or not.

Based on this point-of-view, we have 2 branches of RL algorithms: model-free and model-based:

Model-based is the branch of RL algorithms that try to choose the optimal policy based on the learned model of the environment.
In model-free algorithms, the optimal policy is chosen based on the trial-and-error experienced by the agent.

Both model-free and model-based algorithms have their own upsides and downsides as listed in the table below.

Advantages and Disadvantages of Model-Free and Model-Based Algorithms. [Image by Author]

Fact: Model-free methods are more popular than model-based methods.

Value-Based vs Policy-Based

Another way to classify RL algorithms is by considering what component is optimized by the algorithm — the value function or the policy.

Before we deep diver, let’s learn about policy and value function first.

Policy

A policy π is a mapping from state s to action a, where π(a|s) is the probability of taking action a when in the state s. A policy can be either deterministic or stochastic.

Let’s imagine me and you are playing a rock-paper-scissors game. If you don’t know what this game is, it’s a very simple game where two people are competing towards each other by performing one of the three actions at the same time (rock/paper/scissors). The rule is simple:

Scissors beats paper
Rock beats scissors
Paper beats rock

Consider policies for iterated rock-paper-scissors

A deterministic policy is easily exploited — If you choose “rock” more often than other choices and I realized your behavior, then I can take benefit of that so that I will have a greater probability to win.
A uniform random policy is optimal — If your action is purely random, then I have no clue of what action should I perform in order to beat you.

Value Function

Value function is a function that measures how good a state is based on the prediction of future reward or known as a return. Basically, the return (Gt), is a total sum of “discounted” rewards going forward from time t.

, where γ ∈ [0,1] is the discounting factor. The discounting factor aims to penalize the rewards in the future because of several reasons:

Convenient on the mathematical aspect
Break the infinite loops in the state transition graph
Higher uncertainty in the future rewards (i.e. stock price movement)
The future rewards don’t provide immediate benefits (i.e. human tend to have fun today rather than 10 years later)

We now know what return is. Let’s define the mathematical form of value function!

There are 2 forms of value function:

The state-value function (usually called as value function) is the expected return of a state at time t:

The state-action value function (usually called as Q-value) is the expected return of a state-action pair at time t:

The difference between Q-value and value function is the action advantage function (usually called as A-value):

Okay, we have learned about what is value function and action-state value function. Now, we are ready to learn more about another branching of RL algorithms that focused on what component is optimized by the algorithm.

Value-Based and Policy-Based Algorithms. [Image by Author, Reproduced from David Silver’s RL Course]

Value-based RL aims to learn the value/action-value function in order to generate the optimal policy (i.e. the optimal policy is generated implicitly).
Policy-based RL aims to learn the policy directly using a parameterized function.
Actor-Critic RL aims to learn both the value function and the policy.

There are some advantages and disadvantages for both value and policy-based methods as listed in the table below.

Advantages and Disadvantages of Value-Based and Policy-Based Algorithms. [Image by Author]

Value-based algorithms have to pick the action which maximizes the action-state value function, and it will be expensive if the action space is very high-dimensional or even continuous, while policy-based works by adjusting the parameter of the policy directly without doing the maximization computation.
Value-based algorithms can oscillate/chatter/diverge if do things in a “wrong way” (worse convergence properties / less stable), while policy-based algorithms are more stable and have better convergence properties because they only make little incremental changes on the policy gradient.
Policy-based algorithms can learn both deterministic and stochastic policies, while value-based algorithms can only learn deterministic policies.
Naive policy-based algorithms can be slower and higher variance compared to the value-based algorithms. The value-based methods try to pick the action which maximizes the action-state value function which will improve the policy in the direction to the best policy (faster and lower variance), while policy-based methods just take a little step and smoothly update in that direction which is more stable but in the same time is less efficient and sometimes leads to higher variance.
The policy-based method typically converges to a local rather than global optimum.

On-Policy vs Off-Policy

There’s also another way to classify RL algorithms. This time the classification is based on the source of the policy.

We can say that algorithms classified as on-policy are “learning on the job.” In other words, the algorithm attempts to learn about policy π from experience sampled from π.

While algorithms that are classified as off-policy are algorithms that work by “looking over someone’s shoulder.” In other words, the algorithm attempts to learn about policy π from experience sampled from μ. For example, a robot learns how to operate by watching how another human behaves.

Final Words

Congratulations on keeping up to this point!!

After reading this article, you should have known about how the RL algorithms are classified based on several point-of-views. We will learn more about the value-based and policy-based algorithms in the future episodes.

Remember, our RL journey is still in the early phase! I still have a lot of materials to be shared with you. So, if you love the content and want to keep learning with me in the next 2 months, please follow my Medium account to get the notification about my future posts!

About the Author

Louis Owen is a Data Science enthusiast who always hungry for new knowledge. He pursued a Mathematics major at one of the top universities in Indonesia, Institut Teknologi Bandung, under the full final-year scholarship. Recently, in July 2020, he was just graduated from his study with honors.

Louis has experienced as an analytics/machine learning intern in various fields of industry, including OTA (Traveloka), e-Commerce (Tokopedia), FinTech (Do-it), Smart City App (Qlue Smart City), and currently as a Data Science Consultant at The World Bank.

Check out Louis’ website to know more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.