Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

An Introduction to Reinforcement Learning

A brief introduction to the main concepts

What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of Machine Learning, where agents learn to make decisions in a dynamic environment to maximize a reward.

The typical Reinforcement Learning loop looks like this:

  1. The agent receives an observation, representing the current state of the environment.

  2. Based on the observation and its current policy, the agent selects an action.

  3. The agent performs the action and receives a reward based on the environment’s new state.

Markov Decision Process

Formally, RL can be understood as solving Markov Decision Process (MDP). A MDP is a 5-tuple (S,A,T,π0,R)(S, A, T, \pi_0, R), with:

  • SS: state space of the environment

  • AA: action space of the environment, i.e., actions that can be performed by the agent

  • T:S×A×S[0,1]T : S \times A \times S \mapsto [0,1]: transition function, that describes how actions affect the state of the environment

  • r:S×A×SRr: S \times A \times S \mapsto \mathbb{R}: reward function

  • π0:S[0,1]\pi_0: S \mapsto [0,1]: probability distribution over initial state

The agent selects actions using a policy π:SA\pi: S \mapsto A. The objective is usually to find the optimal policy π\pi^*, which maximizes the cumulative reward.

Questions

  1. Why does the cartesian product that defines the domain of the transition function include the state space SS twice?

  2. Why does the transition function return a value between 0 and 1?

  3. Assume that from the current state S, you can get to state A with an immediate high reward, or state B with an immediate low reward. Is state A always prefarable over B?

  4. Can teaching a dog a new trick be understood as a Markov Decision Process? If yes, what are the state space, action space, and the reward function?