Reinforcement Learning

Reinforcement learning

Machine learning with an agent that learns how to make decisions by taking actions in an environment to maximize some notion of cumulative reward
The agent follows a policy (strategy) to decide which actions to take in different states
Through trial and error, it learns which actions yield the highest rewards in specific situations
Reinforcement learning is applied in domains including robotics, recommendation systems, and autonomous vehicles

Reinforcement learning

Agent: The learner or decision-maker
Environment: The world with which the agent interacts
State: The current situation of the agent in the environment
Action: What the agent can do in each state
Reward: Feedback from the environment indicating the desirability of an action

Reinforcement learning

https://gymnasium.farama.org/introduction/basic_usage/

Reinforcement learning

RL involves training an agent that interacts with its environment
The agent moves between different environmental states by taking actions
These actions produce rewards (positive, negative, or zero)
The agent’s goal is to maximize total rewards throughout an episode (a sequence from start to finish for the task)
We define rewards based on what we want the agent to accomplish

Exploitation vs. Exploration

Learning trade-off: explotation vs. exploration
Exploitation – use the knowledge the agent already has is the best decision, meaning choose actions that have worked well in the past and are known to provide good rewards based on its current understanding
Exploration – try new actions or strategies

Exploitation vs. Exploration

Too much exploitation (too greedy): The agent might get stuck using a suboptimal strategy because it never discovers better alternatives.

Too much exploration: The agent spends too much time trying random actions instead of using what it already knows works well.

Exploitation vs. Exploration – methods

Some things are easy and provide instant gratification, others provide long term rewards
Goal: optimize for maximum rewards over multiple episodes
Training does not depend only on current state, but the whole history of states

Exploitation vs. Exploration – methods

ε-greedy: Choose the best known action most of the time (exploitation), but occasionally (with probability ε) choose a random action (exploration)
Softmax/Boltzmann exploration: Choose actions with probability relative to their estimated values
Upper Confidence Bound (UCB): Select actions that have high potential value or high uncertainty
Adding noise to the policy: Gradually reduce exploration over time as the agent learns

Q-learning

Fundamental reinforcement learning algorithm
The goal is to find the policy that can maximize the reward –find the optimal policy for each possible state
The agent learns a Q-table, which maps states and actions to their expected rewards
Q-table: matrix that estimates the expected future reward for each state-action pair
The “Q” stands for “quality” - representing the quality or value of a specific action in a specific state.

Q-learning – steps

Initialize the Q-table with zeros or random values
For each episode:
- Start in some initial state
- Repeat until goal is reached:
  - Choose an action (exploration or exploitation)
  - Take the action, observe the reward and next state
  - Update the Q-value for the current state-action pair
  - Move to the next state

Delivery Truck

Objective: move from one corner of a 2D grid (roads) to the other corner (delivery point)
Environment: 2D grid, with blocked cells (obstacles)
State: the cell in the grid the truck is on
Actions: 4 in total – up, down, left, right
Reward: positive reward for reaching the goal (delivery spot), negative reward for colliding with an obstacle, small positive reward for moving where there is no obstacle, positive reward for moving in the correct direction

Delivery Truck – initial state

Download starter code

Cart-pole balancing problem

You are trying to keep a pole standing up straight. The pole is connected to a cart with a joint that can move freely (no motor). The cart slides along a smooth track with no friction. At the start, the pole is standing straight up. The goal is to stop it from falling over by making the cart go left and right, faster or slower.

Cart-pole balancing problem

Objective: keep a pole upright on a cart by applying forces to the cart
Environment: The pole is attached to a cart that moves along a frictionless track
State: the cart’s position and velocity, and the pole’s angle and angular velocity
Actions: move the cart either left or right
Reward: pole remains upright

The episode ends when the pole falls too far or the cart moves too far from the center

RF with gymnasium

You need c++ installed in your computer – for macs, install Xcode; for windows machines use VS code

https://gymnasium.farama.org/

/path/to/bin/python3 -m pip install gymnasium  
/path/to/bin/python3 -m pip install "gymnasium[box2d]"   
/path/to/bin/python3 -m pip install swig

cart-pole problem

Starter code

Download starter code

Q-learning

Define dimensions for the Q-table
Fit the observation space into the Q-table
Index for Q-Table is integer, define a tranformation from observation to index
Initialize Q-Table – zeros or random number?

Q-learning starter code

Download Q-learning starter code