Q Learning Intuition
-
State (( s )): A representation of the current situation or configuration of the system. The set of all possible states is denoted as ( S ).
-
Action (( a )): A decision or move that can be made while in a particular state. The set of possible actions available in state ( s ) is denoted as ( A(s) ).
-
Reward Function (( R(s, a) )): This function provides the immediate reward received after taking action ( a ) in state ( s ). It quantifies the immediate benefit or cost of an action.
-
Value Function (( V(s) )): The value function estimates the maximum expected return (cumulative reward) achievable starting from state ( s ). It reflects the long-term value of being in that state.
-
Discount Factor (( \gamma )): A factor between 0 and 1 that determines the present value of future rewards. A value closer to 0 prioritizes immediate rewards, while a value closer to 1 gives more weight to future rewards.
-
Transition Probability (( P(s' | s, a) )): This function describes the probability of transitioning to state ( s' ) after taking action ( a ) in state ( s ). It captures the dynamics of the environment.
-
Optimal Policy (( \pi^ ))**: A policy that defines the best action to take in each state to maximize the expected cumulative reward. The value function under the optimal policy is often denoted as ( V^(s) ).
Q-Learning in Reinforcement Learning
Q-Learning is a key algorithm in reinforcement learning (RL) that helps an agent learn how to achieve a goal in an environment through trial and error. It does this by learning a value function that indicates the expected utility (or reward) of taking a particular action in a given state.
Key Concepts of Q-Learning
- Agent and Environment: The agent interacts with the environment by taking actions and receiving rewards.
- States: The different situations the agent can be in.
- Actions: The choices the agent can make in each state.
- Rewards: The feedback received after taking an action in a state.
- Q-Values: The expected future rewards for taking a specific action in a specific state.
Q-Learning Algorithm Steps
- Initialize the Q-table with arbitrary values (often zeros).
- For each episode:
- Start in an initial state.
- While the state is not terminal:
- Choose an action using an exploration strategy (like ε-greedy).
- Take the action, observe the reward and the new state.
- Update the Q-value using the formula:
- Set the new state as the current state.
- Repeat until the Q-values converge.
Q-Learning Diagram
Additional Reading Paper
- Markov Decision Processes: Concept and Algorithms - By MArtijn Van Otterlo (2009)