Q-Learning is a model-free off-policy Reinforcement Learning algorithm. Its goal is to find the optimal action-selection policy for any given environment. The algorithm uses a Q-table (also called a Q-function) to store and update expected future rewards for state-action pairs.
In simple terms:
- Agent learns what action to take in each state to maximize future rewards.
- No need to know environment’s internal rules (that’s why it’s called model-free).
- It can learn from any kind of action (even actions outside its current policy), so it’s off-policy.
📌 Key Terminologies
Term | Explanation |
---|---|
Environment | The world in which the agent operates |
State (s) | Current situation of the agent |
Action (a) | What agent can do at the state |
Reward (r) | Feedback from environment after action |
Policy (π) | Strategy mapping state to action |
Q-Table | Table storing values for (state, action) pairs |
📌 Objective of Q-Learning
- To learn the optimal Q-function:
Q∗(s,a)=maximum expected future reward if action ’a’ is taken in state ’s’.Q*(s, a) = \text{maximum expected future reward if action ‘a’ is taken in state ‘s’}.
- Using this Q-function, agent chooses actions that maximize rewards in the long run.
📌 Q-Learning Formula
The core update rule of Q-learning is: Q(s,a)=Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]Q(s, a) = Q(s, a) + α \left[r + γ \max_{a’} Q(s’, a’) – Q(s, a)\right]
Where:
- Q(s,a)Q(s, a) = Current Q-value for (state, action) pair.
- αα = Learning rate (how fast to update).
- rr = Reward for current action.
- γγ = Discount factor (importance of future rewards).
- s′s’ = Next state after taking action ‘a’.
- maxa′Q(s′,a′)\max_{a’} Q(s’, a’) = Best future Q-value from next state.
📌 Step-by-Step Process — How Q-Learning Works?
Step 1: Initialization
- Create Q-table: Size is (Number of states × Number of actions), all initialized to 0 or random small values.
Step 2: Repeat until learning is complete (for each episode)
Step 2.1: Start in some initial state ss
Step 2.2: Choose action using policy
- Use ε-greedy policy:
- With probability ε, choose random action (explore).
- With probability 1-ε, choose action with highest Q-value (exploit).
Step 2.3: Take action, observe reward and next state
r,s′=environment.step(action)r, s’ = \text{environment.step(action)}
Step 2.4: Update Q-value using formula
Q(s,a)=Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]Q(s, a) = Q(s, a) + α \left[r + γ \max_{a’} Q(s’, a’) – Q(s, a)\right]
Step 2.5: Move to next state
- Set current state = next state ( s=s′s = s’ )
Step 2.6: Repeat until episode ends (goal reached, or max steps hit)
📌 Example — Grid World
Environment
- Grid: 3×3
- Agent starts at (0,0)
- Goal at (2,2)
- Actions: [up, down, left, right]
- Rewards: -1 for every step, +100 on reaching goal
Q-Table (initially)
State | Up | Down | Left | Right |
---|---|---|---|---|
(0,0) | 0 | 0 | 0 | 0 |
(0,1) | 0 | 0 | 0 | 0 |
… | … | … | … | … |
Process
- Agent starts at (0,0), chooses action (e.g., “Right”)
- Moves to (0,1), gets reward (-1)
- Update:
Q(0,0,Right)=Q(0,0,Right)+α[−1+γmaxa′Q(0,1,a′)−Q(0,0,Right)]Q(0,0,\text{Right}) = Q(0,0,\text{Right}) + α \left[-1 + γ \max_{a’} Q(0,1,a’) – Q(0,0,\text{Right})\right]
- Repeat until goal reached or max steps hit
- After many episodes, the Q-table converges to optimal values, and agent learns best path to goal.
📌 Final Notes — Key Features of Q-Learning
Feature | Meaning |
---|---|
Model-Free | Doesn’t need environment’s internal transition/reward rules |
Off-Policy | Learns optimal policy independent of how actions are selected during learning |
Convergence | Guaranteed to converge to optimal Q if learning rate and exploration handled well |
Trade-off | Between exploration (try new things) and exploitation (use known good actions) |
📌 Visualization Example (Path Found by Q-Learning)
Start -> Right -> Down -> Down -> Right -> Goal
This kind of optimal path is learned over time, and stored in the Q-table.
📌 Summary (Quick Chart)
Step | Action |
---|---|
1 | Initialize Q-table |
2 | Start episode (at initial state) |
3 | Choose action (ε-greedy) |
4 | Take action, get reward + next state |
5 | Update Q-table |
6 | Move to next state |
7 | Repeat until goal reached |
8 | Repeat episodes till convergence |
📌 Why Q-Learning is Powerful?
✅ Works in unknown environments
✅ Handles stochastic rewards and transitions
✅ Simple to implement
✅ Guarantees convergence (under conditions)
📌 Limitation
❌ Doesn’t work well if state space is huge (like image inputs) — needs Deep Q-Learning (DQN) for that.