Q-Learning is a model-free off-policy Reinforcement Learning algorithm. Its goal is to find the optimal action-selection policy for any given environment. The algorithm uses a Q-table (also called a Q-function) to store and update expected future rewards for state-action pairs.

In simple terms:

Agent learns what action to take in each state to maximize future rewards.
No need to know environment’s internal rules (that’s why it’s called model-free).
It can learn from any kind of action (even actions outside its current policy), so it’s off-policy.

📌 Key Terminologies

Term	Explanation
Environment	The world in which the agent operates
State (s)	Current situation of the agent
Action (a)	What agent can do at the state
Reward (r)	Feedback from environment after action
Policy (π)	Strategy mapping state to action
Q-Table	Table storing values for (state, action) pairs

📌 Objective of Q-Learning

To learn the optimal Q-function:

Q∗(s,a)=maximum expected future reward if action ’a’ is taken in state ’s’.Q*(s, a) = \text{maximum expected future reward if action ‘a’ is taken in state ‘s’}.

Using this Q-function, agent chooses actions that maximize rewards in the long run.

📌 Q-Learning Formula

The core update rule of Q-learning is: Q(s,a)=Q(s,a)+α[r+γmax⁡a′Q(s′,a′)−Q(s,a)]Q(s, a) = Q(s, a) + α \left[r + γ \max_{a’} Q(s’, a’) – Q(s, a)\right]

Where:

Q(s,a)Q(s, a) = Current Q-value for (state, action) pair.
αα = Learning rate (how fast to update).
rr = Reward for current action.
γγ = Discount factor (importance of future rewards).
s′s’ = Next state after taking action ‘a’.
max⁡a′Q(s′,a′)\max_{a’} Q(s’, a’) = Best future Q-value from next state.

📌 Step-by-Step Process — How Q-Learning Works?

Step 1: Initialization

Create Q-table: Size is (Number of states × Number of actions), all initialized to 0 or random small values.

Step 2: Repeat until learning is complete (for each episode)

Step 2.1: Start in some initial state ss

Step 2.2: Choose action using policy

Use ε-greedy policy:
- With probability ε, choose random action (explore).
- With probability 1-ε, choose action with highest Q-value (exploit).

Step 2.3: Take action, observe reward and next state

r,s′=environment.step(action)r, s’ = \text{environment.step(action)}

Step 2.4: Update Q-value using formula

Q(s,a)=Q(s,a)+α[r+γmax⁡a′Q(s′,a′)−Q(s,a)]Q(s, a) = Q(s, a) + α \left[r + γ \max_{a’} Q(s’, a’) – Q(s, a)\right]

Step 2.5: Move to next state

Set current state = next state ( s=s′s = s’ )

Step 2.6: Repeat until episode ends (goal reached, or max steps hit)

📌 Example — Grid World

Environment

Grid: 3×3
Agent starts at (0,0)
Goal at (2,2)
Actions: [up, down, left, right]
Rewards: -1 for every step, +100 on reaching goal

Q-Table (initially)

State	Up	Down	Left	Right
(0,0)	0	0	0	0
(0,1)	0	0	0	0
…	…	…	…	…

Process

Agent starts at (0,0), chooses action (e.g., “Right”)
Moves to (0,1), gets reward (-1)
Update:

Q(0,0,Right)=Q(0,0,Right)+α[−1+γmax⁡a′Q(0,1,a′)−Q(0,0,Right)]Q(0,0,\text{Right}) = Q(0,0,\text{Right}) + α \left[-1 + γ \max_{a’} Q(0,1,a’) – Q(0,0,\text{Right})\right]

Repeat until goal reached or max steps hit
After many episodes, the Q-table converges to optimal values, and agent learns best path to goal.

📌 Final Notes — Key Features of Q-Learning

Feature	Meaning
Model-Free	Doesn’t need environment’s internal transition/reward rules
Off-Policy	Learns optimal policy independent of how actions are selected during learning
Convergence	Guaranteed to converge to optimal Q if learning rate and exploration handled well
Trade-off	Between exploration (try new things) and exploitation (use known good actions)

📌 Visualization Example (Path Found by Q-Learning)

Start -> Right -> Down -> Down -> Right -> Goal

This kind of optimal path is learned over time, and stored in the Q-table.

📌 Summary (Quick Chart)

Step	Action
1	Initialize Q-table
2	Start episode (at initial state)
3	Choose action (ε-greedy)
4	Take action, get reward + next state
5	Update Q-table
6	Move to next state
7	Repeat until goal reached
8	Repeat episodes till convergence

📌 Why Q-Learning is Powerful?

✅ Works in unknown environments
✅ Handles stochastic rewards and transitions
✅ Simple to implement
✅ Guarantees convergence (under conditions)

📌 Limitation

❌ Doesn’t work well if state space is huge (like image inputs) — needs Deep Q-Learning (DQN) for that.

What is Q-Learning in Reinforcement Learning?

📌 Key Terminologies

📌 Objective of Q-Learning

📌 Q-Learning Formula

📌 Step-by-Step Process — How Q-Learning Works?

Step 1: Initialization

Step 2: Repeat until learning is complete (for each episode)

Step 2.1: Start in some initial state ss

Step 2.2: Choose action using policy

Step 2.3: Take action, observe reward and next state

Step 2.4: Update Q-value using formula

Step 2.5: Move to next state

Step 2.6: Repeat until episode ends (goal reached, or max steps hit)

📌 Example — Grid World

Environment

Q-Table (initially)

Process

📌 Final Notes — Key Features of Q-Learning

📌 Visualization Example (Path Found by Q-Learning)

📌 Summary (Quick Chart)

📌 Why Q-Learning is Powerful?

📌 Limitation

Team

📌 Key Terminologies

📌 Objective of Q-Learning

📌 Q-Learning Formula

📌 Step-by-Step Process — How Q-Learning Works?

Step 1: Initialization

Step 2: Repeat until learning is complete (for each episode)

Step 2.1: Start in some initial state ss

Step 2.2: Choose action using policy

Step 2.3: Take action, observe reward and next state

Step 2.4: Update Q-value using formula

Step 2.5: Move to next state

Step 2.6: Repeat until episode ends (goal reached, or max steps hit)

📌 Example — Grid World

Environment

Q-Table (initially)

Process

📌 Final Notes — Key Features of Q-Learning

📌 Visualization Example (Path Found by Q-Learning)

📌 Summary (Quick Chart)

📌 Why Q-Learning is Powerful?

📌 Limitation

Team

Trending now