What is Q-Learning in Reinforcement Learning?

Q-Learning is a model-free off-policy Reinforcement Learning algorithm. Its goal is to find the optimal action-selection policy for any given environment. The algorithm uses a Q-table (also called a Q-function) to store and update expected future rewards for state-action pairs.

In simple terms:

  • Agent learns what action to take in each state to maximize future rewards.
  • No need to know environment’s internal rules (that’s why it’s called model-free).
  • It can learn from any kind of action (even actions outside its current policy), so it’s off-policy.

📌 Key Terminologies

TermExplanation
EnvironmentThe world in which the agent operates
State (s)Current situation of the agent
Action (a)What agent can do at the state
Reward (r)Feedback from environment after action
Policy (π)Strategy mapping state to action
Q-TableTable storing values for (state, action) pairs

📌 Objective of Q-Learning

  • To learn the optimal Q-function:

Q∗(s,a)=maximum expected future reward if action ’a’ is taken in state ’s’.Q*(s, a) = \text{maximum expected future reward if action ‘a’ is taken in state ‘s’}.

  • Using this Q-function, agent chooses actions that maximize rewards in the long run.

📌 Q-Learning Formula

The core update rule of Q-learning is: Q(s,a)=Q(s,a)+α[r+γmax⁡a′Q(s′,a′)−Q(s,a)]Q(s, a) = Q(s, a) + α \left[r + γ \max_{a’} Q(s’, a’) – Q(s, a)\right]

Where:

  • Q(s,a)Q(s, a) = Current Q-value for (state, action) pair.
  • αα = Learning rate (how fast to update).
  • rr = Reward for current action.
  • γγ = Discount factor (importance of future rewards).
  • s′s’ = Next state after taking action ‘a’.
  • max⁡a′Q(s′,a′)\max_{a’} Q(s’, a’) = Best future Q-value from next state.

📌 Step-by-Step Process — How Q-Learning Works?

Step 1: Initialization

  • Create Q-table: Size is (Number of states × Number of actions), all initialized to 0 or random small values.

Step 2: Repeat until learning is complete (for each episode)

Step 2.1: Start in some initial state ss

Step 2.2: Choose action using policy

  • Use ε-greedy policy:
    • With probability ε, choose random action (explore).
    • With probability 1-ε, choose action with highest Q-value (exploit).

Step 2.3: Take action, observe reward and next state

r,s′=environment.step(action)r, s’ = \text{environment.step(action)}

Step 2.4: Update Q-value using formula

Q(s,a)=Q(s,a)+α[r+γmax⁡a′Q(s′,a′)−Q(s,a)]Q(s, a) = Q(s, a) + α \left[r + γ \max_{a’} Q(s’, a’) – Q(s, a)\right]

Step 2.5: Move to next state

  • Set current state = next state ( s=s′s = s’ )

Step 2.6: Repeat until episode ends (goal reached, or max steps hit)


📌 Example — Grid World

Environment

  • Grid: 3×3
  • Agent starts at (0,0)
  • Goal at (2,2)
  • Actions: [up, down, left, right]
  • Rewards: -1 for every step, +100 on reaching goal

Q-Table (initially)

StateUpDownLeftRight
(0,0)0000
(0,1)0000

Process

  • Agent starts at (0,0), chooses action (e.g., “Right”)
  • Moves to (0,1), gets reward (-1)
  • Update:

Q(0,0,Right)=Q(0,0,Right)+α[−1+γmax⁡a′Q(0,1,a′)−Q(0,0,Right)]Q(0,0,\text{Right}) = Q(0,0,\text{Right}) + α \left[-1 + γ \max_{a’} Q(0,1,a’) – Q(0,0,\text{Right})\right]

  • Repeat until goal reached or max steps hit
  • After many episodes, the Q-table converges to optimal values, and agent learns best path to goal.

📌 Final Notes — Key Features of Q-Learning

FeatureMeaning
Model-FreeDoesn’t need environment’s internal transition/reward rules
Off-PolicyLearns optimal policy independent of how actions are selected during learning
ConvergenceGuaranteed to converge to optimal Q if learning rate and exploration handled well
Trade-offBetween exploration (try new things) and exploitation (use known good actions)

📌 Visualization Example (Path Found by Q-Learning)

Start -> Right -> Down -> Down -> Right -> Goal

This kind of optimal path is learned over time, and stored in the Q-table.


📌 Summary (Quick Chart)

StepAction
1Initialize Q-table
2Start episode (at initial state)
3Choose action (ε-greedy)
4Take action, get reward + next state
5Update Q-table
6Move to next state
7Repeat until goal reached
8Repeat episodes till convergence

📌 Why Q-Learning is Powerful?

✅ Works in unknown environments
✅ Handles stochastic rewards and transitions
✅ Simple to implement
✅ Guarantees convergence (under conditions)


📌 Limitation

❌ Doesn’t work well if state space is huge (like image inputs) — needs Deep Q-Learning (DQN) for that.

Team
Team

This account on Doubtly.in is managed by the core team of Doubtly.

Articles: 484

jsDelivr CDN plugin by Nextgenthemes

These are the assets loaded from jsDelivr CDN. Do not worry about old WP versions in the URLs, this is simply because the files were not modified. A sha384 hash check is used so you can be 100% sure the files loaded from jsDelivr are the exact same files that would be served from your server.


	

Level up your video embeds with ARVE or ARVE Pro