Post

Advanced Topics - Dyna-Q and Curiosity-Driven Learning

Advanced Topics: Dyna-Q and Curiosity-Driven Learning

Most of the previous methods learn from real experience. The agent acts, observes the result, and updates its policy or value function.

This post introduces two ideas that go beyond that basic loop: planning with a learned model, and exploration driven by curiosity.

Dyna-Q: Learning and Planning Together

Q-Learning only updates from real experience:

1
state, action, reward, next_state

Dyna-Q does one extra thing. It learns a simple model of the environment:

1
(state, action) -> (next_state, reward)

Then it uses this model to create simulated experience for planning.

The Dyna-Q Loop

1
2
3
4
5
1. Take a real action.
2. Update Q-values from the real transition.
3. Store the transition in the model.
4. Sample old transitions from the model.
5. Update Q-values again using simulated transitions.

The benefit is sample efficiency. One real experience can lead to several learning updates.

When Dyna-Q Helps

Dyna-Q is useful when:

  • The environment is simple enough to model.
  • Real experience is expensive.
  • Planning time is available.
  • Past transitions are still useful.

It can struggle if the learned model is wrong. Planning with a bad model can reinforce bad assumptions.

Curiosity-Driven Learning

Some environments have sparse rewards. The agent may wander for a long time without receiving useful feedback.

Curiosity adds an internal reward for discovering something new or surprising.

The basic idea:

1
external reward + curiosity reward = total reward

This encourages the agent to explore even when the environment reward is rare.

Prediction Error as Curiosity

One common method is to train a model to predict what happens next.

If the prediction is bad, the transition is surprising. The agent receives a curiosity reward.

1
curiosity reward = prediction error

As the model learns familiar states, those states become less interesting. The agent is pushed toward unfamiliar parts of the environment.

A Simple Example

In a maze, the external reward may only appear at the goal.

Without curiosity, the agent may never find the goal.

With curiosity, unexplored corridors become temporarily rewarding. This can help the agent discover the goal sooner.

Dyna-Q vs Curiosity

MethodMain IdeaBest Use
Dyna-QLearn a model and plan with itReusing experience
CuriosityReward novelty or surpriseSparse-reward exploration

They address different problems. Dyna-Q asks, “Can I learn more from what I already saw?” Curiosity asks, “What should I explore next?”

Key Takeaways

  • Dyna-Q combines real learning with planning.
  • A learned model can generate simulated experience.
  • Curiosity gives the agent an internal reason to explore.
  • Prediction error is one way to measure novelty.
  • These ideas are useful when experience or rewards are limited.

Experiment with advanced RL techniques in the RL-Tutorial-Series repository

This post is licensed under CC BY 4.0 by the author.