A3C - Asynchronous Advantage Actor-Critic

Posted Mar 15, 2024

By Fanghao Tian 2 min read

A3C: Asynchronous Advantage Actor-Critic

A3C is an Actor-Critic method that uses multiple workers in parallel. Each worker interacts with its own copy of the environment and sends updates to a shared global network.

The main idea is simple: instead of one agent collecting experience slowly, many agents collect different experiences at the same time.

Why Use Multiple Workers?

RL data is usually sequential. One state follows another, so consecutive samples are highly correlated.

DQN uses a replay buffer to reduce this correlation.

A3C uses parallel workers instead. Since workers explore different trajectories, their experiences are more diverse.

The Architecture

A3C has:

A global network.
Several worker networks.
One environment per worker.

worker 1 -> local experience -> gradients
worker 2 -> local experience -> gradients
worker 3 -> local experience -> gradients

gradients -> global network

Each worker periodically copies the global parameters, collects experience, computes gradients, and applies them to the global network.

Actor-Critic Inside Each Worker

Each worker still uses the Actor-Critic idea:

The actor chooses actions.
The critic estimates state values.
The advantage tells the actor whether an action was better or worse than expected.

A simple advantage estimate is:

advantage = return - V(state)

A3C often uses several steps of experience before updating. This gives a better learning signal than a single-step update.

Why It Can Work Well

A3C has a few practical advantages:

It collects experience faster.
It does not need a large replay buffer.
Different workers naturally explore different behavior.
The global network receives learning signals from many trajectories.

The trade-off is implementation complexity. Parallel training is harder to debug than a single training loop.

A Simple Worker Loop

  
while training:
    copy_global_weights_to_worker()
    rollout = collect_steps_in_environment()
    gradients = compute_actor_critic_gradients(rollout)
    apply_gradients_to_global_network(gradients)

This loop is the heart of A3C.

A3C vs A2C

A3C is asynchronous. Workers update the global network independently.

A2C is synchronous. Workers wait for each other, then update together.

A2C is usually easier to reproduce. A3C can be faster in wall-clock time, but its timing makes results less deterministic.

Key Takeaways

A3C runs multiple Actor-Critic workers in parallel.
Workers collect diverse experience without a replay buffer.
Gradients from workers update a shared global network.
It can train faster, but it is harder to implement and debug.
The method shows how parallel experience collection can improve RL training.

Implement parallel A3C training in the RL-Tutorial-Series repository

AI, Reinforcement Learning, Parallel Learning

This post is licensed under CC BY 4.0 by the author.