Deep Q-Networks and Practical Reinforcement Learning with TensorFlow

Sophie Turol


This blog post highlights things-to-know while enabling reinforcement learning with TensorFlow, as discussed at one of the sessions at TensorBeat 2017. You will find out what toolkit simplifies the work done within an environment, how to handle pitfalls of distributed learning, boost performance across multiple environments, etc.


Making reinforcement learning work

Illia Polosukhin, a co-founder of, provided some practical insights into reinforcement learning with TensorFlow.

As Illia puts it, one doesn’t actually have to train data as part of reinforcement learning, but rather drive different types of observations form an environment, perform actions, etc.

To do so, one can employ OpenAI Gym, which is a toolkit for developing and comparing reinforcement learning algorithms. It features a library of environments for games, classical control systems, etc. to aid developers in creating algorithms of their own. Each of the environments has the same API. The library also enables users to compare / share the results.

Illia demonstrated a sample code of a simple agent acting within an environment.


He also showed the code behind an agent.


So, what makes it all work?

The set of states and actions coupled with the rules for transitioning from one state to another, make up the Markov decision process (MDP). One episode of this process (e.g., one game) produces a finite sequence of states, actions, and rewards.

What one has to define is:

  • A return (a total discounted reward)
  • Policy: The agent’s behaviour (deterministic or stochastic)
  • The expected return starting from a particular state (state-value function, action-value function)


Deep Q-learning

One of the ways to approach reinforcement learning is deep Q-learning—a model-free, off-policy techniques. What it means is that there is no MDP approximation or learning inside the agent. Observations are stored into replay buffers and are further used as training data for the model. Being off-policy ensures that optimal learning policy is independent of the agent’s actions.

Illia then demonstrated what the Q-network code looks likes.


As well as how to run optimization.


More examples can be found in this GitHub repo.


A monitored session

As one of the tricks at hand when training a TensorFlow model, Illia suggested using MonitoredSession for:

  • handling pitfalls of distributed training
  • saving and restoring checkpoints
  • injecting computation into TensorFlow training loop via hooks


Asynchronous Advantage Actor-Critic

To enhance reinforcement learning, the Asynchronous Advantage Actor-Critic (A3C) algorithm can be employed. In contrast to a deep Q-learning network, it employs multiple agents represented by multiple neural networks, which interact with multiple environments. Each of the agents interacts with its own copy of the environment and is independent of the experience of the other agents.

Furthermore, this algorithm allows for estimating both a value function and a policy (a set of action probability outputs). The agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods. Finally, one can estimate how different the output is from the expected one.


All the above mentioned can be applied in such spheres as robotics, finance, industrial optimization, and predictive assistance.

Join our group to stay tuned with the upcoming events.


Want details? Watch the video!


You can also check out the full presentation by Illia Polosukhin below.


Related video and slides

In this video from a TensorFlow meetup in London, Leonardo De Marchi, Lead Data Scientist at Badoo, shares how to apply reinforcement learning within the gaming industry.

Below, you will find the slides by Leonardo De Marchi.


Related reading


About the speaker

Illia Polosukhin is a chief scientist and a co-founder at Prior to that, he worked as an engineering manager at Google. Illia is passionate about all things artificial intelligence and machine learning. He has gained master’s degree in Applied Math and Computer Science from Kharkiv Polytechnic Institute. You can check out his GitHub profile.

To stay tuned with the latest updates, subscribe to our blog or follow @altoros.

Get new posts right in your inbox!

1 Comment
  • Dennis Inappropriate

    There are some mistakes in A3C-like loss

Benchmarks and Research

Subscribe to new posts

Get new posts right in your inbox!