While there has been much discussion about OpenAI bot behavior and general meaning of the matches for AI community, there has been surprisingly little technical analysis of their training methods and network architecture. The main source so far is the original OpenAI Five blog post. To some extent this is justified – they used standard PPO algorithm (invented in-house couple of years ago) and proven self-play training scheme (also documented in one of their papers). But there is easy-to-miss hidden gem in the blog post – link to their network architecture diagram.
In this blog post I would like to focus on one aspect of their network architecture – their inventive use of embeddings to handle a huge and variable number of policy inputs and outputs. While the use of embeddings and dot product for attention are standard techniques in natural language processing, they are not widely used in reinforcement learning.
Update: after writing this blog post I learned that there was a newer version of network architecture diagram in one of the later OpenAI blog posts. As there were not too many differences in use of embeddings, I decided to not redo the images and leave the analysis of the new network as a homework .
What are the embeddings?
In mathematics embedding refers to mapping from space X to Y while preserving some structure of the objects (e.g. their distances). However, using embeddings in the context of neural networks usually means converting categorical variables (e.g. word indexes) into continuous vectors. Passing word indexes directly as inputs to network would make its job very hard, as it would need to come up with a binary feature for each value of the index (assuming they are unrelated). Instead we give a hand to the network and convert categorical values into one-hot vectors. If you multiply one-hot vector with the weight matrix, then it basically selects a given row from the weight matrix. In deep learning toolkits this step of converting to one-hot vector and multiplying with weight matrix is usually skipped, instead we use the index directly to select a row from the weight matrix, treating it as a lookup-table. The most important aspect is that embedding vectors are learned, the same way how you would learn weight matrix with one-hot vectors.

The most well-known use of embeddings is in natural language processing, where word indexes are converted into word vectors (or embeddings). It has been shown that when a network is trained to predict from word vector the vectors of its surrounding words, the word vectors acquire semantic meaning and you can do arithmetic operations with them. For example “woman” – “man” + “king” produces a vector close to “queen”. You could imagine that “woman” – “man” produces a gender transfer vector and adding that to “king” converts it to the female ruler. Or alternatively if you frame it “king” – “man” + “woman”, then “king” – “man” produces vector of a “ruler” and that added to “woman” produces “queen”.

Image credit: Mikolov et al. 2013. Linguistic Regularities in Continuous Space Word Representations.
OpenAI Five network
Before we delve into specifics, few words about the general architecture of the network.

Image credit: OpenAI.
Each of the five bots has its own network, with its own inputs and actions. (Not sure if the parameters of five networks are shared or not.) The only communication between the bots is through the game, I can imagine them doing some kind of bee dance to communicate location of the enemy to other bots, but I do not think they actually do it, or would even need it. Update: In newer version of the network there is max-pooling over players, which could be considered as uni-directional broadcast communication channel.
The upper part of the network on the figure processes the observations. It concatenates data from various sources and passes everything into one LSTM cell. The output of this LSTM cell is used by the lower part of the network to produce actions. In a nutshell it is as simple as that – process observations, feed them to LSTM and produce actions. Of course the devil is in the details and this is what we are going to look into in the next sections.
Embeddings in observations
OpenAI Five bots use Dota 2 API to “see” surrounding units and buildings. This produces variable-length list of units (heros, creeps, towers, etc) and their attributes. OpenAI has nice visualizations of observation space and action space in their blog post, I encourage to check those out.


Image credit: OpenAI.
The following diagram summarizes the processing applied to the attributes of one unit.

Image credit: OpenAI.
At the top left you can see that each unit is encoded as an embedding. This totally makes sense, because each of the 116 Dota 2 heroes can be classified in various ways:
- Primary attribute: strength, agility, intelligence.
- Attack type: ranged or melee.
- Role: carry, disabler, lane support, initiator, jungler, support, durable, nuker, pusher, escape.
Each of these might form a dimension in the embedding vector and the network learns automatically how much of a carry, support or jungler each hero is. The same embeddings are also applied to creeps and buildings, e.g. towers also have ranged attack. It makes a very generic way to represent different units to the network. Embedding vector is concatenated with other unit attributes like health, distance from heroes etc.
But embeddings are not used only for unit types, they are also used for modifiers, abilities and items.

Image credit: OpenAI.
Again it totally makes sense – while abilities of all heroes are different, they certainly have some commonalities, e.g. if they must be actively used or if they are passively always on, if they need target, if this target is another unit or an area, etc. In case of items some heal you, some grant you magic powers, some will be immediately consumed, others upgrade your stats. Embedding is a natural way to represent things with many different but potentially overlapping qualities, and which might have a similar effect but to different extent.
Notice that while the number of modifiers, abilities and items is variable, the network max-pools over each of those lists. This means that only the highest value in all those dimensions actually gets through. At first it does not seem to make sense – it might give the impression that you have an ability that is combination of all existing abilities, e.g. ranged passive heal. But it seems to work for them.
Above processing is done separately for each of the nearby units, the results from general attributes, hero modifiers, abilities and items are all concatenated together. Then different post-processing is applied depending if it was enemy non-hero, allied non-hero, neutral, allied hero or enemy hero.

Image credit: OpenAI.
Finally the results of post-processing are max-pooled over all units of that type. Again this seems to be questionable at the first sight, because different qualities of nearby units would be combined, e.g. if one of the dimensions would represent the health of a unit, then the networks sees only the maximum health over the same type of units. But, again, it seems to work fine.
Max-pooled results for each of the unit types are concatenated and then fed into the LSTM. There are also slices of the first half of the output, but we get back to that when we talk about action targets.
Embeddings in actions
The action space of Dota 2 is estimated to be 170,000 different actions. This includes normal actions like moving and attacking, but also using of abilities, using of items, upgrading your stats and so on. Not all actions are available at each time step – you might not have this specific ability yet or item in your inventory. Still there are approximately 1000 different actions that you could potentially use. In addition many of those actions have parameters, like area where you want to move or enemy that you want to target. Again OpenAI has a nice visualization of action space in their blog post.


Image credit: OpenAI.
This poses an enormous exploration problem for RL, because initially the agent starts by just trying out random actions. Naive approach would be to output scores for all 170,000 actions and limit softmax to 1000 currently active actions. But OpenAI solved it neatly with embeddings and variable-length softmax.

Image credit: OpenAI.
As you can see at the top each of the actions has an embedding, e.g. whether this is a ranged attack or using item for healing or teleporting to certain destination. Dot product of action embedding and LSTM output is used to produce scores for different actions. Those scores go through softmax and the resulting probability distribution is used to choose one of the available actions.
Sidenote: dot product between two vectors multiplies those vectors element-wise and sums the result. Sometimes it is also called scalar product, because it produces single scalar value. It has close relationship with cosine similarity – it tends to produce high values when vectors point to the same direction and low values when they point to opposing directions. It is often used as a quick scoring method for similarity of two vectors. Indeed, this is exactly what the convolution operation does – it produces a feature map of similarities between the filter and input.
\(\mathbf{v} =\) (1, 0)
dot product:
\(\mathbf{u} \cdot \mathbf{v} = \sum_i \mathbf{u}_i \mathbf{v}_i =\) 0
cosine similarity:
\(\cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| ||\mathbf{v}||} =\) 0
cosine distance: \(1 – \cos \theta =\) 1
\(\)
Drag arrows around to see how dot product, cosine similarity and cosine distance change.
How I imagine the action selection actually working out is that the LSTM produces something that could in principle be called “intention vector”. For example if you are in a sticky situation and your health bar is really low, then the intention would be something like “get out of here”. This intention is matched with available actions and produces a high score if it is aligned with one of the actions. For example both actions “move” and “teleport” might be well aligned with the intention “get out of here”. Teleport might be slightly better aligned, because you cannot be chased by enemy, therefore it produces higher score and higher probability after softmax. But if teleportation is not available at the moment, then this embedding is not matched and “move” has probably the highest score and probability.
Some actions have parameters, like destination or target. All of these are modeled in straightforward manner with softmax. For example X and Y coordinates are discretized into ranges, instead of using continuous outputs and Gaussian distribution. I guess softmax handles better multimodal distributions. The important observation here is that action outputs seemingly do not model the joint distribution between action and its target. I believe this is not a problem because all action outputs are conditioned on LSTM output. Therefore LSTM output already encodes the “intention” and those FC layers just decode different aspects of this “intention” – the action and its target.
But my favorite part is how OpenAI Five handles targeting. Remember those weird slices from unit observation outputs? These are represented with blue on the diagram, which means these are per-unit. These vectors are called “unit attention keys” and are matched with LSTM “intention” to produce scores for each unit. Those scores go through softmax and determine the unit to attack. Again softmax is taken over observed units, the same way as action softmax is taken over available actions.
How I imagine it working out: from observation the network determines that some unit has really low health and the bot has a chance for last-hit. LSTM produces intention “attempt last-hit”, which aligns well with “attack” action. Also this “attempt last-hit” intention is matched with per-unit outputs from observation processing and aligns well with the unit with the lowest health. Bang – you do the last hit and score additional bounty.
Update: in the more recent version of the network they modulate LSTM output with action embeddings before doing dot product with unit attention keys. I guess otherwise different actions (e.g. attack and heal) attempted to target similar units.
Final words
After analyzing the OpenAI Five network it becomes clear that most of the network is dedicated to perception (preprocessing the observations) and motor control (decoding the actions). All the strategy and tactics must lie in one place – 1024-unit LSTM.

Image credit: OpenAI.
I think it is amazing that one relatively simple mathematical construct can produce such a complicated behavior. Or, I don’t know, maybe it says something about the complexity of Dota 2 game? Do short term tactics combined with fast reaction time beat long-term strategy?
Thanks to Jaan Aru, Tanel Pärnamaa, Roman Ring and Daniel Majoral for insightful comments and discussions.