Based off the classic game Bomberman, my team of 8 people tackle how agents acomplish tasks with infinitely meaningful variations through competitive multi-agent learning. Our team was split betwen two teams; Pytorch and TensorForce, with the goal of seeing if different frameworks lead to different results but trying the same algorithms being PPO and A2C.

We leverage pre-existing architecture and code from Pommerman Competition as cited below.

Pommerman Competition

Original Fork

Introduction
Related Work
Methodology
Findings
Challenges
Future Scope
Conclusion
References

Introduction

Pommerman is a game environment which copies the famous game from Nintendo, Bomberman, for reinforcement learning purposes. The environment is a symmetric 11x11 grid where there are four agents in each corner. In the game environment, there are wooden and rigid walls. Both wooden and rigid walls are impassable, however, a wooden wall can be destroyed. When a wooden wall is destroyed, it can become a passage or power-up. Power-ups can only be obtained by blowing up a wooden wall.

The agent can move up, down, left, or right and can lay a bomb. Agents start with a bomb and can place bombs whenever they want. After nine time steps, bombs explode and any wooden walls, agents, power-ups, and other bombs in its range are destroyed. The goal of each agent is to be the final surviving player.

There are also power-ups players can obtain. Powerups have a chance to spawn when a wooden box is exploded. These powerups are: get an extra bomb, get increased bomb explosion range, or get the ability to kick a bomb. An extra bomb allows the agent to drop one more bomb. The increased range increases the agent’s bomb blast strength by one. The can kick power-up allows the agents to kick bombs by running into them. When kicked, the bombs continue to travel in the direction the agent was moving until they encounter a bomb, a player, or a wall.

The initial rewards function only rewards an agent with +1/-1 reward at the end of each game, and each game has a large variance in timesteps, depending on how long it takes for all other players to die. This creates a problem with delayed, sparse rewards for an AI agent, which will require some custom rewards shaping to help reinforcement learning algorithms to learn effectively.

To tackle this project, we had to research and understand different reinforcement learning algorithms as well as the game’s environment. We decided to try two different reinforcement learning algorithms, PPO and A2C. This was because a few papers had results where vanilla DQN does not work well with the Pommerman environment. Testing and creating rewards shaping functions was another essential part of agent learning that heavily impacts training time and efficiency.

Related Work

Pommerman has had many previous competitions to see which research teams could create the best AI agent. The most prominent agents to compete in competitions have all used PPO or A2C algorithms as the learning algorithm. However, these learning algorithms are not enough for the agent to learn to play Pommerman effectively due to the sparse rewards issue and the difficulty of laying bombs without suicide.

One previous competition submission received strong competition results by using PPO algorithm alongside a custom action pruning function that would not allow an agent to move into positions that would lead to suicide. This helped prevent the agent from committing suicide during training so it could steadily learn to use bombs effectively. Also included a FIFO queue to keep track of previously visited coordinates, and gives rewards for traversing unvisited coordinates to encourage exploration.

The winner of NeuroIPS Pommerman competition in 2018 used A2C algorithm alongside an implementation inspired by population-based training, where multiple A2C agents are created with different hyperparameters (specifically the discount factor), and each A2C agent is ranked based on match results.

Methodology

PyTorch vs TensorFlow

Our approach to this project is to try running the agent with two frameworks, PyTorch and TensorFlow, alongside our different learning algorithms. We used a TensorFlow implementation of PPO and PyTorch implementation of A2C. PyTorch and TensorFlow don’t have built-in implementations of PPO and A2C, but there are frameworks built on top of PyTorch and TensorFlow which implement these algorithms. We used Tensorforce for TensorFlow and Ilya Kostrikov’s implementation for PyTorch. The Pommerman competition Playground package is implemented on top of OpenAI Gym. This package gives us direct access to game state and game environment for training and testing. During training, we log the model’s layer weights, hyper parameters, and other observable features using Tensorboard. Tensorboard is not only compatible with both Pytorch and Tensorforce, but also allows us to easily monitor the agent during training.

Reward Functions

One obstacle we want to tackle is how the agent overcomes the local minima. This specific obstacle is tackled by reward shaping which helps the agent lay a bomb. As it currently is, our agent rarely places a bomb and instead runs away from other agents.

Since the original rewards function only rewarded an agent at the end of each game depending on whether the agent was dead or alive, there is a sparse rewards issue. To fix this, all of our implemented reward functions will give the agent a reward at each timestep instead. We also tried to make our rewards as “hands off” as possible to avoid making something similar to a rule-based agent. The goal of our rewards functions was to help coax the agent to learn a more optimal policy.

The rewards functions we implemented included a bomb dodging function. This function gave the agent a negative reward if that agent was in a bomb blast zone, while it rewarded the agent positively if the agent was in a safe location. The goal of this function was to help the agent learn to avoid bombs, so that it would not committ suicide. Another goal for rewards shaping was to give the agent an incentive to explore the grid. Without help, the agent would just learn to hide in its spawn corner and get reward for surviving as long as possible. This is suboptimal as we want our agent to pick up powerups and gain an advantage on its opponents. We introduced some rewards rules to help the agent explore and pick up power ups, such as a visited memory, where agents are rewarded for visiting coordinates not yet in the visited memory. We also incentivised blowing up wooden boxes, agents and also picking up power through specific reward functions.

To test our dodging reward function, we gave another reward function that automatically awarded the agent for laying a bomb. This is so we can instantly have the agent drop a bomb and begin learning to avoid it. The introduction of this reward function is only temporary and for research and demonstration purposes. A long term goal is to not require an explicit reward for the agent to use bombs, but for the agent to learn its usefulness on its own.

Environment

The Pommerman environment we train in has three structures; no walls, some walls, and default. The default environment has 32 wooden walls and 32 rigid walls with 20 power ups. Although the wall placement is randomly generated, there is a small section for each agent that is 4 walls that guarantees that the agent has a path to traverse. In regarding some walls, we use a combination of wooden-rigid; 16-16, 24-24, and 8-8 with the power ups being two less than the number of wooden walls.

The last type of environment we use is one with no rigid or wooden walls. This is to see if and how much the agent’s behavior deviates from an environment with wooden walls.

Algorithms

A2C

A2C stands for Advantage Actor Critic and is a reinforcement learning algorithm that incorporates two models, and actor and a critic. The actor model consists of a model for the agent policy, and given observations, will determine the best course of action. The critic model learns to evaluate the agent and its game state, and generates a “value score” for the actor’s action given the state. The actor critic model will iteratively improve the actor policy weights that would result in the best increase in value score calculated by the critic.

PPO

PPO stands for Proximal Policy Optimization and is less computationally expensive than other policy gradient methods such as TRPO (Trust Region Policy Optimization). Policy Gradient methods use gradient ascent to update policy in order to maximize expected rewards. However, the update to policy could be too large, and the drastic policy change could lead to a suboptimal policy that could take a long time to recover from. PPO and TRPO incorporates a constraint to limit the size of each policy update to prevent this from happening, while PPO is computationally inexpensive compared to TRPO.

Findings

The Tensorforce PPO agent underperformed compared to the PyTorch implementation of A2C agent. We are not yet sure if this underperformance is due to the Tensorforce implementation of the PPO algorithm or if it is due to the PPO algorithm itself. A2C agent results displayed more exploration during training while Tensorforce PPO agent would consistently get stuck in a suboptimal policy routine.

For rewards shaping, we were able to implement three rewards shaping functions that helped the agents better perform certain tasks in the game. We implemented a rewards function to incentify staying out of the bomb blast radius, which helped the agent learn to better dodge bombs. We also created a rewards function that incentivized the agents to explore unexplored tiles in the game grid. Lastly, we incentivized the agent to pick up power ups. We were not yet able to find some combination of rewards functions that would lead to the agent using bombs correctly. Implementing these different rewards functions by themselves led to improved results for the respective actions they were intended for, but more rewards value tuning is needed in order for these functions to work well together.

Challenges

There were multiple issues with finding appropriate frameworks and open-source code that produced consistent and reliable results. The Pommerman codebase was outdated and required updating to run with the current version of Tensorforce. Tensorforce implementation of PPO had a variety of issues. Tensorforce lacked a lot of documentation and features, such as multiprocessing for training. Tensorforce PPO agent also was learning a lot less efficiently than PyTorch A2C agent, and ultimately the results from PyTorch implementation of A2C far surpassed the results from Tensorforce implementation of PPO.

Another challenge was the limited training resources. We were limited by what we could train from the GCP credits and also by the Cinema computer resources being that we could not train on those computers for long periods of time. Rewards shaping was our most challenging task. We had to use creativity and trial and error to find useful reward functions to help coax our policy agent to a more optimal policy. We used ideas from other papers to help build our own rewards shaping functions, such as for exploration and bomb survival. This was a very time consuming task as we had to train a model to test each reward function. This does not include training to test for optimal reward values. Even after testing a single new rewards function and getting favorable results, combining multiple rewards functions together required additional testing to balance the reward values of the different reward functions in relation to each other.

A huge challenge while designing rewards functions was to make sure the functions were not too specific that the agent would essentially be rule-based. It was also difficult to determine the effectiveness of a rewards function because training time was also a hyperparameter. Even the lack of rewards functions could also lead to results if trained long enough.

Future Scope

Continue Rewards shaping experimentation to find the most effective and minimal combination of rewards functions. Tune rewards function values. Train PPO agent in PyTorch and compare it to our A2C agent and Tensorforce PPO agent.

Conclusion

Currently as it stands, the PPO agents perform a better than the A2C agents by 5 games (out of 100). However what is to be noted is that the PPO agents comparison between the two frame works Pytorch and TensorFlow, the Pytorch agent does a lot better (by 10 games) but this maybe because of the flaw in setting up the TensorFlow agent. Though this project is now concluded, I would like to take another look and see what the specific problem with the TensorFlow agent was. Additionally training these models was a really big hassle, we ran out of GCP credits and had to train on our own machines which took a substancial amount of time and resources.

On the brightside, another team did take over this project and did their own investigation. Take a look here at this link. https://github.com/csci-599-applied-ml-for-games/PommermanS2020/blob/master/materials/CS599%20EDD.pdf I’m currently looking for a better link that has more information.

References

Goals

Learn OpenAI Gym, Pommerman File, Suggested Algorithms
Set up GCP or cloud training
Build/Train PPO/A2C Agent
Integrate Pytorch with Tensorboard
HyperParameter Tune
Define Evaluation Metrics for Visualization

Status

This project has mostly been completed as it was a semester long class project. However I’m doing some additional work being implementing PPO on Pytorch and trying out more reward shaping. However this is no longer a working project as I’m focusing on other work. Check out the video!

Pommerman

Table of Contents

Introduction

Related Work

Methodology

PyTorch vs TensorFlow

Reward Functions

Environment

Algorithms

A2C

PPO

Findings

Challenges

Future Scope

Conclusion

References

Goals

Status

Date

Categories

Main Github Link

CSCI ⁵⁹⁹⁄₅₂₇ Project Repositories

Presentation Video

Technologies Stack

Status

Pommerman

Table of Contents

Introduction

Related Work

Methodology

PyTorch vs TensorFlow

Reward Functions

Environment

Algorithms

A2C

PPO

Findings

Challenges

Future Scope

Conclusion

References

Goals

Status

Date

Categories

Main Github Link

CSCI 599⁄527 Project Repositories

Presentation Video

Technologies Stack

Status

CSCI ⁵⁹⁹⁄₅₂₇ Project Repositories