"We stopped thinking as game developers" - What to do when millions join your virtual world unexpectedly
Lockwood's CTO Ed Morley on migrating from Python to Go after the success of Avakin Life
In November 2013 Lockwood released Avakin Life - an online 3D virtual world for mobile and PC.
It has to-date achieved over 50 million (mostly organic) installs, close to a million daily active users, and six million monthly active users. Our user-base has doubled consistently for each of the last four years, and in this article, we cover some of the challenges we have had to overcome on our journey from a single backend developer working in Python on rented physical hardware, to our cloud-based containerised backend services written in Go, capable of serving millions of users.
At the beginning of 2014, just a few months post launch, we realised we had a problem - a big one. We'd just launched a series of four mobile apps each with a heavy backend component, and one of them (Avakin) took off like a rocket. By conventional measures, this was not supposed to happen: continuous growth without the support of paid UA or featuring from the app stores was practically unheard of at the time.
If Avakin had been a match-three game then at this point the champagne would have been flowing, but at the time the studio was filled with an uneasy set of mixed emotions. There was joy that users loved the product, and fear that we would not be able to capitalise on our opportunity because there was no way the backend would be able to support the sheer amount of users and dataflow inherent in a 24/7 persistent world. So we right away started a journey into scaling our backend, one we're continuing on even now.
Our initial infrastructure was written in Python and was so antiquated it ran on a physical server rented from an ISP local to our studio in Nottingham. But the increase in users meant we needed to do two things - and quickly.
In the short term we needed to manage the ever-increasing number of users. But we also needed to devise a solid long-term plan to scale our backend infrastructure for the future. Even now we still believe Avakin is only at the start of an upward growth curve. Initially, we looked at various BaaS (backend-as-a-service) providers specialising in games, which were beginning to spring up. In 2014 that sector was, we felt, still too immature to help with the specific demands of Avakin. If we had chosen to go with an off-the-shelf solution, we also had a monumental mountain to climb to port the game logic and migrate the user-base to an alternative backend.
In the end we decided to evolve our existing system. Achieving the next tier of scale seemed relatively straightforward in principle - we just needed to put a load balancer in front of what would become multiple servers running the game API, and then scale servers horizontally until the database became overloaded, at which point we would think about sharding the database. So how did that go in practice? Not so well. The issues were directly related to the social nature of Avakin. Users constantly interacting with each other made it incredibly hard to deal with the consistency of the data within the infrastructure. Fighting to keep the database and the app server caches in sync became a daily battle. We knew we needed a fundamentally different approach.
A key turning point was when we recognised the challenges around scaling a truly social product are very similar to the challenges that Facebook had to overcome in their infancy. We had to stop thinking as games developers. So began an enormous amount of research on a vast range of topics that (at least back in 2014) fell outside of the world of game development.
To reason about our concurrency issues we explored the world of Conflict-free replicated data types (CRDTs). We investigated databases that can scale horizontally. We looked to new and alternative languages that work around some of the inherent (mostly performance) problems with Python, as well as a vast array of other subjects that are only an issue at scale. During this period of research, I came across a talk by Google's Rob Pike called 'Concurrency Is Not Parallelism'. For Lockwood, it cemented a growing belief that we should choose Go as the language for our re-implementation of the API component of our backend.
Go brings two enormous benefits to the table.
The first is incredible performance and low memory usage, and the second is that Go channels provide a way to reason about concurrency in a relatively safe and straightforward way that is baked into the language (Go channels are by no means a magic bullet for all concurrency woes, but they help a lot). While the world was looking at the incredible performance benefits of Node.js over Python, we were able to steal a march on the vast majority of the industry by moving over to Go.
The move to Go has brought with it with an additional benefit - world class coders interested in unique problems and first rate solutions. Today, four years after we started trying to solve our backend problem, we have migrated to an extremely advanced architecture written in Go that is capable of serving millions of users and processing millions of virtual item sales each day. We also have a well-rounded development environment making the best use of containerisation and CI, which allows us to push to our production environment multiple times per day.
What have we learnt? Looking back, it's sometimes hard to believe how our backend architecture has grown from what was started by a single developer to where it is now. The amount of change that has occurred with not just the backend development team, but the company as a whole, is a testament to the environment that has been built up at Lockwood.
Our quarterly bonus scheme (almost £500k paid out so far in 2018, with our busiest quarter to come) means the team benefits from their hard work in a very direct way, multiple times a year.
We are also extremely open with our data and ensure that every single member of the team has access all the metrics (including revenue). We employ experts and then let them make the decisions, keeping any micro-managing to an absolute minimum. Our working week and flexitime underpin our commitment to family life and health. The last thing we want is somebody burnt out, or their home life suffering due to crunch.
In short, we took a great language and combined it with some absolutely amazing people and let them get on with it. Neither would have been enough in isolation, but combining people and tech means that, while there's always still work to do, we're well on the way to where we want to be.