Road to GCP Part 2: The Decision
In the first part, I covered the short history of Nordeus' infrastructure and how OpenNebula was a win in every aspect for our developers. In this part, I'm going to go over the events and thoughts on how we decided to migrate everything to the Google Cloud Platform (GCP).
The decision to migrate to GCP was made in the year 2018. How it all happened is a culmination of three different events happening during the same year.
1. Hadoop Scalability Problems
We were having scalability problems with our Hadoop cluster, which was the bread and butter of our analytics platform. We have been maintaining Hadoop ourselves for a few years now, and as the years went by, it started giving us more and more problems. In around 2018, this reached a tipping point because we realized that we had big problems with cost-effectively scaling our Hadoop cluster. Our Hadoop cluster was hosted on tens of large (or "fat", as we called them) dedicated servers, which had a lot of CPUs, a lot of RAM and disks with a few tens of terabytes of data, and they cost a lot. We constantly needed more disk space, and we didn't need more CPU power, but we couldn't scale those two separately because we could only add more fat servers. We couldn't replace them with smaller servers because we couldn't fit that many drives on them, and we couldn't separate the data because it was not meant to work like that, or the effort would be huge leaving us with an even complex system to maintain. It was more complicated than this, but if we wanted to change anything drastically, it would require a lot of time and effort. The result was that we didn't have a cost-effective way to scale our disk usage on Hadoop.
This wasn't the only issue with Hadoop. Another issue was that it is hard to maintain Hadoop clusters from scratch. When we told others that we were maintaining our clusters at any conference we went to, they couldn't believe it. Nobody did that! At least nobody as small as we are. We constantly started having some problems that were hard to fix and taking too much of our time, e.g., we had a strange issue with one disk causing performance issues that we didn't know how to detect or fix besides replacing it (our clusters had 300+ disks, so this was happening weekly). We wanted to look into a managed cloud solution, which would mean migrating our Hadoop cluster to some public cloud provider. We decided that we didn't want to complicate things and that we would choose one of the three biggest: Amazon Web Services (AWS), Microsoft Azure, or Google Cloud (GCP). We soon disregarded Azure because it was oriented towards Microsoft users (which we weren't) and its big data services were lagging compared to the other two, so we were left with AWS and GCP. We did some testing by running their managed Hadoop offerings, and what was interesting during these tests was that by following the official documentation on AWS, things either didn't work or were extremely slow; on the other hand, everything on GCP worked and was very fast and simple.
2. Reducing Infrastructure Costs
From my first day in Nordeus, we were all constantly told NOT to think about cost, just about growth and how we can get our games in the hands of as many players as possible. If money can fix engineering problems, let's fix them with money so that we can concentrate on creating value for our players. This worked great for many years, and I still praise this motto. Let me give you an example of how we used it. While Top Eleven grew year after year, we started having performance problems with our databases. We could've solved some of them by improving the performance of our queries and how we store data, but we didn't; we decided to buy faster hard drives on which those databases were running. This means we didn't spend who knows how much time improving our database, which users will never notice if we do it correctly.
In Q4 2018, we as a company decided that we had disregarded the cost of our infrastructure for far too long and that it was time to reduce our infrastructure spending. We knew exactly where we can cut costs the most and knew what needed to be done. We didn't have a cost-saving goal set in stone, but we wanted to cut costs significantly.
3. We Are a Gaming Company
By this time, we had been using OpenNebula for more than a year, and while we liked it very much, we felt that we had reached a limit with it. New features were being added slowly to the OpenNebula project. We could develop new features for OpenNebula that would accelerate us, but it's not worth the effort; we're too small. On the other hand, public cloud providers were still too expensive when we compared their list prices with our spending.
Also, during this time, we started changing our view; we started thinking more like a gaming company and less like an engineering company. We thought about what our core business is:
Our primary goal as a company is to create games and run those games for many years to come.
Our goal is not to provide an infrastructure platform. Why do we need to think about replacing a faulty disk in a server? Why should we think about if Apple is going to release a new MacBook Pro and buy all the world's NVMe reserves, so we can't order NVMe drives we used with our server? Why should we need to think about the floods in Thailand's hard drive factories? We had all these problems I mentioned (and many more). Instead, why can't we spend that time helping our game devs get their code into production as fast and seamlessly as possible, or even better, helping them design a feature in the game?
The Visit by Google Cloud
To get back to migrating our Hadoop cluster to GCP. In December 2018, the GCP guys came here to Belgrade to visit us. We talked about how we would like to migrate Hadoop to GCP and how that could work. They listened very carefully to everything we had to say, and at one moment, their boss asked us: "Why don't you migrate everything to GCP?". I remember I replied instantly with something like "Because it will cost too much, we did the math a few months ago". Then he told us that the prices you see on the site are just listed prices; if we were to migrate everything to GCP, we would get special discounts, which can be very big. Now he had our attention!
After this meeting, they gave us an offer, so that we could understand what we could expect, and it turned out that they were right. Based on our calculations, later on, we could potentially reduce our costs by 50%. Of course, this reduction wouldn't come just from their special discounts but also from our understanding of how GCP pricing works (which is highly complicated for all cloud providers) and optimization.
This was an eye-opener for us because we could cut costs and get all the features and flexibility of a public cloud provider. We could basically kill two (three) flies with one blow.
A few people asked me about the cost-saving of cloud, and there is always the question: "Is cloud cheaper than bare metal hosting?". In a single word, NO, a better answer is "depends". But let me try to explain why. On Redstation, we were using some servers for a few years, and after each year, we asked them to give us a discount for those servers. To reduce costs on Redstation, we either needed to:
- Reduce the number of servers (we did this by merging multiple servers into a smaller number), or
- Replace the current servers with smaller ones (because most of our servers were overprovisioned)
Replacing current servers with smaller ones isn't that cost-effective. If we ordered a smaller server now, we would pay almost as much as we paid for the current, old, bigger server. After all, the old server had very good discounts.
Reducing costs on Redstation would be extremely small, it would be an enormous amount of work and in the end, we would end up with the same thing we had before.
Also, those new servers would still need to be overprovisioned, because it's very hard for you to scale those bare metal servers up/down. On the other hand, on public cloud providers, you usually use just enough resources that you need, because when you need more, it's easy to add them within minutes, or they can even be added automatically.
AWS or GCP?
After this, we decided to evaluate AWS and GCP a little bit deeper, because before we were just evaluating them for our Hadoop infrastructure. We didn't need to do any deep evaluation. We were familiar with public cloud providers before; what were their features, and how they worked. The only thing we weren't 100% sure of was whether a public cloud provider could run our Gameworlds (the main big Top Eleven servers). After a few performance tests, we realized it could run without any issues (if you give it enough resources, of course).
In the end, we liked GCP more. The biggest reason was:
- Better UX, everything was simple, easy to understand, and it was dev-oriented. You could see that all the problems AWS has because it is a very old product, were fixed by Google because they started later.
- AWS had more features, but GCP had a few times more compared to what we need, and by the time we need more, they will probably catch up.
- From the first contact with Google employees, the conversation was easy. We had a dedicated person who also speaks Serbian, we could talk with them any time we wanted, for anything we needed. Google never felt like a big company, and they felt like they wanted to understand us and that they cared.
- And finally, GCP also gave us better discounts.
Conclusion
To sum up, even though it looks like the main driver behind migrating to GCP was cutting costs, it wasn't. The special discounts we got and our understanding of how GCP billing works just enabled us to consider using GCP.
The main driver was to get our devs a new platform that would allow us to be as flexible as possible and concentrate on game-making. Cutting costs just came with it.
I remember when our CTO asked me before the migration what cost-saving I would be satisfied with; I told him that even if the costs stayed the same, I would be happy because we would get something an order of magnitude better for the same money. Of course, I knew that we would also cut costs significantly.
Stay tuned for part three, in which I will cover the highlights of the migration to GCP and the results.