When this project began Smartology was still operating a purely direct ad sales model, which meant we had a good idea of what traffic we would be serving and our servers only had to deal with semantic matching and rendering the creatives.
As the company shifted focus and we sought to join the world of Programmatic Real Time Bidding on AdX, with its 100ms auction cut-off, we in the engineering department had some big changes to make to get our platform ready for the high volume, fast paced world of RTB. This post describes what had to change, and how we leveraged the AWS platform to get us there as quickly as possible.
What we had before
At the beginning of 2016 the Smartology ad-serving infrastructure was almost entirely manually provisioned and we were operating a global service out of a single AWS Region. With a setup like this, CloudFront was a lifesaver in terms of keeping latencies down, reducing the amount of traffic we had to process, and mitigating traffic spikes that would have otherwise (and had in the past) taken down our extremely basic setup. Here is roughly what it looked like…
Aside from the architectural issues we had at this point, this had been almost entirely manually provisioned as it had evolved slowly from the very early days of Smartology’s SmartMatch ad serving.
Let’s look at some of the problems with that set up…
Scaling our “cluster” had a lot of manual steps – this was, not only, slow but also error prone:
- Launch a new server, ensuring security groups and other settings are all correct
- Adding a puppet manifest to configure that server (all were uniquely named entities)
- Updating our deployment job in Jenkins to include the new server
- Updating security groups (RDS & other Smartology services) to allow access from the new ad-server
- Updating nginx configuration on our NGINX Load Balancers to include the new server
Scaling out the load balancers required a similar effort. Resilience was still poor due to a combination of inability to scale quickly and no automatic recovery of failed instances.
What we had then was sufficient for the relatively small demands placed on the system and predictability of loads enabled by a direct sales model, but it wasn’t good enough for an RTB platform.
What we did
Here is what we did to get ourselves up and running with an RTB PoC and support the higher loads that we knew we would have to deal with in the world of RTB, whilst keeping latencies down.
Switch to ELB
Our old NGINX load balancers had served us well, but added maintenance overhead which our small engineering team really didn’t have time to support. We needed something which would scale without any effort, and allow us to also scale the servers behind it without reconfiguring things. ELB was the obvious choice.
Use Auto Scaling Groups for the ad servers
This goes hand in hand with the decision to use ELB. With some small amount of experimentation we were able to set up CPU based CloudWatch alarms to allow us to scale the ad servers automatically in response to daily traffic patterns and unexpected traffic spikes.
CloudFormation was a key piece of the puzzle in getting our stack up and running, first in a test environment, and then production environment in multiple AWS Regions.
Up until this point, all of our internal and external services had been operated within the AWS Ireland Region, and provisioned by hand. To help us get this right, development of the test environment was done in the Frankfurt (eu-central-1) Region to make sure we covered all the requirements for operating out of a different AWS Region. Codifying all this infrastructure took a little longer than it might have taken to provision by hand, but the benefits of Infrastructure as Code are very clear when rolling out your infrastructure to multiple data centres. Provisioning all of this by hand – VPCs, EC2 Instances, Security Groups, ELBs, Auto Scaling Groups, Scaling Policies, Alarms – would be a mammoth task. CloudFormation allowed us to deploy all of this to multiple regions, quickly, consistently, and with a much reduced risk of human error.
Proximity to exchanges is a must to achieve RTB response times which are generally required to be under 100ms including network latencies. As our ad servers depend on multiple (read-only) data sources for content profiles, BrandSafety checks, creatives, campaign data; we had to find a way to distribute this to the regions close to the exchanges. The data is in various formats which means getting all the data where it needs to be required multiple replication solutions:
- RDS (MySQL) for relational campaign data. We were able to solve this with RDS/MySQL native replication which allows cross region replicas. Unfortunately as a MySQL RDS replica cannot be configured as Multi-AZ, we had to chain replication instances within a region so that each region has a replica of the Master table, and a replica of the replica in a different availability zone, which effectively gives us the resilience of Multi-AZ.
- DynamoDB for content and content profiles. AWS had at the time, published a CF template to provide a simple cross region replication tool based on ECS and DynamoDB Streams. This allowed us to get our tables copied and kept up to date in other regions without too much work.
- ElasticSearch indexes for matching. The best documentation we could find on this was Elastic’s own website, saying something along the lines of “you’ll need to build this yourself”. For bootstrapping the cluster, the S3 snapshot/restore plugin meant we could quickly get a new region populated with existing indexes. For ongoing updates, we took advantage of SNS to publish messages to multiple SQS queues which are consumed by a ElasticSearch Cluster Updater component located in each region.
Caching. Lots of it.
We had gotten away with not doing a lot of in memory caching on our ad servers, but to get our responses back to the Ad Exchange in under 100ms we had to get on top of this. We routed all of our time critical data accesses through ehcache, which alleviated huge loads from the data stores and and brought the majority of our responses down to sub 10ms processing times.
What it looks like now
We’ve gone global! The changes we made to our architecture enabled us to efficiently handle much higher and more variable workloads whilst still responding to almost every bid request within the tight 100ms deadline, and adoption of CloudFormation for this architecture means we can roll out to new regions much more easily to get close to the ad exchanges hosted in data centres which are distributed around the globe.
And for an overview of our global setup:
What we got out of it
For a short review of what we achieved during this project, and some of the benefits it brought:Infrastructure as code
- Automatic creation of critical infrastructure
- Ability to deploy anywhere (anywhere with an AWS datacentre that is.)
- Autoscaling / ELBs to replace manually configured load balancers and provisioned Ad Servers
- Allows for responsive scaling to meet demand
- Less configuration for ELB than our previous NGINX setup
- Cheaper to run an ELB than our previous setup of multiple load balancers
- Replication of data to multiple regions
- Allows us to respond quickly to requests from anywhere on the globe
- Reduces chances of data loss (although AWS is already pretty good at this)
…and most importantly – our foot in the door in the complex world of Real Time Bidding.
If you liked this post, keep an eye on our blog for new content about Smartology and Ad Tech.