Smartology is in the business of delivering brand safe, relevant adverts on some of the world’s largest publisher sites. In order to do this, it creates a semantic profile of both publisher and advertiser content then, when a ‘match’ is found, a relevant ad is inserted on the page.
In this post, Jason Marden, a software engineer in the team, talks through how the team changed their approach to onboarding new publishers and ingesting their content. The new approach is both more efficient and scalable, while utilising the latest in AWS services.
The responsibility of the legacy application was to poll, on a preset schedule, and consume content from RSS/Atom feeds or content API’s. This approach led to consuming content that was never used for matching and potentially caused significant delays in consuming topical new content. To address these issues we introduced a process (miss-driven) that meant we only scraped content that we required for matching, but didn’t exist already in our system. This significantly shortened the delay and improved our cache hit rate. This process however was still scheduled and so still led to unnecessary delays in consuming new content.
The original architecture was deployed on an EC2 instance. Cache misses for match requests of content items we had not yet processed were added to an SQS queue. The publisher specific scraping jobs for feeds and queued cache miss records, were all scheduled via Quartz creating unacceptable latencies for processing new content in our NLP system. The scraping jobs also, persisted directly to a shared database, increasing coupling between application services.
Onboarding new publishers was also highly inefficient and on average it would take 2 days. Content item consumption was built on a monolithic architecture, with a need to generate lots of boilerplate repetitive code. We had to deploy our whole application each time we added or updated a publisher which could be time consuming. With this monolith in place it seemed obvious that eventually there would be an issue with scalability and demand. We wanted the ability to switch on publishers within hours and not days.
To resolve the issue with the speed of on-boarding a publisher we adopted a new approach which we called ‘Configuration over code’. This meant that we no longer relied on publisher specific code and instead opted for a generic scraper which was driven by publisher specific configuration in the database. We no longer had to deploy the whole application but just a publisher configuration mapping. Both persisting and updating publisher configuration was simplified and in the process boilerplate and duplicated code was removed. A publisher can now be set up within hours.
The AWS lambda service provided the key component to improving the speed of consumption. Instead of pulling scraping messages from SQS we push events to an SNS which with our new Content Item Scraper Lambda scrapes the content in an event driven manner on demand. In the process we reduced cost and simplified our architecture. The Lambda responds to SNS messages, scraping the content based on the database configuration and persists via a content API. We were able to replace the monolithic application with an event driven microservice which reduced the latency of consuming content. Both AWS Lambda and SNS are fully managed serverless solutions. For more information on serverless: https://martinfowler.com/articles/serverless.html
We chose a microservices architecture because it allowed us to replace our legacy monolith architecture with a simplified clearly defined application which is loosely coupled and can be scaled independently. For more information on microservices: https://martinfowler.com/articles/microservices.html.
Why AWS Lambda?
With the speed of consumption a key deliverable for the project, the AWS lambda provided the capabilities the team required, here is why:
A lambda can be triggered in a number of ways, from message driven SNS to scheduled based events. AWS Lambda has in built scalability, creating parallel instances for multiple concurrent invocations.
One of the key reasons for choosing AWS Lambda is that it provided us with the means to quickly create a serverless microservices architecture and in combination with SNS provided us with a solution for creating an event driven model which was efficient and fast.
Code and Deployment
AWS Lambda supports a number of languages and the in-built support for Java 8 was an important reason for using the service. Through use of the Java AWS SDK, Lambda functions can be created quickly, SNS message processing being part of the core library. The Lambdas themselves are small units of code and are contained in distinct modules. Deployment of a Lambda consists of the code compressed into a ZIP file and uploaded via the CLI.
A large benefit to using Lambda is that it is highly cost effective. The first 1 million requests to a Lambda per month are free and thereafter the cost is $0.20 per 1 million requests. An EC2 running for one hour will cost $0.025, $16.84 a month, regardless of how many or few requests you process. In our case, the EC2 instance was mostly under utilised and was a single point of failure.
The AWS Lambda service supplies a number of key features to ensure a good level of security. It is particularly reassuring to know that the machines that are used are fully maintained by Amazon and thus receive important security patches as soon as they become available. The Lambdas themselves must be deployed inside a VPC for added security and if you do not have one readily available you can use the default VPC.
Worth considering when using Lambdas
Currently it is difficult debugging lambdas due to the way CloudWatch logs are named and due to the limited functionality in the AWS UI. Although we found it somewhat easier if we used the CLI. Recently we began pushing our logs to Kabana and have found that more efficient for locating issues. If a lambda container is not kept warm then you may experience latency issues and in particular slow startup times however this can be solved by making frequent requests via a CloudWatch event. In our case we have started using serverless and a plugin called serverless-plugin-warmup. Given that the startup for a lambda can be slow it’s important to take into consideration the execution time of the lambda. We found that adding dependency injection via Spring or Guice significantly increased the startup time and therefore the lambda execution was slower than what we anticipated.
This development has enabled us to drastically reduce the amount of time it takes to onboard and set up a publisher on our system, while reducing latency when processing new content.