NETFLIX on AWS

Ajeet Charan
3 min readMar 17, 2021

Netflix is the world’s leading internet television network, with more than 100 million members in more than 190 countries enjoying 125 million hours of TV shows and movies each day.

Netflix uses AWS for nearly all its computing and storage needs, including databases, analytics, recommendation engines, video transcoding, and more

hundreds of functions that in total use more than 100,000 server instances on AWS.

Application Monitoring on a Massive Scale

So Netflix is using more than 100,000 server instances on AWS.

This results in an extremely complex and dynamic networking environment where applications are constantly communicating inside AWS and across the Internet. Monitoring and optimizing its network is critical for Netflix to continue improving customer experience, increasing efficiency, and reducing costs. In particular, Netflix needed a solution for ingesting, augmenting, and analyzing the multiple terabytes of data its network generates daily in the form of virtual private cloud (VPC) flow logs.

This would enable Netflix to identify performance-improvement opportunities, such as identifying apps that are communicating across regions and collocating them. The company would also be able to increase uptime by quickly detecting and mitigating application downtime.

Each log record carries information about the communications between two IP addresses. However, in a dynamic environment like the one at Netflix, where an IP address can float between applications from day to day or even minute to minute, IP addresses alone don’t have much meaning.

“The data sources we had before we took on this initiative were one sided,” says John Bennett, senior software engineer at Netflix. “We’d know an application was connecting to others, but we didn’t know both sides of the conversation and how to optimize those communications or the placement of the applications on the network.”

Centralizing Flow Logs Using Amazon Kinesis Data Stream

The solution Netflix ultimately deployed — known internally as Dredge — centralizes flow logs using Amazon Kinesis Data Streams.

The application reads the data from Amazon Kinesis Data Streams in real time and enriches IP addresses with application metadata to provide a full picture of the networking environment.

“Usually, we would put the data into a database, which would build an index to enable faster querying,” says Bennett. “Dredge joins the flow logs with application metadata as it streams and indexes it without using a database, which eliminates a lot of the complexity.”

The enriched data lands in an open-source analytics application called Druid. Netflix uses the OLAP querying functionality of Druid to quickly slice data into regions, availability zones, and time windows to visualize it and gain insight into how the network is behaving and performing.

The scalability of Amazon Kinesis Data Streams was a good fit for the Dredge application because of the cyclical and elastic nature of network usage at Netflix.

Improving Customer Experience with Real-Time Network Monitoring

Netflix’s Amazon Kinesis Data Streams-based solution has proven to be highly scalable, each day processing billions of traffic flows. Typically, about 1,000 Amazon Kinesis shards work in parallel to process the data stream.

Netflix is now able to identify new ways to optimize its applications, whether that means moving an application from one region to another or changing to a more appropriate network protocol for a specific type of traffic.

Although a streaming data solution is not new to the IT industry, it is an innovation in the networking space.

--

--