The Netflix team is always striving towards bringing more power in terms of its content and performance. They also make most of their improvements open-source, such as their latest software and tools so that users and developers outside can modify or debug problems related to performance and make it better.
In a latest development, the Netflix team has introduced FlameScope to specifically address performance issues occurring at one of its microservices. FlamseScope is a visualisation tool which analyses performance at the CPU level. It also analyses various time-dependent factors such as variance, perturbations, app startup, and many others. This is visually depicted using “flame graphs” by combining another visualisation tool called heat maps. The CPU load profile is presented in the form of these heat maps to access performance issues quickly.
How Does FlameScope Work: Understanding Subsecond Offset Heat Maps
Flame graphs and heat maps mentioned in the previous paragraph forms the core part of performance visualisation. In this section, a distinct aspect called Subsecond Offset Heat Maps (SOHM) from Netflix is explored. SOHM is derived by studying heat maps and flame graphs. Heat maps are used to visualise large amounts of three-dimensional data on the servers, where time is plotted on x-axis, disk input/output (I/O) latency (time on a fractional scale) on y-axis and the frequency of disk I/O as color intensity. This means, color acts as a dimension. Now, flame graphs show software code paths which visually look like a flame, and covers a profile (here, CPU profile) usually for a minute. SOHM is obtained by selecting a section of that minute (seconds or milliseconds) of heat map upon which flame graphs are depicted/invoked instantly. That is to say, flame graphs are embedded into heat maps to assess smaller variations that fall within a minute’s time-frame. Therefore, the data is displayed as a heat map which show various patterns which the user can work with to resolve performance issues.
The following snapshot shows the selection of a time range in FlameScope, which highlights an instance of CPU profile
In their official blog, FlameScope developers Brendan Gregg and Martin Spier elaborate more on the nitty-gritty details in the above picture.
“There’s a number of interesting things from this production CPU profile. The CPUs are busier between 0 and 5 seconds, shown as darker colors. Around the 34 and 94 second mark (sounds like a 60 second periodic task), the CPUs also become busier, but for a shorter duration. And there are occasional bursts of heavy CPU activity for about 80 milliseconds, shown as short dark red stripes.” says the blog.
This setback is where FlameScope comes into context, where upon highlighting the range, it instantly develops a flame graph showing the code given below.
How And Why Netflix Adopted FlameScope
Initially developed to ward off a microservice delay which was occurring almost every 15 minutes, the cloud engineering team of Netflix (Brendan Gregg and Martin Spier) resolved the CPU utilisation issue by developing one-minute flame graphs where the graph was divided into ten-seconds time frame and even further with one-second time frame. Individual flame graphs were generated on these time frames to look where to optimise CPU utilisation.
This task was cumbersome so they had to come up with a novel technique such as SOHM which gave a visual outlook of the problem in fractions of seconds. By clubbing heat maps, FlameScope became a tool to visualise variance among other performance issues. Although, it is still at an initial phase and available only for analysis in Linux (perf), the team aims to garner more developers to build the tool on other operating system platforms as well.
Netflix has continuously focussed on developing more tools to tackle performance-related issues with the content and Flamescope is an attempt in that direction. Since it is in an early stage of adoption, they are still looking out for the extent of benefits that they could reap. On top of that, it is also making them open-source so that users can tinker and create something even more better. Just like any other tech giants, Netflix is following suit to boost its avenue when it comes to testing its own tools.