adventures in systems engineering


tl;dr: In Grafana, every graphite request includes the maxDataPoints parameter which defaults to 100. If your query returns more than 100 data points, Graphite will consolidate down to 100 points through averaging.

I currently work on a team that is responsible for our monitoring infrastructure, which means a lot of requests come our way for setting up monitoring for other teams or troubleshooting specific monitoring implementation details. I am very much trying to change this cultural mindset in our engineering space, but that’s a story for another time.

So, I have a request to figure out why a Grafana singlestat panel is returning some wacky numbers. This panel is tracking when a different system has active alerts firing. An alert can either be firing (1) or not firing (0), but sometimes a .5 value would be displayed. Huh?

First place I look is the script being used to check for active alerts to see what values it’s sending to Graphite. Great, no logging. So I fix up the script to output it’s count to a file every time it runs and sends a metric to Graphite. Cool, no .5 values here.

Next, I dig into Graphite by examining the whisper file directly using whisper-fetch. The values in the whisper file match the values in my temporary script log. Sweet!

This script used to run on another server, so next step is hopping onto the host for our carbon relays and using tcpdump to examine packets and check there are no other sources sending bogus metrics. Aaaand, verified.

At this point, I’m scratching my head. The singlestat panel is set to look at the current value, and none of the values end with .5. I attribute it to the panel previously being set to an average, but I keep an eye on the dashboard anyway as I’m working on something else.

Sure enough, a .5 value shows up, so in Grafana, I decide to use the Query Inspector to check the values being returned. There’s quite a few .5 values returned, but the timestamps for these values don’t match the timestamps for the metrics in Graphite. What in the actual hell?

I’m stumped at this point, so hey, I click on the Help tab next to the Query Inspector tab, and the mystery is instantly solved. Graphite queries include a MaxDataPoints parameter that defaults to 100.

In this case, the dashboard’s timeframe was set to look at the last 12 hours, which had way more than 100 data points. Graphite was just doing it’s default thing and shrinking that down to less than 100 data points, creating all these .5 averages.

On the Options tab, you can actually adjust the Max data points value, keeping in mind that the higher this number, the higher the likelihood that your browser will struggle with displaying the results.

Alternatively, you can set an Override relative time value on your panel’s Time range tab that guarantees it will return less than 100 data points, and therefore, no unexpected averaging regardless of your dashboard timeframe. If you’re interested in spark lines and want trend history for those alerts over a longer timeframe than the override, create a graph panel.

I explained both options to the end user of the dashboard, they were satisfied with my conclusion, and I closed out the ticket. A happy end to the story, but knowing this quirk would’ve saved me so much time.

What did you learn today?

Hi! I'm a systems engineer for a global marketing platform. Here I dish about (mostly technical) books I'm reading, my musings on the ever-important soft skills/glue work in this field, and my general adventures in engineering.


Your email address will not be published. Required fields are marked *

You Might Also Like