steffersaur*us

adventures in systems engineering

EngineeringTIL

tl;dr: iftop is a great tool to add to your troubleshooting toolbelt. It listens to network traffic on an interface, displaying bandwidth between two hosts.

We run Graphite and Carbon in our two main datacenters, and at any given time, our Grafana instance has it’s default data source pointing to one of those Graphite clusters.

We recently had to perform maintenance on one cluster, so we switched the default data source to point at the other one. Groovy, right? Actually, no. Occasional reports of a slow-loading dashboard in Grafana escalated to a flood of reports of dashboards rendering more slowly, timeouts, and data gaps. Ugh.

To determine the underlying cause (after verifying that dashboards using other data sources hadn’t been impacted), I first started with examining logs. Lots of timeouts and errors were being recorded, but a quick Google search didn’t provide much insight, other than this cluster was likely getting overloaded.

I had a feeling there was a saturation issue, but with a solution of provisioning new app servers, I wanted to confirm where in the stack this was happening, and that yes, it was indeed saturated. Enter iftop.

As our servers have multiple interfaces, I ran `iftop -i bond0` to get a complete picture of network traffic on both Graphite app servers. Holy data, Batman! These poor servers had 1G NICs that were easily getting saturated. Most of our network data is sampled, so this wasn’t apparent in dashboards.

“The main part of the display lists, for each pair of hosts, the rate at which data has been sent and received over the preceding 2, 10 and 40 second intervals. The direction of data flow is indicated by arrows,  <= and =>.”

iftop manpage
In this screenshot, you can see after just a few seconds how much traffic app01.graphite is sending and receiving.

iftop does need to be installed, and I recommend checking out it’s manpage to see the flags and filters you can apply.

I’m definitely adding iftop to my troubleshooting toolbelt. And provisioning some new app servers.

What did you learn today?

Tagged:
Hi! I'm a systems engineer for a global marketing platform. Here I dish about (mostly technical) books I'm reading, my musings on the ever-important soft skills/glue work in this field, and my general adventures in engineering.

2 COMMENTS

LEAVE A RESPONSE

Your email address will not be published. Required fields are marked *

You Might Also Like