tl;dr: iftop is a great tool to add to your troubleshooting toolbelt. It listens to network traffic on an interface, displaying bandwidth between two hosts.
We recently had to perform maintenance on one cluster, so we switched the default data source to point at the other one. Groovy, right? Actually, no. Occasional reports of a slow-loading dashboard in Grafana escalated to a flood of reports of dashboards rendering more slowly, timeouts, and data gaps. Ugh.
To determine the underlying cause (after verifying that dashboards using other data sources hadn’t been impacted), I first started with examining logs. Lots of timeouts and errors were being recorded, but a quick Google search didn’t provide much insight, other than this cluster was likely getting overloaded.
I had a feeling there was a saturation issue, but with a solution of provisioning new app servers, I wanted to confirm where in the stack this was happening, and that yes, it was indeed saturated. Enter iftop.
As our servers have multiple interfaces, I ran `iftop -i bond0` to get a complete picture of network traffic on both Graphite app servers. Holy data, Batman! These poor servers had 1G NICs that were easily getting saturated. Most of our network data is sampled, so this wasn’t apparent in dashboards.
“The main part of the display lists, for each pair of hosts, the rate at which data has been sent and received over the preceding 2, 10 and 40 second intervals. The direction of data flow is indicated by arrows, <= and =>.”iftop manpage
iftop does need to be installed, and I recommend checking out it’s manpage to see the flags and filters you can apply.
I’m definitely adding iftop to my troubleshooting toolbelt. And provisioning some new app servers.
What did you learn today?