Why the Lesser Spec Server Was Running Spiffier.

Vito Leung
3 min readDec 8, 2020

I had the same everything running on two types of servers, but somehow the one with lesser specs was doing better. This is how I troubleshoot the problem…

I was troubleshooting the system one day and realize it was time to add another web server to help with the traffic load. After adding the new box into rotation, I noticed the new box had a much lower load average even over a long period of time. The web servers are behind a round robin vip so this stirred my curiosity as to why especially since the server I added was a lower spec than the original servers.

For context, all the other web servers are IBMx3550 M5 E5–2683v4 and the server I added is a HPDL360Gen9 E5–2620v4.

Next step was to run vmstat on both servers to compare and observe a few things.

  • r: process in running queue to see if both servers were processing the same amount of requests at about the same speed
  • b: to see the waiting process situation on both servers
  • cs: context switch
  • sy: system call (kernel code)
  • us: user call (none kernel code)
  • id: time spent idle
  • wa: time spent waiting for IO

All the categories came out to be about the same, so I took a specific look at the cpu details. The scaling governor was indeed different on both boxes. Default settings can be different due to manufacture setting, sys ops preference, etc. As these are production servers, I changed the setting both on to “performance”.

After double checking that the settings were the same on both servers, I ran lscpu again and one other thing was clear, the server performing better hit the max speed. The server which constantly had the higher load was never able to hit max speed; thought the speed performance improved after changing the for the scaling_governor.

Finally, I took a look at two more things. One was using strace to take a look at system calls data. Here, it was very clear that the box with the heavier load had a lot more system calls. The category to look at is usecs/call.

Before wrapping things up, took a quick comparison of the numbers of tcp establish connection between the two hosts with a simple netstat command: netstat -atp | grep -i EST | wc -l. Just wanted to make sure the traffic was fairly consistent and they were at a glance. I do recommend this package meow-watermelo put together which gives greater details for more extensive troubleshooting.

So all this work only proved that both servers are performing as they should. In theory, the queries behind a round robin vip should eventually even out over a long period of time but there are more system calls on the servers with more load. The next step is to profile the requests.

--

--