Check the servers performance? How far down the rabbit hole do you want to go?
When troubleshooting system performance, the end goal is to have latency acceptable to the environment. I've seen times when 50ms of latency is completely acceptable and other times when 5ms of latency is causing production outages. It all depends on the application, period. If you're running a SAN environment or pushing NFS datastores to ESXi, then around 20ms you're basically dead in the water, just as an example. That said, it's difficult, if not impossible, to determine "what should my latency be" and it can be difficult to wrap your head around this too. The real answer here is "it's complicated and it depends" but typically the best answer is obtained by monitoring latency on your system via your preferred methods for a period of time. Then analyze that data, the time frames when users report "good" or "normal" performance are showing you the latency values which you should be hoping to keep sustain in your environment. I would like to point out that "normal" isn't always "good" in every case. If you've got a system that's getting completely slammed, then "normal" might actually be really bad.
So next lets talk a bit about system resources, and I'll try to keep this brief since it can be a little different depending on your OS and how it handles system resources.
CPU - are you using a queuing or messaging based OS? I'll admit, I didn't go that deep down the rabbit hole on this one and if someone reading this can point me to some articles that will take me down the rabbit hole, I'd love to read. What I can tell you is based on personal experience.
I spend a year doing performance analysis for a messaging based OS and had several instances where I would see CPU utilization during "normal workload" sitting around 90% and users reported good performance from their system. There have also been times where I see average CPU sitting under 20% across all cores and performance is complete garbage. This is where the "it's complicated and it depends" part comes in, there are so many possible causes. Maybe your CPU isn't powerful enough for your workload or maybe you've hit an OS bug and there's a process that's taking way more CPU time than was originally intended due to some condition within the system. For this reason, you can't always look at CPU utilization and say "my computer is having a performance problem" when you see high CPU. Before you jump to that conclusion, use the system first and see if you are actually experiencing a performance problem. On a queuing based OS, chances are you'll feel that pain but the same is not true of messaging based OS's, from my experience.
Disk - This is another fun one... Typically disk utilization is time busy, so if we're spending a lot of time seeking and not a lot of time reading or writing then we aren't THAT busy, are we? This honestly could have been the way the OS I was troubleshooting would calculate utilization. That aside, when you are looking over disk utilization, you also need to look at latency per operation and take that into consideration to determine if you're actually having a problem. Lets just make this example easy, you're using 7200 RPM drive and you want your read/write latency to be under or around 15ms, not terribly unreasonable for 7200 RPM drives. If you're doing a lot of sequential workload, then your disk utilization might be higher but your overall latency per operation could be very low since it's sequential. That said, if you're trying to do random operations then that's going to involve a lot of seek time to find the data on platter so it can be read (or written if that just happens to be where there was free space, look up free space fragmentation) and your utilization might show low but your latency is horrible because you have that wait for seek times. So when looking at disks, look more for latency itself and not so much at utilization.
Testing methodology - This one can get you... I've seen several times where someone states they're seeing a performance issue with their network storage because file transfers are "extremely slow" to network storage, but doing a copy on disk is "very fast" - so lets first point out the difference here... we're looking at doing a copy operation locally, from one hard disk to another, compared to copying data from that same hard disk to a hard disk in another system... across the network. So now instead of just the normal overhead associated with copying data from point A to point B on the same computer, we're adding layers. We now have TCP and we have whatever file protocol is in use (SMB or NFS). Next we should consider the version of the protocol being used. SMBv1 will yield different performance from SMBv2, for example.
The application (single-threaded vs multi-threaded) - This kinda bleeds into testing methodology, a lot. Some applications such as RMAN backup can be configured to use multiple threads, thus taking advantage of more cores in a CPU. This can yield in higher throughput because we're doing more work at once. But then we come to some of the more popular testing methodologies, doing a file copy to a network share or using DD to write out data to a network share. These are HORRIBLE testing methodologies because they're single-threaded compared to our multi-threaded application, so of course they're going to be slower. There may also be some applications which are only coded to use one core. When I spun up a Rust server the other week I noticed only one core on my server was getting hit, and the others were basically idle. It could be the Oxide server I was running or it might just be the way Rust is coded, just as an example.
This is kind of all I have time to post at the moment this isn't really something just put all into one post, nor can you learn it in a week or maybe even a months time. I only had a year working system performance and while I made some incredible progress during that time, I feel like I've really only started to scratch the surface.
Edited by SnackMasterX - 2/3/16 at 11:15pm