How accurate are performance counters for storage?

***savas@BackupChain*** · 05-09-2020, 08:38 PM

You know, performance counters for storage are an essential part of monitoring and managing systems, especially when you’re working in IT. When I first started using them, I had a few misconceptions about how reliable and accurate they actually are. After dealing with various projects and implementations, my perspective has matured a bit.

Let me explain the accuracy of these counters. Performance counters collect data related to storage devices, such as read and write speeds, latency, and I/O operations per second. They can provide valuable insights into performance bottlenecks and help you optimize the system. However, the accuracy of these counters can fluctuate based on several factors.

One fundamental factor is the system overhead. Every time a counter is accessed, there’s a slight hit on system resources. Depending on the environment and the number of counters being monitored, this overhead can skew data. I remember working on a Hyper-V backup solution that used BackupChain, a server backup solution, and found that when the backup processes were running, performance counters showed some unexpected variations. The system load increased due to the additional tasks created by the backup operations. This experience reinforced my understanding that performance counters could misrepresent the true state of storage performance if the environment is not stable.

Another issue that affects accuracy is the configuration of the monitoring tools. You might find that default settings don't always align with the specific needs of your infrastructure. For example, in one of my previous roles, we used performance counters to track the performance of our SAN. Initially, the configuration didn't take into account the various RAID levels and how they affect read and write speeds. With manual tuning and adjustments to the sampling rates and counter selection, we uncovered performance metrics that were more representative of our actual workload.

Latency is another critical aspect to consider. Counters will typically show average read/write latencies, but averages can mask significant spikes that can cause user-facing issues. I’ve often seen environments where average latency appears excellent, but users are experiencing occasional slowness due to spikes in I/O demand. This inconsistency can lead to misguided conclusions about storage performance. For example, if you were monitoring a database application and only looking at average latency, you might not realize that certain operations, perhaps during backup windows or heavy querying, were drastically affecting user experience in those moments.

In addition to configuration, there’s also the challenge of interpreting the data correctly. Performance counters can have different meanings depending on context. You might record high IOPS on a particular storage device, but without correlating that with other metrics like throughput or latency, the figures can be misleading. I nailed this down in a project that involved multiple storage arrays. The counters highlighted impressive IOPS, but the throughput was comparatively low, indicating that while the device was responsive, it was not efficient under load.

Moreover, the storage type can impact counter accuracy. SSDs, for instance, will often behave differently compared to traditional spinning HDDs. When using performance counters to monitor SSDs, you need to account for the fact that they can handle more I/O operations concurrently, making some traditional metrics less relevant. I once worked on a project where we migrated to SSDs but continued to use old performance benchmarks meant for HDDs. The results were skewed, and it took some time to recalibrate our expectations and monitoring methods.

Networking can muddy the waters too. When storage resources are accessed over a network (like with NFS or iSCSI), network latencies can distort the actual performance metrics of the storages. I learned this lesson the hard way during deployment testing with a cloud-based storage solution that made use of secondary backup systems. The network had inconsistent latencies, which reflected on our performance counters and suggested we had storage issues. After I isolated the network components, the clarity in storage performance data became apparent.

The time of day can also influence counter performance. I once managed a system where workload changed dramatically throughout the day as backups ran overnight. Performance counters reflected a very different picture during peak vs. off-peak hours. This led me to implement scheduling strategies where we avoided running resource-heavy backup tasks while users were actively engaging with the systems.

It is also beneficial to understand the limitations of what performance counters can inform you about your storage systems. They can reveal how well your storage resources are utilized, but they don't necessarily provide insight into the underlying causes of issues. For example, high read latencies could indicate overwhelmed storage controllers or maybe poorly optimized database queries causing unnecessary I/O. This understanding brought clarity to many performance issues I dealt with while working with databases.

I have learned the importance of using complementary tools alongside performance counters. For instance, observability tools can provide deeper insights, taking into account storage performance alongside application performance. They enable correlations that might otherwise be missed when only analyzing raw performance counters.

The reality is that performance counters are just one piece of a much larger puzzle. They need to be contextualized. I remember integrating them with logging tools that gave me a holistic view of application performance correlated against storage metrics. This approach helped in identifying not just the “what” of the performance metrics but also the “why.”

Testing under different conditions also plays a crucial role. I’ve crafted scenarios to simulate peak loads to see how performance counters react. By understanding how they behave under stress, you gain the knowledge needed to identify and resolve performance bottlenecks more efficiently.

When I consider the experiences I've had, I realize the importance of regular reviews of performance metrics. Keeping an eye on trends over time is vital. Instantaneous readings can be misleading, but trends can provide the insight needed to understand whether performance is improving or degrading. Regular data review sessions can help spot mildly rising latency or slowly decreasing throughput that could indicate problems.

One of the most crucial things is to ensure that performance counters are continuously monitored and adapted over time. As technologies evolve, so should your monitoring strategies. When I hear about organizations using the same setups or configurations they implemented years ago, I think about the missed opportunities for optimizing performance and troubleshooting issues.

You likely recognize that storage environments are lively and can be impacted by many factors from hardware to workloads and user behavior. While performance counters are a powerful tool for assessment, they are not foolproof. You should actively consider them as one part of your broader toolbox for understanding your storage’s performance landscape. Combining them with contextual information and keeping your monitoring practices up-to-date will result in a more accurate view of the storage performance.