Well, like the weather, resource monitoring in virtual environments is something everyone talks about but no one does anything about. (I bet I'm about to get flamed by a bunch of ISVs who are delivering resource monitoring solutions). By doing nothing, I get the feeling we, as an industry, are taking the same approach that got us to an under-instrumented data center in the first place.
To move it forward, we’ve been looking at three big usage areas:
1. Dynamic Load Balancing – Can I use data out of the platform to make better load balancing decisions in my data center? On what basis should I make those decisions and how do I instrument a “Monitor-Analyze-Decide-Act” loop?
2. SLA Management/Chargeback – Can I collect sufficiently robust data from the environment that I can satisfy my clients and an auditor that I have accurately measured a particular client’s actual resource usage against an SLA or pricing model.
3. Capacity Planning – Do I have accurate data about the use of the environment over time that will allow me plan whether I need new capacity in the future.
All three of these usage areas require:
1. Instrumentation of the data center elements to measure on an on-going basis the resource use in the data center.
2. Collection and correlation of the measurements from each of the elements.
3. Analyzing and abstracting the data into key indicators.
4. Setting/modifying policy and reporting the status.
Since Intel's so low in the stack we’ve been focused on #1. Not that 2-4 aren’t also critical but without a sound base of instrumentation, the tools above are limited.
With instrumentation, there are three things to considers:
1. What should be measured? The simplest thing to measure, for example, is CPU usage per VM. There are a number of CPU counters that are available for this purpose and they’ve been used for years by Windows, ESX and other OS and Hypervisors. The number of things to be measured can be extended almost infinitely: memory size and latency, paging thrash, IO throughput and latency and power usage. Frankly, though given we’re all making light use of the basic counters already available, the problem is not really here.
2. How frequently are measurements sampled? For example, ESXTOP, PERFMON, VMWare DRS, Linux/Unix TOP and other performance monitoring tools sample somewhere between 100ms and 5 seconds. Correspondingly, in ESX for example, VM’s are switched by the scheduler much more frequently than this. This is really the crux of the issue we’re exploring at the moment:
· Is there a value in collecting measurements more frequently than we can currently do with the existing tools?
Our hypothesis is that, yes, at the current rates of sampling, the measurements are too inaccurate to be useful in many cases particularly for SLA/Chargeback usages where accurate data is the basis of a trust agreement and possibly a financial agreement. It may also be critical for load balancing decisions and helpful for capacity planning. Don’t know yet.
3. How are measurements extracted from the hardware environment for analysis, decision and action? This must be done with minimal performance impact on the payload applications on a given element/server. Although with the large number of cores and high performance we have available today, using CPU power for measurement collection may not be problematic so long as collection does not impact application performance on an appropriately engineered system. Measurements must also be extracted in a way that allows them to be correlated with other measurements taken elsewhere in the solution. For example, you may want to correlate a VM’s storage usage with its memory usage over time. This correlation requirement means that the measurements must all be accessible to the correlation engine through a software mechanism or a separate out-of-band connection to the hardware.
We're currently building a test framework to instrument and collect some of this data and would be interested in your feedback on our direction and approach. Are we focused in the right places?