2 Replies Latest reply on Mar 14, 2009 12:53 PM by JimBlakley

    Resource Monitoring in Virtual Environments


      Well, like the weather, resource monitoring in virtual environments is something everyone talks about but no one does anything about. (I bet I'm about to get flamed by a bunch of ISVs who are delivering resource monitoring solutions). By doing nothing, I get the feeling we, as an industry, are taking the same approach that got us to an under-instrumented data center in the first place.

      To move it forward, we’ve been looking at three big usage areas:

      1.    Dynamic Load Balancing – Can I use data out of the platform to make better load balancing decisions in my data center? On what basis should I make those decisions and how do I instrument a “Monitor-Analyze-Decide-Act” loop?

      2.    SLA Management/Chargeback – Can I collect sufficiently robust data from the environment that I can satisfy my clients and an auditor that I have accurately measured a particular client’s actual resource usage against an SLA or pricing model.

      3.    Capacity Planning – Do I have accurate data about the use of the environment over time that will allow me plan whether I need new capacity in the future.

      All three of these usage areas require:

      1.    Instrumentation of the data center elements to measure on an on-going basis the resource use in the data center.

      2.    Collection and correlation of the measurements from each of the elements.

      3.    Analyzing and abstracting the data into key indicators.

      4.    Setting/modifying policy and reporting the status.

      Since Intel's so low in the stack we’ve been focused on #1. Not that 2-4 aren’t also critical but without a sound base of instrumentation, the tools above are limited.

      With instrumentation, there are three things to considers:

      1.    What should be measured? The simplest thing to measure, for example, is CPU usage per VM. There are a number of CPU counters that are available for this purpose and they’ve been used for years by Windows, ESX and other OS and Hypervisors. The number of things to be measured can be extended almost infinitely: memory size and latency, paging thrash, IO throughput and latency and power usage. Frankly, though given we’re all making light use of the basic counters already available, the problem is not really here.


      2.    How frequently are measurements sampled? For example, ESXTOP, PERFMON, VMWare DRS, Linux/Unix TOP and other performance monitoring tools sample somewhere between 100ms and 5 seconds. Correspondingly, in ESX for example, VM’s are switched by the scheduler much more frequently than this. This is really the crux of the issue we’re exploring at the moment:


      · Is there a value in collecting measurements more frequently than we can currently do with the existing tools?


      Our hypothesis is that, yes, at the current rates of sampling, the measurements are too inaccurate to be useful in many cases particularly for SLA/Chargeback usages where accurate data is the basis of a trust agreement and possibly a financial agreement. It may also be critical for load balancing decisions and helpful for capacity planning. Don’t know yet.

      3.    How are measurements extracted from the hardware environment for analysis, decision and action? This must be done with minimal performance impact on the payload applications on a given element/server. Although with the large number of cores and high performance we have available today, using CPU power for measurement collection may not be problematic so long as collection does not impact application performance on an appropriately engineered system. Measurements must also be extracted in a way that allows them to be correlated with other measurements taken elsewhere in the solution. For example, you may want to correlate a VM’s storage usage with its memory usage over time. This correlation requirement means that the measurements must all be accessible to the correlation engine through a software mechanism or a separate out-of-band connection to the hardware.

      We're currently building a test framework to instrument and collect some of this data and would be interested in your feedback on our direction and approach. Are we focused in the right places?

        • 1. Re: Resource Monitoring in Virtual Environments


          Thanks for the referral to this post and the comments & questions on resource tracking.  I think the approach mentioned is valid so far.  Keep going forward with the effort because this is an area that I believe is much more important that may be realized.  I make this observation as a consultant and provider of hardware and virtualization solutions that keeps getting the question from my larger customers: "How do I share the cost of the virtual servers among my user departments like I was able to charge them for a physical server?"  This is becoming more than a minor background noise in the communities where virtual server adoption is growing as a response to physical server proliferation.  While I am sure Intel would prefer to see hardware go out the door (as would I), I think it is better for everyone concerned to see better utilization of the resources available so that the deployments are accepted as strategic rather than commodity and new installations employ the more efficient and, quite frankly, more interesting/larger systems.

          Ultimately, I think my customers want to get to a point where they can provide an "electric bill" type of usage report to their constituent user communities for the virtual instances that are being used and anything we can do to help get them there will further the cause of virtualization adoption.


          Bill S.

          • 2. Re: Resource Monitoring in Virtual Environments


            Thanks for your response. I, for one, believe that the extra efficiency and agility that virtualization brings to the data center will more than offset the consolidation impact. It's kind of crazy to believe that CIOs would have lived forever with typical server utilization in the 10-20% range. In business terms, consolidation is simply burning off excess processing inventory. Sooner or later that inventory is gone and new demand (and more efficient demand) takes off.


            But, only if IT can manage the data center like a 21st century factory. Data centers for years have been operated like warehouses. Application owners bring in servers and storage and DC operators install and operate them. Time to move them to a factory model with high degrees of automation and process control. The only way to do that is if there are outstanding solutions for monitoring, analyzing and acting on what's happening in the infrastructure.


                      Jim Blakley