3 Replies Latest reply on Nov 3, 2011 6:57 AM by axero

    Let's talk virtualization


      There has long been an ongoing battle on which operating system is  superior to the other and with virtualization technology this battle is  soon coming to an end. The truth is that no operating system is superior  to the other. It is for example well known that Windows has some severe  flaws at the low level when you look at things "under the hood" but it  is unmatched when it comes to the abundance of software and computer  games. It is also well known that ZFS which is found in Solaris based  operating systems is a file system that is unmatched in terms of  reliability and safety against data corruption, which is a growing  concern as larger and more dense storage hardware has become less  reliable in the past few years (many more hard drives have failed on me  compared to 10 years ago). I'm very concerned about these issues and I  can no longer trust a hard drive in a Windows environment to reliably  keep my data. Linux has many advantages in terms of system resources  efficiency and stability. This list of operating systems and their  advantages/disadvantages can go on...


      So why should I have to choose? Why can't I take advantage of all of these benefits from these operating systems and get the best of all worlds? The answer is that I can,  by virtualization. In the past few years the world has seen exciting  development in the Xen community and really powerful extensions that  enhance the capabilities of a virtualization such as the Intel VT-x/  AMD-v and the Intel VT-d / AMD-Vi (IOMMU) have become widespread among  desktop hardware whereas it has been commonplace among enterprise-level  hardware for quite some time by now.

      So it is quite evident that the role of an operating system is going  to change considerably in the future. The operating system that runs  on-the-metal is going to become a simplistic hypervisor that manages  simplistic virtual machines. The operating systems as they are today  will shrink into so-called wrappers that merely supply the frameworks  required to run a particular piece of software (such as .net, Visual  Runtime etc).

      So there will be a separation between the hardware and the operating   systems by an abstraction layer where different wrappers (that used to   be operating systems) share the underlying hardware with each other.   There will no longer be a question whether you use Windows, MacOS or   Linux. You just use whatever you prefer as a base OS and use whatever is   needed to run the applications you want, which in reality could mean   that you run several operating systems simultaneously on the very same   machine.


      This separation has already begun, ZFS is a good example of that. The   ZFS file system looks at the hard drives as a storage pool and the  user  is not concerned with the physical characteristics of the  partitions and  where the sectors begin or end. I didn't like it at  first but later  found this approach to be ingenious. So I see it as a  natural step  that the rest of the hardware will undergo the same  transition. I also  think a lot can be done with the UEFI framework in  this regard.


      The latest advancement in the virtualization  technology is the set of IOMMU extensions which allows virtual machines  to run directly on selected parts of the hardware on the host. This  means that I can run say, Linux on-the-metal while playing Crysis 2 on a  virtual machine that runs directly on the GPUs. Here's a video showing  Unigine Heaven running on a virtual Windows machine inside Ubuntu on a  dual GPU setup:



      This  is called PCI passthrough where PCI slots are passed through to the  virtual machine or VGA passthrough where also the VGA-BIOS mappings are  sorted out. In another setup I may want to run Windows on-the-metal and  pass through a whole hard disk controller to a Solaris machine where I  run a secured storage pool with redundancy (e.g. raidz3). For ZFS to  give proper protection against data corruption it is an imperative that  it runs directly on the hardware and not through a  virtualized abstraction layer. There currently is no support for IOMMU  on Windows hosts but that will change eventually, our hopes lie with  hyper-v, VirtualBox and VMWare.




      However, there is a lot  to be done and the purpose of my post in these forums is to address  this. For PCI passthrough and VGA passthrough to work it is a  requirement that the hardware supports function level reset (FLR) which  is a feature that allows the hardware to be reset and reinitialized at  any time on a running machine (i.e. at function level). FLR is standard  on QuadroFX cards and nVidia supply patches that enable FLR on Geforce  cards upon request.


      Another issue is that current virtualization technologies only support passthrough of entire  GPUs to virtual machines and GPUs can currently only be shared through  emulation which makes it impossible to run applications that rely on  hardware accelerated 3D (such as DirectX games). This situation is  pretty much the same as where the virtualization was before the  VT-x/AM-v extensions were introduced. The CPU instructions had to be  emulated on the VM which severely degraded the performance on that  machine. When VT-x/AMD-v came, virtual machines could be run directly on  the CPU with almost no overhead at all.


      So I would like to  suggest similar extensions that allow the GPUs to be shared over several  machines just like CPUs can be shared via VT-x/AMD-v.


      So my suggestions in short:

      • Work to get FLR support to become a standard feature among hardware
      • Develop Intel VT-x / AMD-v like extensions for GPUs allowing for GPU   power to be shared seamlessly among VMs and the host just like the CPU
        • 1. Re: Let's talk virtualization

          Awsome POST

          • 2. Re: Let's talk virtualization

            Thanks, I hope hardware developers will share your opinion. It has been said that a picture says more than a thousand words:





            Full-size image



            Some advantages of using the technology I discussed above on a desktop computer:


            • Easier troubleshooting of hardware and software because of the hardware-software separation inherent in virtualization.
            • Easier recovery of an operating system that fails to boot. The hypervisor (on-the-metal OS which in my discussion above may not even be a real OS) could provide more powerful recovery tools involving snapshot management and failure analysis whenever such things happen and there will be no need for a "recovery CD" or DVD (at least not if the hypervisor is in the UEFI/BIOS and it hasn't suffered from hardware failure).
            • Easier integration with "cloud" based backup solutions: A hard disk image file is more portable than a hard disk partition. The hypervisor could synchronize the image file while the virtual machine is running. This synch could be optimized by installing guest additions into the virtual machine that fetches certain low level notifications or similar and convey this information directly to the hypervisor for further processing. The hard disk location could even be in the "cloud".
            • Enables useful tools for stronger protection against viruses and malware: the hard disk image could be scanned externally by the hypervisor or it could be sent to a cloud service for further analysis. Also a "sandbox" like functionality can be provided (see e.g. sandboxie.com for more details).
            • Easier migration to new hardware; no reinstall of the virtual machine is necessary only the hyoervisor needs to be reinstalled which is not a big deal since it is considerably smaller than a regular operating system.
            • The machine could be accessed from anywhere using a remote desktop protocol (e.g. RDP, VNC or Spice). This is very useful as you don't have to shut it down when you finish up at work. You can use this machine from a public place (using a VPN of course if you're concerned about security), home or any place with internet access whenever you want to resume work.
            • 3. Re: Let's talk virtualization

              For people who are interested in learning more about virtualization technology and issues related to silent data corruption (data gets corrupted on your hard drive without you knowing it), I provide links to research papers:


              Additional reading about virtualization:


              In virtualization the operating system that runs on-the-metal, or the host is called dom0 (or domain 0) whereas virtual machines are called domUs.


              There are several different issues that have been worked on with the IOMMU extensions. One is passthrough of single-function vs multi-function devices. The problem used to be to get the entire multi-function device passed through to the domU, which is now resolved. Link: http://www.valinux.co.jp/documents/tech/presentlib/2009/jls/multi-function_b.pdf


              For more information about VT-d and IOMMU, the following paper is a recommended read:




              More on VGA passthrough:





              The Xen community maintains the following documentation resource pages on this subject:






              Additional information about data corruption (thanks Kebabbert for the links and info!):


              Here is a whole PhD disertation showing that normal file systems are unreliable:




              Dr. Prabhakaran stated in this paper that he found that ALL the file systems shared


              ...ad hoc failure handling and a great deal of illogical inconsistency in failure policy...such inconsistency leads to substantially different detection and recovery strategies under similar fault scenarios, resulting in unpredictable and often undesirable fault-handling strategies.


              We observe little tolerance to transient failures;...none of the file systems can recover from partial disk failures, due to a lack of in-disk redundancy.


              Regarding shortcomings in hardware RAID:




              Detecting and recovering from data corruption requires protection techniques beyond those provided by the disk drive. In fact, basic protection schemes such as RAID [13] may also be unable to detect these problems.


              As we discuss later, checksums do not protect against all forms of corruption




              Recent work has shown that even with sophisticated RAID protection strategies, the "right" combination of a single fault and certain repair activities (e.g., a parity scrub) can still lead to data loss [19].


              CERN discusses how their data was corrupted in spite of hardware RAID:




              Here is a whole site that only talks about the lacks and shortcomings in RAID-5:




              Lacks and shortcomings in RAID-6:




              The paper explains that the best RAID-6 can do is use probabilistic methods to distinguish between single and dual-disk corruption, eg."there are 95% chances it is single-disk corruption so I am going to fix it assuming that, but there are 5% chances I am going to actually corrupt more data, I just can't tell ", . I wouldn't want to rely on a RAID controller that takes gambles :-)


              In other words, RAID-5 and RAID-6 are not safe at all and if you care about your data you should migrate to other solutions. In the past the disks were small and you were much less likely to run into problems. Today when the hard drives are big and RAID clusters are even bigger you are much more likely to run inte problems. Assume that there is a 0.00001% chance that you run into problems, if the hard drives are large and fast enough you will run into problems quite frequently.