How well do virtual environments really dedupe?

With the advent of Flash Storage, the new buzz term is “deduplication”. This is the best way to fit a large amount of VM data in a very small capacity flash array. EMC’s new XtremIO is no different in these claims of amazing dedupe ratios for virtual environments. This made me wonder, how efficiently do VMs truly dedupe against each other. Not to let my curiosity get the best of me, I ran a few tests in my lab using EMCs own Dedupe Estimator tool to see what kind of dedupe rates can truly be derived from a virtual environment and how accurately virtual machines will dedupe against each other.






The goal of this trial was to validate the hypothesis that the number of duplicate virtual machines is directly correlated with the deduplication ratio we see with XtremIO. This is assuming that there is no unique data saved on the virtual machine. The virtual machines only have the operating system installed and nothing else. The value we hope to provide is the ability to logically identify and estimate the deduplication ratios a customer can expect in their virtual environment when positioning XtremIO.



the test works by removing the estimated space that the operating system takes on a VMDK. We then remove this redundant data from the overall storage usage. This is assuming that the OS data would deduplicate against a copy of itself flawlessly. Example: Five Windows 2008 Server R2 virtual machines with no user data, applications, or configuration changes should, conceptually, have a 5:1 deduplication rating. In order to validate this idea, I ran several tests in a lab to compare the results of the official EMC deduplication tool and the hypothesis that 2 VMDKs with the same OS will deduplicate fully.



  • The remaining data in the VMDK is filled with unique data that cannot be deduplicated. We are assuming the “worst case scenario” for usable space after deduplication.
  • The customer is using the “Recommended” storage requirements for the OS as stated by the OS manufacturer.



The test started by creating a new ISCSI Datastore on an EMC VNX5500. Then a Windows 2008 Server R2 Virtual Machine was created in the datastore. This was a fresh install of Windows 2008 Server R2 so that there could be a control instance.

Next, there was a second Windows 2008 R2 server VM created in the same datastore. The original VM was not cloned nor was a template used. The same process was followed when provisioning the first VM. As soon as the instillation was over for both VMs and the desktop was accessible, the VM was powered down. (Note: We did not install or configure anything in the machines themselves in order to maintain a valid control environment.)

Next, a snapshot of the LUN with the VMs was taken from the VNX.

Next, the snapshot was mounted as a virtual RDM to a new VM we used as the platform for the EMC ‘Dedupe Estimator’ tool. This was done so that we were able to run the official EMC ‘Dedupe Estimator’ tool against the snapshot without tainting the production datastore.

The results we received was a dedupe ratio of 1.79:1 for the two identical VMDKs.

Next, we added another Windows 2008 Server R2 VM to the same datastore. We took a new snapshot and again mounted it as a new RDM. The same test was ran again with the EMC Dedupe Estimator tool. This time we received a deduplication ratio of 2.96:1.


With two of the same VMDKs, we received a deduplication ratio of 1.79:1, which is 10% off from the results we assumed.

With three of the same VMDKs, we received a deduplication ratio of 2.96:1, which is 1.4% off from the results we assumed.


In conclusion, we can assume that the deduplication ratio of an OS system partition could be anywhere from 1%-10% off. This is most likely due to a few bits being off from one another. This would off-set XtremIO’s 8k dedupe just enough to make a marginal difference. In light of this discovery, we have reduced the amount of space savings of a deduplicated VM Operating System by 10%.

Add Comment

Required fields are marked *. Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>