Wednesday, December 4, 2013

Data Deduplication Sizing

While there are several products that can be used to size an EMC deduplication system for Avamar or Data Domain, only the EMC Backup System Sizer (EBSS) is an official EMC product.  It runs on Adobe Air and therefor runs on Windows or Mac, and EMC recommends you always download the latest version.  It will come in a zip file which includes supplements and a client questionnaire.

In a hybrid Avamar/Data Domain system, we are storing meta data on the Avamar and managing the backup through Avamar while storing the data on the DD.  (Exchange is an exception to this rule, where the database is stored on the DD and the logs are stored on the Avamar system)  In this setup, we'll size each product separately starting with the DD.

Data Domain sizing in the EBSS is  done by selecting Avamar as the backup provider, and then size the same as using Networker because Avamar only does full Exchange backup.  Avamar back end licensing is still required, and we'll use the physical capacity from the tool and round down to the nearest TB.

Avamar sizing is then executed by selecting "size for both Exchange and non-Exchange" regardless of whether or not you have Exchange in your environment.  When choosing backup type, "Av-DD Meta data" is an option and should be selected.  When sizing the base system, use the Total Backup Environment Size from the DD sizing exercise above.

In non-Exchange environments, take the total Backup Environment Size and divide by 10,000 while leaving the dedupe values at 0 and set retention to the same length determined in the DD sizing.  If Exchange will be backed up, take the Total Environment Size and multiply by 1.33%, and enter that value into the full size field.  Again, leave the dedupe at 0 and retention the same as on the DD sizing.

There are several factors that will affect the deduplication of data.  The main components are data type and retention period.  File systems, databases and email systems dedupe well while video streams, database logs and scientific data do not.  It is also best to have retention longer than 2 weeks or commonality is less likely to be found within the blocks.  Also, encrypted, compressed or multiplexed data does not deduplicate well, either.  These actions are typically taken on the user side of data.

When selecting clients for deduplication, the type and amount of data will affect deduplication rates again.  In this case, the more data you select for deduplication and as the data set grows the better your dedupe ratio will be.  Data with a high change rate is typically bad for dedupe, as is rich media due to little commonality.  Clients with regular, frequent backup will have high commonality rates and are therefor good for dedupe.  Again, encrypted and compressed data work against deduplication.  Also to note, when performing source-based deduplication, the client needs to have sufficient hardware resources to perform the deduplication action.

Avamar and Data Domain show dedupe ratios differently.  Avamar reports dedupe in a percentage (50% reduction), while DD reports in "times" (2x reduction, etc).  To compare:

Avamar - Data Domain
50% = 2x
80% = 5x
90% = 10x
95% = 20x
96% = 25x
98% = 50x
99% = 100x
99.7% = 333x

Note that the Data Domain rate increases rapidly as the percentage increase slows near 100%.

While we never quote marketing data, a general rule of thumb when asked what sort of deduplication a customer can expect is that the first full backup to a DD system can expect to see a 2-4x reduction.  Subsequent full backups will typically experience a 15-30x reduction for structured data such as databases, and 25-50x reduction for non-structured data.

Incremental and Differential backup can expect to see 3-7x reduction for structured data, and 5-10x reduction for non-structured.  The deduplication is going to depend largely on the frequency of the backup, with more regular backup resulting in the greatest amount of deduplication.  Near line, or TSO Incremental Forever backups can expect to see 3-7x reduction as a general rule.

Typical Avamar deduplication rates are similar, but you can expect to see a 70% reduction for non-structured data and 35% reduction in structured data across all backed-up clients in environments with 0.3% daily file change rates and 3-5% structured change rate.

When sizing a system for deduplication, be aware of data with high change rates, high growth rates and any challenging data forms.  Also be aware that there is a difference between raw data capacity and usable data capacity and to size the system for *usable* capacity, not raw.  As a rule of thumb, only size systems out to 80-90% utilization, never to 100%.  In environments where there are high change and growth rates, your best results will come about if you use the DDA:A and DDA:B tools.

When sizing systems, remember that low retention is not good for deduplication.  The best retention schedule will age data off in 3-6 months, and longer-term retention becomes unpredictable at best because there is no way to account for all possible events and growth.  It is acceptable to model a 2-3 year growth rate using the EBSS, but remember that this is an estimate based on current system deployment and doesn't take into account new systems or expanded function.

It is nerd-nature to want to run the biggest, most powerful system available whether it is required or not.  This will certainly work against closing the deal from a cost perspective.  EMC recommends focusing on the desired and reasonable backup window and meeting that need rather than over-sizing the system.  Also, make sure the customer's infrastructure will handle the increased load.  This in most commonly the limiting factor and a major issue with deployment.  Also remember that while a throughput level may be achieved, that level may not be able to be sustained for long periods of time to accommodate a backup window.  Shoot to the low side of network utilization when sizing systems and suggest upgrading where necessary, and don't forget that multiplexing is typically bad for deduplication.  Deduplication is not free.  There is never a 24/7 workload, and the backup system will need to take time for system maintenance and garbage collection.  This comes at a cost to the system performance.

When sizing a system for replication, remember that simply having bandwidth between sites does not guarantee that bandwidth is available.  We need to determine not only how much bandwidth a customer has, but also how that bandwidth is being used currently.

No comments:

Post a Comment