Greg and the Network: October 2013

Thursday, October 31, 2013

Data Domain Protocols and Data Paths

Data Domain uses standard protocols to simplify administration and integration into existing networks. It can connect to systems either by fiber channel or Ethernet.

When using Fiber Channel, the DD uses Virtual Tape Library (VTL) to communicate with the backup server. When using Ethernet, the backup server will mount the DD like any other disk share using CIFS or NFS. DD can also use DDBoost, which requires software agent be installed on the client, and can also use NDMP which essentially is VTL over Ethernet instead of FC.

DD Replicator is a licensed feature, and when used on two DD appliances, replication and backup can take place simultaneously. VTL is also a licensed feature.

Data Domain Core Technologies

Data Domain uses two core technologies. SISL and DIA.

SISL is Stream Informed Segment Layout. This is the target-based inline deduplication process that focuses on speed instead of 100% accuracy, although SISL does provide 99% accuracy in limiting duplicated data segments. There is minimal disk access because the process takes place in RAM, and a faster CPU provides increases in SISL efficiency. You dn't spend any time waiting on spindle speed.

The process is such:

Streams data into RAM
Stream is segmented - into 4-12 KB chunks
Fingerprints are created from the segments
Verify if segments are unique using two functions

Summary vector - a list of segments from disk that are predictivly selected
Segment locality - because data follows predictable patterns, DD selects segments for the summary vector that are most likely to be used based on the fingerprint list in RAM

Store unique segments

DIA is Data Invulnerability Architecture and uses four points to check and recheck for corrupt data.

End-to-end Verification - after writing new segments to dis, DD OS performs a read of the data to ensure it can reassemble the file, computes and verifies the checksum
Fault Avoidance and Containment

New data never overwrites existing data. Regenerating data is fast because data is never missing due to overwrite. Deleted data is removed during disk cleanup.
DD uses NVRAM to buffer all data not written to disk

Fault Detection and Healing - DD OS uses the logging file system and RAID6 disks to verify the integrity of the data on every read. It computes and verifies the checksum each time data is accessed
File System Recovery - because data is never overwritten, there are no block maps or reference counts required. DD OS merely needs to find the head of the log file and it can rebuild the filesystem. Once it knows where the log file is, it scans the log and rebuilds data using RAID6 where necessary

Data Domain Deduplication

Deduplication is the main benefit of Data Domain appliances. The basic premise is that we are eliminating redundant data, storing only one instance of each segment and using pointers to take the place of the duplicates. Pointers consume much less space than the actual data, so there is a significant reduction in amount of disk required.

Deduplication uses a hashing algorithm to generate a hash value for each data segment. An index of the hash values is kept for quickly referencing when comparing new data to existing data. If a match is found in the hash table, only a pointer to the original will be kept on disk. "Hash," "fingerprint" and "checksum" are all synonyms to EMC.

File-based deduplication compares entire files and keeps only one instance of each file. There is a slight reduction in space wehn there are multiple copies of the same file in a filesystem, however once a change is made to one of those copies, the entire document is again stored. This is an inefficient method of providing deduplication.

Fixed-length deduplication is breaking down data into a fixed-length segment and replacing duplicate segments with pointers. Because the comparison happens at a more granular level than file-based dedupe the reduction in actual data storage is significant. However, when data is added or modified, the segment stream is broken up and data changes on disk. This requires a reprocessing of data to accommodate the new data. While an improvement over file-based dedupe, it is still somewhat inefficient. "Fixed-length" is also known as "lock based" or "fixed-length segment" deduplication, and is the method most deduplication products employ today.

Variable-length deduplication is a more efficient means of deduplicating data, because each data stream is analyzed and common patterns are found, and only unique patterns are stored on disk. The duplicate patterns are marked with pointers. This is the method Data Domain and Avamar use for storing data. "Variable-length," "variable segment size" and "variable block" are all synonyms.

Inline deduplication takes place in real-time before the data is written to disk, where post-process deduplication takes place once the data is on the disk. Inline is more efficient from a disk utilization perspective, but requires more CPU and memory. Post-process dedupe requires more disk space and administrative overhead as the staging area of the system needs to be monitored for capacity, also.

Source-based dedupe uses a client or piece of software on the system being backed up to hash data blocks before sending them. This process requires more processing from the client, but saves considerably on network utilization. Target-based dedupe is where all data is sent from the backup client to the backup device, where data segments are analyzed and unique data written to disk. This is more efficient for the system being backed up, but requires more network bandwidth. Data Domain natively uses target-based dedupe, but with the addition of DDBoost is able to accommodate source-based deduplication.

Global Compression is not really compression, but is what Data Domain calls deduplication. Local compression is normal, on-disk compression using lz, gz or gzfast algorithms in Data Domain. Delta compression is what takes place during replication, where the source sends a hash list of changed data to the replication target. The target sends back a list of unique segments to the source, which then sends new segments only. Data Domain has the ability to send data changes by means of measuring data similar to what exists on disk, and then just sending the change.

DD boasts a 10-30 time reduction in disk utilization, but I can tell you that I have seen small, simple CIFS shares reduce by better than 90% in the wild.

Backup and Recovery Manager Overview

EMC's BRM (Backup and Recovery Manager) is not highly discussed at this point, as it is currently a monitoring tool only. It is for monitoring EMC products such as Avamar, Data Domain and NetWorker in a single interface.

It is currently in version 1.0, and the only configuration it is capable of is for Avamar Replication. It does, however, integrate with much of the reporting and monitoring capability of the other products as well. A unique feature of BRM is that it has a search function built in, which none of the other monitor products have. It is curently a less-capable DPA, essentially.

NetWorker Overview

NetWorker is EMC's flagship software backup application. Themain benefit of NetWorker is that it has a large number of application modules to cover a wide range of backup requirements.

NetWorker will back up to tape and disk like traditional backup software, but it also has the ability to integrate with snapshot and replication technology, such as that used by VNX and Symmetrix. It provides integration with Avamar and Data Domain for deduplication, as well as integrates with VADP for image-level backup of ESX environments. It also integrates with Data Domain DD Boost and Avamar proxies to accommodate remote office backup. NetWorker also has the ability to manage remote backup jobs in instances where a complete backup system is in place in other locations.

It provides data security in a future-proof means by utilizing the open tape format for data storage as well. It is used mainly in large-scale complex deployments of heterogeneous environments.

Avamar Overview

Avamar is an end-to-end solution for backup and recovery in that it is comprised of both a software component and a hardware component. It uses "Global Deduplication" which is deduplicating data across all clients and all nodes in the backup system. Avamar appliances also provide data reduncancy by utilizing RAIN (Redundant Array of Independant Nodes), much like a RAID is using multiple disks to stripe data with parity, RAIN uses multiple appliances to stripe data. Avamar comes as a physical appliance (Data Store) or a Virtual Edition to be used in VMWare environments.

Avamar is a client-side dedupe process only, which reduces backup window by only transferring changed data, and also reduces network utilization. Avamar also uses incremental backup, but each backup appears as a full backup, reducing time to recover and restore.

Avamar software consists of backup agents to be installed on clients, and there are specialized clients for several databases and application vendor products such as Micro$oft Exchange and Sharepoint. The Enterprise Manager allows adminstrators to configure, manage and monitor the Avamar system, and the Avamar Administrator allows administrators to manage virtual machine backups through VMWare.

Avamar is tightly integrated into VMWare using VADP to perform image-level backup, and also installed agents to provide the granularity by installing agents on the VM itself to take advantage of the application-specific agents.

Avamar also has an appliance to assist with NAS backup, which is an NDMP Accelerator. It stores no client data, but transparently sends data from teh NAS to the Avamar server for backup. This makes it faster and easier to accommodate the large data sets normally associated with NAS devices, as well as accommodates the proprietary operating systems they utilize. The NDMP Accelerator is also recommended for remote office installations as well.

Wednesday, October 30, 2013

Data Domain Overview

Data Domain is another component in EMC's backup recovery portfolio. The Data Domain appliance is mainly disk storage with deduplication built in.

One differentiating factor is that DD deduplicates in a CPU-centric, in-line method so that data is written to disk only once. They call this SISL, or Stream Informed Segment Layout. This is compared with a traditional dedupe product that uses a post-process deduplication. This method uses more disk since it works on data already written to disk. This may seem more efficient, however spindle speed is a limiting factor. More disk is required for post-process dedupe as well because data is being written and re-written after being processed.

Ingest speed is the rate at which data can be brought into a backup system.

Data Domain works with and is certified with most backup application vendors because it works over standard protocols such as Ethernet or Fiber Channel. It simply presents a CIFS or NFS share to the backup server and is used like any other backup-to-disk, except that DD deduplicates data inline. It can also simulate a tape library (or can be seen as a VTL - Virtual Tape Library) to be used with existing backup software and not requiring reconfiguration of existing backup jobs. They simply run faster.

Another component of Data Domain is the DDBoost software, which is a client-loaded piece of software that performs the deduplication on the client. This is advantageous when backing up remote data centers or branch offices where bandwidth is at a premium.

DD also utilizes something they call "Data Invulnerability Architecture." While utilizing RAID6 to protect against disk failure, DD also performs many checksums against the data to ensure it is sound. When deduplicating data, a single corrupt block can have far-reaching effect since that block may be used in several files. DD self-heals and repairs data corruption.

DPA - Data Protection Advisor - Overview

EMC's Data Protection Advisor is one of the few customer-installable software packages from EMC. It is a monitoring tool only, and will keep metrics on all aspects of the backup system. It will touch storage, network, applications, and provide analysis and reporting, alerting, troublehshooting assistance and capacity planning. It provides no ability to modify configurations, however.

The main component of the DPA is its Proactive Analysis Engine. This software runs continually against the data being collected and stored in DPA's database to provide comprehensive visibility in to the backup environment. It gives a higher rate of compliance satisfaction because of the ability to see quickly what jobs are running/ran/failed and which component most likely caused any failure. It allows reduction in resources by measuring capacity of disk and tapes in a backup system.

It is targeted at very large, very complex data centers with multiple backup technologies with heterogeneous environments. The use case discussed includes some 87,000 backup jobs over better than 4000 clients.

EMC Backup Recovery Technology Architect

Well, it's been a while since my last technology challenge, which ended in a CCNA R&S certification. I've been busy this summer enjoying my family and what little sunshine we get here in Minnesota, but now it's time to get back to work.

I've been working with many of the technologies I had explained earlier this year - EMC storage, Cisco UCS and network infrastructure, VMWare virtualization. It's been a whirlwind of geeky goodness and now I'm back to get some EMC certification. I am already EMCISA certified, which is the associate-level EMC certification. I'm not attempting Backup Recovery Architect certification, and I'd like to get this done before the end of the year.

I have 2 exams to take, E20-329 and E20-591. I have access to some EMC video training. I'll be narrating my training sessions back here on my tech blog, because I am a believer in Charlotte Mason's education philosophy - and it worked really well for the CCNA.

So here we go - EMC Backup Recovery Technology Architect!

Thursday, October 31, 2013

Wednesday, October 30, 2013

Total Pageviews