Tuesday, December 10, 2013

Design Imapact of Avamar System Activities

Avamar runs several system operations that will have an effect on the system's performance.

The backup window is the period of time allocated to running backup jobs, during which no maintenance operations are run.  It is best practice to run all backup to complete in the beginning for the backup window to not risk running into the blackout window.  Use the default window where possible and be aware of time zones when backing up desktops and laptops.

The Blackout Window is a time when activities requiring unrestricted access to the Avamar server take place, such as garbage collection and checkpoint.  During this time, the server is read-only and no backup is allowed.  Use the default window where possible and make sure that there is sufficient time to perform the garbage collection since any change to the window will significantly impact the server's performance.

After the Blackout Window is a Maintenance Window, during which time HFS checks are running.  The server can run a backup at this time, but performance will be slow.  Restores can also be performed.  It's best to keep to the default schedule allowing adequate time for completion.  It's also best to limit ad-hoc maintenance activity and to not run backup during this window.

 Replication should be scheduled to run regularly, typically at the end of the backup window.  Since most backup will finish within 1-2 hours of starting, the rest of the time can be scheduled for replication.  If there is limited throughput for client backup, these can overlap.  The goal is to schedule enough time to allow all daily replication to be completed within 4 hours of starting, allowing peaks of 8 hours, and for high-latency WAN networks (greater than 100 ms), tune TCP buffers on both sides of the connection.

Avoid using the --include parameter when specifying which clients to participate in the replication process, opting for --exclude of particular clients that should not participate.  This makes it less likely to mistakenly omit a system that should be replicated.  Always set the --retention-type option to replicate all retention types, including "none". 

Configure the destination to perform a checkpoint directly after a successful replication so you have a reliable rollback point including replicated data, and increase the time-out period for initial replication period and shorten after it normalizes.

Server performance is impacted as it becomes more full.  Daily maintenance will take longer the more fuller the system is, so it's best to limit storage capacity to 80% and garbage collection is completing, then monitor daily change rate and backup retention to prevent passing the read-only threshold.

The Avamar Server has some important thresholds for capacity:

The gsan process is allocate 65% of the entire system.  Within the gsan view, once you hit 80% utilization you are warned until you hit 95% when new backups are suspended and pop-up alerts are sent when logging on to Administrator.  At 100% utilization, server is read-only and tech support needs to be engaged to assist with reducing usage.

When the OS capacity hits 85%, garbage collection stops running until checkpoints start to roll off at which point it will continue.  HFS Checks stop running at 90% full, and at 96% you no longer get checkpoints, either.

Checkpoints are run twice daily.  When checkpoints are running, no other maintenance activities will begin.  All will be queued up to start once the checkpoint is complete.  As the Avamar becomes more full, checkpoint overhead grows faster and system disk utilization increases because the stripes are more full and get reused.  It's best to keep storage utilization below 80% and that garbage collection completes daily.  Also, leave checkpoint creation and retention to default - keeping last two checkpoints with the last one validated.

HFS Checks validate the checkpoint integrity, and when they are running other maintenance jobs except garbage collection can start.  All backup work orders will queue, and there needs to be 2 or less backups running per storage node for it to begin.  The rolling HFS check is run daily.

Garbage Collection is the single most important process to ensure a consistent state of the Avamar.  If garbage collection is running, the system is in a read-only state and no other maintenance or backup can run.  Coincidentally, if backup is still running, garbage collection will not begin.  Garbage collection can free up about 10GB/node/hour once the Avamar is in a steady state, but this will take longer as utilization of the Avamar grows.

When replicating from a source Avamar server, all other maintenance activities can start and all backup work orders are queued.  When receiving data, garbage collection can not start and all backup work orders are queued.  Be sure to schedule replication and that it completes daily to ensure that there is system redundancy.

No comments:

Post a Comment