Thursday, November 7, 2013

Networker Overview

EMC's Networker is their flagship backup application.  It is software that integrates with a large number of backup storage systems and clients, and has plug-ins for a large number of applications.

A Data Zone is comprised of a Networker Server (which contains a server node and a backup client), one or more storage nodes which can be on the Server node or separated.  The storage node writes data to media nad reads from media during restore.  The clients then generate backup data and tracking information regarding what data is located where.

Control Data is sent along with backup data.  It is kept in the following locations:
  • /nsr/index/CFI - client file index telling what data is in what save set
  • /nsr/mm/Media Database - tracks volumes and save sets
  • /nsr/res/ - resource files
  • On Windows hosts these are located in <drive>:\Program Files\Legato\nsr
Resource files include the nsrdb, which is a bunch of text files (not an actual database) of clients, devices, etc.  nsrladb is the drictory that holds resource info for RPC ports available for use, and jobsdb holds statistics and information on backup jobs.

The Media Database lists all media available to the Networker server and clients, as well as info on all save sets.

The Client File Index is where clients send tracking info regarding which data is stored on which node and which save set.

The NMC, or Networker Management Console is a java application that can configure and manage multiple Networker Servers and Data Zones.  It requires Networker 7.3 or newer.

Core Networker Daemons:

NW Client:
  • nsrexecd listens for backup requests, authorizes and determines which RPC ports to use
  •  nsrfsra is the utility for browsing the client filesystem
Storage Node:
  • nsrsnmd - listens for RPC commands and spawns processes
  • nsrmmd - PRIMARY PROCESS which receives data from teh client and writes to storage, also send tracking data to server.  Can run multiple processes at one time for multiple backup jobs
  • nsrlcpd - resources to control libraries
Server Node:
  • nsrd - PRIMARY PROCESS, starts schedule, starts jobs, manages resource database
  • nsrindexd - client talks to this process to write data
  • nsrmmdbd - talks to the database
  • nsrjobd - coordinates scheduled jobs, manages the resulting info
  • nsrmmgd - handles library operations
Console Server:
  • httpd - web server
  • gstd - PRIMARY PROCESS, runs the GUI and talks to the server process
  • dbsrv12 - manages the database and gathers data
BACKUP PROCESS FLOW:
  1. nsrd on the Networker server starts backup on the client
    1. nsrd->nsrjobd->nsrexecd->save
  2. save command on client figures out where to put data
    1. save->nsrjobd->nsrd
  3. Storage node is matched to a job and storage media is mounted
    1. nsrd->nsrmmd->nsrd
  4. Server tells the client where to send data
    1. nsrd->nsrjobd->nsrexecd->save
  5. Client sends data and tracking info simultaneously
    1. save->nsrindexd and save->nsrmmd
  6. Storage node sends tracking info the its media database
    1. nsrmmd->nsrmmdbd

Wednesday, November 6, 2013

E20-591 Scheduled

I scheduled the E20-591 EMC exam this afternoon for Friday.

I've got a bit of review and work to do yet, but I'm hopeful that it will go well.

Data Protection Advisor Reporting

A few notes on the reporting capability of DPA.  DPA offers real-time visibility into the entire backup environment.  Alerts can be sent via SMTP or SNMP, and DPA will flagged missed SLAs, backup jobs that run longer than average and on data loss exposure.

The Capacity report trends and forecasts system utilization, such as tape or disk capacity.

The Change Management tracks changes made throughout the entire system and correlates them across the environment.

The Backup Report Card is an at-a-glance report showing backup of every system over a period of time in blocks.  Green = good, red = bad.  Allows drilling into backup job for details and statistics as well.

The DPA system assists in troubleshooting by correlating data from one end of the backup system to the end of it.  It will bring together data from each system in the backup job's life, giving visibility to problem areas for rapid resolution.

Data Protection Advisor Architecture

DPA is comprised of 3 components as of version 6.0.  There is an application server, a data store and an agent.  The Application Server and Datastore can be installed on the same box or separated for scalability.

The applications server is the main interface and replaces several services previously in use by DPA such as the controller, the Reporter, the Listener, the publisher and the analysis engine.  The data store is a PostgreSQL database and is embedded in the project, replacing the Illuminator, Configuration and Datamine databases.  This gives tighter integration and better performance.

The Data Collection Agent gathers data.  It is automatically installed on the Datastore and Application Server, and can also collect data remotely using a proxy server for systems for which there is no installable agent, such as data switches and Fibre switches.  Lilnux servers require an agent be installed and can not proxy data.  The DPA version 6 supports the newer agents as well as DPA 5.5 and newer collectors, and sends the data via http and XML.

DPA uses REST (REpresentational State Transfer) to interface the components.  The CLI, the GUI and the agents all communicate with the App server through the REST API.  The REST API also communicates with the messaging service and the Datastore.  DPA 6.0 uses a high-performance messaging bus, which utilizes an append-only journal that fits in a single disk cylander for performance.  REST also launches responses to events in the UI.

DPA is built on JBoss, which is owned by RedHat, and uses PostgreSQL 9.1 as the database.  No users are able to access the database without using the "apollosuperuser" account unless they acces sit through the REST API.  The apollosuperuser account is a non-privileged account with full access to the database.

Files are kept in "/opt/emc/dpa/services/datastore/data and /engine.  The transaction log is pg_xlog, and log rotation happense automatically at 250MB.  The max connections to the datastore is 100.

DPA is event-driven, or uses the Event Driven Architecture.  The analysis engine uses policies and compares data with those policy conditions to provide recommendations and events.

DPA will run on Solaris 10-11, Windows 2003-2008R2, and RedHat or SuSE linux on 64-bit platforms with 8 GB of memory.  If DPA's components are all on one box, 4 CPUs is required, and if they are split out, 2 CPUs per server are required.

The main ports for DPA are 3741, 9002 and 9003.  9002 is the HTTPS connection from agents to server, 9003 is PostreSQL and 3741 is the agent HTTP connection.

DPA Agent will run on Windows, Solaris, Linux, HP-UX and AIX, both 32- and 64-bit architectures.

DPA can be licensed in a couple ways.  The DPA for backup is the main application and provides the majority of function.  It is licensed by the number of clients unless you are using Avamar, at which point it is licensed by capacity like Avamar to keep things consistent.  There is a DPA for Replication Analysis, which is licensed by TB for Symmetrix and RecoverPoint, and per array for VNX systems.

DPA for VMWare is licensed by ESXi host and provides for unlimited VMWare VMs.

VNX File, Celerra and Clariion require no license for use with DPA.


Data Protection Advisor Overview

Data Protection Advisor, or DPA, is a monitoring and reporting tool for backup environments.  It is a single tool to manage, report provide trending and analysis of the backup environment end-to-end, from storage to servers to network to applications.  It is installable by the customer, and can be done in a couple hours.

Tuesday, November 5, 2013

Managing Avamar Server Capacity

The goal for an Avamar server is to reach a Steady State Capacity.  This is the point where backup initialization levels off and the amount of new data brought into the Avamar server is approximately the same as the space being freed by Garbage Collection.  Typically, Avamar will reach steady-state shortly after the longest retention period.

The OS View is the total file system in the Avamar, while the gsan View is the total allocated to stripes.  Gsan view is 65% of the Avamar server and it is used mainly for user data - this is the licensed capacity.  Above that, 20% of the system file space is used for checkpoint overhead and the remaining 15% is for the operating system.

Once the Avamar OS view reaches 65% utilization the Avamar becomes read-only.  The amount of primary data on clients, the initial backup commonality, day-over-day commonality and retention policies all affect utilization.

Checkpoint overhead is teh data that changes after a checkpoint is created.  When a checkpoint is created, it creates a read-only hard link to the strip.  When that data is modified, the maintains the read-only link and creates a new read/write stripe with the changed data.  Therefor, the longer checkpoints are created, the more disk is used because there is more read-only data located on the system.  Length of time checkpoints are retained is the largest contributing factor in checkpoint overhead, but how empty the stripes are, daily change rate and HFS checks not completing also contribute.

Capacity Threshold Warnings:
80% of user capacity = warning, start planning for expansion or cleanup
95% user capacity is the health check warning, new backups are suspended
100% user capacity makes the Avamar read-only, can still restore from server

85% of OS capacity = Garbage collection stops running, utilization increases rapidly
90% OS capacity = HFS checks stop running
96% OS capacity = no more checkpoints

At 95% health check warning, new backups are suspended and will not start again until the alert is acknowledged.

Monitor with Avamar Administrator or Enterprise Manager.  EM has the benefit of predictive analysis, so it can estimate system utilization rates.  It also has a graph showing daily change rate, which is a valuable tool in planning.

The DPN Summary in Avamar Administrator shows backup stats, including TotalByts which is all the modified and non-modified bytes being scanned for changes on the client, and ModSent which is the actual bytes sent over the wire.

Monday, November 4, 2013

Avamar Server Maintenance Activities

There are server maintenance activities that need to happen daily, which are Garbage Collection (GC), checkpoints and HFS Checking.

Garbage Collection finds orphaned chunks and removes them during the blackout window.  If a backup runs long, GC can not start, and if GC has started then a backup cannot start.  The server goes into read only mode during GC, so restores can complete.  However, if server capacity becomes greater than 85%, any running backups will cancel and GC will automatically run.  GC will normally delete any backup jobs that have been deleted, expired or partial backups that are more than 7 days old.  During the blackout window, Asynchronous Crunching also takes place, which is essentially a defrag-type job for striped data.  It recognizes deleted data and moves other stripes into proximity.

Checkpoints are read only snapshots of the Avamar server, and they enable server rollback in the event of a problem.  They are run twice daily, at the beginning and at the end of the maintenance window.  Avamar will keep the last 2 checkpoints and at least one validated checkpoint.  Checkpoints can be created, modified and executed manually.  The older the checkpoint, the more space they will consume.

Validation is when Avamar scans the stripes and validates the integrity using hfscheck.  The server is read only when the hfscheck is initiated, but then returns to normal operation following.  There is full validation and rolling validation.  Full validation scans all data, while rolling validation scans new data and some modified.  Validation is the bulk of the maintenance window.

Avamar Administrator is used to monitor maintenance activities under the "Server Details" tab.  If maintenance activities are suspended, they will not run until re-enabled by selecting teh maintenance activity and choosing actions -> resume maintenance.  **If maintenance activity is not suspended and not running, contact EMC technical support.

Default schedule for windows:

Backup window runs  at 8:00 PM and runs for 12 hours until 8:00 AM.

Blackout window runs immediately following the backup window and runs for 3 hours, from 8:00 AM to 11;00 AM.  It is mainly running the asynchronous crunching and GC processes.  Backup can not be run during the blackout window, but restores can be run.

The maintenance window runs following the blackout window for the remainder of time until 8:00 PM when the backup window begins again.  It is primarily running HFS checks at that time, and there are a limited number of backups that can be run.  Normally 27 sessions can run, but during maintenance only 3 can be run.  It also takes a checkpoint and validates it.