To obtain maximum efficiency and system utilization, there are parameters that can be adjusted to tune the Networker environment.
Target sessions are the preferred number of sessions a storage node will accept, but it is a soft limit. If the target sessions are met, new sessions will look for another device. If none exist, the session count will increase to the max sessions for that device. Max sessions is a hard limit and can bet set to a max of 512.
In Networker version 7.6 and older, the default target sessions parameter is set to 4 and the max is set to 512. Those should be changed to 1 and 32 respectively. In Networker version 7.6 SP1 the values are 1 and 32 per recommended value.
Networker 8.0 and newer can run dynamic nsrmmd to limit the number of backup sessions Networker server can run at one time. It is determined by dividing max sessions by the target sessions + 4.
In environments where there is a remote Fibre Channel backup, link latency can be a significant problem. Increasing the block size can have a drastic effect on performance because we're telling the system it can transport more data at one time.
To tune the network, TCP parameters can be changed to accommodate backup as opposed to their default, general-purpose TCP settings. Some things to consider are to disable software flow control, increase the TCP buffer size and TCP queue depth. Enable jumbo frames where possible and use PCIeXpress for 10Gb NICs to ensure adequate bandwidth.
Other network considerations would be to avoid network hops between clients and storage nodes. Place a storage node on the local subnet if it is far removed from the rest of the network, or where there is a firewall to contend with. Also consider converting larger clients to storage nodes and have them backup locally.
A backup network is another option for network tuning. Storage nodes can be multi-homed.
Networker relies heavily on name resolution. Servers should have low-latency access to DNS servers, so consider a local cache server or local non-authoritative name server that accepts zone transfers from the primary. Also, avoid DNS queries all together by using a local hosts file.
Disk latency is a major consideration with Networker, just as with the other EMC products. At 25ms and below we have optimal and stable backup. Once disk latency hits 50 ms, backup slows down and NMC updates are delayed and fail. At 100 ms, savegroups and sessions fail, and at 150-200 ms, Networker daemon will take a long time to launch, backup will be unstable.
If your Networker server is exceeding 100 parallel sessions, you can dedicate fast disk for the Networker databases. RAID10 is recommended for Networker server disk storage. If there are more than 400 parallel sessions, consider placing NetWorker databases on separate volumes. You should disable antivirus on Networker databases, and you can use Asynchronous replication to avoid latency that comes wit synchronous replication.
Parallelism is the main tunable parameter on Networker. You can set parallelism for the server, or for the savegrp to restrict the number of save sets that are backed up simultaneously. You can balance the workload by running multiple groups - putting slow clients in one group separate from other fast clients. You can also use pools to force large savesets to faster devices, however use caution to not have unnecessary pools. You will want one for NDMP or Exchange, but not for all save groups.
Parallelism is configurable and on the servers should be set the the sum of all Target Sessions of all devices in a data zone.
Client parallelism is the number of streams from a client. Default is 4, but you may want to tune to more or less. The Networker server has a client parallelism setting as well to accommodate index backups, and this count should never be set to 1. For less than 30 clients, set to at least 8. For 31-100 set to at least 12 and for more than 100 clients, set this value to at least 16.
Group parallelism will throttle the number of clients within a group. Keep groups at a max of 50 clients with parallelism enforced, and stagger start times to reduce load on the operating system.
perfect....
ReplyDeleteThanks for the useful notes, Greg.
ReplyDeleteI have a large db backup that fails after backing up 5 TB of data for 10 hours. Its a oracle db and is a storage node too. I'm thinking of tuning tcp parameters in client . What things should be considered and what values should be set on hp ux n solaris servers ? Pls advice.
Hello, Suresh - I apologize for the delay in responding. If you've already gotten your issue resolved, please let me know what the fix was.
ReplyDeleteI have not had opportunity to perform Oracle backup, but I have a few questions in regard to overall design. You mention that you have a 5TB database which is on a storage node. Do you mean that have other file services being presented off the HPUX/Solaris servers with your database? Are you seeing disk contention and high latency on the backup client during the backup window?
What sort of storage is the database on, and to what are you writing the backup?
What sort of network connectivity do you have between your client and the backup target?
Are you getting any output from RMAN or Networker that would narrow our search for the culprit? Are you doing any consistency checking on the database that is taking a long period of time, extending through the backup window and causing the job to time out, perhaps? Can you override that backup window in order to get a solid base image on which you can continue forward with incremental backup?
Like you know, there are a lot of components in play here. Let me know what you've found, though...my curiosity is piqued!
--Greg