Best Practices for Deploying WAN Optimization with Data Replication: Keys for Successful Data Protection across the WAN
The Weak Link in Data Protection
All too often, the Wide Area Network (WAN) link is the weak link in data protection. Limited bandwidth, high latency, lost packets, and out of order packets can all jeopardize strategic data replication and backup initiatives, resulting in missed Recovery Point Objectives and Recovery Time Objectives (RPO/RTO).
As data volumes grow, and as the distance between data centers increases to protect business data from catastrophic disasters, there is increasing pressure being placed on the WAN. This has heightened the demand for optimization tools that can improve data replication times across the WAN while maximizing bandwidth efficiency during these processes.
There are unique requirements for deploying WAN optimization in a disaster recovery (DR) environment. Replication, for example, involves a high volume of sustained traffic that is highly susceptible to lost and out-of-order packets. Transfer times are very well defined to fit within allocated windows, and DR traffic is often encrypted to protect sensitive business data. Replication solutions can run over TCP, UDP and in some cases proprietary and encapsulated transport protocols, depending on the solution, and often have their own optimization techniques that can affect the performance of downstream WAN optimization devices.
By understanding these requirements and establishing guidelines for addressing them, WAN optimization can be deployed with maximum effectiveness. As such, WAN optimization can live up to its potential as a key enabler for strategic disaster recovery initiatives.
Asking the Right Questions
The first step when deploying any networking solution is to ask the right questions. When it comes to WAN optimization, the following questions should be at the forefront of the evaluation process:
- Know your applications and protocols. It is impossible to measure results without first baselining the existing situation. How much traffic is being generated in an "average" replication stream or backup process? How much traffic is generated in a single delta set? What are peak loads and how many times a day are they being hit? How long does it take to transfer the data from point A to point B?
The more quantifiable information one can collect, the easier it will be to size an appropriate WAN optimization device and gauge the level of improvement it provided. Think ahead - factor in an expected rate of data growth to ensure that a WAN optimization solution can grow with evolving replication/backup needs.
It is also extremely valuable to know how the replication/backup applications work. Do they run over TCP (e.g. EMC SRDF/A, Netap SnapMirror/SnapVault, Double-Take) or UDP (e.g., Veritas Volume Replicator, EMC ClarIIon disk library, Aspera/Isilon)? Are proprietary or encapsulated protocols being used, as is the case with some Fibre Channel over IP (FCIP) implementations? If anything other than standard TCP is being used to communicate between host devices, make sure that WAN optimization appliances can support those protocols.
- Know your network. It is easy to determine how much bandwidth you are paying for, but it is much harder to determine how much effective throughput is really being achieved across the WAN. In many environments, such as MPLS and IP-VPN, packets are lost or delivered out of order due to router oversubscription. This can lead to excessive re-transmissions, which can drop the effective throughput (aka "goodput") of traffic across the WAN. As Figure 1 shows, just a small amount of packet loss (.075%) can drop goodput to less than 5 Mbps - regardless of how much bandwidth is actually available on the WAN link. Most large backup/replication processes cannot recover from such a significant drop in throughput. As a result, WAN optimization technologies like Forward Error Correction (FEC) and Packet Order Correction (POC) are extremely valuable in some replication/backup environments.
Figure 1. Effective throughput is reduced to < 5 Mbps across the WAN with as little as .075% packet loss
In addition to bandwidth and packet loss, latency can be a silent killer for many DR applications. There is no getting around the laws of physics - when there is a significant distance between target and host devices it will take time for packets to travel back across the WAN. This problem is only getting worse as enterprises look to extend the geographic distances between data centers, be it for better protection from catastrophic disasters or to take advantage of cheaper power in rural environments. In many instances, TCP acceleration techniques like selective acknowledgements and adjustable window sizing will help address latency challenges across the WAN.
If a WAN upgrade is underway, which includes switching to a new WAN technology (e.g. MPLS or IPVPN) or building out a new data center, it is encouraged to simulate potential WAN conditions as part of a WAN optimization evaluation process. A good WAN emulator will effectively reproduce bandwidth, latency, packet loss, and out of order packets to provide a real-world experience.
- Know your limits. How many simultaneous flows are generated during a typical replication process? How many are generated when multiple processes are taking place simultaneously, such as the backup of dozens of remote branch offices? If other traffic is using the same WAN as your DR traffic, how many flows is it generating? By understanding the quantity of flows across the network one can ensure that WAN optimization devices handle the volume appropriately. Be sure to understand how a WAN optimization device reacts when its flow limits are reached. Is traffic blocked when limits are reached, or sent through un-accelerated?
In addition to the above, it is useful to know if there are throughput limits being placed on individual flows by routers, firewalls and other network elements. A router, for example, may limit the amount of throughput per flow to ensure that all flows get serviced properly. Or, a firewall may limit the amount of throughput per flow to prevent malicious applications from hogging precious bandwidth. Whether deliberately set or not, throughput limits can wreak havoc on high volume traffic and should be addressed accordingly (either through reconfiguration of the network element or through WAN optimization techniques, like packet striping).
Best Practice Configuration Guidelines
Many WAN optimization techniques, e.g., data reduction, QoS, compression, latency mitigation, and loss mitigation, work transparently to storage devices and DR software. During normal operations, the storage medium should not even know that the traffic being sent across the WAN is being accelerated. However, different deployment scenarios can result in different levels of performance across the WAN. The following configuration guidelines help to maximize end-to-end performance when performing data replication across a WAN:
- Compression / de-duplication. Many storage devices perform basic payload compression; e.g., LZ. This does not prevent downstream WAN optimization devices from working, but it can reduce the overall effectiveness of these devices by limiting visibility into "raw" data. Because this functionality is not unique to the storage device - i.e., most WAN optimization devices can perform the same or better compression then a storage array - it is usually recommended that this functionality be turned off in the array. This typically leads to better overall net performance from a compression standpoint. In addition, because compression is very CPU intensive, moving this functionality off the host (array) and onto a dedicated WAN optimization appliance can result in better scalability within the storage medium.
De-duplication is a slightly different story, as in many environments this functionality is desired within the storage medium and turning it off is not an option. As long as the WAN optimization device has byte-level granularity when doing its own data reduction, working with de-duplication should not be a problem. In fact, expect an additional 10x to 20x performance improvement when WAN data reduction is performed in conjunction with de-duplication. This is particularly true when multiple applications are sent across the same WAN because the optimization device has a larger data set to sample and match from. For example, if someone sends an email across the WAN it will be fingerprinted and stored by a WAN optimization device performing data reduction. When that email is backed up, the WAN optimization device will have already seen the data, leading to immediate data reduction benefits. In contrast, this might be the first time that the storage device is backing up the data, so de-duplication may be minimal. As one can expect, having a WAN optimization device + de-duplication in the mix yields the best overall net results.
- Encryption. When communicating across a WAN, many enterprises look to encryption as a necessary tool for protecting sensitive information. However, when WAN optimization is deployed with storage medium, one must be careful as to where this encryption takes place. When encryption takes place "upstream" of the WAN optimization device, special actions must be taken to terminate the encryption session on the appliance, un-encrypt the traffic, optimize the traffic, and then re-encrypt the traffic. Otherwise, the WAN optimization device does not have visibility into the data and cannot perform its optimization functions. Because this process can be difficult to coordinate and can have an adverse effect on performance, it is generally not recommended unless encryption is absolutely necessary at the source. Instead, it is recommended that encryption be left to the WAN optimization device.
Best practices dictate that WAN optimization devices perform two types of encryption. One is encryption of data at rest; i.e., data stored on the appliance. The other is encryption of data sent between appliances. The former is particularly needed when the WAN optimization device is using local disk drives for data reduction, which can store several terabytes worth of information. The latter is most often needed on shared networks, such as IP VPNs, where IPsec and other VPN solutions can provide an added layer of security. In both scenarios, it is recommended that encryption take place in dedicated hardware so as not to impact performance.
- High Availability. When WAN optimization is used for disaster recovery, it takes on an increased element of importance. Poor transfer times can mean failed replication/backup processes, which means business information is placed at a higher level of risk. To avoid this scenario, it is often recommended that WAN optimization be deployed in a redundant configuration when used as part of disaster recovery operations.
Redundancy between appliances is typically achieved using standard redirecting techniques, like Policy Based Routing (PBR) and Web Cache Coordination Protocol (WCCP), which can be used to redirect traffic in the event of a problem. Redundant power, disk drives and other modules will help ensure maximum uptime within the appliance. See Figure 2.
Figure 2. Out-of-path deployment ensures high availability. Common redirection techniques can be used to deploy WAN optimization appliances in a redundant fashion
Understanding Success Criteria
With the above information, one can effectively define criteria for a successful WAN optimization evaluation. More specifically, enterprises can collect quantitative data that will justify whether an investment in WAN optimization is the correct choice for their DR environment. Specific items to look for include:
- Reduced transfer times. How much faster is the replication/backup process? This is easy to measure, and easy to compare to baseline numbers (assuming they were collected prior to deploying WAN optimization.)
- Increased LAN-side throughput. In many instances, removing a WAN bottleneck enables more data to be sent from the storage medium. This means that more data can be protected within allocated windows.
- Improved WAN utilization; i.e., more "virtual bandwidth." If LAN-side throughput is constant, than WAN utilization should go down when using WAN optimization. However, in many instances LAN-side throughput goes up, which can result in an increase in overall WAN traffic. This may seem counter-intuitive to the goal of WAN optimization, but it actually means that one is getting more efficient usage out of available WAN bandwidth.
As the last point indicates, removing a WAN bottleneck can actually expose bottlenecks elsewhere in an enterprise. For example, a bad server NIC or outdated LAN hub may have worked "fine" when WAN throughput was limited to 10 Mbps, but they may strain to keep up with a WAN that can now handle 100 Mbps of traffic. Similarly, replication software can be physically limited in the amount of data it can push out, or it might have been manually configured to limit throughput based on WAN conditions. This may lead to sub-optimal performance gains when WAN optimization is deployed. For example, the amount of traffic on the WAN might be significantly reduced with WAN optimization, but transfer times across the WAN may not show a significant improvement. This might be something that can be corrected with minor configuration changes in the storage medium, or it may be a fundamental limitation of that medium.
Lastly, it is important to point out the importance of effective management tools when evaluating, and subsequently deploying a WAN optimization solution. See Figure 3. These will help baseline existing network and application behavior, optimize configuration for seamless deployment, and monitor behavior on an ongoing basis to assess performance over time.
Figure 3. Effective management tools are required to track application performance over time.
Making the Business Case
Faster transfer times and higher LAN/WAN throughput means better RPO. he more data that can be pumped out by the storage device and subsequently sent across the WAN, the more data that can be protected in a given period of time.
Faster transfer times also mean better RTO. WAN optimization not only accelerates replication and backup functions, but it ensures that transfers in the opposite direction - i.e. during a recovery - also happen as quickly as possible.
What is the value placed on improved RPO and RTO? How much is it worth to protect more data and recover it faster? Do these benefits outweigh the investment in WAN optimization equipment?
Consider the alternative - adding more WAN bandwidth. This may seem like the path of least resistance when data protection is not performing as desired across the WAN, but it has several major drawbacks.
For starters, it assumes that bandwidth is the only issue that needs to be addressed when doing replication and backup across the WAN. However, if packet loss, packet ordering, and latency are also issues, adding more bandwidth will not solve the problem. (In fact, loss is often exacerbated as WAN links grow in size).
Secondly, in many regions it can take quite a long time to get a large WAN connection ordered and provisioned from a service provider. If problems exist today, waiting several months for an OC-3 or OC-12 circuit may not be a viable option.
Lastly, when all factors are considered, the cost of adding more WAN bandwidth is often significantly higher than the cost of deploying WAN acceleration. Aside from a dramatic increase in recurring bandwidth expenditures (30% to 60% on average), routers and other network equipment may have to be added or upgraded, the storage medium may need to be upgraded, new licenses might be required in the replication/backup software to accommodate additional WAN links, and new operational expenditures might be required to handle the added complexity of new and larger WAN connections. One can argue that bandwidth expenditures are decreasing over time, but the recurring costs are still significant and the tangential costs associating with upgrading the WAN can be quite substantial.
In the end, WAN optimization offers the best performance improvements in disaster recovery environments with the lowest total cost of ownership. When deployed correctly, the benefits of WAN optimization are very tangible - from improved data transfer times to more efficient usage of available WAN bandwidth. By following best practice recommendations, WAN optimization is an indispensible tool in day-to-day disaster recovery operations.
About the Author
Jeff Aaron is the Director Product Marketing at Silver Peak Systems. Mr. Aaron can be reached at firstname.lastname@example.org.
Share This Article