Data Protection: Where the Problems Lie
Data Protection as It Was in the Beginning
Understanding IT's heritage of data protection technologies is essential to understanding the thinking that still permeates the IT community regarding the nature of data protection. The genesis of that "thinking" was the limited choice for data protection technologies in the recent past, as well as cost considerations. Note that this focus has been on data protection technologies as related to risk management because, except for data security-related technologies, the role of governance and compliance in data protection was not generally recognized. In fact, even from a risk management perspective, not that many years ago, data protection and backup/restore processes using tape were synonymous. All the other numerous data protection technologies in use today were not only unavailable; they were unthinkable.
The big change in data protection technology started with the introduction of what is now called RAID (Redundant Array of Independent Disks) in the 1990s. This was a major advance because before the introduction of RAID technology, all data on a particular disk drive was "lost" if the disk drive experienced a permanent failure that rendered access to the data permanently unavailable. That data loss was temporary if the data had been copied (i.e., backed up) to magnetic tape. Assuming no errors on the tape, the data could be restored to a working disk drive and the data would once again be available to an application for use. The data loss would be permanent if the data had not been backed up (or if the tape media failed). Backups typically were run once a day, at night, after most of the business applications had shut down at the end of a normal business day; the practice of applications running 24 hours a day, 7 days a week, was not common. The data that had been created during the course of the day since the previous night's backup could be permanently lost unless special logging of transactions took place.
The introduction of RAID changed things dramatically, in that a group of disks could now have much higher availability as a whole than the same set of disks would have on average without RAID technology. A RAID group contains a set (an array or part of an array) of disks drives with at least one more disk than is necessary to house all the data (the redundancy part in the term). The "raw" capacity of a RAID group is the sum of the individual capacities of all the disk drives in the group. The "usable" capacity is how much data can actually be stored given that the equivalent of one or more disk drives has to be reserved for data protection purposes (using techniques that use "parity" or "mirroring" to protect the data). From a physical perspective, a RAID group can tolerate at least one disk failure before any data is "lost."
Prior to the introduction of RAID, the lack of such technology required forging a close relationship between magnetic disks, which provide the random access that most applications require, and magnetic tape, which provided a medium on which data could be written and preserved for data protection purposes.
So, just before the introduction of RAID technology, the primary storage media in use were, as today, Winchester disks and tapes (Figure 1). Winchester disk technology itself offered no extra measure of data protection. Winchester disks are disks on which the disk medium itself and the disk drive are sealed into a single unit. Prior to the introduction of Winchester technology, disk pack media could be removed. (The removability of disks is being reintroduced in limited cases with the removability of RAID groups-which include both drives and media-but this is typically not a general practice.) Each Winchester disk (hereafter referred to as simply a "disk") had to stand on its own, so that the mean time between failures (MTBF) for multiple disks was far less than for one disk. Although the technology for disk copies existed, the cost-except for extremely critical online transaction processing systems-was prohibitive. Practically, neither physical nor logical data protection existed.
Figure 3.1 Data Protection: The Way It Was
Magnetic tape solutions provided not only the first line of defense against data problems, but also the last (and any intermediate) line of defense as well. A tape solution consists of tape media, tape drives, and tape automation. Tape media have evolved from reels (almost like old movie reels) to more easily manipulatable cartridges. (Some 4-inch by 4-inch by less-than-1-inch cartridges can hold a terabyte or more of data.) Tape drives (into which a piece of tape media can be inserted) have shrunk dramatically as a consequence.
Furthermore, tape automation (such as a tape library, which contains multiple tape drives, extra slots to store tape cartridges that are not in a tape drive, and a robotic arm to move the cartridges around) has improved the flexibility of management of a large number of pieces of tape media. In the old days (such as the 1980s), a tape operator had to physically change tape reels manually from a free-standing tape drive, whereas now the robotics in an autoloader, which has only one tape drive, or a tape library, which has multiple tape drives, move tape cartridges around automatically. However, under most circumstances, tape is intrinsically slower than disk, and the process of transfer of information from tape to disk or vice versa is a lengthy one.
Unlike disk, an individual piece of tape media is easily removed from a tape drive and can run in any compatible tape drive. This capability is important, because it allows tape media to be transported to and put into use at a remote site independent of the primary data center. Thus, movement of data is dependent on the availability of transportation, but not on the availability of a network. Tape drives can operate independently, but they are often embedded in tape automation solutions (such as an autoloader or tape library).
The copying of data from disks to tape media is done through the use of backup/restore software. This is the traditional backup/restore process, and it was for a long time essentially the only software available for data protection. This process actually provides a great deal of both physical and logical data protection for both operational and disaster continuity. Since each tape copy is on a piece of physical media other than the primary disk, tape delivers physical data protection. Since any tape cartridge that is not in a tape drive is not in the I/O path, tape media also deliver logical data protection. Since a tape copy can be physically transported to a disaster recovery site, tape provides both physical and logical data protection for disaster continuity.
Multiple copies of the same data may be stored on tape. These are called generations. For example, suppose a full copy of a set of data is backed up on Saturday night. That is one generation. Now suppose that every night during the week, an incremental backup is made, which backs up new and changed data that particular day. On Saturday night another full backup is made. That is the second generation. And so on. In fact, many businesses used a generational scheme called grandfather-father-son. (At some point, the oldest pieces of tape media were rotated out of the generational scheme and put in what was called a scratch pool, to be used in creating a new generation of tapes.)
Since several generations of tape copy are available, no tape copy represents the only copy, and no single point of failure exists-thus, tape is both the "front line" and the "last line of defense" for data protection. However, this system does not address what tapes should be kept in-house for operational recovery purposes, and what should be sent off-site for disaster recovery purposes. Moreover, in many cases, not only would a full backup have to be used for recovery purposes, but also one or more (up to perhaps six) incremental backup tapes would have to be used in a recovery.
Apart from scalability, reliability, and manageability issues, the key concern with a tape-only solution therefore is lack of "high" availability, where high availability might be defined as minutes per year (and surely no more than hours per year).
A major recovery using tape may require hours at best, a day or more as a likely occurrence, and a week or more in extreme circumstances. This is because the recovery process, called a restore process, is actually a rebuild process. Tape, as a sequential medium, is too slow to handle random online processing. Therefore, before the data can be used for online processing, the contents of tape have to be copied to disk and that can take a long time. If a mirrored disk copy were already available, the process would be a restart using the mirrored disk, and might take seconds to minutes.
The second reason that tapes are not always an optimal solution for recovery is that a number of tapes are likely to be needed to restore a disk system, and that can lead to problems if a part of the physical tape system, such as a piece of tape media, fails or requires significant resuscitation work, such as in the case of an intermittent error condition or constant read retries, or if there is a mistake in the sequencing of the tapes.
Considering the above, tape by itself is not sufficient to meet the demands of modern IT organizations. Note that the converse-disk is sufficient-also cannot be assumed to be true. Although disk plays a strong role in data protection, the proper role of tape in conjunction with disk has to be examined carefully.
Typical Data Protection Technology Today Still Leaves a Lot to Be Desired
A lot of change has taken place in data protection since the early days (Figure 2). As discussed, one of the primary improvements in data protection technology was the introduction of redundant array of independent disks (RAID), which provided physical data protection for the price of one or more additional disk drives. (RAID originally stood for redundant array of inexpensive disks, but the word independent was substituted for inexpensive as the price of disks fell dramatically and the relationship of disks to one another became more important than their cost.)
Figure 3.2 Typical data protection until recently
A number of RAID levels exist, but only a few are in common use. RAID 1 is a synonym for mirroring-for every disk that contains "original" data, a corresponding disk contains a "copy" of that data. This means that the usable disk space for RAID 1 is only 50% of the total available disk space. Parity RAID levels enable the recalculation of "lost" data in the event of a single failed disk through the use of parity check data. (Parity data is extra data that enables re-creation of all data from a failed drive from the other drives in the RAID group.) A RAID 5 group requires only one more disk than the number of disks required to hold the working data, although the parity check data and the working data are actually spread across all the disks. RAID technology delivers dramatically improved availability over an array of unprotected individual disks, and RAID technology typically now forms the first line of physical defense.
Remote mirroring, a variant of RAID 1, delivers fast-restart physical disk protection at remote sites, to aid in disaster continuity. A second major advance is point-in-time copy capability, which is a fixed view of the data and therefore is not subject to change from I/O processes. This independence enables a point-in-time copy to deliver logical data protection as of the instant that a point-in-time copy (in some circumstances called a snapshot copy) was taken. Since that instant is unlikely to be the instant when a logical failure occurs, the aim of point-in-time copying is to be able to recover with a minimal loss of data. And even though point-in-time capability is available, some organizations do not use it for logical data protection on the disk array itself, but rather to provide a consistent point for invoking the standard backup process.
Although data protection has significantly improved from pre-RAID days, improvements over the typical data protection configuration are still necessary for a large percentage of enterprises in all four boxes in the data protection category matrix. This necessary improvement may not be a technology issue, but rather an education, adoption, and cost issue. In other words, the technology may be available, but understanding the affordability as well as the appropriateness of using the technology has to be carefully examined.
The myriad of new data protection products that have become available over the past few years, as well as the continued evolution of data protection products and services expected in the near future, indicates that the availability of the necessary technology is probably not the primary inhibitor to implementing effective data protection for business continuity. Understanding, finding, affording, and implementing the appropriate mix of data protection technologies is more likely to be the key issue. The following sections address the state of the art in each of the four boxes in the data protection category matrix.
Operational Continuity/Physical: Generally Strong, but Some Improvement Needed
The addition of RAID technology has made physical operational continuity a strong area and, with concepts such as triple mirroring, an enterprise can buy its way to a desired level of physical availability. The operative word is "buy," as incremental changes in availability can become very expensive. The Achilles heel in RAID technology is that a typical RAID array can only allow one disk failure and still protect the data. During the period in which the RAID group is being rebuilt, the data in the array is exposed to the risk of data loss if a second failure should occur. And the chance for a bit error in rebuilding a large disk drive, as compared to a smaller drive, is by no means insignificant.
With that said, would not an advance in RAID technology to allow more than one failure in a RAID group be useful? The answer is yes; and that has already occurred. The general term for this technology is multipleparity RAID, but the practical implementation of this is RAID 6, which can tolerate two disk failures before a rebuild process completes without loss of data. The cost for doing so could be low, as the "hot spare" that is typically found in RAID arrays could be put to active use for the extra drive in a RAID group. As a recommended strategy though, keeping at least one spare drive is still desirable, because the spare drive can be used for rebuilding the data from a failed drive. Although there is typically a slight performance penalty, such multiple-failure-tolerating RAID technology is be the closest thing to a higher-availability "free lunch" that is likely to come along soon.
Operational Continuity/Logical: More Attention Needs
to Be Paid to Logical Data Protection
Point-in-time copy capabilities, including snapshot copy capability, have proven to be helpful for logical data protection. But point-in-time copies typically cannot be taken continuously, so some data can theoretically be lost. A powerful use of an any-point-in-time copy capability called continuous data protection (CDP) has now become generally available. A number of other technologies, including replication technologies, virtual tape libraries, and "write-once, read-many" (WORM) technologies, are also available to aid with logical data protection. In short, tape now has a number of allies, in addition to basic point-in-time copy capability, to help with logical data protection.
Many of the technologies, such as continuous data protection and virtual tape libraries, are still relatively new, so IT organizations may be either unfamiliar with the technologies or still in some stage of evaluating the technologies. However, point-in-time technology has been around for quite a while and has been used successfully by a large number of organizations. Nevertheless, point-in-time functionality not yet been adopted to the extent that it needs to be in order to provide the right level of logical operational continuity. The lack of adoption may be due to an IT tendency to focus on disaster recovery in general and the physical side of recovery for both operational continuity and disaster continuity, rather than the logical side; but logical operational continuity needs its fair share of attention as part of a comprehensive data protection strategy.
Disaster Continuity/Physical: Done Well, but Cost and Distance Are Issues
Remote mirroring has proven its worth, and has been justifiably successful in the data protection marketplace as a result. However, unless an enterprise already has a data center that can serve as a secondary disaster recovery site for the enterprise's primary site, the cost for establishing a disaster-specific site can be quite expensive.
For many organizations, cost-network equipment, software, and remote disk array cost-is a barrier to synchronous remote mirroring implementation. One reason is that many of the original synchronous remote mirroring products required that the disk array at the second site be the same model as the disk array at the first site. However, more cost-effective remote replication technologies are now available for organizations that are willing to make some concessions in exchange for cost savings. For example, if an organization can tolerate the performance loss penalty in case of a disaster, the ability to use less expensive disk arrays as targets for mirroring is an option that the organization might find attractive.
The second issue is that the distance between a primary site and a secondary site should be targeted at 300 miles (480 kilometers) or more. Although 300 miles (480 kilometers) is an arbitrary figure, it is a distance being mandated for certain compliance activities. Even if an organization is not subject to compliance restrictions, common sense says that if you are planning a long-distance data center, there is no sense in choosing one 250 miles (400 kilometers) away if there is any possibility of stricter compliance restrictions being imposed at a later date.
However, synchronous remote mirroring is typically used between two data centers that are no more than 100 miles (160 kilometers) apart. Once again, this is an arbitrary limit, but one that is based on experience with acceptable response-time latency for the valuable online transaction processing
(OLTP) applications that can justify the expense of synchronous remote mirroring.
A sibling of synchronous remote mirroring is asynchronous remote mirroring. Asynchronous remote allows data between a primary site and a disaster recovery site to be kept relatively current. The operative word is "relative." The disaster recovery site may be behind by up to several minutes.
Although the potential data loss of minutes for some applications may be unacceptable (e.g., an order-revenue-producing OLTP application), other applications may find that exposure acceptable.
Remote mirroring techniques create an undated replica (i.e., copy) of data. A dated (i.e., time-stamped) copy of the data, for example, a point-in-time copy for which the time and date of creation is known can also be used as a basis for a replica (i.e., copy) at a remote site. Dated replication is a more cost-effective way of protecting data at a disaster recovery site for data that does not demand up-to-the-second or up-to-the-minute protection. And that could be a good deal of a company's data.
In summary, asynchronous remote mirroring and other remote replication technologies are available to accommodate the needs of physical (and in some cases logical) data protection at a distant site. The challenge to IT organizations is how to meet the necessary data protection requirements while not having to use more remote sites than is absolutely necessary.
Disaster Continuity/Logical: The Danger of Being Under-Protected May Be Very Real
If primary site processing has to move to a secondary site because of a disaster, the former secondary site has to assume the mantle of the primary site. One of the first questions that needs to be asked is "What is the length of time that the original primary site will be out of service?" If the answer is either permanently or for an extended period of time, the enterprise may want to implement a complete logical data protection solution if one is not already built in-and it may not be if only production disks were replicated at the disaster recovery site.
However, even if an outage may last a week or more, additional logical data protection may not be needed if the disaster recovery site replicates not only disk storage, but tape storage as well. (A subset of the original tape solution may be enough in a pinch if the data center environment's configuration, e.g., its space and power, can accommodate expansion to the full solution). If the disaster site does not replicate tape storage, a third-party disaster accommodation arrangement might suffice. Note also that disk storage for data protection purposes, such as the target for disk-based backup, may serve as either a substitute for, or a complement to, tape storage. A point-in-time copy (or equivalent) capability might serve as a stopgap measure, but tape (or the disk equivalent) will provide secondary physical data protection as well as logical data protection as a target when restarting backup software processes.
In any event, a strategy for logical data protection needs to be put in place now, if the organization has not already done so. There is no point in making a large investment in a disaster recovery site if there is no protection from permanent loss of data due to database corruption, accidental file deletion, virus, or other logical data protection problems.
Summing Up Data Protection Challenges by Category
IT organizations need to examine the data protection technology challenges to determine how they affect the data protection planning process within their enterprise (Table 1). These should be kept in mind when setting the objectives for data protection for the enterprise.
Table 1 Data Protection Challenges by Category
The key is to blend newer technologies in with older technologies as either a substitute or complement ingredient in the data protection "stew." RAID 6 gives higher physical availability to a series of disk drives for physical data protection for operational continuity. The increasing traction of virtual tape libraries and continuous data protection bodes well for improving logical data protection for operational continuity. (Incidentally, both technologies also improve physical data protection, as at least one additional physical copy is made.)
In terms of disaster continuity, combining synchronous and asynchronous remote mirroring can improve physical data protection. Perform synchronous remote mirroring to an intermediate site that has only limited disk capabilities, and use asynchronous remote mirroring for the true disaster recovery site. If a true disaster hits only the primary site, the intermediate site can deliver the last few minutes of data. (If the intermediate site is also affected by the disaster, the loss of a few minutes of data is likely to be the least of a company's concerns.)
For logical data protection, the disaster recovery (DR) site has to be able to manage the data that was not protected by remote mirroring techniques. That might mean backup copies on tapes that were vaulted manually to a third-party DR facility, or remote copies for a virtual tape library, or continuous data protection that were electronically vaulted to the DR site (or created remotely at the DR site originally).
About the Author