Managing data has always been a critical focus of IT departments large and small. In the past a critical focus was placed on simply building a highly available and redundant repository. This spawned new data management concepts in line with new technologies such as storage area networking (SAN) and networked attached storage (NAS). Unless you've been living under a rock (albeit a digital one), these terms should be familiar. With the rate of digitally created data increasing, simply having a bucket to store data is no longer sufficient, regardless of how you attach to it or access it.
Factors driving the need for additional storage are far and wide. A primary example is compliance. Responding to the need to keep data available for what seems like an indefinite amount of time, presents unique challenges to storage vendors and IT departments alike. These repositories are bound by redundancies (RAID) that increase the number of spinning disks and the associated HVAC requirements. Not to mention the electrical consumption costs associated with the entire setup.
To fully understand the situation, let's talk numbers. According to a major analyst, storage capacity is growing at 50.9 percent (Compound Annual Growth Rate). This growth is across fixed, transactional and replicated storage mediums. There is also significant growth in direct-attached storage (DAS) installations which is projected to grow at 28 percent through 2008. In the same time period, storage attached and network attached storage is projected to grow at 62 percent.
This continual explosion of growth has been a catalyst in methodologies designed to reduce the need for expensive and/or additional storage.
Adoption of these methodologies can be an either/or scenario. But deployment of these solutions synchronously provides a robust and comprehensive data management scheme. As an administrator, it's your job to determine the best fit for deployment of these concepts. Of course, taking into account how they will assist in enhancing the ROI, performance and other attributes of your current storage investment while providing a multi-year roadmap for handling future storage needs.
The ability to provide a highly available data bucket is straight forward; however, this level of management is not enough. You can approach this in two ways: the first is to associate data with a bucket of appropriate value and create a number of smaller buckets on a diminishing value scale. Technology such as Novell's Dynamic Storage Technology and Hierarchical Storage Management fit this bill. The second method is to diminish the overall size of the cargo contained within the bucket thus reducing the resources needed to maintain it. Technologies in the deduplication arena fill this need.
> Information and Data Lifecycle Management:
The first to the table when talking about using data classification to manage storage is Hierarchical Storage Management, or HSM as it is commonly called. It has roots in Data Lifecycle Management (DLM) which grew out of Information Lifecycle Management (ILM). That was a mouthful; however, it will be outlined more clearly later.
Per the Storage Networking Industry Association (SNIA), Information Lifecycle Management is defined as, “...the polices, processes, practices and tools used to align the business value of information with the most appropriate and cost effective IT infrastructure from the conception point of digital information through its final disposition. Information is aligned with business requirements through management policies and service levels associated with applications, metadata, and data.”
Simply put, ILM is not a technology; it is a set of procedures and associated technologies designed to streamline the presentation, management, decimation and retention of digital information. Moreover, it takes into account the value of information and its relationship to the business process for setting polices such as Service Level Agreements. Although HSM and ILM are closely associated, ILM takes a top-down approach looking at the data organization throughout the entire enterprise. Conversely HSM takes the technical perspective in executing the strategy outlined via ILM.
The roots of ILM are planted in IBM's invention of Hierarchical Storage Management (HSM). Chalk another one up for the boys in blue. As you might have guessed, everything old becomes new as HSM is not a new technology; it celebrated its 30th anniversary in 2004.The originating catalyst for this innovation was addressing the need to quickly move data from one medium to another in a mainframe environment. The primary situational focus was to provide an efficient means to migrate data from tape to a then-new disk-based storage medium. Although the migratory path of data has evolved to movement from expensive disk to less expensive disk, the concept remains modern.
The layer between ILM and the associated technologies of HSM, DST and deduplication is Data Lifecycle Management (DLM). DLM is the tier immediately to the right of ILM, off which the aforementioned technologies fork. Most important, for context of this audience, this is where the IT department becomes involved in the classification of data in its relationship to location, compliances, SLAs, etc.
In a DLM solution, data migration is determined by a policy. It determines attributes such as the time line for when files are migrated, which files should be moved and to which storage tier. Once files are migrated to their new home, the end user is notified of their new location. Gotcha! End user notification is not necessary because redirection is accomplished via inodes or stub files. Inodes are specific to Unix-based file systems and are best thought of as a stored description of a specific file. That description can contain file type, access rights, owners, timestamps, size and pointers to data blocks. The same holds true for stub files which are commonly associated with the Windows file system. In a nutshell, they contain a great deal, if not all, of the metadata associated with the file in question. What they don't contain is the bloat from the actual file blocks. This is the prevailing reason why stub files are critical to the success of archival data solutions.
Using the real-world example of e-mail, the presence of stub files is incredibly powerful from both the perspective of the end user and administrator. From the administrator’s point of view, a mail box can be configured with a small space quota. In this case, small is best thought of as a limit which will cause the mailbox to fill quickly, causing the end user to incessantly manage their environment (or call IT to beg for more space). When messages reach a certain age, they are migrated from the message store to a secondary (cheaper) storage medium. What remains is a pointer (stub file) to the original file. Depending on configuration settings the end user may or may not know he is accessing a stub file. Stub file icons might appear as a different color or shape. Once the message/file is accessed, the clock starts again to determine when the file will be remigrated to a second storage tier. The advantages in this solution are many; below are the most notable:
- Storage capacity is reduced in the primary message store.
- E-mail server performance is easier to maintain and improve.
- Manageable mailbox quotas can be maintained with minimal effort.
- End users can effortlessly retrieve archived data without the assistance of IT.
The take away in the e-mail scenario is: stub files walk users to their data without consuming storage space on the primary data store. Without this virtual detour sign, the end user experience would be a management nightmare, to say the least.
It is estimated that 60 to 80 percent of an organization's data remains at rest or is never touched again once it's created. A basis for assessing the value of data, centers on its usage. HSM provides a means of aligning the value of data with the appropriately valued storage medium.
A byproduct of a HSM or associated technologies is their ability to aid in the overall performance of backup and recovery software. This software requires varying degrees of integration. Simply put, with active data stored on the primary data store (tier one), this can be the only place backup software needs to point. The potential gains in this strategy are a shortened backup window, reduction of backup agents and the offline media required to house the tactically focused backup. Long story short, the net is an overall reduction in cost and higher ROI of the backup environment.
> Dynamic Storage Technology:
A new technology in the market place offered through Novell is Dynamic Storage Technology (DST). It will be introduced in the upcoming release of Open Enterprise Server 2, which promises to be equally compelling in its own right.
DST, much like the HSM is policy-based. This allows for an administrator to define a number of characteristics that require completion prior to a file being migrated to cheaper storage. As mentioned earlier, this can range from file type, size, original location or time frame. Differentiation from traditional HSM solutions take place at the redirection point. In the aforementioned environment, the stub files are the critical mass. DST accomplishes walking the end user to the migrated data via a virtual volume. This volume is an amalgamation of both the primary and secondary storage tier. The end user selects the file; it is then demigrated and opened. It's important to note that simply opening the virtual volume to view its contents does not demigrate data. Conveniently, when data is migrated, it happens without the end user realizing what is taking place under the hood.
Advantages of this scenario are the reduction of moving parts and the improved end user experience. On the business side of the equation, this technology is embedded in Open Enterprise Server 2. Prior to this, traditional HSM product availability was through a third party. Until DST, the same vendor who is providing the operating system, file system and workgroup feature enhancements was not providing a DLM solution. This tight integration substantially reduces the acquisition costs of this solution.
> Data Deduplication:
When looking at technologies underneath DLM, deduplication is one of the most compelling. Unlike HSM and DST whose primary focus is the end user facing part of the data center, deduplication technology has been focused in the rear. Often an HSM solution works in concert with a disk-to-disk-to-tape backup solution. This deployment is secondary to its initial inception which was as a data mover between disparate storage mediums. On the flip side of the coin resides deduplication technologies that have been focused on such scenarios as backup and recovery, and file transport optimization in concert with overall WAN performance optimization.
What's Expensive Storage?
The classification of storage as expensive is subjective at best. What enterprise A classifies as expensive is not necessarily true for enterprise B. At a high level, expensive storage is best thought of as being built around the context of performance and redundancy. Accordingly, at this layer, RAID levels such as five, six and 10 are commonly configured. The immediate trade off in these scenarios is a reduction in overall capacity to compensate for this redundancy.
The second tier, or cheaper storage tier, is denoted by drive technologies such as SATA and could be presented via iSCSI opposed to FC. iSCSI is becoming a viable option as the technology is migrating closer to the mainstream of storage presentation. At this tier, redundancy and performance are not the primary concerns. Simply, basic data availability is the concern, because active data is currently being housed on the first tier. Tier two is chock full of data that has already been backed up and is otherwise considered stale. This tier of lesser-expensive storage can be configured as the last destination for data before it is migrated offline to tape. Outside of the concepts introduced in this article, normally, integration between HSM and a Content Addressable Storage (CAS) solution is required to accomplish this final migration.
The Data Domain Operating System (DD OS) was developed by Data Domain who is one of many companies to address storage consumption issues from the backup target perspective. According to analyst reports, it is number one in the deduplication market while EMC's Avamar solution runs a close second. Either way, this operating system and others like it are designed to look for common patterns or blocks that have been previously stored. Once the scan is complete, only unique data sequences (pointers) are stored which in turn reduces the need for endless storage capacity. The depth of this pattern is aggregated over the number of times the DD OS sees the data.
Simply put, the more times it sees the same data the better the compression will be. Remember earlier when we talked about 60 to 80 percent of data remaining unchanged? It is not uncommon to achieve a compression ratio better than 20:1. When deduplication technology is employed in virtual tape libraries (VTL), some manufacturers report as much as a 50:1 compression ratio. At best this should be analyzed with a discerning eye.
Although deduplication has been popularly deployed at the backup target location, it can live in other locations throughout the enterprise or data path. A trend in file optimization has been growing during the past two years. The catalyst for this trend in part is related to a flattening world. Users and offices are becoming smaller and more dispersed. Simultaneously, the value and need to share data across the connecting infrastructure is as important as it ever was. One method for improving file access times between disparate LANs was through the purchase of additional bandwidth coupled with some creative QoSing. This method was not optimal in the categories of performance, maintenance and expense. A more effective method has since been developed through the use of deduplication at the communication protocol level.
In the WAN performance optimization example, the TCP protocol is scanned for redundancies as to minimize the number of round trips between endpoints. Simply, the deltas, or changes, represent the traffic that traverses the WAN. Because information is interrogated at a byte level opposed to a file level, differing files, applications and protocols can simultaneously take advantage of the technology.
Although the term deduplication is the new 'What's Hot' in storage technology, this isn't cutting edge. Products and features that are currently on the market have components of deduplication in them. Namely, Novell iFolder and Microsoft SIS (Single-Instance-Storage). Although they are designed to scratch a separate set of itches, they address the basic need to maximize current bandwidth while reducing the reliance on the additional purchase of bandwidth. Depending on where it's deployed, the concept of deduplication speaks to leveraging one's existing storage investment fully. It can successfully reduce the need to purchase additional storage while subversively reducing the need for expensive storage as an ancillary benefit.
To summarize DST, deduplication and HSM technologies are deeply rooted in the reduction of expensive storage while simultaneously improving the utilization of your current storage investment. In concert, they can provide a powerful solution that will stem the requirement to purchase additional storage year over year. There are a number of compelling and modern technologies which encompass the ILM, and more specifically the DLM landscape. These technologies have often lived in technological silos; however, this trend is becoming less common. The need to add, manage and provision storage across all industries is built into the cost of doing business. IT departments that can tame their storage costs and align them with business are sure to maintain and improve their position in the global market.