Just How Big Can Databases Get?
TheStreet.com maintains more than a terabyte’s worth of data in various forms – articles, alert data, company data, and trading data. While the online information service’s data store totals well into the terabyte range, the company’s data managers prefer to keep its data in a distributed format.
“We addressed the problem of having lots of lots of information on lots and lots of data servers by cutting it up into smaller segments,” Alex Spinelli, CTO of TheStreet. com, told DBTA. “Each one is a bit more manageable and allows us to have a bit more flexibility, since the information is specialized.”
AstraZeneca, on the other hand, is building a highly centralized Oracle-based clinical image repository, which is now 5TB in size – and is expected to top 100TB of data within a year.
The pharmaceutical company sees centralization as the best way to streamline regulatory compliance and clinical trial efficiency, and make best use of the imaging data and investments in imaging studies. “We did look at distributed models and other alternatives,” Goutham Edula, business lead for clinical imaging informatics at AstraZeneca, told DBTA. “But one of the key drivers at this point is to have a central database, because our main data management is centralized.”
These are two very different approaches, but with a common situation – data volume is growing rapidly, and companies are faced with the choice of maintaining information in federations of distributed databases, or putting it all into a more centralized location.
The dilemma is far more urgent than just a few years ago, when the largest databases were only just starting to top the 1TB mark. Now, a terabyte is almost commonplace, Richard Winter, president of Winter Corp., told DBTA. In fact, a terabyte equals “only a few disk drives these days,” he noted.
Among the largest of the large, data volumes are growing beyond the 100TB threshold. Randy Lea, vice president of product and services marketing for Teradata, said his company now “has close to 30 customers that are over 100TB today, as well as 50 customers with over 50TB.” Lea estimated that the largest customer system now easily tops the 200TB mark.
Billions and Billions
Even databases commonly associated with distributed computing – Microsoft SQL Server – are gaining gargantuan proportions.
A survey conducted by Unisphere Research for the Professional Association for SQL Server (PASS) found that at least one out of 10 SQL DB Server databases now exceed a terabyte in size. Almost a third of SQL Server sites can now support more than 500 simultaneous users. Among more highly centralized enterprise databases such as Oracle, terabytes’ worth of data are everyday business. A Unisphere Research survey conducted among members of the International Oracle Users Group (IOUG) found that 23 percent of Oracle enterprises have at least one database exceeding one terabyte in size. About four percent reported having databases exceeding 10TB.
And the data will just keep on coming. One industry study recently estimated that over the past three years, Fortune 1000 companies have on average seen their total data environments grow from 190TB to one petabyte (one million gigabytes). Another new study commissioned by EMC Corp. put the total “digital universe” at 161 billion GB (161 exabytes). This will grow at a rate of 57 percent a year to 988EB by 2010. Organizations will be responsible for the security, privacy, reliability and compliance of at least 85 percent of this information.
Companies are finding new ways to leverage these vast stores of data for competitive advantage in their markets. For example, one area where very large databases can be a key advantage is marketing. “Capturing every available data point on existing and potential customers and further augmenting those records with data purchased from third parties results in a very large database that is capable of producing very effective targeted marketing campaigns,” said Luke Lonergan, CTO of Greenplum. Or a telecommunications firm may want to capture and analyze volumes of call detail records to determine caller behavior. “The resulting analysis allows for them to customize phone plans that provide customers with what they want while maximizing profit,” Lonergan explained. In addition, new demands on business, such as compliance, are also driving the growth of data volumes. “Regulatory agencies are requiring businesses to hold data for a longer time, sometimes it will be five or seven years,” Sumit Kundu, director of product management for Sybase, told DBTA.
Managing Data Behemoths
The challenge many companies currently face is what manner to deploy this data. Teradata, for example, advocates economies of scale through centralization and a single point of management of data. “ We’ve always believed that you get more business value by looking across the business than by looking at stovepipes in the business. It’s not just about sales that customers generate. There’s the cost of servicing those customers, and the profitability of those customers,” said Lea.
Terry Gray, CFO of Logical Information Machines, agreed, telling DBTA he sees a management advantage to moving data to larger, more centralized information stores. “A very large database can provide a sense of long-term responsibility for the data, a different mission statement, and limiting the required interaction of divisions, departments and groups. Many people are willing to share data on a regular basis. Very large database technology allows a small group to collectively create a database.”
Size Doesn’t Matter
Ultimately, however, the greatest challenges don’t arise from the size and scale of the database, but other factors. Winter, for one, said many companies run into problems with the complexity of their queries. “I know of companies that have faced really challenging problems with databases less than a terabyte. The size of the database doesn’t tell the whole story,” he said. “There are issues with many current queries you have, how complex the schema is, how complex the queries are, whether you have a mixed workload, and what the latency of the data is.”
Spinelli of TheStreet. com agreed, noting that his greatest challenge is bringing together data with different formats from different databases, since the online service is constantly expanding its offerings through partnerships. “Unfortunately, most companies have very different formats for receiving data,” he explained. “Because partnerships are very beneficial to us, we generally want to accommodate that. But when we add distribution partners, it ends up being another query against their datasets. It’s really performance that I’m most concerned about, in terms of the growth need and scaling – not the sheer size of data.”
Alex Gorelik, CTO and founder for Exeros, also agreed that data integration, rather than the size of the database itself, is the most vexing challenge enterprises face. “When you take a large number of existing databases and try to consolidate them into a single system, discovering how data across various databases relates to each other is a very complex and difficult task,” he said. “This is because over time data in disparate systems tends to get out of sync. So when consolidating the data into a single system you have to figure out how data between systems relates so you can identify overlaps, as well as spot and eliminate inconsistencies. If consolidation is not done properly, the result is a big database of garbage that isn’t useful for anything or to anyone.”
TheStreet. com, for example, is addressing complexity challenges among its diverse infrastructure of smaller databases through grid and clustering technologies. The firm is in the process of implementing grid technology from Grid App – run over blades and Oracle Real Application Clusters (RAC) – with the goal of providing a single view of its distributed data environment of Oracle, MySQL, and partner databases. By deploying grid and clustering solutions, “we don’t have to build a giant huge database running on big subsystems,” Spinelli explained. “We can actually be very smart about building out a modular database that scales horizontally and lets us still slice and dice as we need to, and be very flexible and agile, but have it all within the same management systems. That will enable us to very quickly move in different directions.”
Backup and Recovery
Another challenge is backup and recovery. Winter observed, for instance, that when disk capacity is added for storage, the size of a data operation multiplies by a factor of six. This is the main challenge that AstraZeneca is attempting to address as it nears its goal of a 100TB database, said Edula. “Because what we’re looking at is we don’t want to archive old images. For the most part, we want to keep them alive, or online. We’re looking at different options in terms of backup and restoring, and how much time it would take to restore.” Backing up a 100TB database – and being able to restore it as quickly as possible “is a much bigger challenge than just backing it up.”
At the end of the day, the ability to effectively manage large stores of data leads to greater business agility, Spinelli said. That’s why he prefers to maintain large data stores in a distributed fashion, linked by grid and clustering technologies. “ The landscape changes quickly for businesses such as ours,” he said. “The ability to be agile and flexible is one of the most important things I can deliver to my business.”
The ability to effectively manage large stores of data can lead to greater business agility.
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.