Just How Big Can Databases Get?

TheStreet.com maintains more than a terabyte’s worth of data in various forms – articles, alert data, company data, and trad­ing data. While the online information service’s data store totals well into the terabyte range, the company’s data man­agers prefer to keep its data in a distributed format.

“We addressed the problem of having lots of lots of information on lots and lots of data servers by cutting it up into smaller segments,” Alex Spinelli, CTO of TheStreet. com, told DBTA. “Each one is a bit more man­ageable and allows us to have a bit more flexibility, since the information is specialized.”

AstraZeneca, on the other hand, is building a highly cen­tralized Oracle-based clinical image repository, which is now 5TB in size – and is expected to top 100TB of data within a year.

The pharmaceutical compa­ny sees centralization as the best way to streamline regula­tory compliance and clinical trial efficiency, and make best use of the imaging data and investments in imaging studies. “We did look at distributed models and other alternatives,” Goutham Edula, business lead for clinical imaging informat­ics at AstraZeneca, told DBTA. “But one of the key drivers at this point is to have a central database, because our main data management is central­ized.”

These are two very different approaches, but with a com­mon situation – data volume is growing rapidly, and compa­nies are faced with the choice of maintaining information in federations of distributed data­bases, or putting it all into a more centralized location.

The dilemma is far more urgent than just a few years ago, when the largest databases were only just starting to top the 1TB mark. Now, a terabyte is almost commonplace, Richard Winter, president of Winter Corp., told DBTA. In fact, a terabyte equals “only a few disk drives these days,” he noted.

Among the largest of the large, data volumes are grow­ing beyond the 100TB thresh­old. Randy Lea, vice president of product and services market­ing for Teradata, said his com­pany now “has close to 30 cus­tomers that are over 100TB today, as well as 50 customers with over 50TB.” Lea estimated that the largest customer sys­tem now easily tops the 200TB mark.

Billions and Billions

Even databases commonly associated with distributed computing – Microsoft SQL Server – are gaining gargantuan proportions.

A survey conducted by Unisphere Research for the Professional Association for SQL Server (PASS) found that at least one out of 10 SQL DB Server databases now exceed a terabyte in size. Almost a third of SQL Server sites can now support more than 500 simultaneous users. Among more high­ly centralized enterprise databases such as Oracle, terabytes’ worth of data are everyday business. A Unisphere Research survey conducted among members of the International Oracle Users Group (IOUG) found that 23 per­cent of Oracle enterprises have at least one database exceeding one terabyte in size. About four percent reported hav­ing databases exceeding 10TB.

And the data will just keep on com­ing. One industry study recently esti­mated that over the past three years, Fortune 1000 companies have on aver­age seen their total data environments grow from 190TB to one petabyte (one million gigabytes). Another new study commissioned by EMC Corp. put the total “digital universe” at 161 billion GB (161 exabytes). This will grow at a rate of 57 percent a year to 988EB by 2010. Organizations will be responsible for the security, privacy, reliability and compliance of at least 85 percent of this information.

Companies are finding new ways to leverage these vast stores of data for competitive advantage in their markets. For example, one area where very large databases can be a key advantage is marketing. “Capturing every available data point on existing and potential cus­tomers and further augmenting those records with data purchased from third parties results in a very large database that is capable of producing very effec­tive targeted marketing campaigns,” said Luke Lonergan, CTO of Greenplum. Or a telecommunications firm may want to capture and analyze volumes of call detail records to deter­mine caller behavior. “The resulting analysis allows for them to customize phone plans that provide customers with what they want while maximizing profit,” Lonergan explained. In addi­tion, new demands on business, such as compliance, are also driving the growth of data volumes. “Regulatory agencies are requiring businesses to hold data for a longer time, sometimes it will be five or seven years,” Sumit Kundu, director of product management for Sybase, told DBTA.

Managing Data Behemoths

The challenge many companies cur­rently face is what manner to deploy this data. Teradata, for example, advo­cates economies of scale through cen­tralization and a single point of man­agement of data. “ We’ve always believed that you get more business value by looking across the business than by looking at stovepipes in the business. It’s not just about sales that customers generate. There’s the cost of servicing those customers, and the prof­itability of those customers,” said Lea.

Terry Gray, CFO of Logical Information Machines, agreed, telling DBTA he sees a management advan­tage to moving data to larger, more cen­tralized information stores. “A very large database can provide a sense of long-term responsibility for the data, a different mission statement, and limit­ing the required interaction of divisions, departments and groups. Many people are willing to share data on a regular basis. Very large database technology allows a small group to collectively cre­ate a database.”

Size Doesn’t Matter

Ultimately, however, the greatest chal­lenges don’t arise from the size and scale of the database, but other factors. Winter, for one, said many companies run into problems with the complexity of their queries. “I know of companies that have faced really challenging prob­lems with databases less than a terabyte. The size of the database doesn’t tell the whole story,” he said. “There are issues with many current queries you have, how complex the schema is, how com­plex the queries are, whether you have a mixed workload, and what the latency of the data is.”

Spinelli of TheStreet. com agreed, noting that his greatest challenge is bringing together data with different formats from different databases, since the online service is constantly expand­ing its offerings through partnerships. “Unfortunately, most companies have very different formats for receiving data,” he explained. “Because partner­ships are very beneficial to us, we gen­erally want to accommodate that. But when we add distribution partners, it ends up being another query against their datasets. It’s really performance that I’m most concerned about, in terms of the growth need and scaling – not the sheer size of data.”

Alex Gorelik, CTO and founder for Exeros, also agreed that data integra­tion, rather than the size of the database itself, is the most vexing challenge enterprises face. “When you take a large number of existing databases and try to consolidate them into a single system, discovering how data across various databases relates to each other is a very complex and difficult task,” he said. “This is because over time data in disparate systems tends to get out of sync. So when consolidating the data into a single system you have to figure out how data between systems relates so you can identify overlaps, as well as spot and eliminate inconsistencies. If consolidation is not done properly, the result is a big database of garbage that isn’t useful for anything or to anyone.”

TheStreet. com, for example, is addressing complexity challenges among its diverse infrastructure of smaller databases through grid and clustering technologies. The firm is in the process of implementing grid tech­nology from Grid App – run over blades and Oracle Real Application Clusters (RAC) – with the goal of providing a single view of its distributed data envi­ronment of Oracle, MySQL, and part­ner databases. By deploying grid and clustering solutions, “we don’t have to build a giant huge database running on big subsystems,” Spinelli explained. “We can actually be very smart about building out a modular database that scales horizontally and lets us still slice and dice as we need to, and be very flexible and agile, but have it all within the same management systems. That will enable us to very quickly move in different directions.”

Backup and Recovery

Another challenge is backup and recov­ery. Winter observed, for instance, that when disk capacity is added for storage, the size of a data operation multiplies by a factor of six. This is the main chal­lenge that AstraZeneca is attempting to address as it nears its goal of a 100TB database, said Edula. “Because what we’re looking at is we don’t want to archive old images. For the most part, we want to keep them alive, or online. We’re looking at different options in terms of backup and restoring, and how much time it would take to restore.” Backing up a 100TB database – and being able to restore it as quickly as possible “is a much bigger challenge than just backing it up.”

At the end of the day, the ability to effectively manage large stores of data leads to greater business agility, Spinelli said. That’s why he prefers to maintain large data stores in a distributed fash­ion, linked by grid and clustering tech­nologies. “ The landscape changes quickly for businesses such as ours,” he said. “The ability to be agile and flexi­ble is one of the most important things I can deliver to my business.”

The ability to effectively manage large stores of data can lead to greater business agility.

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

1 Comment »