Research@DBTA

Just How Big Can Databases Get?

TheStreet.com maintains more than a terabyte’s worth of data in various forms – articles, alert data, company data, and trad­ing data. While the online information service’s data store totals well into the terabyte range, the company’s data man­agers prefer to keep its data in a distributed format.

“We addressed the problem of having lots of lots of information on lots and lots of data servers by cutting it up into smaller segments,” Alex Spinelli, CTO of TheStreet. com, told DBTA. “Each one is a bit more man­ageable and allows us to have a bit more flexibility, since the information is specialized.”

AstraZeneca, on the other hand, is building a highly cen­tralized Oracle-based clinical image repository, which is now 5TB in size – and is expected to top 100TB of data within a year.

The pharmaceutical compa­ny sees centralization as the best way to streamline regula­tory compliance and clinical trial efficiency, and make best use of the imaging data and investments in imaging studies. “We did look at distributed models and other alternatives,” Goutham Edula, business lead for clinical imaging informat­ics at AstraZeneca, told DBTA. “But one of the key drivers at this point is to have a central database, because our main data management is central­ized.”

These are two very different approaches, but with a com­mon situation – data volume is growing rapidly, and compa­nies are faced with the choice of maintaining information in federations of distributed data­bases, or putting it all into a more centralized location.

The dilemma is far more urgent than just a few years ago, when the largest databases were only just starting to top the 1TB mark. Now, a terabyte is almost commonplace, Richard Winter, president of Winter Corp., told DBTA. In fact, a terabyte equals “only a few disk drives these days,” he noted.

Among the largest of the large, data volumes are grow­ing beyond the 100TB thresh­old. Randy Lea, vice president of product and services market­ing for Teradata, said his com­pany now “has close to 30 cus­tomers that are over 100TB today, as well as 50 customers with over 50TB.” Lea estimated that the largest customer sys­tem now easily tops the 200TB mark.

Billions and Billions

Even databases commonly associated with distributed computing – Microsoft SQL Server – are gaining gargantuan proportions.

A survey conducted by Unisphere Research for the Professional Association for SQL Server (PASS) found that at least one out of 10 SQL DB Server databases now exceed a terabyte in size. Almost a third of SQL Server sites can now support more than 500 simultaneous users. Among more high­ly centralized enterprise databases such as Oracle, terabytes’ worth of data are everyday business. A Unisphere Research survey conducted among members of the International Oracle Users Group (IOUG) found that 23 per­cent of Oracle enterprises have at least one database exceeding one terabyte in size. About four percent reported hav­ing databases exceeding 10TB.

And the data will just keep on com­ing. One industry study recently esti­mated that over the past three years, Fortune 1000 companies have on aver­age seen their total data environments grow from 190TB to one petabyte (one million gigabytes). Another new study commissioned by EMC Corp. put the total “digital universe” at 161 billion GB (161 exabytes). This will grow at a rate of 57 percent a year to 988EB by 2010. Organizations will be responsible for the security, privacy, reliability and compliance of at least 85 percent of this information.

Companies are finding new ways to leverage these vast stores of data for competitive advantage in their markets. For example, one area where very large databases can be a key advantage is marketing. “Capturing every available data point on existing and potential cus­tomers and further augmenting those records with data purchased from third parties results in a very large database that is capable of producing very effec­tive targeted marketing campaigns,” said Luke Lonergan, CTO of Greenplum. Or a telecommunications firm may want to capture and analyze volumes of call detail records to deter­mine caller behavior. “The resulting analysis allows for them to customize phone plans that provide customers with what they want while maximizing profit,” Lonergan explained. In addi­tion, new demands on business, such as compliance, are also driving the growth of data volumes. “Regulatory agencies are requiring businesses to hold data for a longer time, sometimes it will be five or seven years,” Sumit Kundu, director of product management for Sybase, told DBTA.

Managing Data Behemoths

The challenge many companies cur­rently face is what manner to deploy this data. Teradata, for example, advo­cates economies of scale through cen­tralization and a single point of man­agement of data. “ We’ve always believed that you get more business value by looking across the business than by looking at stovepipes in the business. It’s not just about sales that customers generate. There’s the cost of servicing those customers, and the prof­itability of those customers,” said Lea.

Terry Gray, CFO of Logical Information Machines, agreed, telling DBTA he sees a management advan­tage to moving data to larger, more cen­tralized information stores. “A very large database can provide a sense of long-term responsibility for the data, a different mission statement, and limit­ing the required interaction of divisions, departments and groups. Many people are willing to share data on a regular basis. Very large database technology allows a small group to collectively cre­ate a database.”

Size Doesn’t Matter

Ultimately, however, the greatest chal­lenges don’t arise from the size and scale of the database, but other factors. Winter, for one, said many companies run into problems with the complexity of their queries. “I know of companies that have faced really challenging prob­lems with databases less than a terabyte. The size of the database doesn’t tell the whole story,” he said. “There are issues with many current queries you have, how complex the schema is, how com­plex the queries are, whether you have a mixed workload, and what the latency of the data is.”

Spinelli of TheStreet. com agreed, noting that his greatest challenge is bringing together data with different formats from different databases, since the online service is constantly expand­ing its offerings through partnerships. “Unfortunately, most companies have very different formats for receiving data,” he explained. “Because partner­ships are very beneficial to us, we gen­erally want to accommodate that. But when we add distribution partners, it ends up being another query against their datasets. It’s really performance that I’m most concerned about, in terms of the growth need and scaling – not the sheer size of data.”

Alex Gorelik, CTO and founder for Exeros, also agreed that data integra­tion, rather than the size of the database itself, is the most vexing challenge enterprises face. “When you take a large number of existing databases and try to consolidate them into a single system, discovering how data across various databases relates to each other is a very complex and difficult task,” he said. “This is because over time data in disparate systems tends to get out of sync. So when consolidating the data into a single system you have to figure out how data between systems relates so you can identify overlaps, as well as spot and eliminate inconsistencies. If consolidation is not done properly, the result is a big database of garbage that isn’t useful for anything or to anyone.”

TheStreet. com, for example, is addressing complexity challenges among its diverse infrastructure of smaller databases through grid and clustering technologies. The firm is in the process of implementing grid tech­nology from Grid App – run over blades and Oracle Real Application Clusters (RAC) – with the goal of providing a single view of its distributed data envi­ronment of Oracle, MySQL, and part­ner databases. By deploying grid and clustering solutions, “we don’t have to build a giant huge database running on big subsystems,” Spinelli explained. “We can actually be very smart about building out a modular database that scales horizontally and lets us still slice and dice as we need to, and be very flexible and agile, but have it all within the same management systems. That will enable us to very quickly move in different directions.”

Backup and Recovery

Another challenge is backup and recov­ery. Winter observed, for instance, that when disk capacity is added for storage, the size of a data operation multiplies by a factor of six. This is the main chal­lenge that AstraZeneca is attempting to address as it nears its goal of a 100TB database, said Edula. “Because what we’re looking at is we don’t want to archive old images. For the most part, we want to keep them alive, or online. We’re looking at different options in terms of backup and restoring, and how much time it would take to restore.” Backing up a 100TB database – and being able to restore it as quickly as possible “is a much bigger challenge than just backing it up.”

At the end of the day, the ability to effectively manage large stores of data leads to greater business agility, Spinelli said. That’s why he prefers to maintain large data stores in a distributed fash­ion, linked by grid and clustering tech­nologies. “ The landscape changes quickly for businesses such as ours,” he said. “The ability to be agile and flexi­ble is one of the most important things I can deliver to my business.”

The ability to effectively manage large stores of data can lead to greater business agility.

‘Blades on Wheels’ — Plug ‘n Play Data Centers

Some of the major IT vendors have taken a new tact in competing for the heart and soul of the data center. They’re offering densely packed data centers-in-­a-box that literally can be rolled up to any corporate location. Could this be the answer to the burgeoning space and power requirements that are said to be afflicting many growing data centers? Or is it an example of vendors running in the wrong direction?

There’s nothing new about the idea of mobile data centers. For the past decade or so, disaster recovery vendors have offered fully contained, trailer­based mobile data centers that could be rolled up to customer sites to provide con­tinuity to disrupted IT operations. Now, companies such as Sun, IBM, Dell, Rackable Systems, and even Microsoft are proposing that containerized data centers be pluggable into enterprises for any purpose – such as expansion – on short notice. Think of it as a very large “blade” on wheels.

The growing prevalence of commodity hardware and software make this eco­nomically feasible, said Microsoft’s James Hamilton in a recent position paper (Word document). “These commodity clusters are far less expensive than the systems they replace, but they can bring new administrative costs in addition to heat and power-density chal­lenges…. [Hamilton proposed] a data center architecture based upon macro-mod­ules of standard shipping containers that optimizes how server systems are acquired, administered, and later recycled.”

As Hamilton put it, a standard 20x8x8-foot ship­ping container is “ideal” for this purpose, since not only is it is rugged and built to withstand ocean voyages, but also “relatively inexpensive and environmentally robust.” Upon delivery to a site, a data center container could simply be attached to the network, chilled water, and powered up. Each container can be fully equipped with networking gear, compute nodes, and persistent storage. Hamilton even went as far as to predict that these plug-and-play data centers won’t even need any main­tenance or servicing, since they will be loaded with redundant components – “the entire module just slowly degrades over time as more and more systems suffer non­recoverable hardware errors. Even with 50 unrecoverable hardware failures, a 1,000 system module is still operating with 95 percent of its original design capacity.”

Sun Microsystems – a company always trying its best to shake up the established order – already unveiled its own container-based data center. Last October, the ven­dor announced “Project Blackbox,” which is scheduled to be made available by mid­year.  The current prototype could support about 250 servers, up to 1.5 petabytes of disk storage, 7TB of memory, and support up to 10,000 simultaneous desktop users. The interior resembles the inside of a space station. One observer remarked he was half expecting the system to utter “Hello, Dave” when he walked in to take a tour – à la the interactive HAL 9000 supercomputer of 2001: A Space Odyssey.

The economies of scale – mass-produced data centers in a box – are compelling, but in the age of emerging software-as-a-service, on-demand, and in-the-cloud com­puting approaches, is it really needed, since companies may increasingly turn to out­side resources for IT requirements anyway? Proponents of containerized data cen­ters say that trends such as SaaS and Web 2.0 are actually what are driving the mar­ket for these solutions. Companies that provide such services constantly need to keep beefing up their data centers to take on new customers. However, beyond the service providers themselves, how much of a market will there be?

A trend going in the opposite direction is the increasing compute power constantly being packed into smaller and denser chips, as well as geometrically growing stor­age capacity that shows no signs of abating. It’s not inconceivable that all the func­tions of an average data center could eventually be packed into a box the size of a desktop computer. Consider all the scientific and high-end applications that once required Unix computers that now can be run on a laptop.There’s also IBM’s con­solidation pitch, in which thousands of distributed servers could be virtually replaced by a single System z mainframe.

If a company can get a refrigerator-sized mainframe box to support most of its business, why would there be a need to ship in an entire trailer of additional computing?

Is the Data Center Power Drain an Urban Myth?

Are data centers the new resource-draining ‘factories’ of the 21st century? It’s no secret that large data center operators such as Google, Yahoo!, Microsoft, and Amazon recently made strategic location decisions to position their lat­est data center construction projects within range of hydroelectric power, which is far cheaper and more abundant than traditional electricity from coal-fired plants.

One company, Equinix, is reported to be building a data center in Chicago with a server farm drawing up to 30 megawatts of power, which is enough electricity to power 30,000 houses. Some analysts estimate that a typical single server rack now demands up to 15 kilowatts of electricity – up from three just a few years back.

A new study from the Lawrence Berkeley National Laboratory, underwritten by AMD, says as much. The study determined that electricity used by server comput­ers doubled between 2000 and 2005. The report stated that this surge in power con­sumption is largely attributable to the proliferation of cheap servers, lending cre­dence to IBM’s argument that server farms should be consolidated on System z machines.

The study also estimated that the total electricity bill for operating those servers and associated infrastructure in 2005 was about $7.2 billion worldwide ($2.7 bil­lion for the U.S. alone). Servers and the infrastructure used to maintain these machines use about 45 billion kilowatt hours a year. That’s equivalent to the amount of power used by the state of Mississippi in 2005. Additional devices such as stor­age, network equipment, and client front-ends were not included in the calcula­tions.

It all sounds like a lot, and one can be forgiven for thinking that fast-growing data centers are about to suck our electric grid dry. But the Lawrence Berkeley study also pointed out that total power used by servers represented only about 0.6 percent of total U.S. electricity consumption in 2005. When cooling and auxiliary infrastructure are included, that number grows to 1.2 percent, ‘an amount compa­rable to that for color televisions,’ the study said. This equates to the equivalent (in capacity terms) to about five 1,000-megawatt power plants for the U.S. and 14 such plants for the world.

James Bushnell, research director of the University of California Energy Institute, summed it all up this way: The notion of data centers being the new industrial resource hogs is’an urban myth.’

Let’s take things a step further. I think someone ought to fund a study to see what the positive effects of mass computerization have been – how much are we con­serving in resources as a result of the move to information technology?

The Lawrence Berkeley study failed to take into account the overall savings in electric­ity and power usage as a result of mass computerization. The study’s author, Jonathan Koomey, of Lawrence Berkeley and Stanford University, admits right up front that the study “only assesses the direct electricity used by servers and associ­ated infrastructure equipment. It does not attempt to estimate the effect of struc­tural changes in the economy enabled by increased use of information technology, which in many cases can be substantial.”

These can be substantial indeed. Consider these potential areas of direct and indirect energy savings and you quickly get a very long list. Together, we’re prob­ably talking about more resources saved than that used up by servers. Consider: It can be assumed that e-business has drastically reduced the amount of paperwork moving inside and between organizations, with an ensu­ing savings in tree harvesting and energy consumed for the physical delivery of such documents.

  • The e-commerce channel is now a strong part of many businesses, and it can be assumed that to some degree, it has replaced some bricks and mor­tar construction and management, and all the energy consumption that goes with that.
  • Oil companies have been able to cut back on costly, time- consuming, sometimes environmentally risky drilling exploration because they can now model geologic environments with sophisticated tools running on large systems.
  • There’s the less frequent travel required, since collaborative tools and platforms enable teams to work virtu­ally across the globe.
  • Online and distance learning have brought campuses right to students’ homes, cutting down on travel to and from physical campuses. Likewise, telecommuting enabled through IT has cut down on work commutes.
  • There have been industry-specific gains. For example, insurance com­panies leverage mobile technologies to cut down on trips made by claims adjusters out in the field – a savings that alone along ought to add up to quite a few barrels of oil.

Yes, your data center may be running up some huge electric bills, and it’s important to seek ways to cut this con­sumption with a more efficient infra­structure. But before the world starts jumping to conclusions that data cen­ters are depleting our energy resources, we need to see some studies on the pos­itive impact virtualized environments are having on our resources. I’m sure such information would open our eyes with amazement as to how much IT is really saving us in resources.