Trends in Database Systems Research: Column-Stores

This article is a cross-post from my old website, dating from 2009.

Column-store databases are one of the more interesting areas of innovation in recent database systems research. While column-stores have been around for decades, research in the area has recently been kick-started by Mike Stonebraker and others as part of the C-Store project, and it makes for an interesting discussion.

What is a column-store?

The basic premise of this work is simple. Databases which store data by column (withattributes written contiguously on disk) are able to service read queries much faster than more traditional row-store databases (with records written contiguously on disk). Attributes not included in queries can be ignored, rather than just skipped over, and data can be easily compressed, because techniques such as run-length encoding work far more effectively over attributes (where entries are similar), than over rows (where they are distinct). Both features reduce the disk bandwidth required to execute a query, reducing a potentially large bottleneck.

Traditionally the problem with this approach has been a noted slowdown in the speed of updates – the design which makes reading from the database extremely fast, results in the opposite effect when writing. C-Store solves this by creating two stores: a large read-optimized store, and a smaller writeable store. Updates are sent to this smaller store, before being bulk moved to the larger variant at a later date.  This works because C-Store is targeted at the data warehousing market, where queries are read-mostly and updates are infrequent. Specialization is key.

If you read one paper from the area, make it C-Store: A Column Oriented DBMS.

Commercial Rivalry

Not surprisingly given the promise of this work, database vendors are taking note. C-Storeitself has spawned a commercial version, Vertica.

Perhaps as a result we may see fewer academic papers on the subject, but thankfully a number of the parties involved have created blogs which provide useful insights into the current focus of their work.

The people behind Vertica (and thus C-Store) have an interesting blog named The Database Column, which ostensibly promotes the benefits of column-stores, but backs this up with a lot of interesting work and evaluation.

Daniel Abadi, yet another C-Store member, has recently created his own blog, which oddly seems to have a slightly more commercial slant than the previously mentioned Vertica blog. Again, if you have any interest in this area his posts are worth reading.

More generally, Curt Monash’s DBMS2 blog provides an interesting account of the latest happenings in the commercial database world.

Trends in Database Systems Research: Energy Efficiency

This article is a cross-post from my old website, dating from 2009.

Wherever you look nowadays companies are searching for ways to market themselves as the environmental alternative, both because it makes customers feel good and promises to save them money. It follows that this is particularly true of large computing companies, given the cost of running data centres. As an example, US data centres alone run at an estimated cost of $2.7 billion, 1.2% of the total national energy consumption (ref).

The challenge is in finding ways to reduce the waste.

Broadly speaking there are two complementary approaches to reducing energy consumption: by making hardware more efficient, and software less resource intensive. Database research is beginning to appear on the latter approach, but I think a lot of the work on the hardware side is just as interesting.

Hardware Optimization

In The Case for Energy-Proportional Computing Barroso and Hölzle look at the energy efficiency of typical data centres. They show that the energy efficiency of a server is not directly proportional to its utilization – so, for example, a server running at near 0% utilization is using 50% of the power it uses at peak utilization. Ideally servers should use no power when not in use and power only in proportion to their utilization when they are. The authors call for future hardware design to aim for better energy proportionality, so that machines that are doing little, cost little.

Software Optimization

Two recent CIDR papers look at reducing energy consumption in database systems.

Energy Efficiency: The New Holy Grail of Data Management Systems Research looks at areas where software optimizations can be made, and more generally provides a number of approaches to reducing energy waste. It’s a worthwhile read if you want to get a good feel for the area.

Towards Eco-friendly Database Management Systems proposes that energy consumption be considered a first-class performance goal when planning and processing queries. The authors give details of two optimizations which can help to reduce the energy consumption of a database system.

The first uses the ability of modern processors to execute at a lower power voltage and frequency – their database can explicitly order the processor to operate at a lower voltage when such a change is desirable.

Their second technique is to queue queries where possible so that query aggregation can be used more often, reducing the number of repeat queries to the database. In some evaluations these approaches yield a 49% reduction in energy consumption against only a 3% increase in response time.

While these kind of solutions are not the whole answer (in many cases any reduction in performance would be unacceptable), they at the very least provide an interesting perspective.

Existing Resources

One of the most interesting statistics from this work is the 50% energy consumption of servers doing nothing at all. Essentially, machines doing nothing are still doing something. If we assume that we can’t reduce the energy consumption of these machines to zero, then the question becomes how can these machines be used?

When it comes to user workstations various volunteer computing projects (see Seti@HOME,Folding@Home) strive to make use of unused capacity. But within an enterprise there is a paucity of software able to take advantage of unused resources – resources, which as this article points out, are still costing companies money.

 

 

Current Trends in Database Systems Talk

I recently gave a talk to our Masters Databases class entitled Current Trends in Distributed Database Systems.

The talk (available here) covers some of the more innovative designs in database systems over the last few years, from Vertica and VoltDB, to larger-scale datastores such as Amazon’s Dynamo.

2010-12-13 - Current Trends in Distributed Database Systems

 

Major aside: I tried and failed to come up with a more entertaining title for the talk. The suggestions I received on twitter were better, but less relevant (one of the suggestions is on my title slide).

So, if you think you can do better and come up with something that is both relevant and witty/entertaining, there’ll be some form of prize in it for you!

SICSA Conference 2010

I’m just back from presenting a paper and poster at the SICSA Conference 2010.

You can find the work that I presented at the conference below, and more information on H2O in general at the project webpage.

Paper

H2O: An Autonomic, Resource-Aware Distributed Database System

Abstract:

This paper presents the design of an autonomic, resource-aware distributed database which enables data to be backed up and shared without complex manual administration. The database, H2O, is designed to make use of unused resources on workstation machines.

Creating and maintaining highly-available, replicated database systems can be difficult for untrained users, and costly for IT departments. H2O reduces the need for manual administration by autonomically replicating data and load-balancing across machines in an enterprise.

Provisioning hardware to run a database system can be unnecessarily costly as most organizations already possess large quantities of idle resources in workstation machines. H2O is designed to utilize this unused capacity by using resource availability information to place data and plan queries over workstation machines that are already being used for other tasks.

This paper discusses the requirements for such a system and presents the design and implementation of H2O.

Presentation

(1-up slides)

 

An Approach to Ad-Hoc Cloud Computing

You can find our recent technical report on ad-hoc cloud computing here. The abstract is reprinted below.

Abstract:

We consider how underused computing resources within an enterprise may be harnessed to improve utilization and create an elastic computing infrastructure. Most current cloud provision involves a data center model, in which clusters of machines are dedicated to running cloud infrastructure software. We propose an additional model, the ad hoc cloud, in which infrastructure software is distributed over resources harvested from machines already in existence within an enterprise. In contrast to the data center cloud model, resource levels are not established a priori, nor are resources dedicated exclusively to the cloud while in use. A participating machine is not dedicated to the cloud, but has some other primary purpose such as running interactive processes for a particular user. We outline the major implementation challenges and one approach to tackling them.