Paxos By Example

This post describes a decentralized consensus algorithm called Paxos, through a worked example.

Distributed consensus algorithms are used to enable a set of computers to agree on a single value, such as the commit or rollback decision typically made using a two- or three-phase commit. It doesn’t matter to the algorithm what this value is, as long as only a single value is ever chosen.

In distributed systems this is hard, because messages between machines can be lost or indefinitely delayed, or the machines themselves can fail.

Paxos guarantees that nodes will only ever choose a single value (meaning it guarantees safety), but does not guarantee that a value will be chosen if a majority of nodes are unavailable (progress).

General Approach

A Paxos node can take on any or all of three roles: proposer, acceptor, and learner. A proposer proposes a value that it wants agreement upon. It does this by sending a proposal containing a value to the set of all acceptors, which decide whether to accept the value. Each acceptor chooses a value independently — it may receive multiple proposals, each from a different proposer — and sends its decision to learners, which determine whether any value has been accepted. For a value to be accepted by Paxos, a majority of acceptors must choose the same value. In practice, a single node may take on many or all of these roles, but in the examples in this section each role is run on a separate node, as illustrated below.

Figure 1: Basic Paxos architecture. A number of proposers make proposals to acceptors. When an acceptor accepts a value it sends the result to learner nodes.

Paxos By Example

In the standard Paxos algorithm proposers send two types of messages to acceptors: prepare and accept requests. In the first stage of this algorithm a proposer sends a prepare request to each acceptor containing a proposed value, v, and a proposal number, n. Each proposer’s proposal number must be a positive, monotonically increasing, unique, natural number, with respect to other proposers’ proposal numbers[1].

In the example illustrated below, there are two proposers, both making prepare requests. The request from proposer A reaches acceptors X and Y before the request from proposer B, but the request from proposer B reaches acceptor Z first.

Figure 2: Paxos. Proposers A and B each send prepare requests to every acceptor. In this example proposer A’s request reaches acceptors X and Y first, and proposer B’s request reaches acceptor Z first.

If the acceptor receiving a prepare request has not seen another proposal, the acceptor responds with a prepare response which promises never to accept another proposal with a lower proposal number. This is illustrated in Figure 3 below, which shows the responses from each acceptor to the first prepare request they receive.

Figure 3: Paxos. Each acceptor responds to the first prepare request message that it receives.

Eventually, acceptor Z receives proposer A’s request[2], and acceptors X and Y receive proposer B’s request. If the acceptor has already seen a request with a higher proposal number, the prepare request is ignored, as is the case with proposer A’s request to acceptor Z. If the acceptor has not seen a higher numbered request, it again promises to ignore any requests with lower proposal numbers, and sends back the highest-numbered proposal that it has accepted along with the value of that proposal. This is the case with proposer B’s request to acceptors X and Y, as illustrated below:

Figure 4: Paxos. Acceptor Z ignores proposer A’s request because it has already seen a higher numbered proposal (4 > 2). Acceptors X and Y respond to proposer B’s request with the previous highest request that they acknowledged, and a promise to ignore any lower numbered proposals.

Once a proposer has received prepare responses from a majority of acceptors it can issue an accept request. Since proposer A only received responses indicating that there were no previous proposals, it sends an accept request to every acceptor with the same proposal number and value as its initial proposal (n=2, v=8). However, these requests are ignored by every acceptor because they have all promised not to accept requests with a proposal number lower than 4 (in response to the prepare request from proposer B).

Proposer B sends an accept request to each acceptor containing the proposal number it previously used (n=4) and the value associated with the highest proposal number among the prepare response messages it received (v=8)[3]. Note that this is not the value that proposer B initially proposed, but the highest value from the prepare response messages it saw.

Figure 5: Paxos. Proposer B sends an accept request to each acceptor, with its previous proposal number (4), and the value of the highest numbered proposal it has seen (8, from [n=2, v=8

If an acceptor receives an accept request for a higher or equal proposal number than it has already seen, it accepts and sends a notification to every learner node. A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors have accepted a value, as is illustrated below:


Once a value has been chosen by Paxos, further communication with other proposers cannot change this value. If another proposer, proposer C, sends a prepare request with a higher proposal number than has previously been seen, and a different value (for example, n=6, v=7), each acceptor responds with the previous highest proposal (n=4, v=8). This requires proposer C to send an accept request containing [n=6, v=8], which only confirms the value that has already been chosen. Furthermore, if some minority of acceptors have not yet chosen a value, this process ensures that they eventually reach consensus on the same value.

Various efficiency improvements to the standard Paxos algorithm are discussed in the papers by Lamport and Baker et al.. For example, a prepare request is not necessary if the proposer knows that it is the first to suggest a value. The proposal for such a request is numbered 0, so that it will be ignored if any higher numbered requests have been received.

References

L. Lamport, “Paxos Made Simple” in ACM SIGACT News, vol. 32, no. 4, pp. 18–25, 2001.

Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Larson, J., Léon, J. M., “Megastore: Providing Scalable, Highly Available Storage for Interactive Services” in Proceedings of the Conference on Innovative Data Systems Research, pp. 223-234, 2011.

T. D. Chandra, R. Griesemer, and J. Redstone, “Paxos made live: an engineering perspective”, in Proceedings of the twenty-sixth annual ACM Symposium on Principles of Distributed Computing, 2007, pp. 398–407.

 


[1] The method of ensuring the uniqueness of proposal numbers when there are multiple proposers is not specified in the Paxos algorithm itself.

[2] It may not, but the algorithm is resilient to this.

[3] Note that this is the highest proposal number that it received from prepare response messages. In this example, proposer B has a higher numbered proposal (n=4) than proposer A (n=2), but it has only received proposer A’s proposal in response to its prepare request. If no previous proposals were returned by the prepare response messages, proposer B would use its own proposal (n=4).

Updated (4/29/15): Updated text to address an ambiguity discussed in a comment.

Spie Charts Explained (+ Chart.JS Add-On)

I recently discovered spie charts, and thought it’d be interesting to try to create one in javascript… so here we are! This post introduces spie charts and discusses a spie chart extension I created for chart.js.

You can find the working code for the chart.js extension on GitHub.

What is a Spie Chart?

A spie chart is an overloaded version of a pie chart, where you can vary the height of each segment along with its width. This enhancement enables various things, such as the comparison of a dataset in two different states, or the addition of extra dimensions to a single dataset.

The next two sections use examples from the original paper on spie charts to illustrate these uses (though my data is totally made up).

Type A: Comparison of States

To compare a dataset in two states, the original dataset can be segmented in a regular pie chart, with the difference between this and the next state illustrated by the varying height of each segment.

Imagine government expenditure for two distinct years charted in separate pie charts. The change between the two charts is so marginal that it is difficult to determine what changed.

With a spie chart, expenditure in one year can be represented by the width (angle) of each segment in the chart, and the change in spending for the next year can represented by the height (radius) of each segment. This allows you to visualize both the relative weight of spending in each sector, and how that spending has changed over time:

In this example, some segments are broken down into multiple slices, where the outer slice illustrates how much spending has been reduced between the two reports.

Type B: Additional Dimensions

Alternatively, the height can be used to show a size relative to one value, while the width shows the same size relative to another.

In the following example (again based on one in the Feitelson paper), the spie chart conveys road casualties by age and sex. The height of each segment is used to indicate the number of casualties relative to the population size of that segment, while the width is used to indicate the proportion of casualties between age ranges (segments).

This example also shows how slices can be used to convey even more information, such as a breakdown of casualties in a segment between pedestrians, car riders, and (motor)cyclists.

The key characteristics of the spie chart in these examples are the ability to vary both the height and width of each segment, along with the ability to slice each segment into distinct sections.

Creating a Spie Chart

I am no expert with JavaScript, so creating the spie chart was an interesting process. If you want to skip that and jump straight to code, you can find it on GitHub, with examples here, and code here.

I decided to use chart.js as the basis for the chart, which proved to be a good decision, as it’s nicely written and extremely extendible.

Chart.JS Structure

Chart.js consists of a core file — named Chart.Core.js — and several others, with one for each type of chart supported. I decided to copy and extend the polar area chart, since it allows for pie-like segments that have different heights (though each segment has the same width).

Each of the chart type files has the same general structure, but i’ll discuss the Chart.PolarArea.js file here.

At the top of the file, a defaultConfig object provides defaults to be used for the chart’s display, including whether a scale is shown, whether animations are used, and the specifics of these features.

The majority of the code is contained within an extension of the Chart.Type object, which is the base class for all of the chart types. For a polar area chart, this contains an array of segments (the SegmentArc type), and set-up code for displaying a scale and tooltips.

The key functions in the PolarArea chart type are:

  • addData(), through which incoming data is passed from the calling function, and parsed into individual SegmentArc objects.
  • draw(), which is (somewhat obviously) called to draw the chart. When an animation is used for the chart, this function is called repeatedly, with the size of each segment gradually increasing as a result of a call to an easing function.

Each call to draw() adjusts the size of each segment by changing the start and end angle of the segment, and then the draw() function for each segment is called. This draw function is implemented in the Chart.Arc type in the core file and is common to pie charts and polar area charts.

Writing the Spie Chart

The basis of my spie chart is the existing polar area chart, modified to support variable width segments and slices. The Chart.Core.js file remains un-altered.

To begin, I updated the expected input format of the chart to allow for each of these changes.

The input format looked like this for a given segment:

But now looks like this:

I created a new Chart.Arc type, called SliceArc, and updated the addData() function to parse the incoming slices into this object (meaning the SegmentArc object contains many SliceArc objects).

The draw() function is significantly updated, with the main change being that each slice needs to be updated as the chart is drawn, rather than each segment (essentially, more looping).

The SegmentArc.draw() function is overridden, and now draws individual slices in a segment rather than the whole segment in one call. (I think this logic should be moved to the SliceArc call itself, which may make it possible to call the original Arc.draw() function without modifying it)

Tooltips were particuarly challenging to fix, because the existing inRange() and tooltipPosition() functions are very much designed around the location and position of the segment, and not slices. To avoid modifying the showTooltip() function (which is relatively large and in Chart.Core.js [called here]), I pass a slice object and make it look like a segment object (this is more than a little hacky).

Both share the same base class, so this in some ways reflects the extensible design of chart.js, but it involved adding some state to the slices that I don’t think would have otherwise been duplicated from the segment object.

Finally, I updated the tooltips template, which now iterates through slices as well as segments, and highlights slices on hover rather than segments (an implementation of which is shown in the spie-enhanced.html example).

Thoughts on Solution

Now that I have a better understanding of the structure of the codebase, I think I could do a better job modularizing certain functions and re-using existing core components. That said, i’m impressed how easy it was to create the new chart, and how little new code had to be written.

I thought about trying to allow the chart to behave like a multi-level pie chart, but there are already solutions for this, and adding an extra dimension would make things even messier.

Creating a Java REST API With Apache Jersey

In our latest development cycle we’ve been working on creating an official API to manage and control AetherStore. As part of this process I’ve been experimenting with Jersey 2, the Java reference implementation for JAX-RS, the Java API for RESTful Web Services.

This post discusses an example API using Jersey. The example is itself fairly well commented, explaining why certain pieces of code are needed, and how they relate to the project. This post covers the structure of the project and discusses some of its more interesting features.

The example is of an API representing a set, which allows:

  • Strings to be added as part of a GET request parameter
  • Strings to be added as part of a PUT request body.
  • The set of all strings stored to be returned.

If you’re using Jersey, it’s important to note that a lot of examples online use Jersey v1, which causes problems because this version of Jersey uses an entirely different namespace — v1 usescom.sun.jersey, whereas v2 uses org.glassfish.jersey, and a number of classes are either named differently, or are in different sub-packages.

The code on GitHub should work straight out of the box if you’re using Eclipse for Java EE.

Areas Covered

The code is useful if you’re interested in one of the following features in relation to jersey:

  • Running in eclipse.
  • Setting up appropriate Jersey Maven dependencies.
  • Setting up your web.xml to work with Jersey.
  • Creating a basic API.
  • Using JSON to wrap requests and responses.
  • Using an exception mapper for more readable error handling.
  • Injecting dependencies into resource classes (the API classes).
  • Unit testing Jersey.
  • Mocking calls used by our Jersey API.
  • Unmarshalling responses from a Jersey API.

Reading the Code

This section describes how the code is structured at a high level. There are some comments on specific lines of code, but I’d recommend looking at the comments in the code for a closer look at individual features.

The Apache Jersey Core

At the core of a Jersey application is the pom.xml file, which specifies all of the dependencies in the code — including Jersey itself — and the versions being used. If you’re using another example, it’s important to note the version of Jersey you’re using. Here we are using Jersey 2.6:

The web.xml file specifies how your servlet is named, and where the main Jersey Application class is. This Application class is used to start your Jersey servlet.

In this example our application class is called SetApplication. It registers bindings for a few classes, which we’ll discuss later. At this point the most relevant call to our application is:

This tells the servlet container where to look for the resource classes that form the API. In this case our resource class is called SetResource.

SetResource (API) Class

To recap, this project implements an API which supports two calls to add a string to a set (/set/add/<value> and /set/add), and a single call to get all entries in the set (/set/get). The class definition for this class annotated (shown below), to set the base path of the API call to be /set:

Then the methods in this class are further annotated to describe the API calls under this path. For example, the following code is executed when a /set/add/{value} request is made.

The full call to this method will be /set/add/{value}, where value is a variable that is mapped to the value parameter of the method as a result of the @PathParam annotation. We want the call to consume and produce a JSON response, so we specify this in the @Produces annotation. The marshalling to JSON is handled automatically, but we have to specify the marshaller dependency in thepom.xml file, as follows:

There are two methods which include the /add path, but they accept different numbers of inputs so there is no conflict. The class also contains an @Inject annotation, which tells the container to inject a dependency into the callHandler field. If we look back at the SetApplication class, we can see that this value is injected by registering it with the application through the register() call:

Unit Tests

I’ve written two unit test classes. One, SetApiTests provides what are essentially end-to-end integration tests, which call the API and check that its operations perform as expected. The second,MockedSetApiTests provides an example of using mocking to test just the API calls themselves. Both test classes extend JerseyTest, which handles the heavy lifting of setting up a servlet container and providing the API. To run correctly, JerseyTest requires a test framework provider, which is the servlet container used to run the test. In this example I’ve used the jetty container, where the dependency is specified in the pom.xml class with the following code:

In the MockedSetApiTests class, the SetCallHandler, which manages the logic behind the API call (and is injected), is mocked out:

The configure call is required by JerseyTest to properly configure the application, which in this case requires us to pass the mocked dependency so that it can be injected into the SetResource. In the tests themselves, the call to the API is relatively simple:

This makes the API call and returns the result (whether it is a success or failure to theresponseWrapper). This can then be queried to establish whether the call was successful (by getting the HTTP response code):

If successful, we can then obtain the returned value:

The test addMultipleSingleCall shows an example of an API request with a message body (in this case a set of Strings), also showing how to package up this request parameter in the test:

The examples in this post don’t include any parameters that are non-standard Java types, but doing so is relatively simple. By default only the public fields in the class are serialized, and a default constructor is required, but the class doesn’t have to implement serializable.

How To Run

To run this example in Eclipse for Java EE:

  1. Download / clone the code from GitHub.
  2. In Eclispe, go to File -> New -> Java Project
  3. Untick ‘Use default location‘ and navigate to the path of the jersey2-example repository.
  4. Re-tick the ‘Use default location‘ option, which sets up the project name as jersey2-example.
  5. Click finish to create the project.
  6. To run, either run the unit tests in JUnit, or right-click on the project and select Run as -> Run on Server.

If you want to create your own eclipse project, you can follow this example, but note that this is for Jersey v1, so you need to adjust the options used in steps 3 and 5 (i’d compare them to the example in my GitHub repo).

To run standalone (the previous steps are required):

  1. Right-click on the project, Export -> WAR File.
  2. Set the location to store the WAR file, and change the specified server runtime if necessary.
  3. Run the WAR in your favorite application server, or standalone with Jetty Runner.

Additional Resources

The following links are resources I found useful in writing this example. Where the examples refer to v1 of Jersey I’ve said so — examples are included because some part of them is useful, but be careful to note places where v1 specific code is used. This includes anywhere where a com.sun.jersey namespace is used.

This is a repost of a blog I wrote over on the AetherWorks Blog earlier this year.

How Zero-Conf Works

When you connect a printer to your local network, how does your computer find it?

It has to be addressable, which means it needs an IP address (and ideally a hostname), and it needs to be discoverable so that you can find it from your computer. When that works, you see a window like this:

Searching for Devices

These tasks are covered by the zero-configuration protocol, which describes how to make this work even when DHCP and DNS servers, which assign IP addresses and hostnames, are not available.

The zero-conf specification covers three broad goals, which I discuss in this post:

  1. Allow devices to obtain an IP address even in the absence of a DHCP server.
  2. Allow devices to obtain a hostname, even in the absence of a DNS server.
  3. Allow devices to advertise and search for (discover) services on the local link.

How Zero-Conf Works

1. Obtaining an IP Address

For a device to route messages on a network, it needs an IP address. This is relatively trivial if a DHCP server is available, but when this isn’t the case, zero-conf uses link-local addressing to obtain one.

The address assigned by link-local addressing is in the 169.254.0.0/16 block, which is only useful within the link (because routers won’t forward packets from devices in this address space[1]).

To obtain an address the device sends ARP requests to establish whether its desired IP address (chosen at random from the 169.254 range[2]) is available. This is a two-step process:

  1. First, an ARP probe is sent asking for the MAC address of the machine with a given IP. This probe is sent a number of times to confirm that the given IP address is not in use (there will be no response if it isn’t).
  2. If no reply is received, the device sends out an ARP announcement saying that it is now the machine with the given IP.

There are various protocols for conflict resolution of addresses that I won’t discuss here[3].

2. Obtaining a Hostname

Once a device has an IP address it can be contacted through the network, but IP addresses are volatile and for the device to be consistently discoverable it needs a hostname (which is less likely to change). Assigning as hostname is simple if you have a DNS server, but, if you don’t, zero-conf can assign hostnames using Multicast DNS.

Multicast DNS (mDNS) uses IP multicast to send broadcasts on the local link (something we wrote about recently).

To claim a local hostname with mDNS, a device sends DNS messages to probe for the uniqueness of the hostname[4]. Three queries are sent in 250ms intervals, and if no device reports using this hostname, the requesting device then sends an announce message to claim ownership.

In the case of a conflict (where two devices believe they own the same hostname), lexicographic ordering is used to determine a winner.

mDNS hostnames must use a local top-level domain (.com, .org, .gov, etc.) to distinguish them from globally accessible hosts. Apple devices and many others use the .local domain.

3. Browsing for Services

Once a device has an IP address and hostname, it can be contacted, but only if we know the name of the specific device we are looking for. If we don’t, DNS Service Discovery (also known as DNS-SD) can be used to search for services available on the local link. Rather than looking for a printer called ‘jose’, we can look for all devices supporting a print protocol (and then select ‘jose,’ if available).

What this looks like

Services are advertised in the form:

ServiceType.Domain

For example, _ipp.example.com advertises devices supporting the Internet Printing Protocol in theexample.com domain.

Individual services are identified by a specific instance name:

InstanceName.ServiceType.Domain

So our printer, ‘jose,’ would be identified as:

jose._ipp._tcp.local[5]

How this works

DNS-SD uses mDNS to announce the presence of services. To achieve this, it uses DNS PTR and SRV records.

PTR records  (pointer records) are used in lookups[6] to search for a specific service type (_ipp._tcp.local in the above example) and return the name of the SRV record for each service supporting the specified protocol (there is one PTR record for each SRV record).

A query for _ipp._tcp.local will return jose._ipp._tcp.local and all other printers supporting the IPP protocol in the local domain.

SRV records (service records), record the protocol a service supports and its address. If the PTR record is used to find devices supporting a protocol, the SRV record is used to find a specific device’s hostname.  For the printer ‘jose,’ the SRV record would contain the service type and domain, and the hostname for the printer itself:

_ipp._tcp.local | jose.local

At this point we have discovered and can connect to our printer.

In addition to this, there are various extensions to zero-conf that I don’t describe here. These include:

  • TXT records, which allow extra attributes to be recorded in the service announcement (for example, extra information needed to make a proper connection to a service).
  • Subtypes, which allow the same service to advertise different levels or types of support within a service.
  • Flagship Service Types, which enable applications to determine when the same device is supporting and announcing multiple protocols. This makes it possible to remove duplicate entries in a listing, where a device supports multiple protocols that perform the same function.

Implementations

The most commonly used implementation of zero-conf (or at least the most written about) is Apple’sBonjour.

We have used this library as the basis for our own implementation, linking in with the provided dns_sd.jar wrapper. There are various native implementations that I haven’t yet tried.

If you’d like to read more on zero-configuration, I’d recommend the O’Reilly Zero Configuration Networking book, which provides an exhaustive description of everything I’ve touched on in this post.

Other sources are included below:


[1] This is good because it stops local information from potentially clogging upstream networks.

[2] Unless it has already been on this network, in which case it uses its previously chosen address.

[3] Edgar Danielyan discusses this in an article of the Internet Protocol Journal.

[4] This is the same process as with regular unicast DNS, but with some optimizations. For example, individual queries can include multiple requests and can receive multiple responses, to limit the number of multicast requests being made.

Unicast DNS packets are limited to 512 bytes, whereas mDNS packets can have 9,000 bytes (though you can only transmit 1,472 bytes without fragmenting the message into multiple Ethernet packets).

[5] The _tcp line specifies the service runs over TCP (rather than UDP).

[6] PTR records are similar to CNAME records but they only return the name, rather than performing further processing (source).

This is a repost of a blog I wrote over on the AetherWorks Blog earlier this year.

Concurrency Control Mechanisms

As part of my PhD, I spent some time studying concurrency control mechanisms, looking for the most appropriate solution for the database system I was developing.

This article describes one such model, read-write-certify, and looks at how it might be extended for use in a distributed database system. I hope this helps to highlight the issues involved in the design of concurrency control mechanisms more generally.

The Basics

Read-write-certify (RWC) is a slightly more liberal version of the typical shared read, exclusive write locking model (as seen in most DBMSs). Unlike the latter approach RWC allows reads and writes to execute concurrently on the same item. To explain, consider the following example.

Figure 1: Single Item Database

Figure one shows a local database containing one item X, which stores the value ‘1’. In theshared-read, exclusive write model a machine updating X obtains an exclusive lock while it makes its update, meaning all read queries must wait on the update to complete. However in RWC, a transaction creates a new version of the item (named XI in figure 2) in its own transaction workspace on which to make the update. This ensures that the original copy of X is available for read queries.

Figure 2: Contrasting Approaches to Locking

The updating transaction uses  its local workspace copy (XI) to execute the update, but must then commit these changes by updating X, the global copy. Another lock called a certify lock is required to give exclusive access to the original while the update is committed.

Figure 3: Certify Locking

The transaction with the certify lock now completes the update by transferring changes to the primary copy, as shown above.

The principal advantage of this approach in a centralized database is that read queries are not blocked while updates are taking place, giving slightly more concurrency. While certify locks block, they are likely to be held for less time because the update has already been written.

The Effect of Distribution

Despite the improvement from the previous approach, Read-Write-Certify must still block read requests when certifying an update. In a distributed database system is it possible to get around this restriction?

To answer that question we require an item with multiple replicas, each on a different machine.

Figure 4: Multiple Replicas

In this system the same locking principles hold for individual replicas as they did with the local database, but the certify lock has only to the lock a single replica, not every copy. While the update is being completed on this replica read queries can still access other copies elsewhere.

Consider this in our example system. A transaction has previously obtained a write lock on X2and has updated its value. It is now attempting to commit the update and has a certify lock onX2. Another transaction requests a read lock and has it granted for X1, the other replica. The read request can be made, unblocked, despite an update being completed on the other machine.

Figure 5: Non-blocking Read Query

If the update transaction on X2 completes before the read on X1 then the read will take place on out-of-date data. This isn’t desirable, but for many applications it is acceptable on the proviso that the result-set is consistent, if outdated. However, this condition isn’t necessarily true either, as the following example shows.

There are two transactions, T1 and T2, and two items, X and YT1 starts by getting a read lock on a copy of X (with the notation R(X)), and subsequently gets a read lock for Y. However, another transaction, T2, has committed an update to both X and Y after T1 obtained a read lock on X but before it obtained a read lock on Y.

T1 completes its query by returning an old value of X and the current value of Y. The result is inconsistent.

Figure 6: Serializability of RWC

This is only a problem because the first transaction obtained read locks for both tables separately. If the system uses a centralized lock manager, locks can be taken out at the same time, making the result consistent. However, a centralized lock manager creates a point of contention (and failure), so it may not be desirable to have one.

Alternatives

Its probably doubtful that you’d want to use this approach in practise. If your application can accept outdated but consistent data then a more expressive multi-version system may be more appropriate. My point is to show what happens when you expand or relax basic pessimistic approaches.

RWC is a two-phase locking approach to multi-version concurrency control. The alternative istimestamp ordering: each transaction is given a unique start timestamp, while data items are given read and write timestamps that are equal to the timestamp of the last timestamp to read or write the item.

Reads and updates never block. If an operation is a read then it accesses the version with the largest timestamp less than the reading transaction. The read timestamp is then updated. If an operation is an update then a new version of the item is created with the timestamp of the updating transaction, provided that a more recent transaction has not read an older version of the data than the updating transaction. In this case the transaction is aborted.

Serializability is ensured by aborting transactions which access data out of timestamp order.

Finally

Every concurrency control mechanism is designed with three things in mind: the degree of concurrency provided (obviously), the potential for deadlock, and the level of consistency that is guaranteed.

Because there is no definitive solution for any of these, concurrency control models vary widely based on their target application. Most of the fundamental research on the topic was done over thirty years ago, so for further reading you can either go back to that work or look at some more recent textbooks. The following texts were useful to me:

Deadlock in Database Systems

Deadlock is what occurs when two or more transactions are waiting on each other to release a lock. Neither can move, and so both stall.

To combat this, a system can either prevent deadlock from occurring, or detect when it does happen and act accordingly.

Models for Concurrency Control

To understand how deadlock occurs in database systems it helps to understand the role of various concurrency control techniques. I’m going to discuss two popular approaches.

Two-phase locking (2PL) is commonly used to guarantee serializability in database systems. In this model a transaction can either obtain a shared or exclusive lock for an item, but all locking operations must occur before the first unlock operation. The name refers to the two phases that result from this: the expanding phase where locks are acquired, and the shrinking phase where locks are released.

There are a number of variations on this model. Conservative 2PL requires that all locks are taken out at the beginning of the transaction, whereas Rigorous 2PL requires that all locks are held until after the transaction commits (or aborts). The former collapses the expanding phase, while the latter collapses the shrinking phase.

Timestamp-based concurrency control involves using unique transaction timestamps in place of conventional locks. Concurrency control is based on the ordering of timestamps. So, for example, when a transaction accesses an item, the system checks whether this transaction is older than the last one which accessed the item. If this is the case the transaction proceeds; otherwise ordering is violated and the transaction is aborted. Such strict timestamp-based approaches can lead to the cyclic restart of transactions and starvation.

Multi-version concurrency control (MVCC) also incorporates timestamps by allowing several versions of an item to be stored. This allows the system to present a consistent (but potentially historical) version of the database to queries, meaning fewer reads are rejected than with basic timestamp ordering.

Optimistic concurrency control (OCC) allows multiple transactions to read and update items without blocking. However, before a transaction is committed the database must check for conflicts – if any are found one of the conflicting transactions is rolled back.

Deadlock Prevention

For deadlock to occur four conditions need to be true (meaning you need to break one to prevent deadlock):

Mutual Exclusion – a resource cannot be held by more than one transaction at a time. This condition is true of database systems using two 2PL where an exclusive lock is required on updates. Systems using OCC don’t hold locks, and so break this condition.

Hold and Wait – transactions already holding resources can request further resources.Conservative 2PL breaks this condition, since it requires all locks to be acquired from the outset. However this isn’t always desirable as it limits concurrency.

No pre-emption – a resource cannot be forcibly removed from a transaction. Pre-emption is used in timestamp-based approaches. Two of the most commonly used schemes are wait-dieand wound-wait. In wait-die (non-preemptive), if a transaction tries to lock an item which is already locked, it waits if the holder of the lock is a younger transaction (based on timestamp); otherwise it will abort. In wound-wait (pre-emptive), instead of waiting as before, the transaction aborts the other younger transaction. If it is the younger transaction it waits.

Circular wait – a number of transactions form a circular chain where each transaction is waiting for a resource that a later transaction (in the chain) holds. This can be prevented by imposing a total ordering on resources, requiring that each transaction requests locks on resources in an agreed order. This may not be possible in some forms of 2PL where locks are not taken out at a single point in time (e.g. rigorous 2PL).

Many of these approaches aren’t ideal because they result in transactions being aborted at the slightest chance of deadlock. In situations where deadlock will rarely occur (for example, when transactions are mostly short-lived and lightweight) detection is more practical.

Deadlock Detection

Deadlock detection involves periodically checking whether the system is in a state of deadlock. There are two basic methods of detection: timeouts and wait-for graphs.

Timeouts represent the simplest method of detection. If a transaction waits for a period longer than some constant timeout period it will be aborted. This method has a low overhead, but may end up aborting transactions even in the absence of deadlock.

Another approach is for the system to construct a wait-for graph. Each node in the graph represents an active transaction. A directed edge is drawn between two transactions when one transaction is waiting for a lock on an item held by the other transaction. Deadlock exists (and is detected) when there is a cycle in the graph. At this point the system engages in victim selection, where one of the transactions is chosen to be aborted. The challenge in this technique is deciding when and how often to check for deadlock in the graph.

For More Information

Hopefully this post provides enough information to give a good understanding of the subject area. If you want to know more I’d recommend looking at some of the material I used while writing this post:

This is a cross-post from my previous site at the University of St Andrews.