by Clarissa Littler

At this point, it’s safe to say that we all are using devices that can run multiple programs at the same time, unlike desktop computers of the 1980s and early 1990s. Many computers today come with multiple processors, each of which has multiple cores, with each core being a kind of mini-processor capable of executing code on its own.

If we only ever wanted to have each individual program run on its own core, then the story of concurrency wouldn’t be a very interesting one, because the software on each core would never overlap. The real complications come from when a single program is spreading its execution among multiple cores and processors.

In this piece, we’ll be covering the basic ideas of general concurrency and the factors that make concurrent programming complicated: shared variables and cached data.

To start from the very beginning, let’s talk about what happens on a very simple single-core, single processor machine when you want to execute multiple programs. First, each execution thread has its own context: this context consists of things like:

  • the program counter, which tracks where in the list of instructions to the processor the code is currently
  • the values of the registers, which are the lowest level of storage on a processor 1
  • memory maps, which connect the process’s addresses for memory to the physical addresses; and other data that the process needs to execute.

In this uni-processor/uni-core system, each program has to take its turn to execute and each time a new thread takes its turn, the processor performs a context switch. To do the context switch it needs to save the old context to memory then load the new context from memory. This context switch is a very expensive procedure and an unavoidable overhead of trying to run multiple programs on a single processor. In fact, it’s not hard to completely overwhelm a single-processor machine into thrashing, where much like someone flailing in water, there’s no forward progress made in any direction because the processor is spending so much of its time trying to switch between the dozens and hundreds of programs that need to run, that it scarcely finishes one context switch before it needs to start the next one.2

The obvious solution is have multiple processors running at once, and this works fine if each program only has a single thread of execution. What if we want to do something like have our code handle the user interface in one thread while making database queries in another? What if we’re making a game and we want to make the AI code more efficient by giving different entities in the game their own dedicated threads? What if we wanted to write a web-server where new connections get their own threads for listening to the event? All of these cases require having multiple threads of execution that can share data with each other, and that’s where the rest of our story begins.

Creating threads is an operating system-dependent task. Each OS gives us the ability to create an execution thread and from there it handles all the bookkeeping of the context switches and of scheduling the threads. Scheduling is its own interesting topic but we’re going to just call it a black box for this article3.

We’re also going to only deal with threads, rather than processes, for the rest of this piece. The distinction between the two is sometimes a bit arbitrary, as in Linux where the single operation clone allows the user to create both a new thread and a new process, just with different levels of sharing the memory map and other data. A thread is like a process, but “lighter-weight”. The difference between a thread and a process is how much data is shared between them. For example, separate processes have their own disjoint memory maps while threads do not. Threads can see the same memory space, which is both how they can be used to easily share data and how they’re lighter-weight to context switch.

We’ll assume, for this discussion, that we have a programming language that uses OS level threads4, with a thread-creation function called fork, which takes a function as argument, and syntax similar to JavaScript or C. We’re assuming first-class functions5 out of convenience.

In this pseudo-language, the following program spawns off two threads that each print a message to the command line:

function printer (msg) {
  return function () {
    print("this thread said: " + msg);

fork(printer("this should be our first message"));
fork(printer("this should be our second message"));

If you executed this code, you’d find that you can’t necessarily predict which thread is going to print its message first, and that just because you started one thread before the other doesn’t necessarily mean it will finish before the other.

The fact that we can’t make these predictions is a simple example of a race condition. A race condition, named for the visual of competing threads racing to execute first, is when you have unexpected or unwanted behavior caused by the unpredictability of when each thread will run its code. If it doesn’t matter what order events happen in, then you don’t need to worry, but when we’re modifying variables shared by the two threads we hit real problems.

Let’s consider our first example using shared variables, which is practically the “Hello World” of race conditions:

var counter = 0;

function inc(){
  counter = counter + 1;

fork(inc); // we'll call this thread 1
fork(inc); // we'll call this thread 2

Now, there’s roughly four 6 possible sequences of execution that could happen with this code: two of them desired and two of them unintended.

Possibility 1:

  1. thread 1 retrieves the value of counter
  2. thread 1 modifies counter
  3. thread 2 retrieves the value of counter
  4. thread 2 modifies counter

Possibility 2:

  1. thread 2 retrieves the value of counter
  2. thread 2 modifies counter
  3. thread 1 retrieves the value of counter
  4. thread 1 modifies counter

Possibility 3:

  1. thread 1 retrieves the value of counter
  2. thread 2 retrieves the value of counter
  3. thread 1 modifies counter
  4. thread 2 modifies counter

Possibility 4:

  1. thread 2 retrieves the value of counter
  2. thread 1 retrieves the value of counter
  3. thread 2 modifies counter
  4. thread 1 modifies counter

In the third and fourth cases, we have an interleaving of execution where both reads have happened, then the writes happen, and at the end of execution the variable counter is 1 instead of 2.

Now, how do you stop a race condition? We need some way of ordering, of synchronizing operations when we need to make sure threads can’t clobber each other.

There’s quite a few ways to do this, but we’ll start from higher levels of abstraction and move our way down to the more basic components needed.

The first synchronization method we’ll look at are called locks. The idea of a lock is that you can use it to protect code from interference from other threads. We’ll skip the details for the moment, but a lock is a data structure that has two methods, acquire and release7. When a thread attempts to acquire the lock, if no one else has acquired the lock, then the thread obtains the lock and can proceed with execution. If another thread has the lock, then acquire will be blocked until the current owner of the lock calls release. In a basic lock, there’s no queue ensuring first-come-first-serve acquisition of the lock. It’s a matter of whichever thread happens to make the acquire call first, so from the programmer’s perspective it’s still non-deterministic.

So let’s now assume we’ve added a createLock function and that a lock object has two methods: acquire and release that we can call.

Now we can rewrite our previous example to ensure that the threads can’t clobber each other anymore when incrementing the counter.

var counter = 0;
var lock = createLock();

function inc(){
  counter = counter + 1;


Now there are only two possibilities for execution:

The first being:

  1. thread 1 acquires the lock
  2. thread 1 retrieves the value of counter
  3. thread 1 modifies counter
  4. thread 1 releases the lock
  5. thread 2 acquires the lock
  6. thread 2 retrieves the value of counter
  7. thread 2 modifies counter
  8. thread 2 releases the lock

and the second simply is the flipped version where thread 2 gains the lock first.

You might be wondering how a lock is implemented that it ensures no two threads can acquire the lock at the same time. A very basic implementation of a lock is a spin lock, where the acquire function just “spins” (repeats the action) in a tight loop, checking to see if the lock has been released. If we implement our acquire function naively like:

lock.acquire = function () {
  while(lock.isHeld != 0) {
  lock.isHeld = 1;

then we’re just back where we started. If we have three threads, one of which has the lock and the other two do not, when the lock-holding thread releases the two other threads, they might simultaneously test that the lock isn’t held, succeed, then both acquire the lock simultaneously.

In fact, there’s no way to correctly implement the spin lock with the normal programming constructs you might be used to. Instead, we need a special primitive (built-in) instruction provided by the hardware in order to fix our implementation. A common such primitive is called test-and-set. The name test-and-set is a bit misleading at first, because it’s really more of an atomic swap where you place a value into the memory location while returning the old value in one fell swoop that cannot be interrupted by a context switch8. If we add a testAndSet function to our language, then we can rewrite our above pseudocode as:

lock.acquire = function () {
  while(testAndSet(lock.isHeld,1) != 0) {

where in the “spin” part of the spin lock, we’re actually going to be placing the value 1 into lock.isHeld repeatedly and watch for when the old value was actually a 0. That might seem odd at first, but the trick here is that if the lock is already held then setting it to “held” doesn’t really change anything, but if it wasn’t held then the test-and-set operation will return a 0. You’ve then acquired the lock and can stop spinning.

There’s actually quite a few atomic instructions beyond this rather old-fashioned test-and-set, but test-and-set lends itself very well to simple examples! There’s also a number of other synchronization methods beyond spin locks. A very common one you’ll see are semaphores which are similar to locks in having an acquire/release model, but they include a notion of a number of threads that are allowed to acquire it at once rather than simply one. Another that I’ve always liked conceptually is software transactional memory, which is a way to avoid locks and instead allow threads to go on their way attempting transactions.

A transaction is just a chunk of code, modifying shared memory, that you want to either run all at once or not at all depending on whether any other thread has modified the shared memory in the meantime. The problem that software transactional memory tries to solve is that conservatively putting lots of spin locks in your code will mean there are threads that simply wait and do nothing, even if they potentially could have executed without interfering with each other. Not only that, but knowing where to put locks can get tricky in large programs: you need to lock the right sections of code and make sure that no matter what happens in your code—function calls, exceptions, loops, etc.—that the locks are properly released.

If, instead, you deal with transactions you can wrap up code that potentially modifies something in a set of transactions and not worry about whether or not they’re clobbering each other: if they don’t, both threads execute normally and the transactions are committed. If they do interfere with each other, then the one that committed its transaction first wins and the other thread will simply restart and try again from the beginning of the transaction.

Now, having just scratched the surface of how to prevent race conditions in our concurrent code, making it possible to actually share data between threads, we need to increase the realism of our discussion and introduce caches. The caches in any computer’s hardware are small amounts of very fast memory. We need caches in our processors to improve performance, because even though RAM has become much faster over the years, it’s still very slow relative to the clock speed9 of our processors10. So rather than relying on relatively slow RAM for all data retrieval, modern processors have a hierarchy of caches to hold frequently accessed data. In most modern chips there’s three different levels of caches, L1 through L3. L1 is the smallest and fastest and L3 is the largest and slowest. Accessing the L1 cache, for example, may only cost you a handful of clock cycles waiting, rather than the ~100 you may wait for a RAM access.

The “hierarchy” nature of the caches is structured so that if a piece of data can’t be found in one cache, then the next cache up will be checked, just the same way that a page fault in your RAM means that the data will be retrieved from disk. On a single processor/single core system the hierarchy is simple, but on modern chips there are generally separate L1 caches for each core, while L2 caches can be shared by multiple cores, and there’s likely only one L3 cache.

How does this complicate the story of concurrency? Like many of our problems, the cause of this headache comes from writing to shared data. If two threads have a piece of data in their separate caches, and they both write to it, what happens? Do the changes get pushed through to main memory or do they stay in the local caches? If they’re pushed to main memory, how do we make sure that inaccurate, or stale, reads don’t happen in the meantime? If they’re kept in the local cache in the short term, how do other processes know to look there rather than their own local copies of the data?

After all, if we do something simple like queue up the writes to main memory without ensuring that all the other caches are made aware of the problem, even the locked version of our concurrent code may still fail. The counter may be updated in one core’s cache without ever being updated in the other cache, which will still cause the writes to clobber each other once they finally take place.

These are complicated questions, too deep to resolve in a single article or perhaps even a single textbook, but these are the kinds of issues that chipmakers consider when designing their hardware. The hardware designer has to take into account the tradeoffs between accuracy and speed. The strongest notion of consistency would mean that a stale read is literally impossible. If a thread reads from a memory location, then it must read the most recently written value for that memory location (regardless of caching). This isn’t feasible, but the next best “gold standard” for hardware makers is sequential consistency.

Sequential consistency doesn’t necessarily guarantee that there will be no stale reads, but it does mean that there is a consistent order of events that all threads can observe. It still doesn’t guarantee that the resolution of writes and reads to memory are going to happen in the order the instructions were given. Thankfully, there’s another primitive that modern hardware gives us for enforcing some ordering: the memory barrier, often called a memory fence. When you call a memory fence, it forces all reads and writes to be completely resolved and synchronized across memory, before new reads and writes can occur. To fix our example so that we regain the ordering we need, we can place a memory barrier inside the release function, so that any reads and writes used between the acquire and release of the lock are resolved before the next thread acquires the lock.

Modern processors often don’t quite meet the standard of sequential consistency, which is makes life complicated for computer scientists trying to understand how concurrency actually works on hardware.11

Finally, I’d like to mention a line of research I personally think is very interesting, which ties everything we’ve been talking about together: relativistic programming12. Relativistic programming is about how to embrace the lack of ordering of events that comes with modern hardware, only synchronizing in lighter-weight ways when it’s truly necessary, and allowing processes to maintain their own self-consistent picture of the world when true synchronization isn’t necessary. For example, stale reads aren’t necessarily a problem if a thread isn’t also using the stale read to update the data. If one thread reads an older version and uses that in its decision making even though an update has happened, it’s not necessarily a bad thing: after all, if the reading thread had just read this data before the update the same thing would have happened. From this perspective, the order of events is relative* *much like the problem of simultaneity in Special Relativity13. Relativistic programming is thus intended to address weaknesses in standard locking or transactional approaches to concurrency14 along with the need for aggressive uses of memory barriers, like we saw in our discussion of using locks in the face of caching issues.

In summary, concurrency is a much richer and stranger topic than you might have realized and the question of “What’s the best way to allow threads to work together?” is still an open question. If you’ve found this interesting but you’re not sure where to start, I’d recommend picking a programming language and studying the interface it provides for concurrent programming. For example, if you like Haskell or Clojure then you can easily experiment with transactions. If you prefer C then you have ample opportunity to interface with the low level OS interface of threads. Just look at the libraries and primitives for concurrent programming your favorite language provides and study from there.

Clarissa has worked in mathematics, physics, and computer science research but spends much of her time now trying to make computer science education accessible to a broader audience.

  1. We’ll talk more about the memory hierarchy later in this article. 
  2. This is an experience I think we can all relate to. 
  3. A good book on this and other operating system topics is the “dinosaur book” Operating System Concepts by Silberschatz et al. 
  4. As opposed to “green threads” or “user level threads”, which are a user-space abstraction. 
  5. The ability to treat functions as any other kind of data, passing them in as arguments or returning them from other functions. 
  6. There’s actually more ways this could interleave, but they don’t change the outcomes from the four we’re looking at. 
  7. They may be called other things depending on the implementation, but there’ll always be these two methods. 
  8. Whenever you see “atomic” in talk of concurrent programming, it generally means that the operation is going to happen as a single unit and that even if it takes multiple steps internally no other thread can observe that that it takes multiple steps. 
  9. As an aside, clock speed is generally given in a frequency, like 2GHZ, and is the number of times the processor performs an operation per second. The actual speed to process an instruction is longer than a single cycle, but since all modern processors implement some form of “pipelining”, which is like an assembly line for computation, then the average tends to be about one instruction per cycle. Some clever chip architectures are even able to process a bit more than one instruction on average by having redundancy between the pieces of the pipeline and starting more than one instruction per cycle when possible. 
  10. The ratio depends on the quality of RAM and the clock speed of the processor, but 100 clock cycles for a RAM access isn’t a bad estimate. 
  11. A Better Memory Model: x86-TSO, Owens et al. 
  14. Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking and Transactional Memory, Paul E. McKenney et al.