Multi-core and Locking

Intel has released additional information on their Transactional Synchronization technology (TSX), which is basically a instruction set architecture (ISA) extension to make hardware accelerated transactional memory possible. What does that mean in the real world?

The more cores you get in a system, the more threads you need to keep them busy. Unfortunately, doing so is not that easy: threads have to work on shared data and need locks to make sure that the end result is correct. A thread has to acquire a lock, which may necessitate waiting until another thread releases the lock. That can lead to serious lock contention which can result in bad scaling, even to the point where more cores (and threads) can lead to a performance loss instead of a gain.

We measured this in a real-world benchmark (based upon MySQL InnoDB) and showed that spinning locks were indeed to blame. Spinlocks that are "busy waiting" too much do more than just destroying the scalability of your multi-core server: they can waste quite a bit of energy. As cores hardly perform any useful work and can no longer go to sleep, a torrent of spinlocks can wreak havoc on a server's performance per watt ratio. The following chart, based upon measurements on a HPC benchmark, is another example. Looking at the performance that you get from each core helps to summarize the problem.

Euler3D HPC bench, relative per core

The HPC bench shows how the more cores the software uses, the less performance we get per core. When 16 cores are working, each core offers only 60% of the performance of a single threaded run. A shared data structure is the problem. Careful tuning could make this software scale better, but this would require a substantial amount of time and effort.

The root of the locking problems is that locking is a trade-off. Suppose you have a shared data structure such as a small table or list. If the developer is time constrained, which is often the case, the easiest way to guarantee consistency is to let a certain thread lock the entire data structure (e.g. the table lock of MySQL MyISAM). The thread acquires the lock and then executes its code (the critical section). Meanwhile all other threads are blocked until the thread releases the lock on the structure. However, locking a piece of data is only necessary if two threads want to write to it. The classical example is a sequence of two bank transactions:

Locking an entire data structure is something else. You can imagine that the other threads might all write to different values and should be able to do a lot of work in parallel. The coarse grained locking of entire structures is generally not necessary, but it makes the developers' job easier.

The developer could rewrite his/her code and apply a much finer grained locking, but this is a pretty complex and time consuming task, and the critical code section will probably need several variables to do its job. In that case you can get deadlocks: one thread locks variable A, another thread variable B. They both need A and B, and they keep trying to get a lock on the other variable. There is a reason why getting good multi-core scalability requires a large investment of time and money.

Haswell's TSX
Comments Locked

29 Comments

View All Comments

  • 1008anan - Thursday, September 20, 2012 - link

    Good summary Johan De Gelas. Look forward to future articles that further elaborate on how exactly Transactional Synchronization technology (TSX) achieves hardware accelerated transactional memory in a scalable generalized way.

    Part of the solution in my opinion is for a core that can have 2 or more simultaneous thread to only execute 1 thread at a time for single threaded computationally heavy work loads. To give an example, a single Haswell Corps can execute 32 single precision (32 bit) operations per clock. At 2 gigahertz SoC speed, a single Corps can execute 64 billion operations per second with single threaded code. Unfortunately this can only help so much. To make major performance gains an efficient generalized scalable way needs to be found to distribute single threaded computational work loads to multiple different cores. Much easier said than done.
  • extide - Saturday, September 22, 2012 - link

    32 single precision ops is wrong. How many ops they can do per clock has to do with the front end and how many ports it has in it, not how many bits a particular instruction is.
  • extide - Saturday, September 22, 2012 - link

    to further that, Ivy Bridge has 6 ports and Haswell 8, but each of the ports don't necessarily have the same capabilities.
  • twotwotwo - Thursday, September 20, 2012 - link

    Very smart idea, very clever to implement it with backwards compatibility, and it's good that Intel's working out uses for die area that aren't just multiplying cores or cache.

    But a sad thing is, when this feature really helps your app, like doubles or triples throughput, then it means you must be leaving speed on the table on _other_ platforms--manycore Ivy Bridge or AMD or whatever--because you didn't take fine-grained locks. If the difference matters, then for many years you'll have to go to the effort to do fine-grained locks and make it fast on Ivy Bridge too.

    The other thing is, the problem of parallelizing tasks is deeper than just fine-grained locks being tricky. If you want to make, say, a Web browser faster by taking advantage of multiple cores, you still have to do deep thinking and hard work to find tasks that can be split off, deal with dependencies, etc. Smart folks are trying to do that, but they _still_ have hard work when they're on systems with transactional memory.

    That may be overly pessimistic. It's likely apps today _are_ leaving performance on the table by taking coarse locks for simplicity's sake, and they'll get zippier when TSX is dropped in. Or maybe transactional memory will be everywhere, or on all the massively multicore devices anyway, faster than I think. Or, finally, maybe Intel knows TSX won't have its greatest effects for a long time but is making a plan for the very long term like only Intel can.
  • Paul Tarnowski - Thursday, September 20, 2012 - link

    I think it's the last one more than anything else. It really seems to be about setting up architecture for the future. Right now with four and eight cores the losses aren't that high, and effects won't be seen on anything but servers. While it is a big deal, it will be even less important to the consumer market than most other architecture extensions.
  • USER8000 - Thursday, September 20, 2012 - link

    This an article from them nearly three years ago:

    http://blogs.amd.com/developer/2009/11/17/the-velo...
  • NeBlackCat - Thursday, September 20, 2012 - link

    I stopped reading at the end of the 2nd para after the bar chart on the first page:

    "The root of the locking problems is that locking is a trade-off. Suppose you have a shared data structure such as a small table or list. If the developer is time constrained, which is often the case, the easiest way to guarantee consistency is to let a certain thread lock the entire data structure (e.g. the table lock of MySQL MyISAM). The thread acquires the lock and then executes its code (the critical section). Meanwhile all other threads are blocked until the thread releases the lock on the structure (*). However, locking a piece of data is only necessary if two threads want to write to it. (**)"

    (*) all other threads arent locked, only those that also need access to the same data.
    (**) locking a piece of data is only *one* thread wants to write it (else you risk a reader reading it before the writer has finished writing it)

    And if the first locker is doing something that may take (in CPU terms) considerable time, likely in the database scenario, then the OS will probably schedule something else (there are always other processes and threads wanting to run) on a core/hyperthread running a blocked thread, so it wont sit idle anyway.

    Unless things have changed since I was a real time software engineer, anyway
  • Wardrop - Thursday, September 20, 2012 - link

    I think your nitpicking. We all understand the locking problem (those of us that it's relevant to anyway). There's no point in the author going into more detail to clarify what we all should already know - he gives us enough information for us to at least know what he's talking about, and that was the point of that paragraph.
  • JohanAnandtech - Thursday, September 20, 2012 - link

    "all other threads arent locked, only those that also need access to the same data."

    With a bit of goodwill, you would accept that this is implied.

    " then the OS will probably schedule something else "

    Right. But it is no free lunch. In the case of a spinlock, there will several attempts to lock and then a context switch. That is thousands of cycles lost without doing any useful work. BTW if you really want to see some in depth cases, we linked to

    http://www.anandtech.com/show/5279/the-opteron-627...
    http://www.anandtech.com/show/5279/the-opteron-627...

    which goes in a lot more detail.
  • gchernis - Thursday, September 20, 2012 - link

    OP did not explain why software-transactional memory (STM) is good, before diving into Intel's hardware solution. Current breed of databases use STM, but so can other applications. The prerequisite is the assumption that there's much more reading than there's writing going on.

Log in

Don't have an account? Sign up now