Thread Safety - The Hard Way

For a regular RAD/Client application, a single thread is usually enough. Using messages, and/or a TTimer allow some simple cooperative multi-tasking in the application, good enough for most use.

But on server side, scalability requires the business code to be thread-safe. Thread safety is hard, harder than parallel computing from my experiments.

Note that multi-thread programing is not easy, sometimes very difficult to debug, because the problems are hard to reproduce - it is easy to get an HeisenBug.
So ensure you first read some general features about thread safety, and modern CPU memory and operation execution. I just found out these series of blog articles, which details some caveats which may appear in border cases... which may occur to you as they do for me!

Saved By The Lock

To ensure thread safety, the most convenient feature we have is the lock, which protects some code section to be executed from several threads.

To be more accurate, we don't protect code, we protect resources. The code itself is thread-safe. But the data requires attention, when several threads access it. If we only read the data, it is fine. But once the data is changed by one thread, then other threads are likely to break - imagine that you add an item to a list, then the list storage is reallocated in memory, then you get some random GPF due to invalid pointers. Or two threads add items at the same time - then the counter or the storage may become pretty wrong. We need to lock the data access to prevent such issues.

Here is how the POSIX libpthread library offers this lock - similar to a Windows Critical Section:

All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries. So you write your thread-unsafe code as the ham in your sandwich, and you will ensure that only a single thread will execute it at once.

Locks Are Not Expensive, Contention Is

The main rule about using locks it that they should be as small as possible.
Why?

Acquiring an unlocked mutex, or releasing a mutex is almost free, it is usually a single atomic assembly instruction. Atomic instructions have the lock prefix on Intel/AMD, or are explicitly specified as such, e.g. the cmpxchg operation. On ARM, you usually need to write a small loop, or at least several instructions.
In mormot.core.base.pas we provide some cross-platform and cross-compiler functions for atomic process, written in tuned assembly or calling the RTL:

procedure LockedInc32(int32: PInteger);
procedure LockedDec32(int32: PInteger);
procedure LockedInc64(int64: PInt64);
function InterlockedIncrement(var I: integer): integer;
function InterlockedDecrement(var I: integer): integer;
function RefCntDecFree(var refcnt: TRefCnt): boolean;
function LockedExc(var Target: PtrUInt; NewValue, Comperand: PtrUInt): boolean;
procedure LockedAdd(var Target: PtrUInt; Increment: PtrUInt);
procedure LockedAdd32(var Target: cardinal; Increment: cardinal);
procedure LockedDec(var Target: PtrUInt; Decrement: PtrUInt);

But if two (or more) threads fight against acquiring a lock, then only one would get it. So the other threads will have to wait. Waiting is usually done by first spinning (i.e. running a void loop), and trying to acquire the lock. Eventually, an OS kernel call could take place, to leverage the CPU core, and try to execute some pending code from another thread.

This lock contention, spinning or switching to another thread, is what really degrades the whole process performance. You are really wasting time and energy just for accessing a shared resource.

Therefore, in practice, I would advice to follow some simple rules.

Make it work, then make it fast

You may first use a giant Critical Section for a whole method. Most of the time, it would be fine.

Don't guess, run actual benchmarking on multi-core CPU (not a single core VM!), trying to reproduce the worse case possible which may happen.
Have detailed, and thread-aware logs, to properly debug production code - the Heisenbugs are likely to appear not on your development PC, but with real world load.

Once you have identified a real bottleneck, try to split the logic code into small pieces:

  • Ensure you have a multi-thread regression testing code for this method, to validate your modifications are actually still correct and ... faster;
  • Some part of the code may be thread-safe by itself (e.g. the error checking or result logging): no need to protect it with the lock;
  • Isolate the processing code into some private/protected methods, depending on the resources shared, with proper locking.
The Less The Better

Eventually, to achieve the best performance:

  • Keep your locks as short as possible.
  • Prefer more locks on small data than some giant locks;
  • Use a lock per list or queue, not per process or business logic method;
  • Make a private copy of the data (e.g. on a local stack variable) within the lock, then process it outside the lock;
  • Avoid calling other methods within a lock: focus on the shared data, and be sure that the functions you call may not be thread-safe;
  • Try to avoid memory allocations.
Pickup The Right Lock

Generally speaking, the regular TRTLCriticalSection is fine, and should be preferred.
Our mormot.core.os.pas unit leverage this into a cross-platform way, among FPC/Delphi compilers and operating systems. It tries to call directly the OS, with proper inlining if possible.

But if you follow "The Less The Better" rule above, your code may be something very small like this:

procedure TAsyncConnections.AddGC(aConnection: TPollAsyncConnection);
begin
  if Terminated then
    exit;
  (aConnection as TAsyncConnection).fLastOperation := fLastOperationMS; // in ms
  fGCSafe.Lock;
  ObjArrayAddCount(fGC, aConnection, fGCCount);
  fGCSafe.UnLock;
end;

Here you can see that the lock is very small, and setting fLastOperation has been done outside of the lock, since this operation is thread-safe by design: this connection will be free once, whereas fGC/fGCCount list may be accessed from several threads. Also note that ObjArrayAddCount() is a well defined function which should not have its behavior changed, nor raise any exception, so it is safe to be used... and we even didn't put any try...finalll fGCSafe.UnLock; statement here, because a try..finally has a cost on some platforms (e.g. FPC Linux generates several RTL calls even if no exception is raised).

Or course, we could use our TSynLock for fGCSafe - which encapsulate a TRTLCriticalSection in an object-oriented manner.
But since here we know that the lock will be very small, no need to have the whole overhead of a Critical Section or a mutex/futex, which always has a cost at least in resources.

Several Locks To Rule Them All

In addition to the TSynLock wrapper, mormot.core.os.pas defines several kind of locks:

  // a lightweight exclusive non-rentrant lock, stored in a PtrUInt value
  // - calls SwitchToThread after some spinning, but don't use any R/W OS API
  // - warning: methods are non rentrant, i.e. calling Lock twice in a raw would
  // deadlock: use TRWLock or TSynLocker/TRTLCriticalSection for reentrant methods
  // - light locks are expected to be kept a very small amount of time: use
  // TSynLocker or TRTLCriticalSection if the lock may block too long
  // - several lightlocks, each protecting a few variables (e.g. a list), may
  // be more efficient than a more global TRTLCriticalSection/TRWLock
  // - only consume 4 bytes on CPU32, 8 bytes on CPU64
  TLightLock = record
    procedure Lock;
    function TryLock: boolean;
    procedure UnLock;
  end;

  // a lightweight multiple Reads / exclusive Write non-upgradable lock
  // - calls SwitchToThread after some spinning, but don't use any R/W OS API
  // - warning: ReadLocks are reentrant and allow concurrent acccess, but calling
  // WriteLock within a ReadLock, or within another WriteLock, would deadlock
  // - consider TRWLock is you need an upgradable lock
  // - light locks are expected to be kept a very small amount of time: use
  // TSynLocker or TRTLCriticalSection if the lock may block too long
  // - several lightlocks, each protecting a few variables (e.g. a list), may
  // be more efficient than a more global TRTLCriticalSection/TRWLock
  // - only consume 4 bytes on CPU32, 8 bytes on CPU64
  TRWLightLock = record
    procedure ReadLock;
    function TryReadLock: boolean;
    procedure ReadUnLock;
    procedure WriteLock;
    function TryWriteLock: boolean;
    procedure WriteUnLock;
  end;

type
  TRWLockContext = (
    cReadOnly,     cReadWrite,     cWrite);

  // a lightweight multiple Reads / exclusive Write reentrant lock
  // - calls SwitchToThread after some spinning, but don't use any R/W OS API
  // - locks are expected to be kept a very small amount of time: use TSynLocker
  // or TRTLCriticalSection if the lock may block too long
  // - warning: all methods are reentrant, but WriteLock/ReadWriteLock would
  // deadlock if called after a ReadOnlyLock
  TRWLock = record
    procedure ReadOnlyLock;
    procedure ReadOnlyUnLock;
    procedure ReadWriteLock;
    procedure ReadWriteUnLock;
    procedure WriteLock;
    procedure WriteUnlock;
    procedure Lock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
    procedure UnLock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
  end;

TLightLock is the simplest lock.
It will acquire a lock, then spin or sleep on contention. But be aware that it is not reentrant: if you call Lock twice in a row from the same thread, the second Lock would wait forever. So you must ensure that your code doesn't call any other method which may also call Lock during its process, otherwise your thread would "deadlock". Such race conditions are relatively easy to identify: it will always block and deadlock, whatever condition there is. To fix it, don't call other method which run Lock: for instance, you may define some private/protected LockedDoSomething methods, which won't have any lock but expect to be called within a lock.

TRWLightLock and TRWLock are multiple Reads / exclusive Write locks.
This is a feature missing in the regular Critical Section. It is very likely that your shared resource will be often read, and seldom modified. Since reads are thread-safe by design, there is no need to prevent other reading threads to read the resource. Only writing/updating the data should be exclusive and protected from other threads. This is the purpose of ReadLock / ReadOnlyLock and WriteLock.
TRWLock goes one step further, and allow a read lock to be upgraded into a write lock, using ReadWriteLock instead of ReadOnlyLock. ReadWriteLock could be followed by a WriteLock, whereas ReadOnlyLock should always be followed by ReadOnlyUnlock, but never by a WriteLock which would deadblock.
Last but not least, ReadOnlyLock / ReadOnlyUnLock are re-entrant (you can call them nested), because they are implemented using a counter. And TRWLock.WriteLock is re-entrant, because it takes track of the locked thread ID, so detects nested calls - as a TRtlCriticalSection does.

Low Level Stuff

Just for fun, take a look at the source code:

procedure TLightLock.LockSpin;
var
  spin: PtrUInt;
begin
  spin := SPIN_COUNT;
  repeat
    spin := DoSpin(spin);
  until LockedExc(Flags, 1, 0);
end;

procedure TLightLock.Lock;
begin
  // we tried a dedicated asm but it was slower: inlining is preferred
  if not LockedExc(Flags, 1, 0) then
    LockSpin;
end;

function TLightLock.TryLock: boolean;
begin
  result := LockedExc(Flags, 1, 0);
end;

procedure TLightLock.UnLock;
begin
  Flags := 0; // non reentrant locks need no additional thread safety
end;

TLightLock is pretty straightforward, using a simple CAS compare & exchange LockedExc() atomic function, but TRWLightLock and TRWLock are slightly more complex.

In mORMot 2 code base, we tried to use the best lock possible. TRtlCriticalSection / TSynLock when the locks are likely to have a contention for some time (more than a micro second), and other locks, with multiple Reads / exclusive Write methods if possible, are used to protect very small tuned code.
Of course, thread safety is tested during the regression tests, with dozen of concurrent threads trying to break the locks logic. I can tell you that we found some nasty problems in the initial code of our TAsyncServer, but after days debugging and logging, it sounds stable now - but it is the matter for another article! :)

Feedback is welcome in our forum, as usual!