Synopse Open Source - Tag - lockmORMot MVC / SOA / ORM and friends2024-02-02T17:08:25+00:00urn:md5:cc547126eb580a9adbec2349d7c65274DotclearThree Locks To Rule Them Allurn:md5:97541d46fe6c5cf0e003c129797319df2022-01-22T12:56:00+00:002022-01-22T15:55:41+00:00Arnaud BouchezmORMot FrameworkCriticalSectionCrossPlatformDelphiFPCFreePascallockmORMotmultithreadmutexperformance<p>To ensure thread-safety, especially on server side, we usually protect code with critical sections, or locks. In recent Delphi revisions, we have the <code>TMonitor</code> feature, but I would rather trust the OS for locks, which are implemented using Windows Critical Sections, or POSIX futex/mutex.</p>
<p><img src="https://blog.synopse.info?post/public/blog/flyinglock.png" alt="" /></p>
<p>But all locks are not born equal. Most of the time, the overhead of a Critical Section WinAPI or the <code>pthread</code> library is not needed.<br />
So, in <em>mORMot 2</em>, we introduced several native locks in addition to those OS locks, with multi-read/single-write abilities, or re-entrancy.</p> <h4>Thread Safety - The Hard Way</h4>
<p>For a regular RAD/Client application, a single thread is usually enough. Using messages, and/or a <code>TTimer</code> allow some simple cooperative multi-tasking in the application, good enough for most use.</p>
<p>But on server side, scalability requires the business code to be thread-safe. Thread safety is hard, harder than parallel computing from my experiments.</p>
<p>Note that multi-thread programing is not easy, sometimes very difficult to debug, because the problems are hard to reproduce - it is easy to get an <a href="https://en.wikipedia.org/wiki/Heisenbug">HeisenBug</a>.<br />
So ensure you first read some general features about thread safety, and modern CPU memory and operation execution. I just found out <a href="https://preshing.com/20120612/an-introduction-to-lock-free-programming/">these series of blog articles</a>, which details some caveats which may appear in border cases... which may occur to you as they do for me!</p>
<h4>Saved By The Lock</h4>
<p>To ensure thread safety, the most convenient feature we have is the lock, which protects some code section to be executed from several threads.</p>
<p>To be more accurate, we don't protect code, we protect resources. The code itself is thread-safe. But the data requires attention, when several threads access it. If we only read the data, it is fine. But once the data is changed by one thread, then other threads are likely to break - imagine that you add an item to a list, then the list storage is reallocated in memory, then you get some random GPF due to invalid pointers. Or two threads add items at the <em>same time</em> - then the counter or the storage may become pretty wrong. We need to lock the data access to prevent such issues.</p>
<p>Here is how the POSIX <code>libpthread</code> library offers this lock - similar to a Windows Critical Section:</p>
<p><img src="https://blog.synopse.info?post/public/blog/pthreadlock.png" alt="" /></p>
<p>All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries. So you write your thread-unsafe code as the ham in your sandwich, and you will ensure that only a single thread will execute it at once.</p>
<h4>Locks Are Not Expensive, Contention Is</h4>
<p>The main rule about using locks it that they should be as small as possible.<br />
Why?</p>
<p>Acquiring an unlocked mutex, or releasing a mutex is almost free, it is usually a single atomic assembly instruction. Atomic instructions have the <code>lock</code> prefix on Intel/AMD, or are explicitly specified as such, e.g. the <code>cmpxchg</code> operation. On ARM, you usually need to write a small loop, or at least several instructions.<br />
In <code>mormot.core.base.pas</code> we provide some cross-platform and cross-compiler functions for atomic process, written in tuned assembly or calling the RTL:</p>
<pre>
procedure LockedInc32(int32: PInteger);
procedure LockedDec32(int32: PInteger);
procedure LockedInc64(int64: PInt64);
function InterlockedIncrement(var I: integer): integer;
function InterlockedDecrement(var I: integer): integer;
function RefCntDecFree(var refcnt: TRefCnt): boolean;
function LockedExc(var Target: PtrUInt; NewValue, Comperand: PtrUInt): boolean;
procedure LockedAdd(var Target: PtrUInt; Increment: PtrUInt);
procedure LockedAdd32(var Target: cardinal; Increment: cardinal);
procedure LockedDec(var Target: PtrUInt; Decrement: PtrUInt);
</pre>
<p>But if two (or more) threads fight against acquiring a lock, then only one would get it. So the other threads will have to wait. Waiting is usually done by first <em>spinning</em> (i.e. running a void loop), and trying to acquire the lock. Eventually, an OS kernel call could take place, to leverage the CPU core, and try to execute some pending code from another thread.</p>
<p><img src="https://blog.synopse.info?post/public/blog/lockcontention.png" alt="" /></p>
<p>This lock contention, spinning or switching to another thread, is what really degrades the whole process performance. You are really wasting time and energy just for accessing a shared resource.</p>
<p>Therefore, in practice, I would advice to follow some simple rules.</p>
<h5>Make it work, then make it fast</h5>
<p>You may first use a giant Critical Section for a whole method. Most of the time, it would be fine.</p>
<p>Don't guess, run actual benchmarking on multi-core CPU (not a single core VM!), trying to reproduce the worse case possible which may happen.<br />
Have detailed, and thread-aware logs, to properly debug production code - the Heisenbugs are likely to appear not on your development PC, but with real world load.</p>
<p>Once you have identified a real bottleneck, try to split the logic code into small pieces:</p>
<ul>
<li>Ensure you have a multi-thread regression testing code for this method, to validate your modifications are actually still correct and ... faster;</li>
<li>Some part of the code may be thread-safe by itself (e.g. the error checking or result logging): no need to protect it with the lock;</li>
<li>Isolate the processing code into some private/protected methods, depending on the resources shared, with proper locking.</li>
</ul>
<h5>The Less The Better</h5>
<p>Eventually, to achieve the best performance:</p>
<ul>
<li>Keep your locks as short as possible.</li>
<li>Prefer more locks on small data than some giant locks;</li>
<li>Use a lock per list or queue, not per process or business logic method;</li>
<li>Make a private copy of the data (e.g. on a local stack variable) within the lock, then process it outside the lock;</li>
<li>Avoid calling other methods within a lock: focus on the shared data, and be sure that the functions you call may not be thread-safe;</li>
<li>Try to avoid memory allocations.</li>
</ul>
<h5>Pickup The Right Lock</h5>
<p>Generally speaking, the regular <code>TRTLCriticalSection</code> is fine, and should be preferred.<br />
Our <em>mormot.core.os.pas</em> unit leverage this into a cross-platform way, among FPC/Delphi compilers and operating systems. It tries to call directly the OS, with proper inlining if possible.</p>
<p>But if you follow "The Less The Better" rule above, your code may be something very small like this:</p>
<pre>
procedure TAsyncConnections.AddGC(aConnection: TPollAsyncConnection);
begin
if Terminated then
exit;
(aConnection as TAsyncConnection).fLastOperation := fLastOperationMS; // in ms
fGCSafe.Lock;
ObjArrayAddCount(fGC, aConnection, fGCCount);
fGCSafe.UnLock;
end;
</pre>
<p>Here you can see that the lock is very small, and setting <code>fLastOperation</code> has been done outside of the lock, since this operation is thread-safe by design: this connection will be free once, whereas <code>fGC/fGCCount</code> list may be accessed from several threads. Also note that <code>ObjArrayAddCount()</code> is a well defined function which should not have its behavior changed, nor raise any exception, so it is safe to be used... and we even didn't put any <code>try...finalll fGCSafe.UnLock;</code> statement here, because a <code>try..finally</code> has a cost on some platforms (e.g. FPC Linux generates several RTL calls even if no exception is raised).</p>
<p>Or course, we could use our <code>TSynLock</code> for <code>fGCSafe</code> - which encapsulate a <code>TRTLCriticalSection</code> in an object-oriented manner.<br />
But since here we know that the lock will be very small, no need to have the whole overhead of a Critical Section or a mutex/futex, which always has a cost at least in resources.</p>
<h4>Several Locks To Rule Them All</h4>
<p>In addition to the <code>TSynLock</code> wrapper, <em>mormot.core.os.pas</em> defines several kind of locks:</p>
<pre>
// a lightweight exclusive non-rentrant lock, stored in a PtrUInt value
// - calls SwitchToThread after some spinning, but don't use any R/W OS API
// - warning: methods are non rentrant, i.e. calling Lock twice in a raw would
// deadlock: use TRWLock or TSynLocker/TRTLCriticalSection for reentrant methods
// - light locks are expected to be kept a very small amount of time: use
// TSynLocker or TRTLCriticalSection if the lock may block too long
// - several lightlocks, each protecting a few variables (e.g. a list), may
// be more efficient than a more global TRTLCriticalSection/TRWLock
// - only consume 4 bytes on CPU32, 8 bytes on CPU64
TLightLock = record
procedure Lock;
function TryLock: boolean;
procedure UnLock;
end;
// a lightweight multiple Reads / exclusive Write non-upgradable lock
// - calls SwitchToThread after some spinning, but don't use any R/W OS API
// - warning: ReadLocks are reentrant and allow concurrent acccess, but calling
// WriteLock within a ReadLock, or within another WriteLock, would deadlock
// - consider TRWLock is you need an upgradable lock
// - light locks are expected to be kept a very small amount of time: use
// TSynLocker or TRTLCriticalSection if the lock may block too long
// - several lightlocks, each protecting a few variables (e.g. a list), may
// be more efficient than a more global TRTLCriticalSection/TRWLock
// - only consume 4 bytes on CPU32, 8 bytes on CPU64
TRWLightLock = record
procedure ReadLock;
function TryReadLock: boolean;
procedure ReadUnLock;
procedure WriteLock;
function TryWriteLock: boolean;
procedure WriteUnLock;
end;
type
TRWLockContext = (
cReadOnly, cReadWrite, cWrite);
// a lightweight multiple Reads / exclusive Write reentrant lock
// - calls SwitchToThread after some spinning, but don't use any R/W OS API
// - locks are expected to be kept a very small amount of time: use TSynLocker
// or TRTLCriticalSection if the lock may block too long
// - warning: all methods are reentrant, but WriteLock/ReadWriteLock would
// deadlock if called after a ReadOnlyLock
TRWLock = record
procedure ReadOnlyLock;
procedure ReadOnlyUnLock;
procedure ReadWriteLock;
procedure ReadWriteUnLock;
procedure WriteLock;
procedure WriteUnlock;
procedure Lock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
procedure UnLock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
end;
</pre>
<p><code>TLightLock</code> is the simplest lock.<br />
It will acquire a lock, then spin or sleep on contention. But be aware that it is not reentrant: if you call <code>Lock</code> twice in a row from the same thread, the second <code>Lock</code> would wait forever. So you must ensure that your code doesn't call any other method which may also call <code>Lock</code> during its process, otherwise your thread would "deadlock". Such race conditions are relatively easy to identify: it will always block and deadlock, whatever condition there is. To fix it, don't call other method which run <code>Lock</code>: for instance, you may define some private/protected <code>LockedDoSomething</code> methods, which won't have any lock but expect to be called within a lock.</p>
<p><code>TRWLightLock</code> and <code>TRWLock</code> are <em>multiple Reads / exclusive Write locks</em>.<br />
This is a feature missing in the regular Critical Section. It is very likely that your shared resource will be often read, and seldom modified. Since reads are thread-safe by design, there is no need to prevent other reading threads to read the resource. Only writing/updating the data should be exclusive and protected from other threads. This is the purpose of <code>ReadLock</code> / <code>ReadOnlyLock</code> and <code>WriteLock</code>.<br />
<code>TRWLock</code> goes one step further, and allow a read lock to be upgraded into a write lock, using <code>ReadWriteLock</code> instead of <code>ReadOnlyLock</code>. <code>ReadWriteLock</code> could be followed by a <code>WriteLock</code>, whereas <code>ReadOnlyLock</code> should always be followed by <code>ReadOnlyUnlock</code>, but never by a <code>WriteLock</code> which would deadblock.<br />
Last but not least, <code>ReadOnlyLock</code> / <code>ReadOnlyUnLock</code> are re-entrant (you can call them nested), because they are implemented using a counter. And <code>TRWLock.WriteLock</code> is re-entrant, because it takes track of the locked thread ID, so detects nested calls - as a <code>TRtlCriticalSection</code> does.</p>
<h4>Low Level Stuff</h4>
<p>Just for fun, take a look at the source code:</p>
<pre>
procedure TLightLock.LockSpin;
var
spin: PtrUInt;
begin
spin := SPIN_COUNT;
repeat
spin := DoSpin(spin);
until LockedExc(Flags, 1, 0);
end;
procedure TLightLock.Lock;
begin
// we tried a dedicated asm but it was slower: inlining is preferred
if not LockedExc(Flags, 1, 0) then
LockSpin;
end;
function TLightLock.TryLock: boolean;
begin
result := LockedExc(Flags, 1, 0);
end;
procedure TLightLock.UnLock;
begin
Flags := 0; // non reentrant locks need no additional thread safety
end;
</pre>
<p><code>TLightLock</code> is pretty straightforward, using a simple CAS compare & exchange <code>LockedExc()</code> atomic function, but <code>TRWLightLock</code> and <code>TRWLock</code> are slightly more complex.</p>
<p>In <em>mORMot 2</em> code base, we tried to use the best lock possible. <code>TRtlCriticalSection</code> / <code>TSynLock</code> when the locks are likely to have a contention for some time (more than a micro second), and other locks, with <em>multiple Reads / exclusive Write</em> methods if possible, are used to protect very small tuned code.<br />
Of course, thread safety is tested during the regression tests, with dozen of concurrent threads trying to break the locks logic. I can tell you that we found some nasty problems in the initial code of our <code>TAsyncServer</code>, but after days debugging and logging, it sounds stable now - but it is the matter for another article! :)</p>
<p><a href="https://synopse.info/forum/viewtopic.php?id=6119">Feedback is welcome in our forum</a>, as usual!</p>