2022-01-22

Three Locks To Rule Them All

To ensure thread-safety, especially on server side, we usually protect code with critical sections, or locks. In recent Delphi revisions, we have the TMonitor feature, but I would rather trust the OS for locks, which are implemented using Windows Critical Sections, or POSIX futex/mutex.

But all locks are not born equal. Most of the time, the overhead of a Critical Section WinAPI or the pthread library is not needed.
So, in mORMot 2, we introduced several native locks in addition to those OS locks, with multi-read/single-write abilities, or re-entrancy.

Thread Safety - The Hard Way

For a regular RAD/Client application, a single thread is usually enough. Using messages, and/or a TTimer allow some simple cooperative multi-tasking in the application, good enough for most use.

But on server side, scalability requires the business code to be thread-safe. Thread safety is hard, harder than parallel computing from my experiments.

Note that multi-thread programing is not easy, sometimes very difficult to debug, because the problems are hard to reproduce - it is easy to get an HeisenBug.
So ensure you first read some general features about thread safety, and modern CPU memory and operation execution. I just found out these series of blog articles, which details some caveats which may appear in border cases... which may occur to you as they do for me!

Saved By The Lock

To ensure thread safety, the most convenient feature we have is the lock, which protects some code section to be executed from several threads.

To be more accurate, we don't protect code, we protect resources. The code itself is thread-safe. But the data requires attention, when several threads access it. If we only read the data, it is fine. But once the data is changed by one thread, then other threads are likely to break - imagine that you add an item to a list, then the list storage is reallocated in memory, then you get some random GPF due to invalid pointers. Or two threads add items at the same time - then the counter or the storage may become pretty wrong. We need to lock the data access to prevent such issues.

Here is how the POSIX libpthread library offers this lock - similar to a Windows Critical Section:

All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries. So you write your thread-unsafe code as the ham in your sandwich, and you will ensure that only a single thread will execute it at once.

Locks Are Not Expensive, Contention Is

The main rule about using locks it that they should be as small as possible.
Why?

Acquiring an unlocked mutex, or releasing a mutex is almost free, it is usually a single atomic assembly instruction. Atomic instructions have the lock prefix on Intel/AMD, or are explicitly specified as such, e.g. the cmpxchg operation. On ARM, you usually need to write a small loop, or at least several instructions.
In mormot.core.base.pas we provide some cross-platform and cross-compiler functions for atomic process, written in tuned assembly or calling the RTL:

procedure LockedInc32(int32: PInteger);
procedure LockedDec32(int32: PInteger);
procedure LockedInc64(int64: PInt64);
function InterlockedIncrement(var I: integer): integer;
function InterlockedDecrement(var I: integer): integer;
function RefCntDecFree(var refcnt: TRefCnt): boolean;
function LockedExc(var Target: PtrUInt; NewValue, Comperand: PtrUInt): boolean;
procedure LockedAdd(var Target: PtrUInt; Increment: PtrUInt);
procedure LockedAdd32(var Target: cardinal; Increment: cardinal);
procedure LockedDec(var Target: PtrUInt; Decrement: PtrUInt);

But if two (or more) threads fight against acquiring a lock, then only one would get it. So the other threads will have to wait. Waiting is usually done by first spinning (i.e. running a void loop), and trying to acquire the lock. Eventually, an OS kernel call could take place, to leverage the CPU core, and try to execute some pending code from another thread.

This lock contention, spinning or switching to another thread, is what really degrades the whole process performance. You are really wasting time and energy just for accessing a shared resource.

Therefore, in practice, I would advice to follow some simple rules.

Make it work, then make it fast

You may first use a giant Critical Section for a whole method. Most of the time, it would be fine.

Don't guess, run actual benchmarking on multi-core CPU (not a single core VM!), trying to reproduce the worse case possible which may happen.
Have detailed, and thread-aware logs, to properly debug production code - the Heisenbugs are likely to appear not on your development PC, but with real world load.

Once you have identified a real bottleneck, try to split the logic code into small pieces:

  • Ensure you have a multi-thread regression testing code for this method, to validate your modifications are actually still correct and ... faster;
  • Some part of the code may be thread-safe by itself (e.g. the error checking or result logging): no need to protect it with the lock;
  • Isolate the processing code into some private/protected methods, depending on the resources shared, with proper locking.
The Less The Better

Eventually, to achieve the best performance:

  • Keep your locks as short as possible.
  • Prefer more locks on small data than some giant locks;
  • Use a lock per list or queue, not per process or business logic method;
  • Make a private copy of the data (e.g. on a local stack variable) within the lock, then process it outside the lock;
  • Avoid calling other methods within a lock: focus on the shared data, and be sure that the functions you call may not be thread-safe;
  • Try to avoid memory allocations.
Pickup The Right Lock

Generally speaking, the regular TRTLCriticalSection is fine, and should be preferred.
Our mormot.core.os.pas unit leverage this into a cross-platform way, among FPC/Delphi compilers and operating systems. It tries to call directly the OS, with proper inlining if possible.

But if you follow "The Less The Better" rule above, your code may be something very small like this:

procedure TAsyncConnections.AddGC(aConnection: TPollAsyncConnection);
begin
  if Terminated then
    exit;
  (aConnection as TAsyncConnection).fLastOperation := fLastOperationMS; // in ms
  fGCSafe.Lock;
  ObjArrayAddCount(fGC, aConnection, fGCCount);
  fGCSafe.UnLock;
end;

Here you can see that the lock is very small, and setting fLastOperation has been done outside of the lock, since this operation is thread-safe by design: this connection will be free once, whereas fGC/fGCCount list may be accessed from several threads. Also note that ObjArrayAddCount() is a well defined function which should not have its behavior changed, nor raise any exception, so it is safe to be used... and we even didn't put any try...finalll fGCSafe.UnLock; statement here, because a try..finally has a cost on some platforms (e.g. FPC Linux generates several RTL calls even if no exception is raised).

Or course, we could use our TSynLock for fGCSafe - which encapsulate a TRTLCriticalSection in an object-oriented manner.
But since here we know that the lock will be very small, no need to have the whole overhead of a Critical Section or a mutex/futex, which always has a cost at least in resources.

Several Locks To Rule Them All

In addition to the TSynLock wrapper, mormot.core.os.pas defines several kind of locks:

  // a lightweight exclusive non-rentrant lock, stored in a PtrUInt value
  // - calls SwitchToThread after some spinning, but don't use any R/W OS API
  // - warning: methods are non rentrant, i.e. calling Lock twice in a raw would
  // deadlock: use TRWLock or TSynLocker/TRTLCriticalSection for reentrant methods
  // - light locks are expected to be kept a very small amount of time: use
  // TSynLocker or TRTLCriticalSection if the lock may block too long
  // - several lightlocks, each protecting a few variables (e.g. a list), may
  // be more efficient than a more global TRTLCriticalSection/TRWLock
  // - only consume 4 bytes on CPU32, 8 bytes on CPU64
  TLightLock = record
    procedure Lock;
    function TryLock: boolean;
    procedure UnLock;
  end;

  // a lightweight multiple Reads / exclusive Write non-upgradable lock
  // - calls SwitchToThread after some spinning, but don't use any R/W OS API
  // - warning: ReadLocks are reentrant and allow concurrent acccess, but calling
  // WriteLock within a ReadLock, or within another WriteLock, would deadlock
  // - consider TRWLock is you need an upgradable lock
  // - light locks are expected to be kept a very small amount of time: use
  // TSynLocker or TRTLCriticalSection if the lock may block too long
  // - several lightlocks, each protecting a few variables (e.g. a list), may
  // be more efficient than a more global TRTLCriticalSection/TRWLock
  // - only consume 4 bytes on CPU32, 8 bytes on CPU64
  TRWLightLock = record
    procedure ReadLock;
    function TryReadLock: boolean;
    procedure ReadUnLock;
    procedure WriteLock;
    function TryWriteLock: boolean;
    procedure WriteUnLock;
  end;

type
  TRWLockContext = (
    cReadOnly,     cReadWrite,     cWrite);

  // a lightweight multiple Reads / exclusive Write reentrant lock
  // - calls SwitchToThread after some spinning, but don't use any R/W OS API
  // - locks are expected to be kept a very small amount of time: use TSynLocker
  // or TRTLCriticalSection if the lock may block too long
  // - warning: all methods are reentrant, but WriteLock/ReadWriteLock would
  // deadlock if called after a ReadOnlyLock
  TRWLock = record
    procedure ReadOnlyLock;
    procedure ReadOnlyUnLock;
    procedure ReadWriteLock;
    procedure ReadWriteUnLock;
    procedure WriteLock;
    procedure WriteUnlock;
    procedure Lock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
    procedure UnLock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
  end;

TLightLock is the simplest lock.
It will acquire a lock, then spin or sleep on contention. But be aware that it is not reentrant: if you call Lock twice in a row from the same thread, the second Lock would wait forever. So you must ensure that your code doesn't call any other method which may also call Lock during its process, otherwise your thread would "deadlock". Such race conditions are relatively easy to identify: it will always block and deadlock, whatever condition there is. To fix it, don't call other method which run Lock: for instance, you may define some private/protected LockedDoSomething methods, which won't have any lock but expect to be called within a lock.

TRWLightLock and TRWLock are multiple Reads / exclusive Write locks.
This is a feature missing in the regular Critical Section. It is very likely that your shared resource will be often read, and seldom modified. Since reads are thread-safe by design, there is no need to prevent other reading threads to read the resource. Only writing/updating the data should be exclusive and protected from other threads. This is the purpose of ReadLock / ReadOnlyLock and WriteLock.
TRWLock goes one step further, and allow a read lock to be upgraded into a write lock, using ReadWriteLock instead of ReadOnlyLock. ReadWriteLock could be followed by a WriteLock, whereas ReadOnlyLock should always be followed by ReadOnlyUnlock, but never by a WriteLock which would deadblock.
Last but not least, ReadOnlyLock / ReadOnlyUnLock are re-entrant (you can call them nested), because they are implemented using a counter. And TRWLock.WriteLock is re-entrant, because it takes track of the locked thread ID, so detects nested calls - as a TRtlCriticalSection does.

Low Level Stuff

Just for fun, take a look at the source code:

procedure TLightLock.LockSpin;
var
  spin: PtrUInt;
begin
  spin := SPIN_COUNT;
  repeat
    spin := DoSpin(spin);
  until LockedExc(Flags, 1, 0);
end;

procedure TLightLock.Lock;
begin
  // we tried a dedicated asm but it was slower: inlining is preferred
  if not LockedExc(Flags, 1, 0) then
    LockSpin;
end;

function TLightLock.TryLock: boolean;
begin
  result := LockedExc(Flags, 1, 0);
end;

procedure TLightLock.UnLock;
begin
  Flags := 0; // non reentrant locks need no additional thread safety
end;

TLightLock is pretty straightforward, using a simple CAS compare & exchange LockedExc() atomic function, but TRWLightLock and TRWLock are slightly more complex.

In mORMot 2 code base, we tried to use the best lock possible. TRtlCriticalSection / TSynLock when the locks are likely to have a contention for some time (more than a micro second), and other locks, with multiple Reads / exclusive Write methods if possible, are used to protect very small tuned code.
Of course, thread safety is tested during the regression tests, with dozen of concurrent threads trying to break the locks logic. I can tell you that we found some nasty problems in the initial code of our TAsyncServer, but after days debugging and logging, it sounds stable now - but it is the matter for another article! :)

Feedback is welcome in our forum, as usual!

2021-12-19

mORMot 2 Generics and Collections

Generics are a clever way of writing some code once, then reuse it for several types.
They are like templates, or compiler-time shortcuts for type definitions.

In the last weeks, we added a new mormot.core.collections.pas unit, which features:

  • JSON-aware IList<> List Storage;
  • JSON-aware IKeyValue<> Dictionary Storage.

In respect to Delphi or FPC RTL generics.collections, this unit uses interfaces as variable holders, and leverage them to reduce the generated code as much as possible, as the Spring4D 2.0 framework does, but for both Delphi and FPC. It publishes TDynArray and TSynDictionary high-level features like indexing, sorting, JSON/binary serialization or thread safety as Generics strong typing.

Resulting performance is great, especially for its enumerators, and your resulting executable size won't blow up as with the regular RTL unit.

Continue reading

2021-11-16

EKON 25 Slides

EKON 25 at Düsseldorf was a great conference (konference?).

At last, a physical gathering of Delphi developers, mostly from Germany, but also from Europe - and even some from USA! No more virtual meetings, which may trigger the well known 'Abstract Error' on modern pascal coders.
There were some happy FPC users too - as I am now. :)

I have published the slides of my conferences, mostly about mORMot 2.
By the way, I wish we would be able to release officially mORMot 2 in December, before Christmas. I think it starts to be stabilized and already known to be used on production. We expect no more breaking change in the next weeks.

Continue reading

2021-09-21

Delphi 10.4 / Delphi 11 Alexandria Breaking Changes

The latest revision of Delphi, named Delphi 11 Alexandria, is out.
A lot of new features, some enhanced platforms. Nice!
But it is also for us the opportunity to come back to some breaking changes, which appeared in Delphi 10.4 earlier this year, and are now "officially" part of Delphi 11.

The main breaking change of Delphi 10.4 and later, as reported by mORMot users, is the new lifetime of local variables.
TL&LR: a local variable which is not explicitly declared, but returned by a function may be released as soon as it is not used any more, whereas in the original implementation, it was allocated as a regular local variable, and we could expect its lifetime to remain active up to the end of the function. With Delphi 10.4, it is not the case any more: the compiler could release/clear the local variable sooner, to reduce the allocation pressure.

Idea behind this change is that it may have better register allocation within the function, so it "may" theoretically result in faster code. Not convinced about it, anyway - we will discuss that.
The main thing is that it could break existing code, because it breaks the Delphi compiler expectation since decades.
Some perfectly fine working code would end to work as expected. We have identified several use cases with mORMot which are affected by this change. Since it seems there will be no coming back from Delphi point of view, it is worth a blog article. ;)

Continue reading

2021-08-17

mORMot 2 on Ampere AARM64 CPU

Last weeks, we have enhanced mORMot support to one of the more powerful AARM64 CPU available: the Ampere Altra CPU, as made available on the Oracle Cloud Infrastructure.

Long story short, this is an amazing hardware to run on server side, with performance close to what Intel/AMD offers, but with almost linear multi-core scalability. The FPC compiler is able to run good code on it, and our mORMot 2 library is able to use the hardware accelerated opcodes for AES, SHA2, and crc32/crc32c.

Continue reading

2021-07-08

Job Offer: FPC mORMot 2 and WAPT

Good news!
The French company I work for, Tranquil IT, is hiring FPC / Lazarus / mORMot developers. Remote work possible.

I share below the Job Offer from my boss Vincent.
We look forward working with you on this great mORMot-powered project!

https://www.tranquil.it/en/who-are-we/join-us

Continue reading

2021-06-26

Embed Small and Optimized Debug Information for FPC

Debug information can be generated by compilers, to contain symbols and source code lines. This is very handy to have a meaningful stack trace on any problems like exceptions, at runtime.

The problem is that debug information can be huge. New code style with generics tends to increase this size into a bloated way...
On Delphi, mormot2tests generates a 4MB .map file;
on FPC, mormot2tests outputs a 20MB .dbg file in DWARF.

For Delphi, we propose our own binary .mab format which reduces this 4MB .map file into a 290KB .mab file since mORMot 1.
Now mORMot 2 can reduce a FPC debug file of 20MB into a 322KB .mab file!
And this .mab information can just be appended to the executable for single-file distribution, if needed, during the build. No need to distribute two files, potentially with synchronization issues.

Continue reading

2021-05-14

Enhanced HTTP/HTTPS Support in mORMot 2

HTTP(S) is the main protocol of the Internet.
We enhanced the mORMot 2 socket client to push its implementation into more use cases. The main new feature is perhaps WGET-like processing, with hashing, resuming, console feedback, and direct file download.

Continue reading

- page 1 of 48