The root cause the is the LOCK prefix 

All X86 CPUs are equipped with the ability to lock a specific memory address, preventing other system buses to read or modify it while the following instruction runs. 

The LOCK prefix to an assembly instruction causes the CPUs to assert the LOCK# signal, and practically ensures exclusive use of the memory address in multiprocessors / multi-thread environments. 

The LOCK prefix works only with the following instructions:

BT, BTS, BTR, BTC   (mem, reg/imm)
XCHG, XADD  (reg, mem / mem, reg)
ADD, OR, ADC, SBB   (mem, reg/imm)
AND, SUB, XOR   (mem, reg/imm)
NOT, NEG, INC, DEC  (mem)

Note: XCHG and XADD (and all the ‘X’ family of instructions) are planned to be thread-safe, and always asserts LOCK# regardless of the presence of the LOCK prefix. 

These low-level LOCK mechanisms ensure that some memory is modified by only one thread at a time. 

So what is wrong with these LOCKed instructions? 

On a multi-core CPU, all cores just freeze in order to make this LOCKed asm function threadsafe. If you have a lot of threads with more than one CPU, the context of every CPU core has to be frozen, cleared, all cores wait for the LOCKed asm instruction to complete, then the context is to be retrieved, and execution continue. 

So guess what... when the CPU has to execute such instructions, all cores just freeze and your brand new 8 cores CPU just run as a 1 core CPU...

This is the same LOCKed asm function which is used internally by Windows with its Critical Sections. That's why Windows itself is told not to be very multi-core friendly, because it does use a lot of critical sections in its internal... Linux is much more advanced, and scales pretty well on massive multi-core architectures.

What about Delphi? 

In Delphi, I discovered at least two performance problem in its RTL:

  1. Default memory manager, i.e. FastMM4, uses a LOCKed asm instruction for every memory allocation or dis-allocation. 
  2. string types and dynamic arrays just use the same LOCKed asm instruction everywhere, i.e. for every access which may lead into a write to the string. 

See what I wrote in the Embarcadero forum... this post was not very popular, but indeed I think I've raised a big issue on the Delphi compiler internals and performance here - and I don't think Embarcadero has plans to resolve this... 

IMHO if you use strings in your application and need speed, using another memory manager than FastMM4 is not enough. You'll have to avoid most string use, and implement a safe TStringBuilder-like class. 

ShortStrings could be handy here, even if they are limited to 255 character long. 

Using regular PAnsiChar, and fixed buffers in the stack is also a solution, but it must be safe...

Our enhanced RTL

In our enhanced RTL for Delphi 7, we avoid use of this LOCKed asm instruction if your application has only one thread: so if you use our enhanced RTL, and make thread by yourself (not using the TThread object), you'll have the best multi-thread performance possible.
For example, here is how we coded the clearing of a string:
asm     { ->    EAX pointer to str      }
        MOV     EDX,[EAX]                       { fetch str                     }
        TEST    EDX,EDX                         { if nil, nothing to do         }
        JE      @@done 
        lea ecx,[edx-skew].StrRec.refCnt
        mov dword ptr [EAX],0  // clear str
        cmp [ecx],0
        jl @@done     // refCnt=-1: literal str
{$ifdef AVOIDLOCK}
        cmp IsMultiThread,false
        jnz @@lock
        dec [ecx]     // not threadsafe dec refCount, but faster
{$else} lock  dec [ecx]     // threadsafe dec refCount
{$endif}jne @@done
@@free: push eax
        mov eax,ecx
        call MemoryManager.FreeMem // inlined _FreeMem() code
        or eax,eax
        jnz _JmpInvalidPtr
        pop eax
{$ifdef AVOIDLOCK}
@@lock: lock  dec [ecx]     // threadsafe dec refCount
        je @@free
{$else}{$ifndef NOAMD}ret{$endif} // AMD trick: avoid branch misprediction
@@done:{$ifndef NOAMD}db $F3{$endif} // rep ret AMD trick: avoid branch misprediction

So if the AVOIDLOCK conditional is defined, and there is only one thread in your application (the IsMultiThread is false), the lock dec [ecx] instruction won't be called, but a much faster (and core-friendly) dec [ecx] instruction is used.

Note that there is a similar check already in FastMM4: if IsMultiThread is false, no LOCKed instruction will be used.

The only drawback is that if you want to use threads in your application, you'll have:

  1. TThread is not to be used: the creation of one TThread just set IsMultiThread to true, so enable LOCKed instructions;
  2. BeginThread() function must be avoided also (it set also the flag);
  3. So you'll have to call directly CreateThread() Win32 APIs for your threads;
  4. And none of your units should use either TThread either BeginThread!
That's why it could be useful that Embarcadero take this problem in account, and try to resolve it at the compiler level....

You may post comments or react on our forum