The root cause the is the LOCK prefix
All X86 CPUs are equipped with the ability to lock a specific memory address, preventing other system buses to read or modify it while the following instruction runs.
The LOCK prefix to an assembly instruction causes the CPUs to assert the LOCK# signal, and practically ensures exclusive use of the memory address in multiprocessors / multi-thread environments.
The LOCK prefix works only with the following instructions:
BT, BTS, BTR, BTC (mem, reg/imm) XCHG, XADD (reg, mem / mem, reg) ADD, OR, ADC, SBB (mem, reg/imm) AND, SUB, XOR (mem, reg/imm) NOT, NEG, INC, DEC (mem)
Note: XCHG and XADD (and all the ‘X’ family of instructions) are planned to be thread-safe, and always asserts LOCK# regardless of the presence of the LOCK prefix.
These low-level LOCK mechanisms ensure that some memory is modified by only one thread at a time.
So what is wrong with these LOCKed instructions?
On a multi-core CPU, all cores just freeze in order to make this LOCKed asm function threadsafe. If you have a lot of threads with more than one CPU, the context of every CPU core has to be frozen, cleared, all cores wait for the LOCKed asm instruction to complete, then the context is to be retrieved, and execution continue.
So guess what... when the CPU has to execute such instructions, all cores just freeze and your brand new 8 cores CPU just run as a 1 core CPU...
What about Delphi?
In Delphi, I discovered at least two performance problem in its RTL:
- Default memory manager, i.e. FastMM4, uses a LOCKed asm instruction for every memory allocation or dis-allocation.
- string types and dynamic arrays just use the same LOCKed asm instruction everywhere, i.e. for every access which may lead into a write to the string.
See what I wrote in the Embarcadero forum... this post was not very popular, but indeed I think I've raised a big issue on the Delphi compiler internals and performance here - and I don't think Embarcadero has plans to resolve this...
IMHO if you use strings in your application and need speed, using another memory manager than FastMM4 is not enough. You'll have to avoid most string use, and implement a safe TStringBuilder-like class.
ShortStrings could be handy here, even if they are limited to 255 character long.
Using regular PAnsiChar, and fixed buffers in the stack is also a solution, but it must be safe...
Our enhanced RTL
asm { -> EAX pointer to str }
MOV EDX,[EAX] { fetch str }
TEST EDX,EDX { if nil, nothing to do }
JE @@done
lea ecx,[edx-skew].StrRec.refCnt
mov dword ptr [EAX],0 // clear str
cmp [ecx],0
jl @@done // refCnt=-1: literal str
{$ifdef AVOIDLOCK}
cmp IsMultiThread,false
jnz @@lock
dec [ecx] // not threadsafe dec refCount, but faster
{$else} lock dec [ecx] // threadsafe dec refCount
{$endif}jne @@done
@@free: push eax
mov eax,ecx
call MemoryManager.FreeMem // inlined _FreeMem() code
or eax,eax
jnz _JmpInvalidPtr
pop eax
{$ifdef AVOIDLOCK}
ret
@@lock: lock dec [ecx] // threadsafe dec refCount
je @@free
{$else}{$ifndef NOAMD}ret{$endif} // AMD trick: avoid branch misprediction
{$endif}
@@done:{$ifndef NOAMD}db $F3{$endif} // rep ret AMD trick: avoid branch misprediction
end;
So if the AVOIDLOCK conditional is defined, and there is only one thread in your application (the IsMultiThread is false), the lock dec [ecx] instruction won't be called, but a much faster (and core-friendly) dec [ecx] instruction is used.
Note that there is a similar check already in FastMM4: if IsMultiThread is false, no LOCKed instruction will be used.
The only drawback is that if you want to use threads in your application, you'll have:
- TThread is not to be used: the creation of one TThread just set IsMultiThread to true, so enable LOCKed instructions;
- BeginThread() function must be avoided also (it set also the flag);
- So you'll have to call directly CreateThread() Win32 APIs for your threads;
- And none of your units should use either TThread either BeginThread!
You may post comments or react on our forum