Almost every time I'm debugging some core part of our framework, I like to
see the generated asm, and trying to optimize the pascal code for better speed
- when it is worth it, of course!
I just made a nice observation, when comparing the assembler generated by
Delphi to FPC's output.
Imagine you compile the following lines (extracted from SynCommons.pas),
to convert some number into ASCII characters:
c100 := val div 100;
PWord(P)^ := TwoDigitLookupW[val];
This divides a number by 100, then computes the modulo in
to store two digits at a time. We did not use
val := val mod 100
mod would do another division, so we rely on a simple
multiplication to compute the modulo.
You may know that for today's CPUs, integer multiplication is very optimized,
taking a cycle (or less, thanks to its pipelines), whereas a division is much
more expensive - if you have some spare time, take a look at this
document, and you will find out that a
div opcode could use 10
times more cycles then a
mul - even with the latest CPU
Let's see how our two beloved compilers do their homework (with optimization
enabled, of course)...
Delphi generates the following code for
c100 := val div
005082AB 8BC1 mov eax,ecx
005082AD BE64000000 mov esi,$00000064
005082B2 33D2 xor edx,edx
005082B4 F7F6 div esi
Whereas FPC generates the following:
0043AC48 8b55f8 mov -0x8(%ebp),%edx
0043AC4B b81f85eb51 mov $0x51eb851f,%eax
0043AC50 f7e2 mul %edx
0043AC52 c1ea05 shr $0x5,%edx
0043AC55 8955f0 mov %edx,-0x10(%ebp)
Even if you are assembler agnostic, and once you did get rid of the asm
textual representation (Delphi uses Intel's, whereas FPC/GDB follows AT&T),
you can see that Delphi generates a classic (and slow)
opcode, whereas FPC uses a single multiplication, followed by a bit shift.
This optimization is known as "Reciprocal Multiplication", and I would
let you read this
article for mathematical reference - or this one.
It multiplies (
mul) the number by the power of two reciprocal of
100 (which is the hexadecimal
51eb851f value), followed by a right
shr) of 5 bits.
Thanks to 32 bit rounding of the integer operations, this would in fact divide
the number per 100.
Even it consists in two assembler opcodes, a
shr is in fact faster than a single
It is a shame that the Delphi compiler did not include this very common
optimization, which is clearly a win for some very common tasks.
Of course, the LLVM back-end used on the NextGen compiler can do it, be we
may expect this classic optimization be part of the decades-old Delphi
And I'm still not convinced about the performance of the NextGen generated
code, since the associated RTL is known to be slow, so won't benefit of LVVM
optimization - which takes a LOT of time to compile, by the way (much
more than FPC).
Congrats, FPC folks!