Some numbers, taken from
TTestLowLevelCommon.CustomRTL regression tests.
It is not absolute timing, there is always about 10% of variation in our
tests, but we wanted to have a high-level guess of the performance
Numbers are to be compared per run, with a similar execution context.
In the text below:
FillChar/FillCharFastfills a buffer with some increasing number of bytes.
Move/MoveFastmoves some overlapped data with increasing number of bytes, in a forward way.
small Move/MoveFastmoves 1..48 bytes with overlap.
big Move/MoveFastmoves around 8MB of data with or without overlap.
First of all, about Delphi RTL' vs SynCommons' on Windows 64 - big Move with overlap:
On Delphi XE4 Win64 (VM): FillChar in 34.42ms, 11.2 GB/s FillCharFast  in 15.03ms, 25.8 GB/s Move in 3.76ms, 4.1 GB/s MoveFast  in 2.16ms, 7.2 GB/s small Move in 7.51ms, 2.9 GB/s small MoveFast  in 6.77ms, 3.2 GB/s big Move in 67.06ms, 2.3 GB/s big MoveFast  in 41ms, 3.8 GB/s
On Delphi 10.3 Win64 (VM): FillChar in 28.82ms, 13.4 GB/s FillCharFast  in 14.89ms, 26 GB/s Move in 3.68ms, 4.2 GB/s MoveFast  in 2.13ms, 7.3 GB/s small Move in 7.34ms, 2.9 GB/s small MoveFast  in 6.73ms, 3.2 GB/s big Move in 50.90ms, 3 GB/s big MoveFast  in 40.74ms, 3.8 GB/s
As you can see, Delphi 10.3 was slightly better than Delphi XE4, certainly
due to some refactoring, and introducing of ERMS - which exists on my CPU, so
gives not bad results for "big Move".
But our versions are always faster, especially for
When comparing FPC RTL's vs SynCommons' on Linux x86_64 - big moves with or without overlap:
FillChar in 30.33ms, 12.8 GB/s FillCharFast  in 14.16ms, 27.4 GB/s FillCharFast [cpuAVX] in 11.98ms, 32.4 GB/s
Move in 1.92ms, 8.1 GB/s MoveFast  in 2.19ms, 7.1 GB/s MoveFast [cpuAVX] in 1.60ms, 9.7 GB/s
small Move in 8.84ms, 2.4 GB/s small MoveFast  in 6.66ms, 3.2 GB/s small MoveFast [cpuAVX] in 6.63ms, 3.3 GB/s
overlapping: big Move in 39.94ms, 3.9 GB/s big MoveFast  in 38.13ms, 4 GB/s big MoveFast [cpuAVX] in 36.86ms, 4.2 GB/s
non overlapping: big Move in 54.55ms, 7.1 GB/s big MoveFast  in 40.45ms, 9.6 GB/s big MoveFast [cpuAVX] in 39.57ms, 9.8 GB/s
Previous Delphi numbers were taken from a VM, and on Win64 not Linux, so
with diverse high-performance counters and with overlap, so shouldn't be used
directly for a "FPC vs Delphi" comparison.
What means here are relative numbers between FillChar/Move against FillCharFast/MoveFast on a single execution context.
For big moves, the 256-bit YMM AVX version gives a slight advantage over
128-bit XMM SSE2 code, which was already saturating the bandwidth.
Small blocks don't involve AVX by design.
There seems to be a real AVX advantage only for overlapping increasing moves, certainly due to the less number of cycles involved, and no memory prefetch involved - which may occur in some applications.
For small backward/forward moves (on FPC Linux), when you get some details, the performance difference seems pretty interresting:
1b Move in 89us, 214.3 MB/s 1b MoveFast  in 77us, 247.7 MB/s 2b Move in 95us, 401.5 MB/s 2b MoveFast  in 72us, 529.8 MB/s 3b Move in 107us, 534.7 MB/s 3b MoveFast  in 76us, 752.9 MB/s 4b Move in 129us, 591.4 MB/s 4b MoveFast  in 92us, 829.2 MB/s 5b Move in 163us, 585 MB/s 5b MoveFast  in 78us, 1.1 GB/s 6b Move in 143us, 800.2 MB/s 6b MoveFast  in 77us, 1.4 GB/s 7b Move in 143us, 0.9 GB/s 7b MoveFast  in 146us, 914.4 MB/s 8b Move in 159us, 0.9 GB/s 8b MoveFast  in 73us, 2 GB/s 9b Move in 166us, 1 GB/s 9b MoveFast  in 144us, 1.1 GB/s 10b Move in 170us, 1 GB/s 10b MoveFast  in 144us, 1.2 GB/s 11b Move in 181us, 1.1 GB/s 11b MoveFast  in 144us, 1.4 GB/s 12b Move in 153us, 1.4 GB/s 12b MoveFast  in 143us, 1.5 GB/s 13b Move in 154us, 1.5 GB/s 13b MoveFast  in 144us, 1.6 GB/s 14b Move in 150us, 1.7 GB/s 14b MoveFast  in 143us, 1.8 GB/s 15b Move in 151us, 1.8 GB/s 15b MoveFast  in 147us, 1.9 GB/s 16b Move in 159us, 1.8 GB/s 16b MoveFast  in 73us, 4 GB/s 17b Move in 161us, 1.9 GB/s 17b MoveFast  in 153us, 2 GB/s 18b Move in 167us, 2 GB/s 18b MoveFast  in 153us, 2.1 GB/s 19b Move in 180us, 1.9 GB/s 19b MoveFast  in 161us, 2.1 GB/s 20b Move in 155us, 2.4 GB/s 20b MoveFast  in 153us, 2.4 GB/s 21b Move in 150us, 2.6 GB/s 21b MoveFast  in 153us, 2.5 GB/s 22b Move in 155us, 2.6 GB/s 22b MoveFast  in 163us, 2.5 GB/s 23b Move in 157us, 2.7 GB/s 23b MoveFast  in 154us, 2.7 GB/s 24b Move in 166us, 2.6 GB/s 24b MoveFast  in 91us, 4.9 GB/s 25b Move in 175us, 2.6 GB/s 25b MoveFast  in 163us, 2.8 GB/s 26b Move in 183us, 2.6 GB/s 26b MoveFast  in 170us, 2.8 GB/s 27b Move in 193us, 2.6 GB/s 27b MoveFast  in 162us, 3.1 GB/s 28b Move in 166us, 3.1 GB/s 28b MoveFast  in 163us, 3.2 GB/s 29b Move in 183us, 2.9 GB/s 29b MoveFast  in 169us, 3.1 GB/s 30b Move in 170us, 3.2 GB/s 30b MoveFast  in 176us, 3.1 GB/s 31b Move in 175us, 3.3 GB/s 31b MoveFast  in 176us, 3.2 GB/s 32b Move in 185us, 3.2 GB/s 32b MoveFast  in 146us, 4 GB/s 33b Move in 193us, 3.1 GB/s 33b MoveFast  in 197us, 3.1 GB/s 34b Move in 198us, 3.1 GB/s 34b MoveFast  in 157us, 4 GB/s 35b Move in 218us, 2.9 GB/s 35b MoveFast  in 155us, 4.2 GB/s 36b Move in 196us, 3.4 GB/s 36b MoveFast  in 155us, 4.3 GB/s 37b Move in 184us, 3.7 GB/s 37b MoveFast  in 160us, 4.3 GB/s 38b Move in 209us, 3.3 GB/s 38b MoveFast  in 160us, 4.4 GB/s 39b Move in 201us, 3.6 GB/s 39b MoveFast  in 161us, 4.5 GB/s 40b Move in 218us, 3.4 GB/s 40b MoveFast  in 161us, 4.6 GB/s 41b Move in 226us, 3.3 GB/s 41b MoveFast  in 161us, 4.7 GB/s 42b Move in 236us, 3.3 GB/s 42b MoveFast  in 162us, 4.8 GB/s 43b Move in 250us, 3.2 GB/s 43b MoveFast  in 161us, 4.9 GB/s 44b Move in 180us, 4.5 GB/s 44b MoveFast  in 215us, 3.8 GB/s 45b Move in 199us, 4.2 GB/s 45b MoveFast  in 230us, 3.6 GB/s 46b Move in 195us, 4.3 GB/s 46b MoveFast  in 164us, 5.2 GB/s 47b Move in 198us, 4.4 GB/s 47b MoveFast  in 173us, 5 GB/s 48b Move in 231us, 3.8 GB/s 48b MoveFast  in 163us, 5.4 GB/s
As you can see, our code was designed to handle very efficiently 8-bytes
multiples (8-16-24-32 bytes), which are pretty common when moving small
It is always faster than FPC RTL's original code, which was already some optimized assembly.
If you look at the source code, you will see that we tried to make the code as clear as possible, and using all capabilities of Delphi/FPC inlined asm.
Please run the tests on your PC, and share some numbers and enhancements!