Some numbers, taken from
TTestLowLevelCommon.CustomRTL
regression tests.
It is not absolute timing, there is always about 10% of variation in our
tests, but we wanted to have a high-level guess of the performance
obtained.
Numbers are to be compared per run, with a similar execution context.
In the text below:
FillChar/FillCharFast
fills a buffer with some increasing number of bytes.Move/MoveFast
moves some overlapped data with increasing number of bytes, in a forward way.small Move/MoveFast
moves 1..48 bytes with overlap.big Move/MoveFast
moves around 8MB of data with or without overlap.
First of all, about Delphi RTL' vs SynCommons' on Windows 64 - big Move with overlap:
On Delphi XE4 Win64 (VM): FillChar in 34.42ms, 11.2 GB/s FillCharFast [] in 15.03ms, 25.8 GB/s Move in 3.76ms, 4.1 GB/s MoveFast [] in 2.16ms, 7.2 GB/s small Move in 7.51ms, 2.9 GB/s small MoveFast [] in 6.77ms, 3.2 GB/s big Move in 67.06ms, 2.3 GB/s big MoveFast [] in 41ms, 3.8 GB/s
On Delphi 10.3 Win64 (VM): FillChar in 28.82ms, 13.4 GB/s FillCharFast [] in 14.89ms, 26 GB/s Move in 3.68ms, 4.2 GB/s MoveFast [] in 2.13ms, 7.3 GB/s small Move in 7.34ms, 2.9 GB/s small MoveFast [] in 6.73ms, 3.2 GB/s big Move in 50.90ms, 3 GB/s big MoveFast [] in 40.74ms, 3.8 GB/s
As you can see, Delphi 10.3 was slightly better than Delphi XE4, certainly
due to some refactoring, and introducing of ERMS - which exists on my CPU, so
gives not bad results for "big Move".
But our versions are always faster, especially for FillChar()
.
When comparing FPC RTL's vs SynCommons' on Linux x86_64 - big moves with or without overlap:
FillChar in 30.33ms, 12.8 GB/s FillCharFast [] in 14.16ms, 27.4 GB/s FillCharFast [cpuAVX] in 11.98ms, 32.4 GB/s
Move in 1.92ms, 8.1 GB/s MoveFast [] in 2.19ms, 7.1 GB/s MoveFast [cpuAVX] in 1.60ms, 9.7 GB/s
small Move in 8.84ms, 2.4 GB/s small MoveFast [] in 6.66ms, 3.2 GB/s small MoveFast [cpuAVX] in 6.63ms, 3.3 GB/s
overlapping: big Move in 39.94ms, 3.9 GB/s big MoveFast [] in 38.13ms, 4 GB/s big MoveFast [cpuAVX] in 36.86ms, 4.2 GB/s
non overlapping: big Move in 54.55ms, 7.1 GB/s big MoveFast [] in 40.45ms, 9.6 GB/s big MoveFast [cpuAVX] in 39.57ms, 9.8 GB/s
Previous Delphi numbers were taken from a VM, and on Win64 not Linux, so
with diverse high-performance counters and with overlap, so shouldn't be used
directly for a "FPC vs Delphi" comparison.
What means here are relative numbers between FillChar/Move against
FillCharFast/MoveFast on a single execution context.
For big moves, the 256-bit YMM AVX version gives a slight advantage over
128-bit XMM SSE2 code, which was already saturating the bandwidth.
Small blocks don't involve AVX by design.
There seems to be a real AVX advantage only for overlapping increasing moves,
certainly due to the less number of cycles involved, and no memory prefetch
involved - which may occur in some applications.
For small backward/forward moves (on FPC Linux), when you get some details, the performance difference seems pretty interresting:
1b Move in 89us, 214.3 MB/s 1b MoveFast [] in 77us, 247.7 MB/s 2b Move in 95us, 401.5 MB/s 2b MoveFast [] in 72us, 529.8 MB/s 3b Move in 107us, 534.7 MB/s 3b MoveFast [] in 76us, 752.9 MB/s 4b Move in 129us, 591.4 MB/s 4b MoveFast [] in 92us, 829.2 MB/s 5b Move in 163us, 585 MB/s 5b MoveFast [] in 78us, 1.1 GB/s 6b Move in 143us, 800.2 MB/s 6b MoveFast [] in 77us, 1.4 GB/s 7b Move in 143us, 0.9 GB/s 7b MoveFast [] in 146us, 914.4 MB/s 8b Move in 159us, 0.9 GB/s 8b MoveFast [] in 73us, 2 GB/s 9b Move in 166us, 1 GB/s 9b MoveFast [] in 144us, 1.1 GB/s 10b Move in 170us, 1 GB/s 10b MoveFast [] in 144us, 1.2 GB/s 11b Move in 181us, 1.1 GB/s 11b MoveFast [] in 144us, 1.4 GB/s 12b Move in 153us, 1.4 GB/s 12b MoveFast [] in 143us, 1.5 GB/s 13b Move in 154us, 1.5 GB/s 13b MoveFast [] in 144us, 1.6 GB/s 14b Move in 150us, 1.7 GB/s 14b MoveFast [] in 143us, 1.8 GB/s 15b Move in 151us, 1.8 GB/s 15b MoveFast [] in 147us, 1.9 GB/s 16b Move in 159us, 1.8 GB/s 16b MoveFast [] in 73us, 4 GB/s 17b Move in 161us, 1.9 GB/s 17b MoveFast [] in 153us, 2 GB/s 18b Move in 167us, 2 GB/s 18b MoveFast [] in 153us, 2.1 GB/s 19b Move in 180us, 1.9 GB/s 19b MoveFast [] in 161us, 2.1 GB/s 20b Move in 155us, 2.4 GB/s 20b MoveFast [] in 153us, 2.4 GB/s 21b Move in 150us, 2.6 GB/s 21b MoveFast [] in 153us, 2.5 GB/s 22b Move in 155us, 2.6 GB/s 22b MoveFast [] in 163us, 2.5 GB/s 23b Move in 157us, 2.7 GB/s 23b MoveFast [] in 154us, 2.7 GB/s 24b Move in 166us, 2.6 GB/s 24b MoveFast [] in 91us, 4.9 GB/s 25b Move in 175us, 2.6 GB/s 25b MoveFast [] in 163us, 2.8 GB/s 26b Move in 183us, 2.6 GB/s 26b MoveFast [] in 170us, 2.8 GB/s 27b Move in 193us, 2.6 GB/s 27b MoveFast [] in 162us, 3.1 GB/s 28b Move in 166us, 3.1 GB/s 28b MoveFast [] in 163us, 3.2 GB/s 29b Move in 183us, 2.9 GB/s 29b MoveFast [] in 169us, 3.1 GB/s 30b Move in 170us, 3.2 GB/s 30b MoveFast [] in 176us, 3.1 GB/s 31b Move in 175us, 3.3 GB/s 31b MoveFast [] in 176us, 3.2 GB/s 32b Move in 185us, 3.2 GB/s 32b MoveFast [] in 146us, 4 GB/s 33b Move in 193us, 3.1 GB/s 33b MoveFast [] in 197us, 3.1 GB/s 34b Move in 198us, 3.1 GB/s 34b MoveFast [] in 157us, 4 GB/s 35b Move in 218us, 2.9 GB/s 35b MoveFast [] in 155us, 4.2 GB/s 36b Move in 196us, 3.4 GB/s 36b MoveFast [] in 155us, 4.3 GB/s 37b Move in 184us, 3.7 GB/s 37b MoveFast [] in 160us, 4.3 GB/s 38b Move in 209us, 3.3 GB/s 38b MoveFast [] in 160us, 4.4 GB/s 39b Move in 201us, 3.6 GB/s 39b MoveFast [] in 161us, 4.5 GB/s 40b Move in 218us, 3.4 GB/s 40b MoveFast [] in 161us, 4.6 GB/s 41b Move in 226us, 3.3 GB/s 41b MoveFast [] in 161us, 4.7 GB/s 42b Move in 236us, 3.3 GB/s 42b MoveFast [] in 162us, 4.8 GB/s 43b Move in 250us, 3.2 GB/s 43b MoveFast [] in 161us, 4.9 GB/s 44b Move in 180us, 4.5 GB/s 44b MoveFast [] in 215us, 3.8 GB/s 45b Move in 199us, 4.2 GB/s 45b MoveFast [] in 230us, 3.6 GB/s 46b Move in 195us, 4.3 GB/s 46b MoveFast [] in 164us, 5.2 GB/s 47b Move in 198us, 4.4 GB/s 47b MoveFast [] in 173us, 5 GB/s 48b Move in 231us, 3.8 GB/s 48b MoveFast [] in 163us, 5.4 GB/s
As you can see, our code was designed to handle very efficiently 8-bytes
multiples (8-16-24-32 bytes), which are pretty common when moving small
objects.
It is always faster than FPC RTL's original code, which was already some
optimized assembly. :)
If you look at the source code, you will see that we tried to make the code as clear as possible, and using all capabilities of Delphi/FPC inlined asm.
Please run the tests on your PC, and share some numbers and enhancements!