New move/fillchar optimized sse2/avx asm version

Some numbers, taken from TTestLowLevelCommon.CustomRTL regression tests.

It is not absolute timing, there is always about 10% of variation in our tests, but we wanted to have a high-level guess of the performance obtained.
Numbers are to be compared per run, with a similar execution context.

In the text below:

FillChar/FillCharFast fills a buffer with some increasing number of bytes.
Move/MoveFast moves some overlapped data with increasing number of bytes, in a forward way.
small Move/MoveFast moves 1..48 bytes with overlap.
big Move/MoveFast moves around 8MB of data with or without overlap.

First of all, about Delphi RTL' vs SynCommons' on Windows 64 - big Move with overlap:

  On Delphi XE4 Win64 (VM):
FillChar in 34.42ms, 11.2 GB/s   FillCharFast [] in 15.03ms, 25.8 GB/s
Move in 3.76ms, 4.1 GB/s         MoveFast [] in 2.16ms, 7.2 GB/s
small Move in 7.51ms, 2.9 GB/s   small MoveFast [] in 6.77ms, 3.2 GB/s
big Move in 67.06ms, 2.3 GB/s    big MoveFast [] in 41ms, 3.8 GB/s

  On Delphi 10.3 Win64 (VM):
FillChar in 28.82ms, 13.4 GB/s   FillCharFast [] in 14.89ms, 26 GB/s
Move in 3.68ms, 4.2 GB/s         MoveFast [] in 2.13ms, 7.3 GB/s
small Move in 7.34ms, 2.9 GB/s   small MoveFast [] in 6.73ms, 3.2 GB/s
big Move in 50.90ms, 3 GB/s      big MoveFast [] in 40.74ms, 3.8 GB/s

As you can see, Delphi 10.3 was slightly better than Delphi XE4, certainly due to some refactoring, and introducing of ERMS - which exists on my CPU, so gives not bad results for "big Move".
But our versions are always faster, especially for FillChar().

When comparing FPC RTL's vs SynCommons' on Linux x86_64 - big moves with or without overlap:

FillChar in 30.33ms, 12.8 GB/s
FillCharFast [] in 14.16ms, 27.4 GB/s
FillCharFast [cpuAVX] in 11.98ms, 32.4 GB/s

Move in 1.92ms, 8.1 GB/s
MoveFast [] in 2.19ms, 7.1 GB/s
MoveFast [cpuAVX] in 1.60ms, 9.7 GB/s

small Move in 8.84ms, 2.4 GB/s
small MoveFast [] in 6.66ms, 3.2 GB/s
small MoveFast [cpuAVX] in 6.63ms, 3.3 GB/s

overlapping:
big Move in 39.94ms, 3.9 GB/s
big MoveFast [] in 38.13ms, 4 GB/s
big MoveFast [cpuAVX] in 36.86ms, 4.2 GB/s

non overlapping:
big Move in 54.55ms, 7.1 GB/s
big MoveFast [] in 40.45ms, 9.6 GB/s
big MoveFast [cpuAVX] in 39.57ms, 9.8 GB/s

Previous Delphi numbers were taken from a VM, and on Win64 not Linux, so with diverse high-performance counters and with overlap, so shouldn't be used directly for a "FPC vs Delphi" comparison.
What means here are relative numbers between FillChar/Move against FillCharFast/MoveFast on a single execution context.

For big moves, the 256-bit YMM AVX version gives a slight advantage over 128-bit XMM SSE2 code, which was already saturating the bandwidth.
Small blocks don't involve AVX by design.
There seems to be a real AVX advantage only for overlapping increasing moves, certainly due to the less number of cycles involved, and no memory prefetch involved - which may occur in some applications.

For small backward/forward moves (on FPC Linux), when you get some details, the performance difference seems pretty interresting:

1b Move in 89us, 214.3 MB/s     1b MoveFast [] in 77us, 247.7 MB/s
2b Move in 95us, 401.5 MB/s     2b MoveFast [] in 72us, 529.8 MB/s
3b Move in 107us, 534.7 MB/s    3b MoveFast [] in 76us, 752.9 MB/s
4b Move in 129us, 591.4 MB/s    4b MoveFast [] in 92us, 829.2 MB/s
5b Move in 163us, 585 MB/s      5b MoveFast [] in 78us, 1.1 GB/s
6b Move in 143us, 800.2 MB/s    6b MoveFast [] in 77us, 1.4 GB/s
7b Move in 143us, 0.9 GB/s      7b MoveFast [] in 146us, 914.4 MB/s
8b Move in 159us, 0.9 GB/s      8b MoveFast [] in 73us, 2 GB/s
9b Move in 166us, 1 GB/s        9b MoveFast [] in 144us, 1.1 GB/s
10b Move in 170us, 1 GB/s       10b MoveFast [] in 144us, 1.2 GB/s
11b Move in 181us, 1.1 GB/s     11b MoveFast [] in 144us, 1.4 GB/s
12b Move in 153us, 1.4 GB/s     12b MoveFast [] in 143us, 1.5 GB/s
13b Move in 154us, 1.5 GB/s     13b MoveFast [] in 144us, 1.6 GB/s
14b Move in 150us, 1.7 GB/s     14b MoveFast [] in 143us, 1.8 GB/s
15b Move in 151us, 1.8 GB/s     15b MoveFast [] in 147us, 1.9 GB/s
16b Move in 159us, 1.8 GB/s     16b MoveFast [] in 73us, 4 GB/s
17b Move in 161us, 1.9 GB/s     17b MoveFast [] in 153us, 2 GB/s
18b Move in 167us, 2 GB/s       18b MoveFast [] in 153us, 2.1 GB/s
19b Move in 180us, 1.9 GB/s     19b MoveFast [] in 161us, 2.1 GB/s
20b Move in 155us, 2.4 GB/s     20b MoveFast [] in 153us, 2.4 GB/s
21b Move in 150us, 2.6 GB/s     21b MoveFast [] in 153us, 2.5 GB/s
22b Move in 155us, 2.6 GB/s     22b MoveFast [] in 163us, 2.5 GB/s
23b Move in 157us, 2.7 GB/s     23b MoveFast [] in 154us, 2.7 GB/s
24b Move in 166us, 2.6 GB/s     24b MoveFast [] in 91us, 4.9 GB/s
25b Move in 175us, 2.6 GB/s     25b MoveFast [] in 163us, 2.8 GB/s
26b Move in 183us, 2.6 GB/s     26b MoveFast [] in 170us, 2.8 GB/s
27b Move in 193us, 2.6 GB/s     27b MoveFast [] in 162us, 3.1 GB/s
28b Move in 166us, 3.1 GB/s     28b MoveFast [] in 163us, 3.2 GB/s
29b Move in 183us, 2.9 GB/s     29b MoveFast [] in 169us, 3.1 GB/s
30b Move in 170us, 3.2 GB/s     30b MoveFast [] in 176us, 3.1 GB/s
31b Move in 175us, 3.3 GB/s     31b MoveFast [] in 176us, 3.2 GB/s
32b Move in 185us, 3.2 GB/s     32b MoveFast [] in 146us, 4 GB/s
33b Move in 193us, 3.1 GB/s     33b MoveFast [] in 197us, 3.1 GB/s
34b Move in 198us, 3.1 GB/s     34b MoveFast [] in 157us, 4 GB/s
35b Move in 218us, 2.9 GB/s     35b MoveFast [] in 155us, 4.2 GB/s
36b Move in 196us, 3.4 GB/s     36b MoveFast [] in 155us, 4.3 GB/s
37b Move in 184us, 3.7 GB/s     37b MoveFast [] in 160us, 4.3 GB/s
38b Move in 209us, 3.3 GB/s     38b MoveFast [] in 160us, 4.4 GB/s
39b Move in 201us, 3.6 GB/s     39b MoveFast [] in 161us, 4.5 GB/s
40b Move in 218us, 3.4 GB/s     40b MoveFast [] in 161us, 4.6 GB/s
41b Move in 226us, 3.3 GB/s     41b MoveFast [] in 161us, 4.7 GB/s
42b Move in 236us, 3.3 GB/s     42b MoveFast [] in 162us, 4.8 GB/s
43b Move in 250us, 3.2 GB/s     43b MoveFast [] in 161us, 4.9 GB/s
44b Move in 180us, 4.5 GB/s     44b MoveFast [] in 215us, 3.8 GB/s
45b Move in 199us, 4.2 GB/s     45b MoveFast [] in 230us, 3.6 GB/s
46b Move in 195us, 4.3 GB/s     46b MoveFast [] in 164us, 5.2 GB/s
47b Move in 198us, 4.4 GB/s     47b MoveFast [] in 173us, 5 GB/s
48b Move in 231us, 3.8 GB/s     48b MoveFast [] in 163us, 5.4 GB/s

As you can see, our code was designed to handle very efficiently 8-bytes multiples (8-16-24-32 bytes), which are pretty common when moving small objects.
It is always faster than FPC RTL's original code, which was already some optimized assembly. :)

If you look at the source code, you will see that we tried to make the code as clear as possible, and using all capabilities of Delphi/FPC inlined asm.

Please run the tests on your PC, and share some numbers and enhancements!

Synopse Open Source

New move/fillchar optimized sse2/avx asm version