Our Open Source framework includes some optimized asm alternatives
We just rewrote from scratch the x86_64 version of those,
which was previously taken from third-party snippets.
The brand new code is meant to be more efficient and maintainable. In
particular, we switched to SIMD 128-bit SSE2 or 256bit AVX memory access (if
available), whereas current version was using 64-bit regular registers. The
small blocks (i.e. < 32 bytes) process occurs very often, e.g. when
processing strings, so has been tuned a lot. Non temporal instructions (i.e.
bypassing the CPU cache) are used for biggest chunks of data. We tested
ERMS support, but it
was found of no benefit in respect to our optimized SIMD, and was actually
slower than our non-temporal variants. So ERMS code is currently disabled in
the source, and may be enabled on demand by a conditional.
move() was not bad. Delphi's Win64 was far
from optimized - even ERMS was poorly introduced in latest RTL, since it should
be triggered only for blocks > 2KB. Sadly, Delphi doesn't support AVX
assembly yet, so those opcodes would be available only on FPC.
Resulting numbers are talking by themselves. Working on Win64 and Linux, of