mORMot 2 on Ampere AARM64 CPU

Always Free Ampere VM

Back to the beginning. Tom, one mORMot user, reported on our forum that he successfully installed FPC and Lazarus on the Oracle Cloud platform, and accessed it via SSH/XRDP:

Just open account on Oracle Cloud and create new compute VM: 4 ARMv8.2 CPU 3GHz, 24GB Ram (yes 24GB). This is always free VM (you can combine this 4 cores and 24GB Ram to 1 or many (4) VM). Install Ubuntu 20.04 server, then install LXDE and XRdp for remote access. Now I have nice speed workstation. Install fpcupdeluxe then fpc 3.2.2/laz 2.0.12, all OK. fpcup build is faster then my local pc build This OCI VM can be great mormot application server for some projects. I don't have any connection to Oracle - just test their product.

I did the same, and in fact, this platform is really easy to work with, once you have paid a 1€ credit card fee to validate your account. Then you will get an "Always Free VM", with 4 Ampere cores, and 24GB of Ram. Amazing. The Oracle people really like to break into the cloud market, and they make it wide open for developers, so that they consider their Cloud instead of Microsoft's or Amazon's.

FPC and Lazarus on Linux/AArch64

The Lazarus experiment is very good on this platform, even remotely. The only issue is the debugger. Gdb was pretty unstable for me - almost as unstable as on Windows. But somewhat usable, until it crashes.

Finally, and thanks to Alfred - our great friend behind fpcupdeluxe - we identified a problem with our asm stubs when calling mORMot interface-based services. It was in fact a FPC "feature" (it is documented as such in the compiler so it is not a bug), in how arguments are passed as result in the AARCH64 calling ABI. Once identified, we made an explicit exception to help circumvent the problem.

The FPC code quality seems good. At least at the level of the x86_64 Intel/AMD platform. Not as good as gcc for sure, but good enough for production code, and good speed. The only big limitation is that the inlined assembly is very limited: only a few AARCH64 opcodes are available - only what was mandatory for the basic FPC RTL needs.

Tuning mORMot 2 for AArch64

To enhance performance, we replaced the basic FPC RTL Move/FillChar functions by the libc memmov/memset. And performance is amazing:

FPC RTL
     FillChar in 19.06ms, 20.3 GB/s
     Move in 9.95ms, 1.5 GB/s
     small Move in 15.85ms, 1.3 GB/s
     big Move in 222.41ms, 1.7 GB/s

mORMot functions calling the Ubuntu Gnu libc:
     FillCharFast in 8.84ms, 43.9 GB/s
     MoveFast in 1.25ms, 12.4 GB/s
     small MoveFast in 4.98ms, 4.3 GB/s
     big MoveFast in 34.32ms, 11.3 GB/s

In comparison, here are the numbers on my Core i5 7200U CPU, of mORMot tuned x86_64 asm (faster than the FPC RTL), using SSE2 or AVX instructions:

     FillCharFast [] in 21.43ms, 18.1 GB/s
     MoveFast [] in 2.29ms, 6.8 GB/s
     small MoveFast [] in 4.29ms, 5.1 GB/s
     big MoveFast [] in 68.33ms, 5.7 GB/s
     FillCharFast [cpuAVX] in 20.28ms, 19.1 GB/s
     MoveFast [cpuAVX] in 2.26ms, 6.8 GB/s
     small MoveFast [cpuAVX] in 4.25ms, 5.1 GB/s
     big MoveFast [cpuAVX] in 69.93ms, 5.5 GB/s

So we can see that the Ampere CPU memory design is pretty efficient. It is up to twice faster than a Core i5 7200U CPU.

We had to go further, and get some fun with one bottleneck of every server operation: encryption and hashes. So we wrote some C code to be able to use the efficient HW acceleration we wanted for encryption and hashes. You could find the source code in the /res/static/armv8 sub-folder of our repository. Now we have tremendous performance for AES, GCM, SHA2 and CRC32/CRC32C computation.

     2500 crc32c in 259us i.e. 9.2M/s or 20 GB/s
     2500 xxhash32 in 1.47ms i.e. 1.6M/s or 3.5 GB/s
     2500 crc32 in 259us i.e. 9.2M/s or 20 GB/s
     2500 adler32 in 469us i.e. 5M/s or 11 GB/s
     2500 hash32 in 584us i.e. 4M/s or 8.9 GB/s
     2500 md5 in 12.12ms i.e. 201.3K/s or 438.7 MB/s
     2500 sha1 in 21.75ms i.e. 112.2K/s or 244.5 MB/s
     2500 hmacsha1 in 23.81ms i.e. 102.5K/s or 223.4 MB/s
     2500 sha256 in 3.41ms i.e. 714.7K/s or 1.5 GB/s
     2500 hmacsha256 in 4.12ms i.e. 591.2K/s or 1.2 GB/s
     2500 sha384 in 27.71ms i.e. 88K/s or 191.9 MB/s
     2500 hmacsha384 in 32.69ms i.e. 74.6K/s or 162.7 MB/s
     2500 sha512 in 27.73ms i.e. 88K/s or 191.8 MB/s
     2500 hmacsha512 in 32.77ms i.e. 74.4K/s or 162.3 MB/s
     2500 sha3_256 in 35.82ms i.e. 68.1K/s or 148.5 MB/s
     2500 sha3_512 in 65.48ms i.e. 37.2K/s or 81.2 MB/s
     2500 rc4 in 12.98ms i.e. 188K/s or 409.8 MB/s
     2500 mormot aes-128-cfb in 8.84ms i.e. 276.1K/s or 601.7 MB/s
     2500 mormot aes-128-ofb in 3.78ms i.e. 645K/s or 1.3 GB/s
     2500 mormot aes-128-c64 in 4.39ms i.e. 555.6K/s or 1.1 GB/s
     2500 mormot aes-128-ctr in 4.52ms i.e. 539.7K/s or 1.1 GB/s
     2500 mormot aes-128-cfc in 9.16ms i.e. 266.3K/s or 580.4 MB/s
     2500 mormot aes-128-ofc in 5.25ms i.e. 465K/s or 0.9 GB/s
     2500 mormot aes-128-ctc in 5.74ms i.e. 425.1K/s or 0.9 GB/s
     2500 mormot aes-128-gcm in 7.52ms i.e. 324.5K/s or 707.2 MB/s
     2500 mormot aes-256-cfb in 9.52ms i.e. 256.2K/s or 558.4 MB/s
     2500 mormot aes-256-ofb in 4.71ms i.e. 517.6K/s or 1.1 GB/s
     2500 mormot aes-256-c64 in 5.30ms i.e. 460.5K/s or 0.9 GB/s
     2500 mormot aes-256-ctr in 5.33ms i.e. 457.5K/s or 0.9 GB/s
     2500 mormot aes-256-cfc in 10.04ms i.e. 243K/s or 529.5 MB/s
     2500 mormot aes-256-ofc in 6.11ms i.e. 399K/s or 869.6 MB/s
     2500 mormot aes-256-ctc in 6.77ms i.e. 360.4K/s or 785.5 MB/s
     2500 mormot aes-256-gcm in 8.38ms i.e. 291.1K/s or 634.4 MB/s
     2500 openssl aes-128-cfb in 4.94ms i.e. 493.4K/s or 1 GB/s
     2500 openssl aes-128-ofb in 4.12ms i.e. 591.2K/s or 1.2 GB/s
     2500 openssl aes-128-ctr in 1.94ms i.e. 1.2M/s or 2.6 GB/s
     2500 openssl aes-128-gcm in 3.18ms i.e. 767K/s or 1.6 GB/s
     2500 openssl aes-256-cfb in 5.83ms i.e. 418.5K/s or 912.1 MB/s
     2500 openssl aes-256-ofb in 5.04ms i.e. 484.1K/s or 1 GB/s
     2500 openssl aes-256-ctr in 2.42ms i.e. 0.9M/s or 2.1 GB/s
     2500 openssl aes-256-gcm in 3.66ms i.e. 667K/s or 1.4 GB/s
     2500 shake128 in 29.63ms i.e. 82.3K/s or 179.5 MB/s
     2500 shake256 in 35.07ms i.e. 69.5K/s or 151.6 MB/s

Here are the numbers on my Core i5 7200U CPU, with optimized asm, and the last OpenSSL calls:

     2500 crc32c in 224us i.e. 10.6M/s or 23.1 GB/s
     2500 xxhash32 in 817us i.e. 2.9M/s or 6.3 GB/s
     2500 crc32 in 341us i.e. 6.9M/s or 15.2 GB/s
     2500 adler32 in 241us i.e. 9.8M/s or 21.5 GB/s
     2500 hash32 in 441us i.e. 5.4M/s or 11.7 GB/s
     2500 aesnihash in 218us i.e. 10.9M/s or 23.8 GB/s
     2500 md5 in 8.29ms i.e. 294.1K/s or 641.1 MB/s
     2500 sha1 in 13.72ms i.e. 177.8K/s or 387.5 MB/s
     2500 hmacsha1 in 15.05ms i.e. 162.1K/s or 353.3 MB/s
     2500 sha256 in 17.40ms i.e. 140.2K/s or 305.6 MB/s
     2500 hmacsha256 in 18.71ms i.e. 130.4K/s or 284.2 MB/s
     2500 sha384 in 11.59ms i.e. 210.5K/s or 458.9 MB/s
     2500 hmacsha384 in 13.84ms i.e. 176.3K/s or 384.2 MB/s
     2500 sha512 in 11.59ms i.e. 210.5K/s or 458.8 MB/s
     2500 hmacsha512 in 13.89ms i.e. 175.7K/s or 382.9 MB/s
     2500 sha3_256 in 26.66ms i.e. 91.5K/s or 199.5 MB/s
     2500 sha3_512 in 47.96ms i.e. 50.9K/s or 110.9 MB/s
     2500 rc4 in 14.05ms i.e. 173.7K/s or 378.6 MB/s
     2500 mormot aes-128-cfb in 4.59ms i.e. 530.9K/s or 1.1 GB/s
     2500 mormot aes-128-ofb in 4.52ms i.e. 539.4K/s or 1.1 GB/s
     2500 mormot aes-128-c64 in 6.23ms i.e. 391.7K/s or 853.7 MB/s
     2500 mormot aes-128-ctr in 1.40ms i.e. 1.6M/s or 3.6 GB/s
     2500 mormot aes-128-cfc in 4.75ms i.e. 513.2K/s or 1 GB/s
     2500 mormot aes-128-ofc in 5.22ms i.e. 467.7K/s or 0.9 GB/s
     2500 mormot aes-128-ctc in 1.72ms i.e. 1.3M/s or 3 GB/s
     2500 mormot aes-128-gcm in 2.28ms i.e. 1M/s or 2.2 GB/s
     2500 mormot aes-256-cfb in 6.12ms i.e. 398.4K/s or 868.3 MB/s
     2500 mormot aes-256-ofb in 6.10ms i.e. 400K/s or 871.7 MB/s
     2500 mormot aes-256-c64 in 7.86ms i.e. 310.6K/s or 676.9 MB/s
     2500 mormot aes-256-ctr in 1.82ms i.e. 1.3M/s or 2.8 GB/s
     2500 mormot aes-256-cfc in 6.36ms i.e. 383.5K/s or 835.9 MB/s
     2500 mormot aes-256-ofc in 6.77ms i.e. 360.1K/s or 784.8 MB/s
     2500 mormot aes-256-ctc in 2.02ms i.e. 1.1M/s or 2.5 GB/s
     2500 mormot aes-256-gcm in 2.68ms i.e. 909.2K/s or 1.9 GB/s
     2500 openssl aes-128-cfb in 7.11ms i.e. 342.9K/s or 747.3 MB/s
     2500 openssl aes-128-ofb in 5.21ms i.e. 468K/s or 1 GB/s
     2500 openssl aes-128-ctr in 1.54ms i.e. 1.5M/s or 3.3 GB/s
     2500 openssl aes-128-gcm in 1.85ms i.e. 1.2M/s or 2.8 GB/s
     2500 openssl aes-256-cfb in 8.65ms i.e. 282.2K/s or 615 MB/s
     2500 openssl aes-256-ofb in 6.82ms i.e. 357.6K/s or 779.3 MB/s
     2500 openssl aes-256-ctr in 1.93ms i.e. 1.2M/s or 2.6 GB/s
     2500 openssl aes-256-gcm in 2.27ms i.e. 1M/s or 2.2 GB/s
     2500 shake128 in 23.47ms i.e. 104K/s or 226.6 MB/s
     2500 shake256 in 29.64ms i.e. 82.3K/s or 179.5 MB/s

The mORMot plain pascal code is used for MD5, SHA1, or shake/SHA3. So it is slower than our optimized asm for Intel/AMD. But not so slow. And those algorithms are either deprecated or not widely used - therefore they are not a bottleneck. OpenSSL numbers are pretty good too on this platform. As a result, AES, GCM, SHA-2 and crc32/crc32c performance is comparable between AARCH64 and Intel/AMD. With amazing SHA-2 numbers.

Then, we compiled the latest SQLite3, Lizard and libdeflate as static libraries, so that you could use them with your executable with no external dependency. Performance is very good:

     TAlgoSynLZ 3.8 MB->2 MB: comp 287:151MB/s decomp 215:409MB/s
     TAlgoLizard 3.8 MB->1.9 MB: comp 18:9MB/s decomp 857:1667MB/s
     TAlgoLizardFast 3.8 MB->2.3 MB: comp 193:116MB/s decomp 1282:2135MB/s
     TAlgoLizardHuffman 3.8 MB->1.8 MB: comp 84:40MB/s decomp 394:827MB/s
     TAlgoDeflate 3.8 MB->1.5 MB: comp 30:12MB/s decomp 78:196MB/s
     TAlgoDeflateFast 3.8 MB->1.6 MB: comp 48:20MB/s decomp 73:174MB/s

I was a bit surprised by how well the pure pascal version of SynLZ algorithm was running, once compiled with FPC 3.2, on AARCH64. Also the Deflate compression has a small advantage of using our statically linked libdeflate in respect to the plain zlib. But the very good news is that Lizard is really fast on AARCH64: even if it is written in plain C with no manual SIMD/asm code, it is really fast on non Intel/AMD platforms. More than 2GB/s for decompression is very high. I was told that Lizard may be a bit behind ZStandard on Intel/AMD, but its code is simpler, and much more CPU agnostic.

 2.4. Sqlite file memory map: 
  - Database direct access: 22,264 assertions passed  55.40ms
  - Virtual table direct access: 12 assertions passed  347us
  - TOrmTableJson: 144,083 assertions passed  60.25ms
  - TRestClientDB: 608,196 assertions passed  783.02ms
  - Regexp function: 6,015 assertions passed  11.07ms
  - TRecordVersion: 20,060 assertions passed  51.28ms
  Total failed: 0 / 800,630  - Sqlite file memory map PASSED  961.45ms

Here SQLite3 numbers are similar to what I have on Intel/AMD. So I guess we could really consider using this database as storage back-end for mORMot MicroServices with their stand-alone persistence layer.

Ampere and Beyond - Apple M1?

We also tried to support as much as possible the ARM/AARCH64 CPUs with mORMot 2. So now we detect the CPU type and HW platform it runs on, especially on Linux or Android - which is also an AARCH64 platform. Here is what our regression tests report at their ending:

Ubuntu 20.04.2 LTS - Linux 5.8.0-1037-oracle (cp utf8)
    2 x ARM Neoverse-N1 (aarch64)
    on QEMU KVM Virtual Machine virt-4.2
Using mORMot 2.0.1
    TSqlite3LibraryStatic 3.36.0 with internal MM
Generated with: Free Pascal 3.2 64 bit Linux compiler

Time elapsed for all tests: 44.38s
Performed 2021-08-17 13:44:09 by ubuntu on lxde

Total assertions failed for all test suits:  0 / 66,050,607

As you can see, the CPU was properly identified as ARM Neoverse-N1.

We could consider with good faith using mORMot code on an Apple M1/M1X/M2 CPU, thanks to the FPC (cross-)compiler. If we have access to this hardware. Any feedback is welcome.

Server Process Performance

All regression tests do pass whole green, with pretty consistent performance among all its various tasks. JSON process, ORM, SOA or encryption: everything flies on the Ampere CPU. You can check the detailed regression tests console output.

Here are some numbers about UTF-8 or JSON process:

     StrLen() in 1.43ms, 13.3 GB/s
     IsValidUtf8(RawUtf8) in 11.75ms, 1.6 GB/s
     IsValidUtf8(PUtf8Char) in 13.08ms, 1.4 GB/s
     IsValidJson(RawUtf8) in 22.84ms, 858.2 MB/s
     IsValidJson(PUtf8Char) in 22.93ms, 854.7 MB/s
     JsonArrayCount(P) in 22.97ms, 853.1 MB/s
     JsonArrayCount(P,PMax) in 22.89ms, 856.4 MB/s
     JsonObjectPropCount() in 11.90ms, 0.9 GB/s
     jsonUnquotedPropNameCompact in 72.35ms, 240.6 MB/s
     jsonHumanReadable in 119.06ms, 209.4 MB/s
     TDocVariant in 245.99ms, 79.7 MB/s
     TDocVariant no guess in 260.57ms, 75.2 MB/s
     TDocVariant dvoInternNames in 247.56ms, 79.1 MB/s
     TOrmTableJson GetJsonValues in 34.88ms, 247.1 MB/s
     TOrmTableJson expanded in 42.70ms, 459 MB/s
     TOrmTableJson not expanded in 21.42ms, 402.4 MB/s
     DynArrayLoadJson in 87.96ms, 222.8 MB/s
     TOrmPeopleObjArray in 131.10ms, 149.5 MB/s
     fpjson in 115.09ms, 17 MB/s

It is nice to see that our pascal code, which has been deeply tuned to let FPC generate the best x86_64 assembly possible, is also able to give very good performance on AARCH64. No need to write some dedicated code, and pollute the source with plenty of $ifdef/$endif: x86_64 is already some kind of RISC-like architecture, with a bigger number of registers, and 64-bit efficient processing. No need to rewrite everything. Optimized pascal code, with tuned pointer arithmetic is platform neutral. I like the quote of SQLite3 author saying that C is a "portable assembly", and that we could also use tuned pascal code, as we try to do in the mORMot core units, to leverage modern CPU hardware, without the need of fighting against any hype/versatile language.

Asm is Fun Again

So we are pretty excited to see how this platform will go in the future. mORMot has invested a lot of time, refactoring and asm tuning to leverage the Intel/AMD platform, focusing on the server side performance. But this AARCH64 technology is really promising, and I can tell you that its RISC instruction set was very cleverly designed. It is very rich and powerful, almost perfect in its balance between power and expressiveness, in respect to the x86_64 platform, which has a lot of inconsistencies and seems outdated when you compare both asm. After decades playing with i386 or x86_64 asm, I had fun again with the ARM v8 assembly. It tastes like "assembly as it should be" (tm). Linking some static C code is a good balance between leveraging the hardware when needed, and keeping platform-independent pascal source. And FPC, as a compiler, is amazing by being open and well done on so many CPUs and platforms. Open Source rocks!

As usual, feedback is welcome on our forum.

Synopse Open Source