Synopse Open Source - Tag - FPCmORMot MVC / SOA / ORM and friends2024-02-02T17:08:25+00:00urn:md5:cc547126eb580a9adbec2349d7c65274DotclearNative X.509, RSA and HSM Supporturn:md5:6404121d596d3c5cee045e29cf93e02b2023-12-09T11:01:00+00:002023-12-10T18:13:06+00:00Arnaud BouchezmORMot FrameworkAsymmetricCrossPlatformCSPRNGDelphiECCed25519forensicFPCFreePascalGoodPracticeinterfacemORMotmORMot2OpenSourceOpenSSLperformancePKCS11RSAsecurityX509<p>Today, almost all computer security relies on asymmetric cryptography and X.509 certificates as file or hardware modules.<br />
And the RSA algorithm is still used to sign the vast majority of those certificates. Even if there are better options (like ECC-256), RSA-2048 seems the actual standard, at least still allowed for a few years.</p>
<p><img src="https://blog.synopse.info?post/public/blog/mormotSecurity.jpg" alt="" /></p>
<p>So we added pure pascal RSA cryptography and X.509 certificates support in <em>mORMot</em>.<br />
Last but not least, we also added Hardware Security Modules support via the PKCS#11 standard.<br />
Until now, we were mostly relying on OpenSSL, but a native embedded solution would be smaller in code size, better for reducing dependencies, and easier to work with (especially for HSM). The main idea is to offer only safe algorithms and methods, so that you can write reliable software, even if you are no cryptographic expert. <img src="https://blog.synopse.info?pf=smile.svg" alt=":)" class="smiley" /></p> <h4>Rivest-Shamir-Adleman (RSA) Public-Key Cryptography</h4>
<p>The RSA public-key algorithm was designed back in 1977, and is still the most widely used. In order to fully implement it, we need to generate new key pairs (public and private keys), then sign and verify (or encrypt or decrypt) data with the key. For instance, a private key is kept secret, and used for an Authority to sign a certificate, and the public key is published, and able to verify a certificate. It is based on large prime numbers, so we needed to develop a Big Integer library, which is not part of Delphi or FPC RTL.</p>
<p><img src="https://blog.synopse.info?post/public/blog/RSAalgo.png" alt="" /></p>
<p>Here as some notes about our implementation in <a href="https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.rsa.pas">mormot.crypt.rsa.pas</a>:</p>
<ul>
<li>new pure pascal OOP design of BigInt computation optimized for RSA process;</li>
<li>dedicated x86_64/i386 asm for core computation routines (noticeable speedup);</li>
<li>use half-registers (HalfUInt) for efficient computation on all CPUs (arm and aarch64 numbers are good);</li>
<li>slower than OpenSSL, but likely to be the fastest FPC or Delphi native RSA library, thanks to our optimized asm: for instance, you can generate a new RSA-2048 keypair in less than a second;</li>
<li>internal garbage collection of BigInt instances, to minimize heap pressure during computation, and ensure all values are wiped once used during the process - as proven anti-forensic measure;</li>
<li>includes FIPS-level RSA keypair validation and generation, using a safe random source, with efficient prime number detection, and minimal code size;</li>
<li>features both RSASSA-PKCS1-v1_5 and RSASSA-PSS signature schemes;</li>
<li>started as a fcl-hash fork, but full rewrite inspired by Mbed TLS source because this initial code is slow and incomplete;</li>
<li>references: we followed <a href="https://github.com/Mbed-TLS/mbedtls">the Mbded TLS</a> implementation (which is much easier to follow than OpenSSL), and the well known <a href="https://cacr.uwaterloo.ca/hac/about/chap4.pdf">Handbook of Applied Cryptography (HAC)</a> recommendations;</li>
<li>includes full coverage of unit tests to avoid any regression, validated against the OpenSSL library as audited reference;</li>
<li>this unit will register as <code>Asym</code> 'RS256','RS384','RS512' algorithms (if not overridden by the faster <code>mormot.crypt.openssl</code>), keeping 'RS256-int' and 'PS256-int' available to use our unit;</li>
<li>as used by <code>mormot.crypt.x509</code> (see below) to handle RSA signatures of its X.509 Certificates.</li>
</ul>
<p>For instance, if you want to access a <code>TCryptAsym</code> digital signature instance with RSA-2048 and SHA-256 hashing, you can just use the <code>CryptAsym</code> global variable with <code>caaRS256</code> algorithm as factory.<br />
If you need just public/private key support, you can use <code>CryptPublicKey</code> or <code>CryptPrivateKey</code> factories with <code>ckaRsa</code> algorithm.</p>
<p>About RSA security:</p>
<ul>
<li>RSA-512 or RSA-1024 are considered unsafe and should not be used.</li>
<li>RSA-2048 confers 112-bit of security, and is the usual choice today when this algorithm is to be used.</li>
<li>RSA-3072 could confer 128-bit of security, at the expense of being slower and 50% bigger - so switching to ECC-256 may be a better option, for the same level of security.</li>
<li>RSA-4096 is not worth it in respect to RSA-3072, and RSA-7680 is very big and slow, and only gives 192-bit of security, so should be avoided.</li>
</ul>
<p>Anyway, our library is able to support all those key sizes, up to RSA-7680 is you really need it.<br />
See <a href="https://stackoverflow.com/a/589850/458259">this SO response</a> as reference about RSA keysizes.</p>
<h4>X.509 Certificates</h4>
<p>As we wrote in introduction, X.509 certificates are the base of most computer security.<br />
The whole TLS/HTTPS stack makes use of it, and the whole Internet would collapse without it.</p>
<p>We developed our <a href="https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.x509.pas">mormot.crypt.x509.pas</a> unit from scratch, featuring:</p>
<ul>
<li>X.509 Certificates Fields Logic (e.g. X.501 Type Names);</li>
<li>X.509 Certificates and Certificate Signing Request (CSR);</li>
<li>X509 Certificate Revocation List (CRL);</li>
<li>X509 Private Key Infrastructure (PKI);</li>
<li>Registration of our X.509 Engine to the <code>TCryptCert</code>/<code>TCryptStore</code> Factories.</li>
</ul>
<p><img src="https://blog.synopse.info?post/public/blog/X509certificate.png" alt="" /></p>
<p>The raw binary encoding is using the (weird) ASN.1 syntax, which is now implemented as part of the <a href="https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.secure.pas">mormot.crypt.secure.pas</a> unit.<br />
We followed the RFC 5280 specifications, and mapped latest X.509 Certificates / CSR / CRL extensions, with some low-level but very readable pascal code using classes, records and enumerates. It features perfect compatibility with our <code>ICryptCert</code> high-level interface wrappers, ready to be used in a very convenient way. We support all basic functions, but also advanced features like open/sealing or text peer information in a human readable format.<br />
When using our unit, your end-user code should not be lost within the complex details and notions of the X.509 format (like OIDs, versions or extensions), but use high-level pascal code, with no possibility to use a weak or invalid configuration.</p>
<p>Of course, it can support not only our new RSA keys, but also ECC-256 as implemented by our native <a href="https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.ecc.pas">mormot.crypt.ecc.pas</a>, or any other algorithm, e.g. available from OpenSSL.</p>
<h4>X.509 Private Key Infrastructure (PKI)</h4>
<p>Our unit features a full Private Key Infrastructure (PKI) implementation.<br />
In fact, X.509 certificates are as weak as the PKI they are used on. You can have strong certificates, but a weak verification pattern. In end-user applications, it is typical to see all the security being lost by a poor (e.g. naive) implementation of the keys interaction.</p>
<p><img src="https://blog.synopse.info?post/public/blog/PKI.png" alt="" /></p>
<p>This is why our unit publishes a 'x509-pki' <code>ICryptStore </code> as a full featured PKI:</p>
<ul>
<li>using our <code>TX509</code> and <code>TX509Crl</code> classes for actual certificates process;</li>
<li>clean verification of the chain of trust, with customized depth and proper root Certificate Authority (CA) support, following the RFC 5280 section 6 requirements of a clean "Certification Path Validation";</li>
<li>maintaining a cache of <code>ICryptCert</code> instances, which makes a huge performance benefit in the context of a PKI (e.g. you don't need to parse the X.509 binary, or verify the chain of trust each time).</li>
</ul>
<p>We tried to make performance and usability in the highest possible standards, to let you focus on your business logic, and keep the hard cryptography work done in the <em>mORMot</em> library code.</p>
<h4>Hardware Security Modules (HSM) via PKCS#11</h4>
<p>The PKCS#11 standard is a way to define some software access to Hardware Security Modules, via a set of defined API calls.<br />
We just published the <a href="https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.pkcs11.pas">mormot.crypt.pkcs11.pas</a> unit to interface those devices with the other <em>mORMot</em> PKI.</p>
<p><img src="https://blog.synopse.info?post/public/blog/HSM.png" alt="" /></p>
<p>Once you have loaded the library of your actual hardware (typically a <code>.dll</code> or <code>.so</code>) using a <code>TCryptCertAlgoPkcs11</code> instance, you can see all stored certificates and keys, as high-level regular <code>ICryptCert</code> instances, and sign or verify any kind of data (some binary or some other certificates), using the private key safely stored on in the hardware device.<br />
This is usually slower than a pure software verification, but it is much safer, because the private key is sealed within the hardware token, and never leave it. So it can't be intercepted and stolen.</p>
<h4>You are Welcome!</h4>
<p>With those <em>mORMot</em> cryptography units, you now have anything at hand to use standard and proven public-key cryptography in your applications, on both Delphi or FPC, with no external dll deployment issue, and minimal code size increase.<br />
We can thank a lot <a href="https://www.tranquil.it/en">my employer</a> for needing those nice features, therefore letting me work on them.<br />
Open Source rocks! :)</p>The LUTI and the mORMoturn:md5:25a4abb86aef5e445dca306f3a44273a2023-07-20T10:14:00+01:002023-07-20T10:36:35+01:00Arnaud BouchezmORMot FrameworkblogContinuousIntegrationCrossPlatformDelphiFPCFreePascalGoodPracticeLUTImORMotperformanceTestingTranquilITWAPT<p>Since its earliest days, our <em>mORMot</em> framework did offer extensive regression tests. In fact, it is fully test-driven, and almost 78 million individual tests are performed to cover all its abilities:</p>
<p><img src="https://blog.synopse.info?post/public/blog/RegressTests.png" alt="RegressTests.png, Jul 2023" /></p>
<p>We just integrated those tests to the <a href="https://www.tranquil.it/">TranquilIT</a> build farm, and its great LUTI tool. So we have now continuous integration tests over several versions of Windows, Linux, and even Mac!<br />
LUTI is the best <em>mORMot</em>'s friends these days. <img src="https://blog.synopse.info?pf=smile.svg" alt=":)" class="smiley" /></p> <h4>Discover the LUTI</h4>
<p>LUTI is a <em>TranquilIT</em> internal tool that can automatically create, test, track and update WAPT packages for the WAPT Store.<br />
Those WAPT packages are the core software deployment archives used by the great WAPT software solution of the French <em>TranquilIT</em> company.</p>
<p>More information is available at
<a href="https://www.tranquil.it/en/luti-creation-testing-and-automatic-tracking-of-wapt-packages/">https://www.tranquil.it/en/luti-creation-testing-and-automatic-tracking-of-wapt-packages/</a></p>
<h4>Our Rodent's Best Friend</h4>
<p>Yes, of course, <a href="https://www.tranquil.it/en/mormot/">WAPT does use <em>mORMot</em></a>, so it did make perfect sense to integrate both projects.</p>
<p>In practice, we built and integrated the tests on several versions of Windows, several Linux distributions, and even Mac Intel and Mac M1 virtual machines.<br />
The source code is cloned from our <a href="https://github.com/synopse/mORMot2">official GitHub repository</a>, built using FPC, then distributed over the needed virtual machines.<br />
The WAPT agent is in fact deployed on all VMs, and a dedicated package/script is installed and run via this Agent to trigger the actual tests.</p>
<p>A typical WAPT package script extract looks like the following:</p>
<pre>
def install():
if iswin64():
arch = 'x86_64-win64'
else:
arch = 'i386-win32'
try:
run(r'testmormot\%s\mormot2tests.exe /noenter' % arch)
run(r'testmormot\%s\mormot2tests.exe /noenter --test TTestCoreProcess.JSONBenchmark' % arch)
run(r'testmormot\%s\mormot2tests.exe /noenter --dns sambaad.lan --test TNetworkProtocols.DNSAndLDAP' % arch)
run(r'testmormot\%s\mormot2tests.exe /noenter --dns msad.lan --test TNetworkProtocols.DNSAndLDAP' % arch)
except:
...
</pre>
<p>Nothing complex here, just some python code executed on the target machine.</p>
<p>As you can see, the mormot2tests project has now optional <a href="https://blog.synopse.info/?post/2023/04/19/New-Command-Line-Parser-in-mORMot-2">command line options</a> to trigger dedicated tests. For instance the JSON benchmark is run after a main default pass, because it uses some JSON content generated by the ORM, so a second pass is needed. Or we can validate two kinds of <a href="https://blog.synopse.info/?post/2023/04/19/New-DNS-and-%28C%29LDAP-Clients-for-Delphi-and-FPC-in-mORMot-2">local DNS and LDAP servers</a>.</p>
<h4>All Green</h4>
<p>Now, let's see the result of a typical test run. Note that one such run is triggered every night with the latest <em>mORMot</em> sources available, for continuous delivery.</p>
<p>Several versions of Windows are validated:
<img src="https://blog.synopse.info?post/public/blog/LutiWin.png" alt="LutiWin.png, Jul 2023" /></p>
<p>Then several Linux distributions:
<img src="https://blog.synopse.info?post/public/blog/LutiLinux.png" alt="LutiLinux.png, Jul 2023" />
Note that the <em>bullseye_arm64</em> is in fact a Mac M1 virtual machine running Debian, and <em>buster_armhf</em> is a <a href="https://www.raspberrypi.com/">good tiny Raspberry Pi</a> running on our network.</p>
<p>And even Mac Intel and Mac M1 systems:
<img src="https://blog.synopse.info?post/public/blog/LutiMac.png" alt="LutiMac.png, Jul 2023" /></p>
<p>With a quick calculation, we can guess that 1.7 billion individual tests are done during each pass, throughout the 22 machines involved...</p>
<h4>Good Benefits</h4>
<p>This is a good showcase of the FPC and <em>mORMot</em> abilities to work cross-platform and cross-architecture. It also includes OpenSSL validation, and LDAP/DNS testing on local Samba or MSAD server.<br />
We discovered and fixed some corner-case issues during the integration of those tests. But this is what tests are about, isn't it? To show what is wrong.<br />
Some issues were in fact very nasty, especially on Mac, where <a href="https://github.com/synopse/mORMot2/commit/2352a485952ed19354eab8340ec69fd8ecaaecfd">Apple can't do as everyone does</a>.</p>
<p>We also were able to compare performance between targets, and in fact, we were pleased to see that <a href="https://blog.synopse.info/?post/2021/08/17/mORMot-2-on-Ampere-AARM64-CPU">aarch64 platforms should work fast enough</a>, even if x86_64 are better supported by <em>mORMot</em>, especially thanks to a lot of manually-tuned assembly code using latest Intel/AMD SIMD instructions, and our <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas">dedicated memory manager</a>. Even the Raspberry Pi can sustain all this JSON, cryptography, ORM, REST, HTTPS, network testing... at its own pace, of course. :)</p>
<p>Thanks a lot anyway to <a href="https://www.tranquil.it/">TranquilIT</a> for supporting our little <em>mORMot</em> and offering such a great product like <a href="https://www.tranquil.it/en/wapt/managing-your-it-assets/">WAPT</a>!</p>
<p>You can <a href="https://synopse.info/forum/viewtopic.php?id=6645">discuss about this blog entry on our forum</a>, as usual.</p>New Command Line Parser in mORMot 2urn:md5:371069b8539ca9cfd9d3cfbcb6163f282023-04-19T00:55:00+01:002023-04-19T12:17:27+01:00Arnaud BouchezmORMot FrameworkblogCommandLineCrossPlatformDelphiFPCFreePascalGoodPracticemORMot2ParamCountParamStr<p><img src="https://blog.synopse.info?post/public/blog/CommandLine.png" alt="" /></p>
<p>For most projects, we want to be able to pass some custom values when starting it.<br />
The command line is then used to add this additional information.</p>
<p>We have <code>ParamStr</code> and <code>ParamCount</code> global functions, enough to retrieve the information. You may also use <code>FindCmdLineSwitch</code> for something more easy to work with.<br />
The Lazarus RTL offers some additional methods like <code>hasOption</code> or <code>getOptionValue</code> or <code>checkOptions</code> in its <code>TCustomApplication</code> class. Their are better, but not so easy to use, and not available on Delphi.</p>
<p>We just committed a new command line parser to our Open Source <em>mORMot 2</em> framework, which works on both Delphi and FPC, follows both Windows and POSIX/Linux conventions, and has much more features (like automated generation of the help message), in an innovative and easy workflow.</p> <p>The most simple code may be the following (extracted from the documentation):</p>
<pre>
var
verbose: boolean;
threads: integer;
...
with Executable.Command do
begin
ExeDescription := 'An executable to test mORMot Execute.Command';
verbose := Option(['v', 'verbose'], 'generate verbose output');
Get(['t', 'threads'], threads, '#number of threads to run', 5);
ConsoleWrite(FullDescription);
end;
</pre>
<p>This code will fill <code>verbose</code> and <code>threads</code> local variables from the command line (with some optional default value), and output on Linux:</p>
<pre>
An executable to test mORMot Execute.Command
Usage: mormot2tests [options] [params]
Options:
-v, --verbose generate verbose output
Params:
-t, --threads <number> (default 5)
number of threads to run
</pre>
<p>So, not only you can parse the command line and retrieve values, but you can also add some description text, and let generate an accurate help message when needed.</p>
<p>You can note that the <code>#</code> character is used to mark the keyword to be used as value name for a given parameter, to make the text more meaningful.<br />
For instance, <code>'#number of threads to run'</code> will generate a nice <code> -t, --threads <number></code> text for the parameter description.</p>
<p>For a most typical use case, you may look at our <a href="https://github.com/synopse/mORMot2/blob/master/ex/techempower-bench/raw.pas#L665">TFB Benchmarking Sample source code</a>:</p>
<pre>
// parse command line parameters
with Executable.Command do
begin
ExeDescription := 'TFB Server using mORMot 2';
if Option(['p', 'pin'], 'pin each server to a CPU') then
pinServers2Cores := true;
if Option('nopin', 'disable the CPU pinning') then
pinServers2Cores := false; // no option would keep the default boolean
Get(['s', 'servers'], servers, '#count of servers (listener sockets)', servers);
Get(['t', 'threads'], threads, 'per-server thread pool #size', threads);
if Option(['?', 'help'], 'display this message') then
begin
ConsoleWrite(FullDescription);
exit;
end;
if ConsoleWriteUnknown then
exit;
end;
</pre>
<p>This would generate such a description:</p>
<pre>
d:\dev\lib2\ex\techempower-bench\exe>raw /?
TFB Server using mORMot 2
Usage: raw [options] [params]
Options:
/p, /pin pin each server to a CPU
/nopin disable the CPU pinning
/?, /help display this message
Params:
/s, /servers <count> (default 1)
count of servers (listener sockets)
/t, /threads <size> (default 8)
per-server thread pool size
</pre>
<p>It will accept commands like this on Windows:</p>
<pre>
raw /p /t=10
raw /t 10 /s 2 /pin
raw /servers=2 /threads=8 /nopin
raw /servers 2 /threads 8 /nopin
</pre>
<p>And, on Linux/POSIX, you could write as usual:</p>
<pre>
./raw -p -t=10
./raw -t 10 -s 2 --pin
./raw --servers=2 --threads=8 --nopin
./raw --servers 2 --threads 8 --nopin
</pre>
<p>Note that both <code>-t 10</code> and <code>-t=10</code> syntax are accepted.</p>
<p>As you may have guessed it, <code>ConsoleWriteUnknown</code> is able to notify the user that a wrong switch has been used - and display the help message.</p>
<p>This function is available in the base <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.os.pas">mormot.core.os.pas</a> unit of the framework.</p>
<p>And feedback is <a href="https://synopse.info/forum/viewtopic.php?pid=39579#p39579">welcome in our forum</a>, as usual!</p>
<p>We hope you find it useful!
:)</p>Modern Pascal is Still in the Raceurn:md5:af5e15cc3f4bde55561318a1bb5dd3bb2022-11-26T10:05:00+00:002022-11-29T13:21:57+00:00Arnaud BouchezPascal ProgrammingblogcollectionsCrossPlatformDatabaseDelphiFPCGarbageCollectorgenericsGoGoodPracticeMetaProgrammingmORMotmORMot2performanceRTTIRust<p>A recent poll <a href="https://forum.lazarus.freepascal.org/index.php/topic,61276.0.html">on the Lazarus/FPC forum</a> highlighted a fact: pascal coders are older than most coders. Usually, at our age, we should be managers, not developers. But we like coding in pascal. It is still fun after decades!<br />
But does it mean that you should not use pascal for any new project? Are the language/compilers/libraries outdated?<br />
In the company I currently work for, we have young coders, just out-of-school or still-in-school, which joined the team and write great code!</p>
<p><img src="https://blog.synopse.info?post/public/blog/performance.jpg" alt="" /></p>
<p>And a recent thread <a href="https://forum.lazarus.freepascal.org/index.php/topic,61035.0.html">in this very same forum</a> was about comparing languages to implement a REST server, in C#, Go, Scala, TypeScript, Elixir and Rust.<br />
Several pascal versions are about to be contributed, one in which <em>mORMot</em> shines.</p> <h4>The Challenge and the Algorithms</h4>
<p>The original challenge is available at <a href="https://github.com/losvedir/transit-lang-cmp">transit-lang-cmp</a> with the original source code, of all those fancy languages and libraries.</p>
<p>In practice, the goal of this test program is to load two big CSVs into memory (80MB + 2MB), then serve over HTTP some JSON generated by route identifiers, joining both CSVs.<br />
The resulting JSON could be of 30KB up to 2MB. And all data is generated on the fly from the CSV in memory.</p>
<p>To be fair, a regular/business coder would have used a database for this. Not silly memory structures. And asked for money to setup a huge set of cloud machines with load balancing. <img src="https://blog.synopse.info?pf=smile.svg" alt=":-)" class="smiley" /></p>
<h4>Reference Implementations in Today Languages</h4>
<p>The "modern" / "school" approach, as implemented in the reference project in Go/Rust/C#/... is using two lists for the CSVs data, then two maps/dictionaries between route ID and lists indexes.</p>
<ul>
<li>The <a href="https://github.com/losvedir/transit-lang-cmp/blob/main/trogsit/app.go">Golang version</a> has a good expressiveness, and is nice to read, even if you don't know the language.</li>
<li>The <a href="https://github.com/losvedir/transit-lang-cmp/tree/main/Trannet">C# version</a> is also readable, but making a webserver is still confusing because it is not built from code, but from config files.</li>
<li><a href="https://github.com/losvedir/transit-lang-cmp/tree/main/trexit">Elixir</a> is a bit over-complicated to my taste.</li>
<li><a href="https://github.com/losvedir/transit-lang-cmp/tree/main/trala">Scala</a> and <a href="https://github.com/losvedir/transit-lang-cmp/tree/main/trypsit">TypeScript/Deno</a> versions, are fine to read, but really slow. You may better use a database instead.</li>
<li>Just for fun, check <a href="https://github.com/losvedir/transit-lang-cmp/blob/main/trustit/src/main.rs">the Rust version</a> - do you think Rust is good for big maintainable projects with junior developers?</li>
</ul>
<p>There was a first attempt to write a FPC version of it, by Leledumbo.<br />
His <a href="https://github.com/leledumbo/transit-lang-cmp/blob/main/trascal/app.pas">Source Code repository</a> is a nice pascal conversion of above code. But performance was disappointing. Especially because the standard JSON library can not work directly with high level structures like collections or arrays.</p>
<p>So is Pascal out of the race?<br />
Let's call the <em>mORMot</em> to the rescue!</p>
<h4>Following the mORMot Way</h4>
<p>For the <em>mORMot</em> version in FPC, I used another approach, with two diverse algorithms:</p>
<ul>
<li>I ensured the lists were sorted in memory, then made a O(log(n)) binary lookup in it;</li>
<li>All stored strings were "interned", i.e. the same text was sharing a single string instance, and FPC reference counting did its magic.</li>
</ul>
<p>There is no low-level tricks like generating the JSON by hand or using complex data structures - data structures are still are high-level, with readable field names and such. The logic and the intent are clearly readable.<br />
We just leveraged the pascal language, and <em>mORMot</em> features. For instance, string interning is part of the framework, if needed.</p>
<p>Please <a href="https://github.com/synopse/mORMot2/tree/master/ex/lang-cmp/LangCmp.dpr">check the source code in our repository</a>.</p>
<p>As a result:</p>
<ul>
<li>Code is still readable, short and efficient (most of the process is done by <em>mORMot</em>, i.e. CSV, searching, JSON);</li>
<li>It uses much less memory - 10 times less memory than Go when holding the data, 5 times less memory than Go when serving the data;</li>
<li>Performance is as fast as Go, and its very tuned/optimized compiler and RTL.</li>
</ul>
<p><img src="https://blog.synopse.info?post/public/blog/mORMot2-small.png" alt="" /></p>
<h4>Algorithms Matters</h4>
<p>Main idea was to let the algorithms match the input data and the expected resultset.<br />
As programmers do when programming games. Not as coders do when pissing out business software. <img src="https://blog.synopse.info?pf=wink.svg" alt=";-)" class="smiley" /></p>
<ul>
<li>The source code is still pretty readable, thanks to using <em>mORMot</em> efficient <code>TDynArray</code> to map the dynamic array storage, and its CSV and JSON abilities.</li>
<li>I guess source is still understandable for out-of-school programmers - much more readable than Rust for instance.</li>
</ul>
<p>To by fair, I used typed pointers in <code>TScheduler.BuildTripResponse</code> but it is not so hard getting their purpose, and FPC compiles this function into very efficient assembly. I could have used regular dynamic array access with indexes, it would have been slightly slower, but not really easier to follow, nor safer (if we compile with no range checking).</p>
<p>Worth noting that we did not make any specific tuning, like pre-allocating the results with constants, as other frameworks did. We just specified the data, then let <em>mORMot</em> play with it - that's all.<br />
The <em>mORMot</em> RTTI level matches what we expect for modern frameworks: not only some classes to store JSON, but convenient serialization/unserialization using structures like class or record.<br />
Using modern Pascal dynamic arrays and records to define the data structures let the compiler leverage the memory for us, with no need to write any <code>try..finally..Free</code> blocks, and use interfaces. "Manual memory management" with Pascal is not mandatory and can easily be bypassed. Only for the WebServer, we have a <code>Free</code>, which is expected to close it.</p>
<h4>Give Me Some Numbers</h4>
<p>Here are a performance comparison with Go (FPC on the left, Go on the right):</p>
<pre>
parsed 1790905 stop times in 968.43ms | parsed 1790905 stop times in 3.245251432s
parsed 71091 trips in 39.54ms | parsed 71091 trips in 85.747852ms
running (0m33.4s), 00/50 VUs, 348 complete and 0 interrupted | running (0m32.3s), 00/50 VUs, 320 complete and 0 interrupted
default ✓ [======================================] 50 VUs 30 default ✓ [======================================] 50 VUs 30
data_received..................: 31 GB 933 MB/s | data_received..................: 31 GB 971 MB/s
data_sent......................: 3.2 MB 97 kB/s | data_sent......................: 3.0 MB 92 kB/s
http_req_blocked...............: avg=9µs min=1.09µs | http_req_blocked...............: avg=6.77µs min=1.09µs
http_req_connecting............: avg=2.95µs min=0s | http_req_connecting............: avg=1.73µs min=0s
http_req_duration..............: avg=47.59ms min=97.28µs | http_req_duration..............: avg=49.02ms min=123.81µ
{ expected_response:true }...: avg=47.59ms min=97.28µs | { expected_response:true }...: avg=49.02ms min=123.81µ
http_req_failed................: 0.00% ✓ 0 ✗ | http_req_failed................: 0.00% ✓ 0 ✗ 3
http_req_receiving.............: avg=9.66ms min=15.35µs | http_req_receiving.............: avg=5.92ms min=14.76µs
http_req_sending...............: avg=87.24µs min=5.2µs | http_req_sending...............: avg=70.71µs min=5.2µs
http_req_tls_handshaking.......: avg=0s min=0s | http_req_tls_handshaking.......: avg=0s min=0s
http_req_waiting...............: avg=37.83ms min=54.74µs | http_req_waiting...............: avg=43.02ms min=91.84µs
http_reqs......................: 34452 1032.205528/s | http_reqs......................: 31680 981.949476/s
iteration_duration.............: avg=4.72s min=3.54s | iteration_duration.............: avg=4.86s min=2.19s
iterations.....................: 348 10.426318/s | iterations.....................: 320 9.918682/s
vus............................: 30 min=30 ma | vus............................: 15 min=15 max
vus_max........................: 50 min=50 ma | vus_max........................: 50 min=50 max
</pre>
<p>So CSV loading was much faster, then the HTTP server performance was almost the same.</p>
<h4>No Alzheimer</h4>
<p>Here are some numbers about memory consumption:</p>
<blockquote><p>Upon finished loading the CSV, mORMot only eats 80MB, heck so little. Sounds a bit magical. But during load test, it fluctuates between 250-350MB, upon which it returns to 80MB at the end.
The Go version eats 925MB upon finished loading the CSV. During load test, it tops at 1.5GB, returning to 925MB afterwards.</p></blockquote>
<p>Nice to read. :)</p>
<h4>Pascal has a Modern and Capable Ecosystem</h4>
<p>This article was not only about Pascal, but about algorithms and libraries.<br />
The challenge was initially about comparing them. Not only as unrealistic micro-benchmarks, or "computer language benchmark games", but as data processing abilities on a real usecase.</p>
<p><strong>And... Pascal is still in the race for sure!</strong><br />
Not only for "old" people like me - I just got 50 years old. ;-)</p>
<p>The more we spread such kind of information, the less people would make jokes about pascal programmers.<br />
Delphi and FPC are as old as Java, so it is time to get the big picture, not following marketing trends.</p>New Client for MongoDB 5.1/6 Supporturn:md5:41f28034cfacffa2bf7fde35e8e64b432022-08-12T20:12:00+01:002022-08-12T20:12:00+01:00Arnaud BouchezmORMot FrameworkCrossPlatformDatabaseDelphiFPCFreePascalJSONMongoDBmORMot2NoSQLODMORMperformanceRESTSource<p>Starting with its version 5.1, <em>MongoDB</em> disabled the legacy protocol used for communication since its beginning.<br />
As a consequence, our <em>mORMot</em> client was not able to communicate any more with the latest versions of <em>MongoDB</em> instances.</p>
<p><img src="https://blog.synopse.info?post/public/mongodb.png" alt="" /></p>
<p>Last week, we made a deep rewrite of <a href="https://github.com/synopse/mORMot2/blob/master/src/db/mormot.db.nosql.mongodb.pas">mormot.db.nosql.mongodb.pas</a>, which changed the default protocol to use the new layout on the wire. Now messages use regular <a href="https://www.mongodb.com/docs/current/reference/command/">MongoDB Database Commands</a>, with automated compression if needed.</p>
<p>No change is needed in your end-user <em>MongoDB</em> or ORM/ODM code. The upgrade is as simple as update your <em>mORMot 2</em> source, then recompile.</p> <h4>The Mongo Wire Protocol</h4>
<p>Since its beginning, <em>MongoDB</em> used a simple protocol over TCP, via several binary opcodes and message, for CRUD operations.</p>
<p>A new alternative protocol was <a href="https://emptysqua.re/blog/driver-features-for-mongodb-3-6/">introduced in version 3.6,</a> and the former protocol was marked as deprecated.<br />
Two new opcodes were introduced, OP_MSG and OP_COMPRESSED, to replace all other frames. They just encapsulate, with or without compression, some abstract BSON content.<br />
The official documentation details those changes <a href="https://www.mongodb.com/docs/manual/reference/mongodb-wire-protocol/">in this web page</a>.</p>
<p>In short (picture extracted from the blog above), the protocol came from this:</p>
<p><img src="https://blog.synopse.info?post/public/ye-olde-wire-protocol.png" alt="" /></p>
<p>to this:</p>
<p><img src="https://blog.synopse.info?post/public/op-msg.png" alt="" /></p>
<p>The main benefit is that the commands and answers are just conventional BSON, so the protocol can change at logical/BSON/JSON level by adding or changing some members, with no need of dealing with low-level binary structures.</p>
<p>With the version 5.1 of <em>MongoDB</em>, the previous protocol was not just deprecated, but disabled.<br />
So we had to update the <em>mORMot 2</em> client code! (yes, the <em>mORMot 1</em> code has not been updated - it may become a good reason to upgrade)</p>
<h4>Deep Rewrite</h4>
<p>In fact, the official MongoDB documentation is somewhat vague. And the official drivers are a bit difficult to reverse-engineer, due to the verbose nature of C, Java or C#. The native/node driver was easiest to dissect, and we used it as reference.<br />
Luckily enough, there are some <a href="https://github.com/mongodb/specifications/blob/master/source/message/OP_MSG.rst">specification document available too</a>, which offers some additional valuable clarifications.</p>
<p>After some testing, we managed to replace all previous OP_QUERY and its brothers to the new OP_MSG frame, which is, as documented in the specification, "One opcode to rule them all". <img src="https://blog.synopse.info?pf=wink.svg" alt=";)" class="smiley" /></p>
<p>Once we had the commands working, we needed to rewrite all CRUD operations using commands, and not opcodes.<br />
Queries are now made with <code><a href="https://www.mongodb.com/docs/current/reference/command/find/">find</a></code> and <code><a href="https://www.mongodb.com/docs/current/reference/command/aggregate/">aggregate</a></code> commands. Their results are now located in a <code>"cursor": firstBatch": ..</code> BSON array within the response. And a new <code><a href="https://www.mongodb.com/docs/current/reference/command/getMore/">getMore</a></code> command is to be used to retrieve the next values within a <code>"cursor": nextBatch": ...</code> resultset.<br />
For writing, <code><a href="https://www.mongodb.com/docs/current/reference/command/insert">insert</a></code>, <code><a href="https://www.mongodb.com/docs/current/reference/command/update">update</a></code> and <code><a href="https://www.mongodb.com/docs/current/reference/command/delete">delete</a></code> commands are called, with their appropriate BSON content.</p>
<p>During the refactoring, we optimized the BSON process, and also enhanced the whole process, mainly the logs and the execution efficiency. The <em>mORMot</em> client side should not be a bottleneck. And it is not, even with this NoSQL database.</p>
<p>Don't expect any performance enhancement, or new features. It is just some low-level protocol change at TCP level.<br />
But if you used the "non acknowledged write mode" of the former protocol, which was unsafe but very fast, you will have lower performance with the new protocol, because the new protocol always acknowledges the commands it receives. So, in some very specific configurations, the new protocol may reduce the performance.</p>
<h4>Backward Compatibility</h4>
<p>All those changes were encapsulated in our revised <a href="https://github.com/synopse/mORMot2/blob/master/src/db/mormot.db.nosql.mongodb.pas">mormot.db.nosql.mongodb.pas</a> unit.</p>
<p>If you have a very old <em>MongoDB</em> instance, and don't want to upgrade, you could just compile your project with the <code>MONGO_OLDPROTOCOL</code> conditional, to use the deprecated opcodes.<br />
If the <em>MongoDB</em> team does not care much with backward compatibility (they could have kept the previous protocol for sure, they still maintain it for the handshake message if needed), we do care about not breaking too much things with <em>mORMot</em>, so we kept the previous code, and tested/validated it too, for legacy systems.</p>
<h4>New Sample</h4>
<p>We translated and introduced the <em>MongoDB</em> benchmark sample to <em>mORMot 2</em> code base.</p>
<p>You could find it, and run it, from <a href="https://github.com/synopse/mORMot2/tree/master/ex/mongodb">our source code repository</a>.</p>
<p>This code is a good entry point for what is possible with this unit in our framework, for both direct access or ORM/ODM access.<br />
And you would be able to guess the performance numbers you may achieve with your project.</p>
<p>Running a <em>MongoDB</em> database in a container is as easy as executing the following command:</p>
<pre>
sudo docker run --name mongodb -d -p 27017:27017 mongo:latest
</pre>
<p>Then you will have a <em>MongoDB</em> server instance accessible on <code>localhost:27017</code>, so you could run the sample straight away.</p>
<h4>Delphi/FPC Open Source Rocks</h4>
<p>We hope you will find the change painless and transparent. We did not modify the high-level client methods, nor break the ORM/ODM: you can still write some SELECT complex statements, and our ORM will translate it into <em>MongoDB</em> aggregate commands.</p>
<p>To my knowledge, there is <a href="https://github.com/stijnsanders/TMongoWire/commit/7f12a64f571e476704bdcb737e1fc087ef792f59">only a single other Delphi/FPC client library</a> which made the upgrade to the new protocol, at today. Once we made our own changes, we notified other library authors, and Stijn made very quickly the needed changes. Congrats! Maybe our code could be used as reference for other library maintainers, because the protocol needs some small tweaks sometimes.<br />
It is important to have some maintenance on the library you use. And our little <em>mORMot</em> is still on the edge: thanks to FPC, it runs very well on Linux and BSD, which makes it perfect for professional services running in the long term! :)</p>
<p>Your feedback is welcome <a href="https://synopse.info/forum/viewtopic.php?id=6318">in the forum thread which initiated these modifications</a>, as usual!<br />
Don't hesitate to notify us any missing or broken feature.<br />
Thanks Daniel for your report and support!</p>New Async HTTP/WebSocket Server on mORMot 2urn:md5:b5d5687573d19a81f6d6dadfbc68461d2022-05-21T13:35:00+01:002022-05-21T18:05:45+01:00Arnaud BouchezmORMot FrameworkDelphiFPCfpcx64mmHTTPhttp.sysHTTPSLinuxmORMotmORMot2multithreadperformanceRESTRestsecuritySOA<p>The HTTP server is one main part of any SOA/REST service, by design.<br />
It is the main entry point of all incoming requests. So it should better be stable and efficient. And should be able to scale in the future, if needed.</p>
<p><img src="https://blog.synopse.info?post/public/blog/server.jpg" alt="" /></p>
<p>There have always been several HTTP servers in <em>mORMot</em>. You can use the HTTP server class you need.<br />
In <em>mORMot</em> 2, we added two new server classes, one for publishing over HTTP, another able to upgrade to WebSockets. The main difference is that they are fully event-driven, so their thread pool is able to scale with thousands of concurrent connections, with a fixed number of threads. They are a response to the limitations of our previous socket server.</p> <h4>HTTP Is Not REST</h4>
<p>In <em>mORMot</em>, the HTTP server does not match the REST server. They are two concepts, with diverse units, classes and even... source code folders.<br />
The HTTP server can publish its own process, using a callback, or can leverage one or several REST servers. Or the REST server could be with no communication at all, e.g. be run in the service process within the current thread, executing ORM or SOA requests without any HTTP or WebSockets involved. No difference in the user code: just some interface methods to call, and they will return their answer, whatever it is over HTTP, locally or remotely, on the server side or the client side, in a thread pool or in-process... pascal code just runs its magic for you.</p>
<p>For instance, take a look at <a href="https://github.com/synopse/mORMot2/blob/master/ex/http-server-raw/httpServerRaw.dpr">httpServerRaw</a> as a low-level HTTP server sample, using a callback for every incoming request.<br />
Here you don't have any automatic routing: you just parse the input URI, then return the proper HTTP response, with its status code, headers and body.<br />
Those low-level HTTP servers are implemented in the <em>server</em> units of folder <a href="https://github.com/synopse/mORMot2/tree/master/src/net">src/net</a> - as we will detail below.</p>
<p>Then, the REST process is abstracted from HTTP. Even if both HTTP and REST historically share the <a href="https://en.wikipedia.org/wiki/Roy_Fielding">same father/author/initiator</a>, you could have a REST approach without HTTP. For instance, you could use WebSockets, or direct in-process call.<br />
So in <em>mORMot</em>, the REST process is implemented in the <em>server</em> units of folder <a href="https://github.com/synopse/mORMot2/tree/master/src/rest">src/rest</a>, abstracted from any communication protocol. It does not know nor assume anything about TCP, WebSockets or TLS. Just about URI, headers and body texts, and follow the routing as defined.</p>
<p>Two folders, two uncoupled feature sets. Perhaps a bit confusing when you discover it first. But for the best maintainability and code design.</p>
<h4>Several Servers To Rule Them All</h4>
<p><em>mORMot</em> 2 adopted all HTTP server classes from <em>mORMot</em> 1 source code. Then include some new "asynchronous" servers.<br />
They all inherit from a <code>THttpServerGeneric</code> parent class, so you can follow the Liskow Substitution Principle, and change the class at runtime or compilation, as needed, without altering your actual logic.</p>
<p>HTTP servers are implemented in several units:</p>
<ul>
<li><a href="https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.server.pas"><em>mORMot</em>.net.server.pas</a> offers the <code>THttpServerSocket</code>/<code>THttpServer</code> HTTP/1.1 server, the <code>THttpApiServer</code> HTTP/1.1 server over Windows http.sys module, and <code>THttpApiWebSocketServer</code> over Windows http.sys module;</li>
<li><a href="https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.ws.server.pas"><em>mORMot</em>.net.ws.server.pas</a> offers the <code>TWebSocketServerRest</code> server, which uses WebSockets as mean of transmission, but enable a REST-like blocking request/answer protocol on top of it, with optional bi-directional notifications, using a one-thread-per-connection server;</li>
<li><a href="https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.async.pas"><em>mORMot</em>.net.async.pas</a> offers the new <code>THttpAsyncServer</code> event-driven HTTP server;</li>
<li><a href="https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.ws.async.pas"><em>mORMot</em>.net.ws.async.pas</a> offers the new <code>TWebSocketAsyncServerRest</code> server, which uses WebSockets as mean of transmission, but enable a REST-like blocking request/answer protocol on top of it, with optional bi-directional notifications, using an event-driven server.</li>
</ul>
<p>On Windows, the <a href="https://docs.microsoft.com/en-us/iis/get-started/introduction-to-iis/introduction-to-iis-architecture#hypertext-transfer-protocol-stack-httpsys">http.sys</a> module gives you very good stability, and uses the same Windows-centric way of publishing servers as used by IIS and DotNet. You could even share the same port between several services, if needed.</p>
<p>Our socket-based servers are cross-plaform, and compile and run on both Windows and POSIX (Linux, BSD, MacOS). They use a thread pool for HTTP/1.0 short living requests, and one thread per connection on HTTP/1.1. So they are meant to be used behind a reverse proxy like nginx, which could transmit over HTTP/1.0 with <em>mORMot</em>, but keep efficient HTTP/1.1 or HTTP/2.0 to communicate with the clients.</p>
<p>Both <a href="https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.async.pas"><em>mORMot</em>.net.async.pas</a> and <a href="https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.ws.async.pas"><em>mORMot</em>.net.ws.async.pas</a> are new to <em>mORMot</em> 2. They use an event-driven model, i.e. the opened connections are tracked using a fast API (like epoll on Linux), and the thread pool is used only when there is actually new data pending.</p>
<h4>Events Forever</h4>
<p>Asynchronous socket access, and event loops are the key for best server scalability. In respect to our regular <code>THttpServerSocket</code> class which uses one thread per HTTP/1.1 or WebSockets connection, our asynchronous classes (e.g. <code>THttpAsyncServer</code>) can have thousands of concurrent clients, with minimal CPU and RAM resource consumption.</p>
<p>Here is a typical event-driven socket access:</p>
<p><img src="https://blog.synopse.info?post/public/blog/epoll.png" alt="" /></p>
<p>In <em>mORMot</em> 2 network core, i.e. in unit <a href="https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.sock.pas"><em>mORMot</em>.net.sock.pas</a>, we define an abstract event-driven class:</p>
<pre>
/// implements efficient polling of multiple sockets
// - will maintain a pool of TPollSocketAbstract instances, to monitor
// incoming data or outgoing availability for a set of active connections
// - call Subscribe/Unsubscribe to setup the monitored sockets
// - call GetOne from a main thread, optionally GetOnePending from sub-threads
TPollSockets = class(TPollAbstract)
...
/// initialize the sockets polling
constructor Create(aPollClass: TPollSocketClass = nil);
/// finalize the sockets polling, and release all used memory
destructor Destroy; override;
/// track modifications on one specified TSocket and tag
function Subscribe(socket: TNetSocket; events: TPollSocketEvents;
tag: TPollSocketTag): boolean; override;
/// stop status modifications tracking on one specified TSocket and tag
procedure Unsubscribe(socket: TNetSocket; tag: TPollSocketTag); virtual;
/// retrieve the next pending notification, or let the poll wait for new
function GetOne(timeoutMS: integer; const call: RawUtf8;
out notif: TPollSocketResult): boolean; virtual;
/// retrieve the next pending notification
function GetOnePending(out notif: TPollSocketResult; const call: RawUtf8): boolean;
/// let the poll check for pending events and apend them to fPending results
function PollForPendingEvents(timeoutMS: integer): integer; virtual;
/// manually append one event to the pending nodifications
procedure AddOnePending(aTag: TPollSocketTag; aEvents: TPollSocketEvents;
aNoSearch: boolean);
/// notify any GetOne waiting method to stop its polling loop
procedure Terminate; override;
...
</pre>
<p>Depending on the operating system, it will mimic the <a href="https://man7.org/linux/man-pages/man7/epoll.7.html">epoll api</a> with the underlying low-level system calls.</p>
<p>The events are very abstract, and are in fact just the basic R/W operations on each connection, associated with a "tag", which is likely to be a class pointer associated with a socket/connection:</p>
<pre>
TPollSocketEvent = (
pseRead,
pseWrite,
pseError,
pseClosed);
TPollSocketEvents = set of TPollSocketEvent;
TPollSocketResult = record
tag: TPollSocketTag;
events: TPollSocketEvents;
end;
TPollSocketResults = record
Events: array of TPollSocketResult;
Count: PtrInt;
end;
</pre>
<p>Then whole new HTTP/1.0, HTTP/1.1 and WebSockets stacks have been written on top of those basic socket-driven events. Instead of blocking threads, they use internal state machines, which are much lighter than a thread, and even lighter than a coroutine/goroutine. Each connection is just a class instance, which maintains the state of each client/server communication, and accesses its own socket.<br />
The <em>mORMot</em> asynchronous TCP server has by default one thread to accept the connection, one thread to poll for pending events (calling the <code>GetOne</code> method), then a dedicated number of threads to consume the Read/Write/Close events (via the <code>GetOnePending</code> method). We used as many non-blocking structures as possible, we minimized memory allocation by reusing the same buffers e.g. for the headers or small responses, we pickup the <a href="https://blog.synopse.info?post/2022/05/21/New-Async-HTTP/post/2022/01/22/Three-Locks-To-Rule-Them-All">best locks possible</a> in each case, so that this server could scales nice and smoothly. And simple to use, because handling a new protocol is as easy as inheriting and writing a new connection class.</p>
<p>Our asynchronous servers classes seem now stable, and fast (reported to be twice faster than nginx and six time faster than nodejs!).<br />
But of course, as any new complex code, they may be caveats. And some of them have already be identified and fixed - as <a href="https://synopse.info/forum/viewtopic.php?pid=36546#p36546">reported in our forum</a>. Therefore, feedback is welcome, and a nginx, haproxy or caddy reverse proxy frontend is always a good idea on production.</p>
<p>Writing those servers took more time than previewed, and was sometimes painful. Because debugging multi-thread process is not easy, and especially on several operating systems. There are some subtle differences between the OS, which could lead to unexpected blocking or degraded performance. But we are proud of the result, which compares to the best-in-class servers. Still in modern pascal code, and Open Source software.</p>
<p>Don't hesitate to take a look at the source, and try some samples.<br />
Feedback is <a href="https://synopse.info/forum/viewtopic.php?id=6253">welcome in our forum</a>, as usual.</p>mORMot 2 ORM Performanceurn:md5:332da916ca3f905cedf4c82d14fc9b872022-02-15T13:24:00+00:002022-02-15T13:24:00+00:00Arnaud BouchezmORMot Framework64bitAES-CTRasmblogDatabaseDelphiFPCfpcx64mmFreePascalJSONMicroservicesmORMotmORMot2ORMperformanceSQLSQLite3<p>The official release of <em>mORMot 2</em> is around the edge.
It may be the occasion to show some data persistence performance numbers, in respect to <em>mORMot 1</em>.</p>
<p><img src="https://blog.synopse.info?post/public/blog/marmotrunningsnow.jpg" alt="" /></p>
<p>For the version 2 of our framework, its ORM feature has been enhanced and tuned in several aspects: REST routing optimization, ORM/JSON serialization, and in-memory and SQL engines tuning.
Numbers are talking. You could compare with any other solution, and compile and run the tests by yourself for both framework, and see how it goes on your own computer or server.<br />
In a nutshell, we <em>almost reach 1 million inserts per second on SQLite3</em>, and are above the million inserts in our in-memory engine. Reading speed is 1.2 million and 1.7 million respectively. From the object to the storage, and back. And forcing AES-CTR encryption on disk almost don't change anything. Now we are talking. <img src="https://blog.synopse.info?pf=wink.svg" alt=";)" class="smiley" /></p> <h3>Platform Used</h3>
<p>Those numbers were taken from the "external database" sample, which is available on both versions of the framework.<br />
This is the very same benchmark as used in previous benchmarks on this blog or our documentation. So you could compare the numbers.</p>
<p>But we run the tests on a new computer, featuring a Intel(R) Core(TM) i5-7300U cpu, from a good old Thinkpad T470 notebook, with a SATA SSD.<br />
So on a more modern hardware, like a high-end AMD or Xeon server, the million inserts is easily passed. And on a slow VM, you will get pretty good numbers.</p>
<p>We run the tests on Linux x86_64 (Debian 11), compiled from FPC 3.2 stable - which is our target platform for performance.<br />
In fact, performance matters mainly on the server side - clients are usually fast enough to run whatever process they need, and the ORM/database/persistence process is likely to be located on the server side. So in <em>mORMot 2</em>, we focused on a x86_64 Linux server for performance, which is the cheapest and safest solution around.</p>
<p>Both frameworks used our <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas">in-memory heap manager for FPC</a>, written in x86_64 assembly. It has a noticeable performance benefit, especially on multi-thread process (not shown here, but during the main regression tests).</p>
<h3>Insertion Speed</h3>
<p>Here are the <em>mORMot 2</em> insertion numbers:</p>
<p>Running tests using Synopse mORMot framework 2.0.1, compiled with Free Pascal 3.2 64 bit, against SQLite 3.37.2, on Debian GNU/Linux 11 (bullseye) - Linux 5.10.0-10-amd64, at 2022-02-14 21:09:37.</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>Direct</strong></td><td><strong>Batch</strong></td><td><strong>Trans</strong></td><td><strong>Batch Trans</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>98</td><td>5908</td><td>74089</td><td>242072</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>13534</td><td>474428</td><td>151315</td><td>919624</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>42961</td><td>691037</td><td>153374</td><td>929281</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>26882</td><td>533788</td><td>152795</td><td>874814</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>114664</td><td>969743</td><td>152190</td><td>972478</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>411895</td><td>1086956</td><td>428724</td><td>1301236</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>385445</td><td>1200480</td><td>412762</td><td>1219660</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>107</td><td>5957</td><td>83531</td><td>111043</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>16509</td><td>291528</td><td>151890</td><td>390502</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>58922</td><td>354924</td><td>150179</td><td>392649</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>114476</td><td>991080</td><td>154564</td><td>991375</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>19789</td><td>155763</td><td>19439</td><td>236406</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Batch+Trans|Trans|Batch|Direct&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1301236&chds=0,1301236,0,1301236,0,1301236,0,1301236,0,1301236&chd=t:98,5908,74089,242072|13534,474428,151315,919624|42961,691037,153374,929281|26882,533788,152795,874814|114664,969743,152190,972478|411895,1086956,428724,1301236|385445,1200480,412762,1219660|107,5957,83531,111043|16509,291528,151890,390502|58922,354924,150179,392649|114476,991080,154564,991375|19789,155763,19439,236406&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1301236&chds=0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236&chd=t:98,13534,42961,26882,114664,411895,385445,107,16509,58922,114476,19789|5908,474428,691037,533788,969743,1086956,1200480,5957,291528,354924,991080,155763|74089,151315,153374,152795,152190,428724,412762,83531,151890,150179,154564,19439|242072,919624,929281,874814,972478,1301236,1219660,111043,390502,392649,991375,236406&chdl=Direct|Batch|Trans|Batch+Trans" /></p>
<p>In comparison, here are the <em>mORMot 1</em> performance - which was already ahead of most other solutions - on the same machine:</p>
<p>Running tests using Synopse mORMot framework 1.18.6365, compiled with Free Pascal 3.2 64 bit, against SQLite 3.37.2, on Debian GNU/Linux 11 (bullseye) - Linux 5.10.0-10-amd64, at 2022-02-15 10:27:28.</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>Direct</strong></td><td><strong>Batch</strong></td><td><strong>Trans</strong></td><td><strong>Batch Trans</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>97</td><td>5034</td><td>45603</td><td>90704</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>12504</td><td>212350</td><td>96605</td><td>272910</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>36189</td><td>244702</td><td>94754</td><td>273687</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>22933</td><td>219857</td><td>96513</td><td>267881</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>79953</td><td>275269</td><td>97776</td><td>269963</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>208125</td><td>473126</td><td>227821</td><td>505254</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>202683</td><td>467595</td><td>220031</td><td>472545</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>100</td><td>2809</td><td>49603</td><td>102247</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>15329</td><td>190701</td><td>109346</td><td>301768</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>48818</td><td>264760</td><td>109767</td><td>304710</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>93820</td><td>303766</td><td>111437</td><td>311779</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>17952</td><td>68360</td><td>14915</td><td>97096</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Batch+Trans|Trans|Batch|Direct&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,505254&chds=0,505254,0,505254,0,505254,0,505254,0,505254&chd=t:97,5034,45603,90704|12504,212350,96605,272910|36189,244702,94754,273687|22933,219857,96513,267881|79953,275269,97776,269963|208125,473126,227821,505254|202683,467595,220031,472545|100,2809,49603,102247|15329,190701,109346,301768|48818,264760,109767,304710|93820,303766,111437,311779|17952,68360,14915,97096&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,505254&chds=0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254&chd=t:97,12504,36189,22933,79953,208125,202683,100,15329,48818,93820,17952|5034,212350,244702,219857,275269,473126,467595,2809,190701,264760,303766,68360|45603,96605,94754,96513,97776,227821,220031,49603,109346,109767,111437,14915|90704,272910,273687,267881,269963,505254,472545,102247,301768,304710,311779,97096&chdl=Direct|Batch|Trans|Batch+Trans" /></p>
<p>As you can see, the performance benefits are noticeable. For a <em>MicroService</em>, an embedded SQlite3 storage may give pretty amazing scalability of your SOA processing.</p>
<h3>Reading Speed</h3>
<p>Here are the <em>mORMot 2</em> reading numbers:</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>By one</strong></td><td><strong>All Virtual</strong></td><td><strong>All Direct</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>115775</td><td>1151012</td><td>1121956</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>123198</td><td>1157407</td><td>1162925</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>270504</td><td>1162790</td><td>1174122</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>269978</td><td>1160227</td><td>1171920</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>273950</td><td>1154201</td><td>1150350</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>467551</td><td>1803751</td><td>1743071</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>464209</td><td>771188</td><td>777786</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>184836</td><td>522247</td><td>1142204</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>180900</td><td>519237</td><td>1155134</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>186046</td><td>512583</td><td>1153003</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>274559</td><td>1168770</td><td>1182872</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>22246</td><td>444820</td><td>873133</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|All+Direct|All+Virtual|By+one&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1803751&chds=0,1803751,0,1803751,0,1803751&chd=t:115775,1151012,1121956|123198,1157407,1162925|270504,1162790,1174122|269978,1160227,1171920|273950,1154201,1150350|467551,1803751,1743071|464209,771188,777786|184836,522247,1142204|180900,519237,1155134|186046,512583,1153003|274559,1168770,1182872|22246,444820,873133&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1803751&chds=0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751&chd=t:115775,123198,270504,269978,273950,467551,464209,184836,180900,186046,274559,22246|1151012,1157407,1162790,1160227,1154201,1803751,771188,522247,519237,512583,1168770,444820|1121956,1162925,1174122,1171920,1150350,1743071,777786,1142204,1155134,1153003,1182872,873133&chdl=By+one|All+Virtual|All+Direct" /></p>
<p>In comparison, here are the <em>mORMot 1</em> performance - which was already ahead of most other solutions - on the same machine:</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>By one</strong></td><td><strong>All Virtual</strong></td><td><strong>All Direct</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>72439</td><td>847888</td><td>848176</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>73248</td><td>837100</td><td>858811</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>111037</td><td>845737</td><td>848032</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>112973</td><td>863557</td><td>869716</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>111766</td><td>864154</td><td>879043</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>229074</td><td>1395868</td><td>1401738</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>228081</td><td>625782</td><td>626095</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>107587</td><td>412745</td><td>840618</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>132890</td><td>393948</td><td>805023</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>133347</td><td>411082</td><td>821422</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>135197</td><td>411658</td><td>820075</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>19991</td><td>367161</td><td>643832</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|All+Direct|All+Virtual|By+one&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1401738&chds=0,1401738,0,1401738,0,1401738&chd=t:72439,847888,848176|73248,837100,858811|111037,845737,848032|112973,863557,869716|111766,864154,879043|229074,1395868,1401738|228081,625782,626095|107587,412745,840618|132890,393948,805023|133347,411082,821422|135197,411658,820075|19991,367161,643832&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1401738&chds=0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738&chd=t:72439,73248,111037,112973,111766,229074,228081,107587,132890,133347,135197,19991|847888,837100,845737,863557,864154,1395868,625782,412745,393948,411082,411658,367161|848176,858811,848032,869716,879043,1401738,626095,840618,805023,821422,820075,643832&chdl=By+one|All+Virtual|All+Direct" /></p>
<h3>Feedback Welcome</h3>
<p>We encourage you to download the full source of both framework, from:</p>
<ul>
<li><a href="https://github.com/synopse/mORMot">https://github.com/synopse/mORMot</a></li>
<li><a href="https://github.com/synopse/mORMot2">https://github.com/synopse/mORMot2</a></li>
</ul>
<p>Then you can compile the "15 - External DB performance" sample on <em>mORMot 1</em>, and "extdb-bench" example on <em>mORMot 2</em>.</p>
<p>Feedback is <a href="https://synopse.info/forum/viewtopic.php?id=6143">welcome in our forum,</a> as usual!</p>Three Locks To Rule Them Allurn:md5:97541d46fe6c5cf0e003c129797319df2022-01-22T12:56:00+00:002022-01-22T15:55:41+00:00Arnaud BouchezmORMot FrameworkCriticalSectionCrossPlatformDelphiFPCFreePascallockmORMotmultithreadmutexperformance<p>To ensure thread-safety, especially on server side, we usually protect code with critical sections, or locks. In recent Delphi revisions, we have the <code>TMonitor</code> feature, but I would rather trust the OS for locks, which are implemented using Windows Critical Sections, or POSIX futex/mutex.</p>
<p><img src="https://blog.synopse.info?post/public/blog/flyinglock.png" alt="" /></p>
<p>But all locks are not born equal. Most of the time, the overhead of a Critical Section WinAPI or the <code>pthread</code> library is not needed.<br />
So, in <em>mORMot 2</em>, we introduced several native locks in addition to those OS locks, with multi-read/single-write abilities, or re-entrancy.</p> <h4>Thread Safety - The Hard Way</h4>
<p>For a regular RAD/Client application, a single thread is usually enough. Using messages, and/or a <code>TTimer</code> allow some simple cooperative multi-tasking in the application, good enough for most use.</p>
<p>But on server side, scalability requires the business code to be thread-safe. Thread safety is hard, harder than parallel computing from my experiments.</p>
<p>Note that multi-thread programing is not easy, sometimes very difficult to debug, because the problems are hard to reproduce - it is easy to get an <a href="https://en.wikipedia.org/wiki/Heisenbug">HeisenBug</a>.<br />
So ensure you first read some general features about thread safety, and modern CPU memory and operation execution. I just found out <a href="https://preshing.com/20120612/an-introduction-to-lock-free-programming/">these series of blog articles</a>, which details some caveats which may appear in border cases... which may occur to you as they do for me!</p>
<h4>Saved By The Lock</h4>
<p>To ensure thread safety, the most convenient feature we have is the lock, which protects some code section to be executed from several threads.</p>
<p>To be more accurate, we don't protect code, we protect resources. The code itself is thread-safe. But the data requires attention, when several threads access it. If we only read the data, it is fine. But once the data is changed by one thread, then other threads are likely to break - imagine that you add an item to a list, then the list storage is reallocated in memory, then you get some random GPF due to invalid pointers. Or two threads add items at the <em>same time</em> - then the counter or the storage may become pretty wrong. We need to lock the data access to prevent such issues.</p>
<p>Here is how the POSIX <code>libpthread</code> library offers this lock - similar to a Windows Critical Section:</p>
<p><img src="https://blog.synopse.info?post/public/blog/pthreadlock.png" alt="" /></p>
<p>All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries. So you write your thread-unsafe code as the ham in your sandwich, and you will ensure that only a single thread will execute it at once.</p>
<h4>Locks Are Not Expensive, Contention Is</h4>
<p>The main rule about using locks it that they should be as small as possible.<br />
Why?</p>
<p>Acquiring an unlocked mutex, or releasing a mutex is almost free, it is usually a single atomic assembly instruction. Atomic instructions have the <code>lock</code> prefix on Intel/AMD, or are explicitly specified as such, e.g. the <code>cmpxchg</code> operation. On ARM, you usually need to write a small loop, or at least several instructions.<br />
In <code>mormot.core.base.pas</code> we provide some cross-platform and cross-compiler functions for atomic process, written in tuned assembly or calling the RTL:</p>
<pre>
procedure LockedInc32(int32: PInteger);
procedure LockedDec32(int32: PInteger);
procedure LockedInc64(int64: PInt64);
function InterlockedIncrement(var I: integer): integer;
function InterlockedDecrement(var I: integer): integer;
function RefCntDecFree(var refcnt: TRefCnt): boolean;
function LockedExc(var Target: PtrUInt; NewValue, Comperand: PtrUInt): boolean;
procedure LockedAdd(var Target: PtrUInt; Increment: PtrUInt);
procedure LockedAdd32(var Target: cardinal; Increment: cardinal);
procedure LockedDec(var Target: PtrUInt; Decrement: PtrUInt);
</pre>
<p>But if two (or more) threads fight against acquiring a lock, then only one would get it. So the other threads will have to wait. Waiting is usually done by first <em>spinning</em> (i.e. running a void loop), and trying to acquire the lock. Eventually, an OS kernel call could take place, to leverage the CPU core, and try to execute some pending code from another thread.</p>
<p><img src="https://blog.synopse.info?post/public/blog/lockcontention.png" alt="" /></p>
<p>This lock contention, spinning or switching to another thread, is what really degrades the whole process performance. You are really wasting time and energy just for accessing a shared resource.</p>
<p>Therefore, in practice, I would advice to follow some simple rules.</p>
<h5>Make it work, then make it fast</h5>
<p>You may first use a giant Critical Section for a whole method. Most of the time, it would be fine.</p>
<p>Don't guess, run actual benchmarking on multi-core CPU (not a single core VM!), trying to reproduce the worse case possible which may happen.<br />
Have detailed, and thread-aware logs, to properly debug production code - the Heisenbugs are likely to appear not on your development PC, but with real world load.</p>
<p>Once you have identified a real bottleneck, try to split the logic code into small pieces:</p>
<ul>
<li>Ensure you have a multi-thread regression testing code for this method, to validate your modifications are actually still correct and ... faster;</li>
<li>Some part of the code may be thread-safe by itself (e.g. the error checking or result logging): no need to protect it with the lock;</li>
<li>Isolate the processing code into some private/protected methods, depending on the resources shared, with proper locking.</li>
</ul>
<h5>The Less The Better</h5>
<p>Eventually, to achieve the best performance:</p>
<ul>
<li>Keep your locks as short as possible.</li>
<li>Prefer more locks on small data than some giant locks;</li>
<li>Use a lock per list or queue, not per process or business logic method;</li>
<li>Make a private copy of the data (e.g. on a local stack variable) within the lock, then process it outside the lock;</li>
<li>Avoid calling other methods within a lock: focus on the shared data, and be sure that the functions you call may not be thread-safe;</li>
<li>Try to avoid memory allocations.</li>
</ul>
<h5>Pickup The Right Lock</h5>
<p>Generally speaking, the regular <code>TRTLCriticalSection</code> is fine, and should be preferred.<br />
Our <em>mormot.core.os.pas</em> unit leverage this into a cross-platform way, among FPC/Delphi compilers and operating systems. It tries to call directly the OS, with proper inlining if possible.</p>
<p>But if you follow "The Less The Better" rule above, your code may be something very small like this:</p>
<pre>
procedure TAsyncConnections.AddGC(aConnection: TPollAsyncConnection);
begin
if Terminated then
exit;
(aConnection as TAsyncConnection).fLastOperation := fLastOperationMS; // in ms
fGCSafe.Lock;
ObjArrayAddCount(fGC, aConnection, fGCCount);
fGCSafe.UnLock;
end;
</pre>
<p>Here you can see that the lock is very small, and setting <code>fLastOperation</code> has been done outside of the lock, since this operation is thread-safe by design: this connection will be free once, whereas <code>fGC/fGCCount</code> list may be accessed from several threads. Also note that <code>ObjArrayAddCount()</code> is a well defined function which should not have its behavior changed, nor raise any exception, so it is safe to be used... and we even didn't put any <code>try...finalll fGCSafe.UnLock;</code> statement here, because a <code>try..finally</code> has a cost on some platforms (e.g. FPC Linux generates several RTL calls even if no exception is raised).</p>
<p>Or course, we could use our <code>TSynLock</code> for <code>fGCSafe</code> - which encapsulate a <code>TRTLCriticalSection</code> in an object-oriented manner.<br />
But since here we know that the lock will be very small, no need to have the whole overhead of a Critical Section or a mutex/futex, which always has a cost at least in resources.</p>
<h4>Several Locks To Rule Them All</h4>
<p>In addition to the <code>TSynLock</code> wrapper, <em>mormot.core.os.pas</em> defines several kind of locks:</p>
<pre>
// a lightweight exclusive non-rentrant lock, stored in a PtrUInt value
// - calls SwitchToThread after some spinning, but don't use any R/W OS API
// - warning: methods are non rentrant, i.e. calling Lock twice in a raw would
// deadlock: use TRWLock or TSynLocker/TRTLCriticalSection for reentrant methods
// - light locks are expected to be kept a very small amount of time: use
// TSynLocker or TRTLCriticalSection if the lock may block too long
// - several lightlocks, each protecting a few variables (e.g. a list), may
// be more efficient than a more global TRTLCriticalSection/TRWLock
// - only consume 4 bytes on CPU32, 8 bytes on CPU64
TLightLock = record
procedure Lock;
function TryLock: boolean;
procedure UnLock;
end;
// a lightweight multiple Reads / exclusive Write non-upgradable lock
// - calls SwitchToThread after some spinning, but don't use any R/W OS API
// - warning: ReadLocks are reentrant and allow concurrent acccess, but calling
// WriteLock within a ReadLock, or within another WriteLock, would deadlock
// - consider TRWLock is you need an upgradable lock
// - light locks are expected to be kept a very small amount of time: use
// TSynLocker or TRTLCriticalSection if the lock may block too long
// - several lightlocks, each protecting a few variables (e.g. a list), may
// be more efficient than a more global TRTLCriticalSection/TRWLock
// - only consume 4 bytes on CPU32, 8 bytes on CPU64
TRWLightLock = record
procedure ReadLock;
function TryReadLock: boolean;
procedure ReadUnLock;
procedure WriteLock;
function TryWriteLock: boolean;
procedure WriteUnLock;
end;
type
TRWLockContext = (
cReadOnly, cReadWrite, cWrite);
// a lightweight multiple Reads / exclusive Write reentrant lock
// - calls SwitchToThread after some spinning, but don't use any R/W OS API
// - locks are expected to be kept a very small amount of time: use TSynLocker
// or TRTLCriticalSection if the lock may block too long
// - warning: all methods are reentrant, but WriteLock/ReadWriteLock would
// deadlock if called after a ReadOnlyLock
TRWLock = record
procedure ReadOnlyLock;
procedure ReadOnlyUnLock;
procedure ReadWriteLock;
procedure ReadWriteUnLock;
procedure WriteLock;
procedure WriteUnlock;
procedure Lock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
procedure UnLock(context: TRWLockContext {$ifndef PUREMORMOT2} = cWrite {$endif});
end;
</pre>
<p><code>TLightLock</code> is the simplest lock.<br />
It will acquire a lock, then spin or sleep on contention. But be aware that it is not reentrant: if you call <code>Lock</code> twice in a row from the same thread, the second <code>Lock</code> would wait forever. So you must ensure that your code doesn't call any other method which may also call <code>Lock</code> during its process, otherwise your thread would "deadlock". Such race conditions are relatively easy to identify: it will always block and deadlock, whatever condition there is. To fix it, don't call other method which run <code>Lock</code>: for instance, you may define some private/protected <code>LockedDoSomething</code> methods, which won't have any lock but expect to be called within a lock.</p>
<p><code>TRWLightLock</code> and <code>TRWLock</code> are <em>multiple Reads / exclusive Write locks</em>.<br />
This is a feature missing in the regular Critical Section. It is very likely that your shared resource will be often read, and seldom modified. Since reads are thread-safe by design, there is no need to prevent other reading threads to read the resource. Only writing/updating the data should be exclusive and protected from other threads. This is the purpose of <code>ReadLock</code> / <code>ReadOnlyLock</code> and <code>WriteLock</code>.<br />
<code>TRWLock</code> goes one step further, and allow a read lock to be upgraded into a write lock, using <code>ReadWriteLock</code> instead of <code>ReadOnlyLock</code>. <code>ReadWriteLock</code> could be followed by a <code>WriteLock</code>, whereas <code>ReadOnlyLock</code> should always be followed by <code>ReadOnlyUnlock</code>, but never by a <code>WriteLock</code> which would deadblock.<br />
Last but not least, <code>ReadOnlyLock</code> / <code>ReadOnlyUnLock</code> are re-entrant (you can call them nested), because they are implemented using a counter. And <code>TRWLock.WriteLock</code> is re-entrant, because it takes track of the locked thread ID, so detects nested calls - as a <code>TRtlCriticalSection</code> does.</p>
<h4>Low Level Stuff</h4>
<p>Just for fun, take a look at the source code:</p>
<pre>
procedure TLightLock.LockSpin;
var
spin: PtrUInt;
begin
spin := SPIN_COUNT;
repeat
spin := DoSpin(spin);
until LockedExc(Flags, 1, 0);
end;
procedure TLightLock.Lock;
begin
// we tried a dedicated asm but it was slower: inlining is preferred
if not LockedExc(Flags, 1, 0) then
LockSpin;
end;
function TLightLock.TryLock: boolean;
begin
result := LockedExc(Flags, 1, 0);
end;
procedure TLightLock.UnLock;
begin
Flags := 0; // non reentrant locks need no additional thread safety
end;
</pre>
<p><code>TLightLock</code> is pretty straightforward, using a simple CAS compare & exchange <code>LockedExc()</code> atomic function, but <code>TRWLightLock</code> and <code>TRWLock</code> are slightly more complex.</p>
<p>In <em>mORMot 2</em> code base, we tried to use the best lock possible. <code>TRtlCriticalSection</code> / <code>TSynLock</code> when the locks are likely to have a contention for some time (more than a micro second), and other locks, with <em>multiple Reads / exclusive Write</em> methods if possible, are used to protect very small tuned code.<br />
Of course, thread safety is tested during the regression tests, with dozen of concurrent threads trying to break the locks logic. I can tell you that we found some nasty problems in the initial code of our <code>TAsyncServer</code>, but after days debugging and logging, it sounds stable now - but it is the matter for another article! :)</p>
<p><a href="https://synopse.info/forum/viewtopic.php?id=6119">Feedback is welcome in our forum</a>, as usual!</p>mORMot 2 on Ampere AARM64 CPUurn:md5:01baca710d9e6371f285e77a90accdcd2021-08-17T13:16:00+01:002021-08-17T16:46:59+01:00Arnaud BouchezmORMot Framework64bitaarch64AESAES-GCMAES-NiampereAndroidasmavxblogCcompressioncrccrc32cFPCFreePascalLazarusLinuxMicroservicesmORMotmORMot2multithreadoraclecloudperformanceRESTSOASQLite3<p>Last weeks, we have enhanced mORMot support to one of the more powerful AARM64 CPU available: the <a href="https://amperecomputing.com/">Ampere Altra CPU</a>, as made available on the <a href="https://www.oracle.com/cloud/compute/arm/">Oracle Cloud Infrastructure</a>.</p>
<p><img src="https://blog.synopse.info/public/blog/AmpereCPU.jpg" alt="" /></p>
<p>Long story short, this is an amazing hardware to run on server side, with performance close to what Intel/AMD offers, but with <a href="https://www.oracle.com/cloud/compute/arm/why-arm-processors/">almost linear multi-core scalability</a>. The FPC compiler is able to run good code on it, and our mORMot 2 library is able to use the hardware accelerated opcodes for AES, SHA2, and crc32/crc32c.</p> <h3>Always Free Ampere VM</h3>
<p>Back to the beginning. Tom, one mORMot user, reported on our forum that he successfully <a href="https://synopse.info/forum/viewtopic.php?id=5945">installed FPC and Lazarus on the Oracle Cloud platform</a>, and accessed it via SSH/XRDP:</p>
<blockquote><p>Just open account on Oracle Cloud and create new compute VM: 4 ARMv8.2 CPU 3GHz, 24GB Ram (yes 24GB).
This is always free VM (you can combine this 4 cores and 24GB Ram to 1 or many (4) VM).
Install Ubuntu 20.04 server, then install LXDE and XRdp for remote access.
Now I have nice speed workstation. Install fpcupdeluxe then fpc 3.2.2/laz 2.0.12, all OK. fpcup build is faster then my local pc build <img src="https://blog.synopse.info?pf=smile.svg" alt=":-)" class="smiley" />
This OCI VM can be great mormot application server for some projects. I don't have any connection to Oracle - just test their product.</p></blockquote>
<p>I did the same, and in fact, this platform is really easy to work with, once you have paid a 1€ credit card fee to validate your account. Then you will get an "Always Free VM", with 4 Ampere cores, and 24GB of Ram. Amazing. The Oracle people really like to break into the cloud market, and they make it wide open for developers, so that they consider their Cloud instead of Microsoft's or Amazon's.</p>
<h3>FPC and Lazarus on Linux/AArch64</h3>
<p>The Lazarus experiment is very good on this platform, even remotely. The only issue is the debugger. Gdb was pretty unstable for me - almost as unstable as on Windows. But somewhat usable, until it crashes. <img src="https://blog.synopse.info?pf=sad.svg" alt=":(" class="smiley" /></p>
<p>Finally, and thanks to Alfred - our great friend behind <a href="https://github.com/LongDirtyAnimAlf/fpcupdeluxe">fpcupdeluxe</a> - we identified a problem with our asm stubs when calling mORMot interface-based services. It was in fact a FPC "feature" (it is documented as such in the compiler so it is not a bug), in how arguments are passed as result in the AARCH64 calling ABI. Once identified, we made an explicit exception to help circumvent the problem.</p>
<p>The FPC code quality seems good. At least at the level of the x86_64 Intel/AMD platform. Not as good as gcc for sure, but good enough for production code, and good speed. The only big limitation is that the inlined assembly is very limited: only a few AARCH64 opcodes are available - only what was mandatory for the basic FPC RTL needs.</p>
<h3>Tuning mORMot 2 for AArch64</h3>
<p>To enhance performance, we replaced the basic FPC RTL Move/FillChar functions by the libc memmov/memset. And performance is amazing:</p>
<pre>
FPC RTL
FillChar in 19.06ms, 20.3 GB/s
Move in 9.95ms, 1.5 GB/s
small Move in 15.85ms, 1.3 GB/s
big Move in 222.41ms, 1.7 GB/s
mORMot functions calling the Ubuntu Gnu libc:
FillCharFast in 8.84ms, 43.9 GB/s
MoveFast in 1.25ms, 12.4 GB/s
small MoveFast in 4.98ms, 4.3 GB/s
big MoveFast in 34.32ms, 11.3 GB/s
</pre>
<p>In comparison, here are the numbers on my Core i5 7200U CPU, of mORMot tuned x86_64 asm (faster than the FPC RTL), using SSE2 or AVX instructions:</p>
<pre>
FillCharFast [] in 21.43ms, 18.1 GB/s
MoveFast [] in 2.29ms, 6.8 GB/s
small MoveFast [] in 4.29ms, 5.1 GB/s
big MoveFast [] in 68.33ms, 5.7 GB/s
FillCharFast [cpuAVX] in 20.28ms, 19.1 GB/s
MoveFast [cpuAVX] in 2.26ms, 6.8 GB/s
small MoveFast [cpuAVX] in 4.25ms, 5.1 GB/s
big MoveFast [cpuAVX] in 69.93ms, 5.5 GB/s
</pre>
<p>So we can see that the Ampere CPU memory design is pretty efficient. It is up to twice faster than a Core i5 7200U CPU.</p>
<p>We had to go further, and get some fun with one bottleneck of every server operation: encryption and hashes. So we wrote some C code to be able to use the efficient HW acceleration we wanted for encryption and hashes. You could find the source code in <a href="https://github.com/synopse/mORMot2/tree/master/res/static/armv8">the /res/static/armv8 sub-folder of our repository</a>. Now we have tremendous performance for AES, GCM, SHA2 and CRC32/CRC32C computation.</p>
<pre>
2500 crc32c in 259us i.e. 9.2M/s or 20 GB/s
2500 xxhash32 in 1.47ms i.e. 1.6M/s or 3.5 GB/s
2500 crc32 in 259us i.e. 9.2M/s or 20 GB/s
2500 adler32 in 469us i.e. 5M/s or 11 GB/s
2500 hash32 in 584us i.e. 4M/s or 8.9 GB/s
2500 md5 in 12.12ms i.e. 201.3K/s or 438.7 MB/s
2500 sha1 in 21.75ms i.e. 112.2K/s or 244.5 MB/s
2500 hmacsha1 in 23.81ms i.e. 102.5K/s or 223.4 MB/s
2500 sha256 in 3.41ms i.e. 714.7K/s or 1.5 GB/s
2500 hmacsha256 in 4.12ms i.e. 591.2K/s or 1.2 GB/s
2500 sha384 in 27.71ms i.e. 88K/s or 191.9 MB/s
2500 hmacsha384 in 32.69ms i.e. 74.6K/s or 162.7 MB/s
2500 sha512 in 27.73ms i.e. 88K/s or 191.8 MB/s
2500 hmacsha512 in 32.77ms i.e. 74.4K/s or 162.3 MB/s
2500 sha3_256 in 35.82ms i.e. 68.1K/s or 148.5 MB/s
2500 sha3_512 in 65.48ms i.e. 37.2K/s or 81.2 MB/s
2500 rc4 in 12.98ms i.e. 188K/s or 409.8 MB/s
2500 mormot aes-128-cfb in 8.84ms i.e. 276.1K/s or 601.7 MB/s
2500 mormot aes-128-ofb in 3.78ms i.e. 645K/s or 1.3 GB/s
2500 mormot aes-128-c64 in 4.39ms i.e. 555.6K/s or 1.1 GB/s
2500 mormot aes-128-ctr in 4.52ms i.e. 539.7K/s or 1.1 GB/s
2500 mormot aes-128-cfc in 9.16ms i.e. 266.3K/s or 580.4 MB/s
2500 mormot aes-128-ofc in 5.25ms i.e. 465K/s or 0.9 GB/s
2500 mormot aes-128-ctc in 5.74ms i.e. 425.1K/s or 0.9 GB/s
2500 mormot aes-128-gcm in 7.52ms i.e. 324.5K/s or 707.2 MB/s
2500 mormot aes-256-cfb in 9.52ms i.e. 256.2K/s or 558.4 MB/s
2500 mormot aes-256-ofb in 4.71ms i.e. 517.6K/s or 1.1 GB/s
2500 mormot aes-256-c64 in 5.30ms i.e. 460.5K/s or 0.9 GB/s
2500 mormot aes-256-ctr in 5.33ms i.e. 457.5K/s or 0.9 GB/s
2500 mormot aes-256-cfc in 10.04ms i.e. 243K/s or 529.5 MB/s
2500 mormot aes-256-ofc in 6.11ms i.e. 399K/s or 869.6 MB/s
2500 mormot aes-256-ctc in 6.77ms i.e. 360.4K/s or 785.5 MB/s
2500 mormot aes-256-gcm in 8.38ms i.e. 291.1K/s or 634.4 MB/s
2500 openssl aes-128-cfb in 4.94ms i.e. 493.4K/s or 1 GB/s
2500 openssl aes-128-ofb in 4.12ms i.e. 591.2K/s or 1.2 GB/s
2500 openssl aes-128-ctr in 1.94ms i.e. 1.2M/s or 2.6 GB/s
2500 openssl aes-128-gcm in 3.18ms i.e. 767K/s or 1.6 GB/s
2500 openssl aes-256-cfb in 5.83ms i.e. 418.5K/s or 912.1 MB/s
2500 openssl aes-256-ofb in 5.04ms i.e. 484.1K/s or 1 GB/s
2500 openssl aes-256-ctr in 2.42ms i.e. 0.9M/s or 2.1 GB/s
2500 openssl aes-256-gcm in 3.66ms i.e. 667K/s or 1.4 GB/s
2500 shake128 in 29.63ms i.e. 82.3K/s or 179.5 MB/s
2500 shake256 in 35.07ms i.e. 69.5K/s or 151.6 MB/s
</pre>
<p>Here are the numbers on my Core i5 7200U CPU, with optimized asm, and the last OpenSSL calls:</p>
<pre>
2500 crc32c in 224us i.e. 10.6M/s or 23.1 GB/s
2500 xxhash32 in 817us i.e. 2.9M/s or 6.3 GB/s
2500 crc32 in 341us i.e. 6.9M/s or 15.2 GB/s
2500 adler32 in 241us i.e. 9.8M/s or 21.5 GB/s
2500 hash32 in 441us i.e. 5.4M/s or 11.7 GB/s
2500 aesnihash in 218us i.e. 10.9M/s or 23.8 GB/s
2500 md5 in 8.29ms i.e. 294.1K/s or 641.1 MB/s
2500 sha1 in 13.72ms i.e. 177.8K/s or 387.5 MB/s
2500 hmacsha1 in 15.05ms i.e. 162.1K/s or 353.3 MB/s
2500 sha256 in 17.40ms i.e. 140.2K/s or 305.6 MB/s
2500 hmacsha256 in 18.71ms i.e. 130.4K/s or 284.2 MB/s
2500 sha384 in 11.59ms i.e. 210.5K/s or 458.9 MB/s
2500 hmacsha384 in 13.84ms i.e. 176.3K/s or 384.2 MB/s
2500 sha512 in 11.59ms i.e. 210.5K/s or 458.8 MB/s
2500 hmacsha512 in 13.89ms i.e. 175.7K/s or 382.9 MB/s
2500 sha3_256 in 26.66ms i.e. 91.5K/s or 199.5 MB/s
2500 sha3_512 in 47.96ms i.e. 50.9K/s or 110.9 MB/s
2500 rc4 in 14.05ms i.e. 173.7K/s or 378.6 MB/s
2500 mormot aes-128-cfb in 4.59ms i.e. 530.9K/s or 1.1 GB/s
2500 mormot aes-128-ofb in 4.52ms i.e. 539.4K/s or 1.1 GB/s
2500 mormot aes-128-c64 in 6.23ms i.e. 391.7K/s or 853.7 MB/s
2500 mormot aes-128-ctr in 1.40ms i.e. 1.6M/s or 3.6 GB/s
2500 mormot aes-128-cfc in 4.75ms i.e. 513.2K/s or 1 GB/s
2500 mormot aes-128-ofc in 5.22ms i.e. 467.7K/s or 0.9 GB/s
2500 mormot aes-128-ctc in 1.72ms i.e. 1.3M/s or 3 GB/s
2500 mormot aes-128-gcm in 2.28ms i.e. 1M/s or 2.2 GB/s
2500 mormot aes-256-cfb in 6.12ms i.e. 398.4K/s or 868.3 MB/s
2500 mormot aes-256-ofb in 6.10ms i.e. 400K/s or 871.7 MB/s
2500 mormot aes-256-c64 in 7.86ms i.e. 310.6K/s or 676.9 MB/s
2500 mormot aes-256-ctr in 1.82ms i.e. 1.3M/s or 2.8 GB/s
2500 mormot aes-256-cfc in 6.36ms i.e. 383.5K/s or 835.9 MB/s
2500 mormot aes-256-ofc in 6.77ms i.e. 360.1K/s or 784.8 MB/s
2500 mormot aes-256-ctc in 2.02ms i.e. 1.1M/s or 2.5 GB/s
2500 mormot aes-256-gcm in 2.68ms i.e. 909.2K/s or 1.9 GB/s
2500 openssl aes-128-cfb in 7.11ms i.e. 342.9K/s or 747.3 MB/s
2500 openssl aes-128-ofb in 5.21ms i.e. 468K/s or 1 GB/s
2500 openssl aes-128-ctr in 1.54ms i.e. 1.5M/s or 3.3 GB/s
2500 openssl aes-128-gcm in 1.85ms i.e. 1.2M/s or 2.8 GB/s
2500 openssl aes-256-cfb in 8.65ms i.e. 282.2K/s or 615 MB/s
2500 openssl aes-256-ofb in 6.82ms i.e. 357.6K/s or 779.3 MB/s
2500 openssl aes-256-ctr in 1.93ms i.e. 1.2M/s or 2.6 GB/s
2500 openssl aes-256-gcm in 2.27ms i.e. 1M/s or 2.2 GB/s
2500 shake128 in 23.47ms i.e. 104K/s or 226.6 MB/s
2500 shake256 in 29.64ms i.e. 82.3K/s or 179.5 MB/s
</pre>
<p>The mORMot plain pascal code is used for MD5, SHA1, or shake/SHA3. So it is slower than our optimized asm for Intel/AMD. But not so slow. And those algorithms are either deprecated or not widely used - therefore they are not a bottleneck. OpenSSL numbers are pretty good too on this platform. As a result, AES, GCM, SHA-2 and crc32/crc32c performance is comparable between AARCH64 and Intel/AMD. With amazing SHA-2 numbers.</p>
<p>Then, we compiled the latest SQLite3, Lizard and libdeflate as static libraries, so that you could use them with your executable with no external dependency. Performance is very good:</p>
<pre>
TAlgoSynLZ 3.8 MB->2 MB: comp 287:151MB/s decomp 215:409MB/s
TAlgoLizard 3.8 MB->1.9 MB: comp 18:9MB/s decomp 857:1667MB/s
TAlgoLizardFast 3.8 MB->2.3 MB: comp 193:116MB/s decomp 1282:2135MB/s
TAlgoLizardHuffman 3.8 MB->1.8 MB: comp 84:40MB/s decomp 394:827MB/s
TAlgoDeflate 3.8 MB->1.5 MB: comp 30:12MB/s decomp 78:196MB/s
TAlgoDeflateFast 3.8 MB->1.6 MB: comp 48:20MB/s decomp 73:174MB/s
</pre>
<p>I was a bit surprised by how well the pure pascal version of SynLZ algorithm was running, once compiled with FPC 3.2, on AARCH64. Also the Deflate compression has a small advantage of using our statically linked libdeflate in respect to the plain zlib. But the very good news is that Lizard is really fast on AARCH64: even if it is written in plain C with no manual SIMD/asm code, it is really fast on non Intel/AMD platforms. More than 2GB/s for decompression is very high. I was told that Lizard may be a bit behind ZStandard on Intel/AMD, but its code is simpler, and much more CPU agnostic.</p>
<pre>
2.4. Sqlite file memory map:
- Database direct access: 22,264 assertions passed 55.40ms
- Virtual table direct access: 12 assertions passed 347us
- TOrmTableJson: 144,083 assertions passed 60.25ms
- TRestClientDB: 608,196 assertions passed 783.02ms
- Regexp function: 6,015 assertions passed 11.07ms
- TRecordVersion: 20,060 assertions passed 51.28ms
Total failed: 0 / 800,630 - Sqlite file memory map PASSED 961.45ms
</pre>
<p>Here SQLite3 numbers are similar to what I have on Intel/AMD. So I guess we could really consider using this database as storage back-end for mORMot MicroServices with their stand-alone persistence layer.</p>
<h3>Ampere and Beyond - Apple M1?</h3>
<p>We also tried to support as much as possible the ARM/AARCH64 CPUs with mORMot 2. So now we detect the CPU type and HW platform it runs on, especially on Linux or Android - which is also an AARCH64 platform. Here is what our regression tests report at their ending:</p>
<pre>
Ubuntu 20.04.2 LTS - Linux 5.8.0-1037-oracle (cp utf8)
2 x ARM Neoverse-N1 (aarch64)
on QEMU KVM Virtual Machine virt-4.2
Using mORMot 2.0.1
TSqlite3LibraryStatic 3.36.0 with internal MM
Generated with: Free Pascal 3.2 64 bit Linux compiler
Time elapsed for all tests: 44.38s
Performed 2021-08-17 13:44:09 by ubuntu on lxde
Total assertions failed for all test suits: 0 / 66,050,607
</pre>
<p>As you can see, the CPU was properly identified as <a href="https://www.arm.com/products/silicon-ip-cpu/neoverse/neoverse-n1">ARM Neoverse-N1</a>.</p>
<p>We could consider with good faith using <em>mORMot</em> code on an Apple M1/M1X/M2 CPU, thanks to the FPC (cross-)compiler. If we have access to this hardware. Any feedback is welcome.</p>
<h3>Server Process Performance</h3>
<p>All regression tests do pass whole green, with pretty consistent performance among all its various tasks. JSON process, ORM, SOA or encryption: everything flies on the Ampere CPU. You can check <a href="https://gist.github.com/synopse/0e7275684a2e2bbd2206940c3827055c">the detailed regression tests console output</a>.</p>
<p>Here are some numbers about UTF-8 or JSON process:</p>
<pre>
StrLen() in 1.43ms, 13.3 GB/s
IsValidUtf8(RawUtf8) in 11.75ms, 1.6 GB/s
IsValidUtf8(PUtf8Char) in 13.08ms, 1.4 GB/s
IsValidJson(RawUtf8) in 22.84ms, 858.2 MB/s
IsValidJson(PUtf8Char) in 22.93ms, 854.7 MB/s
JsonArrayCount(P) in 22.97ms, 853.1 MB/s
JsonArrayCount(P,PMax) in 22.89ms, 856.4 MB/s
JsonObjectPropCount() in 11.90ms, 0.9 GB/s
jsonUnquotedPropNameCompact in 72.35ms, 240.6 MB/s
jsonHumanReadable in 119.06ms, 209.4 MB/s
TDocVariant in 245.99ms, 79.7 MB/s
TDocVariant no guess in 260.57ms, 75.2 MB/s
TDocVariant dvoInternNames in 247.56ms, 79.1 MB/s
TOrmTableJson GetJsonValues in 34.88ms, 247.1 MB/s
TOrmTableJson expanded in 42.70ms, 459 MB/s
TOrmTableJson not expanded in 21.42ms, 402.4 MB/s
DynArrayLoadJson in 87.96ms, 222.8 MB/s
TOrmPeopleObjArray in 131.10ms, 149.5 MB/s
fpjson in 115.09ms, 17 MB/s
</pre>
<p>It is nice to see that our pascal code, which has been deeply tuned to let FPC generate the best x86_64 assembly possible, is also able to give very good performance on AARCH64. No need to write some dedicated code, and pollute the source with plenty of <em>$ifdef/$endif</em>: x86_64 is already some kind of RISC-like architecture, with a bigger number of registers, and 64-bit efficient processing. No need to rewrite everything. Optimized pascal code, with tuned pointer arithmetic is platform neutral. I like the quote of SQLite3 author saying that <a href="https://www.sqlite.org/whyc.html">C is a "portable assembly"</a>, and that we could also use tuned pascal code, as we try to do in the mORMot core units, to leverage modern CPU hardware, without the need of fighting against any hype/versatile language.</p>
<h3>Asm is Fun Again</h3>
<p>So we are pretty excited to see how this platform will go in the future. mORMot has invested a lot of time, refactoring and asm tuning to leverage the Intel/AMD platform, focusing on the server side performance. But this AARCH64 technology is really promising, and I can tell you that its RISC instruction set was very cleverly designed. It is very rich and powerful, almost perfect in its balance between power and expressiveness, in respect to the x86_64 platform, which has a lot of inconsistencies and seems outdated when you compare both asm. After decades playing with i386 or x86_64 asm, I had fun again with the ARM v8 assembly. It tastes like "assembly as it should be" (tm). Linking some static C code is a good balance between leveraging the hardware when needed, and keeping platform-independent pascal source. And FPC, as a compiler, is amazing by being open and well done on so many CPUs and platforms. Open Source rocks!</p>
<p>As usual, <a href="https://synopse.info/forum/viewtopic.php?pid=35602#p35602">feedback is welcome on our forum</a>.</p>Job Offer: FPC mORMot 2 and WAPTurn:md5:ca4b0bf9d944e866fee17f4328e82b462021-07-08T15:42:00+01:002021-07-08T15:42:00+01:00Arnaud BouchezPascal ProgrammingFPCJobLazarusmORMotmORMot2WAPT<p>Good news!<br />
The French company I work for, Tranquil IT, is hiring FPC / Lazarus / mORMot developers. Remote work possible.</p>
<p><img src="https://blog.synopse.info/public/blog/logo_Tranquil_IT.png" alt="" /></p>
<p>I share below the Job Offer from my boss Vincent.<br />
We look forward working with you on this great mORMot-powered project!</p>
<p><a href="https://www.tranquil.it/en/who-are-we/join-us/">https://www.tranquil.it/en/who-are-we/join-us</a></p> <p><em>If you dream to work on great projects with a team of talented developers and with technologies you love and believe in, Tranquil IT wants to hire you too.</em></p>
<p><em>Tranquil IT is based in Nantes, on the French Atlantic coast. If you have the right skills and you are self-driven, you can work in Nantes or remotely (as Arnaud does).</em></p>
<p><em>Our team is fluent in English, mORMot, Lazarus, FreePascal, Python and system administration. Tranquil IT is best known for developing one of the most useful tool to help private and public organizations prevent cyberattacks, <a href="https://www.wapt.fr/en/doc">WAPT Deployment software</a>. Tranquil IT is also known for her work with <a href="https://samba.tranquil.it/doc/en">Samba Active Directory</a>.</em></p>
<p><em>We have a lot of ideas to improve WAPT, so join us, bring ideas of your own, and become part of the mORMot / Tranquil IT adventure!</em></p>
<p><em>To apply, contact us at rh (at) tranquil (dot) it.</em><br />
Vincent CARDON, Président<br />
TRANQUIL IT</p>
<p><em>PS: You can send your resumé to this email address, preferably with links to code you proudly wrote (we enjoy reading nice code!). Indicate whether you are interested on working on the mORMot framework, on improving on Lazarus/FPC components, or working on the end-user software WAPT.</em></p>