Synopse Open Source - Tag - sse4mORMot MVC / SOA / ORM and friends2024-02-02T17:08:25+00:00urn:md5:cc547126eb580a9adbec2349d7c65274DotclearFastest AES-PRNG, AES-CTR and AES-GCM Delphi implementationurn:md5:1f6861f4711b9a7207a32f4bbcadde7e2021-02-13T09:11:00+00:002021-02-22T08:37:03+00:00Arnaud BouchezmORMot Framework64bitAESAES-CTRAES-GCMAES-NiasmblogCrossPlatformDelphiFreePascalmORMot2performancesse4<p>Last week, I committed new ASM implementations of our AES-PRNG, AES-CTR and AES-GCM for <em>mORMot 2</em>.<br />
They handle eight 128-bit at once in an interleaved fashion, as permitted by the CTR chaining mode. The aes-ni opcodes (<code>aesenc aesenclast</code>) are used for AES process, and the GMAC of the AES-GCM mode is computed using the <code>pclmulqdq</code> opcode.</p>
<p><img src="https://blog.synopse.info?post/public/blog/aesalgo.png" alt="" /></p>
<p>Resulting performance is amazing: on my simple Core i3, I reach 2.6 GB/s for <code>aes-128-ctr</code>, and 1.5 GB/s for <code>aes-128-gcm</code> for instance - the first being actually faster than OpenSSL!</p> <p>AES-CTR is the basic chaining mode used for:</p>
<ul>
<li>AES-CTR as defined by the <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_(CTR)">NIST standard</a> - see our <code>TAesPrngNist</code> class;</li>
<li>AES-GCM which includes a 128-bit GMAC using in <a href="https://en.wikipedia.org/wiki/Galois/Counter_Mode">Galois/Counter Mode</a> - see the <code>TAesGcm</code> class;</li>
<li>and our AES-based <a href="https://en.wikipedia.org/wiki/Pseudorandom_number_generator">Pseudo Random Number Generator</a> (PRNG) as implemented by our <code>TAesPrng</code> class.</li>
</ul>
<h3>mORMot 2</h3>
<p>For <em>mORMot 2</em>, we refactored the <code>SynCrypto.pas</code> unit into <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.crypto.pas"><code>mormot.core.crypto.pas</code></a>:</p>
<ul>
<li>All assembly code has been moved to dedicated <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.crypto.asmx86.inc"><code>mormot.core.crypto.asmx86.inc</code></a> and <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.crypto.asmx64.inc"><code>mormot.core.crypto.asmx64.inc</code></a> include files;</li>
<li>A generic catalog of AES algorithms has been implemented, which allows to search them by name (e.g. <code>'aes-128-ctr'</code>), and also switch to the fastest implementation available, e.g. if OpenSSL is enabled;</li>
<li>The regression tests have been enhanced, to include validation against test vectors for all modes, and comparison with the OpenSSL reference implementation;</li>
<li>A lot of low-level optimizations have been applied, especially targeting x86_64 which is now (sometimes much) faster than the original very tuned i386 code - in fact, we focus on x86_64 which is our main target for Linux high-end services implementation with FPC compilation;</li>
<li>Still as a stand-alone Delphi/FPC unit, with no external <code>.dll</code> to download, search and load.</li>
</ul>
<p>Here are some numbers, extracted from the unit comments, run from several types of blocks (not only huge buffers), during regression tests.</p>
<h3>AES-CTR</h3>
<p>On x86_64 we use a 8*128-bit interleaved optimized asm:</p>
<ul>
<li><strong>mormot aes-128-ctr</strong> in 1.99ms i.e. 1254390/s or <strong>2.6 GB/s</strong></li>
<li><strong>mormot aes-256-ctr</strong> in 2.64ms i.e. 945179/s or <strong>1.9 GB/s</strong></li>
</ul>
<p>It is <em>actually faster than OpenSSL 1.1.1</em> in our benchmarks</p>
<ul>
<li>openssl aes-128-ctr in 2.23ms i.e. 1121076/s or 2.3 GB/s</li>
<li>openssl aes-256-ctr in 2.80ms i.e. 891901/s or 1.8 GB/s</li>
</ul>
<p>As reference, optimized but not interleaved OFB asm is 3 times slower:</p>
<ul>
<li>mormot aes-128-ofb in 6.88ms i.e. 363002/s or 772.5 MB/s</li>
<li>mormot aes-256-ofb in 9.37ms i.e. 266808/s or 567.8 MB/s</li>
</ul>
<p>On i386, numbers are slower for our classes, which are not interleaved:</p>
<ul>
<li>mormot aes-128-ctr in 10ms i.e. 249900/s or 531.8 MB/s</li>
<li>mormot aes-256-ctr in 12.47ms i.e. 200368/s or 426.4 MB/s</li>
<li>openssl aes-128-ctr in 3.01ms i.e. 830288/s or 1.7 GB/s</li>
<li>openssl aes-256-ctr in 3.52ms i.e. 709622/s or 1.4 GB/s</li>
</ul>
<h3>AES-GCM</h3>
<p>On x86_64, our TAesGcm class is 8x interleaved for both GMAC and AES-CTR:</p>
<ul>
<li>mormot <strong>aes-128-gcm</strong> in 3.45ms i.e. 722752/s or <strong>1.5 GB/s</strong></li>
<li>mormot <strong>aes-256-gcm</strong> in 4.11ms i.e. 607385/s or <strong>1.2 GB/s</strong></li>
</ul>
<p><em>OpenSSL is faster</em> since it performs GMAC and AES-CTR in a single pass:</p>
<ul>
<li>openssl aes-128-gcm in 2.86ms i.e. 874125/s or 1.8 GB/s</li>
<li>openssl aes-256-gcm in 3.43ms i.e. 727590/s or 1.5 GB/s</li>
</ul>
<p>On i386, numbers are much lower, since lacks interleaved asm - but still faster than any other Delphi alternatives:</p>
<ul>
<li>mormot aes-128-gcm in 15.86ms i.e. 157609/s or 335.4 MB/s</li>
<li>mormot aes-256-gcm in 18.23ms i.e. 137083/s or 291.7 MB/s</li>
<li>openssl aes-128-gcm in 5.49ms i.e. 455290/s or 0.9 GB/s</li>
<li>openssl aes-256-gcm in 6.11ms i.e. 408630/s or 869.6 MB/s</li>
</ul>
<h3>Other AES modes</h3>
<p>As you may see from the recent commits, and the numbers in the source code, almost all of our AES classes (e.g. OFB and CFB) have had their performance enhanced, sometimes by a large margin.</p>
<p>A new <code>TAesCtrCrc</code> class has been added. It combines AES-CTR with 4 parallel <code>crc32c</code> checksums, of both the encrypted and the decrypted content.<br />
It results in an <a href="https://en.wikipedia.org/wiki/Authenticated_encryption">AEAD algorithm</a> with 256-bit of associated authentication, which outperforms AES-GCM in our implementation.</p>
<p>On x86_64 we use a 8*128-bit interleaved optimized asm:</p>
<ul>
<li><strong>mormot aes-128-ctc</strong> in 2.58ms i.e. 967492/s or <strong>2 GB/s</strong></li>
<li><strong>mormot aes-256-ctc</strong> in 3.13ms i.e. 797702/s or <strong>1.6 GB/s</strong></li>
</ul>
<p>(to be compared with the CTR without 256-bit crc32c MAC computation above at 2.6 GB/s and 1.9GB/s)</p>
<p>In i386, numbers are lower, because they are not interleaved:</p>
<ul>
<li>mormot aes-128-ctc in 9.76ms i.e. 256068/s or 544.9 MB/s</li>
<li>mormot aes-256-ctc in 12.14ms i.e. 205930/s or 438.2 MB/s</li>
</ul>
<p>For internal communication, e.g. for our WebSockets services, it is a very good algorithm, especially for small messages, since it needs less warmup than AES-GCM.</p>
<p>Here are some numbers of our ECDHE stream protocol:</p>
<ul>
<li>efAesCrc128 in 1.57ms i.e. 63,331/s, aver. 15us, 1.1 GB/s</li>
<li>efAesCfb128 in 1.66ms i.e. 60,060/s, aver. 16us, 1 GB/s</li>
<li>efAesOfb128 in 2.52ms i.e. 39,588/s, aver. 25us, 729.9 MB/s</li>
<li>efAesCtr128 in 851us i.e. 117,508/s, aver. 8us, 2.1 GB/s</li>
<li>efAesCbc128 in 2.93ms i.e. 34,059/s, aver. 29us, 628 MB/s</li>
<li>efAesCrc256 in 2.13ms i.e. 46,926/s, aver. 21us, 865.2 MB/s</li>
<li>efAesCfb256 in 2.20ms i.e. 45,330/s, aver. 22us, 835.8 MB/s</li>
<li>efAesOfb256 in 3.38ms i.e. 29,507/s, aver. 33us, 544 MB/s</li>
<li>efAesCtr256 in 1.09ms i.e. 91,659/s, aver. 10us, 1.6 GB/s</li>
<li>efAesCbc256 in 3.33ms i.e. 30,012/s, aver. 33us, 553.3 MB/s</li>
<li>efAesGcm128 in 790us i.e. 126,582/s, aver. 7us, 2.2 GB/s</li>
<li>efAesGcm256 in 987us i.e. 101,317/s, aver. 9us, 1.8 GB/s</li>
<li><strong>efAesCtc128</strong> in 820us i.e. 121,951/s, aver. 8us, <strong>2.1 GB/s</strong></li>
<li><strong>efAesCtc256</strong> in 985us i.e. 101,522/s, aver. 9us, <strong>1.8 GB/s</strong></li>
</ul>
<p>Note that</p>
<ul>
<li>those numbers don't exactly match the other benchmarks, because we don't measure the raw AES encryption performance, but the whole encapsulation in the WebSockets frames protocol, and we test another set of message sizes;</li>
<li>the <code>efAesGcm128</code>/<code>efAesGcm256</code> numbers above automatically used the OpenSSL library on my Ubuntu laptop, since they are faster than our <code>TAesGcm</code> class - so when you don't have OpenSSL installed (which is sometimes tricky on Windows), you could rely on <code>efAesCtc128</code> as WebSockets asymetric encryption protocol.</li>
</ul>
<h3>AES PRNG</h3>
<p>As I wrote above, our <code>TAesPrng</code> class uses internally the AES-CTR mode to generate its random output stream.<br />
The newly introduced asm was very beneficial to its 256-bit AES generator, in terms of performance.</p>
<p>On x86_64, it uses fast hardware AES-NI acceleration, and our 8X interleaved asm:</p>
<ul>
<li>mORMot <strong>Random32</strong> in 3.95ms i.e. <strong>25,303,643/s</strong>, aver. 0us, <strong>96.5 MB/s</strong></li>
<li>mORMot <strong>FillRandom</strong> in 46us, <strong>2 GB/s</strong></li>
</ul>
<p>It is actually <em>noticeably faster than OpenSSL</em> with the same 256-bit safety level:</p>
<ul>
<li>OpenSSL Random32 in 288.71ms i.e. 346,363/s, aver. 2us, 1.3 MB/s</li>
<li>OpenSSL FillRandom in 240us, 397.3 MB/s</li>
</ul>
<p>On i386, numbers are similar, but for <code>FillRandom</code> which is not interleaved:</p>
<ul>
<li>mORMot Random32 in 5.54ms i.e. 18,044,027/s, aver. 0us, 68.8 MB/s</li>
<li>mORMot FillRandom in 203us, 469.7 MB/s</li>
<li>OpenSSL Random32 in 364.24ms i.e. 274,540/s, aver. 3us, 1 MB/s</li>
<li>OpenSSL FillRandom in 371us, 257 MB/s</li>
</ul>
<h3>Conclusion</h3>
<p>Since years, I suspected we wrote the fastest AES library for Delphi and FreePascal. Now we covered even more algorithms (AES-GCM is widely used but not widely implemented in Delphi), and pushed away the performance limits even further!<br />
We can be proud that our library outperforms the OpenSSL 1.1.1 proven codebase for most algorithms, with no <code>.dll</code> dependency.<br />
Open Source rocks!</p>
<p>Next logical step is to work on OpenSSL integration of the TLS layer, which is welcome especially on Linux (we already have a <a href="https://docs.microsoft.com/en-us/windows-server/security/tls/tls-ssl-schannel-ssp-overview">SChannel TLS layer</a> for Windows since years in mORMot)...</p>
<p>If you wish, you can download the current mORMot 2 source code from <a href="https://github.com/synopse/mORMot2">https://github.com/synopse/mORMot2</a> and run the <a href="https://github.com/synopse/mORMot2/blob/master/test/mormot2tests.dpr">regression tests project</a>. You could share your own numbers!</p>
<p>Stay tuned, and feedback is <a href="https://synopse.info/forum/viewtopic.php?id=5760">welcome in our forum, as usual</a>!</p>SynCrypto: SSE4 x64 optimized asm for SHA-256urn:md5:17886bc69f9d62ab276f383dfa8bbd822015-02-21T12:58:00+01:002015-02-21T13:24:17+01:00AB4327-GANDIOpen Source libraries64bitasmblogDelphiperformanceshaSourcesse4<p>We have just included some optimized x64 assembler to our Open
Source <a href="http://synopse.info/fossil/info/b7ba18e68252b76c0fe">SynCrypto.pas</a> unit
so that SHA-256 hashing will perform at best speed.<br />
It is an adaptation from <a href="http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html">
tuned Intel's assembly macros</a>, which makes use of the SSE4 instruction set,
if available.</p>
<p><img src="http://www.xbitlabs.com/images/cpu/core2extreme-qx9650/sse-1.jpg" alt="" width="275" height="182/" /></p> <p>Numbers are talking:</p>
<ul>
<li>under Win32, with a Core i7 CPU: pure pascal: 152ms - x86: 112ms</li>
<li>under Win64, with a Core i7 CPU: pure pascal: 202ms - SSE4: 78ms</li>
</ul>
<p>When executing the following test code:</p>
<pre>
for i := 1 to 100000 do begin
s := SHA256('123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');
assert(s='f816ca413da6f2881c0cf16cb6d5bbc5d4189f5a9f185855c8bfd6423e099e52');
end;
</pre>
<p>Your <a href="http://synopse.info/forum/viewtopic.php?id=2374">feedback is
welcome</a>, as usual!</p>