Synopse Open Source - Tag - 64bitmORMot MVC / SOA / ORM and friends2024-02-02T17:08:25+00:00urn:md5:cc547126eb580a9adbec2349d7c65274DotclearmORMot 2 ORM Performanceurn:md5:332da916ca3f905cedf4c82d14fc9b872022-02-15T13:24:00+00:002022-02-15T13:24:00+00:00Arnaud BouchezmORMot Framework64bitAES-CTRasmblogDatabaseDelphiFPCfpcx64mmFreePascalJSONMicroservicesmORMotmORMot2ORMperformanceSQLSQLite3<p>The official release of <em>mORMot 2</em> is around the edge.
It may be the occasion to show some data persistence performance numbers, in respect to <em>mORMot 1</em>.</p>
<p><img src="https://blog.synopse.info?post/public/blog/marmotrunningsnow.jpg" alt="" /></p>
<p>For the version 2 of our framework, its ORM feature has been enhanced and tuned in several aspects: REST routing optimization, ORM/JSON serialization, and in-memory and SQL engines tuning.
Numbers are talking. You could compare with any other solution, and compile and run the tests by yourself for both framework, and see how it goes on your own computer or server.<br />
In a nutshell, we <em>almost reach 1 million inserts per second on SQLite3</em>, and are above the million inserts in our in-memory engine. Reading speed is 1.2 million and 1.7 million respectively. From the object to the storage, and back. And forcing AES-CTR encryption on disk almost don't change anything. Now we are talking. <img src="https://blog.synopse.info?pf=wink.svg" alt=";)" class="smiley" /></p> <h3>Platform Used</h3>
<p>Those numbers were taken from the "external database" sample, which is available on both versions of the framework.<br />
This is the very same benchmark as used in previous benchmarks on this blog or our documentation. So you could compare the numbers.</p>
<p>But we run the tests on a new computer, featuring a Intel(R) Core(TM) i5-7300U cpu, from a good old Thinkpad T470 notebook, with a SATA SSD.<br />
So on a more modern hardware, like a high-end AMD or Xeon server, the million inserts is easily passed. And on a slow VM, you will get pretty good numbers.</p>
<p>We run the tests on Linux x86_64 (Debian 11), compiled from FPC 3.2 stable - which is our target platform for performance.<br />
In fact, performance matters mainly on the server side - clients are usually fast enough to run whatever process they need, and the ORM/database/persistence process is likely to be located on the server side. So in <em>mORMot 2</em>, we focused on a x86_64 Linux server for performance, which is the cheapest and safest solution around.</p>
<p>Both frameworks used our <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas">in-memory heap manager for FPC</a>, written in x86_64 assembly. It has a noticeable performance benefit, especially on multi-thread process (not shown here, but during the main regression tests).</p>
<h3>Insertion Speed</h3>
<p>Here are the <em>mORMot 2</em> insertion numbers:</p>
<p>Running tests using Synopse mORMot framework 2.0.1, compiled with Free Pascal 3.2 64 bit, against SQLite 3.37.2, on Debian GNU/Linux 11 (bullseye) - Linux 5.10.0-10-amd64, at 2022-02-14 21:09:37.</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>Direct</strong></td><td><strong>Batch</strong></td><td><strong>Trans</strong></td><td><strong>Batch Trans</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>98</td><td>5908</td><td>74089</td><td>242072</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>13534</td><td>474428</td><td>151315</td><td>919624</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>42961</td><td>691037</td><td>153374</td><td>929281</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>26882</td><td>533788</td><td>152795</td><td>874814</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>114664</td><td>969743</td><td>152190</td><td>972478</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>411895</td><td>1086956</td><td>428724</td><td>1301236</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>385445</td><td>1200480</td><td>412762</td><td>1219660</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>107</td><td>5957</td><td>83531</td><td>111043</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>16509</td><td>291528</td><td>151890</td><td>390502</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>58922</td><td>354924</td><td>150179</td><td>392649</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>114476</td><td>991080</td><td>154564</td><td>991375</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>19789</td><td>155763</td><td>19439</td><td>236406</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Batch+Trans|Trans|Batch|Direct&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1301236&chds=0,1301236,0,1301236,0,1301236,0,1301236,0,1301236&chd=t:98,5908,74089,242072|13534,474428,151315,919624|42961,691037,153374,929281|26882,533788,152795,874814|114664,969743,152190,972478|411895,1086956,428724,1301236|385445,1200480,412762,1219660|107,5957,83531,111043|16509,291528,151890,390502|58922,354924,150179,392649|114476,991080,154564,991375|19789,155763,19439,236406&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1301236&chds=0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236,0,1301236&chd=t:98,13534,42961,26882,114664,411895,385445,107,16509,58922,114476,19789|5908,474428,691037,533788,969743,1086956,1200480,5957,291528,354924,991080,155763|74089,151315,153374,152795,152190,428724,412762,83531,151890,150179,154564,19439|242072,919624,929281,874814,972478,1301236,1219660,111043,390502,392649,991375,236406&chdl=Direct|Batch|Trans|Batch+Trans" /></p>
<p>In comparison, here are the <em>mORMot 1</em> performance - which was already ahead of most other solutions - on the same machine:</p>
<p>Running tests using Synopse mORMot framework 1.18.6365, compiled with Free Pascal 3.2 64 bit, against SQLite 3.37.2, on Debian GNU/Linux 11 (bullseye) - Linux 5.10.0-10-amd64, at 2022-02-15 10:27:28.</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>Direct</strong></td><td><strong>Batch</strong></td><td><strong>Trans</strong></td><td><strong>Batch Trans</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>97</td><td>5034</td><td>45603</td><td>90704</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>12504</td><td>212350</td><td>96605</td><td>272910</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>36189</td><td>244702</td><td>94754</td><td>273687</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>22933</td><td>219857</td><td>96513</td><td>267881</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>79953</td><td>275269</td><td>97776</td><td>269963</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>208125</td><td>473126</td><td>227821</td><td>505254</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>202683</td><td>467595</td><td>220031</td><td>472545</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>100</td><td>2809</td><td>49603</td><td>102247</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>15329</td><td>190701</td><td>109346</td><td>301768</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>48818</td><td>264760</td><td>109767</td><td>304710</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>93820</td><td>303766</td><td>111437</td><td>311779</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>17952</td><td>68360</td><td>14915</td><td>97096</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Batch+Trans|Trans|Batch|Direct&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,505254&chds=0,505254,0,505254,0,505254,0,505254,0,505254&chd=t:97,5034,45603,90704|12504,212350,96605,272910|36189,244702,94754,273687|22933,219857,96513,267881|79953,275269,97776,269963|208125,473126,227821,505254|202683,467595,220031,472545|100,2809,49603,102247|15329,190701,109346,301768|48818,264760,109767,304710|93820,303766,111437,311779|17952,68360,14915,97096&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Insertion+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,505254&chds=0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254,0,505254&chd=t:97,12504,36189,22933,79953,208125,202683,100,15329,48818,93820,17952|5034,212350,244702,219857,275269,473126,467595,2809,190701,264760,303766,68360|45603,96605,94754,96513,97776,227821,220031,49603,109346,109767,111437,14915|90704,272910,273687,267881,269963,505254,472545,102247,301768,304710,311779,97096&chdl=Direct|Batch|Trans|Batch+Trans" /></p>
<p>As you can see, the performance benefits are noticeable. For a <em>MicroService</em>, an embedded SQlite3 storage may give pretty amazing scalability of your SOA processing.</p>
<h3>Reading Speed</h3>
<p>Here are the <em>mORMot 2</em> reading numbers:</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>By one</strong></td><td><strong>All Virtual</strong></td><td><strong>All Direct</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>115775</td><td>1151012</td><td>1121956</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>123198</td><td>1157407</td><td>1162925</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>270504</td><td>1162790</td><td>1174122</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>269978</td><td>1160227</td><td>1171920</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>273950</td><td>1154201</td><td>1150350</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>467551</td><td>1803751</td><td>1743071</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>464209</td><td>771188</td><td>777786</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>184836</td><td>522247</td><td>1142204</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>180900</td><td>519237</td><td>1155134</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>186046</td><td>512583</td><td>1153003</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>274559</td><td>1168770</td><td>1182872</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>22246</td><td>444820</td><td>873133</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|All+Direct|All+Virtual|By+one&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1803751&chds=0,1803751,0,1803751,0,1803751&chd=t:115775,1151012,1121956|123198,1157407,1162925|270504,1162790,1174122|269978,1160227,1171920|273950,1154201,1150350|467551,1803751,1743071|464209,771188,777786|184836,522247,1142204|180900,519237,1155134|186046,512583,1153003|274559,1168770,1182872|22246,444820,873133&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1803751&chds=0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751,0,1803751&chd=t:115775,123198,270504,269978,273950,467551,464209,184836,180900,186046,274559,22246|1151012,1157407,1162790,1160227,1154201,1803751,771188,522247,519237,512583,1168770,444820|1121956,1162925,1174122,1171920,1150350,1743071,777786,1142204,1155134,1153003,1182872,873133&chdl=By+one|All+Virtual|All+Direct" /></p>
<p>In comparison, here are the <em>mORMot 1</em> performance - which was already ahead of most other solutions - on the same machine:</p>
<p><table><tbody><tr align="center"><td> </td><td><strong>By one</strong></td><td><strong>All Virtual</strong></td><td><strong>All Direct</strong></td></tr>
<tr align="center"><td><strong>Sqlite file full</strong></td><td>72439</td><td>847888</td><td>848176</td></tr>
<tr align="center"><td><strong>Sqlite file off</strong></td><td>73248</td><td>837100</td><td>858811</td></tr>
<tr align="center"><td><strong>Sqlite file off exc</strong></td><td>111037</td><td>845737</td><td>848032</td></tr>
<tr align="center"><td><strong>Sqlite file off exc aes</strong></td><td>112973</td><td>863557</td><td>869716</td></tr>
<tr align="center"><td><strong>Sqlite in memory</strong></td><td>111766</td><td>864154</td><td>879043</td></tr>
<tr align="center"><td><strong>In memory static</strong></td><td>229074</td><td>1395868</td><td>1401738</td></tr>
<tr align="center"><td><strong>In memory virtual</strong></td><td>228081</td><td>625782</td><td>626095</td></tr>
<tr align="center"><td><strong>External sqlite file full</strong></td><td>107587</td><td>412745</td><td>840618</td></tr>
<tr align="center"><td><strong>External sqlite file off</strong></td><td>132890</td><td>393948</td><td>805023</td></tr>
<tr align="center"><td><strong>External sqlite file off exc</strong></td><td>133347</td><td>411082</td><td>821422</td></tr>
<tr align="center"><td><strong>External sqlite in memory</strong></td><td>135197</td><td>411658</td><td>820075</td></tr>
<tr align="center"><td><strong>Remote sqlite socket</strong></td><td>19991</td><td>367161</td><td>643832</td></tr>
</tbody></table></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|All+Direct|All+Virtual|By+one&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1401738&chds=0,1401738,0,1401738,0,1401738&chd=t:72439,847888,848176|73248,837100,858811|111037,845737,848032|112973,863557,869716|111766,864154,879043|229074,1395868,1401738|228081,625782,626095|107587,412745,840618|132890,393948,805023|133347,411082,821422|135197,411658,820075|19991,367161,643832&chdl=Sqlite+file+full|Sqlite+file+off|Sqlite+file+off+exc|Sqlite+file+off+exc+aes|Sqlite+in+memory|In+memory+static|In+memory+virtual|External+sqlite+file+full|External+sqlite+file+off|External+sqlite+file+off+exc|External+sqlite+in+memory|Remote+sqlite+socket" /></p>
<p><img src="http://chart.apis.google.com/chart?chtt=Read+speed+%28rows%2Fsecond%29&chxl=1:|Remote+sqlite+socket|External+sqlite+in+memory|External+sqlite+file+off+exc|External+sqlite+file+off|External+sqlite+file+full|In+memory+virtual|In+memory+static|Sqlite+in+memory|Sqlite+file+off+exc+aes|Sqlite+file+off+exc|Sqlite+file+off|Sqlite+file+full&chxt=x,y&chbh=a&chs=600x500&cht=bhg&chco=3D7930,3D8930,309F30,40C355&chxr=0,0,1401738&chds=0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738,0,1401738&chd=t:72439,73248,111037,112973,111766,229074,228081,107587,132890,133347,135197,19991|847888,837100,845737,863557,864154,1395868,625782,412745,393948,411082,411658,367161|848176,858811,848032,869716,879043,1401738,626095,840618,805023,821422,820075,643832&chdl=By+one|All+Virtual|All+Direct" /></p>
<h3>Feedback Welcome</h3>
<p>We encourage you to download the full source of both framework, from:</p>
<ul>
<li><a href="https://github.com/synopse/mORMot">https://github.com/synopse/mORMot</a></li>
<li><a href="https://github.com/synopse/mORMot2">https://github.com/synopse/mORMot2</a></li>
</ul>
<p>Then you can compile the "15 - External DB performance" sample on <em>mORMot 1</em>, and "extdb-bench" example on <em>mORMot 2</em>.</p>
<p>Feedback is <a href="https://synopse.info/forum/viewtopic.php?id=6143">welcome in our forum,</a> as usual!</p>EKON 25 Slidesurn:md5:aed86aeed11190901cf050e9842ec0bc2021-11-16T12:34:00+00:002021-11-16T12:34:00+00:00Arnaud BouchezmORMot Framework64bitAESAES-CTRAES-GCMAES-NiauthenticationCertificatesCrossPlatformDDDDelphiECCECDHECIESECSDAed25519EKONFreePascalinterfacelibdeflatemORMotmORMot2multithreadOpenSSLperformancerandomSOASourceWebSockets<p><a href="https://entwickler-konferenz.de/">EKON 25 at Düsseldorf</a> was a great conference (konference?).</p>
<p>At last, a <strong>physical</strong> gathering of Delphi developers, mostly from Germany, but also from Europe - and even some from USA! No more virtual meetings, which may trigger the well known 'Abstract Error' on modern pascal coders.<br />
There were some happy FPC users too - as I am now. <img src="https://blog.synopse.info?pf=smile.svg" alt=":)" class="smiley" /></p>
<p><img src="https://blog.synopse.info?post/public/blog/Ekon25.png" alt="" /></p>
<p>I have published the slides of my conferences, mostly about mORMot 2.<br />
By the way, I wish we would be able to release officially mORMot 2 in December, before Christmas. I think it starts to be stabilized and already known to be used on production. We expect no more breaking change in the next weeks.</p> <p>Here are the slides of my two 1-hour sessions.</p>
<h5>mORMot Cryptography</h5>
<p>The OpenSource mORMot framework has a strong set of cryptography features. It offers symmetric cryptography with hashing and encryption, together with asymmetric cryptography via private/public key pairs. Its optimized pascal and assembly engines can be embedded into your executable, but you could also call an external OpenSSL library if needed. This session will present mormot.crypt.* units, and apply them to some use cases, from low-level algorithms to high-level JWT or file encryption and signing.</p>
<p><a href="https://www.slideshare.net/ArnaudBouchez1/ekon25-mormot-2-cryptography">mORMot 2 Cryptography on SlideShare</a></p>
<p>I just had an interesting discussion with Michael on <a href="https://gitlab.com/freepascal.org/fpc/source/-/commit/3229cb712e33374b85258aed43726058be633bed#note_734398698">FPC new gitlab platform</a>: the FPC RTL is gaining some official cryptography functions, and I proposed to use mORMot code base as reference, and to introduce some RTL wrapper functions which can redirect to a plain pascal FPC RTL version, or use another engines, like OpenSSL or mORMot, if available.</p>
<h5>Server-Side REST Notifications with mORMot</h5>
<p>The most powerful way of writing REST services is to define them via interfaces, then let the SOA/REST framework do all the routing, data marshalling and communication behind the scenes. One distinctive feature of mORMot is to define a method parameter as a notification interface, and let the server call back the client when needed, as with regular Delphi code. This session will present the benefit of defining REST services using interfaces, and how WebSockets can offer real-time notifications into your rich Delphi client applications.</p>
<p><a href="https://www.slideshare.net/ArnaudBouchez1/ekon25-mormot-2-serverside-notifications">mORMot 2 Server-Side Notifications on SlideShare</a></p>
<p>Feedback is <a href="https://synopse.info/forum/viewtopic.php?id=6051">welcome on our forum, as usual.</a></p>mORMot 2 on Ampere AARM64 CPUurn:md5:01baca710d9e6371f285e77a90accdcd2021-08-17T13:16:00+01:002021-08-17T16:46:59+01:00Arnaud BouchezmORMot Framework64bitaarch64AESAES-GCMAES-NiampereAndroidasmavxblogCcompressioncrccrc32cFPCFreePascalLazarusLinuxMicroservicesmORMotmORMot2multithreadoraclecloudperformanceRESTSOASQLite3<p>Last weeks, we have enhanced mORMot support to one of the more powerful AARM64 CPU available: the <a href="https://amperecomputing.com/">Ampere Altra CPU</a>, as made available on the <a href="https://www.oracle.com/cloud/compute/arm/">Oracle Cloud Infrastructure</a>.</p>
<p><img src="https://blog.synopse.info/public/blog/AmpereCPU.jpg" alt="" /></p>
<p>Long story short, this is an amazing hardware to run on server side, with performance close to what Intel/AMD offers, but with <a href="https://www.oracle.com/cloud/compute/arm/why-arm-processors/">almost linear multi-core scalability</a>. The FPC compiler is able to run good code on it, and our mORMot 2 library is able to use the hardware accelerated opcodes for AES, SHA2, and crc32/crc32c.</p> <h3>Always Free Ampere VM</h3>
<p>Back to the beginning. Tom, one mORMot user, reported on our forum that he successfully <a href="https://synopse.info/forum/viewtopic.php?id=5945">installed FPC and Lazarus on the Oracle Cloud platform</a>, and accessed it via SSH/XRDP:</p>
<blockquote><p>Just open account on Oracle Cloud and create new compute VM: 4 ARMv8.2 CPU 3GHz, 24GB Ram (yes 24GB).
This is always free VM (you can combine this 4 cores and 24GB Ram to 1 or many (4) VM).
Install Ubuntu 20.04 server, then install LXDE and XRdp for remote access.
Now I have nice speed workstation. Install fpcupdeluxe then fpc 3.2.2/laz 2.0.12, all OK. fpcup build is faster then my local pc build <img src="https://blog.synopse.info?pf=smile.svg" alt=":-)" class="smiley" />
This OCI VM can be great mormot application server for some projects. I don't have any connection to Oracle - just test their product.</p></blockquote>
<p>I did the same, and in fact, this platform is really easy to work with, once you have paid a 1€ credit card fee to validate your account. Then you will get an "Always Free VM", with 4 Ampere cores, and 24GB of Ram. Amazing. The Oracle people really like to break into the cloud market, and they make it wide open for developers, so that they consider their Cloud instead of Microsoft's or Amazon's.</p>
<h3>FPC and Lazarus on Linux/AArch64</h3>
<p>The Lazarus experiment is very good on this platform, even remotely. The only issue is the debugger. Gdb was pretty unstable for me - almost as unstable as on Windows. But somewhat usable, until it crashes. <img src="https://blog.synopse.info?pf=sad.svg" alt=":(" class="smiley" /></p>
<p>Finally, and thanks to Alfred - our great friend behind <a href="https://github.com/LongDirtyAnimAlf/fpcupdeluxe">fpcupdeluxe</a> - we identified a problem with our asm stubs when calling mORMot interface-based services. It was in fact a FPC "feature" (it is documented as such in the compiler so it is not a bug), in how arguments are passed as result in the AARCH64 calling ABI. Once identified, we made an explicit exception to help circumvent the problem.</p>
<p>The FPC code quality seems good. At least at the level of the x86_64 Intel/AMD platform. Not as good as gcc for sure, but good enough for production code, and good speed. The only big limitation is that the inlined assembly is very limited: only a few AARCH64 opcodes are available - only what was mandatory for the basic FPC RTL needs.</p>
<h3>Tuning mORMot 2 for AArch64</h3>
<p>To enhance performance, we replaced the basic FPC RTL Move/FillChar functions by the libc memmov/memset. And performance is amazing:</p>
<pre>
FPC RTL
FillChar in 19.06ms, 20.3 GB/s
Move in 9.95ms, 1.5 GB/s
small Move in 15.85ms, 1.3 GB/s
big Move in 222.41ms, 1.7 GB/s
mORMot functions calling the Ubuntu Gnu libc:
FillCharFast in 8.84ms, 43.9 GB/s
MoveFast in 1.25ms, 12.4 GB/s
small MoveFast in 4.98ms, 4.3 GB/s
big MoveFast in 34.32ms, 11.3 GB/s
</pre>
<p>In comparison, here are the numbers on my Core i5 7200U CPU, of mORMot tuned x86_64 asm (faster than the FPC RTL), using SSE2 or AVX instructions:</p>
<pre>
FillCharFast [] in 21.43ms, 18.1 GB/s
MoveFast [] in 2.29ms, 6.8 GB/s
small MoveFast [] in 4.29ms, 5.1 GB/s
big MoveFast [] in 68.33ms, 5.7 GB/s
FillCharFast [cpuAVX] in 20.28ms, 19.1 GB/s
MoveFast [cpuAVX] in 2.26ms, 6.8 GB/s
small MoveFast [cpuAVX] in 4.25ms, 5.1 GB/s
big MoveFast [cpuAVX] in 69.93ms, 5.5 GB/s
</pre>
<p>So we can see that the Ampere CPU memory design is pretty efficient. It is up to twice faster than a Core i5 7200U CPU.</p>
<p>We had to go further, and get some fun with one bottleneck of every server operation: encryption and hashes. So we wrote some C code to be able to use the efficient HW acceleration we wanted for encryption and hashes. You could find the source code in <a href="https://github.com/synopse/mORMot2/tree/master/res/static/armv8">the /res/static/armv8 sub-folder of our repository</a>. Now we have tremendous performance for AES, GCM, SHA2 and CRC32/CRC32C computation.</p>
<pre>
2500 crc32c in 259us i.e. 9.2M/s or 20 GB/s
2500 xxhash32 in 1.47ms i.e. 1.6M/s or 3.5 GB/s
2500 crc32 in 259us i.e. 9.2M/s or 20 GB/s
2500 adler32 in 469us i.e. 5M/s or 11 GB/s
2500 hash32 in 584us i.e. 4M/s or 8.9 GB/s
2500 md5 in 12.12ms i.e. 201.3K/s or 438.7 MB/s
2500 sha1 in 21.75ms i.e. 112.2K/s or 244.5 MB/s
2500 hmacsha1 in 23.81ms i.e. 102.5K/s or 223.4 MB/s
2500 sha256 in 3.41ms i.e. 714.7K/s or 1.5 GB/s
2500 hmacsha256 in 4.12ms i.e. 591.2K/s or 1.2 GB/s
2500 sha384 in 27.71ms i.e. 88K/s or 191.9 MB/s
2500 hmacsha384 in 32.69ms i.e. 74.6K/s or 162.7 MB/s
2500 sha512 in 27.73ms i.e. 88K/s or 191.8 MB/s
2500 hmacsha512 in 32.77ms i.e. 74.4K/s or 162.3 MB/s
2500 sha3_256 in 35.82ms i.e. 68.1K/s or 148.5 MB/s
2500 sha3_512 in 65.48ms i.e. 37.2K/s or 81.2 MB/s
2500 rc4 in 12.98ms i.e. 188K/s or 409.8 MB/s
2500 mormot aes-128-cfb in 8.84ms i.e. 276.1K/s or 601.7 MB/s
2500 mormot aes-128-ofb in 3.78ms i.e. 645K/s or 1.3 GB/s
2500 mormot aes-128-c64 in 4.39ms i.e. 555.6K/s or 1.1 GB/s
2500 mormot aes-128-ctr in 4.52ms i.e. 539.7K/s or 1.1 GB/s
2500 mormot aes-128-cfc in 9.16ms i.e. 266.3K/s or 580.4 MB/s
2500 mormot aes-128-ofc in 5.25ms i.e. 465K/s or 0.9 GB/s
2500 mormot aes-128-ctc in 5.74ms i.e. 425.1K/s or 0.9 GB/s
2500 mormot aes-128-gcm in 7.52ms i.e. 324.5K/s or 707.2 MB/s
2500 mormot aes-256-cfb in 9.52ms i.e. 256.2K/s or 558.4 MB/s
2500 mormot aes-256-ofb in 4.71ms i.e. 517.6K/s or 1.1 GB/s
2500 mormot aes-256-c64 in 5.30ms i.e. 460.5K/s or 0.9 GB/s
2500 mormot aes-256-ctr in 5.33ms i.e. 457.5K/s or 0.9 GB/s
2500 mormot aes-256-cfc in 10.04ms i.e. 243K/s or 529.5 MB/s
2500 mormot aes-256-ofc in 6.11ms i.e. 399K/s or 869.6 MB/s
2500 mormot aes-256-ctc in 6.77ms i.e. 360.4K/s or 785.5 MB/s
2500 mormot aes-256-gcm in 8.38ms i.e. 291.1K/s or 634.4 MB/s
2500 openssl aes-128-cfb in 4.94ms i.e. 493.4K/s or 1 GB/s
2500 openssl aes-128-ofb in 4.12ms i.e. 591.2K/s or 1.2 GB/s
2500 openssl aes-128-ctr in 1.94ms i.e. 1.2M/s or 2.6 GB/s
2500 openssl aes-128-gcm in 3.18ms i.e. 767K/s or 1.6 GB/s
2500 openssl aes-256-cfb in 5.83ms i.e. 418.5K/s or 912.1 MB/s
2500 openssl aes-256-ofb in 5.04ms i.e. 484.1K/s or 1 GB/s
2500 openssl aes-256-ctr in 2.42ms i.e. 0.9M/s or 2.1 GB/s
2500 openssl aes-256-gcm in 3.66ms i.e. 667K/s or 1.4 GB/s
2500 shake128 in 29.63ms i.e. 82.3K/s or 179.5 MB/s
2500 shake256 in 35.07ms i.e. 69.5K/s or 151.6 MB/s
</pre>
<p>Here are the numbers on my Core i5 7200U CPU, with optimized asm, and the last OpenSSL calls:</p>
<pre>
2500 crc32c in 224us i.e. 10.6M/s or 23.1 GB/s
2500 xxhash32 in 817us i.e. 2.9M/s or 6.3 GB/s
2500 crc32 in 341us i.e. 6.9M/s or 15.2 GB/s
2500 adler32 in 241us i.e. 9.8M/s or 21.5 GB/s
2500 hash32 in 441us i.e. 5.4M/s or 11.7 GB/s
2500 aesnihash in 218us i.e. 10.9M/s or 23.8 GB/s
2500 md5 in 8.29ms i.e. 294.1K/s or 641.1 MB/s
2500 sha1 in 13.72ms i.e. 177.8K/s or 387.5 MB/s
2500 hmacsha1 in 15.05ms i.e. 162.1K/s or 353.3 MB/s
2500 sha256 in 17.40ms i.e. 140.2K/s or 305.6 MB/s
2500 hmacsha256 in 18.71ms i.e. 130.4K/s or 284.2 MB/s
2500 sha384 in 11.59ms i.e. 210.5K/s or 458.9 MB/s
2500 hmacsha384 in 13.84ms i.e. 176.3K/s or 384.2 MB/s
2500 sha512 in 11.59ms i.e. 210.5K/s or 458.8 MB/s
2500 hmacsha512 in 13.89ms i.e. 175.7K/s or 382.9 MB/s
2500 sha3_256 in 26.66ms i.e. 91.5K/s or 199.5 MB/s
2500 sha3_512 in 47.96ms i.e. 50.9K/s or 110.9 MB/s
2500 rc4 in 14.05ms i.e. 173.7K/s or 378.6 MB/s
2500 mormot aes-128-cfb in 4.59ms i.e. 530.9K/s or 1.1 GB/s
2500 mormot aes-128-ofb in 4.52ms i.e. 539.4K/s or 1.1 GB/s
2500 mormot aes-128-c64 in 6.23ms i.e. 391.7K/s or 853.7 MB/s
2500 mormot aes-128-ctr in 1.40ms i.e. 1.6M/s or 3.6 GB/s
2500 mormot aes-128-cfc in 4.75ms i.e. 513.2K/s or 1 GB/s
2500 mormot aes-128-ofc in 5.22ms i.e. 467.7K/s or 0.9 GB/s
2500 mormot aes-128-ctc in 1.72ms i.e. 1.3M/s or 3 GB/s
2500 mormot aes-128-gcm in 2.28ms i.e. 1M/s or 2.2 GB/s
2500 mormot aes-256-cfb in 6.12ms i.e. 398.4K/s or 868.3 MB/s
2500 mormot aes-256-ofb in 6.10ms i.e. 400K/s or 871.7 MB/s
2500 mormot aes-256-c64 in 7.86ms i.e. 310.6K/s or 676.9 MB/s
2500 mormot aes-256-ctr in 1.82ms i.e. 1.3M/s or 2.8 GB/s
2500 mormot aes-256-cfc in 6.36ms i.e. 383.5K/s or 835.9 MB/s
2500 mormot aes-256-ofc in 6.77ms i.e. 360.1K/s or 784.8 MB/s
2500 mormot aes-256-ctc in 2.02ms i.e. 1.1M/s or 2.5 GB/s
2500 mormot aes-256-gcm in 2.68ms i.e. 909.2K/s or 1.9 GB/s
2500 openssl aes-128-cfb in 7.11ms i.e. 342.9K/s or 747.3 MB/s
2500 openssl aes-128-ofb in 5.21ms i.e. 468K/s or 1 GB/s
2500 openssl aes-128-ctr in 1.54ms i.e. 1.5M/s or 3.3 GB/s
2500 openssl aes-128-gcm in 1.85ms i.e. 1.2M/s or 2.8 GB/s
2500 openssl aes-256-cfb in 8.65ms i.e. 282.2K/s or 615 MB/s
2500 openssl aes-256-ofb in 6.82ms i.e. 357.6K/s or 779.3 MB/s
2500 openssl aes-256-ctr in 1.93ms i.e. 1.2M/s or 2.6 GB/s
2500 openssl aes-256-gcm in 2.27ms i.e. 1M/s or 2.2 GB/s
2500 shake128 in 23.47ms i.e. 104K/s or 226.6 MB/s
2500 shake256 in 29.64ms i.e. 82.3K/s or 179.5 MB/s
</pre>
<p>The mORMot plain pascal code is used for MD5, SHA1, or shake/SHA3. So it is slower than our optimized asm for Intel/AMD. But not so slow. And those algorithms are either deprecated or not widely used - therefore they are not a bottleneck. OpenSSL numbers are pretty good too on this platform. As a result, AES, GCM, SHA-2 and crc32/crc32c performance is comparable between AARCH64 and Intel/AMD. With amazing SHA-2 numbers.</p>
<p>Then, we compiled the latest SQLite3, Lizard and libdeflate as static libraries, so that you could use them with your executable with no external dependency. Performance is very good:</p>
<pre>
TAlgoSynLZ 3.8 MB->2 MB: comp 287:151MB/s decomp 215:409MB/s
TAlgoLizard 3.8 MB->1.9 MB: comp 18:9MB/s decomp 857:1667MB/s
TAlgoLizardFast 3.8 MB->2.3 MB: comp 193:116MB/s decomp 1282:2135MB/s
TAlgoLizardHuffman 3.8 MB->1.8 MB: comp 84:40MB/s decomp 394:827MB/s
TAlgoDeflate 3.8 MB->1.5 MB: comp 30:12MB/s decomp 78:196MB/s
TAlgoDeflateFast 3.8 MB->1.6 MB: comp 48:20MB/s decomp 73:174MB/s
</pre>
<p>I was a bit surprised by how well the pure pascal version of SynLZ algorithm was running, once compiled with FPC 3.2, on AARCH64. Also the Deflate compression has a small advantage of using our statically linked libdeflate in respect to the plain zlib. But the very good news is that Lizard is really fast on AARCH64: even if it is written in plain C with no manual SIMD/asm code, it is really fast on non Intel/AMD platforms. More than 2GB/s for decompression is very high. I was told that Lizard may be a bit behind ZStandard on Intel/AMD, but its code is simpler, and much more CPU agnostic.</p>
<pre>
2.4. Sqlite file memory map:
- Database direct access: 22,264 assertions passed 55.40ms
- Virtual table direct access: 12 assertions passed 347us
- TOrmTableJson: 144,083 assertions passed 60.25ms
- TRestClientDB: 608,196 assertions passed 783.02ms
- Regexp function: 6,015 assertions passed 11.07ms
- TRecordVersion: 20,060 assertions passed 51.28ms
Total failed: 0 / 800,630 - Sqlite file memory map PASSED 961.45ms
</pre>
<p>Here SQLite3 numbers are similar to what I have on Intel/AMD. So I guess we could really consider using this database as storage back-end for mORMot MicroServices with their stand-alone persistence layer.</p>
<h3>Ampere and Beyond - Apple M1?</h3>
<p>We also tried to support as much as possible the ARM/AARCH64 CPUs with mORMot 2. So now we detect the CPU type and HW platform it runs on, especially on Linux or Android - which is also an AARCH64 platform. Here is what our regression tests report at their ending:</p>
<pre>
Ubuntu 20.04.2 LTS - Linux 5.8.0-1037-oracle (cp utf8)
2 x ARM Neoverse-N1 (aarch64)
on QEMU KVM Virtual Machine virt-4.2
Using mORMot 2.0.1
TSqlite3LibraryStatic 3.36.0 with internal MM
Generated with: Free Pascal 3.2 64 bit Linux compiler
Time elapsed for all tests: 44.38s
Performed 2021-08-17 13:44:09 by ubuntu on lxde
Total assertions failed for all test suits: 0 / 66,050,607
</pre>
<p>As you can see, the CPU was properly identified as <a href="https://www.arm.com/products/silicon-ip-cpu/neoverse/neoverse-n1">ARM Neoverse-N1</a>.</p>
<p>We could consider with good faith using <em>mORMot</em> code on an Apple M1/M1X/M2 CPU, thanks to the FPC (cross-)compiler. If we have access to this hardware. Any feedback is welcome.</p>
<h3>Server Process Performance</h3>
<p>All regression tests do pass whole green, with pretty consistent performance among all its various tasks. JSON process, ORM, SOA or encryption: everything flies on the Ampere CPU. You can check <a href="https://gist.github.com/synopse/0e7275684a2e2bbd2206940c3827055c">the detailed regression tests console output</a>.</p>
<p>Here are some numbers about UTF-8 or JSON process:</p>
<pre>
StrLen() in 1.43ms, 13.3 GB/s
IsValidUtf8(RawUtf8) in 11.75ms, 1.6 GB/s
IsValidUtf8(PUtf8Char) in 13.08ms, 1.4 GB/s
IsValidJson(RawUtf8) in 22.84ms, 858.2 MB/s
IsValidJson(PUtf8Char) in 22.93ms, 854.7 MB/s
JsonArrayCount(P) in 22.97ms, 853.1 MB/s
JsonArrayCount(P,PMax) in 22.89ms, 856.4 MB/s
JsonObjectPropCount() in 11.90ms, 0.9 GB/s
jsonUnquotedPropNameCompact in 72.35ms, 240.6 MB/s
jsonHumanReadable in 119.06ms, 209.4 MB/s
TDocVariant in 245.99ms, 79.7 MB/s
TDocVariant no guess in 260.57ms, 75.2 MB/s
TDocVariant dvoInternNames in 247.56ms, 79.1 MB/s
TOrmTableJson GetJsonValues in 34.88ms, 247.1 MB/s
TOrmTableJson expanded in 42.70ms, 459 MB/s
TOrmTableJson not expanded in 21.42ms, 402.4 MB/s
DynArrayLoadJson in 87.96ms, 222.8 MB/s
TOrmPeopleObjArray in 131.10ms, 149.5 MB/s
fpjson in 115.09ms, 17 MB/s
</pre>
<p>It is nice to see that our pascal code, which has been deeply tuned to let FPC generate the best x86_64 assembly possible, is also able to give very good performance on AARCH64. No need to write some dedicated code, and pollute the source with plenty of <em>$ifdef/$endif</em>: x86_64 is already some kind of RISC-like architecture, with a bigger number of registers, and 64-bit efficient processing. No need to rewrite everything. Optimized pascal code, with tuned pointer arithmetic is platform neutral. I like the quote of SQLite3 author saying that <a href="https://www.sqlite.org/whyc.html">C is a "portable assembly"</a>, and that we could also use tuned pascal code, as we try to do in the mORMot core units, to leverage modern CPU hardware, without the need of fighting against any hype/versatile language.</p>
<h3>Asm is Fun Again</h3>
<p>So we are pretty excited to see how this platform will go in the future. mORMot has invested a lot of time, refactoring and asm tuning to leverage the Intel/AMD platform, focusing on the server side performance. But this AARCH64 technology is really promising, and I can tell you that its RISC instruction set was very cleverly designed. It is very rich and powerful, almost perfect in its balance between power and expressiveness, in respect to the x86_64 platform, which has a lot of inconsistencies and seems outdated when you compare both asm. After decades playing with i386 or x86_64 asm, I had fun again with the ARM v8 assembly. It tastes like "assembly as it should be" (tm). Linking some static C code is a good balance between leveraging the hardware when needed, and keeping platform-independent pascal source. And FPC, as a compiler, is amazing by being open and well done on so many CPUs and platforms. Open Source rocks!</p>
<p>As usual, <a href="https://synopse.info/forum/viewtopic.php?pid=35602#p35602">feedback is welcome on our forum</a>.</p>Enhanced Faster ZIP Support in mORMot 2urn:md5:194b963ed6b1488b09f09d891354ed702021-05-08T08:49:00+01:002021-05-08T08:46:27+01:00Arnaud BouchezmORMot Framework64bitavxAVX2blogcompressioncrccrc32cCrossPlatformDelphilibdeflatemORMotmORMot2performancezip<p>The <code>.zip</code> format is from last century, back to the <a href="https://en.wikipedia.org/wiki/DOS">early DOS days</a>, but can still be found everywhere. It is even hidden when you run a <code>.docx</code> document, a <code>.jar</code> application, or any Android app!<br />
It is therefore (ab)used not only as archive format, but as application file format / container - even if in this respect <a href="https://sqlite.org/appfileformat.html">using SQLite3 may have much more sense</a>.</p>
<p><img src="https://blog.synopse.info?post/public/blog/zipfile.jpg" alt="" /></p>
<p>We recently enhanced our <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.zip.pas">mormot.core.zip.pas</a> unit:</p>
<ul>
<li>to support Zip64,</li>
<li>with enhanced <code>.zip</code> read/write,</li>
<li>to have a huge performance boost during its process,</li>
<li>and to integrate better with signed executables.</li>
</ul> <h3>Zip64 Support for Huge Files</h3>
<p>First of all, our unit now supports the long-awaited Zip64 extension. In a nutshell, it allows to store files bigger than 4GB, or have the total <code>.zip</code> bigger than 4GB - which is the maximum 32-bit stored size.</p>
<h3>TZipRead TZipWrite Enhancements</h3>
<p>The high-level <code>TZipRead</code> and <code>TZipWrite</code> classes were deeply refactored and enhanced. Not only Zip64 support has been added, but can ignore and skip some files during reading - a very efficient way of deleting files in a <code>.zip</code>. Some additional methods have been introduced, e.g. to quickly validate a <code>.zip</code> file integrity. Cross-platform support has been enhanced.</p>
<p>Previous <code>TZipRead</code> used to map the whole <code>.zip</code> file into memory. It was convenient for small content. But huge files won't fit into a Win32 application memory: you could not use Zip64 on 32-bit executables - not very convenient for sure! And performance of memory-mapped files is typically slower than explicit Seek/Read calls, since the Kernel is involved to handle page faults and read the data from disk. This had to be fixed.<br />
Now a memory buffer size is specified to <code>TZipRead.Create</code> constructors, which will contain the last bytes of the <code>.zip</code> file, in which the directory header would appear - and very efficiently parsed at opening. Then, actual content decompression would use regular Seek/Read calls, only when needed. Of course, if the data is available in the memory buffer - which is the case for the last files, or for smaller <code>.zip</code> - it will take it from there. So the new approach seems a very reasonable implementation - typically faster than other zip library I have seen, and our previous code.</p>
<h3>LibDeflate Support</h3>
<p>Perhaps the main change of this refactoring, is the <a href="https://github.com/ebiggers/libdeflate">libdeflate library</a> integration. It is a library for fast, whole-buffer DEFLATE-based compression and decompression. In practice, when working on memory buffers (not streams), it is able to leverage very efficient ASM code for modern CPUs (like AVX), resulting to be much faster than any other zlib implementation around. If streams are involved - e.g. when decompressing huge files - then we fallback to the regular zlib code.</p>
<p>LibDeflate implementation of crc or adler checsums is astonishing: on my Intel Core i5 <code>crc()</code> went from 800MB/s to 10GB/s. And this crc is used for <code>.zip</code> file checksums, so it really helps.<br />
Also compression and decompression are almost twice faster than regular zlib, thanks to a full rewrite of the deflate engine, targeting modern CPUs, and using tuned asm for bottlenecks.<br />
Last but not least, you can use higher compression levels - regular zlib understand from 0 (stored) to 9 (slowest), but libdeflate accepts 10..12 for even higher compression - at the expense of compression speed which becomes very slow, but decompression will be in par with other levels.</p>
<p>We statically linked libdeflate, so you don't need to have an external library. Sadly, it is currently available for FPC only, since Delphi linking is an incompatible mess.</p>
<p>Note that libdeflate will be used anywhere in <em>mORMot</em> where deflate/zip buffer compression is involved, so for instance regular HTTP/HTTPS on-the-fly <code>gzip</code> compression will be much faster, and even some unexpected part of the framework would benefit from it - e.g. our default RESTful URI authentication used the zlib <code>crc()</code> for its online checksum, so each REST request is slightly faster.</p>
<h3>Integrated to Signed Executables</h3>
<p>The last enhancement was also the ability to append a <code>.zip</code> content to an existing "signed" executable. Since "mORMot 1", we allowed to find and read any <code>.zip</code> content appended to an executable. But if you digitally signed this executable, you would need to re-sign it after appending. Not very convenient, e.g. when you build a custom <code>Setup.exe</code>.</p>
<p>We added some functions to include the <code>.zip</code> content within the signature itself, allowing to store some additional data or configuration in a convenient format, without requiring to sign the executable again.</p>
<h3>Use the Source, Luke!</h3>
<p>Check the <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.zip.pas">mormot.core.zip.pas</a> unit in our Open Source repository!</p>Fastest AES-PRNG, AES-CTR and AES-GCM Delphi implementationurn:md5:1f6861f4711b9a7207a32f4bbcadde7e2021-02-13T09:11:00+00:002021-02-22T08:37:03+00:00Arnaud BouchezmORMot Framework64bitAESAES-CTRAES-GCMAES-NiasmblogCrossPlatformDelphiFreePascalmORMot2performancesse4<p>Last week, I committed new ASM implementations of our AES-PRNG, AES-CTR and AES-GCM for <em>mORMot 2</em>.<br />
They handle eight 128-bit at once in an interleaved fashion, as permitted by the CTR chaining mode. The aes-ni opcodes (<code>aesenc aesenclast</code>) are used for AES process, and the GMAC of the AES-GCM mode is computed using the <code>pclmulqdq</code> opcode.</p>
<p><img src="https://blog.synopse.info?post/public/blog/aesalgo.png" alt="" /></p>
<p>Resulting performance is amazing: on my simple Core i3, I reach 2.6 GB/s for <code>aes-128-ctr</code>, and 1.5 GB/s for <code>aes-128-gcm</code> for instance - the first being actually faster than OpenSSL!</p> <p>AES-CTR is the basic chaining mode used for:</p>
<ul>
<li>AES-CTR as defined by the <a href="https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_(CTR)">NIST standard</a> - see our <code>TAesPrngNist</code> class;</li>
<li>AES-GCM which includes a 128-bit GMAC using in <a href="https://en.wikipedia.org/wiki/Galois/Counter_Mode">Galois/Counter Mode</a> - see the <code>TAesGcm</code> class;</li>
<li>and our AES-based <a href="https://en.wikipedia.org/wiki/Pseudorandom_number_generator">Pseudo Random Number Generator</a> (PRNG) as implemented by our <code>TAesPrng</code> class.</li>
</ul>
<h3>mORMot 2</h3>
<p>For <em>mORMot 2</em>, we refactored the <code>SynCrypto.pas</code> unit into <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.crypto.pas"><code>mormot.core.crypto.pas</code></a>:</p>
<ul>
<li>All assembly code has been moved to dedicated <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.crypto.asmx86.inc"><code>mormot.core.crypto.asmx86.inc</code></a> and <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.crypto.asmx64.inc"><code>mormot.core.crypto.asmx64.inc</code></a> include files;</li>
<li>A generic catalog of AES algorithms has been implemented, which allows to search them by name (e.g. <code>'aes-128-ctr'</code>), and also switch to the fastest implementation available, e.g. if OpenSSL is enabled;</li>
<li>The regression tests have been enhanced, to include validation against test vectors for all modes, and comparison with the OpenSSL reference implementation;</li>
<li>A lot of low-level optimizations have been applied, especially targeting x86_64 which is now (sometimes much) faster than the original very tuned i386 code - in fact, we focus on x86_64 which is our main target for Linux high-end services implementation with FPC compilation;</li>
<li>Still as a stand-alone Delphi/FPC unit, with no external <code>.dll</code> to download, search and load.</li>
</ul>
<p>Here are some numbers, extracted from the unit comments, run from several types of blocks (not only huge buffers), during regression tests.</p>
<h3>AES-CTR</h3>
<p>On x86_64 we use a 8*128-bit interleaved optimized asm:</p>
<ul>
<li><strong>mormot aes-128-ctr</strong> in 1.99ms i.e. 1254390/s or <strong>2.6 GB/s</strong></li>
<li><strong>mormot aes-256-ctr</strong> in 2.64ms i.e. 945179/s or <strong>1.9 GB/s</strong></li>
</ul>
<p>It is <em>actually faster than OpenSSL 1.1.1</em> in our benchmarks</p>
<ul>
<li>openssl aes-128-ctr in 2.23ms i.e. 1121076/s or 2.3 GB/s</li>
<li>openssl aes-256-ctr in 2.80ms i.e. 891901/s or 1.8 GB/s</li>
</ul>
<p>As reference, optimized but not interleaved OFB asm is 3 times slower:</p>
<ul>
<li>mormot aes-128-ofb in 6.88ms i.e. 363002/s or 772.5 MB/s</li>
<li>mormot aes-256-ofb in 9.37ms i.e. 266808/s or 567.8 MB/s</li>
</ul>
<p>On i386, numbers are slower for our classes, which are not interleaved:</p>
<ul>
<li>mormot aes-128-ctr in 10ms i.e. 249900/s or 531.8 MB/s</li>
<li>mormot aes-256-ctr in 12.47ms i.e. 200368/s or 426.4 MB/s</li>
<li>openssl aes-128-ctr in 3.01ms i.e. 830288/s or 1.7 GB/s</li>
<li>openssl aes-256-ctr in 3.52ms i.e. 709622/s or 1.4 GB/s</li>
</ul>
<h3>AES-GCM</h3>
<p>On x86_64, our TAesGcm class is 8x interleaved for both GMAC and AES-CTR:</p>
<ul>
<li>mormot <strong>aes-128-gcm</strong> in 3.45ms i.e. 722752/s or <strong>1.5 GB/s</strong></li>
<li>mormot <strong>aes-256-gcm</strong> in 4.11ms i.e. 607385/s or <strong>1.2 GB/s</strong></li>
</ul>
<p><em>OpenSSL is faster</em> since it performs GMAC and AES-CTR in a single pass:</p>
<ul>
<li>openssl aes-128-gcm in 2.86ms i.e. 874125/s or 1.8 GB/s</li>
<li>openssl aes-256-gcm in 3.43ms i.e. 727590/s or 1.5 GB/s</li>
</ul>
<p>On i386, numbers are much lower, since lacks interleaved asm - but still faster than any other Delphi alternatives:</p>
<ul>
<li>mormot aes-128-gcm in 15.86ms i.e. 157609/s or 335.4 MB/s</li>
<li>mormot aes-256-gcm in 18.23ms i.e. 137083/s or 291.7 MB/s</li>
<li>openssl aes-128-gcm in 5.49ms i.e. 455290/s or 0.9 GB/s</li>
<li>openssl aes-256-gcm in 6.11ms i.e. 408630/s or 869.6 MB/s</li>
</ul>
<h3>Other AES modes</h3>
<p>As you may see from the recent commits, and the numbers in the source code, almost all of our AES classes (e.g. OFB and CFB) have had their performance enhanced, sometimes by a large margin.</p>
<p>A new <code>TAesCtrCrc</code> class has been added. It combines AES-CTR with 4 parallel <code>crc32c</code> checksums, of both the encrypted and the decrypted content.<br />
It results in an <a href="https://en.wikipedia.org/wiki/Authenticated_encryption">AEAD algorithm</a> with 256-bit of associated authentication, which outperforms AES-GCM in our implementation.</p>
<p>On x86_64 we use a 8*128-bit interleaved optimized asm:</p>
<ul>
<li><strong>mormot aes-128-ctc</strong> in 2.58ms i.e. 967492/s or <strong>2 GB/s</strong></li>
<li><strong>mormot aes-256-ctc</strong> in 3.13ms i.e. 797702/s or <strong>1.6 GB/s</strong></li>
</ul>
<p>(to be compared with the CTR without 256-bit crc32c MAC computation above at 2.6 GB/s and 1.9GB/s)</p>
<p>In i386, numbers are lower, because they are not interleaved:</p>
<ul>
<li>mormot aes-128-ctc in 9.76ms i.e. 256068/s or 544.9 MB/s</li>
<li>mormot aes-256-ctc in 12.14ms i.e. 205930/s or 438.2 MB/s</li>
</ul>
<p>For internal communication, e.g. for our WebSockets services, it is a very good algorithm, especially for small messages, since it needs less warmup than AES-GCM.</p>
<p>Here are some numbers of our ECDHE stream protocol:</p>
<ul>
<li>efAesCrc128 in 1.57ms i.e. 63,331/s, aver. 15us, 1.1 GB/s</li>
<li>efAesCfb128 in 1.66ms i.e. 60,060/s, aver. 16us, 1 GB/s</li>
<li>efAesOfb128 in 2.52ms i.e. 39,588/s, aver. 25us, 729.9 MB/s</li>
<li>efAesCtr128 in 851us i.e. 117,508/s, aver. 8us, 2.1 GB/s</li>
<li>efAesCbc128 in 2.93ms i.e. 34,059/s, aver. 29us, 628 MB/s</li>
<li>efAesCrc256 in 2.13ms i.e. 46,926/s, aver. 21us, 865.2 MB/s</li>
<li>efAesCfb256 in 2.20ms i.e. 45,330/s, aver. 22us, 835.8 MB/s</li>
<li>efAesOfb256 in 3.38ms i.e. 29,507/s, aver. 33us, 544 MB/s</li>
<li>efAesCtr256 in 1.09ms i.e. 91,659/s, aver. 10us, 1.6 GB/s</li>
<li>efAesCbc256 in 3.33ms i.e. 30,012/s, aver. 33us, 553.3 MB/s</li>
<li>efAesGcm128 in 790us i.e. 126,582/s, aver. 7us, 2.2 GB/s</li>
<li>efAesGcm256 in 987us i.e. 101,317/s, aver. 9us, 1.8 GB/s</li>
<li><strong>efAesCtc128</strong> in 820us i.e. 121,951/s, aver. 8us, <strong>2.1 GB/s</strong></li>
<li><strong>efAesCtc256</strong> in 985us i.e. 101,522/s, aver. 9us, <strong>1.8 GB/s</strong></li>
</ul>
<p>Note that</p>
<ul>
<li>those numbers don't exactly match the other benchmarks, because we don't measure the raw AES encryption performance, but the whole encapsulation in the WebSockets frames protocol, and we test another set of message sizes;</li>
<li>the <code>efAesGcm128</code>/<code>efAesGcm256</code> numbers above automatically used the OpenSSL library on my Ubuntu laptop, since they are faster than our <code>TAesGcm</code> class - so when you don't have OpenSSL installed (which is sometimes tricky on Windows), you could rely on <code>efAesCtc128</code> as WebSockets asymetric encryption protocol.</li>
</ul>
<h3>AES PRNG</h3>
<p>As I wrote above, our <code>TAesPrng</code> class uses internally the AES-CTR mode to generate its random output stream.<br />
The newly introduced asm was very beneficial to its 256-bit AES generator, in terms of performance.</p>
<p>On x86_64, it uses fast hardware AES-NI acceleration, and our 8X interleaved asm:</p>
<ul>
<li>mORMot <strong>Random32</strong> in 3.95ms i.e. <strong>25,303,643/s</strong>, aver. 0us, <strong>96.5 MB/s</strong></li>
<li>mORMot <strong>FillRandom</strong> in 46us, <strong>2 GB/s</strong></li>
</ul>
<p>It is actually <em>noticeably faster than OpenSSL</em> with the same 256-bit safety level:</p>
<ul>
<li>OpenSSL Random32 in 288.71ms i.e. 346,363/s, aver. 2us, 1.3 MB/s</li>
<li>OpenSSL FillRandom in 240us, 397.3 MB/s</li>
</ul>
<p>On i386, numbers are similar, but for <code>FillRandom</code> which is not interleaved:</p>
<ul>
<li>mORMot Random32 in 5.54ms i.e. 18,044,027/s, aver. 0us, 68.8 MB/s</li>
<li>mORMot FillRandom in 203us, 469.7 MB/s</li>
<li>OpenSSL Random32 in 364.24ms i.e. 274,540/s, aver. 3us, 1 MB/s</li>
<li>OpenSSL FillRandom in 371us, 257 MB/s</li>
</ul>
<h3>Conclusion</h3>
<p>Since years, I suspected we wrote the fastest AES library for Delphi and FreePascal. Now we covered even more algorithms (AES-GCM is widely used but not widely implemented in Delphi), and pushed away the performance limits even further!<br />
We can be proud that our library outperforms the OpenSSL 1.1.1 proven codebase for most algorithms, with no <code>.dll</code> dependency.<br />
Open Source rocks!</p>
<p>Next logical step is to work on OpenSSL integration of the TLS layer, which is welcome especially on Linux (we already have a <a href="https://docs.microsoft.com/en-us/windows-server/security/tls/tls-ssl-schannel-ssp-overview">SChannel TLS layer</a> for Windows since years in mORMot)...</p>
<p>If you wish, you can download the current mORMot 2 source code from <a href="https://github.com/synopse/mORMot2">https://github.com/synopse/mORMot2</a> and run the <a href="https://github.com/synopse/mORMot2/blob/master/test/mormot2tests.dpr">regression tests project</a>. You could share your own numbers!</p>
<p>Stay tuned, and feedback is <a href="https://synopse.info/forum/viewtopic.php?id=5760">welcome in our forum, as usual</a>!</p>New AesNiHash for mORMot 2urn:md5:135c4d191ec81d923f08cba3cc4efdd52021-02-12T17:41:00+00:002021-02-22T08:38:47+00:00Arnaud BouchezmORMot Framework64bitAES-Nialignmentasmblogcrc32cDelphidynamic arrayhashperformancesse42 <p>I have <a href="https://github.com/synopse/mORMot2/commit/32e3a5e43b1">just committed some new</a> <code>AesNiHash32 AesNiHash64 AesNiHash128</code> Hashers for mORMot 2.</p>
<p><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/78c56ffe89890582a7060845e131a788266cbd59" alt="" /></p>
<ul>
<li>They are using AES-NI and SSE4.1 opcodes on x86_64 and i386.</li>
<li>This implementation is faster than the fastest SSE4.1 crc32c and with a much higher usability (less collisions).</li>
<li>Logic was extracted from the Go Runtime, and optimized for Delphi/FPC.</li>
</ul>
<h3>Purpose</h3>
<p>Its purpose is NOT to replace crc32c or MD5 or SHA2 or SHA3. It is not to be used to compute a signature..<br />
It is used internally by <em>mORMot 2</em> to hash elements, e.g. strings, and store them into a Hash table - see e.g. our <code>TDynArrayHashed</code> or <code>TSynDictionary</code>.</p>
<p>In fact, the hashes are initialized with a new random AES key at startup, so every time a process is launched, the hashes will change. As a consequence, they should be used internally within the process, never store on disk..<br />
But this random seeding is a <a href="https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks">good solution to resist to DOS attacks</a>.</p>
<h3>The Fastest 32/64/128-bit Hash on Delphi</h3>
<p>Numbers on i386 and x86_64 are amazing: <strong>more than 15GB/s (GigaBytes!) on my Core i3</strong>.<br />
Regular crc32c with SSE4.1 is 2-3GB/s and optimized crc32c from Intel is slightly slower.
Tuned xxHash32 assembly is around 4GB/s, and I can't even mention how slow in comparison is the hash algorithm used by the Delphi RTL.</p>
<p>But the output of the AesNiHash is much better, since it has much less collisions than crc32c or xxHash32..<br />
It is based on the AES permutation hardware instructions of modern Intel/AMD CPUs, for both safety and performance.</p>
<p>To be fair, <strong>performance on huge blocks is not the main point</strong> for a classical hashmap/hashtable use case. If your keys are numbers or strings, you won't hash MB of data for sure.<br />
Therefore we tuned the AesNiHash performance from the very first byte. The smallest input length of <strong>0-15 bytes are taken without any branch, 16-128 bytes have no loop</strong>, and, of course, if your key is huge, 129+ bytes are hashed with 128 bytes per iteration. And it passes all our regression tests just as good as our previous hashers.</p>
<h3>Proven Origin</h3>
<p>Proposing a new hash, and showing numbers is fine, but you should just not trust me: is this algorithm usable? <br />
This time, we didn't reinvent the wheel, we re-implemented the Go RTL assembly, with some optimizations for our particular use.
So it is a proven algorithm, used on production by Google and a lot of companies since years..<br />
Check <code>aeshashbody</code> in Go runtime <a href="https://golang.org/src/runtime/asm_amd64.s">asm_amd64.s</a> as reference.</p>
<p>Feedback is <a href="https://synopse.info/forum/viewtopic.php?id=5759">welcome on our forum</a>!</p>New Multi-thread Friendly Memory Manager for FPC written in x86_64 assemblyurn:md5:e807023b5387a22a6db1c1f62e76fdc42020-05-07T23:33:00+02:002020-07-03T13:41:20+02:00AB4327-GANDImORMot Framework64bitasmCrossPlatformfpcx64mmFreePascalmapmORMot2multithreadperformanceSynScaleMM<p>As a gift to the FPC community, I just committed a new Memory Manager for FPC.<br />
Check <a href="https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas">mormot.core.fpcx64mm.pas</a> in our <em>mORMot2</em> repository.<br />
This is a stand-alone unit for FPC only.</p>
<p><img alt="" src="https://www.computerhope.com/jargon/m/memory-manager.jpg" title="MM" /></p>
<p>It targets Windows and Linux multi-threaded Service applications - typically mORMot daemons.<br />
It is written in almost pure x86_64 assembly, and some unique tricks in the Delphi/FPC Memory Manager world.</p>
<p>It is based on <a href="https://github.com/pleriche/FastMM4">FastMM4</a> (not <a href="https://github.com/pleriche/FastMM5">FastMM5</a>), and we didn't follow the path of the <a href="https://github.com/maximmasiutin/FastMM4-AVX">FastMM4-AVX</a> version - instead of AVX, we use plain good (non-temporal) SSE2 opcode, and we rely on the <a href="http://man7.org/linux/man-pages/man2/mremap.2.html ">mremap</a> API on Linux for very efficient reallocation. Using <em>mremap</em> is perhaps the biggest benefit of this memory manager - it leverages a killer feature of the Linux kernel for sure. By the way, we directly call the Kernel without the need of the libc.</p>
<p>We tuned our x86_64 assembly a lot, and made it cross-platform (Windows and POSIX). We profiled the multi-threading, especially by adding some additional small blocks for GetMem (which is a less expensive notion of "arenas" as used in FastMM5 and most C allocators), introducing an innovatice and very efficient round-robin of tiny blocks (<128 bytes), and proper spinning for FreeMem and medium blocks.</p>
<p>It runs all our regression tests with huge performance and stability - including multi-threaded tests with almost no slow down: sleep is reported as less than 1 ms during a 1 minute test. It has also been validated on some demanding multi-threaded tasks.</p> <p>We are proud to offer a new Multi-thread Friendly Memory Manager for FPC written in x86_64 assembly, implemented as such:</p>
<ul>
<li>targetting Linux (and Windows) multi-threaded Services</li>
<li>only for FPC on the x86_64 target - use the RTL MM on Delphi or ARM</li>
<li>based on FastMM4 proven algorithms by Pierre le Riche</li>
<li>code has been reduced to the only necessary featureset for production</li>
<li>deep asm refactoring for cross-platform, compactness and efficiency</li>
<li>can report detailed statistics (with threads contention and memory leaks)</li>
<li>mremap() makes large block ReallocMem a breeze on Linux <img src="https://blog.synopse.info?pf=smile.svg" alt=":)" class="smiley" /></li>
<li>inlined SSE2 movaps loop is more efficient that subfunction(s)</li>
<li>lockless round-robin of tiny blocks (<=128/256 bytes) for better scaling</li>
<li>optional lockless bin list to avoid freemem() thread contention</li>
<li>three app modes: default mono-thread friendly, FPCMM_SERVER or FPCMM_BOOST</li>
</ul>
<p>Usage: include this unit as the very first in your FPC project uses clause.</p>
<p>Why another Memory Manager on FPC?</p>
<ul>
<li>The built-in heap.inc is well written and cross-platform and cross-CPU, but its threadvar arena for small blocks tends to consume a lot of memory on multi-threaded servers, and has suboptimal allocation performance</li>
<li>C memory managers (glibc, Intel TBB, jemalloc) have a very high RAM consumption (especially Intel TBB) and panic/SIGKILL on any GPF</li>
<li>Pascal alternatives (FastMM4,ScaleMM2,BrainMM) are Windows+Delphi specific</li>
<li>Our lockess round-robin of tiny blocks is a unique algorithm among Memory Managers AFAIK</li>
<li>It was so fun deeping into SSE2 x86_64 assembly and Pierre's insight</li>
<li>Resulting code is still easy to understand and maintain, and performs very well</li>
<li>It tends to have very low fragmentation, and consume less memory than FPC alternatives, especially for multi-threaded projects</li>
<li>It is really Open Source (MPL/GPL/LGPL) and may be used on production for FPC x86_64 apps just as much as FastMM4 was for Delphi</li>
</ul>
<p><a href="https://synopse.info/forum/viewtopic.php?pid=32016#p32016">Feedback is welcome!</a></p>Faster Double-To-Text Conversionurn:md5:4418b32b1ad34daab0da1624c3faae172020-03-28T20:15:00+01:002020-07-03T09:29:59+02:00AB4327-GANDImORMot Framework64bitblogDelphidoubleFreePascalJSONmORMotperformance<p>On server side, a lot of CPU is done processing conversions to or from text.
Mainly JSON these days.</p>
<p><img src="https://inmyownterms.com/wp-content/uploads/2016/02/convert-button.jpg" alt="" /></p>
<p>In <em>mORMot</em>, we take care a lot about performance, so we have
rewritten most conversion functions to have something faster than the Delphi or
FPC RTL can offer.<br />
Only float to text conversion was not available. And RTL str/floattexttext
performance, at least under Delphi, is not consistent among platforms.<br />
So we just added a new Double-To-Text set of functions.</p> <p>We implement 64-bit floating point (double) to ASCII conversion using
the GRISU-1 efficient algorithm. This clever and very efficient algorithm
was detailed in 2009 by Florian Loitsch, and is a standard of reference
for this particular process.</p>
<p>Encoding integers into text is pretty straightforward. But encoding doubles
is a real P...A - the <a href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format#IEEE_754_double-precision_binary_floating-point_format:_binary64">
IEEE standard is quite complex</a>.</p>
<p>We extracted a double-to-ascii only cut-down version of <em>flt_core.inc
flt_conv.inc flt_pack.inc</em> files from FPC RTL, which implemented this
algorithm.<br />
As usual, we made a huge refactoring to reach the best performance, especially
tuning the Intel target, with some dedicated asm and code rewrite.</p>
<p>Some information and numbers extracted from the new source code
comments:</p>
<pre>
With Delphi 10.3 on Win32: (no benefit)<br /> 100000 FloatToText in 38.11ms i.e. 2,623,570/s, aver. 0us, 47.5 MB/s<br /> 100000 str in 43.19ms i.e. 2,315,082/s, aver. 0us, 50.7 MB/s<br /> 100000 DoubleToShort in 45.50ms i.e. 2,197,367/s, aver. 0us, 43.8 MB/s<br /> 100000 DoubleToAscii in 42.44ms i.e. 2,356,045/s, aver. 0us, 47.8 MB/s<br /> With Delphi 10.3 on Win64:<br /> 100000 FloatToText in 61.83ms i.e. 1,617,233/s, aver. 0us, 29.3 MB/s<br /> 100000 str in 53.20ms i.e. 1,879,663/s, aver. 0us, 41.2 MB/s<br /> 100000 DoubleToShort in 18.45ms i.e. 5,417,998/s, aver. 0us, 108 MB/s<br /> 100000 DoubleToAscii in 18.19ms i.e. 5,496,921/s, aver. 0us, 111.5 MB/s<br /> With FPC on Win32:<br /> 100000 FloatToText in 115.62ms i.e. 864,842/s, aver. 1us, 15.6 MB/s<br /> 100000 str in 57.30ms i.e. 1,745,109/s, aver. 0us, 39.9 MB/s<br /> 100000 DoubleToShort in 23.88ms i.e. 4,187,078/s, aver. 0us, 83.5 MB/s<br /> 100000 DoubleToAscii in 23.34ms i.e. 4,284,490/s, aver. 0us, 86.9 MB/s<br /> With FPC on Win64:<br /> 100000 FloatToText in 76.92ms i.e. 1,300,052/s, aver. 0us, 23.5 MB/s<br /> 100000 str in 27.70ms i.e. 3,609,456/s, aver. 0us, 82.6 MB/s<br /> 100000 DoubleToShort in 14.73ms i.e. 6,787,944/s, aver. 0us, 135.4 MB/s<br /> 100000 DoubleToAscii in 13.78ms i.e. 7,253,735/s, aver. 0us, 147.2 MB/s<br /> With FPC on Linux x86_64:<br /> 100000 FloatToText in 98.47ms i.e. 1,015,465/s, aver. 0us, 18.4 MB/s<br /> 100000 str in 38.14ms i.e. 2,621,369/s, aver. 0us, 60 MB/s<br /> 100000 DoubleToShort in 14.77ms i.e. 6,766,357/s, aver. 0us, 134.9 MB/s<br /> 100000 DoubleToAscii in 13.79ms i.e. 7,248,477/s, aver. 0us, 147.1 MB/s
</pre>
<p>As you can see:</p>
<ul>
<li>Our rewrite is twice faster than original flt_conv.inc from FPC RTL
(str)</li>
<li>Delphi Win32 has trouble making 64-bit computation - no benefit since it
has good optimized i87 asm (but slower than our code with FPC/Win32)</li>
<li>FPC is more efficient when compiling integer arithmetic; we avoided slow
division by calling our Div100(), but Delphi Win64 is still far behind</li>
<li>Delphi Win64 has very slow FloatToText and str() implementation (in pure
pascal) - so our new version is welcome.</li>
</ul>
<div>In a nutshell, this routine is now used on all platform (even ARM and
AARCH64), with the exception of Delphi Win32, in which the built-in x87 asm is
a bit faster, mainly due to performance problems of the Delphi compiler when
handling 64-bit logical and arithmetic process on the i386 CPU.</div>
<div>You can <a href="https://github.com/synopse/mORMot/blob/master/SynDoubleToText.inc">check the
source code</a> of our implementation of Grisu. You may find some nice
performance tricks.<br />
And any feedback is <a href="https://synopse.info/forum/viewtopic.php?id=5353">welcome in our forum, as
usual</a>!</div>New move/fillchar optimized sse2/avx asm versionurn:md5:be778aceb206fcabd75e4861b2e0f2602020-02-17T13:52:00+01:002020-07-03T09:29:59+02:00AB4327-GANDImORMot Framework64bitavxblogDelphiFreePascalmultithreadperformanceRTLsse2<p>Our Open Source framework includes some optimized <em>asm</em> alternatives
to RTL's <code>move()</code> and <code>fillchar()</code>, named
<code>MoveFast()</code> and <code>FillCharFast()</code>.</p>
<p><img src="https://static.makeuseof.com/wp-content/uploads/2017/04/more-ram-fast-ram-994x400.jpg" width="330" height="133" alt="" /></p>
<p>We just rewrote from scratch the <em>x86_64</em> version of those, <a href="http://blog.synopse.info/post/2013/03/13/x64-optimized-asm-of-FillChar%28%29-and-Move%28%29-for-Win64">
which was previously taken from third-party snippets</a>.<br />
The brand new code is meant to be more efficient and maintainable. In
particular, we switched to SIMD 128-bit SSE2 or 256bit AVX memory access (if
available), whereas current version was using 64-bit regular registers. The
small blocks (i.e. < 32 bytes) process occurs very often, e.g. when
processing strings, so has been tuned a lot. Non temporal instructions (i.e.
bypassing the CPU cache) are used for biggest chunks of data. We tested
<a href="https://stackoverflow.com/a/43837564/458259">ERMS support</a>, but it
was found of no benefit in respect to our optimized SIMD, and was actually
slower than our non-temporal variants. So ERMS code is currently disabled in
the source, and may be enabled on demand by a conditional.</p>
<p>FPC <code>move()</code> was not bad. Delphi's <em>Win64</em> was far
from optimized - even ERMS was poorly introduced in latest RTL, since it should
be triggered only for blocks > 2KB. Sadly, Delphi doesn't support AVX
assembly yet, so those opcodes would be available only on FPC.</p>
<p>Resulting numbers are talking by themselves. Working on Win64 and Linux, of
course.</p> <p>Some numbers, taken from
<code>TTestLowLevelCommon.CustomRTL</code> regression tests.</p>
<p>It is not absolute timing, there is always about 10% of variation in our
tests, but we wanted to have a high-level guess of the performance
obtained.<br />
Numbers are to be compared per run, with a similar execution context.</p>
<p>In the text below:</p>
<ul>
<li><code>FillChar/FillCharFast</code> fills a buffer with some increasing
number of bytes.</li>
<li><code>Move/MoveFast</code> moves some overlapped data with increasing
number of bytes, in a forward way.</li>
<li><code>small Move/MoveFast</code> moves 1..48 bytes with overlap.</li>
<li><code>big Move/MoveFast</code> moves around 8MB of data with or without
overlap.</li>
</ul>
<p>First of all, about Delphi RTL' vs SynCommons' on <em>Windows 64</em> -
big Move with overlap:</p>
<pre>
<em> On Delphi XE4 Win64 (VM):</em>
FillChar in 34.42ms, 11.2 GB/s FillCharFast [] in 15.03ms, 25.8 GB/s
Move in 3.76ms, 4.1 GB/s MoveFast [] in 2.16ms, 7.2 GB/s
small Move in 7.51ms, 2.9 GB/s small MoveFast [] in 6.77ms, 3.2 GB/s
big Move in 67.06ms, 2.3 GB/s big MoveFast [] in 41ms, 3.8 GB/s
</pre>
<pre>
<em> On Delphi 10.3 Win64 (VM):</em>
FillChar in 28.82ms, 13.4 GB/s FillCharFast [] in 14.89ms, 26 GB/s
Move in 3.68ms, 4.2 GB/s MoveFast [] in 2.13ms, 7.3 GB/s
small Move in 7.34ms, 2.9 GB/s small MoveFast [] in 6.73ms, 3.2 GB/s
big Move in 50.90ms, 3 GB/s big MoveFast [] in 40.74ms, 3.8 GB/s
</pre>
<p>As you can see, Delphi 10.3 was slightly better than Delphi XE4, certainly
due to some refactoring, and introducing of ERMS - which exists on my CPU, so
gives not bad results for "big Move".<br />
But our versions are always faster, especially for <code>FillChar()</code>.</p>
<p>When comparing FPC RTL's vs SynCommons' on <em>Linux x86_64</em> - big
moves with or without overlap:</p>
<pre>
FillChar in 30.33ms, 12.8 GB/s
FillCharFast [] in 14.16ms, 27.4 GB/s
FillCharFast [cpuAVX] in 11.98ms, 32.4 GB/s
</pre>
<pre>
Move in 1.92ms, 8.1 GB/s
MoveFast [] in 2.19ms, 7.1 GB/s
MoveFast [cpuAVX] in 1.60ms, 9.7 GB/s
</pre>
<pre>
small Move in 8.84ms, 2.4 GB/s
small MoveFast [] in 6.66ms, 3.2 GB/s
small MoveFast [cpuAVX] in 6.63ms, 3.3 GB/s
</pre>
<pre>
<em>overlapping:</em>
big Move in 39.94ms, 3.9 GB/s
big MoveFast [] in 38.13ms, 4 GB/s
big MoveFast [cpuAVX] in 36.86ms, 4.2 GB/s
</pre>
<pre>
<em>non overlapping:</em>
big Move in 54.55ms, 7.1 GB/s
big MoveFast [] in 40.45ms, 9.6 GB/s
big MoveFast [cpuAVX] in 39.57ms, 9.8 GB/s
</pre>
<p>Previous Delphi numbers were taken from a VM, and on Win64 not Linux, so
with diverse high-performance counters and with overlap, so shouldn't be used
directly for a "FPC vs Delphi" comparison.<br />
What means here are relative numbers between FillChar/Move against
FillCharFast/MoveFast on a single execution context.</p>
<p>For big moves, the 256-bit YMM AVX version gives a slight advantage over
128-bit XMM SSE2 code, which was already saturating the bandwidth.<br />
Small blocks don't involve AVX by design.<br />
There seems to be a real AVX advantage only for overlapping increasing moves,
certainly due to the less number of cycles involved, and no memory prefetch
involved - which may occur in some applications.</p>
<p>For small backward/forward moves (on FPC Linux), when you get some details,
the performance difference seems pretty interresting:</p>
<pre>
1b Move in 89us, 214.3 MB/s 1b MoveFast [] in 77us, 247.7 MB/s
2b Move in 95us, 401.5 MB/s 2b MoveFast [] in 72us, 529.8 MB/s
3b Move in 107us, 534.7 MB/s 3b MoveFast [] in 76us, 752.9 MB/s
4b Move in 129us, 591.4 MB/s 4b MoveFast [] in 92us, 829.2 MB/s
5b Move in 163us, 585 MB/s 5b MoveFast [] in 78us, 1.1 GB/s
6b Move in 143us, 800.2 MB/s 6b MoveFast [] in 77us, 1.4 GB/s
7b Move in 143us, 0.9 GB/s 7b MoveFast [] in 146us, 914.4 MB/s
8b Move in 159us, 0.9 GB/s 8b MoveFast [] in 73us, 2 GB/s
9b Move in 166us, 1 GB/s 9b MoveFast [] in 144us, 1.1 GB/s
10b Move in 170us, 1 GB/s 10b MoveFast [] in 144us, 1.2 GB/s
11b Move in 181us, 1.1 GB/s 11b MoveFast [] in 144us, 1.4 GB/s
12b Move in 153us, 1.4 GB/s 12b MoveFast [] in 143us, 1.5 GB/s
13b Move in 154us, 1.5 GB/s 13b MoveFast [] in 144us, 1.6 GB/s
14b Move in 150us, 1.7 GB/s 14b MoveFast [] in 143us, 1.8 GB/s
15b Move in 151us, 1.8 GB/s 15b MoveFast [] in 147us, 1.9 GB/s
16b Move in 159us, 1.8 GB/s 16b MoveFast [] in 73us, 4 GB/s
17b Move in 161us, 1.9 GB/s 17b MoveFast [] in 153us, 2 GB/s
18b Move in 167us, 2 GB/s 18b MoveFast [] in 153us, 2.1 GB/s
19b Move in 180us, 1.9 GB/s 19b MoveFast [] in 161us, 2.1 GB/s
20b Move in 155us, 2.4 GB/s 20b MoveFast [] in 153us, 2.4 GB/s
21b Move in 150us, 2.6 GB/s 21b MoveFast [] in 153us, 2.5 GB/s
22b Move in 155us, 2.6 GB/s 22b MoveFast [] in 163us, 2.5 GB/s
23b Move in 157us, 2.7 GB/s 23b MoveFast [] in 154us, 2.7 GB/s
24b Move in 166us, 2.6 GB/s 24b MoveFast [] in 91us, 4.9 GB/s
25b Move in 175us, 2.6 GB/s 25b MoveFast [] in 163us, 2.8 GB/s
26b Move in 183us, 2.6 GB/s 26b MoveFast [] in 170us, 2.8 GB/s
27b Move in 193us, 2.6 GB/s 27b MoveFast [] in 162us, 3.1 GB/s
28b Move in 166us, 3.1 GB/s 28b MoveFast [] in 163us, 3.2 GB/s
29b Move in 183us, 2.9 GB/s 29b MoveFast [] in 169us, 3.1 GB/s
30b Move in 170us, 3.2 GB/s 30b MoveFast [] in 176us, 3.1 GB/s
31b Move in 175us, 3.3 GB/s 31b MoveFast [] in 176us, 3.2 GB/s
32b Move in 185us, 3.2 GB/s 32b MoveFast [] in 146us, 4 GB/s
33b Move in 193us, 3.1 GB/s 33b MoveFast [] in 197us, 3.1 GB/s
34b Move in 198us, 3.1 GB/s 34b MoveFast [] in 157us, 4 GB/s
35b Move in 218us, 2.9 GB/s 35b MoveFast [] in 155us, 4.2 GB/s
36b Move in 196us, 3.4 GB/s 36b MoveFast [] in 155us, 4.3 GB/s
37b Move in 184us, 3.7 GB/s 37b MoveFast [] in 160us, 4.3 GB/s
38b Move in 209us, 3.3 GB/s 38b MoveFast [] in 160us, 4.4 GB/s
39b Move in 201us, 3.6 GB/s 39b MoveFast [] in 161us, 4.5 GB/s
40b Move in 218us, 3.4 GB/s 40b MoveFast [] in 161us, 4.6 GB/s
41b Move in 226us, 3.3 GB/s 41b MoveFast [] in 161us, 4.7 GB/s
42b Move in 236us, 3.3 GB/s 42b MoveFast [] in 162us, 4.8 GB/s
43b Move in 250us, 3.2 GB/s 43b MoveFast [] in 161us, 4.9 GB/s
44b Move in 180us, 4.5 GB/s 44b MoveFast [] in 215us, 3.8 GB/s
45b Move in 199us, 4.2 GB/s 45b MoveFast [] in 230us, 3.6 GB/s
46b Move in 195us, 4.3 GB/s 46b MoveFast [] in 164us, 5.2 GB/s
47b Move in 198us, 4.4 GB/s 47b MoveFast [] in 173us, 5 GB/s
48b Move in 231us, 3.8 GB/s 48b MoveFast [] in 163us, 5.4 GB/s
</pre>
<p>As you can see, our code was designed to handle very efficiently 8-bytes
multiples (8-16-24-32 bytes), which are pretty common when moving small
objects.<br />
It is always faster than FPC RTL's original code, which was already some
optimized assembly. :)</p>
<p>If you look at the source code, you will see that we tried to make the code
as clear as possible, and using all capabilities of Delphi/FPC inlined asm.</p>
<p>Please run the tests on your PC, and <a href="https://synopse.info/forum/viewtopic.php?id=5285">share some numbers and
enhancements</a>!</p>SQLite3 static linking for Delphi Win64urn:md5:8e5bdbf381a07b5e29d887d7ba195b722019-09-21T12:02:00+02:002020-07-03T09:29:59+02:00AB4327-GANDImORMot Framework64bitblogDelphiDocumentationmORMotSQLite3<p>A long-awaited feature was the ability to create stand-alone <em>mORMot</em>
Win64 applications via Delphi, with no external <em>sqlite3-64.dll</em>
required.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/3/38/SQLite370.svg" alt="" /></p>
<p>It is now available, with proper integration, and <a href="https://blog.synopse.info?post/post/2018/03/12/New-AES-based-SQLite3-encryption">encryption is
working</a>!</p> <p>We supported static linking for Delphi and Win32, and FPC on a lot of
targets (Win32, Win64, Linux-x86, Linux-x64, Linux-Arm32, Darwin-x86,
Darwin-x64...).<br />
Now it is also available for Delphi and Win64!</p>
<p>You need to update your <a href="https://github.com/synopse/mORMot">https://github.com/synopse/mORMot</a> revision,
or refresh both your fossil clone and download again <a href="https://synopse.info/files/sqlite3obj.7z">https://synopse.info/files/sqlite3obj.7z</a>
to retrieve the new <em>sqlite3.o</em> file.</p>
<p>Note that the <em>Delphi/Win64 sqlite3.o</em> does not match the
<em>FPC/Win64 sqlite3.o</em> file, for several low-level linking reasons.<br />
Our <em>SynSQLite3Static.pas</em> unit will just use the right version for you.
<img src="https://blog.synopse.info?pf=smile.svg" alt=":)" class="smiley" /></p>
<p>I have also updated the documentation to reflect this new feature, and
properly state how to compile <em>SQLite3</em> from its official sources, if
needed.</p>
<p>Comments and feedback is <a href="https://synopse.info/forum/viewtopic.php?pid=30239#p30239">welcome in our
forum, as usual</a> !</p>New AES-based SQLite3 encryptionurn:md5:366357cb693d27fa649e65e0e6d4d1042018-03-12T16:29:00+01:002020-07-03T09:29:59+02:00AB4327-GANDImORMot Framework64bitAESAES-NiblogCrossPlatformDelphiFreePascalmORMotperformanceSQLite3SynDB <p>We just committed a deep refactoring of the <a href="https://synopse.info/files/html/api-1.18/SynSQLite3Static.html">SynSQlite3Static.pas
unit</a> - and all units using static linking for FPC.<br />
It also includes a new encryption format for <em>SQlite3</em>, using AES, so
much more secure than the previous one.<br />
This is a breaking change, so worth a blog article!</p>
<p><img src="https://cuttingedge.it/blogs/steven/images/breakingchains.jpg" alt="" /></p>
<p>Now all static <code>.o .a</code> files are located in a <a href="https://github.com/synopse/mORMot/tree/master/static">static sub-folder</a> in
the source code.<br />
Please delete the previous <code>fpc-*</code> folders, which are deprecated and
should not be used.<br />
It has been deployed under <a href="https://github.com/synopse/mORMot/tree/master/static">GitHub</a>, or you need
to download a new version of <a href="https://synopse.info/files/sqlite3fpc.7z">sqlite3fpc.7z</a> if you used
our nightly build from fossil.<br />
This will allow to set <code>"Libraries -fFl"</code> in your FPC project
options as safe and sound
<code>(...)\static\$(TargetCPU)-$(TargetOS)</code></p>
<p>The new <em>SQlite3</em> encryption is based on our <a href="https://synopse.info/files/html/api-1.18/SynCrypto.html">SynCrypto unit</a>,
so it uses AES-NI acceleration, if available. Performance impact is minimal,
much lower than the one included e.g. with <a href="https://github.com/utelle/wxsqlite3">wxSqlite3</a>, and with a safer
implementation (explicit AES-OFB mode with fast IV derivation and proven PBKDF2
password reduction).<br />
It also allows to use a plain/official/unpatched amalgamation
<code>sqlite3.c</code> file, so it is easier to maintain as
cross-platform.<br />
You can easily convert existing encrypted databases using <a href="https://synopse.info/files/html/api-1.18/SynSQLite3Static.html#ISOLDSQLENCRYPTTABLE">IsOldSQLEncryptTable</a>
and <a href="https://synopse.info/files/html/api-1.18/SynSQLite3Static.html#OLDSQLENCRYPTTABLEPASSWORDTOPLAIN">OldSQLEncryptTablePassWordToPlain</a>
functions.</p>
<p>Of course, it will also work with Delphi, so that Win32 statically linked
<em>sqlite3.obj</em> will offer this new encryption.</p>
<p>Comments are <a href="https://synopse.info/forum/viewtopic.php?pid=26794#p26794">welcome on our
forum</a>, as usual.</p>Status of mORMot ORM SOA MVC with FPCurn:md5:713e7f794fdb05475b36cadd4cde47f02018-02-07T10:31:00+01:002020-07-03T09:29:59+02:00AB4327-GANDImORMot Framework64bitasmblogBSDcrc32cCrossPlatformDelphiECCexceptionFreePascalLazarusLinuxMaxOSXmORMotORMperformanceRestRTTISOASourceSQLite3sse42<p>In the last weeks/months, we worked a lot with FPC.<br />
Delphi is still our main IDE, due to its better debugging experience under
Windows, but we target to have premium support of FPC, on all platforms,
especially Linux.</p>
<p><img src="https://blog.synopse.info?post/public/ScreenShots/lazarusaboutbox.png" alt="" title="Lazarus FPC About, Feb 2018" /></p>
<p>The new Delphi Linux compiler is out of scope, since it is heavily priced,
its performance is not so good, and ARC broke memory management so would need a
deep review/rewrite of our source code, which we can't afford - since we have
FPC which is, <a href="https://synopse.info/forum/viewtopic.php?pid=25984#p25984">from our
opinion</a>, a much better compiler for Linux.<br />
Of course, you can create clients for Delphi Linux and FMX, as usual, using
the <a href="https://synopse.info/files/html/Synopse%20mORMot%20Framework%20SAD%201.18.html#TITL_86">cross-platform
client parts of mORMot</a>. But for server side, this compiler is not
supported, and will probably never be.</p> <p>First of all, since FPC - and Lazarus, its sibling IDE - are Open Source and
free, we can focus on mainly support a single version of the compiler.<br />
Since some missing RTTI for interfaces were recently merged into the trunk, we
start from the current FPC trunk as our main version. Easier than maintaining
Delphi 5 - 10.2 compability!</p>
<p>To install it, we usually use the <a href="https://github.com/newpascal/fpcupdeluxe">fpcupdeluxe tool</a>: you <a href="https://github.com/newpascal/fpcupdeluxe/releases">download a single binary
for your platform</a>, then you run the executable, pickup the compiler (or
cross-compiler) versions you need, and everything is downloaded and compiled
from git/svn on your own computer. Then click on the desktop link, and the IDE
launches in seconds. Nice fresh air in respect to Delphi setup experience!</p>
<p style="margin-top: 0;">As I wrote, in the last weeks/months, we worked a lot
to improve FPC support.</p>
<p style="margin-top: 0;">Just a few commits:</p>
<ul>
<li><a href="https://github.com/synopse/mORMot/commit/7bf5951e322f0164d8e3852f98f9f48b1bd7a6d2">
crc32c 2x/4x speeup by using SSE4.2+pclmulqdq opcodes on x64</a> - speed is now
around 21GB/s</li>
<li><a href="https://github.com/synopse/mORMot/commit/5b569db11e47fd65e2e0f792e1383895e50352b6">
TSynDaemon fork/run support on Linux/Posix</a></li>
<li><a href="https://github.com/synopse/mORMot/commit/4e019e1b41f6759487e7479c194fcb28c25314c9">
fixed vtQWord proper support for FPC</a> - this doesn't exist in Delphi,
but it should!</li>
<li><a href="https://github.com/synopse/mORMot/commit/bc405cffb7bfb79646b46a6afc13968a0d804f32">
added pure pascal version of SynECC.pas</a> to run on all FPC platforms,
including ARM</li>
<li>BSD/OSX <a href="https://github.com/synopse/mORMot/commit/af8b33720d0da12a8476cd5bc34eeeab417cb096">
enhanced</a> <a href="https://github.com/synopse/mORMot/commit/16332915f87f5dcc7de881c9b72b09d05b609eba">
support</a></li>
<li><a href="https://github.com/synopse/mORMot/commit/a5352309f41d8ef70561d819e57732ac0aaf43e3">
added scripts to use fpcupdeluxe's gcc cross-compilers for FPC static linking
on Windows, Linux, BSD and OSX</a></li>
<li><a href="https://github.com/synopse/mORMot/commit/eaddf84b04b7eaf647adb598be140276b9b42046">
updated SQLite3 engine to latest version 3.22.0</a> - statically linked under
FPC Linux (no external dependency to the system libsqlite3.so)</li>
<li><a href="https://github.com/synopse/mORMot/commit/7c6d09df6caf3d8865e488d9147b5340c1a15fca">
HTTP server enhancements and fixes for high performance and stability under
Linux behind a local nginx proxy for production servers</a></li>
<li style="list-style: none"><a href="https://github.com/synopse/mORMot/commit/f40a379cbd3fb0e5076ac2cf664acb8b03ab86e2">
</a></li>
<li>better Linux compatibility</li>
<li><a href="https://github.com/synopse/mORMot/commit/ec4755a19722e744029244e7a2d1bfa985887f3e">
TSQLRecord QWord property fix</a></li>
<li><a href="https://github.com/synopse/mORMot/commit/4b7786c3be0d1aec5abaad26f9de90e58edcce16">
tuned/enhanced logging content</a></li>
<li><a href="https://github.com/synopse/mORMot/commit/d9fe9a5bf9305a0eb05c9f5a80d97a1b49f3bcb8">
deep refactoring of FPC RTTI access to have the same level than Delphi</a></li>
<li><a href="https://github.com/synopse/mORMot/commit/95c5d56edbcb641f716c36d78792853eae67c689">
implemented Exceptions interception and logging for FPC</a><br />
with call stack trace (if available) - includes source code lines if compiled
using -g or -gl switches - tested on Win32, Win64, Linux i386 et x86_64 but
should work on other OS <img src="https://blog.synopse.info?pf=smile.svg" alt=":)" class="smiley" /></li>
<li>SyNode JavaScript engine support under FPC / Linux x86_64 by <a href="https://github.com/synopse/mORMot/commits?author=ssoftpro">ssoftpro</a> and
<a href="https://github.com/synopse/mORMot/commits?author=pavelmash">pavelmash</a> -
including a lot of fixes and tuning for this platform <a href="https://synopse.info/forum/viewtopic.php?pid=25985#p25985">to be heavily used
on production</a></li>
<li>and a lot of smaller enhancements (just search for FPC <a href="https://synopse.info/fossil/timeline?n=500&y=ci&t=&ms=exact">in
the commit timeline</a>), especially <a href="https://github.com/synopse/mORMot/commit/c5ad9b1d1f57177a8fe686271370cb12fc29d3d9">
tuning</a> the pascal code to better compile and execute under FPC, which can
generate very efficient assembly!</li>
</ul>
<div>As you can see, exciting times!</div>
<div>To be honest, the more we work with FPC as a compiler, the more we like
it.<br />
Staty tuned, and we encourage you to discover FPC/Lazarus!</div>Faster and cross-platform SynLZurn:md5:5e8d8752a69b6f5a3159dae541ca39ff2017-08-10T15:19:00+02:002020-07-03T09:29:59+02:00AB4327-GANDImORMot Framework64bitasmblogcompressionDelphiSynLZ <p>You probably know about our <a href="https://github.com/synopse/mORMot/blob/master/SynLZ.pas"><em>SynLZ</em>
compression unit</a>, in pascal and x86 asm, which is very fast for compression
with a good compression ratio, and proudly compete with LZ4 or Snappy.<br />
It is used in our framework everywhere, e.g. for WebSockets communication, for
ECC encrypted file content, or to compress executable resources. </p>
<p><img src="https://cdn3.iconfinder.com/data/icons/musthave/256/Archive.png" alt="" /></p>
<p>Two news to share:</p>
<p>1. I've added <em>SynLZ</em> support for the NextGen compiler, now available
in <a href="https://github.com/synopse/mORMot/blob/master/CrossPlatform/SynCrossPlatformSynLZ.pas">
a new unit of the "CrossPlatform" sub-folder</a>.<br />
<a href="https://synopse.info/forum/viewtopic.php?pid=24701#p24701">Feeback is
welcome</a>, since we don't use Delphi for iOS and Android with Delphi, and
prefer FPC for Linux!</p>
<p>2. I've also written a <a href="https://github.com/synopse/mORMot/commit/042ee0102e240f1c6871bf1e86bfb61b5cba2fca">
new x64 asm optimized version of <em>SynLZ</em></a>, and profiled the existing
x86 asm to be even faster than previously.<br />
For a 100MB text log file, <em>SynLZ</em> is faster than Snappy, and compresses
better (93% instead of 84%).<br />
For other kind of files, Snappy is slightly faster at decompression, but SynLZ
compresses better, and most of the time faster.<br />
When used on a REST server solution, as with <em>mORMot</em>, compression speed
does matter more than decompression.</p>
<p>For Win32:</p>
<pre>
Win32 Processing DragonFly-devpcm.log = 98.7 MB for 1 times<br /> Snappy compress in 125.07ms, ratio=84%, 789.3 MB/s<br /> Snappy uncompress in 70.35ms, 1.3 GB/s<br /> SynLZ compress in 103.61ms, ratio=93%, 952.8 MB/s<br /> SynLZ uncompress in 68.71ms, 1.4 GB/s
</pre>
<p>For Win64:</p>
<pre>
Win64 Processing DragonFly-devpcm.log = 98.7 MB for 1 times<br /> Snappy compress in 107.13ms, ratio=84%, 921.5 MB/s<br /> Snappy uncompress in 61.06ms, 1.5 GB/s<br /> SynLZ compress in 97.25ms, ratio=93%, 1015.1 MB/s<br /> SynLZ uncompress in 61.27ms, 1.5 GB/s
</pre>
<p>Of course, we didn't change the <em>SynLZ</em> binary format, so it is just
perfectly backward compatible with any existing program.<br />
Anyway, from my point of view, the main benefit of <em>SynLZ</em> is that it
was designed in plain pascal, so it is clearly cross-platform and well
integrated with Delphi/FPC (no external .obj/.o/.dll required).</p>
<p><a href="https://synopse.info/forum/viewtopic.php?pid=24702#p24702">Feedback
is welcome in our forum</a>, as usual!</p>Delphi 10.2 Tokyo Compatibility: DCC64 brokenurn:md5:979384129f7a479457a156d2d9e0c2732017-03-22T18:45:00+01:002017-05-04T07:20:03+02:00AB4327-GANDImORMot Framework10.264bitblogDelphimORMot<p>We are proud to announce compatibility of our mORMot Open Source framework
with the latest Delphi 10.2 Tokyo compiler...<br />
At least for Win32.</p>
<p>For Win64, the compiler was stuck at the end of the compilation, burning
100% of one CPU core...</p>
<p><img src="https://i.snag.gy/UvLQIj.jpg" alt="" /></p>
<p>A bit disappointing, isn't it?</p> <p>Regression tests for Win32:</p>
<pre>
Synopse mORMot Framework Automated tests
------------------------------------------
</pre>
<pre>
....
</pre>
<pre>
Using mORMot 1.18.3538 FTS3
Running on Windows 7 64bit SP1 (6.1.7601) with code page 1252
TSQLite3LibraryStatic 3.17.0 with internal MM
Generated with: Delphi 10.2 Tokyo compiler
</pre>
<pre>
Time elapsed for all tests: 58.07s
Tests performed at 22/03/2017 18:31:55
Total assertions failed for all test suits: 0 / 27,281,416
</pre>
<pre>
! All tests passed successfully.
</pre>
<p>For Win64, the compilation went well... but the IDE was stuck! </p>
<p>By using the command line compiler, we had the same behavior:</p>
<p>This is clearly not the framework's fault... but some bug in the DCC64
compiler.</p>
<p>Feel free to <a href="https://synopse.info/forum/viewtopic.php?id=3885">share your experiments on
our forum, as usual</a>.<br />
(our forum has the full regression tests log)</p>
<p><strong>Update: Embarcadero just <a href="https://goo.gl/2NGWNx">released a
hotfix update for Delphi 10.2 Tokyo</a>: now the Win64 compiler is fixed!<br />
Great news and good reactivity!</strong></p>Linux support for Delphi to be available end of 2016urn:md5:785e93a45ce015e9ed922adf0f00140d2016-02-08T16:55:00+01:002016-02-08T17:24:14+01:00AB4327-GANDIPascal Programming64bitARCCrossPlatformDelphiLinuxNextGen<p>Marco Cantu, product manager of Delphi/RAD Studio, did publish the <a href="http://community.embarcadero.com/article/news/16211-embarcadero-rad-studio-2016-product-approach-and-roadmap-2">
official RAD Studio 2016 Product Approach and Roadmap</a>.<br />
The upcoming release has a codename known as "BigBen", and should be called
Delphi 10.1 Berlin, as far as I understand.<br />
<a href="http://community.embarcadero.com/uploads/376/2016roadmap.png"><img src="http://community.embarcadero.com/uploads/376/2016roadmap.png" width="500" height="275" /></a></p>
<p>After this summer, another release, which codename is "Godzilla", will
support Linux as a compiler target, in its Delphi 10.2 Tokyo release.<br />
This is a very good news, and some details are given.<br />
I've included those official names to <em>mORMot</em>'s <a href="http://synopse.info/fossil/info/8ebd41a0fe">internal compiler version
detection</a>.<br />
Thanks Marco for the information, and pushing in this direction!</p>
<p>My only concern is that it would be "ARC-enabled"...</p> <p>Details are the following:</p>
<ul>
<li>Apache modules in WebBroker and support for DataSnap and EMS</li>
<li>FireDAC Linux database access</li>
<li>Linux platform support for console apps with IoT support</li>
<li>We will formally support Ubuntu Server & RedHat Enterprise. We will
extend the formally supported Linux distributions list over time as demand
dictates</li>
<li>Windows based IDE with cross-compiler, deploy and debug via PAServer</li>
<li>Linux compilers will be for Intel 64-bit server, LLVM-based and
<strong>ARC-enabled</strong></li>
</ul>
<p>Sounds fine to me to support only Intel 64 bit at first, and Ubuntu/RedHat
support - I guess most other distributions would work, even if not officially
supported.<br />
I'm not sure LLVM would make the running code actually faster, since generated
code from mobile targets were not as efficient as it could have been.<br />
But LLVM would make the compilation phase much slower (10 times slower I
guess), in respect to Win32/OSX compilation timing... we loose one of the great
points of Delphi, which is its almost instant compilation time.</p>
<p>The latest information, 'ARC-enabled' is a concern to me.<br />
What I write here is not an absolute, your experiment and expectations may
differ, and you may rejoice for ARC being enabled.<br />
I'm writing from my own needs.</p>
<p>I still do not understand the benefit of using ARC, in regard to the
compatibility break it would induce with existing code.<br />
All our code base on servers, relies on explicit memory allocation. We mitigate
the code verbosity with simple ownership of objects. Not a big deal, with
modern IDEs.</p>
<p>On a server, disposing resources as soon as possible is a very sensitive
task. Processes do run 24/7, so any leak may be avoided at all cost.<br />
ARC is a fine memory model - we spoke about it <a href="https://blog.synopse.info?post/post/2011/12/08/Avoiding-Garbage-Collector%3A-Delphi-and-Apple-on-the-same-side">
some years ago</a> - but, <a href="http://docwiki.embarcadero.com/RADStudio/Seattle/en/Automatic_Reference_Counting_in_Delphi_Mobile_Compilers">
as implemented in Delphi NextGen compiler</a>, it is hard to have a single code
base for both ARC and non ARC platforms. A lot of IFDEF is needed, or you need
to use <code>FreeAndNil</code>. IMHO calling <code>DisposeOf</code> may break
most of the benefits of using ARC, since it leave standing pointers in an
unsafe state...<br />
And remote debugging over PAServer for Linux won't be perfect, we can be
sure... If only we may compile with ARC on Windows, it may ease debugging big
applications...</p>
<p>I like very much the FPC approach: they support a lot of targets and
platforms (including Linux since the beginning), and really do their best not
to break existing code.<br />
Having two memory models at once is a real PITA, from the developer point of
view, if they are excluding...<br />
The more I use FPC, the more I like it. Using Lazarus under Linux is very nice.
You are able to debug your application in place...</p>
<p>I hope Delphi for Linux may have both models: ARC-enabled and
ARC-disabled.<br />
Or at least that Delphi for Windows may be ARC-enabled, so that we may share
the same code base, and debug/test under Windows, and cross-compile with
confidence.<br />
From the compiler point of view, it is not a big deal, as far as A. Bauer wrote
(IIRC): they internally have an ARC-enabled compiler for Windows.<br />
From LLVM point of view, it is a feasible, since <a href="http://www.codeography.com/2011/10/10/making-arc-and-non-arc-play-nice.html">you
can set a per-file</a> ARC enabling.</p>
<p>Please do not break compatibility with existing code base!<br />
Please!</p>Delphi 10 Seattle Win64 compiler Heisenbug: unusable targeturn:md5:71ebf7ad358a1a6d44027008ff3c7add2015-10-05T11:55:00+02:002015-10-08T15:16:36+02:00AB4327-GANDIPascal Programming64bitblogDelphiHeisenbug <p>Andy reported that <a href="http://andy.jgknet.de/blog/2015/10/ide-fix-pack-how-do-i-know-i-didnt-break-the-compiler/">
he was not able to validate its IDE fix pack for Delphi 10 Seattle</a>, due to
its Win64 compiler not being deterministic anymore.<br />
The generated code did vary, from one build to other.</p>
<p><img src="http://3.bp.blogspot.com/-f6o39qrUitQ/T8SBfb6CmFI/AAAAAAAAACc/ERb5rSCJaV8/s1600/heisenbug.png" alt="" /></p>
<p>Sadly, on our side, we identified that the code generated by <strong>the
Win64 compiler of Delphi 10 Seattle is broken</strong>.<br />
We have observed some weird code generation with the Win64 platform as a
target. Some unexpected exception do occur (like a
<code>EPrivilege</code> or <code>EAccessViolation</code> exception).<br />
But it is a random issue, very difficult to reproduce. After a
recompile, no problem any more. Or a problem at another place... A typical
<a href="https://en.wikipedia.org/wiki/Heisenbug">Heisenbug</a>...<br />
And, to be clear, no such problem when using an older version of Delphi...</p>
<p>We hope that the <a href="https://quality.embarcadero.com/browse/RSP-12512">corresponding QC entry</a>
would be quickly fixed.<br />
So we will stay away from Delphi 10's Win64 compiler, and use Delphi XE8
instead, in the meanwhile.</p>
<p><strong>Update: Issue fixed!</strong><br />
Allen Bauer recognized that "<em>It was an uninitialized memory
allocation</em>" in the QC, and that he is pushing to include the fix into the
upcoming Seattle 10 Update 1.<br />
Nice seeing such a quick reaction. Delphi is not dead, even <a href="http://thomabravo.com/2015/10/07/thoma-bravo-announces-sale-of-embarcadero-technologies-to-idera-inc/">
if Embarcadero was just acquired</a>! :)</p>SynCrypto: SSE4 x64 optimized asm for SHA-256urn:md5:17886bc69f9d62ab276f383dfa8bbd822015-02-21T12:58:00+01:002015-02-21T13:24:17+01:00AB4327-GANDIOpen Source libraries64bitasmblogDelphiperformanceshaSourcesse4<p>We have just included some optimized x64 assembler to our Open
Source <a href="http://synopse.info/fossil/info/b7ba18e68252b76c0fe">SynCrypto.pas</a> unit
so that SHA-256 hashing will perform at best speed.<br />
It is an adaptation from <a href="http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html">
tuned Intel's assembly macros</a>, which makes use of the SSE4 instruction set,
if available.</p>
<p><img src="http://www.xbitlabs.com/images/cpu/core2extreme-qx9650/sse-1.jpg" alt="" width="275" height="182/" /></p> <p>Numbers are talking:</p>
<ul>
<li>under Win32, with a Core i7 CPU: pure pascal: 152ms - x86: 112ms</li>
<li>under Win64, with a Core i7 CPU: pure pascal: 202ms - SSE4: 78ms</li>
</ul>
<p>When executing the following test code:</p>
<pre>
for i := 1 to 100000 do begin
s := SHA256('123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');
assert(s='f816ca413da6f2881c0cf16cb6d5bbc5d4189f5a9f185855c8bfd6423e099e52');
end;
</pre>
<p>Your <a href="http://synopse.info/forum/viewtopic.php?id=2374">feedback is
welcome</a>, as usual!</p>BREAKING CHANGE - TSQLRecord.ID primary key changed to TID: Int64urn:md5:efba053f92e0e5a1217d27e7f7d1e0832014-11-14T15:11:00+01:002014-11-14T15:15:13+01:00AB4327-GANDImORMot Framework64bitblogDatabaseDelphiDocumentationmORMotORMPrimaryKeyRestSourceSQLSQLite3<p>Up to now, the <code>TSQLRecord.ID</code> property was defined in
<code>mORMot.pas</code> as a plain
<code>PtrInt</code>/<code>NativeInt</code> (i.e. <code>Integer</code> under
Win32), since it was type-cast as pointer for <code>TSQLRecord</code> published
properties.<br />
We introduced a new <code>TID</code> type, so that the ORM primary key would
now be defined as <code>Int64</code>.</p>
<p><img src="http://blog.codinghorror.com/content/images/uploads/2007/11/6a0120a85dcdae970b012877701d33970c-pi.png" alt="" width="415" height="145/" /></p>
<p>All the framework ORM process relies on the <code>TSQLRecord</code>
class.<br />
This abstract <code>TSQLRecord</code> class features a lot of built-in methods,
convenient to do most of the ORM process in a generic way, at record level.</p>
<p>It first defines a <em>primary key</em> field, defined as <code>ID:
TID</code>, i.e. as <code>Int64</code> in <code>mORMot.pas</code>:</p>
<pre>
<strong>type</strong>
TID = <strong>type</strong> Int64;
...
TSQLRecord = <strong>class</strong>(TObject)
...
<strong>property</strong> ID: TID <strong>read</strong> GetID <strong>write</strong> fID;
...
</pre>
<p>In fact, our ORM relies now on a <code>Int64</code> primary key, matching
the <em>SQLite3</em> <code>ID</code>/<code>RowID</code> primary key.<br />
This primary key will be used as <a href="https://blog.synopse.info?post/post/2014/01/10/REpresentational-State-Transfer-%28REST%29">RESTful resource
identifier</a>, for all CRUD operations.</p> <h3>Limitation or feature?</h3>
<p>You may be disappointed by this limitation, which is needed by the
<em>SQLite3</em>'s implementation of Virtual Tables - see <em><a href="http://synopse.info/files/html/Synopse%20mORMot%20Framework%20SAD%201.18.html#TITL_20">
Virtual Tables magic</a></em>.</p>
<p>We won't debate about a composite primary key (i.e. several fields), which
is not a good idea for an ORM.<br />
In your previous RDBMS data modeling, you may be used to define a TEXT primary
key, or even a GUID primary key: those kinds of keys are somewhat less
efficient than an INTEGER, especially for ORM internals, since they are not
monotonic.<br />
Note that you can always define a secondary key, as <code>string</code> or
<code>TGUID</code> field, if needed - using <code>stored AS_UNIQUE</code>
attribute as explained below.</p>
<p>Having <code>Int64</code> wide primary key will allow to compute huge IDs,
which was found to be almost mandatory to implement <a href="http://synopse.info/fossil/tktview/3453f314d97d">safe multi-node
replication</a>.</p>
<h3>Introducing TID published properties</h3>
<p><code>TSQLRecord</code> published properties do match a class instance
pointer, so are 32 bit (at least for <em>Win32/Linux32</em> executables).<br />
Since the <code>TSQLRecord.ID</code> field is declared as <code>TID =
Int64</code>, we may loose information if the stored <code>ID</code> is greater
than 2,147,483,647 (i.e. a signed 32 bit value).</p>
<p>You can now define a published property as <code>TID</code> to store any
value of our primary key, i.e. up to 9,223,372,036,854,775,808.<br />
Note that in this case, there is no information about the joined table.</p>
<p>As a consequence, the ORM will perform the following optimizations for
<code>TID</code> fields:</p>
<ul>
<li>An <em>index</em> will be created on the database, for the corresponding
column;</li>
<li>When a referenced record is deleted, the ORM <em>won't do anything</em>,
since it has no information about the table to track - this is the main
difference with <code>TSQLRecord</code> published property.</li>
</ul>
<p>See the <a href="http://synopse.info/files/html/Synopse%20mORMot%20Framework%20SAD%201.18.html#TITL_26">
corresponding updated documentation</a>.<br />
Feedback is <a href="http://synopse.info/forum/viewtopic.php?pid=13588#p13588">welcome on our
forum</a>, as usual.</p>Sub-optimized Win64 Delphi compiler: missing branch table for case ofurn:md5:92fd165c70d5178b9b31ecb55df33cbd2014-06-30T11:18:00+02:002014-06-30T10:29:10+02:00AB4327-GANDIPascal Programming64bitblogDelphiFreePascalperformance<p>As we already stated here, the Delphi compiler for the Win64 target performs
well, as soon as you by-pass the RTL and its sub-optimized implementation -
<a href="https://blog.synopse.info?post/post/2013/03/13/x64-optimized-asm-of-FillChar%28%29-and-Move%28%29-for-Win64">
as we do for <em>mORMot</em></a>.<br />
In fact, our huge set of regression tests <a href="https://blog.synopse.info?post/post/2013/03/07/64-bit-compatibility-of-mORMot-core-units">perform only 10%
slower on Win64, when compared to Win32</a>.<br />
But we got access to much more memory - which is not a huge gain for a
<em>mORMot</em> server, which uses very little of RAM - so may be useful in
some cases, when you need a lot of structures to be loaded in your RAM.</p>
<p><img src="http://docwiki.embarcadero.com/images/RADStudio/XE3/e/8/85/ActiveWindows64Platform.png" alt="" /></p>
<p>Slowdown on Win64 is mostly due to biggest pointer size, which will use
twice the memory, hence may generate a larger number of cache misses (failed
attempts to read or write a piece of data in the cache, which results in a main
memory access with much longer latency).<br />
But in Delphi, apart from the RTL which may need more tuning about performance
(but seems <a href="https://blog.synopse.info?post/post/2013/05/21/Performance-issue-in-NextGen-ARC-model">not to be a priority
on Embarcadero side</a>), is also sometimes less efficient when generating the
code.<br />
For instance, sounds like if <code>case ... of ... end</code> statements do not
generated <a href="http://en.wikipedia.org/wiki/Branch_table">branch table
instructions</a> on Win64, whereas it does for Win32 - and FPC does for any x64
platform it supports.</p> <p>As stated by Wikipedia:</p>
<blockquote>
<p>In computer programming, a branch table or jump table is a method of
transferring program control (branching) to another part of a program (or a
different program that may have been dynamically loaded) using a table of
branch or jump instructions. It is a form of multiway branch. The branch table
construction is commonly used when programming in assembly language but may
also be generated by a compiler, especially when implementing an optimized
switch statement where known, small ranges are involved with few gaps.</p>
</blockquote>
<p>Here is a simple <code>case ... of ... end</code> statement, as found in our
<em>SynCrossPlatformJSON.pas</em> unit:</p>
<pre>
<code> case VType of
{$ifndef NEXTGEN}
vtString: result := string(VString^);
vtAnsiString: result := string(AnsiString(VAnsiString));
vtChar: result := string(VChar);
vtPChar: result := string(VPChar);
vtWideString: result := string(WideString(VWideString));
{$endif}
{$ifdef UNICODE}
vtUnicodeString: result := string(VUnicodeString);
{$endif}
vtPWideChar: result := string(VPWideChar);
vtWideChar: result := string(VWideChar);
vtBoolean: if VBoolean then result := '1' else result := '0';
vtInteger: result := IntToStr(VInteger);
vtInt64: result := IntToStr(VInt64^);
vtCurrency: DoubleToJSON(VCurrency^,result);
vtExtended: DoubleToJSON(VExtended^,result);
vtObject: result := ObjectToJSON(VObject);
vtVariant: if TVarData(VVariant^).VType<=varNull then
result := 'null' else begin
wasString := VarIsStr(VVariant^);
result := VVariant^;
end;
else result := '';
end;</code>
</pre>
<p>Here is the code generated by Delphi on Win64:</p>
<pre>
<code>
SynCrossPlatformJSON.pas.727: case VType of
0000000000560F40 480FB64608 movzx rax,byte ptr [rsi+$08]
0000000000560F45 4883F809 cmp rax,$09
0000000000560F49 7F6B jnle VarRecToValue + $B6
0000000000560F4B 4883F809 cmp rax,$09
0000000000560F4F 0F842F010000 jz VarRecToValue + $184
0000000000560F55 4883F803 cmp rax,$03
0000000000560F59 7F33 jnle VarRecToValue + $8E
0000000000560F5B 4883F803 cmp rax,$03
0000000000560F5F 0F8496010000 jz VarRecToValue + $1FB
0000000000560F65 4883E801 sub rax,$01
0000000000560F69 4883F8FF cmp rax,-$01
0000000000560F6D 0F844F010000 jz VarRecToValue + $1C2
0000000000560F73 4885C0 test rax,rax
0000000000560F76 0F8419010000 jz VarRecToValue + $195
0000000000560F7C 4883E801 sub rax,$01
0000000000560F80 4885C0 test rax,rax
0000000000560F83 0F85C4010000 jnz VarRecToValue + $24D
0000000000560F89 E9A5000000 jmp VarRecToValue + $133
0000000000560F8E 4883E804 sub rax,$04
0000000000560F92 4885C0 test rax,rax
0000000000560F95 747C jz VarRecToValue + $113
0000000000560F97 4883E802 sub rax,$02
0000000000560F9B 4885C0 test rax,rax
0000000000560F9E 0F84A0000000 jz VarRecToValue + $144
0000000000560FA4 4883E801 sub rax,$01
0000000000560FA8 4885C0 test rax,rax
0000000000560FAB 0F859C010000 jnz VarRecToValue + $24D
0000000000560FB1 E956010000 jmp VarRecToValue + $20C
0000000000560FB6 4883F80D cmp rax,$0d
0000000000560FBA 7F32 jnle VarRecToValue + $EE
0000000000560FBC 4883F80D cmp rax,$0d
0000000000560FC0 0F8456010000 jz VarRecToValue + $21C
0000000000560FC6 4883E80A sub rax,$0a
0000000000560FCA 4885C0 test rax,rax
0000000000560FCD 0F84A1000000 jz VarRecToValue + $174
0000000000560FD3 4883E801 sub rax,$01
0000000000560FD7 4885C0 test rax,rax
0000000000560FDA 7447 jz VarRecToValue + $123
0000000000560FDC 4883E801 sub rax,$01
0000000000560FE0 4885C0 test rax,rax
0000000000560FE3 0F8564010000 jnz VarRecToValue + $24D
0000000000560FE9 E9F3000000 jmp VarRecToValue + $1E1
0000000000560FEE 4883E80F sub rax,$0f
0000000000560FF2 4885C0 test rax,rax
0000000000560FF5 745D jz VarRecToValue + $154
0000000000560FF7 4883E801 sub rax,$01
0000000000560FFB 4885C0 test rax,rax
0000000000560FFE 0F84CD000000 jz VarRecToValue + $1D1
0000000000561004 4883E801 sub rax,$01
0000000000561008 4885C0 test rax,rax
000000000056100B 0F853C010000 jnz VarRecToValue + $24D
0000000000561011 EB51 jmp VarRecToValue + $164
</code>
</pre>
<p>And here is the code generated by FPC on Win64:</p>
<pre>
<code>
mov eax, dword ptr [rsi] ; 0027 _ 8B. 06
cmp eax, 2 ; 0029 _ 83. F8, 02
jc ?_0067 ; 002C _ 72, 15
cmp eax, 3 ; 002E _ 83. F8, 03
stc ; 0031 _ F9
jz ?_0067 ; 0032 _ 74, 0F
sub eax, 12 ; 0034 _ 83. E8, 0C
cmp eax, 2 ; 0037 _ 83. F8, 02
jc ?_0067 ; 003A _ 72, 07
cmp eax, 4 ; 003C _ 83. F8, 04
stc ; 003F _ F9
jz ?_0067 ; 0040 _ 74, 01
clc ; 0042 _ F8
?_0067: setae byte ptr [rdi] ; 0043 _ 0F 93. 07
mov rax, qword ptr [rsi] ; 0046 _ 48: 8B. 06
cmp rax, 16 ; 0049 _ 48: 83. F8, 10
ja ?_0084 ; 004D _ 0F 87, 000001F6
lea rdx, [?_0086] ; 0053 _ 48: 8D. 15, 00000000(rel)
movsxd rax, dword ptr [rdx+rax*4] ; 005A _ 48: 63. 04 82
lea rax, [rdx+rax] ; 005E _ 48: 8D. 04 02
jmp rax ; 0062 _ FF. E0
...
?_0086 label dword ; switch/case jump table
dd ?_0077-$ ; 0000 _ 00000172 (rel)
dd ?_0075-$+4H ; 0004 _ 00000148 (rel)
dd ?_0070-$+8H ; 0008 _ 00000098 (rel)
dd ?_0080-$+0CH ; 000C _ 000001DF (rel)
dd ?_0068-$+10H ; 0010 _ 00000078 (rel)
dd ?_0084-$+14H ; 0014 _ 00000261 (rel)
dd ?_0071-$+18H ; 0018 _ 000000CC (rel)
dd ?_0081-$+1CH ; 001C _ 00000204 (rel)
dd ?_0084-$+20H ; 0020 _ 0000026D (rel)
dd ?_0074-$+24H ; 0024 _ 00000144 (rel)
dd ?_0073-$+28H ; 0028 _ 00000124 (rel)
dd ?_0069-$+2CH ; 002C _ 000000AB (rel)
dd ?_0079-$+30H ; 0030 _ 000001E0 (rel)
dd ?_0082-$+34H ; 0034 _ 0000023D (rel)
dd ?_0084-$+38H ; 0038 _ 00000285 (rel)
dd ?_0072-$+3CH ; 003C _ 00000114 (rel)
dd ?_0078-$+40H ; 0040 _ 000001CF (rel)
</code>
</pre>
<p>As you can see, the FPC 2.7.1 compiler generates a <em>branch table</em>, so
will perform much better.<br />
The single <code>movsxd rax, dword ptr [rdx+rax*4]</code>
instruction replaces a huge list of <code>cmp/jz</code> statements.</p>
<p>Sounds like if the Open Source <em>FreePascal</em> compiler generates better
code than Delphi's, <a href="http://forum.lazarus.freepascal.org/index.php/topic,24509.msg147599.html?PHPSESSID=10cf0d3560f7a37adda1e58da7a24b98#msg147599">
not only for floating-point computations</a>, but for simple general-usage
code.<br />
BTW the floating-point regression issue in XE6 was <a href="http://qc.embarcadero.com/wc/qcmain.aspx?d=124652">marked as resolved in
QC</a> and fixed in XE6 update 1. But still slower than FPC on 32
bit...</p>Performance comparison from Delphi 6, 7, 2007, XE4 and XE6urn:md5:08084cbd1ab10b0be53bbdd159f221ac2014-06-09T21:00:00+02:002014-06-09T21:32:11+02:00AB4327-GANDIPascal Programming64bitDelphiDocumentationGoodPracticemORMotmultithreadORMperformance<p>Since <a href="http://blogs.riversoftavg.com/index.php/2014/06/08/performance-comparison-from-delphi-2010-to-delphi-xe6-conclusion/">
there was recently some articles</a> about performance comparison between
several versions of the Delphi compiler, we had to react, and gives our
personal point of view.</p>
<p>IMHO there won't be any definitive statement about this.<br />
I'm always doubtful about any conclusion which may be achieved with such kind
of benchmarks.<br />
Asking "which compiler is better?" is IMHO a wrong question.<br />
As if there was some "compiler magic": the new compiler will be just like a new
laundry detergent - it will be cleaner and whiter...</p>
<p>Performance is not about marketing.<br />
Performance is an iterative process, always a matter of <em>circumstances</em>,
and <em>implementation</em>.</p>
<p><img src="http://dev.day.com/content/docs/en/cq/current/deploying/performance/_jcr_content/par/image_3.img.jpg" alt="" width="300" height="147/" /></p>
<p><em>Circumstances</em> of the benchmark itself.<br />
Each benchmark will report only information about the process it
measured.<br />
What you compare is a limited set of features, running most of the time an
idealized and simplified pattern, which shares nothing with real-world
process.</p>
<p><em>Implementation</em> is what gives performance.<br />
Changing a compiler will only gives you some percents of time change.<br />
Identifying the true bottlenecks of an application <a href="http://www.delphitools.info/samplingprofiler/">via a profiler</a>, then
changing the implementation of the identified bottlenecks may give order of
magnitudes of speed improvement.<br />
For instance, multi-threading abilities can be achieved by <a href="https://blog.synopse.info?post/post/2011/05/20/How-to-write-fast-multi-thread-Delphi-applications">following
some simple rules</a>.</p>
<p>With our huge set of regression tests, we have at hand more than 16,500,000
individual checks, covering low-level features (like numerical and text
marshaling), or high-level process (like concurrent client/server and database
multi-threaded process).</p>
<p>You will find here some benchmarks run with Delphi 6, 7, 2007, XE4 and XE6
under Win32, and XE4 and XE6 under Win64.<br />
In short, all compilers performs more or less at the same speed.<br />
Win64 is <a href="https://blog.synopse.info?post/post/2013/03/07/64-bit-compatibility-of-mORMot-core-units">a
little slower</a> than Win32, and the fastest appears to be Delphi 7, using our
<a href="https://blog.synopse.info?post/post/2009/12/20/Enhanced-Run-Time-library-for-Delphi-7">enhanced and
optimized RTL</a>.</p> <blockquote>
<p><code>Delphi 6 compiler<br />
Time elapsed for all tests: 35.38s</code></p>
<p><code>Delphi 7 compiler (with our enhanced RTL)<br />
Time elapsed for all tests: 34.79s</code></p>
<p><code>Delphi 2007 compiler<br />
Time elapsed for all tests: 36.04s</code></p>
<p><code>Delphi XE4 compiler<br />
Time elapsed for all tests: 38.09s</code></p>
<p><code>Delphi XE6 compiler<br />
Time elapsed for all tests: 37.53s</code></p>
<p><code>Delphi XE4 64 bit compiler<br />
Time elapsed for all tests: 41.40s</code></p>
<p><code>Delphi XE6 64 bit compiler<br />
Time elapsed for all tests: 40.87s</code></p>
</blockquote>
<p>You can find details about those regression tests, as <a href="https://blog.synopse.info?post/public/mORMot/mORMotRegressionTestsBenchmark.zip">mORMot regression tests
text reports</a>.<br />
Or, even better, you can <a href="http://synopse.info/fossil/finfo?name=SQLite3/TestSQL3.dpr">run all tests by
yourself</a>.</p>
<p>This is not a definitive answer.<br />
In short, for most real process, the Delphi compiler did not improve the
execution speed.<br />
On the contrary, we may say that the generated executables are slightly slower
with newer versions.<br />
The compiler itself is perhaps not the main point during our tests, but the
RTL, which was not modified with speed in mind since Delphi 2010.<br />
Even if <em>mORMot</em> code by-passes the RTL for most of its process, we
can still see some speed regressions when compared to pre-Unicode versions of
Delphi.<br />
In some cases, the generated asm is faster since Delphi 2007, mainly due to
function <em>inlining</em> abilities.<br />
But we can't say that the Delphi compiler do generate much better code in newer
versions.<br />
And we can assure you that the RTL is a true bottlenck: from our experiment,
Win64 process is only slightly slower than Win32, due to the fact that we
by-pass the RTL, and use our own set of low-level routines (including optimized
x64 asm in <code>SynCommons.pas</code>).</p>
<p>When testing the <a href="http://www.freepascal.org/">FreePascal
compiler</a>, we found out that its generated code is slightly slower than
Delphi.<br />
Floating-point is much faster with <em>FreePascal</em> than with Delphi, but
for common code (like our framework regression tests), FreePascal is slightly
less efficient than Delphi.<br />
But still perfectly usable in production, generating smaller executables, and
with better abilities to cross-platform support, and a tuned RTL.</p>
<p>So, is it worth upgrading?<br />
Are newer versions of Delphi worth the price?</p>
<p>To be fair... Delphi compiler did not improve much since 10 years...<br />
But just like GCC or other compilers!<br />
The only dimension where performance did improve in order of magnitude is for
floating-point process, and auto-vectorisation of the code, using SSE
instructions. But for business code (like database or client-server process),
the main point is definitively not the compiler, but the algorithm. The
hardware did improve a lot (pipelining, cache, multi-core....), and is the main
improvement axis.</p>
<p>Feedback is <a href="http://synopse.info/forum/viewtopic.php?id=1832">welcome on our forum, as
usual</a>.</p>