To content | To menu | To search

Tag - asm

Entries feed

2017, Thursday August 10

Faster and cross-platform SynLZ

You probably know about our SynLZ compression unit, in pascal and x86 asm, which is very fast for compression with a good compression ratio, and proudly compete with LZ4 or Snappy.
It is used in our framework everywhere, e.g. for WebSockets communication, for ECC encrypted file content, or to compress executable resources. 

Two news to share:

1. I've added SynLZ support for the NextGen compiler, now available in a new unit of the "CrossPlatform" sub-folder.
Feeback is welcome, since we don't use Delphi for iOS and Android with Delphi, and prefer FPC for Linux!

2. I've also written a new x64 asm optimized version of SynLZ, and profiled the existing x86 asm to be even faster than previously.
For a 100MB text log file, SynLZ is faster than Snappy, and compresses better (93% instead of 84%).
For other kind of files, Snappy is slightly faster at decompression, but SynLZ compresses better, and most of the time faster.
When used on a REST server solution, as with mORMot, compression speed does matter more than decompression.

For Win32:

Win32 Processing DragonFly-devpcm.log = 98.7 MB for 1 times
Snappy compress in 125.07ms, ratio=84%, 789.3 MB/s
Snappy uncompress in 70.35ms, 1.3 GB/s
SynLZ compress in 103.61ms, ratio=93%, 952.8 MB/s
SynLZ uncompress in 68.71ms, 1.4 GB/s

For Win64:

Win64 Processing DragonFly-devpcm.log = 98.7 MB for 1 times
Snappy compress in 107.13ms, ratio=84%, 921.5 MB/s
Snappy uncompress in 61.06ms, 1.5 GB/s
SynLZ compress in 97.25ms, ratio=93%, 1015.1 MB/s
SynLZ uncompress in 61.27ms, 1.5 GB/s

Of course, we didn't change the SynLZ binary format, so it is just perfectly backward compatible with any existing program.
Anyway, from my point of view, the main benefit of SynLZ is that it was designed in plain pascal, so it is clearly cross-platform and well integrated with Delphi/FPC (no external .obj/.o/.dll required).

Feedback is welcome in our forum, as usual!

2015, Tuesday June 30

Faster String process using SSE 4.2 Text Processing Instructions STTNI

A lot of our code, and probably yours, is highly relying on text process.
In our mORMot framework, most of its features use JSON text, encoded as UTF-8.
Profiling shows that a lot of time is spent computing the end of a text buffer, or comparing text content.

You may know that In its SSE4.2 feature set, Intel added STTNI (String and Text New Instructions) opcodes.
They are several new instructions that perform character searches and comparison on two operands of 16 bytes at a time.

I've just committed optimized version of StrComp() and StrLen(), also used for our TDynArrayHashed wrapper.
The patch works from Delphi 5 up to XE8, and with FPC - unknown SSE4.2 opcodes have been entered as hexadecimal bytes, for compatibility with the last century compilers!
The resulting speed up may be worth it!

Next logical step would be to use those instruction in the JSON process itself.
It may speed up the parsing speed of our core functions (which is already very optimized, but written in a classical one-char-at-a-time reading).
Main benefit would be to read the incoming UTF-8 text buffer by blocks of 16 bytes, and performing several characters comparison in a few CPU cycles, with no branching.
Also JSON writing would benefit for it, since escaping could be speed up thanks to STTNI instructions.

Any feedback is welcome, as usual!

2015, Sunday June 21

Why FPC may be a better compiler than Delphi

Almost every time I'm debugging some core part of our framework, I like to see the generated asm, and trying to optimize the pascal code for better speed - when it is worth it, of course!
I just made a nice observation, when comparing the assembler generated by Delphi to FPC's output.

Imagine you compile the following lines (extracted from SynCommons.pas), to convert some number into ASCII characters:

c100 := val div 100;
PWord(P)^ := TwoDigitLookupW[val];

This divides a number by 100, then computes the modulo in val, to store two digits at a time. We did not use val := val mod 100 here, since mod would do another division, so we rely on a simple multiplication to compute the modulo.
You may know that for today's CPUs, integer multiplication is very optimized, taking a cycle (or less, thanks to its pipelines), whereas a division is much more expensive - if you have some spare time, take a look at this document, and you will find out that a div opcode could use 10 times more cycles then a mul - even with the latest CPU architectures.
Let's see how our two beloved compilers do their homework (with optimization enabled, of course)...

Delphi generates the following code for c100 := val div 100:

005082AB 8BC1             mov eax,ecx
005082AD BE64000000       mov esi,$00000064
005082B2 33D2             xor edx,edx
005082B4 F7F6             div esi

Whereas FPC generates the following:

0043AC48 8b55f8                   mov    -0x8(%ebp),%edx
0043AC4B b81f85eb51               mov    $0x51eb851f,%eax
0043AC50 f7e2                     mul    %edx
0043AC52 c1ea05                   shr    $0x5,%edx
0043AC55 8955f0                   mov    %edx,-0x10(%ebp)

Even if you are assembler agnostic, and once you did get rid of the asm textual representation (Delphi uses Intel's, whereas FPC/GDB follows AT&T), you can see that Delphi generates a classic (and slow) div esi opcode, whereas FPC uses a single multiplication, followed by a bit shift.

This optimization is known as "Reciprocal Multiplication", and I would let you read this article for mathematical reference - or this one.
It multiplies (mul) the number by the power of two reciprocal of 100 (which is the hexadecimal 51eb851f value), followed by a right shift (shr) of 5 bits.
Thanks to 32 bit rounding of the integer operations, this would in fact divide the number per 100.
Even it consists in two assembler opcodes, a mul + shr is in fact faster than a single div.

It is a shame that the Delphi compiler did not include this very common optimization, which is clearly a win for some very common tasks. 

Of course, the LLVM back-end used on the NextGen compiler can do it, be we may expect this classic optimization be part of the decades-old Delphi compiler.
And I'm still not convinced about the performance of the NextGen generated code, since the associated RTL is known to be slow, so won't benefit of LVVM optimization - which takes  a LOT of time to compile, by the way (much more than FPC).

Congrats, FPC folks!

2015, Saturday February 21

SynCrypto: SSE4 x64 optimized asm for SHA-256

We have just included some optimized x64 assembler to our Open Source SynCrypto.pas unit so that SHA-256 hashing will perform at best speed.
It is an adaptation from tuned Intel's assembly macros, which makes use of the SSE4 instruction set, if available.

Continue reading...

2015, Thursday January 15

AES-NI enabled for SynCrypto

Today, we committed a new patch to enable AES-NI hardware acceleration to our SynCrypto.pas unit.

Intel® AES-NI is a new encryption instruction set that improves on the Advanced Encryption Standard (AES) algorithm and accelerates the encryption of data on newer processors.

Of course, all this is available in the Delphi unit, from Delphi 6 to XE7: no external dll nor OS update is needed.
And it will work also on Linux, so could help encrypting the mORMot transmission with no power loss.

You have nothing to do: just upgrade your mORMot source code, then AES-NI instructions will be used, if the CPU offers it.
We have seen performance boost of more than 5x, depending on the size of the data to be encrypted.


2014, Sunday May 25

New crc32c() function using optimized asm and SSE 4.2 instruction

Cyclic Redundancy Check (CRC) codes are widely used for integrity checking of data in fields such as storage and networking.
There is an ever-increasing need for very high-speed CRC computations on processors for end-to-end integrity checks.

We just introduced to mORMot's core unit (SynCommons.pas) a fast and efficient crc32c() function.

It will use either:

  • Optimized x86 asm code, with unrolled loops;
  • SSE 4.2 hardware crc32 instruction, if available.

Resulting speed is very good.
This is for sure the fastest CRC function available in Delphi.
Note that there is a version dedicated to each Win32 and Win64 platform - both performs at the same speed!

In fact, most popular file formats and protocols (Ethernet, MPEG-2, ZIP, RAR, 7-Zip, GZip, and PNG) use the polynomial $04C11DB7, while Intel's hardware implementation is based on another polynomial, $1EDC6F41 (used in iSCSI and Btrfs).
So you would not use this new crc32c() function to replace the zlib's crc32() function, but as a convenient very fast hashing function at application level.
For instance, our TDynArray wrapper will use it for fast items hashing.

Continue reading...

2013, Thursday December 5

New Open Source Multi-Thread ready Memory Manager: SAPMM

Do you remember this former article about scalability of the Delphi memory manager, in multi-thread execution context?

Our SynScaleMM is still experimental.
But did pretty well, for an experiment!

At first, you can take a look at ScaleMM2, which is more stable, and based on the same ground.

But a new multi-thread friendly memory manager for Delphi just came out.
It is in fact the anonymous (and already famous) "NN memory manager" Primož talked about in his article about string building and memory managers.

(Note that in this article, our SynScaleMM was found to be scaling very well, but on the other hand, Primož did compile its benchmark program in Debug mode, so our TTextWriter was not in good shape: when you compile in Release mode, optimizations and inlining are ON, and our good TTextWriter just flies... See the note at the beginning of the article - this is why I never find those benchmarks very informative. I always prefer profiling from the real world with real useful process… and was never convinced by any such naive benchmark.)

OK, back to our business!

SapMM is an interesting beast.

Sounds like if Alexei (the initial coder) has a C coding background. But that's fine when you have to deal with low-level structures and algorithms, as required by a memory manager. :)
It features everything we may ask for such a piece of code: clear design, optimized code (mostly by inlining process), memory leak reporting, some parameters for tuning.

It is only for Delphi XE (and up) under Win32 by now, but contributors are welcome!
It is used in production since more than half a year, and it passed all FastcodeMM benchmark tests.

If you want a direct link of the today's source code, without SVN, you may try this direct link from our site.
(but it probably will never be updated - you are warned)

Continue reading...

2013, Tuesday May 21

Performance issue in NextGen ARC model

Apart from being very slow during compilation, the Delphi NextGen compiler introduced a new memory model, named ARC.

We already spoke about ARC years ago, so please refer to our corresponding blog article for further information, especially about how Apple did introduce ARC to iOS instead of the Garbage Collector model.

About how ARC is to be used in the NextGen compiler, take a look at Marco's blog article, and its linked resources.

But the ARC model, as implemented by Embarcadero, has at least one huge performance issue, in the way weak references, and zeroing weak pointers have been implemented.
I do not speak about the general slow down introduced during every class/record initialization/finalization, which is noticeable, but not a big concern.

If you look at XE4 internals, you will discover a disappointing global lock introduced in the RTL.

Continue reading...

2013, Wednesday March 13

x64 optimized asm of FillChar() and Move() for Win64

We have included x64 optimized asm of FillChar() and Move() for Win64 - for corresponding compiler targets, i.e. Delphi XE2 and XE3.
It will handle properly cache prefetch and appropriate SSE2 move instructions

The System.pas unit of Delphi RTL will be patched at startup, unless the NOX64PATCHRTL conditional is defined.
Therefore, whole application may benefit for this optimized version.

Performance improvement is noticeable, when compared with the original pascal-based version included in System.pas.

By the way, the Delphi x64 built-in assembler does not recognize the movnti opcode... so we had to inline it as plain db hexadecimal values.
A bit disappointing. Until now, we did not suffer from anything in regard to the x64 compatibility at Delphi level.

No stand-alone unit available yet, since it is included in our SynCommons.pas shared unit, starting with the 1.18 revision of mORMot.

Feedback is welcome, as usual!

2013, Thursday March 7

64 bit compatibility of mORMot units

I'm happy to announce that mORMot units are now compiling and working great in 64 bit mode, under Windows.
Need a Delphi XE2/XE3 compiler, of course!

ORM and services are now available in Win64, on both client and server sides.
Low-level x64 assembler stubs have been created, tested and optimized.
UI part is also available... that is grid display, reporting (with pdf export and display anti-aliasing), ribbon auto-generation, SynTaskDialog, i18n... the main SynFile demo just works great!

Overall impression is very positive, and speed is comparable to 32 bit version (only 10-15% slower).

Speed decrease seems to be mostly due to doubled pointer size, and some less optimized part of the official Delphi RTL.
But since mORMot core uses its own set of functions (e.g. for JSON serialization, RTTI support or interface calls or stubbing), we were able to release the whole 64 bit power of your hardware.

Delphi 64 bit compiler sounds stable and efficient. Even when working at low level, with assembler stubs.
Generated code sounds more optimized than the one emitted by FreePascalCompiler - and RTL is very close to 32 bit mode.
Overall, VCL conversion worked as easily than a simple re-build.
Embarcadero's people did a great job for VCL Win64 support, here!

Continue reading...

2012, Thursday December 20

How to make it fast?

On our forum, a clever question was posted about publishing some enhanced RTL functions for newer versions of Delphi - as we did for Delphi 7 and 2007.

I was looking for a faster IntToStr implementation and discovered SynCommons.pas.
That's really too bad, SynCommons.pas really does contain some seriously fast stuff, people would greatly benefit from it if it was made general-purpose.

In fact, it would not be enough to change the RTL function implementations.
IMHO, to write something scalable, you need to get rid of such functions.

Continue reading...

2012, Tuesday November 13

Go language and Delphi

Do you know the Go language?

It is a strong-typed, compiled, cross-platform, and concurrent.
It features some nice high-level structures, like maps and strings, and still have very low-level access to the generated code: pointers are there, in a safe strong-typed implementation just like in pascal, and there is even a "goto", which sounds like an heresy to dogmatic coders, but does make sense to me, at least when you want to optimize code speed, in some rare cases.

It is created/pushed by Google, used internally by the company in their computer farms, and was designed by one of the original C creators.

Continue reading...

2012, Tuesday April 10

How function results are allocated

One potential issue with Delphi coding, is about how the result of a functions are implemented.

If you forget to set a result value to a function, you'll get a compiler warning.
Never underestimate such warning: IMHO this is not a warning, but an error.

And you should better be aware of the handling of reference-counted types (e.g. string) in a function results: those are passed the stack as var parameters, so the result of a function may be set even if an exception is raised during function execution!

Continue reading...

2011, Tuesday November 8

Currency is your friend

The currency type is the standard Delphi type to be used when storing and handling monetary values. It will avoid any rounding problems, with 4 decimals precision. It is able to safely store numbers in the range -922337203685477.5808 .. 922337203685477.5807. Should be enough for your pocket change.

As stated by the official Delphi documentation:

Currency is a fixed-point data type that minimizes rounding errors in monetary calculations. On the Win32 platform, it is stored as a scaled 64-bit integer with the four least significant digits implicitly representing decimal places. When mixed with other real types in assignments and expressions, Currency values are automatically divided or multiplied by 10000.

In fact, this type matches the corresponding OLE and .Net implementation of currency, and the one used by most database providers (when it comes to money, a dedicated type is worth the cost in a "rich man's world"). It is still implemented the same in the Win64 platform (since XE 2). The Int64 binary representation of the currency type (i.e. value*10000 as accessible via PInt64(aCurrencyValue)^) is a safe and fast implementation pattern.

In our framework, we tried to avoid any unnecessary conversion to float values when dealing with currency values. Some dedicated functions have been implemented for fast and secure access to currency published properties via RTTI, especially when converting values to or from JSON text. Using the Int64 binary representation can be not only faster, but also safer: you will avoid any rounding problem which may be introduced by the conversion to a float type. Rounding issues are a nightmare to track - it sounds safe to have a framework handling natively a currency type from the ground up.

Continue reading...

2011, Monday September 12

Using Extended in Delphi XE2 64 bit

Unfortunately, Delphi's 64-bit compiler (dcc64) and RTL do not support 80-bit extended floating point values on Win64, but silently alias Extended = Double on Win64.

There are situations, however, where this is clearly undesirable, e.g. if the additional precision gained from Extended is required.

The Open-source uTExtendedX87 unit provides a replacement FPU-backed 80-bit Extended floating point type (TExtendedX87) for Win64.

Continue reading...

2011, Sunday August 28

Multi-threading and Delphi

Writing working multi-threaded code is not easy - it's even hard, as as a Delphi expert just wrote in his blog.

In fact, the first step into multi-thread application development could be:

"protect your shared variables with locks (aka critical sections), because you are not sure that the data you read/write is the same for all threads".

The CPU per-core cache is just one of the possible issues, which will lead into reading wrong values. Another issue which may lead into race condition is two threads writing to a resource at the same time: it's impossible to know which value will be stored afterward.

Continue reading...

2011, Monday August 8

Our mORMot won't hibernate this winter, thanks to FireMonkey

Everybody is buzzing about FireMonkey...

Our little mORMot will like FireMonkey!
Here is why...

Continue reading...

2011, Thursday June 16

Which Delphi compiler produces faster code?

After a question on StackOverflow, I wanted to comment about the speed of generated code by diverse Delphi compiler versions.

Since performance matters when we write general purpose libraries like ours, we have some feedback to propose:

Continue reading...

2011, Tuesday June 7

Intercepting exceptions: a patch to rule them all

In order to let our TSynLog logging class intercept all exceptions, we use the low-level global RtlUnwindProc pointer, defined in System.pas.

Alas, under Delphi 5, this global RtlUnwindProc variable is not existing. The code calls directly the RtlUnWind Windows API function, with no hope of custom interception.

Two solutions could be envisaged:

  • Modify the Sytem.pas source code, adding the new RtlUnwindProc variable, just like Delphi 7; 
  • Patch the assembler code, directly in the process memory.

The first solution is simple. Even if compiling System.pas is a bit more difficult than compiling other units, we already made that for our Enhanced RTL units. But you'll have to change the whole build chain in order to use your custom System.dcu instead of the default one. And some third-party units (only available in .dcu form) may not like the fast that the System.pas interface changed...

So we used the second solution: change the assembler code in the running process memory, to let call our RtlUnwindProc variable instead of the Windows API.

Continue reading...

True per-class variable

For our ORM, we needed a class variable to be available for each TSQLRecord class type.

This variable is used to store the properties of this class type, i.e. the database Table properties (e.g. table and column names and types) associated with a particular TSQLRecord class, from which all our ORM objects inherit.

The class var statement was not enough for us:
- It's not available on earlier Delphi versions, and we try to have our framework work with Delphi 6-7 up to XE;
- This class var instance will be shared by all classes inheriting from the class where it is defined - and we need ONE instance PER class type, not ONE instance for ALL

We needed to find another way to implement this class variable

An unused VMT slot in the class type description was identified, then each class definition was patched in the process memory to contain our class variable.

Continue reading...

- page 1 of 2