Apart from the low-level functions, there are much more to do if you want your application to scale in multi-thread environement.
What we try with mORMot is to let it scale as much as possible.
Main trick is to avoid any unnecessary memory allocations.

Using functions returning string (or RawUTF8) is slow by design in this aspect.
You should better use dedicated classes avoiding memory allocations. This is what we do for all our DB or JSON process (e.g. avoiding memory copy, and working with in-place parsing and pointers).
See this blog article.

In short: if you want to scale, forget about basic RTL functions and the "string" type, and create your own dedicated process - this was the purpose of mORMot SynCommons.pas core unit.

For the very same reason, we used our UTF-8 type (Delphi is not UTF8 natively), to avoid conversion during internal ORM process (e.g. with JSON and database), with the benefit of being ready to work with pre-Unicode versions of Delphi applications.

There is a big gap between such a simple loop-based benchmark and a real application.
I'm not sure that making our optimized functions "general-purpose" would be worth it. Bottleneck is in the main coding style.

In fact, when we use our Enhanced RTL for our regression tests using Delphi 7, there is not a big speed benefit with mORMot code, which by-pass most of the enhanced RTL functions.
And we are not affected by Delphi RTL speed regressions (like hidden charset conversions or slow UTF-8 process).

When it comes to speed, profiling of the real application is everything!
There is no magic-bullet library to make it fast.
Changing an algorithm results always in better profit than using overloaded asm-optimized versions of the functions.