Fast JSON parsing
By A.Bouchez on 2011, Thursday June 2, 16:16 - SQLite3 Framework - Permalink
When it deals with parsing some (textual) content, two directions are
usually envisaged. In the XML world, you have usually to make a choice
between:
- A DOM parser, which creates an in-memory tree structure of objects mapping
the XML nodes;
- A SAX parser, which reads the XML content, then call pre-defined
events for each XML content element.
In fact, DOM parsers use internally a SAX parser to read the XML content. Therefore, with the overhead of object creation and their property initialization, DOM parsers are typically three to five times slower than SAX. But, DOM parsers are much more powerful for handling the data: as soon as it's mapped in native objects, code can access with no time to any given node, whereas a SAX-based access will have to read again the whole XML content.
Most JSON parser available in Delphi use a DOM-like approach. For instance, the DBXJSON unit included since Delphi 2010 or the SuperObject or DWS libraries create a class instance mapping each JSON node.
In a JSON-based Client-Server ORM like ours, profiling shows that a lot of time is spent in JSON parsing, on both Client and Server side. Therefore, we tried to optimize this part of the library.
In order to achieve best speed, we try to use a mixed approach:
- All the necessary conversion (e.g. un-escape text) is made in-memory, from
and within the JSON buffer, to avoid memory allocation;
- The parser returns pointers to the converted elements (just like the
vtd-xml library).
In practice, here is how it is implemented:
- A private copy of the source JSON data is made internally (so that the
Client-Side method used to retrieve this data can safely free all allocated
memory);
- The source JSON data is parsed, and replaced by the UTF-8 text un-escaped
content, in the same internal buffer (for example, strings are un-escaped and
#0 are added at the end of any field value; and numerical values remains
text-encoded in place, and will be extracted into Int64 or
double only if needed);
- Since data is replaced in-memory (JSON data is a bit more verbose than pure
UTF-8 text so we have enough space), no memory allocation is performed during
the parsing: the whole process is very fast, not noticeably slower than a SAX
approach;
- This very profiled code (using pointers and tuned code) results in a very
fast parsing and conversion.
This parsing "magic" is done in the GetJSONField function, as
defined in the SynCommons.pas unit:
/// decode a JSON field in an UTF-8 encoded buffer (used in TSQLTableJSON.Create)
// - this function decodes in the P^ buffer memory itself (no memory allocation
// or copy), for faster process - so take care that it's an unique string
// - PDest points to the next field to be decoded, or nil on any unexpected end
// - null is decoded as nil
// - '"strings"' are decoded as 'strings'
// - strings are JSON unescaped (and \u0123 is converted to UTF-8 chars)
// - any integer value is left as its ascii representation
// - wasString is set to true if the JSON value was a "string"
// - works for both field names or values (e.g. '"FieldName":' or 'Value,')
// - EndOfObject (if not nil) is set to the JSON value char (',' ':' or '}' e.g.)
function GetJSONField(P: PUTF8Char; out PDest: PUTF8Char;
wasString: PBoolean=nil; EndOfObject: PUTF8Char=nil): PUTF8Char;
This function allows to iterate throughout the whole JSON buffer content,
retrieving values or property names, and checking EndOfObject
returning value to handle the JSON structure.
This in-place parsing of textual content is one of the main reason why we
used UTF-8 (via RawUTF8) as the common string type in our
framework, and not the generic string type, which would have
introduced a memory allocation and a charset conversion.
Feedback and comments are welcome in our forum, just as usual.