Rust-Simd json: High performance JSON parser based on a port of simdjson

SIMD Json for Rust   Build Status Build Status ARM Quality Latest Version Code Coverage

Rust port of extremely fast simdjson JSON parser with serde compatibility.


readme (for real!)

simdjson version

Currently tracking version 0.2.x of simdjson upstream (work in progress, feedback welcome!).

CPU target

To be able to take advantage of simd-json your system needs to be SIMD capable. This means that it needs to compile with native cpu support and the given features. This also requires that projects using simd-json also need to be configured with native cpu support. Look at The cargo config in this repository to get an example of how to configure this in your project.

simd-json supports AVX2, SSE4.2 and NEON.

Unless the allow-non-simd feature is passed to your simd-json dependency in your Cargo.toml simd-json will fail to compile, this is to prevent unexpected slowness in fallback mode that can be hard to understand and hard to debug.

allocator

For best performance we highly suggest using mimalloc or jemalloc instead of the system allocator used by default. Another recent allocator that works well ( but we have yet to test in production a setting ) is snmalloc.

serde

simd-json is compatible with serde and serde-json. The Value types provided implement serializers and deserializers. In addition to that simd-json implements the Deserializer trait for the parser so it can deserialize anything that implements the serde Deserialize trait. Note, that serde provides both a Deserializer and a Deserialize trait.

That said the serde support is contained in the serde_impl feature which is part of the default feature set of simd-json, but it can be disabled.

known-key

The known-key feature changes the hash mechanism for the DOM representation of the underlying JSON object, from ahash to fxhash. The ahash hasher is faster at hashing and provides protection against DOS attacks by forcing multiple keys into a single hashing bucket. The fxhash hasher on the other hand allows for repeatable hashing results, which in turn allows memoizing hashes for well known keys and saving time on lookups. In workloads that are heavy at accessing some well known keys this can be a performance advantage.

The known-key feature is optional and disabled by default and should be explicitly configured.

serializing

simd-json is not capable of serializing JSON data as there would be very little gain in re-implementing it. For serialization, we typically rely on serde-json.

For DOM values we provide convience methods for serialization.

For struct values we defer to external serde-compatible serialization mechanisms.

unsafe

simd-json uses a lot of unsafe code.

There are a few reasons for this:

  • SIMD intrinsics are inherently unsafe. These uses of unsafe are inescapable in a library such as simd-json.
  • We work around some performance bottlenecks imposed by safe rust. These are avoidable, but at a cost to performance. This is a more considered path in simd-json.

simd-json goes through extra scrutiny for unsafe code. These steps are:

  • Unit tests - to test 'the obvious' cases, edge cases, and regression cases
  • Structural constructive property based testing - We generate random valid JSON objects to exercise the full simd-json codebase stochastically. Floats are currently excluded since slighty different parsing algorihtms lead to slighty different results here. In short "is simd-json correct".
  • Data-oriented property based testing of string-like data - to assert that sequences of legal printable characters don't panic or crash the parser (they might and often error so - they are not valid json!)
  • Destructive Property based testing - make sure that no illegal byte sequences crash the parser in any way
  • Fuzzing (using American Fuzzy Lop - afl) - fuzz based on upstream simd pass/fail cases

This doesn't ensure complete safety nor is at a bullet proof guarantee, but it does go a long way to asserting that the library is production quality and fit for purpose for practical industrial applications.

Other interesting things

There are also bindings for upstream simdjson available here

License

simd-json itself is licensed under either of

However it ports a lot of code from simdjson so their work and copyright on that should be respected along side.

The serde integration is based on their example and serde-json so again, their copyright should as well be respected.

Comments

  • TechEmpower - FrameworkBenchmarks
    TechEmpower - FrameworkBenchmarks

    Apr 13, 2020

    It might be worth considering adding simd-json to the TechEmpower benchmarks:

    https://github.com/TechEmpower/FrameworkBenchmarks/blob/cfbbf5d11143c96851620eb1aeaf3c2893e862c9/frameworks/Rust/actix/src/main.rs#L23

    To make this really worthwhile #121 would be good.

    suggested by @pickfire

    easy good first issue help wanted 
    Reply
  • Parsing newline separated JSON is cumbersome
    Parsing newline separated JSON is cumbersome

    Apr 15, 2020

    In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as lz4 -d < big.json | myapp, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.

    Unfortunately, this kind of parsing is not at all straight-forward to do with simd-json. The usual no-copy BufRead::lines() workflow is killed by the fact that Lines yields immutable &strs while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. Using BufRead::read_line results in unnecessary copying of the line and manual \n suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).

    I feel like it would be great if this lib could also provide a SIMD accelerated lines_mut which would increase this libraries usability immensely.

    It is also very much possible that there is an obvious way to make this work which I just failed to see.

    enhancement 
    Reply
  • Implement number parsing from simdjson v0.3.1
    Implement number parsing from simdjson v0.3.1

    Apr 24, 2020

    This includes accurate and consistent float parsing. Float parsing is generally handled by the fast path but falls back to lexical-core.

    This passes number tests from JSONTestSuite except n_multidigit_number_then_00.json because of the trailing padding.

    Reply
  • graviton2 benchmark numbers
    graviton2 benchmark numbers

    May 19, 2020

    looks like graviton2 is available!

    https://aws.amazon.com/blogs/aws/new-m6g-ec2-instances-powered-by-arm-based-aws-graviton2/

    It would be cool to benchmark & compare w/ upstream performance...

    easy enhancement 
    Reply
  • Parsing fixed size integers
    Parsing fixed size integers

    May 27, 2020

    Not sure if this is useful but using SIMD to parse integers looks interesting. Or maybe this shouldn't be in the crate either.

    https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html

    Reply
  • Investigate rapidjson's float parsing and printing
    Investigate rapidjson's float parsing and printing

    Jun 13, 2020

    See: https://github.com/serde-rs/json-benchmark/issues/15#issuecomment-643569709

    rapidjson is very fast when it comes to printing float numbers, we should find out why and see if we can adopt their logic.

    Reply
  • What is the behaviour of the fallback mode ?
    What is the behaviour of the fallback mode ?

    Jun 1, 2020

    What exactly does this crate do in fallback mode ?

    Reply
  • 128 bit numbers
    128 bit numbers

    Dec 4, 2019

    See #59

    todo:

    • [x] bench
    • [x] implement 128bit parsing
    • [x] remove 128bit as default flag
    Reply
  • Use simd-lite
    Use simd-lite

    Aug 30, 2019

    Use simd-lite for ARM support.

    Reply
  • 0.2 work
    0.2 work

    Oct 23, 2019

    This is a work branch for the 0.2 release where we can introduce breaking changes for things we do not like in 0.1.

    So far:

    • [x] Add U64 type
    • [x] Remove deprecated functions
    • [x] Box objects to reduce enum size
    • [x] Add object and array access to value trait for convenience
    • [x] Arch alignment
    Reply
  • ARM NEON support
    ARM NEON support

    Jul 30, 2019

                                                                                                                                                                                                           
    Reply
  • RFC: Neon support (pretty much working)
    RFC: Neon support (pretty much working)

    Aug 7, 2019

    Hello hello!

    I have been pulling some of your Neon intrinsics and porting the simdjson neon code. Maybe it's useful! I'll keep improving it... Comments welcome anytime!

    All the best,

    -Sunny

    Reply