Rust-Suffix: suffix — Linear time suffix array construction (with Unicode support)

suffix

Fast linear time & space suffix arrays for Rust. Supports Unicode!

Build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/suffix

If you just want the details on how construction algorithm used, see the documentation for the SuffixTable type. This is where you'll find info on exactly how much overhead is required.

Installation

This crate works with Cargo and is on crates.io. The package is regularly updated. Add it to your Cargo.toml like so:

[dependencies]
suffix = "1.2"

Examples

Usage is simple. Just create a suffix array and search:

use suffix::SuffixTable;

fn main() {
  let st = SuffixTable::new("the quick brown fox was quick.");
  assert_eq!(st.positions("quick"), &[4, 24]);
}

There is also a command line program, stree, that can be used to visualize suffix trees:

git clone git://github.com/BurntSushi/suffix
cd suffix/stree_cmd
cargo build --release
./target/release/stree "banana" | dot -Tpng | xv -

And here's what it looks like:

"banana" suffix tree

Status of implementation

The big thing missing at the moment is a generalized suffix array. I started out with the intention to build them into the construction algorithm, but this has proved more difficult than I thought.

A kind-of-sort-of compromise is to append your distinct texts together, and separate them with a character that doesn't appear in your document. (This is technically incorrect, but maybe your documents don't contain any NUL characters.) During construction of this one giant string, you should record the offsets of where each document starts and stops. Then build a SuffixTable with your giant string. After searching with the SuffixTable, you can find the original document by doing a binary search on your list of documents.

I'm currently experimenting with different techniques to do this.

Benchmarks

Here are some very rough benchmarks that compare suffix table searching with searching in the using standard library functions. Note that these benchmarks explicitly do not include the construction of the suffix table. The premise of a suffix table is that you can afford to do that once---but you hope to gain much faster queries once you do.

test search_scan_exists_many            ... bench:       2,964 ns/iter (+/- 180)
test search_scan_exists_one             ... bench:          19 ns/iter (+/- 1)
test search_scan_not_exists             ... bench:      84,645 ns/iter (+/- 3,558)
test search_suffix_exists_many          ... bench:         228 ns/iter (+/- 65)
test search_suffix_exists_many_contains ... bench:         102 ns/iter (+/- 10)
test search_suffix_exists_one           ... bench:         162 ns/iter (+/- 13)
test search_suffix_exists_one_contains  ... bench:           8 ns/iter (+/- 0)
test search_suffix_not_exists           ... bench:         177 ns/iter (+/- 21)
test search_suffix_not_exists_contains  ... bench:          50 ns/iter (+/- 6)

The "many" benchmarks test repeated queries that match. The "one" benchmarks test a single query that matches. The "not_exists" benchmarks test a single query that does not match. Finally, the "contains" benchmark test existence rather finding all positions.

One thing you might take away from here is that you'll get a very large performance boost if many of your queries don't match. A linear scan takes a long time to fail!

And here are some completely useless benchmarks on suffix array construction. They compare the linear time algorithm with the naive construction algorithm (call sort on all suffixes, which is O(n^2 * logn)).

test naive_dna_medium                   ... bench:  22,307,313 ns/iter (+/- 939,557)
test naive_dna_small                    ... bench:   1,785,734 ns/iter (+/- 43,401)
test naive_small                        ... bench:         228 ns/iter (+/- 10)
test sais_dna_medium                    ... bench:   7,514,327 ns/iter (+/- 280,544)
test sais_dna_small                     ... bench:     712,938 ns/iter (+/- 34,730)
test sais_small                         ... bench:       1,038 ns/iter (+/- 58)

These benchmarks might make you say, "Whoa, the special algorithm isn't that much faster." That's because the data just isn't big enough. And when it is big enough, a micro benchmark is useless. Why? Because using the naive algorithm will just burn your CPUs until the end of the time.

It would be more useful to compare this to other suffix array implementations, but I haven't had time yet. Moreover, most (all?) don't support Unicode and instead operate on bytes, which means they aren't paying the overhead of decoding UTF-8.

Comments

  • Fix invariant check in from_parts.
    Fix invariant check in from_parts.

    Jun 10, 2016

    The suffix array table has the same length as the text in bytes. The invariant check in from_parts, however, checks that the suffix array size is the same as the number of characters. This invariant holds when a string only contains ASCII characters, but it fails otherwise.

    Modify the check to use the length of the string in bytes.

    Reply
  • Ownership of the table
    Ownership of the table

    Jun 10, 2016

    I had a question, or perhaps suggestion, but I am new to Rust, so I might be missing something: the SuffixTable struct stores the text as Cow, but the actual table/array as a vector that is always owned. I wondered if you considered making the table Cow as well.

    The immediate benefit that I see is that one could write the constructed table to disk and then later use the SuffixTable using a memory mapped table via from_parts.

    I made a quick modification to use Cow for the table member and all unit tests pass.

    Reply
  • Make the table member of SuffixTable COW.
    Make the table member of SuffixTable COW.

    Jun 10, 2016

    Ok, here is a first attempt :).

    Reply
  • approaching 1.0
    approaching 1.0

    Jul 31, 2016

    This crate enjoys a small API and has had some minor breaking changes over its lifetime, but given its modest functionality, I don't foresee a major API refactoring in its near future. Therefore, I'd like to move this to 1.0.

    When I initially built this crate, I did have a grander vision for building a more complete suffix array implementation. In particular, while this crate provides a nearly optimal construction algorithm implementation, it does not provide an optimal search algorithm implementation. Search should be implementable in O(p+logn) time (where p ~ len(query) and n ~ len(haystack)), but is currently O(p*logn). Improving the bound requires some sophisticated implementation tricks that have proved difficult to extract from the literature. I offered a bounty on a StackOverflow question and got an answer, but I haven't digested it yet. Nevertheless, a plain suffix table is plenty useful in its own right, so a 1.0 release shouldn't be blocked on further improvements even if it requires a rethink of the public API.

    help wanted 
    Reply
  • Performances
    Performances

    Oct 3, 2018

    Hi,

    I needed a high performance suffix array lib some time ago for a bioinformatics project, and I ended up working with a crude FFI with divsufsort.

    Just so you know, its performances while creating the array is blowing suffix out of the water. So feel free to take a glance if it interests you or to close this issue otherwise ;)

    Reply
  • Recommendations for dictionary of strings with wildcards?
    Recommendations for dictionary of strings with wildcards?

    Aug 20, 2017

    I am implementing an MQTT Broker. After connection, a client subscribes to one or more topics, which might contain one of two wildcard types.
    A regular topic might look like foo/bar/baz/blah .
    A # wildcard would match the entirety of the remaining string, a + wildcard would match only a single path segment, e.g. foo/bar/+/blah. When a message is published to a topic, the broker would find all matching subscriptions and deliver that message accordingly. This is a bit of a twist on the traditional string search problem, because the wildcards are in the dictionary instead of the query string.

    It is expected that the broker should be able to handle 10,000 of connections, with each connection subscribing to an average 20 topics each. It is also expected that there would be many duplicate connections.

    I can think of one approach, and I'm not thrilled by it, and that is to build the table using only the path segments themselves, and perform successive searches using only the segments of the path segments of the query. Then I could use the resulting indices into the table as keys into a map of subscribers.

    It would still be a bit tricky, because it would require an n-gram search for the resulting subscribers.

    e.g. subscribers:
    A: is at pizza/+/meats/sausage C: is at pizza/toppings/fruits/tomatoes D: is at pizza/#

    Naively**, the dictionary is

    pizza/+/meats/sausage$pizza/toppings/fruits/tomatoes$pizza/#
    01234567890123456789012345678901234567890123456789
    

    a match at 0, 24,31, 36 would find subscriber A. A match at 0 would find subscriber D.
    A message is published at pizza/toppings/fruits/tomatoes

    ** I big optimization here is that the dictionary doesn't need to duplicate path segments, so we could ensure they're unique. Given the nature of mqtt subscriptions, this would be a big win.

    I'm thinking that might be too much effort trying to wedge the suffix table approach where it doesn't fit, and maybe I'm better off building a suffix tree where I can explicitly create and handle wildcard nodes. Also a bitmap index seems promising.

    Even simpler but perhaps less performant would be to just build a massive Regex out of each subscription string. (might work for 100 subscriptions, but get unwieldy at 1000+, which is certainly in the realm of possibility.

    p.s., I just found http://blog.burntsushi.net/transducers/ which I am attempting to digest.

    Reply