Rust-Whatlang rs: whatlang-rs — Natural language detection library based on trigrams

Whatlang

Build Status License Documentation Demo

Natural language detection for Rust with focus on simplicity and performance.

Features

  • Supports 84 languages
  • 100% written in Rust
  • Lightweight, fast and simple
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
  • Provides reliability information

Get started

Add to you Cargo.toml:

[dependencies]

whatlang = "0.7.2"

Example:

extern crate whatlang;

use whatlang::{detect, Lang, Script};

fn main() {
    let text = "Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!";

    let info = detect(text).unwrap();
    assert_eq!(info.lang(), Lang::Epo);
    assert_eq!(info.script(), Script::Latin);
    assert_eq!(info.confidence(), 1.0);
    assert!(info.is_reliable());
}

For more details (e.g. how to blacklist some languages) please check the documentation.

Requirements

The latest whatlang library works with rust 1.31.0 or higher.

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How is_reliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

Running benchmarks

This is mostly useful to test performance optimizations.

cargo bench

Ports and clones

Derivation

Whatlang is a derivative work from Franc (JavaScript, MIT) by Titus Wormer.

License

MIT © Sergey Potapov

Contributors

Comments

  • Add comparison CLD2
    Add comparison CLD2

    Feb 10, 2017

    It would be interesting to know how this library compares to cld2 https://github.com/CLD2Owners/cld2.

    Reply
  • Generate language list without Tera
    Generate language list without Tera

    Jan 29, 2019

    Tera is pulling quite a lot of dependencies, and is only used to generate one file during build. I removed it and replaced it with a pure Rust implementation. When cargo testing, I went from 132 dependencies to 96.

    I can understand you refuse this change, as it may be less readable than the current version, but it makes build times shorter (I'm using whatlang in a project with 400 dependencies, and it takes 20 minutes to build, so if I could avoid Tera and its dependencies it would be great).

    Another solution would be to generate this file once and for all, as it is probably not updated very often.

    Reply
  • Add support for Latin
    Add support for Latin

    Apr 6, 2020

    Example test taken from Cicero: https://www.thelatinlibrary.com/cicero/sex.rosc.shtml

    Reply
  • Failed to compile
    Failed to compile

    May 1, 2020

        |
    259 |     Lat = 84,
        |     --- not covered
        |
        = help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
        = note: the matched value is of type `whatlang::Lang`
    

    using dockerfile from https://github.com/valeriansaliou/sonic/blob/master/Dockerfile (same locally on Mac Book Pro with homebrew

    Reply
  • Implemented Lang::from_code() via procedural macros
    Implemented Lang::from_code() via procedural macros

    Feb 11, 2017

    to_code() / from_code() implemenetation looked a bit weird to me and I wanted to get some expirience in Rust language programming so I've made this pull request :)

    pattern matching operator in from_code() is now being generated via procedural macros docs: https://doc.rust-lang.org/book/procedural-macros.html

    I wanted to make from_string() function universal, so EnumFromString could be applied to any enum that's why there is from_string() in EnumFromString trait and from_code() in concrete implemetation for Lang enum

    I also wanted to move EnumFromString inside of the whatlang-derive library, but macro libraries can only export functions for now:

    error: proc-macro crate types cannot export any items other than functions tagged with #[proc_macro_derive] currently
    

    as for benchmarks:

    before change:

    running 2 tests
    test bench_detect        ... bench:  29,613,380 ns/iter (+/- 307,791)
    test bench_detect_script ... bench:     227,694 ns/iter (+/- 689)
    

    after change:

    running 2 tests
    test bench_detect        ... bench:  29,496,420 ns/iter (+/- 387,359)
    test bench_detect_script ... bench:     224,283 ns/iter (+/- 1,048)
    

    it's almost just not changed

    Reply
  • New API
    New API

    Feb 6, 2017

    Here is the proposal for a new API. The main advantage of it is that it's easier to use, and the desired result can be obtained with one line without limitation to pass additional options (whitelist/blaclist).

    let text = "Bla bla bla";
    let whitelist = [Lang::Epo, Lang::Spa];
    
    // get Option<Result>
    let result = whatlang::new(text).detect();
    
    // get only language, Option<Lang>
    let lang = whatlang::new(text).detect_lang();
    
    // detect only script, Option<Script>
    let lang = whatlang::new(text).detect_script();
    
    // with whitelist specified (same syntax for black list)
    let result = whatlang::new(text).whitelist(&whitelist).detect()
    
    Reply
  • Upgrade Hashbrown dependency
    Upgrade Hashbrown dependency

    Feb 19, 2020

    test detect::tests::test_detect_lang_ukrainian ... ok
    test detect::tests::test_detect_with_options_with_whitelist ... ok
    test detect::tests::test_detect_with_options_with_whitelist_mandarin_japanese ... ok
    test detect::tests::test_detect_with_options_with_blacklist_mandarin_japanese ... ok
    test detect::tests::test_detect_with_options_with_blacklist_none ... ok
    test lang::tests::test_code ... ok
    test detector::tests::test_detect_script ... ok
    test lang::tests::test_from_code ... ok
    test lang::tests::test_name ... ok
    test script::tests::test_detect_script ... ok
    test lang::tests::test_eng_name ... ok
    test script::tests::test_is_bengali ... ok
    test detect::tests::test_detect_spanish ... ok
    test script::tests::test_is_ethiopic ... ok
    test script::tests::test_is_cyrillic ... ok
    test script::tests::test_is_georgian ... ok
    test detector::tests::test_detect ... ok
    test script::tests::test_is_gurmukhi ... ok
    test script::tests::test_is_greek ... ok
    test script::tests::test_is_hangul ... ok
    test script::tests::test_is_gujarati ... ok
    test script::tests::test_is_oriya ... ok
    test script::tests::test_is_kannada ... ok
    test script::tests::test_is_katakana ... ok
    test script::tests::test_is_latin ... ok
    test script::tests::test_is_hiragana ... ok
    test script::tests::test_is_tamil ... ok
    test script::tests::test_is_telugu ... ok
    test script::tests::test_script_name ... ok
    test detector::tests::test_detect_lang ... ok
    test detect::tests::test_detect_with_options_with_blacklist ... ok
    test trigrams::tests::test_count ... ok
    test script::tests::test_is_thai ... ok
    test trigrams::tests::test_get_trigrams_with_positions ... ok
    test trigrams::tests::test_to_trigram_char ... ok
    test utils::tests::test_is_top_char ... ok
    test detect::tests::test_detect_with_random_text ... ok
    
    test result: ok. 37 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
         Running target/debug/deps/detect-e62082f7a21f59b1
    
    running 2 tests
    test test_with_russian_text ... ok
    test test_with_multiple_examples ... ok
    
    test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
         Running target/debug/deps/proptests-b01aa78ab877cbd6
    
    running 1 test
    test proptest_detect_does_not_crash ... ok
    
    test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
       Doc-tests whatlang
    
    running 11 tests
    test src/lang.rs - lang::Lang::eng_name (line 5032) ... ok
    test src/lang.rs - lang::Lang::name (line 5021) ... ok
    test src/detector.rs - detector::Detector (line 13) ... ok
    test src/lang.rs - lang::Lang::code (line 5010) ... ok
    test src/detector.rs - detector::Detector (line 24) ... ok
    test src/lang.rs - lang::Lang::from_code (line 4999) ... ok
    test src/detect.rs - detect::detect (line 13) ... ok
    test src/detect.rs - detect::detect_lang (line 27) ... ok
    test src/lib.rs -  (line 24) ... ok
    test src/script.rs - script::detect_script (line 77) ... ok
    test src/lib.rs -  (line 9) ... ok
    
    test result: ok. 11 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
    Reply