Rust-Runiq v1.2.0: runiq — an efficient way to filter duplicate lines from unsorted input.

icon
Latest Release: v1.2.0

runiq

Crates.io Build Status

This project offers an efficient way (in both time and space) to filter duplicate entries (lines) from texual input. This project was born from neek, but optimized for both speed and memory. Several filtering options are supported depending on your data and tradeoffs you wish to make between speed and memory usage. For a more detailed explanation, see the relevant blog post.

Installation

This tool will be available via Crates.io, so you can install it directly with cargo:

$ cargo install runiq

If you'd rather just grab a pre-built binary, you might be able to download the correct binary for your architecture directly from the latest release on GitHub here. The list of binaries may not be complete, so please file an issue if your setup is missing (bonus points if you attach the appropriate binary).

Examples

$ cat << EOF >> input.txt
> this is a unique line
> this is a duplicate line
> this is another unique line
> this is a duplicate line
> this is a duplicate line
> EOF

$ cat input.txt
this is a unique line
this is a duplicate line
this is another unique line
this is a duplicate line
this is a duplicate line

$ runiq input.txt
this is a unique line
this is a duplicate line
this is another unique line

Comparisons

Here are some comparisons of runiq against other methods of filtering uniques:

Tool Flags Time Taken Peak Memory
neek N/A 55.8s 313MB
sort -u 595s 9.07GB
uq N/A 32.3s 1.66GB
runiq -f digest 17.8s 64.6MB
runiq -f naive 26.3s 1.62GB
runiq -f bloom 36.8s 13MB

The numbers above are based on filtering unique values out of the following file:

File size:     3,290,971,321 (~3.29GB)
Line count:        5,784,383
Unique count:      2,715,727
Duplicates:        3,068,656

Comments

  • Benchmarking suite
    Benchmarking suite

    Oct 12, 2019

    While everyone can cook up their own files, it would be nice to have a uniform way of generating data for proper comparison in repo. Crates I found that could help: test-data-generation or regex_generate.

    Reply
  • Support piped stdin
    Support piped stdin

    May 22, 2020

    It is very common to pipe the output of cat or find to a filtering program. It would be great if runiq could be used as a cross-platform drop-in replacement for uniq.

    Reply
  • Feature Request: More unique uniqueness flag
    Feature Request: More unique uniqueness flag

    Feb 26, 2021

    As it stands, both runiq and runiq --invert always include a single instance of each value that exists more than once within the inputs. There is not, however, an option to completely omit values that occur more than once. I would like to see some sort of '--no-duped' flag (the name is open for debate), probably mutually exclusive with --invert, that filters out all occurrences of data with duplicates, rather than the current default behavior of leaving a single instance. example:

    $ cat fileA
    a1
    b7
    c1
    d3
    $ cat fileB
    a7
    b3
    d8
    c1
    d3
    

    With the current behavior, runiq fileA fileB would produce:

    a1
    b7
    c1
    d3
    a7
    b3
    d8
    

    runiq --no-duped fileA fileB would then produce:

    a1
    b7
    a7
    b3
    d8
    
    Reply
  • No list of available filters
    No list of available filters

    Jun 8, 2021

    Apart from looking at the code or at the Comparison section in README, there is no way to find out the list of filters available, and what is the default filter. I would expect it to show with runiq --help.

    Reply
  • Create Bash completion script
    Create Bash completion script

    Jul 7, 2021

    Add tab-completions for Bash.

    Reply
  • runiq as a library?
    runiq as a library?

    Aug 2, 2021

    Hi,

    I've been using as a library to be able to provide multiple filter implementations depending on context, and while forking the reposity I noticed the following lines in the main.rs file:

    //! Runiq is only built as a command line tool, although it may be
    //! distributed as a core crate if the backing implementation becomes
    //! interesting for other use cases.
    

    While I'm currently happy with my solution, I wonder if it would benefit people to integrate changes in the project so that runiq could be used both as a deduplication tool and library.

    Reply
  • Add a --count option
    Add a --count option

    Nov 28, 2019

                                                                                                                                                                                                           
    Reply
  • Fix a few small lint errors
    Fix a few small lint errors

    Feb 5, 2019

    This is a fix for two small clippy lint errors, as well as a small typo in a comment.

    It also hoists the eol variable from #3 out into a constant.

    Reply
  •  Don't allocate/process strings; just work on the bytes
    Don't allocate/process strings; just work on the bytes

    Oct 18, 2018

    Much faster

                                                                                                                                                                                                           
    Reply
  • Add a -c (count) flag?
    Add a -c (count) flag?

    Nov 20, 2019

    Would it be possible to add a -c flag to output a count of each unique line, like uniq has? A significant proportion of my usage of uniq is in the form of sort | uniq -c | sort -n, and being able to use runiq to replace that initial pair of commands would be really nice.

    Reply
  • Index out of bounds
    Index out of bounds

    Oct 4, 2019

    Hello,

    I downloaded runiq's code and compiled myself for mac, and I'm getting the following panic:

    thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 18446744073709551615', /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/libcore/slice/mod.rs:2715:10
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
    

    When running runiq on the following text:

    Montrer les messages depuis: Tous les messages1 Jour7 Jours2 Semaines1 Mois3 Mois6 Mois1 An Le plus ancien en premierLe plus récent en premier
    Sauter vers: Sélectionner un forum----------------==Réglement== Règlement du forum==Général== Présentez-Vous Demandes de skin Administratif Demande de Partenariat Boite a Idées pour le Forum==téléchargements GTS== camions scania vovlo Daf Man Renault Mercedes-benz Iveco Camions tandem Autres camion Remorques Frigo Container Plateau bachée Citerne Benne autres remorque Maps Mods aide==Partie photo== Photo de ETS Photo de GTS Photo de ET Photo de UK Photo d'autre jeux==Discussion== Discussion de ETS/GTS... Espace Blabla==Aide== Demande d'aide Demande de TutoriauxPartie GTA San Andreas GTA Modding Les Projets de Nos Membres Constructions BrianK Disscussions Diverses sur Tous Les Grand Theft Auto Off-Topic--Autres
    Index | forum gratuit | Forum gratuit d’entraide | Annuaire des forums gratuits | Signaler une violation | Conditions générales d'utilisation
    
    Les lois concernant l'utilisation d'un logiciel varient d'un pays à l'autre. Nous n'encourageons pas l'utilisation de ce logiciel s'il est en violation avec l'une de ces lois.
    
    Blog hébergé par CanalBlog | Plan du site | Blog Cuisine et Gastronomie créé le 07/05/2006 | Contacter l'auteur | Signaler un abus
    
    telecharger cv en ligne gratuit a word mettre mon gratuitement telechargement,telecharger modele cv en ligne mettre mon pole emploi gratuitement,format word comment mettre mon cv en ligne sur pole emploi indeed,faire un cv en ligne et le telecharger gratuitement gratuit mettre mon pole emploi,mettre mon cv en ligne sur pole emploi word indeed ou,mettre mon cv en ligne sur indeed comment pole emploi telecharger word,a ement telecharger cv en ligne gratuit telechargement mettre mon pole emploi,telecharger mon cv en ligne word curriculum vitae mettre pole emploi modele,mettre mon cv en ligne sur pole emploi telecharger comment a format word,mettre mon cv en ligne gratuitement web comment sur pole emploi.
    
    Vous êtes à : Accueil > Label > Galerie des sites Web labellisés - Aucune correspondance pour ces critères
    

    The problem disappears if I remove the newlines. Also the problem is not present if I use runiq via rust install runiq.

    I want to download the code, because I want to introduce a simple modification to ignore empty lines, so that I can preserve new lines between paragraphs. Thus the vanilla version of runiq won't work for me.

    Thank you in advance for your help.

    Reply
  • Special characters crash the program
    Special characters crash the program

    Jun 11, 2018

    thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }', libcore\result.rs:945:5 note: Run with RUST_BACKTRACE=1 for a backtrace.

    caused by a "ü" character on the next line

    bug enhancement 
    Reply