Rust-Chomp: chomp – A fast monadic-style parser combinator

Chomp

Gitter Build Status Coverage Status Crates.io Documentation

Chomp is a fast monadic-style parser combinator library designed to work on stable Rust. It was written as the culmination of the experiments detailed in these blog posts:

For its current capabilities, you will find that Chomp performs consistently as well, if not better, than optimized C parsers, while being vastly more expressive. For an example that builds a performant HTTP parser out of smaller parsers, see http_parser.rs.

Installation

Add the following line to the dependencies section of your Cargo.toml:

[dependencies]
chomp = "0.3.1"

Usage

Parsers are functions from a slice over an input type Input<I> to a ParseResult<I, T, E>, which may be thought of as either a success resulting in type T, an error of type E, or a partially completed result which may still consume more input of type I.

The input type is almost never manually manipulated. Rather, one uses parsers from Chomp by invoking the parse! macro. This macro was designed intentionally to be as close as possible to Haskell's do-syntax or F#'s "computation expressions", which are used to sequence monadic computations. At a very high level, usage of this macro allows one to declaratively:

  • Sequence parsers, while short circuiting the rest of the parser if any step fails.
  • Bind previous successful results to be used later in the computation.
  • Return a composite datastructure using the previous results at the end of the computation.

In other words, just as a normal Rust function usually looks something like this:

fn f() -> (u8, u8, u8) {
    let a = read_digit();
    let b = read_digit();
    launch_missiles();
    return (a, b, a + b);
}

A Chomp parser with a similar structure looks like this:

fn f<I: U8Input>(i: I) -> SimpleResult<I, (u8, u8, u8)> {
    parse!{i;
        let a = digit();
        let b = digit();
                string(b"missiles");
        ret (a, b, a + b)
    }
}

And to implement read_digit we can utilize the map function to manipulate any success value while preserving any error or incomplete state:

// Standard rust, no error handling:
fn read_digit() -> u8 {
    let mut s = String::new();
    std::io::stdin().read_line(&mut s).unwrap();
    s.trim().parse().unwrap()
}

// Chomp, error handling built in, and we make sure we only get a number:
fn read_digit<I: U8Input>(i: I) -> SimpleResult<I, u8> {
    satisfy(i, |c| b'0' <= c && c <= b'9').map(|c| c - b'0')
}

For more documentation, see the rust-doc output.

Example

#[macro_use]
extern crate chomp;

use chomp::prelude::*;

#[derive(Debug, Eq, PartialEq)]
struct Name<B: Buffer> {
    first: B,
    last:  B,
}

fn name<I: U8Input>(i: I) -> SimpleResult<I, Name<I::Buffer>> {
    parse!{i;
        let first = take_while1(|c| c != b' ');
                    token(b' ');  // skipping this char
        let last  = take_while1(|c| c != b'\n');

        ret Name{
            first: first,
            last:  last,
        }
    }
}

assert_eq!(parse_only(name, "Martin Wernstål\n".as_bytes()), Ok(Name{
    first: &b"Martin"[..],
    last: "Wernstål".as_bytes()
}));

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Contact

File an issue here on Github or visit gitter.im/m4rw3r/chomp.

Comments

  • Accessing numbering::InputPosition::position via map_err
    Accessing numbering::InputPosition::position via map_err

    Oct 26, 2016

    I have a usecase where I'd like to somehow pass numbering::InputPosition::position to an Error type as a way of reporting parsing errors at a location (e.g. line/column location).

    The issue is that I'm unable to access numbering::InputPosition::position from within chomp::types::ParseResult::map_err function.

    I adapted map_err into map_err2 as follows: https://github.com/dashed/chomp/commit/3f1998b1d06394ed5b1a8c765230371b6a4f4533

    This enables me to do this:

    type ESParseResult<I, T> = ParseResult<I, T, ParseError>;
    
    fn some_parser<I: U8Input>(i: InputPosition<I, CurrentPosition>)
        -> ESParseResult<InputPosition<I, CurrentPosition>, ()> {
        parse!{i;
    
            let _var = (i -> {
                string(i, b"var")
                    .map_err2(|_, i| {
                        let loc = i.position();
                        ParseError::Expected(loc, "Expected var here.")
                    })
            });
    
            // ...
    
            ret {()}
        }
    }
    

    I'd love to hear any feedback on this, especially for any better alternative approaches. ?


    Appendix

    CurrentPosition type for reference:

    #[derive(Debug, Copy, Clone, PartialEq, Eq, Ord, PartialOrd, Hash)]
    pub struct CurrentPosition(
        // The current line, zero-indexed.
        u64,
        // The current col, zero-indexed.
        u64
    );
    
    impl CurrentPosition {
        // Creates a new (line, col) counter with zero.
        pub fn new() -> Self {
            CurrentPosition(0, 0)
        }
    }
    
    impl Numbering for CurrentPosition {
        type Token  = u8;
    
        fn update<'a, B>(&mut self, b: &'a B)
            where B: Buffer<Token=Self::Token> {
                b.iterate(|c| if c == b'\n' {
                    self.0 += 1; // line num
                    self.1 = 0;  // col num
                } else {
                    self.1 += 1; // col num
                });
        }
    
        fn add(&mut self, t: Self::Token) {
            if t == b'\n' {
                self.0 += 1; // line num
                self.1 = 0;  // col num
            } else {
                self.1 += 1; // col num
            }
        }
    }
    
    pub trait Input: Sized {
    
        // ...
    
        #[inline]
        pub fn map_err2<V, F>(self, f: F) -> ParseResult<I, T, V>
          where F: FnOnce(E, &I) -> V {
            match self {
                ParseResult(i, Ok(t))  => ParseResult(i, Ok(t)),
                ParseResult(i, Err(e)) => {
                    let err = f(e, &i);
                    ParseResult(i, Err(err))
                },
            }
        }
    
        // ...
    }
    
    combinator enhancement 
    Reply
  • Improve parse! macro documentation
    Improve parse! macro documentation

    May 26, 2017

    Currently the parse! macro documentation does not detail exactly what operators like <* expand to in terms of normal code. Having access to this is useful to debug certain issues which can arise in macro usage.

    Reply
  • string parser (and possibly others internally using consume_while) force unnecessary stream reads
    string parser (and possibly others internally using consume_while) force unnecessary stream reads

    Jul 16, 2017

    problem

    the chomp::parsers::string parser (and possibly others internally using consume_while) might force unnecessary stream reads. example code:

    #[macro_use]
    extern crate chomp;
    
    use chomp::prelude::*;
    use chomp::buffer::{Source, Stream};
    
    use std::net::TcpStream;
    
    
    fn main() {
        let tcp = TcpStream::connect("faumail.fau.de:143").unwrap();
        let mut src = Source::new(tcp);
    
        // IMAP lines end in b"\r\n", so the real text is everything up to b'\r',
        // but we have to read the line ending nonetheless before reading any future stuff
        let p = src.parse(parser!{take_till(|c| c == b'\r') <* string(b"\r\n")});
        println!("{:?}", p);
    }
    

    expected output: Ok(<some bytes from the imap server welcome line>)

    actual output: Err(Retry)

    cause

    the string parser (src/parsers.rs:378) uses consume_while(f), which first reads the next token from the input stream, and only after that inspects it (using f) for whether to consume it or not. note this is not a bug in consume_while, but its perfectly fine expected behaviour. the problem with using it the way it currently is for string(s) is that after len(s) tokens have been consumed, we could return successfully, but consume_while waits for the next token to call its decider function on (which then determines that it has read len(s) tokens already and tells consume_while to quit), which in some cases can force a read on the underlying stream when actually the answer would be clear.

    solution

    i wrote a (very hackish) fix for the string parser at https://github.com/dario23/chomp/tree/fix_string but (without having checked in depth) i'm expecting more parsers to be affected. probably a more exhaustive fix would include adding consume_while_max_n(f, usize).

    i'd be happy to propose changes and submit a PR, but only after hearing your opinion on the matter :-)

    Reply
  • Question: What is the idiomatic way of parsing a string from a byte slice?
    Question: What is the idiomatic way of parsing a string from a byte slice?

    Aug 8, 2017

    In other words, is there a concise way to map an arbitrary Result to a chomp error?

    In nom, I would do the following:

    map_res!(
    	take_until_and_consume!("\n"),
    	str::from_utf8
    )
    

    How would I do that with chomp?

    Reply
  • Fix string parser
    Fix string parser

    Oct 31, 2017

    this is just one part of the story (as described in #67), but it seems to me that one piece of the puzzle would improve the situation already. would you be willing to accept more pull requests/a bigger chunk of changes to other parser functions?

    Reply
  • Do not reset the slice pointer when whole input is consumed
    Do not reset the slice pointer when whole input is consumed

    Aug 2, 2018

    This property is useful e.g. when calculating how many bytes of a buffer were consumed without actually explicitly keeping track of it.

    Reply
  • run_scanner state can't depend on last token
    run_scanner state can't depend on last token

    Jan 6, 2016

    I'm trying to parse one utf8 character. I tried run_scanner and std::char::from_u32, but it doesn't work because when I get a whole character, the way to signal it is to return None, which throws away the state.

    Reply
  • Remove type parameter default on functions and methods
    Remove type parameter default on functions and methods

    Jan 13, 2016

    See https://github.com/rust-lang/rust/pull/30724 and https://gist.github.com/nikomatsakis/760c6a67698bd24253bf

    These are warnings in the nightly.

    Reply
  • How do I examine success/fail?
    How do I examine success/fail?

    Feb 29, 2016

    let parse_result = parse!{i;
    ..
    };
    
    // I now have to execute some Rust code to see what parser I should call next.
    let input2: Input<'i, u8> = match parse_result.into_inner() {
        // stuck here.
        // Ok(o) => o,
        // Err(e) => return parse_result
    };
    ```rust
    
    I'm a bit lost walking through the types. I simply want to continue with an Input, or return the parse_result.
    Any help would be appreciated.
    Thanks!
    
    Reply
  • Infinite loop?
    Infinite loop?

    Feb 23, 2016

    skip_many() and many() do not seem to be propagating the incomplete state. Or maybe the or combinator is always resetting the stream position and not propagating the error?

    I expect the flow to be:

    • skip_many(all)
    • all OR tests b() and c() - both fail
    • all returns fail
    • skip_many returns fail <-- this does not happen ... infinite loop ...

    Ideas?

    Thanks!

    i == "fffff".as_bytes(); // will never match any token...
    parse!{i;
        skip_many(all);
        ...
    
    pub fn c<'a>(i: Input<'a, u8>) -> U8Result<()> {
        parse!{i;
            take_while(is_whitespace);
            ret () } }
    
    pub fn b<'i, 'a>(i: Input<'i, u8>, s: &'a str) -> U8Result<'i, ()> {
        parse!{i;
            take_while(is_whitespace);
            ret () } }
    
    pub fn all<'a>(i: Input<'a, u8>) -> U8Result<()> {
        let s = String::new();
        parse!{i;
                b(&s) <|>
                c();
            ret () } }
    ```rust
    
    Reply
  • Is there a way to get current position?
    Is there a way to get current position?

    Feb 7, 2016

    Hi! I'm wondering if it would be possible to add a function that could provide the current position in the file (or stream)?

    In my case, I'm parsing from a file and would like to capture the line number in particular.

    I haven't had a chance to dig through the code much yet, but if I were to take a stab at adding it, I'd definitely appreciate a few pointers! I'm guessing it would have to bubble up from the buffer...

    enhancement 
    Reply
  • Make `Input` a trait
    Make `Input` a trait

    Mar 16, 2016

    Problem

    Currently the input type only allows for slices, and is special cased for situations where it may not be the whole of the input. I cannot provide any line/row/offset counting either since it is a concrete type and an extension with that functionality would impact all code.

    This would provide a way to slot in position-aware wrappers to solve #38 neatly.

    Proposed solution

    Convert Input<I> into a trait, with ret and err as provided methods, the input-token type would be the associated type Token. All the primitive methods (currently provided by InputClone and InputBuffer) are also present but require an instance of the zero-sized type Guard which cannot be instantiated outside of the primitives module (note the private field). The primitives would be reachable through methods on a Primitives trait which has to be used separately (the blanket implementation for all Input makes it possible to easily use it once it is in scope).

    use primitives::Guard;
    pub use primitives::Primitives;
    
    pub trait Input: Sized {
        type Token;
        type Marker;
    
        fn ret<T>(self, t: T) -> ParseResult<Self, T> {
            ParseResult(self, t)
        }
    
        fn _consume(self, usize, Guard)        -> Self;
        fn _buffer(&self, Guard)               -> &[Self::Token];
        fn _is_end(&self, Guard)               -> bool;
        fn _mark(&self, Guard)                 -> Self::Marker;
        fn _restore(self, Self::Marker, Guard) -> Self;
    }
    
    pub mod primitives {
        use Input;
    
        pub struct Guard(());
    
        pub trait Primitives: Input {
            fn consume(self, n: usize) -> Self {
                self._consume(Guard(()), n)
            }
            fn buffer(&self) -> &[Self::Token] {
                self._buffer(Guard(()))
            }
            fn is_end(&self) -> bool {
                self._is_end(Guard(()))
            }
            fn mark(&self) -> Self::Marker {
                self._mark(Guard(()))
            }
            fn restore(self, m: Self::Marker) -> Self {
                self._restore(Guard(()), m)
            }
        }
    
        impl<I: Input> Primitives for I {}
    }
    

    The mark method is the replacement for InputClone, it should be used with the restore method to restore the state of the Input to the old one.

    Pros

    • Input can be implemented directly for slices, eliminating certain branches from parsers and combinators like many, take_while, eof and so on.
    • An Input implementation can be provided for line-counting which could be slotted in to provide line-counting in any existing parsers
    • The mark and restore methods would provide mechanisms allowing types which do not wholly consist of slices to work, though the buffer method is probably not the right choice for that, it will need a change to eg. support ropes.
    • All parsers need to be generic, before we could get away with only concrete types since Input<u8> is a concrete type. Input<Token=u8> will not be a concrete type.

    Cons

    • Parser function signature change, very backwards incompatible:

      // old
      fn my_parser<'a, I>(i: Input<'a, I>, ...) -> ParseResult<'a, I, T, E>
      // old, lifetime elision:
      fn my_parser<I>(i: Input<I>, ...) -> ParseResult<I, T, E>
      // new
      fn my_parser<I: Input>(i: I, ...) -> ParseResult<I, T, E>
      
    • The type I: Input can no longer be guaranteed to be linear since the #[must_use] annotation cannot be put on the concrete type.

      This is probably not an issue in practice since the I type is required by value to create a ParseResult and the ParseResult in turn is ultimately required by the functions which start the parsing.

    enhancement 
    Reply