Rust-Url scraper: Rust crate for scraping URLs from HTML pages.

url-scraper

Rust crate for scraping URLs from HTML pages.

Example

extern crate url_scraper;
use url_scraper::UrlScraper;

fn main() {
    let directory = "http://phoronix.com/";

    let scraper = UrlScraper::new(directory).unwrap();
    for (text, url) in scraper.into_iter() {
        println!("{}: {}", text, url);
    }
}

Comments

  • Support Async I/O with Futures
    Support Async I/O with Futures

    Dec 10, 2018

    The current API only exposes a synchronous API. An asynchronous API based around reqwest::async::Client would be ideal as well.

    enhancement good first issue help wanted 
    Reply
  • Cannot Build: Error Compiling CSS Parser
    Cannot Build: Error Compiling CSS Parser

    May 5, 2020

    I have tried to reproduce the code exactly in the url_scraper documentation, but am unable to build the program.

    I think it is an error with the cssparser dependency. Any ideas how to fix?

    Error:

       Compiling cssparser v0.24.1
    error[E0506]: cannot assign to `self.input.cached_token` because it is borrowed
       --> /home/jared/.cargo/registry/src/github.com-1ecc6299db9ec823/cssparser-0.24.1/src/parser.rs:572:17
        |
    547 |     pub fn next_including_whitespace_and_comments(&mut self) -> Result<&Token<'i>, BasicParseError<'i>> {
        |                                                   - let's call the lifetime of this reference `'1`
    ...
    560 |             Some(ref cached_token)
        |                  ---------------- borrow of `self.input.cached_token` occurs here
    ...
    572 |                 self.input.cached_token = Some(CachedToken {
        |                 ^^^^^^^^^^^^^^^^^^^^^^^ assignment to borrowed `self.input.cached_token` occurs here
    ...
    584 |         Ok(token)
        |         --------- returning this value requires that `self.input.cached_token.0` is borrowed for `'1`
    
    error: aborting due to previous error
    
    For more information about this error, try `rustc --explain E0506`.
    error: could not compile `cssparser`.
    warning: build failed, waiting for other jobs to finish...
    error: build failed
    

    To Reproduce:

    My main.rs is:

    extern crate url_scraper;
    use url_scraper::UrlScraper;
    
    fn main() {
        let directory = "http://phoronix.com/";
    
        let scraper = UrlScraper::new(directory).unwrap();
        for (text, url) in scraper.into_iter() {
            println!("{}: {}", text, url);
        }
    }
    

    My Cargo.tom dependencies are:

    [dependencies]
    url-scraper = "0.1.1"
    
    Reply
  • Starts with char literal
    Starts with char literal

    Nov 28, 2019

    starts_with can take a char literal instead of a single character string slice.

    Reply
  • Updating scraper
    Updating scraper

    Nov 28, 2019

    error[E0506]: cannot assign to `self.input.cached_token` because it is borrowed
       --> /home/callen/.cargo/registry/src/github.com-1ecc6299db9ec823/cssparser-0.24.1/src/parser.rs:572:17
        |
    547 |     pub fn next_including_whitespace_and_comments(&mut self) -> Result<&Token<'i>, BasicParseError<'i>> {
        |                                                   - let's call the lifetime of this reference `'1`
    ...
    560 |             Some(ref cached_token)
        |                  ---------------- borrow of `self.input.cached_token` occurs here
    ...
    572 |                 self.input.cached_token = Some(CachedToken {
        |                 ^^^^^^^^^^^^^^^^^^^^^^^ assignment to borrowed `self.input.cached_token` occurs here
    ...
    584 |         Ok(token)
        |         --------- returning this value requires that `self.input.cached_token.0` is borrowed for `'1`
    
    error: aborting due to previous error
    
    

    This compile error broke downstream: such as url-crawler. This version bump seems to clear it up.

    Reply
  • Bump scraper version
    Bump scraper version

    Dec 17, 2018

    Referring to this issue I think this might solve compilation issues. I don't anticipate this, but you might wanna double check this doesn't break anything.

    Reply