Using BufRead for faster Rust I/O speed

One of the many reasons people write applications in Rust is to have the highest possible performance. This is the sort of thing that is hard to compare across languages, but some benchmarks (like this one and this one) generally show that Rust is the fastest major language not named C or C++.

Rust makes it easy to write high performance code because its data copies are generally explicit; you know when they’re happening because you have to write .clone() or something similar to make them happen. However, there are still ways where you can write lower-performing code in Rust without being aware of it, and using unbuffered I/O is one of those ways!

In this article, we’ll cover the following:

A brief intro to buffering I/O

First, let’s go over a few definitions. Buffered I/O is input or output that goes through a buffer before going to or coming from disk. Unbuffered I/O is input or output that doesn’t go through a buffer.

Buffering I/O is important for performance optimization because doing many small reads or writes on disk can have a significant overhead; a system call has to be done and the driver for the disk has to set up to access data. In the worst case, all of that overhead has to be done for every character that is being read or written.

In reality, a lot of buffering can happen at various layers of calls, so it’s rarely that bad. But using unbuffered I/O in Rust can still have a noticeable impact on performance.

Let’s look at an example. This GitHub repo has a large text file (around 8 MB) containing a list of English word frequencies. Each line has a word made up of ASCII characters, followed by a space and then a number. We want to calculate the total of all the numbers. I wrote four different functions to do this and benchmarked them.

Benchmarking Rust code

Whenever you’re concerned about performance, there are many things you can try to make your code run faster, but there’s no substitute for measuring how fast it runs to see what actually helps!

Unfortunately, benchmarking code can be tricky. The problem is that there are many things that can affect performance. To be sure you’re measuring what you care about, you need to take a number of measurements and average the results. But how many runs is enough? 10? 100? 1,000? Too low and your results won’t be reliable, too high and you’re wasting the computer’s time and yours.

The Crate criterion is a handy way to figure this out using statistics — it measures the difference between runs to see if the results are converging. It also has a warmup step where it runs the code a few times to make sure everything is loaded into any caches. You can read more about the analysis criterion does here. It also remembers the results of the last run and gives you a comparison, which is helpful when trying out changes!

To use criterion for this experiment, I added it with cargo add criterion, then added these lines to Cargo.toml:

[[bench]]
name = "process_lines"
harness = false

Then I added process_lines.rs to the benches directory, with a function to measure each of the approaches listed below. Each function looks something like this:

fn bench_unbuffered_one_character_at_a_time(c: &mut Criterion) {
  c.bench_function("unbuffered_one_character_at_a_time", |b| b.iter(|| read_unbuffered_one_character_at_a_time()));
}

The argument to b.iter() is what is actually being benchmarked. In this case, the function we want to benchmark doesn’t take arguments, so this works nicely.

Because you’ll be running these benchmarks in release mode, you need to be careful that the compiler doesn’t optimize away your function! You can use criterion::black_box() to signal to the compiler not to do this. Here’s an example from the criterion book:

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
}

In our case, this isn’t necessary because we’re reading from a file; the compiler knows it’s not possible to optimize any of that away.

To run the benchmarks, simply run cargo bench, which will build your crate in release mode and then call criterion to do the measurements. It will print out what it’s doing and the results, including any outliers it finds.

Here’s the output for the buffered_allocate_string_every_time() function:

buffered_allocate_string_every_time
                        time:   [45.728 ms 45.784 ms 45.851 ms]
                        change: [-49.792% -48.593% -47.393%] (p = 0.00 < 0.05)
                        Performance has improved.

Here, the middle time value is the median (45.784 ms), and it’s showing that performance has improved since the last run, and that result is statistically significant. (This is the difference plugging in my laptop makes!)

Four ways to read a file, line by line

In order from slowest to fastest (as measured on my laptop while plugged in), here they are:

Unbuffered, one character at a time

Speed: 10.1 seconds

This is the read_unbuffered_one_character_at_a_time() function, which is implemented as:

pub fn read_unbuffered_one_character_at_a_time() -> io::Result<u64> {
  let mut file = File::open(FILENAME)?;
  let len = file.metadata().expect("Failed to get file metadata").len() as usize;
  let mut v: Vec<u8> = Vec::new();
  v.resize(len, 0u8);
  for index in 0..len {
    file.read_exact(&mut v[index..(index+1)])?;
  }
  let s = String::from_utf8(v).expect("file is not UTF-8?");
  let mut total = 0u64;
  for line in s.lines() {
    total += get_count_from_line(line);
  }
  Ok(total)
}

This is the worst case scenario — the read_exact() call reads exactly one character at a time until the whole file has been read, which, for this file, means more than eight million times! This takes more than 20 times longer than the next slowest method.

Buffered, allocating a new string every time

Speed: 45.8 milliseconds

This is the read_buffered_allocate_string_every_time() function, which looks like this:

pub fn read_buffered_allocate_string_every_time() -> io::Result<u64> {
  let file = File::open(FILENAME)?;
  let reader = BufReader::new(file);
  let mut total = 0u64;
  for line in reader.lines() {
    let s = line?;
    total += get_count_from_line(&s);
  }
  Ok(total)
}

Here, we’re using the BufReader class to wrap the file and read it in a buffer-sized chunk at a time. (BufReader implements the BufRead trait, which can be implemented by any sort of reader that has an internal buffer). Then we can just call lines() on the BufRead to get an iterator over each line of the file, which is very convenient!

Note that, by default, BufReader has a buffer size of 8 KB, though this may change in the future. If you want to change this, you can use BufReader::with_capacity() instead of BufReader::new() to construct it.

Buffered, reusing the string buffer

Speed: 29.4 milliseconds

This is the read_buffered_reuse_string() function, which is implemented as:

pub fn read_buffered_reuse_string() -> io::Result<u64> {
  let file = File::open(FILENAME)?;
  let mut reader = BufReader::new(file);
  let mut string = String::new();
  let mut total = 0u64;
  while reader.read_line(&mut string).unwrap() > 0 {
    total += get_count_from_line(&string);
    string.clear();
  }
  Ok(total)
}

This is very similar in concept to the previous function. The only difference is that we allocate one String and pass this in to reader.read_line() so it will fill in the line to the existing String instead of allocating a new one. This small difference to avoid an allocation per line is enough to make this method run 1.5 times faster than the previous function!

Reading the whole string from disk into a giant buffer

Speed: 22.9 milliseconds

The final function we’ll look at is read_buffer_whole_string_into_memory(), which looks like this:

pub fn read_buffer_whole_string_into_memory() -> io::Result<u64> {
  let mut file = File::open(FILENAME)?;
  let mut s = String::new();
  file.read_to_string(&mut s)?;
  let mut total = 0u64;
  for line in s.lines() {
    total += get_count_from_line(line);
  }
  Ok(total)
}

This is the extreme version of a buffer; here we allocate one big buffer and read the whole string into it all at once. This is the best way of showing that the number of read calls is really the determining factor in our performance; this function, which does only one read call, is the fastest one of all. It is 1.3 times faster than the next fastest version.

The downside to this technique is that you need enough memory to be able to hold all of the file contents at once. In this case, the file is only around 8 MB big, which is not much memory, but if you’re writing a program to process arbitrary files, this could easily fail. In general, it’s safer to use a BufReader as described above; you can tweak it to increase its buffer size if you’re comfortable using more memory.

Final thoughts

BufReader is more capable than we’ve shown here; it’s capable of wrapping any struct that implements the Read trait. Notably, this includes the TcpStream struct, so you can use BufReader for network connections too.

In a larger sense, whenever you’re making repeated calls to read from something, consider using buffered I/O; it can make a big difference in performance!

LogRocket: Full visibility into web frontends for Rust apps

Debugging Rust applications can be difficult, especially when users experience issues that are difficult to reproduce. If you’re interested in monitoring and tracking performance of your Rust apps, automatically surfacing errors, and tracking slow network requests and load time, try LogRocket.

LogRocket is like a DVR for web and mobile apps, recording literally everything that happens on your Rust app. Instead of guessing why problems happen, you can aggregate and report on what state your application was in when an issue occurred. LogRocket also monitors your app’s performance, reporting metrics like client CPU load, client memory usage, and more.

Modernize how you debug your Rust apps — start monitoring for free.

Source link