Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common "footguns" when writing CLI applications #29

Open
kbknapp opened this issue Mar 27, 2018 · 14 comments
Open

Common "footguns" when writing CLI applications #29

kbknapp opened this issue Mar 27, 2018 · 14 comments
Labels
A-book Area: Documenting how to create CLIs C-tracking-issue Category: A tracking issue for an unstable feature

Comments

@kbknapp
Copy link
Contributor

kbknapp commented Mar 27, 2018

It seems like we're coming across a few small "footguns", some common some less so, so it may be nice to collect them in one place so when writing something like #6 we don't forget anything.

This could happen in #6, but I'd prefer if it were separate so this can dive a little deeper and actually make a list.

Off the top of my head from stuff that's come up today in the gitter and elsewhere:

  • Reading stdin line by line is slow, because it does a heap alloc at each line. Easiest solution is BufReader::read_lines
  • Using String as an input type for arbitrary data that doesn't necessarily need to be UTF-8
  • Using println! when writing lots of small output, which locks stdout at each call. Prefer locking stdout manually and using something like write!
  • I think Silently dropped errors in std #28 probably qualifies too, although it's probably far less common. The workout is simple, but not really intuitive IMO (calling libc::close manually, and checking for errors)
  • println! can panic on a broken pipe
  • std::env::args panics at invalid UTF-8 (the fix is to use std::env::args_os but this is not very intuitive, and with issues like Ergonomics of String handling #26 it's far more friction than it should be)

If you've got more I'll add them. Of course I think we should also be open to discussion about the ones I listed as to why they should, or should not be listed and/or other fixes for them. 😄


Gitter log of reading `stdin` line by line @rcoh 18:37: Hello! I'm working on https://github.com/rcoh/angle-grinder in the process, I noticed that reading from stdin via .lines() is pretty slow there are ways around it, of course. ripgrep uses a custom written stdin buffer

@BurntSushi 18:38: yeah it creates an alloc for each new line. ripgrep's buffer is quite a bit more involved. :grinning: easiest path to something fast is BufReader::read_line

@rcoh 18:38: In any case, I was wondering if there were any plans to make reading lots of data from stdin a more generally supported pattern or if ripgrep's pattern could be abstracted into a library in some way

@BurntSushi 18:39: ripgrep doesn't really read line-by-line, so it's a bit different

@rcoh 18:40: but I assume at some point in the internals, it's splitting things into lines?

@BurntSushi 18:40: nope

@rcoh 18:40: oh interesting!

@BurntSushi 18:40: only for output

@rcoh 18:40: doesn't grep have line-by-line semantics? Does ripgrep as well?

@BurntSushi 18:41: in the (very common) case that searching a file yields no matches, it is possible that the concept of a "line" never actually materializes at all

@rcoh 18:41: interesting.

@BurntSushi 18:41: see: https://blog.burntsushi.net/ripgrep/#mechanics (GNU grep does the same thing)

@rcoh 18:42: My current approach is:

        let mut line = String::with_capacity(1024);
        while buf.read_line(&mut line).unwrap() > 0 {
            self.proc_str(&(line));
            line.clear();
        }

Is that fastest I can do without resorting to a totally different strategy?

@BurntSushi 18:42: pretty much, yes. if you could use a Vec instead of a String, then you can use read_until and avoid the UTF-8 validation check. depends on what you're doing though!
if you're processing data from arbitrary files, then String is very likely the wrong thing to use

@rcoh 18:44: it's intended to process log files so it's almost always printable data except in some weird cases

@BurntSushi 18:44: it all depends on whether you're OK with it choking on non-UTF-8 encoded text

@rcoh 18:53: Got it. Thanks for your help! I literally couldn't think of a more qualified person to have answer my question :grinning:

@BurntSushi 18:53: :grinning: FWIW, I do intend on splitting the core ripgrep search code out into the grep crate, which will include the buffer handling. it's likely possible that the API will be flexible enough to do line-by-line parsing as fast as possible (although, avoiding line-by-line will still of course be faster for many workloads)

@rcoh 19:02: Cool. I don't actually need to be line-by-line until after running some filtering so that would be useful for my use case.

@spacekookie
Copy link
Collaborator

  • Using String as an input type for arbitrary data that doesn't necessarily need to be UTF-8

I'm not sure in the year 2018 anyone should not be expecting utf-8 input 🤔

@BurntSushi
Copy link

@spacekookie If your tool is designed to process arbitrary files, and that processing can be done in a somewhat encoding agnostic way (for example, one might only need the assumption that the text is ASCII compatible), then you really can't use String. Many of the tools I write fall into this category, so I am personally biased in witnessing the representation of these cases, but I wouldn't be surprised if this were a common thing when writing tools intended for use by others on Unix-like systems.

On Windows this gets a bit more complicated. If you write a CLI tool for Windows that reads files, then at some point, you're likely to come across UTF-16. This can be handled automatically via BOM-sniffed and transcoding, but there's no out-of-the-box convenient abstraction to handle this. (There are nominal String::from_utf16 methods, which work well if you're OK reading an entire file into memory. For some tools, that's OK.)

@epage
Copy link
Contributor

epage commented Mar 27, 2018

This can be handled automatically via BOM-sniffed and transcoding, but there's no out-of-the-box convenient abstraction to handle this.

Sounds like a good candidate for any interested party in the WG to take on.

@BurntSushi
Copy link

Sounds like a good candidate for any interested party in the WG to take on.

For streaming transcoding, see:

@jeekobu
Copy link

jeekobu commented May 17, 2018

The String issue is, in my view, huge. If your CLI app takes a filename but can't handle my non-UTF8-but-perfectly-valid argument, for example, that's a bug. And in general, non-String-but-text-data ([u8], OsString) feels very second-class-citizen in Rust. Most libraries/APIs/macros require String... such as std::fmt (including Debug and Display). And putting unchecked data into a String is considered UB.

Ideally, an overarching trait or similar could be created to expand existing systems to the non-UTF8 story. I'm sure that's not a 2018 possibility though, and would be interested to see what the WG suggestions in the meantime.

(Side note, since the OP mentioned the "ideal guide" issue: the quicli docs are mentioned as a "concrete example" there. Their front page example has file: String.)

@BurntSushi
Copy link

@jeekobu An RFC was recently merged that adds Pattern support to OsStr, which should make it a bit nicer to use. IMO, it should also be done for &[u8], but the RFC only addresses OsStr.

I think fixing the "programs demand UTF-8, even for file paths" bug is the realm of argv parsers. They should make it very easy to handle file paths correctly.

@jeekobu
Copy link

jeekobu commented May 18, 2018

The Pattern extension is why I have some hope that quality &[u8] and/or OsStr support is a future possibility ☺️. But it's unstable (and thus currently easy to extend or modify), unlike most of std::fmt (which is where I think most of the pain comes from).

Anyway, after I made that comment, I found #26 which more directly addresses this hurdle.

@killercup killercup mentioned this issue Jul 24, 2018
@djhworld
Copy link

Unfortunately managed to shoot myself in the foot regarding reading lines from stdin, see here

https://www.reddit.com/r/rust/comments/aaood3/go_version_of_program_is_40_quicker_than_rust/

and follow up gist here https://gist.github.com/djhworld/ced7eabca8e0c06d5b65917e87fb8745

I think it would be great if the CLI book could at least document a recommended way of reading newline deilimited input from stdin to avoid falling into these traps, or at least have an example application that does something trivial like converting input to uppercase or something.

@killercup
Copy link
Collaborator

@djhworld very good point! I have a TODO open for stdin in #95.

I'm not entirely sure what the book should recommend precisely; I'd say using a buffered reader is uncontroversial. Reusing the line buffer however will require you to be very careful in that you always clear it and don't expect to be able to store it permanently without cloning.

Do you think that reading from stdin, using a buffered reader, will often be a performance issue? Or would other parts of the program be usually vastly slower?

@yoshuawuyts
Copy link
Collaborator

I'm not entirely sure what the book should recommend precisely; I'd say using a buffered reader is uncontroversial. Reusing the line buffer however will require you to be very careful in that you always clear it and don't expect to be able to store it permanently without cloning.

I think putting this in the book, in almost these exact words would be super helpful. Coupled with an example it would probably help people a lot.


Do you think that reading from stdin, using a buffered reader, will often be a performance issue?

I've seen the issue of folks unknowingly acquiring a lock on stdin over and over come up several times. It seems that at least for trivial benchmarks / exploratory work this is an actual issue. I suspect this might also be an actual bottleneck for people building filtering shell tools, similar to grep or jq.

@stappersg
Copy link
Contributor

and would be interested to see what the WG suggestions in the meantime.

WG is the acronym for ... ??

@Dylan-DPC-zz
Copy link

Working Group

@yuvadm
Copy link

yuvadm commented Nov 23, 2019

Is there any recommended pattern or best-practice for writing a piping CLI that efficiently reads from and writes to stdin/stdout?

Even having a reference implementation to look at would be useful.

@killercup
Copy link
Collaborator

@yuvadm I haven't seen a write-up about it, but I'd love to see one! :)

Maybe have a look at what some projects use? Reading and writing to std{in,out} should be easy enough, but it gets tricky if you want to handle "broken pipe" errors gracefully. This was also discussed in #21, btw.

@epage epage added A-book Area: Documenting how to create CLIs C-tracking-issue Category: A tracking issue for an unstable feature labels Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-book Area: Documenting how to create CLIs C-tracking-issue Category: A tracking issue for an unstable feature
Projects
None yet
Development

No branches or pull requests