BlaanSh is a Unix-like shell written in C language by @os-moussao and @awbx, this shell was inspired by Bash.
$> touch file ; ls -la file ; ping -c 1 google.com &
$> false && echo success || echo failure
$> cat /etc/passwd | cut -d':' -f1 | sort | uniq # get unique user's
$> cat << "EOF" > outfile ; < infile cat >> outfile
$> cd / ; (cd bin && ./ls -G .); pwd # The current working directory is still `/`
echo # Write arguments to the standard output.
cd # Change the shell working directory.
pwd # Print the name of the current working directory.
export # Set export attribute for shell variables.
unset # Unset values and attributes of shell variables.
env # Set environment and execute command, or print environment.
exit # Exit the shell.
Before getting any input, we first have to handle Ctrl-C, Ctrl-D, Ctrl-Z and ctrl-\ which should behave like in bash. After that, the program is continuously prompting for a new command line
input.
This command line
goes through many phases before being executed and showing the result.
First it goes through a lexer, which splits the input into a "valid" list of tokens. Second, the parser uses that list to produce an abstract syntax tree. And last but not the least, the executor walks this tree recursively and executes each node accordingly.
The lexer (or lexical analyser) is the first stage of parsing the command line
input.
- It consists three important steps:
- Tokenizer: It converts the command line into an "initial" list of tokens which contains every detail of the
char *cmdline
. The tokenizer does not check for syntax errors inside thecmdline
, this task is left to the syntax analyser. - Syntax analyser: It goes through the doubly linked list of tokens, and checks for syntax errors at every node. This is done by looking at the left and the right of the current node, if some strange or some unexpected token is encountered, an error is printed to
stderr
and a falsy value is returned to the lexer function. Bellow are the rules of the syntax analyser. - Expander: If the syntax analyser does not produce any error, the list is then passed to an expander which removes quotes and whitespaces, and expands the tilde and the wildcard patterns to their values. Variable expansion is done during execution.
- Tokenizer: It converts the command line into an "initial" list of tokens which contains every detail of the
- UNEXPECTED TOKENS:
- `;;' (this token is always unrecognized in bash and zsh shells)
* AND, OR, PIPE, FG, BG:
- left: [WSPACE] (STRING | CPAR)
- right: [WSPACE] (STRING | REDIRECT | OPAR | if <FG, BG> ENDOFCMD)
* OPAR "(":
- left: CMDBEGIN | [WSPACE] (AND | OR | PIPE | OPAR)
- right: [WSPACE] (STRING | REDIRECT | OPAR)
* CPAR ")":
- left: [WSPACE] (STRING | CPAR)
- right: [WSPACE] (AND | OR | PIPE | CPAR | ENDOFCMD)
* REDIRECT:
- right: [WSPACE] STRING
* PARENTHESES MATCHING AND QUOTING:
- inside each pair parentheses should not be an empthy command.
- every open parentheses has to have a matching closing parentheses.
- every single/double quote have to be closed.
// inside the lexer function
tokens = tokenizer(cmdline);
if (!validate_syntax(tokens))
{
set_status(2);
return (NULL);
}
tokens = expander(tokens);
The parsing algorithm used in this project is a recursive descent parser. Its job is to navigate through the list of tokens produced by the lexer function, and recursively validate each rule defined in the grammar that it recognizes.
The shell (or BlaanSh) grammar written in Extended Backus–Naur form:
<cmdline> ::= <block>
| <block> (";" | "&") <cmdline>
<block> ::= <pipeline> {("&&" | "||") <pipeline>}
<pipeline> ::= <command> {"|" <command>}
<command> ::= <cmdlist>
| "(" <cmdline> ")" <redir> (* subshell *)
<cmdlist> ::= <redir>+ (* at least one redirect - without WORDs *)
| <redir> {<arg> <redir>}+ (* at least one WORD - zero or more <redir> in any place *)
<redir> ::= {("<" | "<<" | ">" | ">>") <filename>} (* a delimiter in case of heredoc *)
<arg> ::= token WORD | token VAR | token GROUP
<filename> ::= token WORD | token VAR | token GROUP
The parser scans the list token by token, if it fails at some point (which rarely happens because of the syntax analyser phase) it outputs an error, and returns NULL pointer. Otherwise it returns a pointer to the generated abstract syntax tree.
If you read the parser source code you can notice how closely it mirrors the grammar above. That's why defining a well structured grammar is a very critical step that preceeds parser implementation.
// calls the starting rule <parse_cmdline> which itself invokes the other rules
t_cmdtree *parser(t_lexer *tokens);
t_cmdtree *parse_cmdline(t_node **tokp); // parses <cmdline>
t_cmdtree *parse_block(t_node **tokp); // parses <block>
t_cmdtree *parse_pipeline(t_node **tokp); // parses <pipeline>
t_cmdtree *parse_command(t_node **tokp); // parses <command>
t_cmdtree *parse_cmdlist(t_node **tokp); // parses <cmdlist>
t_cmdtree *parse_redir(t_cmdtree *cmdtree, t_node **tokp); // parses <redir>
The tree structure t_cmdtree
used in this project has multiple node types. And the t_cmdtree*
pointer is casted to the corresponding type whenever we want to access its elements.
Node types:
- Subshell Node: This node has a pointer to another command tree, which is ment to be executed in a child process.
- Connector Node: This markes one of these connectors
&&
,||
,|
,;
or&
. It has a pointer to a left command tree, and a pointer to a right command tree. - Redirect Node: This type has many information about the redirection mode, source and destination, and a pointer to the tree ment to be executed under the redirection.
- Command list Node: This node type has no childs (a leaf), it has a pointer to a list of command arguments ment to be executed. This is equivalent to the numbers in a math expression.
The executor's job is to recursively walk the syntax tree and execute each node accordingly. You can see that there is a function run_ for each node type.
// executes the whole tree by calling one of the runners depending on the node_type.
void executor(t_cmdtree *tree);
// NODE_PIPE runner: it creates a pipe, and executes the left and the right trees.
void run_pipeline(t_connector *connector);
// runs NODE_CMDLIST type: converts the `t_list *cmdvec` element into a char ** array and executes the array in a child process by calling execve.
void run_cmdlist(t_cmdlist *cmdlist);
// NODE_AND, NODE_OR runner: it checks the status after executing the left tree, and decides whether or not to execute the right tree.
void run_logical_connector(t_connector *connector, int node_type);
// NODE_SUBSH runner: it executes the subtree in a child process and waits for it
void run_subshell(t_subsh *subshell);
// NODE_BG runner: it executes the left subtree in the background (forking without waiting for the child) and executes the right tree immediately (if it exists)
void run_bg_connector(t_connector *connector);
// NODE_FG runner: it runs the left command tree, after its execution is finished, it executes the right tree (if it exists)
void run_fg_connector(t_connector *connecter);
// NODE_REDIR runner: it implements the redirection logic and then executes the subtree
int run_redirection(t_redir *redir, int exec);