-
Notifications
You must be signed in to change notification settings - Fork 56
Quickstart
In this guide, we will summarize the use of Chatette to generate examples from a set of templates.
The TLDR section will allow you to jumpstart and use Chatette in less than 5 minutes. The quickstart will give you a general overview of what Chatette is capable of.
As shown in the readme file, Chatette is meant to be executed on a command line, taking a file as input and creating files as output. The output files can then be used to train a NLU model. If you use Rasa NLU, these output files can be used directly without any changes.
The usual process is thus:
- Make one or several input files containing templates written using the Domain Specific Language (DSL) specified in the next section;
- Execute Chatette on this or these files to produce the desired outputs, using the following command:
python -m chatette PATH/TO/TEMPLATE/FILE
In template files, everything on a line that comes after a double slash //
is a comment and will be ignored by the program.
The DSL will be presented using comments within an example.
// Lines starting with double slash are comments and are ignored
// Indented and empty lines are also ignored
// Other template files can be included inside a file
// using the pipe `|` symbol.
// Inclusion of a file simply means the contents of this file
// are "copied" in plae of the inclusion.
|path/to/other/file/relative/to/this/file
// A unit is a list of generation rules, with a name.
// Once declared, units can be used inside other rules.
// Each unit declaration starts with a special symbol.
// You define aliases with tilde `~`.
// They are basically lists of synonyms
~[greeting]
hello // This is a rule
hi // Rules are indented
// Upon generation, this alias will generate
// either `hello` or `hi`.
// Units can be used inside rules,
// using their name and special symbol
~[other alias]
~[greeting] guys!
// Upon generation, this alias will generate
// either `hello guys` or `hi guys`.
~[third alias]
I like [apples|pears] // Inline synonyms also exist.
// Slots (also called entities) start with an at sign `@`
// and represent part of a sentence that are worth noticing
// because they hold important information.
@[operating system]
Linux
macOS
Windows
// Upon generation, this will generate either `Linux`,
// `macOS´ or `Windows`, and will be "highlighted"
// in the output files.
// Finally, intents start with a percent sign `%`
// and are the "entry points" to the generator.
%[greet user]
~[greeting] user! What is your @[operating system]?
// Upon generation, the generator will look for
// intents and generate them and only them.
// Each generated example will be marked as having
// the intent it was generated under.
// Inside rules, you can also use generation modifiers
// which will change the generation of some parts of the rule.
~[alias]
group [of words?] // The word `group` is always generated.
// The part surrounded by brackets is optional
// and will generate only 50% of the time.
[&group of words] // The first word of the group will sometimes
// be capitalized.
Of course, many more features and modifiers exist, but this gives you a glimpse of what Chatette is capable of.
Obviously, in order to use Chatette, you will first need to install it. Refer to the README or the command line interface page for more information.
Then, you will need to write one or several template files, that will be parsed and interpreted by Chatette.
Finally, you will want to run Chatette on those files to generate output file(s) which will contain a set of examples corresponding to the descriptions provided in the template files. If you need custom or more powerful capabilities, you can use parts of the Chatette module within your own Python programs.
This page will quickly describe how to do each step (except using parts in your own scripts). A more complete documentation for each of those steps is available in this wiki.
A template file is simply a file on your file system containing text in a certain format. Even though all the examples in this repository end in .chatette
, a specific file extension is not mandatory.
A template file contains a set of unit declarations, which themselves contain a list of generation rules (also called templates).
Each line of the file can be one of 4 types:
-
Empty lines and comments: empty lines contain no characters; comment lines start with a double slash
//
. Both can be indented as you wish. Those lines will be ignored by the parser. Any string that follows a (unescaped)//
will actually be ignored. -
Unit declaration: they are unindented lines starting with a special character (
%
,@
or~
depending on the type of unit that is being declared). The special character is followed by a set of unicode characters surrounded by square brackets ([
and]
) and in some cases followed by another string surrounded by parentheses ((
and)
). - Unit definition: those lines are the content of a unit declaration (which may be several lines long), and must be indented in a coherent way, i.e. all the contents of a definition must be indented in the same way. Each of those line describe a generation rule for the declared unit, which we will call "rule" or "template".
-
Inclusion of another file: they tell the parser to include another template file exactly where this line is. They begin with a pipe symbol
|
(not indented) directly followed by the file path (relative to the file that is currently being parsed).
A template file thus conforms to the following skeleton:
// comment
%[DECLARATION1](something)
RULE1
RULE2
@[DECLARATION2]
RULE1
RULE2 // comment
RULE3
~[DECLARATION3] //comment
RULE1
The lines marked as "rules" in the skeleton above are placeholders for what we call generation rules, that is, templates to follow in order to generate a string.
A rule is a sequence of parts (usually called sub-rules) which are able to generate certain strings. The string generated by the rule is then simply the concatenation of those strings. Sub-rules are separated by whitespaces or special characters (~
, @
, %
, [
or ]
).
Here is an exhaustive list of the types of sub-rules that exist and what they do upon generation:
- a word is a simple word, which will generate itself.
- a unit reference is a reference to another unit definition. This is achieved by writing the special character for the type of unit referenced, and then the unit's name surrounded by brackets (and possibly modifiers within the brackets as well). Upon generation, they will look for the unit declaration and make it generate a string (see below).
- a choice is a list of rules separated by pipe symbols
|
, the whole choice being surrounded by square brackets[
and]
. It tells the generator to choose one of the rules within the choice at random and make it generate a string. As for unit references, its generation behavior can be modified using modifiers.
Here are examples of such sub-rules and what they could generate:
Sub-rule type | Sub-rule | Possible generated string(s) |
---|---|---|
Word | test |
test |
Unit reference | ~[alias] |
Depends on the unit declaration labelled alias |
Choice | [choice1|choice two|[third choice]|longer choice [very long]] |
choice1 , choice two , third choice or longer choice very long
|
Note that you can specify one and only one rule per line, and that no rule can span several lines.
A unit is a set of generation rules which can be used in other rules by referring to them. Upon generation (i.e. when the generator asks a unit declaration to generate a string), a rule is chosen at random inside its set of rules and generates the string.
To be able to refer to a unit, it must have been defined somewhere in the template file(s) (not necessarily before it is used). A unit definition is made of 2 parts:
- a declaration, on one line
- a set of rules, on one or several lines (cf. above)
The declaration line contains information about the unit that is being defined. The mandatory information in the declarations is:
- the type of unit being declared, denoted with a special character the line starts with,
- a name to refer to this particular unit, which is a string containing any characters (including whitespaces) except special characters (unless they are escaped using a backslash
\
).
Here are all the characters that can be escaped if you want to use them in a unit identifier (i.e. use \;
instead of ;
): ;
, //
, /
, [
, [
, {
, }
, ~
, @
, %
, \
, |
, ?
, #
, $
and &
.
Other optional information will be discussed below.
As for Chatito, there are 3 different types of units: aliases, slots and intents.
An alias simply represents a set of generation rules, which can be used within other rules. An alias could for example be a list of synonyms that can be used interchangeably in generated examples.
Here is an example of a simple alias definition:
~[FOSS]
FOSS
Free and Open-Source Software
free and open-source software
libre software
Referring to this example inside another rule would thus for instance be done the following way:
I like ~[FOSS] a lot
where I
, like
, a
and lot
are each sub-rules of type word and ~[FOSS]
is an alias reference.
This specific example would generate I like FOSS a lot
25% of the time, I like Free and Open-Source Software a lot
25% of the time, I like free and open-source software a lot
25% of the time and I like libre software
the rest of the time.
Intents are entry points of the generator in the template file, which means there must be at least one intent in the template file(s) in order to have a non-empty example generation. If you refer to one of them in a rule, they will behave as aliases.
Defining an intent comes down to saying:
I want Chatette to generate X examples denoted as having the intent Y.
After an intent declaration, we can thus give the number of examples that should be generated. If no number is given, Chatette will generate all possible examples.
Here is an example of a simple intent definition:
%[greeting](2)
hello
hi
Hi!
After generation, you will find two of the three strings generated in the output file.
A slot in a generation rule represents what Rasa NLU calls an entity, i.e. a finite set of values for a particular variable for which different values mean different things. We will sometimes refer to slots as entities.
Here is an example of a simple slot definition.
@[operating system]
Linux
Windows
macOS
FreeBSD/OpenBSD
Referring to this slot in other rules would for instance be done as follows:
I use @[operating system].
which will generate I use Linux.
25% of the time, I use Windows
25% of the time, I use macOS.
25% of the time and I use FreeBSD/OpenBSD
the rest of the time. In the output file, all the parts of the generated sentences that were generated by a reference to this slot will be marked as belonging to an entity called operating system
.
For both unit declarations and sub-rules, we can add modifiers to change the behavior of the generator when it encounters it. The generation behavior is changed for the current sub-rule/unit declaration only. Note also that simple words cannot take any modifiers.
We will describe the most useful ones in this section. Other ones exist and are explained in this wiki.
-
Case generation: this tells the generator to randomly choose between a leading uppercase or lowercase letter for the first letter of the unit or reference on generation. This modifier is denoted by an ampersand
&
placed right after the opening bracket[
in unit declaration initiators or sub-rules.For example,
[&hello]
will generatehello
50% of the time andHello
the rest of the time.
-
Random generation: adding a question mark
?
right before the closing bracket]
of a sub-rule tells the generator to randomly decide whether it should ignore this sub-rule or not, and thus if it should generate a string or rather nothing. Except for choices, you can give an identifier for this random generation right after the question mark; every sub-rule that has this identifier for its random generation will be generated together or will not be generated at all (rather than some of them being generated and other not).For example, the rule
hey [you?]
will generatehey
50% of the time andhey you
the rest of time; the ruleHi [I'm a?rand name] pretty [test?rand name]
will generateHi I'm a pretty test
50% of the time andHi pretty
the rest of the time, but neverHi I'm a pretty
orHi pretty test
.
-
Variation naming: adding a hashtag followed by a certain name after the unit identifier tells the parser that you are making a flavor (or variation) of a unit. This unit variation can be referenced in rules as a normal unit, but you can also reference the unit without flavor, which will refer to all the variations of that unit. In Englihs, it is usually done to make a singular and a plural flavor of the same alias, and use the unit without flavor in places where singular or plural doesn't matter.
For example, if the following aliases are defined:
~[alias#singular] alias ~[alias#plural] aliases
the sub-rule
~[alias#singular]
will generatealias
,~[alias#plural]
generatesaliases
and~[alias]
will generatealias
50% of the time andaliases
the rest of the time.
To run Chatette on a template file, simply run the following command:
python -m chatette <path-to-template-file>
The generated examples will be put in a newly created file in output/train/output.json
(or several files in the same folder if the output is really large). If you want the output to be put somewhere else, run the following command:
python -m chatette <path-to-template-file> -o <path-to-output-directory>
Other flags are available and are described in this wiki.