Skip to content
SimGus edited this page Oct 29, 2019 · 10 revisions

As we explained, units are the places where rules live. All rules must be placed within a unit declaration.

As a reminder, unit declarations are made of a unit declaration initiator on one line, and rules on one or several lines. The line of the initiator has to be unindented and starts with a special character which depends on the type of unit that is being declared. An identifier for the unit being declared (and possibly generation modifiers), surrounded by brackets ([ and ]), is placed after that.

As explained when we talked about sub-rules, we can refer to a unit by writing the special character for the type of the unit we are referring to, followed by the unit's name surrounded by brackets ([ and ]). As one could expect, it is not permitted to refer to the unit itself within its own declaration.

A certain number of special characters allow to give some information to the parser and the generator and thus cannot be used inside a unit identifier. If you really need to use any of the special characters, you will have to escape the character using a backslash \ (e.g. use \? instead of \). This is also true for unit references. Here is an exhaustive list of those special characters:

Character name Symbol
Question mark ?
Slash /
Double slashes //
Semi-colon ;
Hashtag #
Dollar sign $
Ampersand &
Square brackets [ and ]
Backslash \

Other symbols (including whitespaces) should not be problematic if used in unit's names. If you're not sure you can use a certain symbol, escape it. The single backslashes will always be removed upon generation.

We will explain exhaustively which types of unit exist and what they are used for.

Types of units

Alias

An alias is the most basic unit type: a set of rules that can be referred to. On generation of a reference to such a unit, one of the rules is chosen and generates a string.

The special character that represents aliases is the tilde ~.

Here is an example of such a unit declaration:

~[greetings]
   hello
   hi
   howdy

In a rule, this alias would be referred to using ~[greetings]. This reference would thus generate hello, hi or howdy (chosen at random).

Slot

A slot works exactly in the same way as an alias. The only difference comes when the output file is written: the particular strings that were generated by slot references will be marked as entities (see Rasa NLU documentation for more information). In a nutshell, an entity is a certain part of an example whose value might trigger different actions/answers from the produced chatbot.

The special character representing this type of unit is the at sign @.

Here is an example of a slot declaration:

@[browser]
   Firefox
   Chrome/Chromium
   Safari
   Edge/Internet explorer
   Opera
   another browser

This slot would be referred to as @[browser] inside rules, e.g. I use @[browser].. If the example sentence I use Firefox. gets generated, Firefox will be marked as the value of an entity called browser. For example for Rasa NLU, the following JSON array will be part of the object representing this example:

"entities": [
          {
            "end": 13,
            "entity": "browser",
            "start": 6,
            "value": "Firefox"
          }
        ]

If you need several different texts to mean the same value, you will want to use the = syntax, that is, you add an equal and the entity value at the end of the rule (any whitespace around the equal sign will be ignored). This way, you can have different strings mapping to the same entity value. For example, if in the previous example, you wanted Firefox, FireFox and ff all be mapped to the value Firefox (and thus get the JSON array generated for Rasa NLU as above) whichever string is generated, you would define the slot as follows:

@[browser]
   Firefox
   FireFox = Firefox
   ff=Firefox
   Chrome/Chromium
   Brave browser
   //...

Note that you can achieve the same thing using a choice (see generation rules) as in the following declaration:

@[browser]
   [Firefox|FireFox|ff] = Firefox
   Chrome/Chromium
   Brave browser
   //...

If you use the special value / for a rule (e.g. Firefox = /), the value of the slot if this rule gets selected will be the "name" of the first sub-rule in the rule (even if it doesn't generate anything). This "name" is the word itself for a word, the string of words in a word group and the name of the unit for a unit reference. For a choice, this "name" is the string inside the curly braces without the modifiers. Therefore, the rule ~[this] is a rule = / inside a slot definition would generate the text this is a rule (for example) and would have the value this.

If you want to use an equal sign = within your generation rule, you will need to escape it using backslash \, i.e. use \= instead of = if the generated string needs to include a =.

Slots can be referenced in any rule, except if the rule is contained in a slot (or referenced within a slot's rule) since it doesn't make sense.

Intent

Intents can be considered to be the "entry points" of the program: an intent specifically ask the program to generate a certain number of examples by generating its rules. In other words, upon generation, the generator will look for intent declarations and generate the number of example sentences for each particular intent.

Intents are not especially meant to be referenced, but if you do so, they will behave as an alias reference (thus, generating a string from one of its rules at random).

The special character that characterizes this type of unit is the percent symbol %.

The declaration initiator can be followed by a string between parentheses. This string indicates how many examples should be generated, which can be written in several different ways:

  • %[intent name](5)
  • %[intent name]('train': '5')
  • %[intent name]('training': '5')
  • %[intent name]('train': '5', 'test': 3)
  • %[intent name]('training': '5', 'testing': 3) All the single quotes ' in those strings are optional. The whitespaces around the colon are optional as well.

The first way tells the generator to generate 5 examples for the intent intent name, as do the second and third lines. The last two ways ask the generator to generate 5 examples for this intent, and 3 different examples (if possible) that will be put in another file. The two output files created in this case can be used as the training dataset and the test dataset for an NLU module. As a side note, the training data will be generated in a directory called train in the output directory, while the test data will be in the directory test at the same place.

When the generator generates an example for an intent, the example is marked as having the intent intent name in the output. For example, for an output that is aimed at being an input to Rasa NLU, "intent": "intent name" would be part of the JSON object describing each example that was generated from this intent declaration.