Skip to content

Commit

Permalink
More and more details about running the programs manually
Browse files Browse the repository at this point in the history
  • Loading branch information
hackerb9 committed Mar 13, 2024
1 parent 29a36b8 commit 6529153
Showing 1 changed file with 132 additions and 54 deletions.
186 changes: 132 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,45 @@
# m100-tokenize

A tokenizer for TRS-80 Model 100 (AKA "M100") BASIC language.
A tokenizer for TRS-80 Model 100 (AKA "M100") BASIC language. Converts
`.DO` files to `.BA`.

tokenize FOO.DO FOO.BA

Although, the text refers to the "Model 100", this also works for the
Tandy 102, Tandy 200, Kyocera Kyotronic-85, and Olivetti M10, which
all have [identical tokenization](http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file).

_This does not yet work for the NEC PC-8201/8201A/8300 whose N82 BASIC
has a different tokenization._
_This does not work for the NEC PC-8201/8201A/8300 whose N82 BASIC has
a different tokenization._

## Introduction

The Tandy/Radio-Shack Model 100 portable computer can save its BASIC
files in ASCII (plain text) or in a "tokenized" format where the
keywords — such as `FOR`, `IF`, `PRINT`, `REM` — are converted to a
single byte. Not only is this more compact, but it loads much faster.

### The problem

Programs for the Model 100 are generally distributed in ASCII format,
but that has two downsides: ① the user must LOAD and re-SAVE the file
on their machine to tokenize it as only tokenized BASIC can be run and
② the machine may not have enough storage space for the tokenized
version while the ASCII version is also in memory.

### The solution

This program solves that problem by tokenizing on the host computer
before downloading to the Model 100. Additionally, this project
provides a decommenter and cruncher (whitespace remover) to save bytes
in the tokenized output at the expense of readability.

ASCII formatted BASIC files generally are given the extension `.DO` so
that the Model 100 will see them as text documents. Other common
extensions are `.BA`, `.100`, and `.200`. Tokenized BASIC files use
the extension `.BA`.
### File extension terminology

Tokenized BASIC files use the extension `.BA`. ASCII formatted BASIC
files should be given the extension `.DO` so that the Model 100 will
see them as text documents, although people often misuse `.BA` for
ASCII.

## Programs in this project

Expand Down Expand Up @@ -107,27 +117,113 @@ Output file 'PROG.BA' already exists. Overwrite [yes/No/rename]? R
Old file renamed to 'PROG.BA~'
```

### Running m100-tokenize manually
### Running m100-tokenize and friends manually

#### Synopsis
#### Soft dependencies

Certain programs should _usually_ be run to process the input before
the final tokenization step, depending upon what is wanted.
m100-sanity is strongly recommended. (See [Abnormal
code](#Abnormal code) below.)

``` mermaid
flowchart LR;
m100-sanity ==> m100-tokenize
m100-sanity ==> m100-jumps
m100-sanity ==> m100-decomment --> m100-crunch --> m100-tokenize
m100-decomment --> m100-tokenize
```

| Programs used | Effect | Same as |
|---------------------------------------------------------------------------------|--------------------------------------------|-------------|
| m100-tokenize | Abnormal code is kept as is | |
| m100-sanity<br/>m100-tokenize | Identical output as a genuine Model 100 | tokenize |
| m100-sanity<br/>m100-jumps<br/>m100-decomment<br/>m100-tokenize | Saves RAM by removing unnecessary comments | tokenize -d |
| m100-sanity<br/>m100-jumps<br/>m100-decomment<br/>m100-crunch<br/>m100-tokenize | Saves even more RAM by removing whitespace | tokenize -c |

#### m100-tokenize synopsis

**m100-tokenize** [ _INPUT.DO_ [ _OUTPUT.BA_ ] ]

Unlike `tokenize`, m100-tokenize does not require an input
filename as it is meant to be used as a filter in a pipeline.
With no files specified, the default is to use stdin and stdout.
Unlike `tokenize`, m100-tokenize never guesses the output filename.
With no files specified, the default is to use stdin and stdout so it
can be used as a filter in a pipeline. The other programs --
m100-sanity, m100-jumps, m100-decomment, and m100-crunch -- all have the
same syntax taking two optional filenames.

#### Example usage of m100-tokenize

When running m100-tokenize by hand, it is recommended to process
the input through the `m100-sanity` script first to correct
possibly ill-formed BASIC source code.
When running m100-tokenize by hand, process the input through the
`m100-sanity` script first to correct possibly ill-formed BASIC source
code.

``` bash
m100-sanity INPUT.DO | m100-tokenize > OUTPUT.BA
```

#### Stdout stream rewinding
The above example is equivalent to running `tokenize INPUT.DO
OUTPUT.BA`.

#### Example usage with decommenting

The m100-decomment program needs help from the m100-jumps program to
know when it shouldn't completely remove a commented out line, for
example,

``` BASIC
10 REM This line would normally be removed
20 GOTO 10 ' ... but now line 10 should be kept.
```

So, first, we get the list of line numbers that must be kept in the
variable `$jumps` and then we call m100-decomment passing in that list
on the command line.

jumps=$(m100-sanity INPUT.DO | m100-jumps)
m100-sanity INPUT.DO |
m100-decomment - - $jumps |
m100-tokenize > OUTPUT.BA

The above example is equivalent to running `tokenize -d INPUT.DO
OUTPUT.BA`.

Note that m100-decomment keeps the entire text of comments which are
listed by m100-jumps with the presumption that, as targets of GOTO or
GOSUB, they are the most valuable remarks in the program. (This
behaviour may change in the future.)

Example output after decommenting but before tokenizing:
``` BASIC
10 REM This line would normally be removed
20 GOTO 10
```

#### Example usage with crunching

The m100-crunch program removes all optional space and some other
optional characters, such as a double-quote at the end of a line or a
colon before an apostrophe. It also completely removes the text of
comments which may have been preserved by m100-decomment from the
m100-jumps list. In short, it makes the program extremely hard to
read, but does save a few more bytes in RAM.

jumps=$(m100-sanity INPUT.DO | m100-jumps)
m100-sanity INPUT.DO |
m100-decomment - - $jumps |
m100-crunch |
m100-tokenize > OUTPUT.BA

The above example is equivalent to running `tokenize -c INPUT.DO
OUTPUT.BA`.

Example output after crunching but before tokenizing:

``` BASIC
10REM
20GOTO10
```

### An obscure note about stdout stream rewinding

After finishing tokenizing, m100-tokenize rewinds the output
file in order to correct the **PL PH** line pointers. Rewinding
Expand All @@ -145,6 +241,9 @@ pointers will all contain "\*\*" (0x2A2A). This does not matter
for a genuine Model T computer which ignores **PL PH** in a file,
but some emulators are known to be persnickety and balk.

If you find this to be a problem, please file an issue as it is
potentially correctable using `open_memstream()`, but hackerb9 does
not see the need.

## Machine compatibility

Expand All @@ -166,21 +265,22 @@ M100 BASIC.)

This program is written in
[Flex](https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/handouts/050%20Flex%20In%20A%20Nutshell.pdf),
a lexical analyzer, because it made implementation trivial. It's mostly
just a table of keywords and the corresponding byte they should emit.
Flex handles special cases, like quoted strings and REMarks, easily.
a lexical analyzer, because it made implementation trivial. The
tokenizer itself, m100-tokenize, is mostly just a table of keywords
and the corresponding byte they should emit. Flex handles special
cases, like quoted strings and REMarks, easily.

The downside is that one must have flex installed to _modify_ the
tokenizer. Flex is _not_ necessary to compile and run as flex actually
generates C code which can be compliled anywhere (see the file
`lex.tokenize.c`).
generates portable C code. See the tokenize-cfiles.tar.gz in the
github release or run `make cfiles`.

## Abnormal code

The `tokenize` script always uses the m100-sanity program to
clean up the source code, but one can run m100-tokenize directly
to purposefully create abnormal, but valid, programs. These
programs cannot be created on genuine hardware, but **will** run.
The `tokenize` script always uses the m100-sanity program to clean up
the source code, but one can run m100-tokenize directly to
purposefully create abnormal, but valid, `.BA` files. These programs
cannot be created on genuine hardware, but **will** run.

Here is an extreme example.

Expand Down Expand Up @@ -243,13 +343,6 @@ which was created using m100-tokenizer.

</details>

takes care of automatically: sort line
numbers and keep only the last line of any duplicates. This
should be typically be used on any source code, hackerb9's
m100-tokenize is able to generate tokenizations of degenerate
BASIC programs.



## Miscellaneous notes

Expand Down Expand Up @@ -281,11 +374,11 @@ BASIC programs.
portable computer and will cause others to crash badly, possibly
losing files. To avoid this, some filename extensions are used:

* `.100` An ASCII BASIC file that includes POKEs or CALLs specific
* `.100` An ASCII BASIC file that includes POKEs or CALLs specific
to the Model 100/102.
* `.200` An ASCII BASIC file specific to the Tandy 200.
* `.BA1` A tokenized BASIC file specific to the Model 100/102.
* `.BA2` A tokenized BASIC file specific to the Tandy 200.
* `.200` An ASCII BASIC file specific to the Tandy 200.
* `.BA1` A tokenized BASIC file specific to the Model 100/102.
* `.BA2` A tokenized BASIC file specific to the Tandy 200.
* The `.BA0` and `.NEC.BA` extension signify a tokenized BASIC file
specific to the NEC portables. This is a different tokenization
format than any of the above and is not yet supported.
Expand All @@ -295,28 +388,13 @@ BASIC programs.
save "FOO", A
```

* If the output is piped to another program, m100-tokenize will not
be able to rewind the stream to update the line pointers at the for
each line. In that case, the characters '**' are used which will
work fine on genuine Model 100 hardware. However, some emulators may
complain or refuse to load up the tokenized file.

## Decommenter

As a bonus, a program called m100-decomment exists which tokenizes
programs while also shrinking the output size by throwing away data.
It removes the text of comments and extraneous whitespace. It could do
more thorough packing (merging lines together, removing unnecessary
lines), but I think it currently strikes a good balance of compression
versus complexity.

## Testing

Run `make test` to try out the tokenizer on some [sample Model 100
Run `make check` to try out the tokenizer on some [sample Model 100
programs](https://github.com/hackerb9/tokenize/tree/main/samples) and
some strange ones designed specifically to exercise peculiar syntax.
The program `bacmp` is used to compare the generated .BA file with one
created on a actual Tandy 200.
created on hackerb9's Tandy 200.

Note that without m100-sanity, the SCRAMB.DO test, whose input is
scrambled and redundant, would fail.
Expand Down Expand Up @@ -347,7 +425,7 @@ has followed suit.

## More information

* The file format of tokenized BASIC in the Model 100/102 and Tandy 200:
* Hackerb9 has documented the file format of tokenized BASIC at
http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file

## Alternatives
Expand Down

0 comments on commit 6529153

Please sign in to comment.