-
Notifications
You must be signed in to change notification settings - Fork 4
LogonProcessing_BatchTranslation
As of LOGON 0.5 (knut'), the default behaviour of the batch` script has changed slightly: as it is now using [incr tsdb()] facilities internally, it assumes that by default the intention is to process an [incr tsdb()] skeleton. Thus, a command like
$LOGONROOT/batch vei
will operate on the vei skeleton (for which there is no ASCII fan-out input file, actually). To obtain the original functionality of the batch script, i.e. process an input file in ASCII fan-out input format (see below), do the following:
$LOGONROOT/batch --ascii $LOGONROOT/ntnu/data/mrs.txt
There will be more to say on the new batch script, but two additional (new) options may be relevant immediately: --count n will parallelize processing and start-up n full instantiations of the LOGON pipeline; --limit n will prune the fan-out space to at most n alternatives that, at each stage (i.e post-analysis and -transfer) get passed on downstream. There actually is a default value for the --limit option of (currently) 25. Thus, in order to get full fan-out (if you think you have the cpu cycles :-), use --limit 0.
Note that option processing to the batch script is not very robust; hence, please make sure you get the option syntax exactly right.
Jointly with Torbjørn, we confirmed (August 2004) that the fan-out batch script allows running new data sets through the end-to-end system. We constructured the following input file:
Vi skal møte Ask på mandag.
Vi shall meet Abrams on Monday.
We should meet Abrams on Monday.
Ta båt til Ortnevik.
Take the boat to Ortnevik.
Take a boat to Ortnevik.
Tar du båten til Ortnevik, kan du gå stien samme dagen.
If you take the boat to Ortnevik, you can walk the path the same day.
If you take the boat to Ortnevik, you can walk the path on the same day.
where the format is a sequence of blocks; each block has the Norwegian input on the first line, followed by zero or more lines with reference translations. Blocks are separated from each other by two consecutive newlines. Since we are running this on Unix, it is important to produce Unix-style linebreaks, i.e. either create the file in a Unix environment itself or make sure the linebreaks are ^L (linefeed) and not ^M (carriage return). Incidentally, we got reasonable coverage on the above baby test file and stellar BLEU scores.
I completed a mode of running the core demonstrator that exhaustively multiplies out ambiguous outputs from the three processing phases. I put a log of exhaustive batch processing the tur corpus into CVS as `tur.fan'. For each of the Norwegian sentences from the input file, there will be one block of lines like these:
[17:37:37] (10) |Bergensområdet er tett befolket.| --- 1 (0.24|0.00:0.24 s) <:> () [0].
|
|-[0.39] # 0 --- 2 (0.08|0.00:0.00 s) <:9> (756.4K 10.7M = 11.5M) [0].
| |
| |-[0.92] # 0 --- 4 (0.17|0.00:0.00 s) <:294> {1335:307} (1.6M 53.7M = 55.3M) [1].
| | |the bergen area is densely populated| [1310.78]
| | |the bergen area is populated densely| [10070.39]
| | |the bergen area is populated densely| [10070.39]
| | |the bergen area densely is populated| [14501.42]
| |
| |-[1.49] # 1 --- 8 (0.17|0.00:0.00 s) <:255> {1312:302} (1.6M 47.1M = 48.7M) [1].
| | |the area around bergen is densely populated| [816.89]
| | |the area round bergen is densely populated| [1142.83]
| | |the area around bergen is populated densely| [4690.06]
| | |the area around bergen is populated densely| [4690.06]
| | |the area around bergen densely is populated| [6028.07]
| | |the area round bergen is populated densely| [6561.42]
| | |the area round bergen is populated densely| [6561.42]
| | |the area round bergen densely is populated| [8433.31]
|
|< |Bergensområdet er tett befolket.| (10) --- 9 [12]
|> |the area around bergen is densely populated| [816.9] (0:1:4).
|> |the area round bergen is densely populated| [1142.8] (0:1:6).
|> |the bergen area is densely populated| [1310.8] (0:0:2).
|> |the area around bergen is populated densely| [4690.1] (0:1:0).
|> |the area around bergen densely is populated| [6028.1] (0:1:5).
|> |the area round bergen is populated densely| [6561.4] (0:1:1).
|> |the area round bergen densely is populated| [8433.3] (0:1:7).
|> |the bergen area is populated densely| [10070.4] (0:0:0).
|> |the bergen area densely is populated| [14501.4] (0:0:3).
|= 10:0 of 10 {100.0 0.0}; 10:0 of 10:0 {100.0 0.0}; 9:0 of 10:0 {90.0 0.0} @ 9 of 10 {90.0}.
The first line is the input sentence, followed by --- and the number of readings returned by the analysis grammar. The remaining numbers on that line are timing and memory measures; see the [incr tsdb()] manual. Subsequent lines show results from running each output, in turn, through downstream components, using the `branch' lines and indentation to indicate the flow of control. The third line states that the first parsing output (# 0) had 2 transfer outputs, of which (in turn) the first gave rise to four generator outputs. Upon successfull completion of generation, all realizations are presented, one per line, each followed by their MaxEnt realization ranker score. For each new branch, the initial number in square brackets is the elapsed real time since the start of translating this sentence, i.e. at time [1.49] we started generation from the second transfer output for the first (and only) parsing result.
Once all combinatorics has been explored for one input, there follows a block of summary lines. The first (prefixed by |< repeats the Norwegian input string, followed by two numbers (9 [12] in this case): there were a total of 12 translations output from all branches, of which 9 are actually distinct strings. Next follow nine lines, ordered by cross-perplexity, presenting the various unique output translations, followed by an index into the branching process in terms of parse, transfer, and realization output identifiers. finally, the last line in the example above (prefixed by |=) is a running, accumulated coverage summary on the current input file:
10:0 of 10 {100.0 0.0}; 10:0 of 10:0 {100.0 0.0}; 9:0 of 10:0 {90.0 0.0} @ 9 of 10 {90.0}
All i:j pairs of numbers are in terms of full vs. fragmented analyses, i.e. at this point (translating item # 10 from tur.txt*), there were 10 full and 0 fragmented parser outputs for an analysis coverage of 100%. following are transfer and generation coverage, each relative to the number of available inputs (full or fragmented) to that component. In the above example, transfer succeeded on all 10 parser outputs, but for one of them we were unable to generate. the final number, following the @ sign, is accumulated end-to-end coverage, aka the product of the three individual coverage numbers. In other words, the number of inputs that went all the way through the system successfully.*
The BLEU scoring script is integrated in the fan-out batch script (and in CVS). In the fan-out log, BLEU scores are printed in angle brackets, e.g.
|> |be careful about use of an open fire in the backcountry| [71.6] <0.49> (0:4:1).
Additionally, the running summary lines include the average document-level BLEU score, once averaged over all inputs, once averaged over only those for which the system produced one or more outputs, e.g.
|= 69:21 of 104 {66.3 20.2}; 54:8 of 69:21 {78.3 38.1}; 46:8 of 54:8 {85.2 100.0} @ 54 of 104 {51.9} <0.34 0.66>.
This is to say that, according to the above (August 2004), we produce outputs for 51.9 per cent of the
tur items, our BLEU average over these is 0.66 (pretty good) and over the total set it droppes to 0.34, as the 50 items with no system output are counted as a BLEU score of 0 (we could probably increase this score by outputting a selection of high-frequency function words of English, e.g. `a the of').
Home | Forum | Discussions | Events