Experiments to determine the new BibTeX parser formula. The result may be applied to other formats as well in the future.
Above is some pseudo-grammar describing the stages of parsing. The reason I distinguish between parsing the file and parsing values is best demonstrated in the "entry value with mid-command concatenation" test case:
Input:
@book{a,
title = "foo \\copy" # "right{} bar"
}
Output:
[{
type: 'book',
id: 'a',
properties: {
title: 'foo © bar'
}
}]
Note: citationjs-idea
(Citation.js Idea #1) and citationjs-nearley
are skipped
in some tables, as they are only in here for historic reasons.
At the time of starting these experiments, the TokenStack
class was utilized,
together with a simple RegExp that tokenizes commands.
The idea was to explore tokenization without introducing a formal grammar, as
formal grammars introduce extra build steps, runtime dependencies and large swaths
of generated code. However, used as I was to the syntax of PEG.js
and nearley.js
,
I made some unnecessarily complicated features like consumeAnyRule()
, and some
weird loops in the rules. This was partly due to bad tokenization.
Idea #2, currently active in Citation.js has new tokens, a simpler Grammar
class
and simplified rules. It also has more features, including more commands and
diacritics, including more ways to write them.
In parallel to reworking the idea, I used the tokenizer in a nearley.js
grammar,
which failed miserably. This is probably the result of bad grammar-writing on my
part, and not a reflection of the capabilities of nearley.js
. However, an additional
downside of this route is that it introduces an extra build step (nearleyc
) and
a runtime dependency — nearley
itself.
The astrocite-bibtex
package by @dsifford uses PEG.js. It is capable of returning
an AST.
Fiduswriter's biblatex-csl-converter
seems to perform very poorly on the larger
file. However, it does return lossless values, although I am not a fan of the
(lack of) difference between arrays representing single and multiple values:
[
{ type: 'text', text: 'foo' }
]
// vs
[
{ literal: [{ type: 'text', text: 'foo' }] },
{ literal: [{ type: 'text', text: 'bar' }] }
]
This causes testing such as
literal in value[0]
// or more properly
value.every(part => literal in part)
Zotero Translators are relatively hard to use stand-alone, as they depend on a Zotero framework in the global scope. It immediately converts to Zotero API JSON while parsing the syntax. Not shown in the performance table is that Zotero requires initialization, not counting the time it takes to import files, and that this takes relatively long.
Using @retorquere/bibtex-parser
, this performs very well. It is capable of
returning an AST. I have not had a chance to test out all the parser features
for literal/text/name values yet.
JabRef is reference management software with Bib(La)TeX as the internal
representation, so one can assume their support for parsing it to be pretty good.
However, I have not found a way to export their internal representation in the
level of detail required for passing the syntax tests. Similarly, as their program
only partly supports a CLI (no stdio) and is written in Java, a performance
comparison would not be very fair. For now, it is possible to test syntax features
by uncommenting the jabref
entry in test/feature.js
and running
npm run features -- parser jabref
.
citationjs-old | citationjs | astrocite | fiduswriter | zotero | bbt | |
---|---|---|---|---|---|---|
Sync/Async | sync | sync | sync | both | async | both |
AST output | ✘ | ✘ | ✓ | ✘ | ✘ | ✓ |
Lossless schema¹ | ✓ | ✓ | ✓ | ✓ | ✘ | ✓ |
Lossless values | ✘ | ✘ | ✘ | ✓ | ✘ | ✓ |
Error recovery | ✓ | ✘ | ✘ | ✓ | ✘ | ✓ |
¹ specifically the schema used to represent data entries and not value syntax (commands, formatting), and disregarding AST
Empty cells indicate a choice to follow either natbib or biblatex for
certain behavior, this becomes clear from context. If both cells are
empty, this is may be an error, but that should be indicated by a
different test fixture. Auto-generated by npm run features -- fixtures
,
see also the fixture file.
citationjs-old | citationjs | astrocite | fiduswriter | zotero | bbt | |
---|---|---|---|---|---|---|
entry with lowercase type | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry with mixed-case type | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry with uppercase type | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry with parentheses | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry with spacing | ✓ | ✓ | ✘ | ✓ | ✓ | ✓ |
entry with trailing comma | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
string key with colon | ✘ | ✓ | ✘ | ✓ | ✘ | ✓ |
entry key with colon | ✓ | ✓ | ✓ | ✘ | ✘ | ✓ |
entry value with annotation | ||||||
entry label with number | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry label with colon | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry label with double quotes | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry value of quoted string | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry value of braced string | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
entry value of number | ✘¹ | ✓ | ✓ | ✘¹ | ✘¹ | ✓ |
entry value with mid-and concatenation | ✘² | ✓ | ✘² | ✘² | ✘² | ✓ |
entry value with mid-command concatenation | ✘² | ✓ | ✘² | ✘² | ✘² | ✘² |
entry value with sentence-casing (real title) | ✘¹ | ✓ | ✘¹ | ✘¹ | ✘¹ | ✓ |
entry value with sentence-casing (artificial title) | ✘¹ | ✓ | ✘¹ | ✘¹ | ✘¹ | ✘¹ |
entry value with sentence-casing (markup) | ✘¹ | ✓ | ✘¹ | ✘¹ | ✘¹ | ✘¹ |
entry value with sentence-casing (env markup) | ✘¹ | ✓ | ✘¹ | ✘¹ | ✘¹ | ✘¹ |
entry value with markup | ✘¹ | ✓ | ✘¹ | ✘¹ | ✘¹ | ✓ |
entry value with envs | ✘¹ | ✓ | ✘¹ | ✘¹ | ✘¹ | ✘¹ |
entry value with env overrides | ✘¹ | ✓ | ✘¹ | ✘¹ | ✘¹ | ✓ |
entry value with literal names | ✘ | ✓ | ✘ | ✘ | ✘ | ✓ |
entry value with truncated names | ✘¹ | ✘¹ | ✘¹ | ✘¹ | ✘¹ | ✘¹ |
entry value with extended names (biblatex) | ✓ | ✓ | ||||
entry value with verbatim fields | ✓ | ✓ | ✓ | |||
entry value with uri fields | ✓ | |||||
entry value with pre-encoded uri fields | ✓ | ✓ | ✓ | ✓ | ||
entry value with diacritics | ✘ | ✓ | ✘ | ✘ | ✘ | ✓ |
entry value with escapes | ✘ | ✓ | ✘ | ✘ | ✘ | ✘ |
entry value with sub/superscript | ✘ | ✓ | ✘ | ✘ | ✘ | ✓ |
entry value with multi-argument commands | ✘ | ✘ | ✘ | ✘ | ✘ | ✓ |
entry value with verbatim-argument commands | ✘ | ✘ | ✘ | ✘ | ✘ | ✓ |
entry value with unbracketed-argument commands | ✘¹ | ✘¹ | ✘¹ | ✘¹ | ✘¹ | ✓ |
TODO | ||||||
string with lowercase type | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
string with mixed-case type | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
string with uppercase type | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
string with parentheses | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
string value with string | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
string value with concatenated string | ✘ | ✓ | ✘ | ✓ | ✓ | ✓ |
preamble with quoted string | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
preamble with string | ✘ | ✓ | ✘ | ✓ | ✓ | ✓ |
preamble with concatenated string | ✘ | ✓ | ✘ | ✓ | ✓ | ✓ |
comment before entry | ✘ | ✓ | ✓ | ✓ | ✓ | ✓ |
comment around entry (natbib) | ✓ | |||||
comment around entry (biblatex) | ✓ | ✓ | ✓ | ✓ |
¹ undefined representation, actual support may vary
² very unlikely to matter
Data from npm test
, as run on Travis CI.
Init | Time (single entry) | Time (3345 entries) | |
---|---|---|---|
citationjs-old | 0.795ms ± 2.2% | 0.842ms ± 1.6% | 1.97e+3ms ± 5.5% |
citationjs | 4.42ms ± 2.3% | 0.455ms ± 1.5% | 1.20e+3ms ± 11.7% |
astrocite | 1.73ms ± 2.2% | 0.665ms ± 2.3% | 2.17e+3ms ± 7.0% |
fiduswriter | 17.7ms ± 12.5% | 7.52ms ± 24.7% | 1.16e+5ms ± 4.0% |
zotero | 4.16ms ± 12.1% | 2.13ms ± 1.3% | 1.00e+4ms ± 0.8% |
bbt | 21.7ms ± 23.4% | 1.55ms ± 6.5% | 2.27e+4ms ± 5.4% |