This tiny utility converts StarDict dictionaries to ODict XML (ODXML), which can then be compiled to a binary ODict dictionary using the ODict compiler.
The StarDict file format is rather convoluted and requires multiple, that's right, multiple files in order to run. It's also pretty poorly documented. The only documentation you'll find about it is scattered across the internet, but here is a pretty good explanation linked to from the repo this code was based on (pfff like we'd really write a StarDict parser ourselves... please).
Using the CLI is dead simple. There's two ways you can use it:
By simply providing the path to a StarDict archive, the CLI will return an example HTML entry stored in that file that will help you in crafting the regex for conversion. Here's an example:
$ stardict2odict.jar ./babylon-korean-english.tar.bz2
Processing dictionary file...
Found example entry HTML: <font color="blue">adj.</font> aspen
As you can see, the definitions for this dictionary generally contain the part of speech abbreviation inside a <font>
tag, while the actual definitions are listed after the <font>
tag is closed. This may result in the following regex:
<font color="blue">([a-z]+?)\.</font>\s*(.*?)$
This will match the lowercase part of speech abbreviation within the font tag, as well as the definition list.
By providing two arguments to the CLI, you can convert a StarDict archive into ODXML. This conversion requires a
configuration YAML file in order to work. The configuration consists of four primary keys: name
(the name of the
output dictionary), delimiter
(the delimiter between definitions... usually either a comma or semicolon), regex
(
the regex string used to match groups in entries), and match
(a set of rules for matching attributes to groups,
see below).
Currently, the match
rules support the pos
(part of speech), definition
, alt
(alternative form), and
romanization
keys.
An example configuration file might look like:
# config.yml
name: Babylon Chinese - English
delimiter: ,
regex: <font color="blue">([a-z]+?)\.</font>\s*\(<I>\w+?</I>=(.*?)<I>,\s*Pinyin</I>=(.*?)\)\s*(.*?)
match: # Each number corresponds to a 1-based group number in the above regex
pos: 1
alt: 2
romanization: 3
definition: 4
Now, to convert the file to ODXML using the following command:
$ stardict2odict.jar ./zh_en.tar.bz2 ./zh_en.xml --config config.yml