Skip to content

Converts StarDict dictionaries to valid ODict XML (ODXML) 🌟 📖

Notifications You must be signed in to change notification settings

TheOpenDictionary/stardict2odict

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stardict2odict

This tiny utility converts StarDict dictionaries to ODict XML (ODXML), which can then be compiled to a binary ODict dictionary using the ODict compiler.

The StarDict file format is rather convoluted and requires multiple, that's right, multiple files in order to run. It's also pretty poorly documented. The only documentation you'll find about it is scattered across the internet, but here is a pretty good explanation linked to from the repo this code was based on (pfff like we'd really write a StarDict parser ourselves... please).

How to Use

Using the CLI is dead simple. There's two ways you can use it:

1. With a Single Argument

By simply providing the path to a StarDict archive, the CLI will return an example HTML entry stored in that file that will help you in crafting the regex for conversion. Here's an example:

$ stardict2odict.jar ./babylon-korean-english.tar.bz2
Processing dictionary file...
Found example entry HTML: <font color="blue">adj.</font> aspen

As you can see, the definitions for this dictionary generally contain the part of speech abbreviation inside a <font> tag, while the actual definitions are listed after the <font> tag is closed. This may result in the following regex:

<font color="blue">([a-z]+?)\.</font>\s*(.*?)$

This will match the lowercase part of speech abbreviation within the font tag, as well as the definition list.

2. With Two Arguments

By providing two arguments to the CLI, you can convert a StarDict archive into ODXML. This conversion requires a configuration YAML file in order to work. The configuration consists of four primary keys: name (the name of the output dictionary), delimiter (the delimiter between definitions... usually either a comma or semicolon), regex ( the regex string used to match groups in entries), and match (a set of rules for matching attributes to groups, see below).

Currently, the match rules support the pos (part of speech), definition, alt (alternative form), and romanization keys.

An example configuration file might look like:

# config.yml
name: Babylon Chinese - English
delimiter: ,
regex: <font color="blue">([a-z]+?)\.</font>\s*\(<I>\w+?</I>=(.*?)<I>,\s*Pinyin</I>=(.*?)\)\s*(.*?)
match: # Each number corresponds to a 1-based group number in the above regex
  pos: 1 
  alt: 2
  romanization: 3
  definition: 4

Now, to convert the file to ODXML using the following command:

$ stardict2odict.jar ./zh_en.tar.bz2 ./zh_en.xml --config config.yml 

Releases

No releases published

Packages

No packages published

Languages

  • Java 88.6%
  • Python 11.4%