Skip to content

Commit

Permalink
Version 1.3
Browse files Browse the repository at this point in the history
  • Loading branch information
k-sl committed Oct 28, 2017
1 parent 15a624d commit 6cb6abb
Show file tree
Hide file tree
Showing 6 changed files with 17 additions and 10 deletions.
Binary file not shown.
9 changes: 7 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# Change Log

## v. 1.0
First Version
## v. 1.2
* Convert three dots (...) to the actual ellipsis character (…) to avoid mistaking some words as abbreviations.
* Added abbreviations *sb* (somebody) and *sth* (something).
* Changed default output filename to `CC-CEDICT_publishingdate-converterversion`, hopefully making it clearer.

## v. 1.1
* Added arguments to define input and output files.
* Added progress bars.
* Added option of automatically downloading the most recent release of CC-CEDICT instead of using a pre-downloaded file.

## v. 1.0
First Version
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
CedictXML is simple tool written in Python to convert an original [CC-CEDICT](https://www.mdbg.net/chindict/chindict.php?page=cc-cedict) file to a XML dictionary file in the logical [XDXF format](https://github.com/soshial/xdxf_makedict/blob/master/format_standard/xdxf_description.md), which can be used with dictionary software that support this format.

## Screenshot
![Screenshot of XDXF CC-CEDICT running on GoldenDict 1.5](https://github.com/k-sl/CedictXML/blob/master/images/screenshot.png)
![Screenshot of XDXF CC-CEDICT openned on GoldenDict 1.5](https://github.com/k-sl/CedictXML/blob/master/images/screenshot.png)

Screenshot of XDXF CC-CEDICT running on [GoldenDict](https://github.com/goldendict/goldendict) 1.5
Screenshot of XDXF CC-CEDICT openned on [GoldenDict](https://github.com/goldendict/goldendict) 1.5

## Dependencies
* _pinyin.py_ from the [pycedict](https://github.com/jdillworth/pycedict/) library
Expand Down
2 changes: 1 addition & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@
* Add support for Pleco's [CC-Canto](http://cantonese.org/) Cantonese dictionary (as an optional addition).
* Add support for [Pleco's Cantonese readings](http://cantonese.org/download.html) (as an optional addition). (Possibly confusing in the current status of the XDXF standand.)
* Make the abbreviations list of tuples an external file (possibly unnecessary).
* Add not yet recognized abbreviations: "sb", "hon".
* Add not yet recognized abbreviations such as "hon".
* Recognize two internal references in the format "See also: "
* Decide and implement a format for internal references. E.g. 漢字|汉子 hanzi (if alternative writing systems and transliterations aren't added to the XDXF format.)
Binary file added XDXF CC-CEDICT/CC-CEDICT_20171028-1.2.xdxf.zip
Binary file not shown.
12 changes: 7 additions & 5 deletions cedictxml.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,14 @@
import argparse
import zipfile
import urllib2
import zipfile
import tempfile
from tqdm import tqdm

from lxml import etree as ET
from pinyin import pinyinize


version = "1.1"
version = "1.2"
dictionaryname = "CC-CEDICT"
currenttime = time.strftime("%d-%m-%Y %H:%M:%S", time.localtime())
dtd_url = "https://raw.github.com/soshial/xdxf_makedict/master/format_standard/xdxf_strict.dtd"
Expand Down Expand Up @@ -178,6 +177,8 @@ def dictconvert(dictionaryfile):
("Taiwan pr. [" + entry_taiwan +
"]", ""))
entry_taiwan = pyjoin(entry_taiwan)
# Correct three dots to ellipsis.
entry_translation = entry_translation.replace(u"...", u"…")
# Correct the pinyin and separate the different translations
# into a list.
entry_translation = bracketpy(entry_translation)
Expand Down Expand Up @@ -208,7 +209,7 @@ def dictconvert(dictionaryfile):
publishing_date_xdxf = (publishing_date[8:] + "-" +
publishing_date[5:7] + "-" + publishing_date[:5])
global dictionary_version
dictionary_version = "1." + publishing_date.replace("-","")
dictionary_version = publishing_date.replace("-","") + "-" + version
return cedict_dict

def createxdxf(dictionary):
Expand Down Expand Up @@ -255,7 +256,8 @@ def createxdxf(dictionary):
("telecom.", "telecommunications", "knl"), ("trad.",
"traditional(ly)","stl"), ("translit.", "transliteration",
"aux"), ("usu.", "usually", "aux"), ("zool.", "zoology",
"knl"), ("zoolog.", "zoology", "knl")]
"knl"), ("zoolog.", "zoology", "knl"), ("sth", "something",
"aux"), ("sb", "somebody", "aux")]
abbrlist = []
for tupple in abbreviations:
abbrlist.append(tupple[0])
Expand Down Expand Up @@ -329,7 +331,7 @@ def createxdxf(dictionary):
lexicon_ar_def_def = ET.SubElement(lexicon_ar_def, "def")
# Recognize the abbreviations.
for abbreviation in abbrlist:
abbreviation_re = r"\b(" + re.escape(abbreviation) + r")\W"
abbreviation_re = r"\b(" + re.escape(abbreviation) + r")\W|\b(" + re.escape(abbreviation) + r")$"
if len(re.findall(abbreviation_re,translation)) > 0:
translation = (translation.
replace(abbreviation, "_lt_abbr_mt_" +
Expand Down

0 comments on commit 6cb6abb

Please sign in to comment.