-
Notifications
You must be signed in to change notification settings - Fork 0
Devnotes
Since I started on this project, been trying to find a good source for the dictionary part of the project. This proves to be a little challenging since all that is available is pretty much the PDF of the Indonesian dictionary that when converted to a parsable format would result in a badly marked up data. I have used a Win32 program named pdf2html to convert the pdf to xml. There are great inconsistencies in the pdf. There have been previous attempts done by other Indonesian programmers in parsing the dictionary but it seems like they manage to missed badly marked up entries. One example is the secondary definition of bawang that means an artificial river. In the PDF this was badly mangled, although it looks ok visually. The official Indonesian dictionary website do not have an entry for this definition, so it seems that their parser didn't understood the entry and nobody notices that this entry was missing
This project is determined to have an accurate representation of the printed 4th edition Indonesian dictionary and it seems that I will have to apply some elbow grease to manually edit the xml generated from the pdf such that it'll produce a good database.