-
Notifications
You must be signed in to change notification settings - Fork 2
/
transliterate.doc
148 lines (106 loc) · 5.47 KB
/
transliterate.doc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
@
% Transliteration tool
% R. Shreevatsa
% 2010
This document is a revisionist history of how I wrote a program.
A couple of years after I first wrote the [Sanskrit transliterator][1],
I went back and wanted to modify it. I couldn't easily understand what
the program was doing (despite good comments), and I realised this is
exactly what literate programming is for.
[1]: http://web.mit.edu/vatsa/www/sanskrit/transliterate.html
Introduction
============
The goal is to have a webpage which will convert, on-the-fly, between
various scripts. To take the example of Sanskrit,
---------------------------------------- --------------------
**Devanagari** हरे राम हरे कृष्ण
**IAST** (diacritics, case-insensitive) hare rāma hare kṛṣṇa
**Harvard-Kyoto** (ASCII, case-sensitive) hare rAma hare kRSNa
------------------------------------------------------------
There are other transliteration schemes and variants, e.g.
----------------------------- --------------------
**ITRANS** (a mongrel hybrid): hare raama hare kRRiShNa
**Velthuis** (used for TeX): hare raama hare k.r.s.na
**IAST uppercase** (diacritics): HARE RĀMA HARE KṚṢṆA
------------------------------------------------------------
and possibly other scripts and languages we may want to add later.
Thus, probably
Unicode representation
======================
(Skip this section if you are already familiar with how Devanagari
letters are represented in Unicode.)
The Unicode representations of Harvard-Kyoto and IAST are natural enough:
each letter is a separate glyph and corresponds to a particular *code
point* (henceforth "codepoint", incorrectly). For instance, **a** is
U+0061, **A** is U+0041, **ā** is U+0101 ("LATIN SMALL LETTER A WITH
MACRON"), **ṛ** is U+1E5B ("LATIN SMALL LETTER R WITH DOT BELOW") etc.
So "hare rāma" corresponds to nine codepoints (including 0020 SPACE):
0068 0061 0072 0065 0020 0072 0101 006D 0061
h a r e r ā m a
For Devanagari, the correspondence is less obvious, and is as follows.
There is a codepoint for each *consonant* (with the "inherent a" vowel included):
0915 क DEVANAGARI LETTER KA
0916 ख DEVANAGARI LETTER KHA
0917 ग DEVANAGARI LETTER GA
0918 घ DEVANAGARI LETTER GHA
etc.
There is a codepoint for each *vowel sign*:
093E DEVANAGARI VOWEL SIGN AA
093F DEVANAGARI VOWEL SIGN I
0940 DEVANAGARI VOWEL SIGN II
0941 DEVANAGARI VOWEL SIGN U
0942 DEVANAGARI VOWEL SIGN UU
etc.
So each consonant+vowel glyph, where the vowel is not 'a', corresponds to
two codepoints:
का 0915 093E
खा 0916 093E
खि 0916 093F
गि 0917 093F
घि 0918 093F
etc.
Consonants without vowels, e.g. in conjuncts, are represented by [consonant] followed by [ ्] ("094D: DEVANAGARI SIGN VIRAMA"). Thus:
ख् 0916 094D (DEVANAGARI LETTER KHA, DEVANAGARI SIGN VIRAMA)
मुख्य 092E 0941 0916 094D 092F (DEVANAGARI LETTER MA, DEVANAGARI VOWEL SIGN U, DEVANAGARI LETTER KHA, DEVANAGARI SIGN VIRAMA, DEVANAGARI LETTER YA)
And vowels that do not follow a consonant (i.e. are at the beginning of words) take different forms in Devanagari, and correspond to different codepoints:
------ ---- --------------------
अ 0905 DEVANAGARI LETTER A
आ 0906 DEVANAGARI LETTER AA (different from 093E DEVANAGARI VOWEL SIGN AA)
इ 0907 DEVANAGARI LETTER I
ई 0908 DEVANAGARI LETTER II
उं 0909 0902 DEVANAGARI LETTER U, DEVANAGARI SIGN ANUSVARA
उः 0909 0903 DEVANAGARI LETTER U, DEVANAGARI SIGN VISARGA
कं 0915 0902 DEVANAGARI LETTER KA, DEVANAGARI SIGN ANUSVARA
कः 0915 0903 DEVANAGARI LETTER KA, DEVANAGARI SIGN VISARGA
कैः 0915 0948 0903 DEVANAGARI LETTER KA, DEVANAGARI VOWEL SIGN AI, DEVANAGARI SIGN VISARGA
------------------------------------------------------------
You can also play around with R. Ishida's [Devanagari character
picker][2] to get an idea.
[2]: http://people.w3.org/rishida/scripts/pickers/devanagari/
Internal representation
=======================
The Unicode conventions described in the section above may correspond
to the way writers of Devanagari think of the letters, but is
irritating to deal with, since we need to treat the vowel **a** as a
special case -- it does not correspond to the notation used by any of
the English scripts, so this would make incremental transliteration
difficult. Therefore, I instead made the somewhat strange choice of
using the following "canonical"/intermediate representation of the
Indian script internally.
It is a string of Devanagari (or other Indian script) characters that
somewhat corresponds to the IAST representation, except for the fact
that consonants get *two characters* (KA + VISARGA), etc. Thus "ki"
is represented as KA + VISARGA + I, and so on. This removes
1. the need for special-casing the vowel 'a',
2. the need for keeping track of whether the bare vowel or matra sign
is intended.
Devanagari can be converted to this intermediate representation in
quite a straightforward way, and this intermediate representation can
be converted to Devanagari as well.
Does this intermediate representation really help, or does it simply
add a unnecessary layer of complexity? Well, it seems to help keep the
transliteration part straightforward.
<<*>>=
/* Do the conversion */
alert('Done');
@