-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
393 lines (285 loc) · 10 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
Character Sets and Character Encodings.
ASCII: http://http://ascii-table.com/ 7 bits (127 characters)
Different uses for that remaining bit (code pages):
http://www.i18nguy.com/unicode/codepages.html#msftdos
This kind of worked for a while, as long as documents where used on the same machine
as they were created, and you only had to deal with one language.
As soon as the Internet happened it became quite common to move strings from one
machine to another.
Unicode:
It's an effort to create a character set that includes every writing system in the
planet, including dead and ficticious languages.
A letter is mapped to a number (code point) and this is just a concept, it has nothing
to do with its physical representation.
Some of the problems in defining this standard have to do with identifying which
letters in different languages are actually the same.
open U0000.pdf
open U0B80.pdf
http://www.unicode.org/charts/
Hello
U+0048 U+0065 U+006C U+006C U+006F
Encodings:
One obvious way of encoding those code points would be using 2 bytes per code point:
00 48 00 65 00 6C 00 6C 00 6F
But this could also be:
48 00 65 00 6C 00 6C 00 6F 00
This is actually UTF-16/UCS2 and it has both little endian and big endian mode.
There is a convention to store a Byte Order Mask (BOM) at the beginning of the file:
FE FF or FF FE but it's not always there
Encoding English in this way wastes a lot of space since most characters are below +U00FF
Unicode Transformation Format-8 (UTF-8):
UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on
the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use
mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet.
UTF-8 is the default encoding for XML.
Hello - 48 65 6C 6C 6F
There is also UCS4/UTF32 - Really inefficient.
UTF-8 is more efficient for western languages but for other languages UTF-16 can be more efficient.
Open hello.latin1 and hello.utf8 in TextViewer
Open hello.latin1 and hello.utf8 in Chrome, change encoding of both files.
victor@victor ~/encodings (master)*$ xxd hello.utf8
0000000: 4865 6c6c 6f20 4920 616d 2056 c3ad 6374 Hello I am V..ct
0000010: 6f72 or
victor@victor ~/encodings (master)*$ xxd hello.latin1
0000000: 4865 6c6c 6f20 4920 616d 2056 ed63 746f Hello I am V.cto
0000010: 72 r
See how ASCII characters are encoded the same in utf8 and latin1.
See how UTF8 encoding takes more space for some characters.
Text without knowing it's encoding doesn't mean anything, it's just bytes.
How to specify encoding:
HTTP - Content Type Header.
HTML - Content Type meta tag. This could be tricky but it isn't. The meta tag should be the first thing in the
head section. Browsers try to guess based on frequency of bytes.
E-mail - Content Type Header
Ruby 1.8:
victor@victor ~/encodings (master)*$ rvm 1.8.6
victor@victor ~/encodings (master)*$ irb
ruby-1.8.6-p399 > latin1 = File.open("hello.latin1").read
=> "Hello I am V\355ctor"
ruby-1.8.6-p399 > 0355
=> 237
ruby-1.8.6-p399 > 237.to_s(16)
=> "ed"
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
=> "Hello I am V\303\255ctor"
ruby-1.8.6-p399 > 0303
=> 195
ruby-1.8.6-p399 > 195.to_s(16)
=> "c3"
ruby-1.8.6-p399 > 0255
=> 173
ruby-1.8.6-p399 > 173.to_s(16)
=> "ad"
ruby-1.8.6-p399 > latin1 << utf8
=> "Hello I am V\355ctorHello I am V\303\255ctor"
Ruby 1.8 has some support for Encodings:
victor@victor ~/encodings (master)*$ irb
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
=> "Hello I am V\303\255ctor"
ruby-1.8.6-p399 > $KCODE = "U"
=> "U"
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
=> "Hello I am Víctor"
There are 4 possible values for $KCODE:
NONE: "N"
EUC: "E" Asian Encoding
Shift-JS: "S" Asian Encoding
UTF-8: "U"
Support in regular expressions:
ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
ruby -e 'p "Résumé".scan(/./mu)'
["R", "\303\251", "s", "u", "m", "\303\251"]
ruby -e 'p "Résumé".size'
8
ruby -e 'p "Résumé".scan(/./mu).size'
6
ruby -e 'p "Résumé".unpack("U*")'
[82, 233, 115, 117, 109, 233]
ruby -e 'p "Résumé"'
"R\303\251sum\303\251"
ruby -KUe 'p "Résumé"'
"Résumé"
ruby -KUe 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]
#!/usr/bin/env ruby -wKU
Iconv: C library to handle character conversion
iconv --list
irb
ruby-1.8.6-p399 > $KCODE = "U"
=> "U"
ruby-1.8.6-p399 > require "iconv"
=> true
ruby-1.8.6-p399 > latin1 = File.open("hello.latin1").read
=> "Hello I am V?ctor"
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
=> "Hello I am Víctor"
ruby-1.8.6-p399 > latin1_in_utf8 = Iconv.conv("UTF8", "LATIN1", latin1)
=> "Hello I am Víctor"
ruby-1.8.6-p399 > latin1_in_utf8 + utf8
=> "Hello I am VíctorHello I am Víctor"
Problems with 1.8 encoding support:
No enough encodings supported
Regexp-only support just isn't comprehensive enough
$KCODE is a global setting for all encodings
Ruby 1.9:
In Ruby 1.9 Strings are both raw bytes plus information about the encoding.
This is different from other languages that favour only 1 type of encoding (UTF-8).
p __ENCODING__
ruby encoding-1.9.rb
#<Encoding:US-ASCII>
ruby encoding-1.9_comment.rb
#<Encoding:UTF-8>
-e gets the encoding from the environment:
echo $LANG
en_GB.UTF-8
ruby -e 'p __ENCODING__'
#<Encoding:UTF-8>
ruby -e 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]
ruby -e 'p "Résumé".size'
6
ruby -e 'p "Résumé".encoding'
#<Encoding:UTF-8>
ruby -e 'p "Résumé".size'
6
ruby -e 'p "Résumé".encoding'
#<Encoding:UTF-8>
ruby -e 'p "Résumé".each_byte{|b| p b}'
82
195
169
115
117
109
195
169
"Résumé"
ruby -e 'p "Résumé".each_char{|c| p c}'
"R"
"é"
"s"
"u"
"m"
"é"
"Résumé"
ruby -e 'p "Résumé".each_codepoint{|c| p c}'
82
233
115
117
109
233
"Résumé"
ruby -e 'p "Résumé".bytes.to_a'
[82, 195, 169, 115, 117, 109, 195, 169]
Encode Method: (Changes encoding metadata + raw bytes)
ruby -e 'p "Résumé".encode("ISO-8859-1")'
"R?sum?"
ruby -e 'p "Résumé".encode("ISO-8859-1").size'
6
ruby -e 'p "Résumé".encode("ISO-8859-1").encoding'
#<Encoding:ISO-8859-1>
ruby -e 'p "Résumé".encode("ISO-8859-1").bytes.to_a'
[82, 233, 115, 117, 109, 233]
Force Encoding: (only changes metadata)
ruby -e 'p [82, 233, 115, 117, 109, 233].map{|c| c.to_s(16)}'
["52", "e9", "73", "75", "6d", "e9"]
ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9"'
"R\xE9sum\xE9"
ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9".encoding'
#<Encoding:UTF-8>
ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9".force_encoding("ISO-8859-1")'
"R?sum?"
ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9".force_encoding("ISO-8859-1").encode("UTF-8")'
"Résumé"
Read a file specifying external and internal encoding:
rvm 1.9.2
irb
ruby-1.9.2-preview1 > File.open("hello.latin1").read
=> "Hello I am V\xEDctor"
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1").read
=> "Hello I am V?ctor"
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1").read.encoding
=> #<Encoding:ISO-8859-1>
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1:UTF-8").read
=> "Hello I am Víctor"
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1:UTF-8").read.encoding
=> #<Encoding:UTF-8>
Internal Encoding and External Encoding (defaults):
ruby-1.9.2-preview1 > Encoding.default_internal = "UTF-8"
=> "UTF-8"
ruby-1.9.2-preview1 > Encoding.default_external = "ISO-8859-1"
=> "ISO-8859-1"
ruby-1.9.2-preview1 > File.open("hello.latin1").read
=> "Hello I am Víctor"
Exceptions:
ruby-1.9.2-preview1 > "Hello".encode("ASCII-8BIT")
=> "Hello "
ruby-1.9.2-preview1 > "Hello I am Víctor".encode("ASCII-8BIT")
Encoding::UndefinedConversionError: U+00ED from UTF-8 to ASCII-8BIT
from (irb):19:in `encode'
from (irb):19
from /Users/victor/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>'
ruby-1.9.2-preview1 > b = File.open("hello.utf8", "r:binary").read
=> "Hello I am V\xC3\xADctor"
ruby-1.9.2-preview1 > b.encoding
=> #<Encoding:ASCII-8BIT>
ruby-1.9.2-preview1 > b << "Hello I am Víctor"
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):3
from /Users/victor/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>'
Erlang:
A string is just a list of numbers:
[65, 66, 67, 68].
"ABCD"
[65, 66, 67, 68] == "ABCD".
true
Erlang assumes by default ISO-8859-1:
[82, 233, 115, 117, 109, 233].
"Résumé"
[82, 233, 115, 117, 109, 233,256].
[82,233,115,117,109,233,256]
A list of integer takes 8 bytes per element, 4 for the integer and 4 for the pointer to the next element (double in 64 bit architecture).
A binary takes just 1 byte per character:
list_to_binary([82, 233, 115, 117, 109, 233]).
<<"Résumé">>
Unicode support:
io:getopts().
[{expand_fun,#Fun<group.0.120017273>},
{echo,true},
{binary,false},
{encoding,unicode}]
U = unicode:characters_to_binary([82,233,115,117,109,233], utf8).
<<"Résumé">>
io:format("~s~n",[U]).
Résumé
ok
io:format("~ts~n",[U]).
Résumé
unicode:characters_to_list(<<"Résumé">>).
"Résumé"
<<First/utf8, Second/utf8, Rest/binary>> = <<"Résumé">>.
<<"Résumé">>
First.
82
Second.
233
[First, Second].
"Ré"
Modules that are unicode aware: unicode, io, file, re, wx.
string, except to_upper and to_lower.
GSM:
http://www.dreamfabric.com/sms/default_alphabet.html
It contains 127 + 10 characters, representable in 7 bits (10 of them need an escape character).
That's why a SMS can contain a maximum of 140 bytes but 160 characters (not escaped).
Some operators accept data already encoded in GSM, others only accept a default alphabet that they translate into GSM so not all GSM characters can be sent to all operators.
Handsets also support UCS but size is limited to 70 characters.
References:
http://www.unicode.org
http://www.utf-8.com/
http://www.joelonsoftware.com/articles/Unicode.html
http://blog.grayproductions.net/categories/character_encodings
http://yehudakatz.com/2010/05/17/encodings-unabridged/
http://nuclearsquid.com/writings/ruby-1-9-encodings.html
http://ftp.sunet.se/pub/lang/erlang/doc/man/unicode.html