Skip to content

Latest commit

 

History

History
82 lines (58 loc) · 4.03 KB

3.6 - Unicode Representations of Strings.md

File metadata and controls

82 lines (58 loc) · 4.03 KB

When you write a Unicode string to a text file, or some other form of storage, the string is encoded in one of several Unicode-defined encoding forms. In each form, the string is encoded in small chunks known as code units. These encoding forms include:

  • UTF-8 - which encodes a string as 8-bit code units
  • UTF-16 - which encodes a string as 16-bit code units
  • UTF-32 - which encodes a string as 32-bit code units

In addition to allowing a user to iterate over the Characters in a string, Swift also provides ways of accessing these Unicode representations. Strings possess special properties which allow a user to iterate over their various Unicode-compliant representations with a for-in loop. They are:

  • .utf8 - Which returns a collection of UTF-8 code units
  • .utf16 - Which returns a collection of UTF-16 code units
  • .unicodeScalars - Whichreturns a collection of 21-bit Unicode scalar values that are equivalent to the string's UTF-32 encoding form.

Each example below shows a different representation of the following string, which is made up of the characters D, o, g, ‼ (DOUBLE EXCLAMATION MARK, or Unicode scalar U+203C), and the 🐶 character (DOG FACE, or Unicode scalar U+1F436):

let dogString = "Dog‼🐶"

UTF-8 Representation

The UTF-8 representation of a string is accessed by iterating over its utf8 property. The property itself is of type String.UTF8View, which is collection of unsigned 8-bit integers (UInt8).

UTF-8 Representation of String

for codeUnit in dogString.utf8 {
    print("\(codeUnit) ", terminator: "")
}
print("")
// Prints "68 111 103 226 128 188 240 159 144 182 "

UTF-16 Representation

The UTF-16 representation of a String is accessed by iterating over its utf16 property. The property is of type String.UTF16View, which is a collection of unsigned 16 bit integers (UInt16)

UTF-16 Representation of String

for codeUnit in dogString.utf16 {
    print("\(codeUnit) ", terminator: "")
}
print("")
// Prints "68 111 103 8252 55357 56374 "

Note how the UTF-8 and UTF-16 representations of the characters D,o and g are the same. The DOUBLE EXCLAMATION MARK and DOG FACE characters can be expressed more succintly here than in the UTF-8 representation.

Unicode Scalar Representation

The final, Unicode scalar representation of a String can be accessed by iterating over its unicodeScalars property. This property is of type UnicodeScalarView, which is a collection of values of type UnicodeScalar

Each UnicodeScalar has a value property, which returns the scalar's 21-bit value, represented within a UInt32

Unicode Scalar Representation of String

for scalar in dogString.unicodeScalars {
    print("\(scalar.value) ", terminator: "")
}
print("")
// Prints "68 111 103 8252 128054 "

Note again how all 3 forms have the same representation of D,o and g. UTF-16 and the Unicode Scalars also share a representation of the DOUBLE EXCLAMATION MARK character. Finally, unicode scalars are able to represent DOG FACE in a single character.

Finally, Unicode scalars can also be used to construct a string, for instance as shown below.

for scalar in dogString.unicodeScalars {
    print("\(scalar) ")
}
// D
// o
// g
// ‼
// 🐶

Previous Note | Back To Contents | Next Section