When you write a Unicode string to a text file, or some other form of storage, the string is encoded in one of several Unicode-defined encoding forms. In each form, the string is encoded in small chunks known as code units. These encoding forms include:
- UTF-8 - which encodes a string as 8-bit code units
- UTF-16 - which encodes a string as 16-bit code units
- UTF-32 - which encodes a string as 32-bit code units
In addition to allowing a user to iterate over the Characters
in a string, Swift also provides ways of accessing these Unicode representations. Strings possess special properties which allow a user to iterate over their various Unicode-compliant representations with a for-in
loop. They are:
.utf8
- Which returns a collection of UTF-8 code units.utf16
- Which returns a collection of UTF-16 code units.unicodeScalars
- Whichreturns a collection of 21-bit Unicode scalar values that are equivalent to the string's UTF-32 encoding form.
Each example below shows a different representation of the following string, which is made up of the characters D, o, g, ‼ (DOUBLE EXCLAMATION MARK, or Unicode scalar U+203C), and the 🐶 character (DOG FACE, or Unicode scalar U+1F436):
let dogString = "Dog‼🐶"
The UTF-8 representation of a string is accessed by iterating over its utf8
property. The property itself is of type String.UTF8View
, which is collection of unsigned 8-bit integers (UInt8
).
for codeUnit in dogString.utf8 {
print("\(codeUnit) ", terminator: "")
}
print("")
// Prints "68 111 103 226 128 188 240 159 144 182 "
The UTF-16 representation of a String is accessed by iterating over its utf16
property. The property is of type String.UTF16View
, which is a collection of unsigned 16 bit integers (UInt16
)
for codeUnit in dogString.utf16 {
print("\(codeUnit) ", terminator: "")
}
print("")
// Prints "68 111 103 8252 55357 56374 "
Note how the UTF-8 and UTF-16 representations of the characters D,o and g are the same. The DOUBLE EXCLAMATION MARK and DOG FACE characters can be expressed more succintly here than in the UTF-8 representation.
The final, Unicode scalar representation of a String can be accessed by iterating over its unicodeScalars
property. This property is of type UnicodeScalarView
, which is a collection of values of type UnicodeScalar
Each UnicodeScalar
has a value property, which returns the scalar's 21-bit value, represented within a UInt32
for scalar in dogString.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}
print("")
// Prints "68 111 103 8252 128054 "
Note again how all 3 forms have the same representation of D,o and g. UTF-16 and the Unicode Scalars also share a representation of the DOUBLE EXCLAMATION MARK character. Finally, unicode scalars are able to represent DOG FACE in a single character.
Finally, Unicode scalars can also be used to construct a string, for instance as shown below.
for scalar in dogString.unicodeScalars {
print("\(scalar) ")
}
// D
// o
// g
// ‼
// 🐶