-
-
Notifications
You must be signed in to change notification settings - Fork 200
IL2CPP Data Structures
This page isn't complete yet!
The amount of metadata that ships with your average IL2CPP game is, to put it mildly, a lot. Everything from top-level assembly and module definitions, down to individual default values for function parameters or fields, to array initializers for local variables.
This page serves as a sort of "behind-the-scenes" of LibCpp2IL specifically - what it's reading, where it pulls data from, etc, to provide a (hopefully) convenient interface to pull data from. I'll try and order it in an biggest-to-smallest fashion, with things like assembly definitions (which also tend to be simpler) at the top, and small details at the bottom.
You can use the table-of-contents on the right to jump to a specific section.
Starting at the very top, let's take a look at the structure of an IL2CPP metadata file. At least, one that's not encrypted, obfuscated, or otherwise modified.
Every metadata file starts with a header. And every header starts with a specific "magic number" - a sequence of identifying bytes that confirms that this is, in fact, a metadata file. In this case, those bytes are (in hexadecimal): FA B1 1B AF
. The very first thing LibCpp2IL does when reading is confirms that these are correct. If not, there's often little point in proceeding, because the metadata is damaged or modified.
Immediately after that are 4 bytes that define the version of the metadata file. As of writing, the latest metadata version is 29, introduced in unity 2021.2.
In general, unity increase the version number when they make a change which breaks reading older metadata files. However, it appears they don't always remember to do so, resulting in a metadata which claims to be version 24 having one of 6 distinct formats, which the community has labelled 24.0 through 24.5. However, they seem to have realised their mistake, or just decided to not follow normal numerical order, and the next metadata version after 24.5 is 27.0. There also exists a community-dubbed 27.1 (unity 2020.2.4+), and then it jumps again to 29 in 2021.2.0.
After these eight bytes, which are technically considered part of the metadata header, the rest of the header follows, which you can see details on below, followed by the rest of the raw metadata, which can be divided into groups and then individual structs using offsets + counts in the header.
I won't be covering every single one of these groups of data, but I'll cover the most relevant ones.
The metadata header (or at least, the part after the magic number and version) generally follows a pattern of file offset, followed by amount of data at that offset, followed by file offset, followed by count, etc. Exactly what those file offsets and amounts specify varies, from version to version, but for every version that LibCpp2IL supports, the first few are for: string literals for managed code, string literals for the metadata file (type names etc), event definitions, then property definitions. There are, roughly speaking, about 35-40 pairs of offset-count in your average header.
Unlike the binary structures, the "count" fields don't indicate the length of the array of data at a given offset, but instead the raw number of bytes. This means that, know how many entries to read, you have to divide the count by the size in bytes of the data structure at that offset. For example, the struct for a string literal is made up of two four-byte integers (the first one is unsigned, but that's irrelevant here), so in total it is eight bytes. So whatever the stringLiteralCount or stringCount is defined as in the metadata header, you have to divide it by 8 to get the number of strings in that section of the metadata.
Almost every structure that references a metadata structure does so via an index. These specify the index in a given array of data in the metadata. For example, an Image Definition (which specifies a Module) contains a Int32 nameIndex as its first field, which refers to the metadata (not managed code) string literal array.
On top of that, most relations in IL2CPP are, for space reasons, one-to-many. That is to say, structs don't reference their parent (which would be many-to-one), but instead the parent references its children (as a start index + count). Again using an Image Definition as an example, the struct dictates what types it contains - the type does not define its image.
Finally, most of these structs define their token, which is exactly the same concept (and follows the same rules) as a Mono token - a unique (within a context) identifier that other structs can use to refer to this struct. However, in many cases, IL2CPP prefers to use indices into an array, rather than tokens, for performance reasons (though there are some exceptions).
With that out of the way, let's dive in.
An image definition dictates what Cecil calls a Module. It has, among other things a nameIndex, which is relative to the metadata string literal array, an assemblyIndex which refers to an AssemblyDefinition (which LibCpp2IL doesn't use, because it's not usually relevant), the start offset + count of types (these refer to the TypeDefinition array of the metadata) within the Module, the custom attributes which apply to the module itself (from metadata v24.1 onwards), and its token.
The two most useful pieces of information here are the name and contained types of the module.
The Type Definition is arguably one of the central types contained within the metadata. Almost everything contains a reference to these - because, being a strongly-typed language, almost everything in c# has a type.
Besides the obvious fields - name + namespace indices, declaring, parent (base), and element (for enums) type indices, and offset/count pairs for the various other referenced members (namely fields, methods, events, properties, nested types, and interfaces), and a token, these also contain a 32-bit packed bitfield, vtable, generic container index, and interface offsets.
The bitfield currently only uses the lowest 17 bits of the 32, with the remaining 15 reserved for future use. Those 17 bits are allocated as follows:
Bits 1-6 inclusive, as well as 11, 12, and 17, are boolean flags. Bit 1 dictates a value type, 2 dictates an enum, 3 specifies this type has a finalizer, 4 that it has a static constructor. 5 specifies it is blittable, 6 that it's imported or part of the windows runtime. 11 and 12 specify that the Type's packing and class sizes, respectively, have their default values, and 17 indicates that the type is "By-Ref-Like".
Bits 7-10 inclusive, and 13-16 inclusive, each specify a 4-bit integer, specifying one of 9 values indicating the packing size, where 0 indicates 0, and values after that refer to powers of 2, starting at 1, then 2, then 4, etc. The first of these fields, at bits 7-10, define the packing size itself, and the second defines "the specified packing size (even for explicit layouts)".
The vtable start + count refer to offsets into the metadata VTableMethodIndices array, and are used for virtual methods to specify exactly which methods are implemented by this type. The index into the resulting array is called the slot of the method, and is also accessible from the MethodDefinition, including the base type abstract or virtual one.
The generic container index refers to the GenericContainer array of the metadata and contains the generic type parameters (e.g. the T in List) for this type. If this type is not generic, the count is 0 and the index is -1.
Finally, the interface offsets is an array to the metadata InterfaceOffsets array, and specifies an interface type index and the unique slot offset this type's implementations of that interface type's methods receives.