layout | title |
---|---|
default |
Objects - Structure of Ruby objects |
Translated by Vincent ISAMBART
Starting from this chapter we will explore the `ruby` source code, starting by
studying the declaration of objects structures.
What are the required conditions to make sure objects can exist? Many
explanations can be given but in reality there are three conditions that must
be obeyed:
- Being able to differentiate itself from the rest (having an identity)
- Being able to reply to requests (methods)
- Keeping an internal state (instance variables)
In this chapter, we are going to confirm these three features one by one.
The most interesting file in this quest will be `ruby.h`, but we will also
briefly look at other files such as `object.c`, `class.c` or `variable.c`.
In `ruby`, the contents of an object is expressed by a `C` structure, always
handled via a pointer. A different kind of structure is used for each class,
but the pointer type will always be `VALUE` (figure 1).
Here is the definition of `VALUE`:
▼ `VALUE`
71 typedef unsigned long VALUE;(ruby.h)
In practice, a `VALUE` must be cast to different types of structure pointer.
Therefore if an `unsigned long` and a pointer have a different size, `ruby`
will not work well. Strictly speaking, it will not work for pointer types
bigger than `sizeof(unsigned long)`. Fortunately, no recent machine feature
this capability, even if some time ago there were quite a few of them.
Several structures are available according to object classes:
`struct RObject` | all things for which none of the following applies |
`struct RClass` | class object |
`struct RFloat` | small numbers |
`struct RString` | string |
`struct RArray` | array |
`struct RRegexp` | regular expression |
`struct RHash` | hash table |
`struct RFile` | `IO`, `File`, `Socket`, etc… |
`struct RData` | all the classes defined at C level, except the ones mentioned above |
`struct RStruct` | Ruby’s `Struct` class |
`struct RBignum` | big integers |
For example, for an string object, `struct RString` is used, so we will have
something like the following.
Let’s look at the definition of a few object structures.
▼ Examples of object structure
/* structure for ordinary objects */
295 struct RObject {
296 struct RBasic basic;
297 struct st_table *iv_tbl;
298 }; /* structure for strings (instance of String) */ 314 struct RString { 315 struct RBasic basic; 316 long len; 317 char *ptr; 318 union { 319 long capa; 320 VALUE shared; 321 } aux; 322 }; /* structure for arrays (instance of Array) */ 324 struct RArray { 325 struct RBasic basic; 326 long len; 327 union { 328 long capa; 329 VALUE shared; 330 } aux; 331 VALUE *ptr; 332 };(ruby.h)
Before looking at every one of them in detail, let’s begin with something more
general.
First, as `VALUE` is defined as `unsigned long`, it must be cast before
being used. That’s why `Rxxxx()` macros have been made for each object
structure. For example, for `struct RString` there is `RSTRING`, etc… These macros are used like this:
VALUE str = ....; VALUE arr = ....; RSTRING(str)->len; /* ((struct RString*)str)->len */ RARRAY(arr)->len; /* ((struct RArray*)arr)->len */
Another important point to mention is that all object structures start with a
member `basic` of type `struct RBasic`. As a result, whatever the type of
structure pointed by `VALUE`, if you cast this `VALUE` to `struct RBasic*`,
you will be able to access the content of `basic`.
You guessed that `struct RBasic` has been designed to contain some important
information shared by all object structures. The definition of `struct RBasic`
is the following:
▼ `struct RBasic`
290 struct RBasic {
291 unsigned long flags;
292 VALUE klass;
293 };(ruby.h)
`flags` are multipurpose flags, mostly used to register the structure type
(for instance `struct RObject`). The type flags are named `T_xxxx`, and can be
obtained from a `VALUE` using the macro `TYPE()`. Here is an example:
VALUE str; str = rb_str_new(); /* creates a Ruby string (its structure is RString) */ TYPE(str); /* the return value is T_STRING */
The names of these `T_xxxx` flags are directly linked to the corresponding type
name, like `T_STRING` for `struct RString` and `T_ARRAY` for `struct RArray`.
The other member of `struct RBasic`, `klass`, contains the class this object
belongs to. As the `klass` member is of type `VALUE`, what is stored is (a
pointer to) a Ruby object. In short, it is a class object.
The relation between an object and its class will be detailed in the “Methods”
section of this chapter.
By the way, the name of this member is not `class` to make sure it does not
raise any conflict when the file is processed by a C++ compiler, as it is a
reserved word.
I said that the type of structure is stored in the `flags` member of `struct
Basic`. But why do we have to store the type of structure? It’s to be able to
handle all different types of structure via `VALUE`. If you cast a pointer to
a structure to `VALUE`, as the type information does not remain, the compiler
won’t be able to help. Therefore we have to manage the type ourselves. That’s
the consequence of being able to handle all the structure types in a unified
way.
OK, but the used structure is defined by the class so why are the structure
type and class are stored separately? Being able to find the structure type
from the class should be enough. There are two reasons for not doing this.
The first one is (I’m sorry for contradicting what I said before), in fact
there are structures that do not have a `struct RBasic` (i.e. they have no
`klass` member). For example `struct RNode` that will appear in the second
part of the book. However, `flags` is guaranteed to be in the beginning
members even in special structures like this. So if you put the type of
structure in `flags`, all the object structures can be differentiated in one
unified way.
The second reason is that there is no one-to-one correspondence between class
and structure. For example, all the instances of classes defined at the Ruby
level use `struct RObject`, so finding a structure from a class would require
to keep the correspondence between each class and structure. That’s why it’s
easier and faster to put the information about the type in the structure.
As limiting myself to saying that `basic.flags` is used for different things
including the type of structure makes me feel bad, here’s a general
illustration for it (figure 5). There is no need to understand everything
right away, I just wanted to show its uses while it was bothering me.
When looking at the diagram, it looks like that 21 bits are not used on 32 bit
machines. On these additional bits, the flags `FL_USER0` to `FL_USER8` are
defined, and are used for a different purpose for each structure. In the
diagram I also put `FL_USER0` (`FL_SINGLETON`) as an example.
As I said, `VALUE` is an `unsigned long`. As `VALUE` is a pointer, it may look
like `void*` would also be all right, but there is a reason for not doing
this. In fact, `VALUE` can also not be a pointer. The 6 cases for which
`VALUE` is not a pointer are the following:
- small integers
- symbols
- `true`
- `false`
- `nil`
- `Qundef`
I’ll explain them one by one.
As in Ruby all data are objects, integers are also objects. However, as there
are lots of different instances of integers, expressing them as structures
would risk slowing down execution. For example, when incrementing from 0
to 50000, just for this creating 50000 objects would make us hesitate.
That’s why in `ruby`, to some extent, small integers are treated specially and
embedded directly into `VALUE`. “small” means signed integers that can be held
in `sizeof(VALUE)*8-1` bits. In other words, on 32 bits machines, the integers
have 1 bit for the sign, and 30 bits for the integer part. Integers in this
range will belong to the `Fixnum` class and the other integers will belong to
the `Bignum` class.
Then, let’s see in practice the `INT2FIX()` macro that converts from a C `int`
to a `Fixnum`, and confirm that `Fixnum` are directly embedded in `VALUE`.
▼ `INT2FIX`
123 #define INT2FIX ((VALUE)(((long)(i))<<1 | FIXNUM_FLAG))
122 #define FIXNUM_FLAG 0×01(ruby.h)
In brief, shift 1 bit to the right, and bitwise or it with 1.
`0110100001000` | before conversion |
`1101000010001` | after conversion |
That means that `Fixnum` as `VALUE` will always be an odd number. On the other
hand, as Ruby object structures are allocated with `malloc()`, they are
generally arranged on addresses multiple of 4. So they do not overlap with the
values of `Fixnum` as `VALUE`.
Also, to convert `int` or `long` to `VALUE`, we can use macros like
`INT2NUM`. Any conversion macro `XXXX2XXXX` with a name
containing `NUM` can manage both `Fixnum` and `Bignum`. For example if
`INT2NUM` will convert both `Fixnum` and `Bignum` to
`int`. If the number can’t fit in an `int`, an exception will be raised, so
there is not need to check the value range.
What are symbols?
As this question is quite troublesome to answer, let’s start with the reasons
why symbols were necessary. First, let’s start with the `ID` type used inside
`ruby`. It’s like this:
▼ `ID`
72 typedef unsigned long ID;(ruby.h)
This `ID` is a number having a one-to-one association with a string. However,
in this world it’s not possible to have an association between all strings and
a numerical value. That’s why they are limited to the one to one relationships
inside one `ruby` process. I’ll speak of the method to find an `ID` in the
next chapter “Names and name tables”.
In language implementations, there are a lot of names to handle. Method names
or variable names, constant names, file names in class names… It’s
troublesome to handle all of them as strings (`char*`), because of memory
management and memory management and memory management… Also, lots of
comparisons would certainly be necessary, but comparing strings character by
character will slow down the execution. That’s why strings are not handled
directly, something will be associated and used instead. And generally
“something” will be integers, as they are the simplest to handle.
These `ID` are found as symbols in the Ruby world. Up to `ruby 1.4`, the
values of `ID` where converted to `Fixnum`, but used as symbols. Even today
these values can be obtained using `Symbol#to_i`. However, as real use results
came piling up, it was understood that making `Fixnum` and `Symbol` the same
was not a good idea, so since 1.6 an independent class `Symbol` has been
created.
`Symbol` objects are used a lot, especially as keys for hash tables. That’s
why `Symbol`, like `Fixnum`, was made stored in `VALUE`. Let’s look at the
`ID2SYM()` macro converting `ID` to `Symbol` object.
▼ `ID2SYM`
158 #define SYMBOL_FLAG 0×0e
160 #define ID2SYM ((VALUE)(((long)(x))<<8|SYMBOL_FLAG))(ruby.h)
When shifting 8 bits left, `x` becomes a multiple of 256, that means a
multiple of 4. Then after with a bitwise or (in this case it’s the same as
adding) with `0×0e` (14 in decimal), the `VALUE` expressing the symbol is not
a multiple of 4. Or even an odd number. So it does not overlap the range of
any other `VALUE`. Quite a clever trick.
Finally, let’s see the reverse conversion of `ID2SYM`.
▼ `SYM2ID RSHIFTx,8)
(ruby.h)
`RSHIFT` is a bit shift to the right. As right shift may keep or not the sign
depending of the platform, it became a macro.
These three are Ruby special objects. `true` and `false` represent the boolean
values. `nil` is an object used to denote that there is no object. Their
values at the C level are defined like this:
▼ `true false nil`
164 #define Qfalse 0 /* Ruby’s false /
165 #define Qtrue 2 / Ruby’s true /
166 #define Qnil 4 / Ruby’s nil */(ruby.h)
This time it’s even numbers, but as 0 or 2 can’t be used by pointers, they
can’t overlap with other `VALUE`. It’s because usually the first bloc of
virtual memory is not allocated, to make the programs dereferencing a `NULL`
pointer crash.
And as `Qfalse` is 0, it can also be used as false at C level. In practice, in
`ruby`, when a function returns a boolean value, it’s often made to return an
`int` or `VALUE`, and returns `Qtrue`/`Qfalse`.
For `Qnil`, there is a macro dedicated to check if a `VALUE` is `Qnil` or not,
`NIL_P()`.
▼ `NIL_P()`
170 #define NIL_P(v) ((VALUE)(v) == Qnil)(ruby.h)
The name ending with `p` is a notation coming from Lisp denoting that it is a
function returning a boolean value. In other words, `NIL_P` means “is the
argument `nil`?”. It seems the “`p`” character comes from “predicate”. This
naming rule is used at many different places in `ruby`.
Also, in Ruby, `false` and `nil` are false and all the other objects are true.
However, in C, `nil` (`Qnil`) is true. That’s why in C a Ruby-style macro,
`RTEST()`, has been created.
▼ `RTEST (((VALUE)(v) & ~Qnil) != 0)
(ruby.h)
As in `Qnil` only the third lower bit is 1, in `~Qnil` only the third lower
bit is 0. Then only `Qfalse` and `Qnil` become 0 with a bitwise and.
`!=0` has be added to be certain to only have 0 or 1, to satisfy the
requirements of the glib library that only wants 0 or 1
([ruby-dev:11049]).
By the way, what is the ‘`Q`’ of `Qnil`? ‘R’ I would have understood but why
‘`Q`’? When I asked, the answer was “Because it’s like that in Emacs”. I did
not have the fun answer I was expecting…
▼ `Qundef`
167 #define Qundef 6 /* undefined value for placeholder */(ruby.h)
This value is used to express an undefined value in the interpreter. It can’t
be found at all at the Ruby level.
I already brought up the three important points of a Ruby object, that is
having an identity, being able to call a method, and keeping data for each
instance. In this section, I’ll explain in a simple way the structure linking
objects and methods.
In Ruby, classes exist as objects during the execution. Of course. So there
must be a structure for class objects. That structure is `struct RClass`. Its
structure type flag is `T_CLASS`.
As class and modules are very similar, there is no need to differentiate their
content. That’s why modules also use the `struct RClass` structure, and are
differentiated by the `T_MODULE` structure flag.
▼ `struct RClass`
300 struct RClass {
301 struct RBasic basic;
302 struct st_table *iv_tbl;
303 struct st_table *m_tbl;
304 VALUE super;
305 };(ruby.h)
First, let’s focus on the `m_tbl` (Method TaBLe) member. `struct st_table` is
an hashtable used everywhere in `ruby`. Its details will be explained in the
next chapter “Names and name tables”, but basically, it is a table mapping
names to objects. In the case of `m_tbl`, it keeps the
correspondence between the name (`ID`) of the methods possessed by this class
and the methods entity itself.
The fourth member `super` keeps, like its name suggests, the superclass. As it’s a
`VALUE`, it’s (a pointer to) the class object of the superclass. In Ruby there
is only one class that has no superclass (the root class): `Object`.
However I already said that all `Object` methods are defined in the `Kernel`
module, `Object` just includes it. As modules are functionally similar to
multiple inheritance, it may seem having just `super` is problematic, but but
in `ruby` some clever changes are made to make it look like single
inheritance. The details of this process will be explained in the fourth
chapter “Classes and modules”.
Because of this, `super` of the structure of `Object` points to `struct
RClass` of the `Kernel` object. Only the `super` of Kernel is NULL. So
contrary to what I said, if `super` is NULL, this `RClass` is the `Kernel`
object (figure 6).
With classes structured like this, you can easily imagine the method call
process. The `m_tbl` of the object’s class is searched, and if the method was
not found, the `m_tbl` of `super` is searched, and so on. If there is no more
`super`, that is to say the method was not found even in `Object`, then it
must not be defined.
The sequential search process in `m_tbl` is done by `search_method()`.
▼ `search_method()`
256 static NODE*
257 search_method(klass, id, origin)
258 VALUE klass, *origin;
259 ID id;
260 {
261 NODE *body;
262
263 if (!klass) return 0;
264 while (!st_lookup(RCLASS→m_tbl, id, &body)) {
265 klass = RCLASS→super;
266 if (!klass) return 0;
267 }
268
269 if (origin) *origin = klass;
270 return body;
271 }(eval.c)
This function searches the method named `id` in the class object `klass`.
`RCLASS` is the macro doing:
((struct RClass*)(value))
`st_lookup()` is a function that searches in `st_table` the value
corresponding to a key. If the value is found, the function returns true and
puts the found value at the address given in third parameter (`&body`).
Nevertheless, doing this search each time whatever the circumstances would be
too slow. That’s why in reality, once called, a method is cached. So starting
from the second time it will be found without following `super` one by one.
This cache and its search will be seen in the 15th chapter “Methods”.
In this section, I will explain the implementation of the third essential
condition, instance variables.
Instance variables are what allows each object to store characteristic data.
Having it stored in the object itself (i.e. in the object structure) may seem
all right but how is it in practice? Let’s look at the function
`rb_ivar_set()` that puts an object in an instance variable.
▼ `rb_ivar_set()`
/* write val in the id instance of obj */
984 VALUE
985 rb_ivar_set(obj, id, val)
986 VALUE obj;
987 ID id;
988 VALUE val;
989 {
990 if (!OBJ_TAINTED(obj) && rb_safe_level() >= 4)
991 rb_raise(rb_eSecurityError,
“Insecure: can’t modify instance variable”);
992 if (OBJ_FROZEN(obj)) rb_error_frozen(“object”);
993 switch (TYPE) {
994 case T_OBJECT:
995 case T_CLASS:
996 case T_MODULE:
997 if (!ROBJECT→iv_tbl)
ROBJECT→iv_tbl = st_init_numtable();
998 st_insert(ROBJECT→iv_tbl, id, val);
999 break;
1000 default:
1001 generic_ivar_set(obj, id, val);
1002 break;
1003 }
1004 return val;
1005 }(variable.c)
`rb_raise()` and `rb_error_frozen()` are both error checks. Error checks are
necessary, but it’s not the main part of the treatment, so you should ignore
them at first read.
After removing error treatment, only the `switch` remains, but this
switch (TYPE(obj)) { case T_aaaa: case T_bbbb: ... }
form is characteristic of `ruby`. `TYPE. In other words as
the type flag is an integer constant, we can branch depending on it with a
`switch`. `Fixnum` or `Symbol` do not have structures, but inside `TYPE()` a
special treatment is done to properly return `T_FIXNUM` and `T_SYMBOL`, so
there’s no need to worry.
Well, let’s go back to `rb_ivar_set()`. It seems only the treatments of
`T_OBJECT`, `T_CLASS` and `T_MODULE` are different. These 3 have been chosen on
the basis that their second member is `iv_tbl`. Let’s confirm it in practice.
▼ Structures whose second member is `iv_tbl`
/* TYPE == T_OBJECT */
295 struct RObject {
296 struct RBasic basic;
297 struct st_table *iv_tbl;
298 }; /* TYPE == T_CLASS or T_MODULE */ 300 struct RClass { 301 struct RBasic basic; 302 struct st_table *iv_tbl; 303 struct st_table *m_tbl; 304 VALUE super; 305 };(ruby.h)
`iv_tbl` is the Instance Variable TaBLe. It stores instance variable names and
their corresponding value.
In `rb_ivar_set()`, let’s look again the code for the structures having
`iv_tbl`.
if (!ROBJECT(obj)->iv_tbl) ROBJECT(obj)->iv_tbl = st_init_numtable(); st_insert(ROBJECT(obj)->iv_tbl, id, val); break;
`ROBJECT()` is a macro that casts a `VALUE` into a `struct
RObject*`. It’s possible that `obj` points to a struct RClass, but as
we’re only going to access the second member no problem will occur.
`st_init_numtable()` is a function creating a new `st_table`. `st_insert()` is
a function doing associations in a `st_table`.
In conclusion, this code does the following: if `iv_tbl` does not exist, it
creates it, then stores the [variable name → object] association.
Warning: as `struct RClass` is a class object, this instance variable table is
for the use of the class object itself. In Ruby programs, it corresponds to
something like the following:
class C @ivar = "content" end
For objects for which the structure used is not `T_OBJECT`, `T_MODULE`, or
`T_CLASS`, what happens when modifying an instance variable?
▼ `rb_ivar_set()` in the case there is no `iv_tbl`
1000 default:
1001 generic_ivar_set(obj, id, val);
1002 break;(variable.c)
The control is transferred to `generic_ivar_set()`. Before looking at this
function, let’s first explain its general idea.
Structures that are not `T_OBJECT`, `T_MODULE` or `T_CLASS` do not have an
`iv_tbl` member (the reason why they do not have it will be explained later).
However, a method linking an instance to a `struct st_table` would allow
instances to have instance variables. In `ruby`, this was solved by using a
global `st_table`, `generic_iv_table` (figure 7) for these associations.
Let’s see this in practice.
▼ `generic_ivar_set()`
801 static st_table *generic_iv_tbl; 830 static void 831 generic_ivar_set(obj, id, val) 832 VALUE obj; 833 ID id; 834 VALUE val; 835 { 836 st_table *tbl; 837 /* for the time being you should ignore this */ 838 if (rb_special_const_p(obj)) { 839 special_generic_ivar = 1; 840 } /* initialize generic_iv_tbl if it does not exist */ 841 if (!generic_iv_tbl) { 842 generic_iv_tbl = st_init_numtable(); 843 } 844 /* the treatment itself */ 845 if (!st_lookup(generic_iv_tbl, obj, &tbl)) { 846 FL_SET(obj, FL_EXIVAR); 847 tbl = st_init_numtable(); 848 st_add_direct(generic_iv_tbl, obj, tbl); 849 st_add_direct(tbl, id, val); 850 return; 851 } 852 st_insert(tbl, id, val); 853 }(variable.c)
`rb_special_const_p()` is true when its parameter is not a pointer. However,
as this `if` part requires knowledge of the garbage collector, we’ll skip it
for now. I’d like you to check it again after reading the chapter 5 “Garbage
collection”.
`st_init_numtable()` already appeared some time ago. It creates a new hash
table.
`st_lookup()` searches a value corresponding to a key. In this case it
searches for what’s attached to `obj`. If an attached value can be found, the
whole function returns true and stores the value at the address (`&tbl`) given
as third parameter. In short, `!st_lookup(…)` can be read “if a value can’t
be found”.
`st_insert()` was also already explained. It stores a new association in a
table.
`st_add_direct()` is similar to `st_insert()`, but the part before adding the
association that checks if the key was already stored or not is different. In
other words, in the case of `st_add_direct()`, if a key already registered is
being used, two associations linked to this same key will be stored.
`st_add_direct()` can be used when the check for existence has already been
done, as is the case here, or when a new table has just been created.
`FL_SET(obj, FL_EXIVAR)` is the macro that sets the `FL_EXIVAR` flag in the
`basic.flags` of `obj`. The `basic.flags` flags are all named `FL_xxxx` and
can be set using `FL_SET()`. These flags can be unset with `FL_UNSET()`. The
`EXIVAR` from `FL_EXIVAR` seems to be the abbreviation of EXternal Instance
VARiable.
The setting of these flags is done to speed up the reading of instance
variables. If `FL_EXIVAR` is not set, even without searching in
`generic_iv_tbl`, we directly know if the object has instance variables. And
of course a bit check is way faster than searching a `struct st_table`.
Now you should understand how the instance variables are stored, but why are
there structures without `iv_tbl`? Why is there no `iv_tbl` in `struct
RString` or `struct RArray`? Couldn’t `iv_tbl` be part of `RBasic`?
Well, this could have been done, but there are good reasons why it was not. As
a matter of fact, this problem is deeply linked to the way `ruby` manages
objects.
In `ruby`, memory used by for example string data (`char[]`) is directly
allocated using `malloc()`. However, the object structures are handled in a
particular way. `ruby` allocates them by clusters, and then distribute them
from these clusters. As at allocation time the diversity of types (and sizes)
of structures is difficult to handle, a type (`union`) that combines all
structures `RVALUE` was declared and an array of this type is managed. As this
type’s size is the same as the biggest one of its members, if there is only
one big structure, there is a lot of unused space. That’s why doing as much as
possible to regroup structures of similar size is desirable. The details about
`RVALUE` will be explained in chapter 5 “Garbage collection”.
Generally the most used structure is `struct RString`. After that, in programs
there are `struct RArray` (array), `RHash` (hash), `RObject` (user defined
object), etc. However, this `struct RObject` only uses the space of `struct
RBasic` + 1 pointer. On the other hand, `struct RString`, `RArray` and `RHash`
take the space of `struct RBasic` + 3 pointers. In other words, when putting a
`struct RObject` in the shared entity, the space for 2 pointers is useless.
And beyond that, if `RString` had 4 pointers, `RObject` would use less that
half the size of the shared entity. As you would expect, it’s wasteful.
So the received merit for `iv_tbl` is more or less saving memory and speeding
up. Furthermore we do not know if it is used often or not. In the facts,
`generic_iv_tbl` was not introduced before `ruby` 1.2, so it was not possible
to use instance variables in `String` or `Array` at this time. Nevertheless it
was not so much of a problem. Making large amounts of memory useless just for
such a functionality looks stupid.
If you take all this into consideration, you can conclude that increasing the
size of object structures does not do any good.
We saw the `rb_ivar_set()` function that sets variables, so let’s see quickly
how to get them.
▼ `rb_ivar_get()`
960 VALUE
961 rb_ivar_get(obj, id)
962 VALUE obj;
963 ID id;
964 {
965 VALUE val;
966
967 switch (TYPE) {
/* (A) /
968 case T_OBJECT:
969 case T_CLASS:
970 case T_MODULE:
971 if (ROBJECT→iv_tbl &&
st_lookup(ROBJECT→iv_tbl, id, &val))
972 return val;
973 break;
/ (B) /
974 default:
975 if (FL_TEST(obj, FL_EXIVAR) || rb_special_const_p(obj))
976 return generic_ivar_get(obj, id);
977 break;
978 }
/ © */
979 rb_warning(“instance variable %s not initialized”, rb_id2name(id));
980
981 return Qnil;
982 }(variable.c)
The structure is strictly the same.
(A) For `struct RObject` or `RClass`, we search the variable in `iv_tbl`. As
`iv_tbl` can also be `NULL`, we must check it before using it. Then if
`st_lookup()` finds the relation, it returns true, so the whole `if` can be
read as “If the instance variable has been set, return its value”.
© If no correspondence could be found, in other words if we read an
instance variable that has not been set, we first leave the `if` then the
`switch`. `rb_warning()` will then issue a warning and `nil` will be returned.
That’s because you can read instance variables that have not been set in Ruby.
(B) On the other hand, if the structure is neither `struct RObject` nor
`RClass`, the instance variable table is searched in `generic_iv_tbl`. What
`generic_ivar_get()` does can be easily guessed, so I won’t explain it. I’d
rather want you to focus on the `if`.
I already told you that `generic_ivar_set()` sets the `FL_EXIVAR` flag to make
the check faster.
And what is `rb_special_const_p()`? This function returns true when its
parameter `obj` does not point to a structure. As no structure means no
`basic.flags`, no flag can be set, and `FL_xxxx()` will always returns false.
That’s why these objects have to be treated specially.
In this section we’ll see simply, among object structures, what the important
ones contain and how they are handled.
`struct RString` is the structure for the instances of the `String` class and
its subclasses.
▼ `struct RString`
314 struct RString {
315 struct RBasic basic;
316 long len;
317 char *ptr;
318 union {
319 long capa;
320 VALUE shared;
321 } aux;
322 };(ruby.h)
`ptr` is a pointer to the string, and `len` the length of that string. Very
straightforward.
Rather than a string, Ruby’s string is more a byte array, and can contain any
byte including `NUL`. So when thinking at the Ruby level, ending the string
with `NUL` does not mean anything. As C functions require `NUL`, for
convenience the ending `NUL` is there, however, it is not included in `len`.
When dealing with a string coming from the interpreter or an extension
library, you can write `RSTRING→ptr` or `RSTRING→len`, and access
`ptr` and `len`. But there are some points to pay attention to.
- you have to check before if `str` really points to a `struct RString`
- you can read the members, but you must not modify them
- you can’t store `RSTRING→ptr` in something like a local variable and
use it later
Why is that? First, there is an important software engineering principle:
Don’t arbitrarily tamper with someone’s data. Interface functions are there
for a reason. However, there are concrete reasons in `ruby`‘s design
why you should not do such things as consulting or storing a pointer, and
that’s related to the fourth member `aux`. However, to explain properly how to
use `aux`, we have to explain first a little more of Ruby’s strings’
characteristics.
Ruby’s strings can be modified (are mutable). By mutable I mean after the
following code:
s = "str" # create a string and assign it to s s.concat("ing") # append "ing" to this string object p(s) # show the string
the content of the object pointed by `s` will become “`string`”. It’s
different from Java or Python string objects. Java’s `StringBuffer` is closer.
And what’s the relation? First, mutable means the length (`len`) of the string
can change. We have to increase or decrease the allocated memory size each time
the length changes. We can of course use `realloc()` for that, but generally
`malloc()` and `realloc()` are heavy operations. Having to `realloc()` each
time the string changes is a huge burden.
That’s why the memory pointed by `ptr` has been allocated with a size a little
bigger than `len`. Because of that, if the added part can fit into the
remaining memory, it’s taken care of without calling `realloc()`, so it’s
faster. The structure member `aux.capa` contains the length including this
additional memory.
So what is this other `aux.shared`? It’s to speed up the creation of literal
strings. Have a look at the following Ruby program.
while true do # repeat indefinitely a = "str" # create a string with "str" as content and assign it to a a.concat("ing") # append "ing" to the object pointed by a p(a) # show "string" end
Whatever the number of times you repeat the loop, the fourth line’s `p` has to
show `“string”`. That’s why the code `“str”` should create, each time, a string
object holding a different `char[]`. However, if no change occurs for a lot of
strings, useless copies of `char[]` can be created many times. It would be better
to share one common `char[]`.
The trick that allows this to happen is `aux.shared`. String objects created
with a literal use one shared `char[]`. When a change occurs, the string is
copied in unshared memory, and the change is done on this new copy. This
technique is called “copy-on-write”. When using a shared `char[]`, the flag
`ELTS_SHARED` is set in the object structure’s `basic.flags`, and `aux.shared`
contains the original object. `ELTS` seems to be the abbreviation of
`ELemenTS`.
But, well, let’s return to our talk about `RSTRING→ptr`. Even if
consulting the pointer is OK, you must not modify it, first because the value
of `len` or `capa` will no longer agree with the content, and also because when
modifying strings created as litterals, `aux.shared` has to be separated.
To finish this section about `RString`, let’s write some examples how to use
it. `str` is a `VALUE` that points to `RString`.
RSTRING(str)->len; /* length */ RSTRING(str)->ptr[0]; /* first character */ str = rb_str_new("content", 7); /* create a string with "content" as its content the second parameter is the length */ str = rb_str_new2("content"); /* create a string with "content" as its content its length is calculated with strlen() */ rb_str_cat2(str, "end"); /* Concatenate a C string to a Ruby string */
`struct RArray` is the structure for the instances of Ruby’s array class
`Array`.
▼ `struct RArray`
324 struct RArray {
325 struct RBasic basic;
326 long len;
327 union {
328 long capa;
329 VALUE shared;
330 } aux;
331 VALUE *ptr;
332 };(ruby.h)
Except for the type of `ptr`, this structure is almost the same as `struct
RString`. `ptr` points to the content of the array, and `len` is its length.
`aux` is exactly the same as in `struct RString`. `aux.capa` is the “real”
length of the memory pointed by `ptr`, and if `ptr` is shared, `aux.shared`
stores the shared original array object.
From this structure, it’s clear that Ruby’s `Array` is an array and not a
list. So when the number of elements changes in a big way, a `realloc()` must
be done, and if an element must be inserted at an other place than the end, a
`memmove()` will occur. But even if we do it, it’s moving so fast it’s really
impressive on current machines.
That’s why the way to access it is similar to `RString`. You can consult
`RARRAY→ptr` and `RARRAY→len` members, but can’t set them, etc.,
etc. We’ll only look at simple examples:
/* manage an array from C */ VALUE ary; ary = rb_ary_new(); /* create an empty array */ rb_ary_push(ary, INT2FIX(9)); /* push a Ruby 9 */ RARRAY(ary)->ptr[0]; /* look what's at index 0 */ rb_p(RARRAY(ary)->ptr[0]); /* do p on ary[0] (the result is 9) */ # manage an array from Ruby ary = [] # create an empty array ary.push(9) # push 9 ary[0] # look what's at index 0 p(ary[0]) # do p on ary[0] (the result is 9)
It’s the structure for the instances of the regular expression class `Regexp`.
▼ `struct RRegexp`
334 struct RRegexp {
335 struct RBasic basic;
336 struct re_pattern_buffer *ptr;
337 long len;
338 char *str;
339 };(ruby.h)
`ptr` is the regular expression after compilation. `str` is the string before
compilation (the source code of the regular expression), and `len` is this
string’s length.
As the `Regexp` object handling code doesn’t appear in this book, we won’t see
how to use it. Even if you use it in extension libraries, as long as you do
not want to use it a very particular way, the interface functions are enough.
`struct RHash` is the structure for Ruby’s `Hash` objects.
▼ `struct RHash`
341 struct RHash {
342 struct RBasic basic;
343 struct st_table *tbl;
344 int iter_lev;
345 VALUE ifnone;
346 };(ruby.h)
It’s a wrapper for `struct st_table`. `st_table` will be detailed in the next
chapter “Names and name tables”.
`ifnone` is the value when a key does not have an attached value, its default
is `nil`. `iter_lev` is to make the hashtable reentrant (multithread safe).
`struct RFile` is a structure for instances of the built-in IO class and
its subclasses.
▼ `struct RFile`
348 struct RFile {
349 struct RBasic basic;
350 struct OpenFile *fptr;
351 };(ruby.h)
▼ `OpenFile`
19 typedef struct OpenFile {
20 FILE f; / stdio ptr for read/write /
21 FILE *f2; / additional ptr for rw pipes /
22 int mode; / mode flags /
23 int pid; / child’s pid (for pipes) /
24 int lineno; / number of lines read /
25 char *path; / pathname for file /
26 void (finalize) _((struct OpenFile*)); /* finalize proc */
27 } OpenFile;(rubyio.h)
All members have been transferred in `struct OpenFile`. As there aren’t many
instances of `IO` objects, it’s OK to do it like this. The purpose of each member
is written in the comments. Basically, it’s a wrapper around C’s `stdio`.
`struct RData` has a different tenor from what we saw before. It is the
structure for implementation of extension libraries.
Of course structures for classes created in extension libraries as necessary,
but as the types of these structures depend of the created class, it’s
impossible to know their size or structure in advance. That’s why a “structure
for managing a pointer to a user defined structure” has been created on
`ruby`’s side to manage this. This structure is `struct RData`.
▼ `struct RData`
353 struct RData {
354 struct RBasic basic;
355 void (dmark) _((void));
356 void (dfree) _((void));
357 void *data;
358 };(ruby.h)
`data` is a pointer to the user defined structure,
`dfree` is the function used to free this structure, and
`dmark` is the function for when the “mark” of the mark and sweep occurs.
Because explaining `struct RData` is still too complicated, for
the time being let’s just look at its representation (figure 8). You’ll read
a detailed explanation of its members in chapter 5 “Garbage collection” where
there’ll be presented once again.