-
Notifications
You must be signed in to change notification settings - Fork 63
Kernel
This page teaches you how the MentOS kernel works - the brain of the operating system.
Important: All code examples and implementation details are from the actual MentOS kernel source code. You can explore these files in kernel/ directory of the MentOS repository.
The kernel is the core program that:
- Controls the hardware - CPU, RAM, disk, keyboard, screen
- Manages resources - Decides which program gets CPU time, memory, disk space
- Provides services - File reading, process creation, network communication
- Enforces security - Prevents programs from interfering with each other
Think of it as the traffic controller of your computer:
- Multiple programs want CPU time → kernel schedules them
- Multiple programs need memory → kernel allocates RAM
- Programs want to read files → kernel accesses the disk
The kernel runs in ring 0 (privileged mode) while your programs run in ring 3 (restricted mode). Programs must ASK the kernel for help through system calls.
┌──────────────────────────────────────────────────────┐
│ User Programs (ring 3) │
│ shell, ls, cat, editor, games, etc. │
└─────────────────┬────────────────────────────────────┘
│ System Calls (INT 0x80)
┌─────────────────┴────────────────────────────────────┐
│ Kernel (ring 0) │
│ │
│ ┌───────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Process │ │ Memory │ │ File │ │
│ │ Management │ │ Management │ │ System │ │
│ │ │ │ │ │ │ │
│ │ • Scheduler │ │ • Paging │ │ • VFS │ │
│ │ • fork/exec │ │ • Allocators │ │ • EXT2 │ │
│ │ • Signals │ │ • Heap mgmt │ │ • ProcFS │ │
│ └───────────────┘ └──────────────┘ └───────────┘ │
│ │
│ ┌───────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Drivers │ │ Interrupts │ │ IPC │ │
│ │ │ │ │ │ │ │
│ │ • Keyboard │ │ • Timer │ │ • Pipes │ │
│ │ • Disk (ATA) │ │ • Syscalls │ │ • Signals │ │
│ │ • RTC/PS2 │ │ • Exceptions │ │ • SysV IPC│ │
│ └───────────────┘ └──────────────┘ └───────────┘ │
└─────────────────┬────────────────────────────────────┘
│
┌─────────────────┴────────────────────────────────────┐
│ Hardware (CPU, RAM, Disk, etc.) │
└───────────────────────────────────────────────────────┘
Each box solves a specific problem. Let's understand them one by one.
The Problem: You have 1 CPU but want to run 10 programs simultaneously.
The Solution: The kernel creates the illusion of multiple CPUs by rapidly switching between programs.
Timeline (every 10ms, timer interrupt fires):
───────────────────────────────────────────────────>
Program A │ Program B │ Program A │ Program C
running │ running │ running │ running
The scheduler decides who runs next:
- Timer interrupt fires (10ms elapsed)
- Save current program's state (registers, stack pointer)
- Pick next program to run (based on priority/fairness)
- Load new program's state
- Resume execution
This happens 100 times per second, so it feels seamless!
Every running program is represented by a task_struct (see kernel/inc/process/process.h):
typedef struct task_struct {
pid_t pid; // Process ID
__volatile__ long state; // TASK_RUNNING, TASK_STOPPED, etc.
// Scheduling
list_head_t run_list;
sched_entity_t se;
// CPU/FPU state
thread_struct_t thread;
// Memory
mm_struct_t *mm;
// Files
vfs_file_descriptor_t *fd_list;
int max_fd;
// Signals
sighand_t sighand;
sigset_t blocked;
sigpending_t pending;
// Misc
char name[TASK_NAME_MAX_LENGTH];
char cwd[PATH_MAX];
} task_struct;MentOS Implementation: See kernel/inc/process/process.h for full definition and kernel/src/process/ for task management code.
Real-world analogy: Think of task_struct as a "snapshot" of a program. The kernel can freeze any program, save its snapshot, load another program's snapshot, and resume it - like saving your game progress!
MentOS supports multiple scheduling algorithms (configurable at build time):
1. Round-Robin (RR) - Default, simple fairness
Queue: [A, B, C]
→ Run A for 10ms
→ Run B for 10ms
→ Run C for 10ms
→ Repeat
Implementation: kernel/src/process/scheduler.c
2. Completely Fair Scheduler (CFS) - Linux-inspired
Track "virtual runtime" for each process:
A: 100ms
B: 150ms ← runs more recently
C: 80ms ← runs longest ago
→ Pick process with lowest vruntime (C)
Implementation: kernel/src/process/scheduler.c (CFS mode)
3. Priority-based - Higher priority = more CPU
Priority queue:
High priority (20): [Process A]
Medium priority (10): [Process B, Process C]
Low priority (0): [Process D]
→ Always run highest priority first
See Scheduling for details on each algorithm.
The fork() syscall creates a new process:
// User program:
pid_t pid = fork();
if (pid == 0) {
// Child process
printf("I'm the child!\n");
} else {
// Parent process
printf("I created child PID %d\n", pid);
}What happens in the kernel:
// kernel/src/process/process.c
pid_t sys_fork(pt_regs_t *f)
{
task_struct *current = scheduler_get_current_process();
scheduler_store_context(f, current);
task_struct *proc = __alloc_task(current, current, current->name);
proc->mm = mm_clone(current->mm);
// Child returns 0
proc->thread.regs.eax = 0;
proc->thread.regs.eflags |= EFLAG_IF;
// Inherit ids
proc->sid = current->sid;
proc->pgid = current->pgid;
proc->uid = current->uid;
proc->ruid = current->ruid;
proc->gid = current->gid;
proc->rgid = current->rgid;
scheduler_enqueue_task(proc);
return proc->pid;
}Key trick: Copy-on-Write (COW)
- Child shares parent's memory pages
- Pages marked "read-only"
- If either writes, kernel copies the page first
- This makes fork() fast!
The Problem: Multiple programs running, but they all need memory. How do we prevent them from overwriting each other?
The Solution: Virtual memory - each program thinks it has the entire address space (0x00000000 - 0xBFFFFFFF) to itself!
MentOS Implementation: kernel/src/mem/ - Contains paging, page tables, memory allocators
Program A thinks: Program B thinks:
0x00000000: My code 0x00000000: My code
0x10000000: My heap 0x10000000: My heap
0xBFFFFFFF: My stack 0xBFFFFFFF: My stack
But in PHYSICAL RAM:
0x00100000: Actually Program A's code
0x00200000: Actually Program B's code
0x00300000: Actually Program A's heap
0x00400000: Actually Program B's heap
The CPU's Memory Management Unit (MMU) translates virtual addresses to physical addresses using page tables.
Virtual Address: 0x12345678
│
├─ Top 10 bits (0x048): Index into Page Directory
│ │
│ └──> Page Directory[0x048] → Points to Page Table
│ │
├─ Next 10 bits (0x0D1): Index into Page Table
│ │
│ └──> Page Table[0x0D1] → Physical Page Frame (0x00400)
│
└─ Bottom 12 bits (0x678): Offset within page
│
└──> Physical Address: (0x00400 × 4096) + 0x678 = 0x00400678
Data structures:
// Page directory (one per process)
struct page_directory {
page_dir_entry_t entries[1024]; // Each points to a page table
};
// Page table (many per process)
struct page_table {
page_table_entry_t pages[1024]; // Each points to a 4KB physical page
};
// Page table entry
struct page_table_entry {
unsigned int present : 1; // Is page in RAM?
unsigned int rw : 1; // Read/write or read-only?
unsigned int user : 1; // User-accessible or kernel-only?
unsigned int frame : 20; // Physical page frame number
};The kernel has multiple memory allocators for different needs:
1. Buddy Allocator - Allocates physical pages (4KB chunks)
Free memory split into powers of 2:
Order 0: 4KB pages
Order 1: 8KB chunks (2 pages)
Order 2: 16KB chunks (4 pages)
...
Order 10: 4MB chunks (1024 pages)
Request 12KB?
→ Split 16KB chunk into 8KB + 8KB
→ Split 8KB into 4KB + 4KB
→ Give 8KB + 4KB = 12KB
2. Slab Allocator - Caches common object sizes
Frequently allocated objects (task_struct, file, inode):
→ Pre-allocate a "slab" of these objects
→ Allocation is just taking from cache (fast!)
→ Deallocation returns to cache (no fragmentation)
3. kmalloc/kfree - Kernel's malloc
kmalloc(1024) → Uses slab allocator for common sizes
kmalloc(100000) → Uses buddy allocator for large chunks
The Problem: The disk is just a giant array of bytes. How do we organize it into files and folders?
The Solution: The Virtual File System (VFS) provides an abstraction layer. Programs use open/read/write, and VFS translates to the actual filesystem (EXT2, ProcFS, etc.).
User Program:
fd = open("/home/user/file.txt", O_RDONLY);
read(fd, buffer, 100);
↓
System Call (INT 0x80)
↓
VFS Layer:
vfs_open("/home/user/file.txt")
→ Parse path: / → home → user → file.txt
→ Resolve to a filesystem object
→ Create vfs_file_t structure
→ Return file descriptor (integer)
↓
EXT2 Filesystem Driver:
ext2_read(inode, buffer, offset, count)
→ Read inode's block list
→ Find physical disk blocks
→ Call disk driver
↓
ATA Disk Driver:
ata_read_sectors(block_number, buffer)
→ Send commands to disk controller
→ Wait for disk to read data
→ Copy data to buffer
MentOS exposes VFS objects via vfs_file_t and related types (see kernel/inc/fs/vfs_types.h):
typedef struct vfs_file {
char name[NAME_MAX];
void *device;
uint32_t mask;
uint32_t uid;
uint32_t gid;
uint32_t flags;
uint32_t ino;
uint32_t length;
uint32_t open_flags;
size_t f_pos;
vfs_file_operations_t *fs_operations;
} vfs_file_t;
typedef struct super_block {
char name[NAME_MAX];
char path[PATH_MAX];
struct vfs_file *root;
file_system_type_t *type;
} super_block_t;
typedef struct vfs_file_descriptor {
struct vfs_file *file_struct;
int flags_mask;
} vfs_file_descriptor_t;See File Systems for complete details.
The Problem: Hardware needs to notify the CPU (keyboard pressed, disk finished reading, timer tick).
The Solution: Interrupts - hardware signals that pause the CPU and run a handler function.
MentOS Implementation: kernel/src/descriptor_tables/ (GDT/IDT setup) and kernel/src/hardware/ (timer/IRQ handling)
Hardware Interrupts (IRQs):
IRQ 0: Timer (fires every 10ms)
IRQ 1: Keyboard
IRQ 14/15: Disk (ATA)
Software Interrupts:
INT 0x80: System calls (see kernel/inc/system/syscall.h)
CPU Exceptions:
0: Divide by zero
6: Invalid opcode
13: General protection fault
14: Page fault
1. CPU executing normal code:
mov eax, [ebx]
add eax, 5
← Timer interrupt fires (IRQ 0)
2. CPU automatically:
• Pushes current state (EFLAGS, CS, EIP) onto stack
• Looks up handler in IDT (Interrupt Descriptor Table)
• Jumps to handler
3. Handler runs:
void timer_handler(pt_regs_t *regs) {
tick_count++;
scheduler_tick(); // Maybe switch processes
}
4. Handler returns (IRET instruction):
• Pops state from stack
• Resumes interrupted code
struct idt_entry {
uint16_t offset_low; // Handler address (low 16 bits)
uint16_t selector; // Code segment
uint8_t zero; // Reserved
uint8_t type_attr; // Gate type, DPL, present
uint16_t offset_high; // Handler address (high 16 bits)
};
// 256 entries (0-255)
idt_entry_t idt[256];Setting up an interrupt handler:
// kernel/src/descriptor_tables/idt.c
void idt_install_handler(uint8_t num, void (*handler)(pt_regs_t *)) {
idt[num].offset_low = (uint32_t)handler & 0xFFFF;
idt[num].offset_high = ((uint32_t)handler >> 16) & 0xFFFF;
idt[num].selector = KERNEL_CODE_SEGMENT;
idt[num].type_attr = 0x8E; // Present, ring 0, interrupt gate
}The Problem: Each hardware device has its own interface (registers, commands, protocols).
The Solution: Device drivers - kernel modules that know how to talk to specific hardware.
// kernel/src/drivers/keyboard.c
void keyboard_init(void) {
// Install interrupt handler for IRQ 1
install_interrupt_handler(IRQ_KEYBOARD, keyboard_handler);
enable_irq(IRQ_KEYBOARD);
}
void keyboard_handler(pt_regs_t *regs) {
// Read scan code from keyboard controller
uint8_t scancode = inb(KEYBOARD_DATA_PORT); // I/O port 0x60
// Translate scancode to ASCII
char key = scancode_to_ascii(scancode);
// Add to keyboard buffer
keyboard_buffer_push(key);
}// kernel/src/drivers/ata.c
void ata_read_sector(uint32_t lba, void *buffer) {
// Send LBA (Logical Block Address) to disk
outb(ATA_PORT_LBA_LOW, lba & 0xFF);
outb(ATA_PORT_LBA_MID, (lba >> 8) & 0xFF);
outb(ATA_PORT_LBA_HIGH, (lba >> 16) & 0xFF);
// Send READ command
outb(ATA_PORT_COMMAND, ATA_CMD_READ);
// Wait for disk ready
while (!(inb(ATA_PORT_STATUS) & ATA_STATUS_DRQ));
// Read 512 bytes from data port
insw(ATA_PORT_DATA, buffer, 256); // Read 256 words (512 bytes)
}When the bootloader jumps to the kernel, here's what happens:
// kernel/src/kernel.c
void kmain(multiboot_info_t *mboot_info)
{
// 1. Video
video_init();
printf("MentOS starting...\n");
// 2. Descriptor tables
gdt_init(); // Global Descriptor Table (memory segments)
idt_init(); // Interrupt Descriptor Table
// 3. Memory
paging_init(mboot_info); // Enable virtual memory
buddy_init(); // Physical page allocator
slab_init(); // Object allocator
// 4. Interrupts
pic_init(); // Programmable Interrupt Controller
timer_init(100); // Timer at 100 Hz
keyboard_init(); // Keyboard driver
// 5. File system
vfs_init(); // Virtual File System
ext2_init(); // EXT2 driver
procfs_init(); // /proc filesystem
// 6. Mount root filesystem
vfs_mount("/", rootfs_image);
// 7. Scheduler
scheduler_init();
// 8. Create init process (PID 1)
task_struct *init = create_process("/bin/init");
scheduler_add_task(init);
// 9. Enable interrupts
sti();
// 10. Idle loop
while (1) {
hlt(); // Wait for interrupt
}
}The x86 CPU has privilege levels (rings 0-3). MentOS uses:
- Ring 0 - Kernel mode (full access)
- Ring 3 - User mode (restricted)
Switching happens on syscalls, interrupts, and exceptions.
The timer interrupt preempts (forcibly pauses) the running program:
Program A running
↓
Timer interrupt (10ms elapsed)
↓
Kernel scheduler picks Program B
↓
Program A is paused (saved state)
Program B resumes
This prevents any program from hogging the CPU.
Multiple parts of the kernel might access shared data:
// BAD: Race condition!
if (free_list_head != NULL) {
// ← Interrupt here could corrupt free_list!
node = free_list_head;
free_list_head = node->next;
}
// GOOD: Use spinlock
spinlock_lock(&free_list_lock);
if (free_list_head != NULL) {
node = free_list_head;
free_list_head = node->next;
}
spinlock_unlock(&free_list_lock);// User calls read():
read(fd, buf, 100);
// Libc wrapper:
_syscall3(read, int, fd, void *, buf, size_t, count) {
eax = SYSCALL_NUMBER_READ; // 3
ebx = fd;
ecx = buf;
edx = count;
INT 0x80; // ← Switch to kernel mode
return eax;
}
// Kernel:
void syscall_handler(pt_regs_t *regs) {
uint32_t syscall_num = regs->eax;
switch (syscall_num) {
case 3: // read
regs->eax = sys_read(regs->ebx, regs->ecx, regs->edx);
break;
}
}Start here:
-
kernel/src/kernel.c -
kmain()function (initialization) - kernel/src/process/scheduler.c - How processes are scheduled
- kernel/src/mem/paging.c - Virtual memory implementation
- kernel/src/system/syscall.c - System call dispatcher
- kernel/src/drivers/ - Device drivers (keyboard, disk, etc.)
Key directories:
kernel/
├── src/
│ ├── kernel.c # Main kernel initialization
│ ├── process/ # Process management and scheduling
│ │ ├── scheduler.c # CPU scheduling algorithms
│ │ ├── fork.c # Process creation
│ │ └── wait.c # Process synchronization
│ ├── mem/ # Memory management
│ │ ├── paging.c # Virtual memory (page tables)
│ │ ├── zone.c # Physical memory zones
│ │ └── slab.c # Object allocator
│ ├── fs/ # File systems
│ │ ├── vfs.c # Virtual File System
│ │ ├── ext2.c # EXT2 filesystem
│ │ └── namei.c # Path resolution
│ ├── drivers/ # Device drivers
│ │ ├── keyboard.c # Keyboard driver
│ │ ├── ata.c # Disk driver
│ │ └── video.c # VGA text mode
│ ├── system/ # System calls
│ │ ├── syscall.c # Syscall dispatcher
│ │ └── signal.c # Signal handling
│ └── descriptor_tables/ # CPU tables
│ ├── gdt.c # Global Descriptor Table
│ ├── idt.c # Interrupt Descriptor Table
│ └── isr.c # Interrupt Service Routines- Architecture - Overall system structure
- System Calls - How userspace talks to kernel
- Scheduling - CPU scheduling algorithms explained
- File Systems - VFS and EXT2 details
- Userspace Programs - How programs use kernel services
- Debugging - How to debug kernel code
Key takeaway: The kernel is not magic - it's code that manages hardware and provides services. Start with one subsystem (e.g., scheduler) and trace through how it works! .extern isr_handler .extern irq_handler
// Register handler interrupt_handler_install(vector, handler_func, flags)
### Device Drivers
Located in `kernel/src/drivers/`
**Implemented Drivers:**
- **Keyboard** - PS/2 keyboard input handling
- **ATA** - IDE/ATA disk driver for reading sectors
- **RTC** (Real-Time Clock) - System time and calendar
- **Video** - VGA text mode display (80x25 console)
**Driver Interface:**
```c
typedef struct device_driver {
const char *name;
int (*probe)(void); // Detect hardware
int (*init)(void); // Initialize
void (*interrupt)(void); // Handle interrupt
} device_driver_t;
Located in kernel/inc/io/ and kernel/src/io/
Kernel Logging:
- 8 log levels (0=EMERGENCY to 7=DEBUG)
- Per-file log level control
- Debug macros:
pr_err(),pr_warn(),pr_notice(),pr_debug()
Available Log Levels:
#define LOGLEVEL_EMERGENCY 0 // System unusable
#define LOGLEVEL_ALERT 1 // Immediate action required
#define LOGLEVEL_CRITICAL 2 // Critical conditions
#define LOGLEVEL_ERROR 3 // Error conditions
#define LOGLEVEL_WARNING 4 // Warning conditions
#define LOGLEVEL_NOTICE 5 // Normal but significant
#define LOGLEVEL_INFO 6 // Informational
#define LOGLEVEL_DEBUG 7 // Debug informationDebug Functions:
-
dbg_print_*()- Print various data structures -
panic()- Kernel panic with message
Located in kernel/inc/system/ and kernel/src/system/
Over 60 system calls implemented across categories:
- Process Management - fork, exec, exit, wait, getpid, setpgid, signals
- File Operations - open, close, read, write, lseek, stat, chmod
- Memory - brk, mmap, munmap
- IPC - semget, semop, msgget, msgsnd, msgrcv, shmget, shmat
- Filesystem - mkdir, rmdir, unlink, symlink, readlink
- Timing - time, sleep, timer operations
- User/Group - getuid, setuid, getgid, setgid
- Misc - ioctl, fcntl, uname, reboot
See System Calls page for complete reference.
1. Boot Assembly (stack.S)
├─ Set up minimal stack
└─ Jump to kernel_main()
2. kernel_main() (kernel.c)
├─ Parse multiboot info
├─ Initialize GDT
├─ Initialize IDT
├─ Enable paging
├─ Initialize memory allocators
├─ Initialize VFS
├─ Initialize scheduler
├─ Initialize devices
├─ Load initial programs
└─ Enable interrupts
3. Scheduler takes over
└─ Execute init process
The scheduler performs context switching through:
- Store Context - Save current process registers to task_struct
- Select Next - Choose next process based on scheduling algorithm
- Restore Context - Load new process registers from its task_struct
- Switch CR3 - Load new page directory for memory mapping
void scheduler_store_context(pt_regs_t *f, task_struct *process) {
// Save CPU state from interrupt frame to task
process->thread.esp0 = f->esp;
process->thread.ss0 = f->ss;
// ... save other registers
}
void scheduler_run(pt_regs_t *f) {
// Store current process context
scheduler_store_context(f, current);
// Select next process
current = select_next_process();
// Switch memory context
paging_switch_page_directory(current->mm->pgdir);
// Restore registers from interrupt frame
f->esp = current->thread.esp0;
f->eip = current->thread.eip;
// ... restore other registers
}The kernel handles CPU exceptions like page faults, general protection faults, etc.
void page_fault_handler(pt_regs_t *regs) {
unsigned long addr;
asm("mov %%cr2, %0" : "=r" (addr));
// Check if fault address is in a valid vm_area
vm_area_t *vma = find_vma(current, addr);
if (!vma) {
// Invalid access - send SIGSEGV signal
send_sig(SIGSEGV, current);
} else {
// Allocate page on demand
allocate_page_on_fault(vma, addr);
}
}1. Process Creation (fork)
├─ Allocate new task_struct
├─ Copy parent's memory (with COW)
├─ Copy file descriptors
├─ Copy signal handlers
└─ Add to scheduler runqueue
2. Process Execution (exec)
├─ Load ELF binary
├─ Clear old memory maps
├─ Set up new address space
└─ Jump to entry point
3. Process Termination (exit)
├─ Close file descriptors
├─ Free memory
├─ Notify parent (SIGCHLD)
├─ Reparent children to init
└─ Remove from scheduler
- Architecture - Overall system structure
- System Calls - How userspace talks to kernel
- Scheduling - CPU scheduling algorithms explained
- File Systems - VFS and EXT2 details
- Userspace Programs - How programs use kernel services
- Debugging - How to debug kernel code
Key takeaway: The kernel is not magic - it's code that manages hardware and provides services. Start with one subsystem (e.g., scheduler) and trace through how it works!