Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the kernel needs the pagetable early_dynamic_pgts? #544

Open
hao-lee opened this issue Nov 13, 2017 · 12 comments
Open

Why the kernel needs the pagetable early_dynamic_pgts? #544

hao-lee opened this issue Nov 13, 2017 · 12 comments
Labels

Comments

@hao-lee
Copy link
Contributor

hao-lee commented Nov 13, 2017

Hi,

I have finished Kernel initialization Part 1, but I still have some questions. Could you please give me some hints? Many Thanks.

In arch/x86/kernel/head_64.S, several pagetables are defined. After reading this part, I think the early paging is handled by 3 tables:

(PGD)early_level4_pgt -> (PUD)level3_kernel_pgt -> (PMD)level2_kernel_pgt

The PMD table level2_kernel_pgt is filled with 256 entries, so it can map 512MB pyhsical space [0, 512MB).

If a virtual address is 0xffffffff81000000, these pagetables can map it to physical address 0x1000000. This is very straightforward. (I hope my understanding is correct)

However, I noticed that from the label early_dynamic_pgts there are 2 tables are also filled. I think they are PUD and PMD too and are used to map the kernel from _text to _end. I don't know why these two tables are needed. After all, we already have three tables which can map 512MB physical space.

@danix800
Copy link
Contributor

danix800 commented Nov 15, 2017

On x86_64:

At the early (not very first) stages, early_dynamic_pgts is used as PUD (first 512 entries) and PMD (remaining entries) for mapping __PAGE_OFFSET:

va = ffff880000000000, mode = ia32e, 2M page

entry shift size offset decimal
pgof 0 0x200000 0x0 0
L2(pmd) 21 0x200 0x0 0
L3(pud) 30 0x200 0x0 0
L4(pgd) 39 0x200 0x110 272

So __PAGE_OFFSET is mapped with early_top_pgt[272], which points to early_dynamic_pgts. If you debug with gdb you can verify this by:

(gdb) x/zg &early_top_pgt[272]

This entry should point to early_dynamic_pgts. And if you follow the paging mechnism you'll get to PMD level mapping the 2M pages.

The kernel code is mapped through:

va = ffffffff80000000, mode = ia32e, 2M page

entry shift size offset decimal
pgof 0 0x200000 0x0 0
L2(pmd) 21 0x200 0x0 0
L3(pud) 30 0x200 0x1FE 510
L4(pgd) 39 0x200 0x1FF 511

Verify:

(gdb) x/zg &early_top_pgt[511]

This should point to level3_kernel_pgt, and

(gdb) x/zg &level3_kernel_pgt[510]

should point to level2_kernel_pgt. This is the 2M PMD pages.

@danix800
Copy link
Contributor

@hao-lee
Copy link
Contributor Author

hao-lee commented Nov 18, 2017

@danix800

Thanks for your reply, but you may misunderstand my question.

early_dynamic_pgts is used to map __PAGE_OFFSET only after the early page fault handler is set. What I want to know is the identity mapping.

early_level4_pgt is renamed to early_top_pgt in latest kernel, but I will still use the former to illustrate.

In Identity mapping setup, the kernel uses the first two entries in early_level4_pgt and use two tables starting from early_dynamic_pgts as PUD and PMD. As a result, these three tables map the kernel from _text to _end.

                                                              +------------+ _end
                                                              |            |
                                                              |            |
                                                              |            |
                                                              |  kernel    |
                                                              |  text      |
                      ---+--------------+                     |            |
                         |              |                     |            |
                         |              |                     |            |
                         +--------------+                     |            |
                     PUD |  entry 8     +-------------------> +------------+ _text
                         +--------------+
                         |              |
                         |              |
                      ------------------+
                         |              |
                         |              |
                         |              |
                     PMD |              |
                         |              |
                         +--------------+
                         |   entry 0    |
early_dynamic_pgts+---------------------+
                         |              |
                         |              |
                         |              |
                     PGD |              |
                         +--------------+
                         |   entry 0    |
  early_level4_pgt+------+--------------+

I don't know why this mapping is needed. After deleting these code and recompile my kernel, everything is ok. I can still boot my system normally.

@danix800
Copy link
Contributor

This identity mapping is for page table switching. If no such mapping the IP will be invalid after setting cr3. Can you verify your test and feedback to us?

@hao-lee
Copy link
Contributor Author

hao-lee commented Nov 19, 2017

Hi, @danix800
Thank you very much! I didn't realize the running of the following two instructions needs a temporary mapping.

	/* Ensure I am executing from virtual addresses */
	movq	$1f, %rax
	jmp	*%rax

Thanks for your help! I have understood why this mapping is necessary.


I have debugged my kernel step by step in Bochs and have found a strange behavior.
As I said above, I delete these code and recompiled my kernel and run it in Bochs. After cr3 being set to point to early_level4_pgt, Bochs prompts me that it can't display the physical address of the above two instructions because the page tables(ie. PUD, PMD) don't exist.

[333497746] ??? (physical address not available)

I ignore these warnings and make the kernel continue running the code. I find the kernel can reach movl $0x80000001, %eax successfully.

The following code is copied from here.

	/* Setup early boot stage 4 level pagetables. */
	addq	phys_base(%rip), %rax
	movq	%rax, %cr3	/* pagetable switching */

	/* Ensure I am executing from virtual addresses */
	movq	$1f, %rax	/* Bochs prompts: physical address not available */
	jmp	*%rax		/* Bochs prompts: physical address not available */
1:

	/* Check if nx is implemented */
reach->	movl	$0x80000001, %eax	/* Bochs can reach here successfully!!! Everything is OK! */
	cpuid
	movl	%edx,%edi

I guess that the Bochs detects the error and continue fetching instructions from physical memory even though it doesn't know what would happen. I have tested my kernel with VMware and QEMU, the former can also boot successfully but QEMU can't. I think this behavior may relate to CPU.

@danix800
Copy link
Contributor

I'm investigating this too. For QEMU when kvm is enabled (--enable-kvm) the kernel can boot also. So I think there's some page fault handling under the hood by KVM.

arch/x86/kvm/mmu.c has page fault handling, that might be where the real magic happens. I'm not sure.

@hao-lee
Copy link
Contributor Author

hao-lee commented Nov 19, 2017

My Bochs and VMware don't have any KVM mechanism. Things get a little more interesting.

@hao-lee
Copy link
Contributor Author

hao-lee commented Nov 24, 2017

Hi, @danix800
I accidentally saw your question on StackOverflow. I also sent an email to the linux-mm mailing list, but nobody replied me.

@danix800
Copy link
Contributor

Yes nobody seems to be interested. I think it's all on us now. Currently I'm studying GRUB, I'll dig into it when I have time.

@danix800
Copy link
Contributor

I actually digged a little a few days ago, I've already setup a debugging environment and can break the
KVM code on the exact faulting instruction.

But without deep understanding of KVM it's difficult to unearth all what's going on so I gave up for now.

On qemu-devel list nobody replies. On linux-kernel list, here, also there's no useful info available.

Happy debugging!

@hao-lee
Copy link
Contributor Author

hao-lee commented Nov 25, 2017

I will also keep watching this question and hope that we can solve it in the future. 😃

@0xAX 0xAX added the question label Jun 14, 2018
@fangzhen
Copy link

fangzhen commented May 5, 2023

Years since last comments, I'm also running into here :-)

I think the behavior is related to TLB, as linux-kernel list indicates.

I made some test on kernel v6.2 source with qemu. The related code including:

  1. setup ideneity_mapping
  2. flush TLB
no -enable-kvm -enable-kvm
delete 1 boot fail boot fail
delete 1 & 2 boot fail boot success

In -enable-kvm case, If we don't setup the identity mappping, and don't flush TLB, kernel boot successfully. If we flush TLB, kernel boot failed. It make sense if TLB caches the identity mapping page table, no page fault. And if TLB is flushed, page fault occurs.

If no -enable-kvm, my guess is qemu don't emulate TLB the same as hardware TLB. As a result, page fault always occurs.

However, this is more of a obscure guess than a solid proof.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants