Kernel entry point

커널 초기화. Part 4.

커널 진입점

만약 이전 파트를 읽었다면 - Last preparations before the kernel entry point, init/main.c에서 start_kernel 함수를 호출하기 직전에 모든 사전 초기화 작업을 완료하고 중지했음을 기억할 수 있습니다.

start_kernel은 일반 및 아키텍처 독립적 인 커널 코드의 항목이지만 arch /폴더로 여러 번 돌아갑니다. start_kernel 함수를 살펴보면이 함수가 매우 크다는 것을 알 수 있습니다. 현재로서는 약 86 개의 함수 호출이 포함되어 있습니다. 예, 매우 큽니다. 물론이 파트는 이 함수에서 발생하는 모든 프로세스를 다루지는 않습니다. 이번 파트에서는 시작하기 만합니다. 이 부분과 다음 부분은 커널 초기화 과정 장에서 다룰 것입니다.

start_kernel의 주요 목적은 커널 초기화 프로세스를 마치고 첫 번째init 프로세스를 시작하는 것입니다. 첫 번째 프로세스가 시작되기 전에start_kernel은 다음과 같은 많은 작업을 수행해야합니다. lock validator 초기 cgroups 하위 시스템을 활성화하고 CPU 별 영역을 설정하고 vfs에서 다른 캐시를 초기화하려면 프로세서 ID를 초기화합니다. 메모리 관리자, rcu, vmalloc, 스케줄러, IRQ, ACPI 등을 초기화합니다. 이 단계 수행 후에만 이 장의 마지막 부분에서 첫 번째 'init'프로세스가 시작됩니다. 많은 커널 코드가 우리를 기다리고 있습니다. 시작하겠습니다.

참고 :이 큰 파트인 'Linux 커널 초기화 프로세스'의 모든 부분은 디버깅에 대해서는 다루지 않습니다. 커널 디버깅 팁에 대한 별도의 장이 있습니다.

함수 속성에 대해

위에서 쓴 것처럼start_kernel 함수는 init / main.c에 정의되어 있습니다. 이 함수는__init 속성으로 정의되었으며 다른 부분에서 이미 알고 있듯이 이 속성으로 정의 된 모든 함수는 커널 초기화 중에 필요합니다.

#define __init      __section(.init.text) __cold notrace

초기화 과정이 끝나면 커널은free_initmem 함수를 호출하여 이 섹션들을 해제합니다. __init__coldnotrace라는 두 가지 속성으로 정의됩니다. 첫 번째 __cold속성의 목적은 함수가 거의 사용되지 않고 컴파일러가 크기에 맞게 이 함수를 최적화해야 한다는 것을 표시하는 것입니다. 두 번째 notrace는 다음과 같이 정의됩니다.

#define notrace __attribute__((no_instrument_function))

여기서 no_instrument_function은 컴파일러에게 프로파일링 함수 호출을 생성하지 말라고 지시합니다.

start_kernel 함수의 정의에서 다음과 같이 확장되는__visible 속성도 볼 수 있습니다.

#define __visible __attribute__((externally_visible))

여기서externally_visible은 컴파일러에게이 함수 또는 변수를 사용할 수 없음으로 표시하지 않도록 이 함수 또는 변수를 사용한다고 알려줍니다. 이 속성과 다른 매크로 속성의 정의는 include / linux / init.h에서 찾을 수 있습니다.

start_kernel의 첫 단계

start_kernel의 시작 부분에서 이 두 변수의 정의를 볼 수 있습니다 :

char *command_line;
char *after_dashes;

첫 번째는 커널 명령 행에 대한 포인터를 나타내고 두 번째는 'parse_args'함수의 결과를 포함하며, 입력 문자열을 'name = value'형식의 매개 변수로 구문 분석하여 특정 키워드를 찾고 올바른 핸들러를 호출합니다. 현재 이 두 변수와 관련된 세부 사항은 다루지 않겠지만 다음 부분에서 볼 것입니다. 다음 단계에서 set_task_stack_end_magic 함수에 대한 호출을 볼 수 있습니다. 이 함수는 init_task의 주소를 가져와 STACK_END_MAGIC (0x57AC6E9D)를 카나리아로 설정합니다. init_task는 초기 작업 구조를 나타냅니다.

struct task_struct init_task = INIT_TASK(init_task);

여기서 task_struct는 프로세스에 대한 모든 정보를 저장합니다. 이 책에서는 이 구조가 매우 크기 때문에 설명하지 않습니다. include / linux / sched.h에서 정의를 찾을 수 있습니다. 현재 task_struct에는 100 개 이상의 필드가 있습니다! 이 책에서task_struct에 대한 설명은 보이지 않지만 Linux 커널의 process를 설명하는 기본 구조이기 때문에 자주 사용합니다. 우리가 실제로 보면서 이 구조의 분야의 의미를 설명하겠습니다.

init_task의 정의를 볼 수 있으며INIT_TASK 매크로로 초기화됩니다. 이 매크로는 include/linux/init_task.h에서 왔으며init_task에 대한 값으로 채웁니다. 첫 번째 과정. 예를 들어 다음을 설정합니다.

  • init 프로세스 상태를 0 또는 실행 가능으로 설정하십시오. 실행 가능한 프로세스는 CPU가 실행되기를 기다리는 프로세스입니다.

  • init 프로세스 플래그-PF_KTHREAD-커널 스레드;

  • 실행 가능한 작업 목록;

  • 프로세스 주소 공간

  • init 프로세스 스택은 &init_thread_infoinit_thread_union.thread_info이며 initthread_union에는 'thread_info'및 프로세스 스택을 포함하는 thread_union 유형이 있습니다.

union thread_union {
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};

모든 프로세스에는 자체 스택이 있으며 16 킬로바이트 또는 4 페이지 프레임입니다. x86_64에서. 우리는 unsigned long의 배열로 정의되어 있음을 알 수 있습니다. thread_union의 다음 필드는 다음과 같이 정의 된 thread_info입니다.

struct thread_info {
        struct task_struct      *task;
        struct exec_domain      *exec_domain;
        __u32                   flags; 
        __u32                   status;
        __u32                   cpu;
        int                     saved_preempt_count;
        mm_segment_t            addr_limit;
        struct restart_block    restart_block;
        void __user             *sysenter_return;
        unsigned int            sig_on_uaccess_error:1;
        unsigned int            uaccess_err:1;
};

and occupies 52 bytes. The thread_info structure contains architecture-specific information on the thread. We know that on x86_64 the stack grows down and thread_union.thread_info is stored at the bottom of the stack in our case. So the process stack is 16 kilobytes and thread_info is at the bottom. The remaining thread_size will be 16 kilobytes - 62 bytes = 16332 bytes. Note that thread_union represented as the union and not structure, it means that thread_info and stack share the memory space.

Schematically it can be represented as follows:

+-----------------------+
|                       |
|                       |
|        stack          |
|                       |
|_______________________|
|          |            |
|          |            |
|          |            |
|__________↓____________|             +--------------------+
|                       |             |                    |
|      thread_info      |<----------->|     task_struct    |
|                       |             |                    |
+-----------------------+             +--------------------+

http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct

So the INIT_TASK macro fills these task_struct's fields and many many more. As I already wrote above, I will not describe all the fields and values in the INIT_TASK macro but we will see them soon.

Now let's go back to the set_task_stack_end_magic function. This function defined in the kernel/fork.c and sets a canary to the init process stack to prevent stack overflow.

void set_task_stack_end_magic(struct task_struct *tsk)
{
    unsigned long *stackend;
    stackend = end_of_stack(tsk);
    *stackend = STACK_END_MAGIC; /* for overflow detection */
}

Its implementation is simple. set_task_stack_end_magic gets the end of the stack for the given task_struct with the end_of_stack function. Earlier (and now for all architectures besides x86_64) stack was located in the thread_info structure. So the end of a process stack depends on the CONFIG_STACK_GROWSUP configuration option. As we learn in x86_64 architecture, the stack grows down. So the end of the process stack will be:

(unsigned long *)(task_thread_info(p) + 1);

where task_thread_info just returns the stack which we filled with the INIT_TASK macro:

#define task_thread_info(task)  ((struct thread_info *)(task)->stack)

From the Linux kernel v4.9-rc1 release, thread_info structure may contains only flags and stack pointer resides in task_struct structure which represents a thread in the Linux kernel. This depends on CONFIG_THREAD_INFO_IN_TASK kernel configuration option which is enabled by default for x86_64. You can be sure in this if you will look in the init/main.c configuration build file:

config THREAD_INFO_IN_TASK
    bool
    help
      Select this to move thread_info off the stack into task_struct.  To
      make this work, an arch will need to remove all thread_info fields
      except flags and fix any runtime bugs.

      One subtle change that will be needed is to use try_get_task_stack()
      and put_task_stack() in save_thread_stack_tsk() and get_wchan().

and arch/x86/Kconfig:

config X86
    def_bool y
        ...
        ...
        ...
        select THREAD_INFO_IN_TASK
        ...
        ...
        ...

So, in this way we may just get end of a thread stack from the given task_struct structure:

#ifdef CONFIG_THREAD_INFO_IN_TASK
static inline unsigned long *end_of_stack(const struct task_struct *task)
{
    return task->stack;
}
#endif

As we got the end of the init process stack, we write STACK_END_MAGIC there. After canary is set, we can check it like this:

if (*end_of_stack(task) != STACK_END_MAGIC) {
        //
        // handle stack overflow here
    //
}

The next function after the set_task_stack_end_magic is smp_setup_processor_id. This function has an empty body for x86_64:

void __init __weak smp_setup_processor_id(void)
{
}

as it not implemented for all architectures, but some such as s390 and arm64.

The next function in start_kernel is debug_objects_early_init. Implementation of this function is almost the same as lockdep_init, but fills hashes for object debugging. As I wrote above, we will not see the explanation of this and other functions which are for debugging purposes in this chapter.

After the debug_object_early_init function we can see the call of the boot_init_stack_canary function which fills task_struct->canary with the canary value for the -fstack-protector gcc feature. This function depends on the CONFIG_CC_STACKPROTECTOR configuration option and if this option is disabled, boot_init_stack_canary does nothing, otherwise it generates random numbers based on random pool and the TSC:

get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);

After we got a random number, we fill the stack_canary field of task_struct with it:

current->stack_canary = canary;

and write this value to the top of the IRQ stack with the:

this_cpu_write(irq_stack_union.stack_canary, canary); // read below about this_cpu_write

Again, we will not dive into details here, we will cover it in the part about IRQs. As canary is set, we disable local and early boot IRQs and register the bootstrap CPU in the CPU maps. We disable local IRQs (interrupts for current CPU) with the local_irq_disable macro which expands to the call of the arch_local_irq_disable function from include/linux/percpu-defs.h:

static inline notrace void arch_local_irq_disable(void)
{
        native_irq_disable();
}

Where native_irq_disable is cli instruction for x86_64. As interrupts are disabled we can register the current CPU with the given ID in the CPU bitmap.

The first processor activation

The current function from the start_kernel is boot_cpu_init. This function initializes various CPU masks for the bootstrap processor. First of all it gets the bootstrap processor id with a call to:

int cpu = smp_processor_id();

For now it is just zero. If the CONFIG_DEBUG_PREEMPT configuration option is disabled, smp_processor_id just expands to the call of raw_smp_processor_id which expands to the:

#define raw_smp_processor_id() (this_cpu_read(cpu_number))

this_cpu_read as many other function like this (this_cpu_write, this_cpu_add and etc...) defined in the include/linux/percpu-defs.h and presents this_cpu operation. These operations provide a way of optimizing access to the per-cpu variables which are associated with the current processor. In our case it is this_cpu_read:

__pcpu_size_call_return(this_cpu_read_, pcp)

Remember that we have passed cpu_number as pcp to the this_cpu_read from the raw_smp_processor_id. Now let's look at the __pcpu_size_call_return implementation:

#define __pcpu_size_call_return(stem, variable)                         \
({                                                                      \
        typeof(variable) pscr_ret__;                                    \
        __verify_pcpu_ptr(&(variable));                                 \
        switch(sizeof(variable)) {                                      \
        case 1: pscr_ret__ = stem##1(variable); break;                  \
        case 2: pscr_ret__ = stem##2(variable); break;                  \
        case 4: pscr_ret__ = stem##4(variable); break;                  \
        case 8: pscr_ret__ = stem##8(variable); break;                  \
        default:                                                        \
                __bad_size_call_parameter(); break;                     \
        }                                                               \
        pscr_ret__;                                                     \
})

Yes, it looks a little strange but it's easy. First of all we can see the definition of the pscr_ret__ variable with the int type. Why int? Ok, variable is common_cpu and it was declared as per-cpu int variable:

DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);

In the next step we call __verify_pcpu_ptr with the address of cpu_number. __veryf_pcpu_ptr used to verify that the given parameter is a per-cpu pointer. After that we set pscr_ret__ value which depends on the size of the variable. Our common_cpu variable is int, so it 4 bytes in size. It means that we will get this_cpu_read_4(common_cpu) in pscr_ret__. In the end of the __pcpu_size_call_return we just call it. this_cpu_read_4 is a macro:

#define this_cpu_read_4(pcp)       percpu_from_op("mov", pcp)

which calls percpu_from_op and pass mov instruction and per-cpu variable there. percpu_from_op will expand to the inline assembly call:

asm("movl %%gs:%1,%0" : "=r" (pfo_ret__) : "m" (common_cpu))

Let's try to understand how it works and what it does. The gs segment register contains the base of per-cpu area. Here we just copy common_cpu which is in memory to the pfo_ret__ with the movl instruction. Or with another words:

this_cpu_read(common_cpu)

is the same as:

movl %gs:$common_cpu, $pfo_ret__

As we didn't setup per-cpu area, we have only one - for the current running CPU, we will get zero as a result of the smp_processor_id.

As we got the current processor id, boot_cpu_init sets the given CPU online, active, present and possible with the:

set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);

All of these functions use the concept - cpumask. cpu_possible is a set of CPU ID's which can be plugged in at any time during the life of that system boot. cpu_present represents which CPUs are currently plugged in. cpu_online represents subset of the cpu_present and indicates CPUs which are available for scheduling. These masks depend on the CONFIG_HOTPLUG_CPU configuration option and if this option is disabled possible == present and active == online. Implementation of the all of these functions are very similar. Every function checks the second parameter. If it is true, it calls cpumask_set_cpu or cpumask_clear_cpu otherwise.

For example let's look at set_cpu_possible. As we passed true as the second parameter, the:

cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits));

will be called. First of all let's try to understand the to_cpumask macro. This macro casts a bitmap to a struct cpumask *. CPU masks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. CPU mask presented by the cpu_mask structure:

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;

which is just bitmap declared with the DECLARE_BITMAP macro:

#define DECLARE_BITMAP(name, bits) unsigned long name[BITS_TO_LONGS(bits)]

As we can see from its definition, the DECLARE_BITMAP macro expands to the array of unsigned long. Now let's look at how the to_cpumask macro is implemented:

#define to_cpumask(bitmap)                                              \
        ((struct cpumask *)(1 ? (bitmap)                                \
                            : (void *)sizeof(__check_is_bitmap(bitmap))))

I don't know about you, but it looked really weird for me at the first time. We can see a ternary operator here which is true every time, but why the __check_is_bitmap here? It's simple, let's look at it:

static inline int __check_is_bitmap(const unsigned long *bitmap)
{
        return 1;
}

Yeah, it just returns 1 every time. Actually we need in it here only for one purpose: at compile time it checks that the given bitmap is a bitmap, or in other words it checks that the given bitmap has a type of unsigned long *. So we just pass cpu_possible_bits to the to_cpumask macro for converting the array of unsigned long to the struct cpumask *. Now we can call cpumask_set_cpu function with the cpu - 0 and struct cpumask *cpu_possible_bits. This function makes only one call of the set_bit function which sets the given cpu in the cpumask. All of these set_cpu_* functions work on the same principle.

If you're not sure that this set_cpu_* operations and cpumask are not clear for you, don't worry about it. You can get more info by reading the special part about it - cpumask or documentation.

As we activated the bootstrap processor, it's time to go to the next function in the start_kernel. Now it is page_address_init, but this function does nothing in our case, because it executes only when all RAM can't be mapped directly.

The next call is pr_notice:

#define pr_notice(fmt, ...) \
    printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)

as you can see it just expands to the printk call. At this moment we use pr_notice to print the Linux banner:

pr_notice("%s", linux_banner);

which is just the kernel version with some additional parameters:

Linux version 4.0.0-rc6+ (alex@localhost) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #319 SMP

Architecture-dependent parts of initialization

The next step is architecture-specific initialization. The Linux kernel does it with the call of the setup_arch function. This is a very big function like start_kernel and we do not have time to consider all of its implementation in this part. Here we'll only start to do it and continue in the next part. As it is architecture-specific, we need to go again to the arch/ directory. The setup_arch function defined in the arch/x86/kernel/setup.c source code file and takes only one argument - address of the kernel command line.

This function starts from the reserving memory block for the kernel _text and _data which starts from the _text symbol (you can remember it from the arch/x86/kernel/head_64.S) and ends before __bss_stop. We are using memblock for the reserving of memory block:

memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text);

You can read about memblock in the Linux kernel memory management Part 1.. As you can remember memblock_reserve function takes two parameters:

  • base physical address of a memory block;

  • size of a memory block.

We can get the base physical address of the _text symbol with the __pa_symbol macro:

#define __pa_symbol(x) \
    __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))

First of all it calls __phys_reloc_hide macro on the given parameter. The __phys_reloc_hide macro does nothing for x86_64 and just returns the given parameter. Implementation of the __phys_addr_symbol macro is easy. It just subtracts the symbol address from the base address of the kernel text mapping base virtual address (you can remember that it is __START_KERNEL_map) and adds phys_base which is the base address of _text:

#define __phys_addr_symbol(x) \
 ((unsigned long)(x) - __START_KERNEL_map + phys_base)

After we got the physical address of the _text symbol, memblock_reserve can reserve a memory block from the _text to the __bss_stop - _text.

Reserve memory for initrd

In the next step after we reserved place for the kernel text and data is reserving place for the initrd. We will not see details about initrd in this post, you just may know that it is temporary root file system stored in memory and used by the kernel during its startup. The early_reserve_initrd function does all work. First of all this function gets the base address of the ram disk, its size and the end address with:

u64 ramdisk_image = get_ramdisk_image();
u64 ramdisk_size  = get_ramdisk_size();
u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);

All of these parameters are taken from boot_params. If you have read the chapter about Linux Kernel Booting Process, you must remember that we filled the boot_params structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example:

Field name:    ramdisk_image
Type:        write (obligatory)
Offset/size:    0x218/4
Protocol:    2.00+

  The 32-bit linear address of the initial ramdisk or ramfs.  Leave at
  zero if there is no initial ramdisk/ramfs.

So we can get all the information that interests us from boot_params. For example let's look at get_ramdisk_image:

static u64 __init get_ramdisk_image(void)
{
        u64 ramdisk_image = boot_params.hdr.ramdisk_image;

        ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;

        return ramdisk_image;
}

Here we get the address of the ramdisk from the boot_params and shift left it on 32. We need to do it because as you can read in the Documentation/x86/zero-page.txt:

0C0/004    ALL    ext_ramdisk_image ramdisk_image high 32bits

So after shifting it on 32, we're getting a 64-bit address in ramdisk_image and we return it. get_ramdisk_size works on the same principle as get_ramdisk_image, but it used ext_ramdisk_size instead of ext_ramdisk_image. After we got ramdisk's size, base address and end address, we check that bootloader provided ramdisk with the:

if (!boot_params.hdr.type_of_loader ||
    !ramdisk_image || !ramdisk_size)
    return;

and reserve memory block with the calculated addresses for the initial ramdisk in the end:

memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);

Conclusion

It is the end of the fourth part about the Linux kernel initialization process. We started to dive in the kernel generic code from the start_kernel function in this part and stopped on the architecture-specific initialization in the setup_arch. In the next part we will continue with architecture-dependent initialization steps.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me a PR to linux-insides.

Last updated