Blog

Golang Internals, Part 5: the Runtime Bootstrap Process

Siarhei Matsiukevich

Golang Internals Go Runtime and BootstrappingAll parts: Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6
 
The bootstrapping process is the key to understanding how the Go runtime works. Learning it is essential, if you want to move forward with Go. So the fifth installment in our Golang Internals series is dedicated to the Go runtime and, specifically, the Go bootstrap process. This time you will learn about:

  • Go bootstrapping
  • resizable stacks implementation
  • internal TLS implementation

Note that this post contains a lot of assembler code and you will need at least some basic knowledge of it to proceed (here is a quick guide to Go’s assembler). So let’s get going!

 

Finding an entry point

First, we need to find what function is executed immediately after we start a Go program. To do this, we will write a simple Go app:

package main

func main() {
	print(123)
}

Then we need to compile and link it:

go tool 6g test.go
go tool 6l test.6

This will create an executable file called 6.out in your current directory. The next step involves the objdump tool, which is specific to Linux. Windows and Mac users can find analogs or skip this step altogether. Now run the following command:

objdump -f 6.out

You should get output that will contain the start address:

6.out:     file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x000000000042f160

Next, we need to disassemble our executable and find what function is located at this address:

objdump -d 6.out > disassemble.txt

Then we need to open the disassemble.txt file and search for “42f160.” Here is what I got:

000000000042f160 <_rt0_amd64_linux>:
  42f160:	48 8d 74 24 08       		lea    0x8(%rsp),%rsi
  42f165:	48 8b 3c 24          		mov    (%rsp),%rdi
  42f169:	48 8d 05 10 00 00 00 	lea    0x10(%rip),%rax        # 42f180 <main>
  42f170:	ff e0               		 	jmpq   *%rax

Nice, we have found it! The entry point for my OS and architecture is a function called _rt0_amd64_linux.
 

The starting sequence

Now we need to find this function in Go runtime sources. It is located in the rt0_linux_amd64.s file. If you look inside the Go runtime package, you can find many filenames with postfixes related to OS and architecture names. When a runtime package is built, only the files that correspond to the current OS and architecture are selected. The rest are skipped. Let’s take a closer look at rt0_linux_amd64.s:

TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
	LEAQ	8(SP), SI // argv
	MOVQ	0(SP), DI // argc
	MOVQ	$main(SB), AX
	JMP	AX

TEXT main(SB),NOSPLIT,$-8
	MOVQ	$runtime·rt0_go(SB), AX
	JMP	AX

The _rt0_amd64_linux function is very simple. It calls the main function and saves arguments (argc and argv) in registers (DI and SI). The arguments are located in the stack and can be accessed via the SP (stack pointer) register. The main function is also very simple. It calls runtime.rt0_go. The runtime.rt0_go function is longer and more complicated, so I will break it into small parts and describe each one separately.

The first section goes like this:

	MOVQ	DI, AX		// argc
	MOVQ	SI, BX		// argv
	SUBQ	$(4*8+7), SP		// 2args 2auto
	ANDQ	$~15, SP
	MOVQ	AX, 16(SP)
	MOVQ	BX, 24(SP)

Here, we put some previously saved command line argument values inside the AX and BX decrease stack pointers. We also add space for two more four-byte variables and adjust it to be 16-bit aligned. Finally, we move the arguments back to the stack.

	// create istack out of the given (operating system) stack.
	// _cgo_init may update stackguard.
	MOVQ	$runtime·g0(SB), DI
	LEAQ	(-64*1024+104)(SP), BX
	MOVQ	BX, g_stackguard0(DI)
	MOVQ	BX, g_stackguard1(DI)
	MOVQ	BX, (g_stack+stack_lo)(DI)
	MOVQ	SP, (g_stack+stack_hi)(DI)

The second part is a bit more tricky. First, we load the address of the global runtime.g0 variable into the DI register. This variable is defined in the proc1.go file and belongs to the runtime,g type. Variables of this type are created for each goroutine in the system. As you can guess, runtime.g0 describes a root goroutine. Then we initialize the fields that describe the stack of the root goroutine. The meaning of stack.lo and stack.hi should be clear. These are pointers to the beginning and the end of the stack for the current goroutine, but what are the stackguard0 and stackguard1 fields? To understand this, we need to set aside the investigation of the runtime.rt0_go function and take a closer look at stack growth in Go.
 

Resizable stack implementation in Go

The Go language uses resizable stacks. Each goroutine starts with a small stack and its size changes each time a certain threshold is reached. Obviously, there is a way to check whether we have reached this threshold or not. In fact, the check is performed at the beginning of each function. To see how it works, let’s compile our sample program one more time with the -S flag (this will show the generated assembler code). The beginning of the main function looks like this:

"".main t=1 size=48 value=0 args=0x0 locals=0x8
	0x0000 00000 (test.go:3)	TEXT	"".main+0(SB),$8-0
	0x0000 00000 (test.go:3)	MOVQ	(TLS),CX
	0x0009 00009 (test.go:3)	CMPQ	SP,16(CX)
	0x000d 00013 (test.go:3)	JHI	,22
	0x000f 00015 (test.go:3)	CALL	,runtime.morestack_noctxt(SB)
	0x0014 00020 (test.go:3)	JMP	,0
	0x0016 00022 (test.go:3)	SUBQ	$8,SP

First, we load a value from thread local storage (TLS) to the CX register (I have already explained what TLS is in one of my previous posts). This value always contains a pointer to the runtime.g structure that corresponds to the current goroutine. Then we compare the stack pointer to the value located at an offset of 16 bytes in the runtime.g structure. We can easily calculate that this corresponds to the stackguard0 field.

So, this is how we check if we have reached the stack threshold. If we haven’t reached it yet, the check fails. In this case, we call the runtime.morestack_noctxt function repeatedly until enough memory has been allocated for the stack. The stackguard1 field works very similarly to stackguard0, but it is used inside the C stack growth prologue instead of Go. The inner workings of runtime.morestack_noctxt is also a very interesting topic, but we will discuss it later. For now, let’s return to the bootstrap process.
 

Continuing the investigation of Go bootstrapping

We will proceed with the starting sequence by looking at the next portion of code inside the runtime.rt0_go function:

	// find out information about the processor we're on
	MOVQ	$0, AX
	CPUID
	CMPQ	AX, $0
	JE	nocpuinfo

	// Figure out how to serialize RDTSC.
	// On Intel processors LFENCE is enough. AMD requires MFENCE.
	// Don't know about the rest, so let's do MFENCE.
	CMPL	BX, $0x756E6547  // "Genu"
	JNE	notintel
	CMPL	DX, $0x49656E69  // "ineI"
	JNE	notintel
	CMPL	CX, $0x6C65746E  // "ntel"
	JNE	notintel
	MOVB	$1, runtime·lfenceBeforeRdtsc(SB)
notintel:

	MOVQ	$1, AX
	CPUID
	MOVL	CX, runtime·cpuid_ecx(SB)
	MOVL	DX, runtime·cpuid_edx(SB)
nocpuinfo:	

This part is not crucial for understanding major Go concepts, so we will look through it briefly. Here, we are trying to figure out what processor we are using. If it is Intel, we set the runtime·lfenceBeforeRdtsc variable. The runtime·cputicks method is the only place where this variable is used. This method utilizes a different assembler instruction to get cpu ticks depending on the value of runtime·lfenceBeforeRdtsc. Finally, we call the CPUID assembler instruction, execute it, and save the result in the runtime·cpuid_ecx and runtime·cpuid_edx variables. These are used in the alg.go file to select a proper hashing algorithm that is natively supported by your computer’s architecture.

Ok, let’s move on and examine another portion of code:

	// if there is an _cgo_init, call it.
	MOVQ	_cgo_init(SB), AX
	TESTQ	AX, AX
	JZ	needtls
	// g0 already in DI
	MOVQ	DI, CX	// Win64 uses CX for first parameter
	MOVQ	$setg_gcc<>(SB), SI
	CALL	AX

	// update stackguard after _cgo_init
	MOVQ	$runtime·g0(SB), CX
	MOVQ	(g_stack+stack_lo)(CX), AX
	ADDQ	$const__StackGuard, AX
	MOVQ	AX, g_stackguard0(CX)
	MOVQ	AX, g_stackguard1(CX)

	CMPL	runtime·iswindows(SB), $0
	JEQ ok

This fragment is only executed when cgo is enabled. cgo is a topic for a separate discussion and we might talk about it in one of the upcoming posts. At this point, we only want to understand the basic bootstrap workflow, so we will skip it.

The next code fragment is responsible for setting up TLS:

needtls:
	// skip TLS setup on Plan 9
	CMPL	runtime·isplan9(SB), $1
	JEQ ok
	// skip TLS setup on Solaris
	CMPL	runtime·issolaris(SB), $1
	JEQ ok

	LEAQ	runtime·tls0(SB), DI
	CALL	runtime·settls(SB)

	// store through it, to make sure it works
	get_tls(BX)
	MOVQ	$0x123, g(BX)
	MOVQ	runtime·tls0(SB), AX
	CMPQ	AX, $0x123
	JEQ 2(PC)
	MOVL	AX, 0	// abort

I have already mentioned TLS before. Now it is time to understand how it is implemented.
 

Internal TLS implementation

If you look at the previous code fragment carefully, you can easily understand that the only lines that do actual work are:

LEAQ	runtime·tls0(SB), DI
	CALL	runtime·settls(SB)

All the other stuff is used to skip TLS setup when it is not supported on your OS and check that TLS works correctly. The two lines above store the address of the runtime·tls0 variable in the DI register and call the runtime·settls function. The code of this function is shown below:

// set tls base to DI
TEXT runtime·settls(SB),NOSPLIT,$32
	ADDQ	$8, DI	// ELF wants to use -8(FS)

	MOVQ	DI, SI
	MOVQ	$0x1002, DI	// ARCH_SET_FS
	MOVQ	$158, AX	// arch_prctl
	SYSCALL
	CMPQ	AX, $0xfffffffffffff001
	JLS	2(PC)
	MOVL	$0xf1, 0xf1  // crash
	RET

From the comments, we can understand that this function makes an arch_prctl system call and passes ARCH_SET_FS as an argument. We can also see that this system call sets a base for the FS segment register. In our case, we set TLS to point to the runtime·tls0 variable.

Do you remember the instruction that we saw at the beginning of the assembler code for the main function?

	0x0000 00000 (test.go:3)	MOVQ	(TLS),CX

I have previously explained that it loads the address of the runtime.g structure instance into the CX register. This structure describes the current goroutine and is stored in thread local storage. Now we can find out and understand how this instruction is translated into machine assembler. If you open the previously created disassembly.txt file and look for the main.main function, the first instruction inside it should look like this:

400c00:       64 48 8b 0c 25 f0 ff    mov    %fs:0xfffffffffffffff0,%rcx

The colon in this instruction (%fs:0xfffffffffffffff0) stands for segmentation addressing (you can read more on it here).
 

Returning to the starting sequence

Finally, let’s look at the last two parts of the runtime.rt0_go function:

ok:
	// set the per-goroutine and per-mach "registers"
	get_tls(BX)
	LEAQ	runtime·g0(SB), CX
	MOVQ	CX, g(BX)
	LEAQ	runtime·m0(SB), AX

	// save m->g0 = g0
	MOVQ	CX, m_g0(AX)
	// save m0 to g0->m
	MOVQ	AX, g_m(CX)

Here, we load the TLS address into the BX register and save the address of the runtime·g0 variable in TLS. We also initialize the runtime.m0 variable. If runtime.g0 stands for root goroutine, then runtime.m0 corresponds to the root operating system thread used to run this goroutine. We may take a closer look at runtime.g0 and runtime.m0 structures in upcoming blog posts.

The final part of the starting sequence initializes arguments and calls different functions, but this is a topic for a separate discussion.
 

More on Golang

So, we have learned the inner mechanisms of the bootstrap process and found out how stacks are implemented. To move forward, we need to analyze the last part of the starting sequence. That will be the subject of my next post. If you want to get notified as soon as it comes out, hit the subscribe button below or follow @altoros.
 
Read all parts of the series: Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6


About the author: Siarhei Matsiukevich is a Cloud Engineer and Go Developer at Altoros. With 6+ years in software engineering, he is an expert in cloud automation and designing architectures for complex cloud-based systems. An active member of the Go community, Siarhei is a frequent contributor to open-source projects, such as Ubuntu and Juju Charms.


Subscribe to our blog for the next parts of this series or follow @altoros.

Get new posts right in your inbox!

3 Comments
  • Aniket

    Did you mean rt0_linux_amd64.s instead of rt0_linux_arm64.s in the “The Starting sequence” section ?
    Nice series of articles.

    • Sergey Matyukevich

      Thanks for noticing, I’ll fix that.

  • Pingback: Looking into the Go Runtime – Golang News()

Benchmarks and Research

Subscribe to new posts

Get new posts right in your inbox!