If you ever run Go 1.15 code under valgrind --tool=massif you might see something out of sorts:

09:22:23 mrogers@mothra:[~:]:                                                                      
$ cat hello_world.go                                                                                
package main                 
                                                
import "fmt"

func main() {
        fmt.Println("foo")
}
09:22:33 mrogers@mothra:[~:]:
$ go build -o foo hello_world.go 
09:22:45 mrogers@mothra:[~:]:
$ valgrind --tool=massif --trace-children=yes --detailed-freq=1 --max-snapshots=200 --massif-out-file=foomassif.%p --pages-as-heap=yes --time-unit=B env GODEBUG=madvdontneed=1 ./foo
==1305185== Massif, a heap profiler
==1305185== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote
==1305185== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1305185== Command: env GODEBUG=madvdontneed=1 ./foo
==1305185== 
==1305185== Massif, a heap profiler
==1305185== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote
==1305185== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1305185== Command: ./foo
==1305185== 
foo
==1305185== 
09:23:00 mrogers@mothra:[~:]:
09:23:00 mrogers@mothra:[~:]:
$ ms_print foomassif.1305185 | head -n 40                                                           
--------------------------------------------------------------------------------
Command:            ./foo
Massif arguments:   --detailed-freq=1 --max-snapshots=200 --massif-out-file=foomassif.%p --pages-as-heap=yes --time-unit=B
ms_print arguments: foomassif.1305185
--------------------------------------------------------------------------------


    MB
750.5^                                                                       @
    |                                                                    @@@@
    |                                                                    @  @
    |                                                              @@@@@@@  @
    |                                                              @     @  @
    |                                                        @@@@@@@     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |                                                        @     @     @  @
    |       @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     @     @  @
0 +----------------------------------------------------------------------->MB
    0                                                                   750.5

Number of snapshots: 31
Detailed snapshots: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9 (peak), 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]

--------------------------------------------------------------------------------
n        time(B)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
0        630,784          630,784          630,784             0            0
100.00% (630,784B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.

According to this, the Go runtime is allocating 750.5MB for a “hello world”! The --pages-as-heap=yes option passed to valgrind follows allocations down to the page level rather than only the malloc’ed blocks. From the massif docs:

However, if you wish to measure all the memory used by your program, you can use the –pages-as-heap=yes. When this option is enabled, Massif’s normal heap block profiling is replaced by lower-level page profiling. Every page allocated via mmap and similar system calls is treated as a distinct block. This means that code, data and BSS segments are all measured, as they are just memory pages. Even the stack is measured, since it is ultimately allocated (and extended when necessary) via mmap; for this reason –stacks=yes is not allowed in conjunction with –pages-as-heap=yes.

It’s not immediately clear from this description that (on Linux, at least) this would include reserved virtual memory (VSZ), so I had a classic “linux ate my ram” moment while trying to peek into the Go runtime. During discussion with colleagues I was directed towards this comment on a golang issue raised about programs run under very restricted environments. In short, the 1.14 release had page allocator changes. The crucial bit is in the implementation section:

… Overall this uses between KiB and hundreds of MiB of address space on systems with smaller address spaces (~600 MiB for a 48-bit address space, ~12 KiB for a 32-bit address space). For a full 64-bit address space, this layout requires ~37 TiB of reserved memory.

At first glance, this seems like an enormous amount, but in reality that’s an extremely small fraction (~0.00022%) of the full address space. Furthermore, this address space is very cheap since we’ll only commit what we use, and to reduce the size of core dumps and eliminate issues with overcommit we will map the space as PROT_NONE (only MEM_RESERVE on Windows) and map it as read/write explicitly when we grow the heap (an infrequent operation).

There are only two known adverse effects of this large mapping on Linux:

ulimit -v, which restricts even PROT_NONE mappings. Programs like top, when they report virtual memory footprint, include PROT_NONE mappings. In the grand scheme of things, these are relatively minor consequences. The former is not used often, and in cases where it is, it’s used as an inaccurate proxy for limiting a process’s physical memory use. The latter is mostly cosmetic, though perhaps some monitoring system uses it as a proxy for memory use, and will likely result in some harmless questions. …

So essentially, by now using mprotect for its mappings, the golang page allocator can reserve a lot of virtual memory.

A little test where we restrict the RLIMIT_AS for our process, being the total limit on address space, and compare the results from restricting to approx. 700MB to a ceiling of approx. 800MB gives us some experimental proof:

$ prlimit --as=700000000 ./foo
fatal error: out of memory allocating heap arena map

runtime stack:
runtime.throw(0x4c2de1, 0x27)
        /home/mrogers/.gvm/gos/go1.15.8/src/runtime/panic.go:1116 +0x72 fp=0x7ffd43b31620 sp=0x7ffd43b315f0 pc=0x4322d2
runtime.(*mheap).sysAlloc(0x56b620, 0x400000, 0x0, 0x5)
        /home/mrogers/.gvm/gos/go1.15.8/src/runtime/malloc.go:737 +0x76d fp=0x7ffd43b316c8 sp=0x7ffd43b31620 pc=0x40bc4d
runtime.(*mheap).grow(0x56b620, 0x1, 0x0)
        /home/mrogers/.gvm/gos/go1.15.8/src/runtime/mheap.go:1344 +0x85 fp=0x7ffd43b31730 sp=0x7ffd43b316c8 pc=0x4253a5
runtime.(*mheap).allocSpan(0x56b620, 0x1, 0x7268746f6d002a00, 0x5859c8, 0x0)
        /home/mrogers/.gvm/gos/go1.15.8/src/runtime/mheap.go:1160 +0x6b6 fp=0x7ffd43b317b0 sp=0x7ffd43b31730 pc=0x425156
runtime.(*mheap).alloc.func1()
        /home/mrogers/.gvm/gos/go1.15.8/src/runtime/mheap.go:907 +0x65 fp=0x7ffd43b31808 sp=0x7ffd43b317b0 pc=0x45c105

$ prlimit --as=800000000 ./foo
foo

I tracked this down as part of a wider investigation to find out why the OpenShift file-integrity-operator’s daemon workload occasionally shows large plateaus of memory usage in Prometheus, well beyond what it should be. I started with pprof profiling, and have worked my way down to the runtime in order to get the full picture. More on that soon.