Not long ago I needed to benchmark the performance of Golang on a many-core machine. I took several of the benchmarks that are bundled with the Go source code, copied them, and modified them to run on all available threads. In that case the machine has 24 cores and 48 threads.
I started with ECDSA P256 Sign, probably because I have warm feeling for that function, since I optimized it for amd64.
First, I ran the benchmark on a single goroutine:
ECDSA-P256 Sign,30618.50, op/s
That looks good; next I ran it on 48 goroutines:
ECDSA-P256 Sign,78940.67, op/s.
OK, that is not what I expected. Just over 2X speedup, from 24 physical cores? I must be doing something wrong. Maybe Go only uses two cores? I ran
top, it showed 2,266% utilization. That is not the 4,800% I expected, but it is also way above 400%.
How about taking a step back, and running the benchmark on two goroutines?
ECDSA-P256 Sign,55966.40, op/s. Almost double, so pretty good. How about four goroutines?
ECDSA-P256 Sign,108731.00, op/s. That is actually faster than 48 goroutines, what is going on?
I ran the benchmark for every number of goroutines from 1 to 48:
Looks like the number of signatures per second peaks at 274,622, with 17 goroutines. And starts dropping rapidly after that.
Time to do some profiling.
(pprof) top 10 Showing nodes accounting for 47.53s, 50.83% of 93.50s total Dropped 224 nodes (cum <= 0.47s) Showing top 10 nodes out of 138 flat flat% sum% cum cum% 9.45s 10.11% 10.11% 9.45s 10.11% runtime.procyield /state/home/vlad/go/src/runtime/asm_amd64.s 7.55s 8.07% 18.18% 7.55s 8.07% runtime.futex /state/home/vlad/go/src/runtime/sys_linux_amd64.s 6.77s 7.24% 25.42% 19.18s 20.51% runtime.sweepone /state/home/vlad/go/src/runtime/mgcsweep.go 4.20s 4.49% 29.91% 16.28s 17.41% runtime.lock /state/home/vlad/go/src/runtime/lock_futex.go 3.92s 4.19% 34.11% 12.58s 13.45% runtime.(*mspan).sweep /state/home/vlad/go/src/runtime/mgcsweep.go 3.50s 3.74% 37.85% 15.92s 17.03% runtime.gcDrain /state/home/vlad/go/src/runtime/mgcmark.go 3.20s 3.42% 41.27% 4.62s 4.94% runtime.gcmarknewobject /state/home/vlad/go/src/runtime/mgcmark.go 3.09s 3.30% 44.58% 3.09s 3.30% crypto/elliptic.p256OrdSqr /state/home/vlad/go/src/crypto/elliptic/p256_asm_amd64.s 3.09s 3.30% 47.88% 3.09s 3.30% runtime.(*lfstack).pop /state/home/vlad/go/src/runtime/lfstack.go 2.76s 2.95% 50.83% 2.76s 2.95% runtime.(*gcSweepBuf).push /state/home/vlad/go/src/runtime/mgcsweepbuf.go
Clearly Go spends a disproportionate amount of time collecting garbage. All my benchmark does is generates signatures and then dumps them.
So what are our options? The Go runtime states the following:
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
The GODEBUG variable controls debugging variables within the runtime. It is a comma-separated list of name=val pairs setting these named variables:
Let’s see what setting
gc 1 @0.021s 0%: 0.15+0.37+0.25 ms clock, 3.0+0.19/0.39/0.60+5.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 2 @0.024s 0%: 0.097+0.94+0.16 ms clock, 0.29+0.21/1.3/0+0.49 ms cpu, 4->4->1 MB, 5 MB goal, 48 P gc 3 @0.027s 1%: 0.10+0.43+0.17 ms clock, 0.60+0.48/1.5/0+1.0 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 4 @0.028s 1%: 0.18+0.41+0.28 ms clock, 0.18+0.69/2.0/0+0.28 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 5 @0.031s 1%: 0.078+0.35+0.29 ms clock, 1.1+0.26/2.0/0+4.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 6 @0.032s 1%: 0.11+0.50+0.32 ms clock, 0.22+0.99/2.3/0+0.64 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 7 @0.034s 1%: 0.18+0.39+0.27 ms clock, 0.18+0.56/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 8 @0.035s 2%: 0.12+0.40+0.27 ms clock, 0.12+0.63/2.2/0+0.27 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 9 @0.036s 2%: 0.13+0.41+0.26 ms clock, 0.13+0.52/2.2/0+0.26 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 10 @0.038s 2%: 0.099+0.51+0.20 ms clock, 0.19+0.56/1.9/0+0.40 ms cpu, 4->5->0 MB, 5 MB goal, 48 P gc 11 @0.039s 2%: 0.10+0.46+0.20 ms clock, 0.10+0.23/1.3/0.005+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 12 @0.040s 2%: 0.066+0.46+0.24 ms clock, 0.93+0.40/1.7/0+3.4 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 13 @0.041s 2%: 0.099+0.30+0.20 ms clock, 0.099+0.60/1.7/0+0.20 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 14 @0.042s 2%: 0.095+0.45+0.24 ms clock, 0.38+0.58/2.0/0+0.98 ms cpu, 4->5->0 MB, 5 MB goal, 48 P gc 15 @0.044s 2%: 0.095+0.45+0.21 ms clock, 1.0+0.78/1.9/0+2.3 ms cpu, 4->4->0 MB, 5 MB goal, 48 P gc 16 @0.045s 3%: 0.10+0.45+0.23 ms clock, 0.10+0.70/2.1/0+0.23 ms cpu, 4->5->0 MB, 5 MB goal, 48 P gc 17 @0.046s 3%: 0.088+0.40+0.17 ms clock, 0.088+0.45/1.9/0+0.17 ms cpu, 4->4->0 MB, 5 MB goal, 48 P . . . . gc 6789 @9.998s 12%: 0.17+0.91+0.24 ms clock, 0.85+1.8/5.0/0+1.2 ms cpu, 4->6->1 MB, 6 MB goal, 48 P gc 6790 @10.000s 12%: 0.086+0.55+0.24 ms clock, 0.78+0.30/4.2/0.043+2.2 ms cpu, 4->5->1 MB, 6 MB goal, 48 P
The first round of GC kicks in at 0.021s, then it starts collecting every 3ms and then every 1ms. That is insane, the benchmark runs for 10 seconds, and I saw 6,790 rounds of GC. The number that starts with @ is the time since program start, followed by a percentage that supposedly states the amount of time spent collecting garbage. This number is clearly misleading, because the performance indicates at least 90% of the time is wasted (indirectly) on GC, not 12%. The synchronization overhead is not taken into account. What really is interesting are the three numbers separated by arrows. They show the size of the heap at GC start, GC end, and the live heap size. Remember that a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage, and defaults to 100%.
I am running a benchmark, where all allocated data is immediately discarded, and collected at the next GC cycle. The only live heap is fixed to the Go runtime, and having more goroutines does not add to the live heap. In contrast the freshly allocated data grows much faster with each additional goroutine, triggering increasingly frequent, and expensive GC cycles.
Clearly what I needed to do next was to run the benchmark with the GC disabled, by setting
GOGC=off. This lead to a dramatic improvement:
ECDSA-P256 Sign,413740.30, op/s.
But still not the number I was looking for, and running an application without garbage collection is unsustainable in the long run. I started playing with the
GOGC variable. First I set it to 2,400, which made sense since we have 24 cores, perhaps collecting garbage 24 times less frequently will do the trick:
ECDSA-P256 Sign,671538.90, op/s, oh my that is getting better.
What if I tried 4,800, for the number of threads?
ECDSA-P256 Sign,685810.90, op/s. Getting warmer.
I ran a script to find the best value, from 100 to 20,000, in increments of 100. This is what I got:
Looks like the optimal value for
GOGC in that case is 11,300 and it gets us 691,054 signatures/second. That is 22.56X times faster than the single core score, and overall pretty good for a 24 core processor. Remember that when running on a single core, the CPU frequency is 3.0GHz, and only 2.1GHz when running on all cores.
Per goroutine performance when running with
GOGC=11330 now looks like that:
The scaling looks much better, and even past 24 goroutines, when we run out of physical cores, and start sharing cores with hyper-threading, the overall performance improves.
The bottom line here is that although this type of benchmarking is definitely an edge case for garbage collection, where 48 threads allocate large amounts of short lived data, this situation can occur in real world scenarios. As many-core CPUs become a commodity, one should be aware of the pitfalls.
Most languages with garbage collection offer some sort of garbage collection control. Go has the GOGC variable, that can also be controlled with the SetGCPercent function in the runtime/debug package. Don't be afraid to tune the GC to suit your needs.
We're always looking for Go programmers, so if you found this blog post interesting, why not check out our jobs page?Cloudflare Railgun is available when using Cloud Server Webuzo, Cloud Web Apps. Contact us to find out our latest offers!