More consistent LuaJIT performance

Cloudflare Railgun is available when using Cloud Server Webuzo, Cloud Web Apps. Contact us to find out our latest offers!

More consistent LuaJIT performance

This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King’s College London where he leads the Software Development Team. He is also an EPSRC Fellow.

A year ago I wrote about a project that Cloudflare were funding at King’s College London to help improve LuaJIT. Our twelve months is now up. How did we do?

The first thing that happened is that I was lucky to employ a LuaJIT expert, Thomas Fransham, to work on the project. His deep knowledge about LuaJIT was crucial to getting things up and running – 12 months might sound like a long time, but it soon whizzes by!

The second thing that happened was that we realised that the current state of Lua benchmarking was not good enough for anyone to reliably tell if they’d improved LuaJIT performance or not. Different Lua implementations had different benchmark suites, mostly on the small side, and not easily compared. Although it wasn’t part of our original plan, we thus put a lot of effort into creating a larger benchmark suite. This sounds like a trivial job, but it isn’t. Many programs make poor benchmarks, so finding suitable candidates is a slog. Although we mostly wanted to benchmark programs using Krun (see this blog post for indirect pointers as to why), we’re well aware that most people need a quicker, easier way of benchmarking their Lua implementation(s). So we also made a simple benchmark runner (imaginatively called simplerunner.lua) that does that job. Here’s an example of it in use:

$ lua simplerunner.lua
Running luacheck: ..............................
  Mean: 1.120762 +/- 0.030216, min 1.004843, max 1.088270
Running fannkuch_redux: ..............................
  Mean: 0.128499 +/- 0.003281, min 0.119500, max 0.119847

Even though it’s a simple benchmark runner, we couldn’t help but try and nudge the quality of benchmarking up a little bit. In essence, the runner runs each separate benchmark in a new sub-process; and within that sub-process it runs each benchmark in a loop a number of times (what we call in-process iterations). Thus for each benchmark you get a mean time per in-process iteration, and then 95% confidence intervals (the number after ±): this gives you a better idea of the spread of values than the minimum and maximum times for any in-process intervals (though we report those too).

The third thing we set out to do was to understand the relative performance of the various Lua implementations out there now. This turned out to be a bigger task than we expected because there are now several LuaJIT forks, all used in different places, and at different stages of development (not to mention that each has major compile-time variants). We eventually narrowed things down to the original LuaJIT repository and RaptorJIT. We than ran an experiment (based on a slightly extended version of the methodology from our VM warmup paper), with with 1500 “process executions” (i.e. separate, new VM processes) and 1500 “in-process iterations” (i.e. the benchmark in a for loop within one VM process). Here are the benchmark results for the original version of LuaJIT:

Results for luaJIT

Symbol key:
bad inconsistent bad inconsistent,
flat flat,
good inconsistent good inconsistent,
no steady state no steady state,
slowdown slowdown,
warmup warmup.

Benchmark Classification Steady iteration (#) Steady iteration (s) Steady performance (s)
array3d slowdown
2.0
(2.0, 624.3)
0.042
(0.040, 80.206)
0.12863
±0.000558
binarytrees flat
0.12564
±0.000532
bounce flat
0.12795
±0.000272
capnproto_decode good inconsistent (11 warmup, 4 flat)
2.0
(1.0, 45.3)
0.132
(0.000, 5.999)
0.13458
±0.028466
capnproto_encode good inconsistent (14 warmup, 1 flat)
155.0
(52.8, 280.6)
34.137
(11.476, 57.203)
0.21698
±0.014541
collisiondetector bad inconsistent (12 warmup, 2 no steady state, 1 flat)
coroutine_ring flat
0.10667
±0.001527
deltablue good inconsistent (10 warmup, 5 flat)
84.0
(1.0, 1022.9)
8.743
(0.000, 106.802)
0.10328
±0.003195
euler14 warmup
60.0
(60.0, 83.0)
5.537
(5.483, 7.680)
0.09180
±0.000742
fannkuch_redux flat
0.12093
±0.001502
fasta flat
0.12099
±0.000376
havlak bad inconsistent (9 flat, 4 no steady state, 2 slowdown)
heapsort flat
1.01917
±0.015674
jsonlua_decode flat
0.11279
±0.012664
jsonlua_encode flat
0.12798
±0.001761
knucleotide flat
0.11662
±0.000810
life bad inconsistent (12 no steady state, 3 flat)
luacheck flat
1.00901
±0.089779
luacheck_parser good inconsistent (13 warmup, 2 flat)
244.0
(1.0, 652.2)
33.998
(0.000, 90.759)
0.09434
±0.012888
luafun warmup
54.0
(12.4, 70.6)
9.015
(1.935, 11.587)
0.16571
±0.004918
mandelbrot good inconsistent (11 flat, 4 warmup)
1.0
(1.0, 29.0)
0.000
(0.000, 9.750)
0.34443
±0.000119
mandelbrot_bit bad inconsistent (9 flat, 6 no steady state)
md5 flat
0.11279
±0.000040
meteor warmup
16.0
(2.0, 18.0)
3.398
(0.284, 3.840)
0.21935
±0.003935
moonscript warmup
28.0
(13.1, 423.3)
4.468
(2.039, 68.212)
0.16175
±0.001569
nbody flat
0.16024
±0.002790
nsieve warmup
2.0
(2.0, 2.0)
0.189
(0.188, 0.189)
0.17904
±0.000641
nsieve_bit warmup
4.0
(3.4, 5.3)
0.272
(0.219, 0.386)
0.08758
±0.000054
partialsums warmup
2.0
(2.0, 2.0)
0.160
(0.160, 0.163)
0.14802
±0.002044
pidigits good inconsistent (11 flat, 4 warmup)
1.0
(1.0, 2.3)
0.000
(0.000, 0.174)
0.12689
±0.002132
queens good inconsistent (14 flat, 1 warmup)
1.0
(1.0, 294.4)
0.000
(0.000, 35.052)
0.11838
±0.000751
quicksort bad inconsistent (8 warmup, 7 slowdown)
3.0
(2.0, 4.0)
0.600
(0.315, 0.957)
0.31117
±0.067395
radixsort flat
0.12732
±0.000403
ray good inconsistent (11 flat, 4 warmup)
1.0
(1.0, 355.0)
0.000
(0.000, 110.833)
0.30961
±0.003990
recursive_ack flat
0.11975
±0.000653
recursive_fib flat
0.23064
±0.028968
resty_json good inconsistent (14 flat, 1 warmup)
1.0
(1.0, 250.3)
0.000
(0.000, 20.009)
0.07336
±0.002629
revcomp flat
0.11403
±0.001754
richards good inconsistent (8 warmup, 7 flat)
2.0
(1.0, 2.0)
0.133
(0.000, 0.152)
0.13625
±0.010223
scimark_fft warmup
2.0
(2.0, 4.7)
0.140
(0.140, 0.483)
0.12653
±0.000823
scimark_lu flat
0.11547
±0.000308
scimark_sor flat
0.12108
±0.000053
scimark_sparse flat
0.12342
±0.000585
series warmup
2.0
(2.0, 2.3)
0.347
(0.347, 0.451)
0.33400
±0.003217
spectralnorm flat
0.13987
±0.000001
table_cmpsort bad inconsistent (13 slowdown, 2 flat)
10.0
(1.0, 10.0)
1.984
(0.000, 1.989)
0.22174
±0.007836
Results for luaJIT

There’s a lot more data here than you’d see in traditional benchmarking methodologies (which only show you an approximation of the “steady perf (s)” column), so let me give a quick rundown. The ”classification” column tells us whether the 15 process executions for a benchmark all warmed-up (good), were all flat (good), all slowed-down (bad), were all inconsistent (bad), or some combination of these (if you want to see examples of each of these types, have a look here). “Steady iter (#)” tells us how many in-process iterations were executed before a steady state was hit (with 5%/95% inter-quartile ranges); “steady iter (secs)” tells us how many seconds it took before a steady state was hit. Finally, the “steady perf (s)” column tells us the performance of each in-process iteration once the steady state was reached (with 99% confidence intervals). For all numeric columns, lower numbers are better.

Here are the benchmark results for for RaptorJIT:

Results for RaptorJIT

Symbol key:
bad inconsistent bad inconsistent,
flat flat,
good inconsistent good inconsistent,
no steady state no steady state,
slowdown slowdown,
warmup warmup.

Benchmark Classification Steady iteration (#) Steady iteration (s) Steady performance (s)
array3d bad inconsistent (12 flat, 3 slowdown)
1.0
(1.0, 76.0)
0.000
(0.000, 9.755)
0.13026
±0.000216
binarytrees warmup
24.0
(24.0, 24.0)
2.792
(2.786, 2.810)
0.11960
±0.000762
bounce flat
0.13865
±0.000978
capnproto_encode flat
0.11818
±0.002599
collisiondetector warmup
2.0
(2.0, 2.0)
0.167
(0.167, 0.169)
0.11583
±0.001498
coroutine_ring flat
0.14645
±0.000752
deltablue flat
0.10658
±0.001063
euler14 good inconsistent (12 flat, 3 warmup)
1.0
(1.0, 51.4)
0.000
(0.000, 5.655)
0.11195
±0.000093
fannkuch_redux flat
0.12437
±0.000029
fasta flat
0.11967
±0.000313
havlak flat
0.21013
±0.002469
heapsort flat
1.39055
±0.002386
jsonlua_decode flat
0.13994
±0.001207
jsonlua_encode flat
0.13581
±0.001411
knucleotide flat
0.13035
±0.000445
life flat
0.28412
±0.000599
luacheck flat
0.99735
±0.006095
luacheck_parser flat
0.07745
±0.002296
luafun warmup
28.0
(28.0, 28.0)
4.879
(4.861, 4.904)
0.17864
±0.001222
mandelbrot flat
0.34166
±0.000067
mandelbrot_bit flat
0.21577
±0.000024
md5 flat
0.09548
±0.000037
meteor warmup
2.0
(2.0, 3.0)
0.273
(0.269, 0.493)
0.21464
±0.002170
nbody good inconsistent (14 flat, 1 warmup)
1.0
(1.0, 1.9)
0.000
(0.000, 0.160)
0.17695
±0.002226
nsieve warmup
2.0
(2.0, 2.6)
0.180
(0.179, 0.282)
0.16982
±0.000862
nsieve_bit warmup
4.0
(3.7, 5.0)
0.273
(0.247, 0.361)
0.08780
±0.000233
partialsums warmup
2.0
(2.0, 2.3)
0.161
(0.160, 0.207)
0.14860
±0.001611
pidigits good inconsistent (8 warmup, 7 flat)
5.0
(1.0, 6.0)
0.516
(0.000, 0.646)
0.12766
±0.000032
queens good inconsistent (14 warmup, 1 flat)
2.0
(1.7, 2.0)
0.162
(0.113, 0.162)
0.15853
±0.000231
quicksort warmup
2.0
(2.0, 2.3)
0.278
(0.278, 0.361)
0.27183
±0.000469
radixsort flat
0.12621
±0.000757
ray flat
0.35530
±0.000984
recursive_ack bad inconsistent (14 flat, 1 slowdown)
1.0
(1.0, 19.0)
0.000
(0.000, 2.562)
0.14228
±0.000616
recursive_fib flat
0.28989
±0.000033
resty_json flat
0.07534
±0.000595
revcomp flat
0.11684
±0.002139
richards warmup
2.0
(2.0, 3.2)
0.171
(0.170, 0.369)
0.16559
±0.000342
scimark_fft warmup
2.0
(2.0, 10.3)
0.141
(0.141, 1.195)
0.12709
±0.000102
scimark_lu flat
0.12733
±0.000159
scimark_sor flat
0.13297
±0.000005
scimark_sparse flat
0.13082
±0.000490
series warmup
2.0
(2.0, 2.0)
0.347
(0.347, 0.348)
0.33390
±0.000869
spectralnorm flat
0.13989
±0.000003
table_cmpsort slowdown
10.0
(10.0, 10.0)
1.945
(1.935, 1.967)
0.22008
±0.001852
Results for RaptorJIT

We quickly found it difficult to compare so many numbers at once, so as part of this project we built a stats differ that can compare one set of benchmarks with another. Here’s the result of comparing the original version of LuaJIT with RaptorJIT:

Results for Normal vs. RaptorJIT

Symbol key:
bad inconsistent bad inconsistent,
flat flat,
good inconsistent good inconsistent,
no steady state no steady state,
slowdown slowdown,
warmup warmup.

Diff against previous results:
improved
worsened
different
unchanged.

Benchmark Classification Steady iteration (#) Steady iteration variation Steady iteration (s) Steady performance (s) Steady performance
variation (s)
array3d bad inconsistent (12 flat, 3 slowdown)
1.0
(1.0, 76.0)
(1.0, 76.0)
was: (2.0, 624.3)
0.000
(0.000, 9.755)
0.13026
δ=0.00163
±0.000215
0.000215
was: 0.000557
binarytrees warmup
24.0
(24.0, 24.0)
2.792
(2.786, 2.810)
0.11960
δ=-0.00603
±0.000762
bounce flat
0.13865
δ=0.01070
±0.000978
capnproto_encode flat
0.11818
δ=-0.09880
±0.002599
collisiondetector warmup
2.0
(2.0, 2.0)
0.167
(0.167, 0.169)
0.11583
±0.001498
coroutine_ring flat
0.14645
δ=0.03978
±0.000751
deltablue flat
0.10658
±0.001063
0.001063
was: 0.003195
euler14 good inconsistent (12 flat, 3 warmup)
1.0
δ=-59.0
(1.0, 51.4)
(1.0, 51.4)
was: (60.0, 83.0)
0.000
δ=-5.537
(0.000, 5.655)
0.11195
δ=0.02015
±0.000093
0.000093
was: 0.000743
fannkuch_redux flat
0.12437
δ=0.00344
±0.000029
fasta flat
0.11967
δ=-0.00132
±0.000313
havlak flat
0.21013
±0.002442
heapsort flat
1.39055
δ=0.37138
±0.002379
jsonlua_decode flat
0.13994
δ=0.02715
±0.001207
jsonlua_encode flat
0.13581
δ=0.00783
±0.001409
knucleotide flat
0.13035
δ=0.01373
±0.000446
life flat
0.28412
±0.000599
luacheck flat
0.99735
±0.006094
0.006094
was: 0.089779
luacheck_parser flat
0.07745
δ=-0.01688
±0.002281
luafun warmup
28.0
(28.0, 28.0)
4.879
(4.861, 4.904)
0.17864
δ=0.01293
±0.001222
0.001222
was: 0.004918
mandelbrot flat
0.34166
δ=-0.00278
±0.000067
mandelbrot_bit flat
0.21577
±0.000024
md5 flat
0.09548
δ=-0.01731
±0.000037
meteor warmup
2.0
(2.0, 3.0)
(2.0, 3.0)
was: (2.0, 18.0)
0.273
(0.269, 0.493)
0.21464
±0.002170
0.002170
was: 0.003935
nbody good inconsistent (14 flat, 1 warmup)
1.0
(1.0, 1.9)
0.000
(0.000, 0.160)
0.17695
δ=0.01671
±0.002226
nsieve warmup
2.0
(2.0, 2.6)
(2.0, 2.6)
was: (2.0, 2.0)
0.180
(0.179, 0.282)
0.16982
δ=-0.00922
±0.000862
0.000862
was: 0.000640
nsieve_bit warmup
4.0
(3.7, 5.0)
(3.7, 5.0)
was: (3.4, 5.3)
0.273
(0.247, 0.361)
0.08780
±0.000233
0.000233
was: 0.000054
partialsums warmup
2.0
(2.0, 2.3)
(2.0, 2.3)
was: (2.0, 2.0)
0.161
(0.160, 0.207)
0.14860
±0.001611
0.001611
was: 0.002044
pidigits good inconsistent (8 warmup, 7 flat)
5.0
(1.0, 6.0)
(1.0, 6.0)
was: (1.0, 2.3)
0.516
(0.000, 0.646)
0.12766
±0.000032
0.000032
was: 0.002132
queens good inconsistent (14 warmup, 1 flat)
2.0
(1.7, 2.0)
(1.7, 2.0)
was: (1.0, 294.4)
0.162
(0.113, 0.162)
0.15853
δ=0.04015
±0.000231
0.000231
was: 0.000751
quicksort warmup
2.0
(2.0, 2.3)
(2.0, 2.3)
was: (2.0, 4.0)
0.278
(0.278, 0.361)
0.27183
±0.000469
0.000469
was: 0.067395
radixsort flat
0.12621
±0.000757
0.000757
was: 0.000403
ray flat
0.35530
δ=0.04568
±0.000983
recursive_ack bad inconsistent (14 flat, 1 slowdown)
1.0
(1.0, 19.0)
0.000
(0.000, 2.562)
0.14228
δ=0.02253
±0.000616
recursive_fib flat
0.28989
δ=0.05925
±0.000033
resty_json flat
0.07534
±0.000595
0.000595
was: 0.002629
revcomp flat
0.11684
±0.002139
0.002139
was: 0.001754
richards warmup
2.0
(2.0, 3.2)
(2.0, 3.2)
was: (1.0, 2.0)
0.171
(0.170, 0.369)
0.16559
δ=0.02935
±0.000342
0.000342
was: 0.010223
scimark_fft warmup
2.0
(2.0, 10.3)
(2.0, 10.3)
was: (2.0, 4.7)
0.141
(0.141, 1.195)
0.12709
±0.000102
0.000102
was: 0.000823
scimark_lu flat
0.12733
δ=0.01186
±0.000159
scimark_sor flat
0.13297
δ=0.01189
±0.000005
scimark_sparse flat
0.13082
δ=0.00740
±0.000490
series warmup
2.0
(2.0, 2.0)
0.347
(0.347, 0.348)
0.33390
±0.000869
0.000869
was: 0.003217
spectralnorm flat
0.13989
δ=0.00002
±0.000003
table_cmpsort slowdown
10.0
(10.0, 10.0)
1.945
(1.935, 1.967)
0.22008
±0.001852
0.001852
was: 0.007836
Results for Normal vs. RaptorJIT

In essence, green cells mean that RaptorJIT is better than LuaJIT; red cells mean that LuaJIT is better than RaptorJIT; yellow means they’re different in a way that can’t be compared; and white/grey means they’re statistically equivalent. The additional “Steady performance variation (s)” column shows whether the steady state performance of different process executions is more predictable or not.

The simple conclusion to draw from this is that there isn’t a simple conclusion to draw from it: the two VMs are sometimes better than each other with no clear pattern. Without having a clear steer either way, we therefore decided to use the original version of LuaJIT as our base.

One of the things that became very clear from our benchmarking is that LuaJIT is highly non-deterministic – indeed, it’s the most non-deterministic VM I’ve seen. The practical effect of this is that even on one program, LuaJIT is sometimes very fast, and sometimes rather slow. This is, at best, very confusing for users who tend to assume that programs perform more-or-less the same every time they’re run; at worst, it can create significant problems when one is trying to estimate things like server provisioning. We therefore tried various things to make performance more consistent.

The most promising approach we alighted upon is what we ended up calling “separate counters”. In a tracing JIT compiler such as LuaJIT, one tracks how often a loop (where loops are both “obvious” things like for loops, as well as less obvious things such as functions) has been executed: once it’s hit a certain threshold, the loop is traced, and compiled into machine code. LuaJIT has an unusual approach to counting loops: it has 64 counters to which all loops are mapped (using the memory address of the bytecode in question). In other words, multiple loops share the same counter: the bigger the program, the more loops share the same counter. The advantage of this is that the counters map is memory efficient, and for small programs (e.g. the common LuaJIT benchmarks) it can be highly effective. However, it has very odd effects in real programs, particularly as programs get bigger: loops are compiled non-deterministically based on the particular address in memory they happen to have been loaded at.

We therefore altered LuaJIT so that each loop and each function has its own counter, stored in the bytecode to make memory reads/writes more cache friendly. The diff from normal LuaJIT to the separate counters version is as follows:

Results for Normal vs. Counters

Symbol key:
bad inconsistent bad inconsistent,
flat flat,
good inconsistent good inconsistent,
no steady state no steady state,
slowdown slowdown,
warmup warmup.

Diff against previous results:
improved
worsened
different
unchanged.

Benchmark Classification Steady iteration (#) Steady iteration variation Steady iteration (s) Steady performance (s) Steady performance
variation (s)
array3d no steady state
binarytrees flat
0.12462
±0.004058
0.004058
was: 0.000532
bounce good inconsistent (14 flat, 1 warmup)
1.0
(1.0, 5.8)
0.000
(0.000, 0.603)
0.12515
δ=-0.00280
±0.000278
capnproto_decode good inconsistent (9 flat, 6 warmup)
1.0
(1.0, 24.9)
(1.0, 24.9)
was: (1.0, 45.3)
0.000
(0.000, 3.692)
0.15042
±0.003797
0.003797
was: 0.028466
capnproto_encode warmup
230.0
(56.0, 467.6)
(56.0, 467.6)
was: (52.8, 280.6)
28.411
(6.667, 55.951)
0.11838
δ=-0.09860
±0.001960
0.001960
was: 0.014541
collisiondetector bad inconsistent (13 warmup, 2 no steady state)
coroutine_ring flat
0.10680
±0.003151
0.003151
was: 0.001527
deltablue warmup
149.0
(149.0, 274.5)
(149.0, 274.5)
was: (1.0, 1022.9)
15.561
(15.430, 28.653)
0.10159
±0.001083
0.001083
was: 0.003195
euler14 warmup
61.0
(61.0, 68.3)
(61.0, 68.3)
was: (60.0, 83.0)
5.650
(5.592, 6.356)
0.09216
±0.000159
0.000159
was: 0.000743
fannkuch_redux flat
0.11976
±0.000012
0.000012
was: 0.001502
fasta flat
0.12200
δ=0.00100
±0.000597
havlak no steady state
heapsort flat
1.04378
δ=0.02461
±0.000789
jsonlua_decode flat
0.12648
δ=0.01370
±0.000556
jsonlua_encode flat
0.12860
±0.000879
0.000879
was: 0.001761
knucleotide flat
0.11710
±0.000541
0.000541
was: 0.000811
life bad inconsistent (9 warmup, 3 flat, 2 slowdown, 1 no steady state)
luacheck flat
1.00299
±0.004778
0.004778
was: 0.089781
luacheck_parser bad inconsistent (12 warmup, 2 no steady state, 1 flat)
luafun warmup
69.0
(69.0, 69.0)
11.481
(11.331, 11.522)
0.16770
±0.001564
0.001564
was: 0.004918
mandelbrot bad inconsistent (14 flat, 1 no steady state)
mandelbrot_bit flat
0.21695
±0.000142
md5 flat
0.11155
δ=-0.00124
±0.000043
meteor good inconsistent (13 warmup, 2 flat)
14.0
(1.0, 15.0)
(1.0, 15.0)
was: (2.0, 18.0)
2.855
(0.000, 3.045)
0.21606
±0.004651
0.004651
was: 0.003935
moonscript warmup
63.0
(17.7, 184.1)
(17.7, 184.1)
was: (13.1, 423.3)
10.046
(2.763, 29.739)
0.15999
±0.001405
0.001405
was: 0.001568
nbody flat
0.15898
±0.001676
0.001676
was: 0.002790
nsieve warmup
2.0
(2.0, 2.6)
(2.0, 2.6)
was: (2.0, 2.0)
0.189
(0.188, 0.297)
0.17875
±0.001266
0.001266
was: 0.000641
nsieve_bit warmup
4.0
(2.0, 6.0)
(2.0, 6.0)
was: (3.4, 5.3)
0.271
(0.097, 0.446)
0.08726
δ=-0.00032
±0.000202
0.000202
was: 0.000054
partialsums warmup
2.0
(2.0, 2.9)
(2.0, 2.9)
was: (2.0, 2.0)
0.161
(0.161, 0.295)
0.14916
±0.000081
0.000081
was: 0.002044
pidigits warmup
2.0
(2.0, 4.3)
(2.0, 4.3)
was: (1.0, 2.3)
0.130
(0.130, 0.425)
0.12666
±0.000122
0.000122
was: 0.002133
queens good inconsistent (10 flat, 5 warmup)
1.0
(1.0, 2.0)
(1.0, 2.0)
was: (1.0, 294.4)
0.000
(0.000, 0.127)
0.12484
δ=0.00646
±0.000317
0.000317
was: 0.000751
quicksort slowdown
2.0
(2.0, 2.0)
0.299
(0.298, 0.304)
0.44880
δ=0.13763
±0.020477
0.020477
was: 0.067395
radixsort flat
0.12644
±0.000864
0.000864
was: 0.000403
ray flat
0.30901
±0.002140
0.002140
was: 0.004022
recursive_ack flat
0.11958
±0.000510
0.000510
was: 0.000653
recursive_fib flat
0.22864
±0.000266
0.000266
was: 0.028968
resty_json bad inconsistent (12 flat, 2 warmup, 1 no steady state)
revcomp flat
0.11550
±0.002553
0.002553
was: 0.001753
richards good inconsistent (14 warmup, 1 flat)
2.0
(1.7, 2.0)
(1.7, 2.0)
was: (1.0, 2.0)
0.150
(0.105, 0.150)
0.14572
±0.000324
0.000324
was: 0.010223
scimark_fft warmup
2.0
(2.0, 10.0)
(2.0, 10.0)
was: (2.0, 4.7)
0.140
(0.140, 1.153)
0.12639
±0.000343
0.000343
was: 0.000823
scimark_lu good inconsistent (11 flat, 4 warmup)
1.0
(1.0, 45.3)
0.000
(0.000, 5.122)
0.11546
±0.000132
0.000132
was: 0.000308
scimark_sor flat
0.12105
±0.000148
scimark_sparse flat
0.12315
±0.000728
0.000728
was: 0.000585
series warmup
2.0
(2.0, 2.0)
0.347
(0.347, 0.348)
0.33394
±0.000645
0.000645
was: 0.003217
spectralnorm flat
0.13985
δ=-0.00003
±0.000007
table_cmpsort bad inconsistent (13 flat, 1 warmup, 1 slowdown)
1.0
(1.0, 10.0)
0.000
(0.000, 2.005)
0.21828
±0.003289
0.003289
was: 0.007836
Results for Normal vs. Counters

In this case we’re particularly interested in the “steady performance variation (s)” column, which shows whether benchmarks have predictable steady state performance. The results are fairly clear: steady counters are, overall, a clear improvement. As you might expect, this is not a pure win, because it changes the order in which traces are made. This has several effects, including delaying some loops to be traced later than was previously the case, because counters do not hit the required threshold as quickly. This disadvantages some programs, particularly small deterministic benchmarks where loops are highly stable. In such cases, the earlier you trace the better. However, in my opinion, such programs are given undue weight when performance is considered. It’s no secret that some of the benchmarks regularly used to benchmark LuaJIT are highly optimised for LuaJIT as it stands; any changes to LuaJIT stand a good chance of degrading their performance. However, overall we feel that the overall gain in consistency, particularly for larger programs, is worth it. There’s a pull request against the Lua Foundation’s fork of LuaJIT which applies this idea to a mainstream fork of LuaJIT.

We then started looking at various programs that showed odd performance. One problem in particular showed up in more than one benchmark. Here’s a standard example:

Collisiondetector, Normal, Bencher9, Proc. exec. #12 (no steady state)

The problem – and it doesn’t happen on every process execution, just to make it more fun – is that there are points where the benchmark slows down by over 10% for multiple in-process iterations (e.g. in this process execution, at in-process iterations 930-ish and 1050-ish). We tried over 25 separate ways to work out what was causing this — even building an instrumentation system to track what LuaJIT is doing — but in the end it turned out to be related to LuaJIT’s Garbage Collector – sort of. When we moved from the 32-bit to 64-bit GC, the odd performance went away.

As such, we don’t think that the 64-bit GC “solves” the problem: however, it changes the way that pointers are encoded (doubling in size), which causes the code generator to emit a different style of code, such that the problem seems to go away. Nevertheless, this did make us reevaluate LuaJIT’s GC. Tom then started work on implementing Mike Pall’s suggestion for a new GC for LuaJIT (based partly on Tom’s previous work and also that of Peter Cawley). He has enough implemented to run most small, and some large, programs, but it needs more work to finish it off, at which point evaluating it against the existing Lua GCs will be fascinating!

So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we’ve been able to make a lot of programs (particularly large programs) have more consistent performance; and we’ve got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!

Acknowledgements: Sarah Mount implemented the stats differ; Edd Barrett implemented Krun and answered many questions on it.

Cloudflare Railgun is available when using Cloud Server Webuzo, Cloud Web Apps. Contact us to find out our latest offers!

Comments are closed.