War Stories: Debugging Julia with GDB inside Docker

One of the batch jobs at work is written in Julia, and runs for a few hours every night on a handful of beefy instances on AWS (36 cores, 64G RAM each). Recently, this job started hanging and getting killed by timeouts. Debugging this turned out to be quite an adventure. I'm recording that experience here, in case others can learn from it (or just be entertained by it).

First, because the job took some 140 core-hours, it was impractical to run the whole thing on my laptop. I tried to reproduce the problem locally by feeding the job a small subset of the data, but I could not get it to hang.

So I ssh-ed into the instance with the hung job. I saw that one of the Julia worker processes was using 100% CPU, while the rest of the system was idle. I ran strace and ltrace on the busy process, waited a few minutes, and got nothing. So it wasn't making any system calls or library calls. I started to suspect some kind of infinite loop.

Some languages have awesome tools that allow you to attach a remote debugger (Java) or a remote shell (Erlang) to a running VM, and inspect what's going on. Julia is JIT-compiled down to the metal, and does not have a real debugger, so my best bet was to attach GDB to the running process and try to see where it was stuck.

This was complicated by the fact that the Julia process was running inside a Docker container. Trying to attach gdb from outside the container didn't work, even with sudo. So I got inside the container with

docker exec -it $CONTAINER_NAME bash

and continued from there. I ran ps aux |grep julia to get the pid of the process inside the container (which is different from the pid outside the container). Then I apt-get installed gdb inside the container, and tried to attach it with gdb -p $PID. I got the following error:

ptrace: Operation not permitted.

Some googling quickly suggested adding --privileged to my docker exec command, but that did not solve the issue. More googling revealed that I also needed stop apparmor from blocking Docker's ptrace attempt, using

sudo aa-complain /etc/apparmor.d/docker

But this opened another can of worms, as I got

apparmor.common.AppArmorException: 'Syntax Error: Invalid Regex @{PROC}/{*,**^[0-9*],sys/kernel/shm*} in file: /etc/apparmor.d/docker line: 16'

Turns out this is another bug in Docker, and I had to fix the offending regex in /etc/apparmor.d/docker as follows:

- deny @{PROC}/{*,**^[0-9*],sys/kernel/shm*} wkx,
+ deny @{PROC}/{*,**^[0-9]*,sys/kernel/shm*} wkx,

After doing this, I checked the apparmor messages in /var/log/syslog to make sure that it wasn't blocking ptrace anymore. But GDB still gave me ptrace: Operation not permitted. I was stuck for a while, until a colleague helpfully reminded me to use --privileged. I learned that both fixes are necessary (the apparmor bit and the --privileged bit); neither works by itself.

So I finally managed to attach GDB to the offending process. Knowing that it was using 100% CPU for hours and hours, I used a technique known as the poor man's sampling profiler to try to figure out where the slowness was. I interrupted and continued the process several times, looking at the backtrace every time. Most backtraces looked something like this:

(gdb) bt
#0  0x00007ff4045ee59d in llvm::InstVisitor<(anonymous namespace)::SCCPSolver, void>::visit(llvm::Instruction&) [clone .constprop.345] () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#1  0x00007ff4045ef0a9 in (anonymous namespace)::SCCPSolver::Solve() [clone .constprop.343] () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#2  0x00007ff4045f4308 in (anonymous namespace)::SCCP::runOnFunction(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#3  0x00007ff4049215bf in llvm::FPPassManager::runOnFunction(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#4  0x00007ff4049216ae in llvm::FunctionPassManagerImpl::run(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#5  0x00007ff4049217d3 in llvm::FunctionPassManager::run(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#6  0x00007ff404047939 in to_function (li=li@entry=0x7ff204f190f0) at codegen.cpp:675
#7  0x00007ff404047aac in jl_compile (f=<optimized out>) at codegen.cpp:812
#8  0x00007ff40402001b in jl_get_specialization (f=<optimized out>, types=0x7ff202467190) at gf.c:1475
#9  0x00007ff40404e022 in emit_known_call (ff=<optimized out>, args=args@entry=0x7ff201f13010, nargs=nargs@entry=1, ctx=ctx@entry=0x7fffd70cdd50, theFptr=theFptr@entry=0x7fffd70cd610, theF=theF@entry=0x7fffd70cd618, expr=0x7ff2022b2670) at codegen.cpp:2041
#10 0x00007ff404052c99 in emit_call (expr=0x7ff2022b2670, ctx=0x7fffd70cdd50, arglen=2, args=<optimized out>) at codegen.cpp:2644
#11 emit_expr (expr=expr@entry=0x7ff2022b2670, ctx=ctx@entry=0x7fffd70cdd50, isboxed=isboxed@entry=true, valuepos=valuepos@entry=true) at codegen.cpp:3289
#12 0x00007ff4040597d9 in emit_assignment (bp=bp@entry=0x2e82dd0, r=r@entry=0x7ff2022b2670, declType=declType@entry=0x7ff202c1bf10, isVolatile=isVolatile@entry=false, used=used@entry=true, ctx=ctx@entry=0x7fffd70cdd50) at codegen.cpp:3004
#13 0x00007ff4040538f5 in emit_assignment (ctx=0x7fffd70cdd50, r=0x7ff2022b2670, l=<optimized out>) at codegen.cpp:3051
#14 emit_expr (expr=expr@entry=0x7ff2022b2650, ctx=ctx@entry=0x7fffd70cdd50, isboxed=isboxed@entry=false, valuepos=valuepos@entry=false) at codegen.cpp:3292
#15 0x00007ff404045343 in emit_function (lam=lam@entry=0x7ff2032370d0) at codegen.cpp:4835
#16 0x00007ff4040478d1 in to_function (li=li@entry=0x7ff2032370d0) at codegen.cpp:644
#17 0x00007ff404047aac in jl_compile (f=f@entry=0x7ff2022efcf0) at codegen.cpp:812
#18 0x00007ff404029329 in jl_trampoline_compile_function (sig=0x7ff202a49ac0, always_infer=0, f=0x7ff2022efcf0) at builtins.c:1015
#19 jl_trampoline (F=0x7ff2022efcf0, args=0x7fffd70ce2e8, nargs=2) at builtins.c:1030
#20 0x00007ff40401ef95 in jl_apply (nargs=2, args=0x7fffd70ce2e8, f=<optimized out>) at julia.h:1331
#21 jl_apply_generic (F=0x7ff200758db0, args=0x7fffd70ce2e8, nargs=<optimized out>) at gf.c:1708
#22 0x00007ff4050fd230 in ?? ()
#23 0x00007ff1ff634010 in ?? ()
#24 0x000000000000000c in ?? ()
#25 0x00007fffd70ce350 in ?? ()
#26 0x00000000058a9240 in ?? ()
#27 0x0000000000000000 in ?? ()

Okay, so we are deep inside the internals of Julia's compiler, the part that uses LLVM for codegen. Now what? Luckily, Julia provides some basic tools for debugging with GDB. I played around with this for a while, trying to get a stack trace of the Julia program. I wasn't able to get a full stack trace, but I did get some interesting details using variations of the following:

(gdb) f 18
#18 0x00007ff404029329 in jl_trampoline_compile_function (sig=0x7ff202a49ac0, always_infer=0, f=0x7ff2022efcf0) at builtins.c:1015
1015    in builtins.c
(gdb) p f
$28 = (jl_function_t *) 0x7ff2022efcf0
(gdb) p jl_(f)
$29 = void
(gdb) p f->env
$30 = (jl_value_t *) 0x7ff1ff634010
(gdb) p jl_(f->env)
$31 = void
(gdb) p jl_(f->linfo)
$32 = void
(gdb) p to_function(f->linfo)
$33 = (llvm::Function *) 0x6a74ff0
(gdb) p jl_dump_llvm_value(to_function(f->linfo))
$34 = void

When you do jl_(x), you get a printout of the value x in the stderr stream of the Julia process being debugged. In my case, that stream was redirected to a log file, so I got nothing in the gdb session, and had to go read the log file. I saw a bunch of unintelligible garbage, with longs strings of Float64, Float64, ..., like this:

Expr(:lambda, Array{Any, 1}[:f, :t], Array{Any, 1}[Array{Any, 1}[Array{Any, 1}[:f, Function, 0], Array{Any, 1}[:t, Tuple{Float64, Float64, Float64, Float64, ...SNIP..., Float64, Float64, Float64, Float64, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void}, 0]], Array{Any, 1}[], Array{Any, 1}[Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}}, Tuple{Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}}, Tuple{Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}}, Tuple{Any}, Tuple{Any}, Tuple{Any}], Array{Any, 1}[]], Expr(:body,
# line 63 tuple.jl,
GenSym(5) = Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, t::Tuple{Float64, Float64, Float64, Float64, ...SNIP..., Float64, Float64, Float64, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void}, 1)::Float64)::Any)::Tuple{Any},
GenSym(0) = Expr(:call, Base.tail, t::Tuple{Float64, Float64, Float64, Float64, ...SNIP..., Float64, Float64, Float64, Float64, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void})::Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}},
GenSym(4) = Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, GenSym(0), 1)::Float64)::Any)::Tuple{Any},
GenSym(1) = Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, Base.argtail, GenSym(0))::Tuple{Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}},
GenSym(3) = Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, GenSym(1), 1)::Float64)::Any)::Tuple{Any},
GenSym(2) = Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, Base.argtail, GenSym(1))::Tuple{Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}},
Expr(:return, Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), GenSym(5), Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), GenSym(4), Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), GenSym(3), Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, GenSym(2), 1)::Float64)::Any)::Tuple{Any}, Expr(:call, Base.map, f::Function, Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, Base.argtail, GenSym(2))::Tuple{Float64, Float64, Float64, Vararg{Union{Float64, Void}}})::Tuple{Any, Vararg{Any}})::Tuple{Any, Any, Vararg{Any}})::Tuple{Any, Any, Any, Vararg{Any}})::Tuple{Any, Any, Any, Any, Vararg{Any}})::Tuple{Any, Any, Any, Any, Any, Vararg{Any}})::Any)::Tuple{Any, Any, Any, Any, Any, Vararg{Any}})::Any

I looked at line 63 of tuple.jl in the version of Julia that I was running, but that was utterly unenlightening. After a few hours spend on this, I didn't seem to be getting anywhere. Disgusted, I went home and slept.

The next day, I was thinking about the Vararg{Union{Float64, Void}} bits seen in the dump above, and I had two strokes of insight. First: Vararg meant that there was a function with a variable number of arguments, using "splat" syntax like f(arg1, arg2, more_args...). The program was about 1500 lines, and there were about ten places using the splat syntax. This narrowed down the search space for the bug considerably.

Second: Union{Float64, Void} meant that Julia was dealing with a value that could be either a float or a nothing. That was weird. I couldn't think of where that might happen, but then I realized that we were parsing a bunch of float arrays from JSON. Could a null somehow sneak in among those floats? A quick scan of our input dataset, using jq and grep, confirmed this hypothesis: the arrays were supposed to be all floats, but contained the occasional null. (This was the result of serializing NaN values into JSON, which in turn was due to a freshly introduced bug in another system.)

Putting these two insights together, I was able to pinpoint the location in our Julia code that was the cause of the apparent infinite loop. It was an array concatenation using splat syntax, like result = [x..., y...]. When x and y only contained floats, this was not an issue. But when one of them contained an occasional nothing, Julia's type inference for the resulting array went haywire, and took time exponential in the length of the arrays. Here's a minimal example to illustrate this exponential growth in running time (in Julia 0.4.6):

julia> for n in [100, 150, 200, 250, 300, 350, 400]
           x = zeros(Float64, n)
           y = [0.0, nothing]
           print("n=$n: ")
           @time result = [x..., y...]
       end
n=100:   0.711594 seconds (208.66 k allocations: 12.082 MB, 3.53% gc time)
n=150:   1.409423 seconds (88.71 k allocations: 4.886 MB)
n=200:   2.961411 seconds (100.57 k allocations: 5.549 MB)
n=250:   7.311390 seconds (126.25 k allocations: 6.930 MB, 0.07% gc time)
n=300:  16.698541 seconds (153.91 k allocations: 8.400 MB)
n=350:  25.183418 seconds (183.35 k allocations: 9.966 MB, 0.02% gc time)
n=400:  39.814280 seconds (214.75 k allocations: 11.587 MB)

So it wasn't an infinite loop at all, just an incredibly slow operation. (This also explains why I did not observe any problem when running the program locally on a subset of the data.) The workaround was to use result = Any[x..., y...] or result = vcat(x, y), thus constraining the type of the resulting array, and avoiding this worst-case inference behavior. I actually went with result = Float64[x..., y...], since I wanted the program to fail hard if there were any nothings in the arrays.

I bundled up the minimal example and the workarounds and submitted a bug report in Julia, feeling a mixture of triumph and futility at being able to track down this nasty bug, just so a few bits of the Internet will continue to work for a few more days.