War Stories: Debugging Julia with GDB inside Docker
One of the batch jobs at work is written in Julia, and runs for a few hours every night on a handful of beefy instances on AWS (36 cores, 64G RAM each). Recently, this job started hanging and getting killed by timeouts. Debugging this turned out to be quite an adventure. I'm recording that experience here, in case others can learn from it (or just be entertained by it).
First, because the job took some 140 core-hours, it was impractical to run the whole thing on my laptop. I tried to reproduce the problem locally by feeding the job a small subset of the data, but I could not get it to hang.
So I ssh-ed into the instance with the hung job. I saw that one of the Julia
worker processes was using 100% CPU, while the rest of the system was idle. I
ran strace
and ltrace
on the busy process, waited a few minutes, and got
nothing. So it wasn't making any system calls or library calls. I started to
suspect some kind of infinite loop.
Some languages have awesome tools that allow you to attach a remote debugger (Java) or a remote shell (Erlang) to a running VM, and inspect what's going on. Julia is JIT-compiled down to the metal, and does not have a real debugger, so my best bet was to attach GDB to the running process and try to see where it was stuck.
This was complicated by the fact that the Julia process was running inside a Docker container. Trying to attach gdb from outside the container didn't work, even with sudo. So I got inside the container with
docker exec -it $CONTAINER_NAME bash
and continued from there. I ran ps aux |grep julia
to get the pid of the
process inside the container (which is different from the pid outside the
container). Then I apt-get installed gdb inside the container, and tried to
attach it with gdb -p $PID
. I got the following error:
ptrace: Operation not permitted.
Some googling quickly suggested adding --privileged
to my docker exec
command, but that did not solve the issue. More googling revealed that I also
needed stop apparmor from blocking Docker's ptrace
attempt, using
sudo aa-complain /etc/apparmor.d/docker
But this opened another can of worms, as I got
apparmor.common.AppArmorException: 'Syntax Error: Invalid Regex @{PROC}/{*,**^[0-9*],sys/kernel/shm*} in file: /etc/apparmor.d/docker line: 16'
Turns out this is another bug in Docker, and I had to fix
the offending regex in /etc/apparmor.d/docker
as follows:
- deny @{PROC}/{*,**^[0-9*],sys/kernel/shm*} wkx,
+ deny @{PROC}/{*,**^[0-9]*,sys/kernel/shm*} wkx,
After doing this, I checked the apparmor messages in /var/log/syslog
to make
sure that it wasn't blocking ptrace
anymore. But GDB still gave me ptrace:
Operation not permitted.
I was stuck for a while, until a colleague helpfully
reminded me to use --privileged
. I learned that both fixes are necessary
(the apparmor bit and the --privileged
bit); neither works by itself.
So I finally managed to attach GDB to the offending process. Knowing that it was using 100% CPU for hours and hours, I used a technique known as the poor man's sampling profiler to try to figure out where the slowness was. I interrupted and continued the process several times, looking at the backtrace every time. Most backtraces looked something like this:
(gdb) bt
#0 0x00007ff4045ee59d in llvm::InstVisitor<(anonymous namespace)::SCCPSolver, void>::visit(llvm::Instruction&) [clone .constprop.345] () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#1 0x00007ff4045ef0a9 in (anonymous namespace)::SCCPSolver::Solve() [clone .constprop.343] () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#2 0x00007ff4045f4308 in (anonymous namespace)::SCCP::runOnFunction(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#3 0x00007ff4049215bf in llvm::FPPassManager::runOnFunction(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#4 0x00007ff4049216ae in llvm::FunctionPassManagerImpl::run(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#5 0x00007ff4049217d3 in llvm::FunctionPassManager::run(llvm::Function&) () from /mnt/data/julia/julia-2ac304dfba/bin/../lib/julia/libjulia.so
#6 0x00007ff404047939 in to_function (li=li@entry=0x7ff204f190f0) at codegen.cpp:675
#7 0x00007ff404047aac in jl_compile (f=<optimized out>) at codegen.cpp:812
#8 0x00007ff40402001b in jl_get_specialization (f=<optimized out>, types=0x7ff202467190) at gf.c:1475
#9 0x00007ff40404e022 in emit_known_call (ff=<optimized out>, args=args@entry=0x7ff201f13010, nargs=nargs@entry=1, ctx=ctx@entry=0x7fffd70cdd50, theFptr=theFptr@entry=0x7fffd70cd610, theF=theF@entry=0x7fffd70cd618, expr=0x7ff2022b2670) at codegen.cpp:2041
#10 0x00007ff404052c99 in emit_call (expr=0x7ff2022b2670, ctx=0x7fffd70cdd50, arglen=2, args=<optimized out>) at codegen.cpp:2644
#11 emit_expr (expr=expr@entry=0x7ff2022b2670, ctx=ctx@entry=0x7fffd70cdd50, isboxed=isboxed@entry=true, valuepos=valuepos@entry=true) at codegen.cpp:3289
#12 0x00007ff4040597d9 in emit_assignment (bp=bp@entry=0x2e82dd0, r=r@entry=0x7ff2022b2670, declType=declType@entry=0x7ff202c1bf10, isVolatile=isVolatile@entry=false, used=used@entry=true, ctx=ctx@entry=0x7fffd70cdd50) at codegen.cpp:3004
#13 0x00007ff4040538f5 in emit_assignment (ctx=0x7fffd70cdd50, r=0x7ff2022b2670, l=<optimized out>) at codegen.cpp:3051
#14 emit_expr (expr=expr@entry=0x7ff2022b2650, ctx=ctx@entry=0x7fffd70cdd50, isboxed=isboxed@entry=false, valuepos=valuepos@entry=false) at codegen.cpp:3292
#15 0x00007ff404045343 in emit_function (lam=lam@entry=0x7ff2032370d0) at codegen.cpp:4835
#16 0x00007ff4040478d1 in to_function (li=li@entry=0x7ff2032370d0) at codegen.cpp:644
#17 0x00007ff404047aac in jl_compile (f=f@entry=0x7ff2022efcf0) at codegen.cpp:812
#18 0x00007ff404029329 in jl_trampoline_compile_function (sig=0x7ff202a49ac0, always_infer=0, f=0x7ff2022efcf0) at builtins.c:1015
#19 jl_trampoline (F=0x7ff2022efcf0, args=0x7fffd70ce2e8, nargs=2) at builtins.c:1030
#20 0x00007ff40401ef95 in jl_apply (nargs=2, args=0x7fffd70ce2e8, f=<optimized out>) at julia.h:1331
#21 jl_apply_generic (F=0x7ff200758db0, args=0x7fffd70ce2e8, nargs=<optimized out>) at gf.c:1708
#22 0x00007ff4050fd230 in ?? ()
#23 0x00007ff1ff634010 in ?? ()
#24 0x000000000000000c in ?? ()
#25 0x00007fffd70ce350 in ?? ()
#26 0x00000000058a9240 in ?? ()
#27 0x0000000000000000 in ?? ()
Okay, so we are deep inside the internals of Julia's compiler, the part that uses LLVM for codegen. Now what? Luckily, Julia provides some basic tools for debugging with GDB. I played around with this for a while, trying to get a stack trace of the Julia program. I wasn't able to get a full stack trace, but I did get some interesting details using variations of the following:
(gdb) f 18
#18 0x00007ff404029329 in jl_trampoline_compile_function (sig=0x7ff202a49ac0, always_infer=0, f=0x7ff2022efcf0) at builtins.c:1015
1015 in builtins.c
(gdb) p f
$28 = (jl_function_t *) 0x7ff2022efcf0
(gdb) p jl_(f)
$29 = void
(gdb) p f->env
$30 = (jl_value_t *) 0x7ff1ff634010
(gdb) p jl_(f->env)
$31 = void
(gdb) p jl_(f->linfo)
$32 = void
(gdb) p to_function(f->linfo)
$33 = (llvm::Function *) 0x6a74ff0
(gdb) p jl_dump_llvm_value(to_function(f->linfo))
$34 = void
When you do jl_(x)
, you get a printout of the value x
in the stderr
stream of the Julia process being debugged. In my case, that stream was
redirected to a log file, so I got nothing in the gdb session, and had to go
read the log file. I saw a bunch of unintelligible garbage, with longs strings
of Float64, Float64, ...
, like this:
Expr(:lambda, Array{Any, 1}[:f, :t], Array{Any, 1}[Array{Any, 1}[Array{Any, 1}[:f, Function, 0], Array{Any, 1}[:t, Tuple{Float64, Float64, Float64, Float64, ...SNIP..., Float64, Float64, Float64, Float64, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void}, 0]], Array{Any, 1}[], Array{Any, 1}[Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}}, Tuple{Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}}, Tuple{Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}}, Tuple{Any}, Tuple{Any}, Tuple{Any}], Array{Any, 1}[]], Expr(:body,
# line 63 tuple.jl,
GenSym(5) = Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, t::Tuple{Float64, Float64, Float64, Float64, ...SNIP..., Float64, Float64, Float64, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void}, 1)::Float64)::Any)::Tuple{Any},
GenSym(0) = Expr(:call, Base.tail, t::Tuple{Float64, Float64, Float64, Float64, ...SNIP..., Float64, Float64, Float64, Float64, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void, Void})::Tuple{Float64, Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}},
GenSym(4) = Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, GenSym(0), 1)::Float64)::Any)::Tuple{Any},
GenSym(1) = Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, Base.argtail, GenSym(0))::Tuple{Float64, Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}},
GenSym(3) = Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, GenSym(1), 1)::Float64)::Any)::Tuple{Any},
GenSym(2) = Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, Base.argtail, GenSym(1))::Tuple{Float64, Float64, Float64, Float64, Vararg{Union{Float64, Void}}},
Expr(:return, Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), GenSym(5), Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), GenSym(4), Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), GenSym(3), Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, top(:tuple), Expr(:call, top(:tuple), Expr(:call, f::Function, Expr(:call, Base.getfield, GenSym(2), 1)::Float64)::Any)::Tuple{Any}, Expr(:call, Base.map, f::Function, Expr(:call, top(:_apply), Expr(:call, top(:getfield), Base, :call)::Function, Base.argtail, GenSym(2))::Tuple{Float64, Float64, Float64, Vararg{Union{Float64, Void}}})::Tuple{Any, Vararg{Any}})::Tuple{Any, Any, Vararg{Any}})::Tuple{Any, Any, Any, Vararg{Any}})::Tuple{Any, Any, Any, Any, Vararg{Any}})::Tuple{Any, Any, Any, Any, Any, Vararg{Any}})::Any)::Tuple{Any, Any, Any, Any, Any, Vararg{Any}})::Any
I looked at line 63 of tuple.jl
in the version of Julia that I was running,
but that was utterly unenlightening. After a few hours spend on this, I didn't
seem to be getting anywhere. Disgusted, I went home and slept.
The next day, I was thinking about the Vararg{Union{Float64, Void}}
bits seen
in the dump above, and I had two strokes of insight. First: Vararg
meant that
there was a function with a variable number of arguments, using
"splat" syntax like f(arg1, arg2, more_args...)
. The program was about 1500
lines, and there were about ten places using the splat syntax. This narrowed
down the search space for the bug considerably.
Second: Union{Float64, Void}
meant that Julia was dealing with a value that
could be either a float or a nothing
. That was weird. I couldn't think of
where that might happen, but then I realized that we were parsing a bunch of
float arrays from JSON. Could a null
somehow sneak in among those floats? A
quick scan of our input dataset, using jq and grep, confirmed this
hypothesis: the arrays were supposed to be all floats, but contained the
occasional null
. (This was the result of serializing NaN
values into JSON,
which in turn was due to a freshly introduced bug in another system.)
Putting these two insights together, I was able to pinpoint the location in our
Julia code that was the cause of the apparent infinite loop. It was an array
concatenation using splat syntax, like result = [x..., y...]
. When x and y
only contained floats, this was not an issue. But when one of them contained an
occasional nothing
, Julia's type inference for the resulting array went
haywire, and took time exponential in the length of the arrays. Here's a
minimal example to illustrate this exponential growth in running time (in Julia
0.4.6):
julia> for n in [100, 150, 200, 250, 300, 350, 400]
x = zeros(Float64, n)
y = [0.0, nothing]
print("n=$n: ")
@time result = [x..., y...]
end
n=100: 0.711594 seconds (208.66 k allocations: 12.082 MB, 3.53% gc time)
n=150: 1.409423 seconds (88.71 k allocations: 4.886 MB)
n=200: 2.961411 seconds (100.57 k allocations: 5.549 MB)
n=250: 7.311390 seconds (126.25 k allocations: 6.930 MB, 0.07% gc time)
n=300: 16.698541 seconds (153.91 k allocations: 8.400 MB)
n=350: 25.183418 seconds (183.35 k allocations: 9.966 MB, 0.02% gc time)
n=400: 39.814280 seconds (214.75 k allocations: 11.587 MB)
So it wasn't an infinite loop at all, just an incredibly slow operation. (This
also explains why I did not observe any problem when running the program
locally on a subset of the data.) The workaround was to use result = Any[x...,
y...]
or result = vcat(x, y)
, thus constraining the type of the resulting
array, and avoiding this worst-case inference behavior. I actually went with
result = Float64[x..., y...]
, since I wanted the program to fail hard if
there were any nothing
s in the arrays.
I bundled up the minimal example and the workarounds and submitted a bug report in Julia, feeling a mixture of triumph and futility at being able to track down this nasty bug, just so a few bits of the Internet will continue to work for a few more days.