Debugging for Systems Engineers

Originally posted as part of sysadvent

Overview

I’ve mentioned before that my favourite prior sysadvent articles is Down The Rabbit Hole.

This year amongst some of the reading I’ve done, I really enjoyed the articles from [http://jvns.ca/](Julia Evans) on strace. I wanted to write something for sysadvent that would be interesting but focussed on debuggers. There isn’t enough space here to give a full debugger tutorial but instead wanted to give some cases when I reach for a debugger and what for, with some specific tips and examples thrown in. If you’re already an expert at using debuggers you can probably stop reading now.

Many of us are familiar with some of variety this quote:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

Brian Kerninghan

learning debugging techniques for programs you haven’t written is a skill I’d recommend systems engineers/administrators practise.

Forms of debugging

Print statement debugging

The most effective debugging tool is still careful thought, coupled with judiciously placed print statements

Brian Kernighan

If you spend any time on the golang-nuts list you’ll see that using gdb with highly concurrent runtime such as go is problematic, and certainly the latter part of the above quote is often recommended. If you have the luxury of being able to rebuild and rerun your application or service - such as when developing it then print debugging is a valuable part of your too chain. If you’ve written a helloworld program in a language you know at least one way of getting your program to print information out to you.

Program tracing

Julia has a fantastic blog post on debugging your programs like they’re closed source covers using strace and friends to do black box analysis using operating systems tools.

Using a debugger

This is where I want to share some of my knowledge. …

Setup

Lets build a container for our debugging experiments and launch a shell in it. I’m using Docker to make it easy for you to follow along with the same environment, lets create a directory for this with the following Dockerfile

FROM fedora:20

RUN yum -y update && yum clean all
RUN yum -y install gdb yum-utils && yum clean all
RUN debuginfo-install -y coreutils && yum clean all
RUN mkdir -p /tmp/a && touch /tmp/a/foo

Most distributions ship with stripped and optimized binaries, so to get some more meaningful traces we need to install some additional packages for debug symbols. Debian and Fedora derived distros (and probably others) include ways to get these extra packages:

$ docker build -t gdb .
$ docker run -ti gdb /bin/bash

Running a program through the debugger

Now off lets setup to launch a program with some args in the debugger:

gdb  --args /bin/ls /tmp/a
GNU gdb (GDB) Fedora 7.7.1-21.fc20
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /bin/ls...Reading symbols from /usr/lib/debug/usr/bin/ls.debug...done.
done.
(gdb)

As you can see gdb is quite verbose by default so following invocations of gdb will use the quiet flag -q to suppres the banner and license information.

All that has happened is so far is that gdb has loaded the symbols from the file and the additional debuginfo, we’ve not run the program yet. Lets just do that now by typing run - note that (gdb) is the prompt output from gdb.

(gdb) run
Starting program: /usr/bin/ls /tmp/a
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
foo
[Inferior 1 (process 28) exited normally]
Missing separate debuginfos, use: debuginfo-install pcre-8.33-7.fc20.x86_64 xz-libs-5.1.2-12alpha.fc20.x86_64

So far so good - we see it starts the program, prints some info on how it is running and then the foo is the output of our ls command.

To make it clearer maybe we should have chosen ls -l lets change the state of the debugger to update the arguments

(gdb) show args
Argument list to give program being debugged when it is started is "/tmp/a".
(gdb) set args -l /tmp/a
(gdb) show args
Argument list to give program being debugged when it is started is "-l /tmp/a".
(gdb) run
Starting program: /usr/bin/ls -l /tmp/a
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
total 0
-rw-r--r-- 1 root root 0 Dec  3 15:44 foo
[Inferior 1 (process 32) exited normally]

Exit gdb using quit or Ctrl-D.

Stop the world

Still in our container lets start gdb again this time explicitly using ls -l:

gdb -q --args /bin/ls -l /tmp/a
Reading symbols from /bin/ls...Reading symbols from /usr/lib/debug/usr/bin/ls.debug...done.
done.
(gdb)

Running a program straight through isn’t that interesting, if you’ve done basic C you’ll know that the main method is the entry point to our program. We’re going to tell gdb to create a breakpoint to pause the program when it hits the main method.

(gdb) break main
Breakpoint 1 at 0x402c60: file src/ls.c, line 1242.

Using gdb you can abbreviate commands until they become ambiguous. We could have used b main. With the breakpoint in place lets run again:

Starting program: /usr/bin/ls -l /tmp/a
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, main (argc=3, argv=0x7fffffffe798) at src/ls.c:1242
1242	{
Missing separate debuginfos, use: debuginfo-install pcre-8.33-7.fc20.x86_64 xz-libs-5.1.2-12alpha.fc20.x86_64
(gdb)

The above tells us we’ve hit the breakpoint at line 1242 which has a single { , but not much else. Lets see some more context by using the list command:

(gdb) list
1237	    }
1238	}
1239
1240	int
1241	main (int argc, char **argv)
1242	{
1243	  int i;
1244	  struct pending *thispend;
1245	  int n_files;
1246
(gdb)

Repeated use of list will show us the next few lines, with gdb just hitting return will repeat the previous command - eg if you hit return after the above listing we see the next few lines.

(gdb)
1247	  /* The signals that are trapped, and the number of such signals.  */
1248	  static int const sig[] =
1249	    {
1250	      /* This one is handled specially.  */
1251	      SIGTSTP,
1252
1253	      /* The usual suspects.  */
1254	      SIGALRM, SIGHUP, SIGINT, SIGPIPE, SIGQUIT, SIGTERM,
1255	#ifdef SIGPOLL
1256	      SIGPOLL,

Even though we’re paging through the source code we’ve still not executed it - if you get confused you can always ask gdb where you are:

(gdb) where
#0  main (argc=3, argv=0x7fffffffe798) at src/ls.c:1242

If you’re not familiar with C the argc is the count of arguments and argv is an array of strings. We can use gdb’s print command - which abbreviates to p to inspect these.

(gdb) print argc
$2 = 3
(gdb) p argv[0]
$3 = 0x7fffffffe97f "/usr/bin/ls"
(gdb) p argv[1]
$4 = 0x7fffffffe98b "-l"
(gdb) p argv[2]
$5 = 0x7fffffffe98e "/tmp/a"

For now ignore the $2 = part and just focus on the values. For argc which is an int we just get the numeric value. Using the size we can look at the three strings forming argv using array access.

ls is quite a large program - for now lets tell it to continue executing:

(gdb) cont
Continuing.
total 0
-rw-r--r-- 1 root root 0 Dec  3 15:44 foo
[Inferior 1 (process 50) exited normally]

Now we know a little on how to set a breakpoint and get some basic information on variables in the program. Lets try another program - setting a breakpoint in another method. For now I’m choosing to use cat.

[root@7dffa749b76c /]# gdb -q --args /bin/cat /etc/passwd
Reading symbols from /bin/cat...Reading symbols from /usr/lib/debug/usr/bin/cat.debug...done.
done.
(gdb) break simple_cat
Breakpoint 1 at 0x4026b8: file src/cat.c, line 177.
(gdb) run
Starting program: /usr/bin/cat /etc/passwd

Breakpoint 1, main (argc=2, argv=<optimized out>) at src/cat.c:730
730	          ok &= simple_cat (ptr_align (inbuf, page_size), insize);

So we stop before we execute the simple_cat method

(gdb) list
725	             || show_tabs || squeeze_blank))
726	        {
727	          insize = MAX (insize, outsize);
728	          inbuf = xmalloc (insize + page_size - 1);
729
730	          ok &= simple_cat (ptr_align (inbuf, page_size), insize);
731	        }
732	      else
733	        {
734	          inbuf = xmalloc (insize + 1 + page_size - 1);

Rather than continuing the exection we are going to tell the debugger to execute the simple_cat method then stop using the next command. Some debuggers call this “stepping over”.

(gdb) next
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
Breakpoint 1, main (argc=2, argv=<optimized out>) at src/cat.c:730
730	          ok &= simple_cat (ptr_align (inbuf, page_size), insize);
(gdb) p ok
$1 = true

As you can see we’ve not moved lines but have executed the subroutine - which output the file and returned. We can see the value of ok set. Stepping over is useful when you want to follow the flow of something but don’t care about what it is calling.

Type cont to continue execution.

Going deeper

In addition to stepping over, most debuggers allow you to follow the code into the method by stepping into. In gdb the command is step

gdb -q --args /bin/cat /etc/passwd
Reading symbols from /bin/cat...Reading symbols from /usr/lib/debug/usr/bin/cat.debug...done.
done.
(gdb) break simple_cat
Breakpoint 1 at 0x4026b8: file src/cat.c, line 177.
(gdb) run
Starting program: /usr/bin/cat /etc/passwd

Breakpoint 1, main (argc=2, argv=<optimized out>) at src/cat.c:730
730	          ok &= simple_cat (ptr_align (inbuf, page_size), insize);
(gdb) step
simple_cat (bufsize=65536,
    buf=0x60f000 "root:x:0:0:root:/root:/bin/bash\nbin:x:1:1:bin:/bin:/sbin/nologin\ndaemon:x:2:2:daemon:/sbin:/sbin/nologin\nadm:x:3:4:adm:/var/adm:/sbin/nologin\nlp:x:4:7:lp:/var/spool/lpd:/sbin/nologin\nsync:x:5:0:sync:/"...) at src/cat.c:177
177	      if (n_read == 0)

When you’ve entered a function cont acts as a sort of step return where you run all the code in the method and stop at the return point.

(gdb) cont
Continuing.
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin

Breakpoint 1, main (argc=2, argv=<optimized out>) at src/cat.c:730
730	          ok &= simple_cat (ptr_align (inbuf, page_size), insize);

Getting to know your codebases

I learned this technique from a friend as an early thing I do when learning a new service or application I’m going to work on or support.

Requirements:

I make sure have a US Letter/A4 pad which I usually make sure is landscape, some pencils/pens next to me.
Next step is to learn how to checkout, build and run the application locally.
Choose a point to dive in for - eg for a command line app this might be main()

Then I use step-debugging to follow through and sketch out the main conceptually flow of the program, and dive in to parts I’m interested in. This works particularly well within IDE debuggers with the source code such as Eclipse or Intellij (or gud mode of emacs). Where you can set a break from the sourcecode and using the debugger. If you’re working in a Java environment I’d strongly encourage you to learn the IDE your developer’s use and how the debugger works there. Learning to navigate the codebase and understanding what live thread and stack dumps look like can really help you reason better about the application when troubleshooting.

Test-driven debugging

This is an expansion of the previous point - some language debuggers such as pdb or pry make it easy to add a breakpoint programatically within the test, or using the IDE or even just running the test in the debugger set the breakpoint and step into.

I’ve often used this as a way to improve my understanding of the code under test and the API before writing a new test. It allows you to drill deep.

Debugging production issues

At various points in my career I’ve had to dive in at quite a low level and figure out what is happening - with large and complex multithreaded applications this can be non-trivial. Often you really just want to get the stack trace of the program when the issue manifests.

You want to apply careful thought before ever doing this.

Come up with a hypothesis
Attempt to isolate problem
Run with debugger in isolated environment

file descriptor leaks within a large JVM application including JNI database client.

When facing these issues gdb can be very powerful - in addition to the example of running the program in gdb you can attach to processes or analyse core dumps.

Scripting your debugger

gdb supports various forms of automation and scripting - for example you can define helper functions. One reason I’ve used this in the path is when debugging the JVM - the JVM users various signals internally so we can to configure gdb to ignore these.

define java_signals
  handle SIGTSTP noprint nostop
  handle SIGSEGV noprint nostop
  handle SIGQUIT noprint nostop
  handle SIGABRT stop print nopass 
end

SIGQUIT gives us a JVM stacktrace so we can understand how where we are natively corresponds to where we are in our Java codebase. You can also define a gdb function to force that.

define jtrace
  call kill(getpid(), 3)
  cont
end

The times I’ve had to connect debugger to real live production tasks - this should always be done with caution - if you can ensure it’s not serving real user traffic either by is best as debuggers suspend execution which can impact user requests. You also may find that your monitoring or health checking may fail and kill the process under the debugger. Often a stacktrace is enough.

gdb supports attaching commands on a breakpoint so you can run some code so you could for example run the above jtrace function to force a Java stack trace when you hit some specific state and continue. If you’re absolutely stuck with no alternative but to run the debugger on a production serving process practice in a test environment first to learn how to use breakpoint commands to minimize the amount of time the program will be suspended.

In addition to gdb scripting modern gdb supports python to write richer extensions, pretty printers, etc. Two examples I’ve used as reference are gdb-heap which allows inspection of python’s memory usage and Go’s runtime-gdb.py which is an interesting way to learn some of the runtime internals by seeing how it looks up goroutines, etc.

Conclusion

So we’ve learned the basic debugger primatives of

Setting a breakpoint
Inspecting some state
Advancing execution by stepping over a call
Debugging further in by stepping into a call

We’ve taken a look at some actual real world tasks where debugger knowledge is useful.

Working with multithreaded, concurrent and complex applications with a debugger is more challenging than the simple walk through here. Like many tools in your portfolio it’s best to practice using your debuggers before you’re in a situation where that’s the only way you could get information.

Pris Nasrat