So far, we've parsed enough of the contents of an ELF executable to get the various debug info sections as raw byte arrays. Now, we'll use all these sections together to make progress towards properly inspecting a running instance of the target program.
This post's goal is to construct the Debug Information Entry (DIE) tree. This is a long task with several side quests along the way.
A DIE is a piece of information about a particular node of a compiled program, and they are stored in a tree. When compiler authors write parsers/lexers, the first step in the pipeline is generally attempting to turn the source text of a program in to an Abstract Syntax Tree, and you can sort of think of the DIE tree as a similar thing for DWARF debuggers. All your compile units, functions, variables, struct definitions, etc. each get an entry in the DIE tree. A DIE tree is really more like a forest in that each compile unit in your executable gets its own tree.
We will use these DIEs later on to ascertain key facts about a running process such as "what is the data type of the variable named x
, and where in memory/registers are its bytes?", which is critical for building a debugger.
Perhaps it's easiest to show what it looks like. Here's the relevant section of cloop's DIE tree using dwarfdump cloop
:
< 1><0x000002dd> DW_TAG_subprogram
DW_AT_external yes(1)
DW_AT_name main
DW_AT_decl_file 0x00000001 /home/jcalabro/local/data/cloop/main.c
DW_AT_decl_line 0x00000004
DW_AT_decl_column 0x00000005
DW_AT_type <0x00000058>
DW_AT_low_pc 0x00401156
DW_AT_high_pc <offset-from-lowpc> 86 <highpc: 0x004011ac>
DW_AT_frame_base len 0x0001: 0x9c:
DW_OP_call_frame_cfa
DW_AT_call_all_tail_calls yes(1)
DW_AT_sibling <0x0000031c>
< 2><0x000002ff> DW_TAG_variable
DW_AT_name pid
DW_AT_decl_file 0x00000001
DW_AT_decl_line 0x00000005
DW_AT_decl_column 0x0000000b
DW_AT_type <0x0000009d>
DW_AT_location len 0x0002: 0x9164:
DW_OP_fbreg -28
< 2><0x0000030d> DW_TAG_variable
DW_AT_name ndx
DW_AT_decl_file 0x00000001
DW_AT_decl_line 0x00000006
DW_AT_decl_column 0x00000018
DW_AT_type <0x0000031c>
DW_AT_location len 0x0002: 0x9168:
DW_OP_fbreg -24
There's not a ton to see in our cloop
program because it's so simple, but there is a function (aka subprogram
) named main
and two variables, ndx
and pid
, just as we'd expect. It tells us on what lines they appear, and what their data types are, and DW_AT_location
actually tells us how to find the value of a variable at runtime! There's a lot of good stuff in here.
You can see that all these entries have a single tag that indicates the type of entry (each prefixed with
DW_TAG_
), then some number of attributes, or metadata describing that particular entry (each prefixed with DW_AT_
). If you look at the full output, you can see a lot of DIEs for primitive C types and the standard library.
It's a bit tough to tell because we're only one level deep, but this data is stored in a tree. You can see it in the indentation of the tags and attributes if you look carefully, or you can use the numbers on the far left as reference of depth (< 1>
and < 2>
). This is just how dwarfdump
renders the DIE tree; the actual format on disk is not of this form.
All that being said, the very first thing we have to parse actually is not the DIE tree. No, we will start by reading the full set of abbreviation tables
. The abbrev tables give you a lot of information that's used in constructing the DIE tree. It tells you, for each potential entry in the tree, what is its tag (AST node type), and what attributes each node has. For each attribute, it tells you its data type via its form (is it a string, is it a constant integer value, and so on). The set of abbrev tables lives in the .debug_abbrev
ELF section.
Parsing the abbrev table is pretty straightforward: it's an array of declarations (decls), each with a code, a tag, and some number of attributes. To see it in action, you run dwarfdump --print-abbrev cloop
:
< 1><0x00000000> DW_TAG_member DW_children_no
<0x00000003> DW_AT_name DW_FORM_strp
<0x00000005> DW_AT_decl_file DW_FORM_implicit_const <4 (0x4)>
<0x00000008> DW_AT_decl_line DW_FORM_data1
<0x0000000a> DW_AT_decl_column DW_FORM_data1
<0x0000000c> DW_AT_type DW_FORM_ref4
<0x0000000e> DW_AT_data_member_location DW_FORM_data1
< 2><0x00000012> DW_TAG_base_type DW_children_no
<0x00000015> DW_AT_byte_size DW_FORM_data1
<0x00000017> DW_AT_encoding DW_FORM_data1
<0x00000019> DW_AT_name DW_FORM_strp
< 3><0x0000001d> DW_TAG_pointer_type DW_children_no
<0x00000020> DW_AT_byte_size DW_FORM_implicit_const <8 (0x8)>
<0x00000023> DW_AT_type DW_FORM_ref4
< 4><0x00000027> DW_TAG_typedef DW_children_no
<0x0000002a> DW_AT_name DW_FORM_strp
<0x0000002c> DW_AT_decl_file DW_FORM_data1
<0x0000002e> DW_AT_decl_line DW_FORM_data1
<0x00000030> DW_AT_decl_column DW_FORM_data1
<0x00000032> DW_AT_type DW_FORM_ref4
etc...
We can write a parser for this with the following methodology:
Check the DWARF documentation for what each of those enum values should be, and see the appendix for more information on LEB128. Translated in to code, that looks like:
type abbrevTable struct {
offset int
decls map[uint64]abbrevDecl
}
type abbrevDecl struct {
code uint64
hasChildren uint8
attrs []abbrevAttribute
}
type abbrevAttribute struct {
name uint64
form uint64
implicitConstVal int64
}
func parseAbbrevTables(abbrevSecion []byte) []abbrevTable {
buf := bytes.NewBuffer(abbrevSecion)
tables := []abbrevTable{}
reader := NewBinaryReader(buf, binary.NativeEndian)
for {
table := abbrevTable{
offset: BufferOffset(len(abbrevSecion), buf),
decls: map[uint64]abbrevDecl{},
}
if table.offset >= len(abbrevSecion) {
break // nothing left to read in the section
}
for {
decl := abbrevDecl{}
decl.code, _ = leb128.DecodeU64(reader)
if decl.code == 0 {
break // we've finished with this table
}
decl.tag, _ = leb128.DecodeU64(reader)
decl.hasChildren, _ = Read[uint8](reader)
for {
attr := abbrevAttribute{}
attr.name, _ = leb128.DecodeU64(reader)
attr.form, _ = leb128.DecodeU64(reader)
if attr.name == 0 && attr.form == 0 {
break
}
if attr.form == 0x21 { // DW_FORM_implicit_const
attr.implicitConstVal, _ = leb128.DecodeS64(reader)
}
decl.attrs = append(decl.attrs, attr)
}
table.decls[decl.code] = decl
}
tables = append(tables, table)
}
return tables
}
That's it with abbrev tables. We'll use these quite a lot as we parse the DIE tree. You can see that we already parsed something that resembles a tree, but we're not at the point where the data is meaningful yet. Note that cloop
only has one table, but there will probably be many in real-world programs.
Now we're ready to start tackling the .debug_info
section, which by is the section that contains most of the DIE information. We're going to parse one or more compilation units and their DIEs in a loop, starting with the compilation unit header. The header isn't itself a DIE, but it does give us some critical information in dealing with the remainder of the tree. So our outer loop looks like:
type CU struct {
header *CUHeader
dies []DIE
}
type CUHeader struct {
length uintptr
version uint16
unitType uint8 // added in v5
debugAbbrevOffset int
addrSize uint8
// not a real field in the DWARF standard,
// but helpful for bookkeeping
is32Bit bool
}
cus := []*CU{}
infoReader := NewBinaryReader(bytes.NewBuffer(sections.info))
for {
if infoReader.Offset() >= len(sections.info) {
break
}
cuHeader := parseCUHeader(infoReader)
// choose the correct abbrev table for this CU
abbrev := abbrevTables[0]
for _, a := range abbrevTables {
if a.offset == cuHeader.debugAbbrevOffset {
abbrev = a
break
}
}
cu := parseCU(infoReader, cuHeader, sections, abbrev)
cus = append(cus, cu)
}
Then, we can implement parseCUHeader
. Each compilation unit header contains these fields in order, but note that the order of debug_abbrev_offset
and addr_size
is reversed as of DWARF v5.
Additionally, see the appendix for more information on fields of type initial length
.
length, initial length
: the number of bytes taken by this CU header in the .debug_info
section (this does not include the number of bytes it takes to store the length itself, so either 4 or 12 bytes)version, uint16
: DWARF version numberunit_type, uint8
: type of compilation unit (this field was added in DWARF v5 and there is no such field in v4 or below, see DWARF v5, section 3.1 for more details, and for our purposes we only care about DW_UT_compile
)debug_abbrev_offset, uint32 if is_32_bit, else uint64
: how far to seek in to the abbrev section in order to find this CU's abbrev tableaddr_size, uint8
: how large an address is in this CUSo we can parse a CU header with something like:
func parseCUHeader(reader *BinaryReader) *CUHeader {
header := &CUHeader{}
header.length = readInitialLength(reader)
// we know it's a 32 bit binary if our initial length
// was 4 bytes, not 12
header.is32Bit = reader.Offset() == 4
header.version, _ = Read[uint16](reader)
if header.version >= 5 {
// this field was added in DWARF v5
header.unitType, _ = Read[uint8](reader)
// DWARF v5 changes the order of fields (switches abbrev_offset and addr_size)
header.addrSize, _ = Read[uint8](reader)
header.debugAbbrevOffset = parseAbbrevOffset(reader, header)
} else {
header.debugAbbrevOffset = parseAbbrevOffset(reader, header)
header.addrSize, _ = Read[uint8](reader)
}
return header
}
func parseAbbrevOffset(reader *BinaryReader, header *CUHeader) int {
if header.is32Bit {
offset, _ := Read[uint32](reader)
return int(offset)
}
offset, _ := Read[uint64](reader)
return int(offset)
}
You can check your work against readelf --debug-dump=info cloop
:
Contents of the .debug_info section:
Compilation Unit @ offset 0:
Length: 0x320 (32-bit)
Version: 5
Unit Type: DW_UT_compile (1)
Abbrev Offset: 0
Pointer Size: 8
etc...
Now, we're ready to begin parsing the DIE tree in this compilation unit! What we really want at the end of this section is actually just an array of DIEs, but we do need to interpret them as a tree in order to parse the .debug_info
section correctly.
First, we should pick an abbrev table by searching the list of abbrev table offsets for the one that matches with the debug_abbrev_offset
field from the CU header. This is the table we'll use for the rest of parsing this CU. This code is already handled in the "outer loop" example above.
Next, we'll begin reading through the .debug_info
section to determine the list of DIEs, their depth in the tree, and the form
of each of their attributes. DWARF forms are basically just data types, so this is just saying that we want to figure out how to interpret the bytes of each attribute (should we read a uint8
, a string
, etc.). Forms are an enum with the prefix DW_FORM_
.
Additionally, based on which form we're looking at, we should also be sure to store its associated attribute class for more information on how to interpret this data. We won't be using this much since our code is just a demo, but real-world, robust parsers will use this a lot. Read the spec to learn more.
To complete this, we should:
.debug_info
sectioncloop
on my machine in the code sample below. Refer to the documentation for the version(s) of DWARF you care about to see the full list.DW_CHILDREN_yes
was set on the decl, push an item on to the DIE treeEasy to say, but somewhat tedious to actually do. First, let's write that algorithm:
func parseCU(
reader *BinaryReader,
header *CUHeader,
sections *DWARFSections,
abbrev abbrevTable,
) *CU {
dies := []DIE{} // list of all DIEs
dieTree := []uint64{} // stack of abbrev codes
for {
dieOffset := reader.Offset()
abbrevCode, _ := leb128.DecodeU64(reader)
if abbrevCode == 0 {
if len(dieTree) == 1 {
break // we're done
}
// this DIE has no more children, pop the stack
dieTree = dieTree[:len(dieTree)-1]
continue
}
abbrevDecl := abbrev.decls[abbrevCode]
die := DIE{
offset: dieOffset,
depth: len(dieTree),
tag: abbrevDecl.tag,
}
for _, attr := range abbrevDecl.attrs {
form := chooseFormAndAdvanceBySize(reader, header, sections, attr)
die.forms = append(die.forms, form)
}
dies = append(dies, die)
if abbrevDecl.hasChildren == 1 {
dieTree = append(dieTree, abbrevCode)
}
}
Next, let's define a few types and "enums" (Go doesn't really have enums):
type DIE struct {
offset int
depth int
tag uint64
forms []Form
}
type Form struct {
data any
class Class
}
type DWARFForm int
const (
// we're not going to use all of these today, but a real parser should
DW_FORM_addr DWARFForm = 0x01
DW_FORM_block2 DWARFForm = 0x03
DW_FORM_block4 DWARFForm = 0x04
DW_FORM_data2 DWARFForm = 0x05
DW_FORM_data4 DWARFForm = 0x06
DW_FORM_data8 DWARFForm = 0x07
DW_FORM_string DWARFForm = 0x08
DW_FORM_block DWARFForm = 0x09
DW_FORM_block1 DWARFForm = 0x0a
DW_FORM_data1 DWARFForm = 0x0b
DW_FORM_flag DWARFForm = 0x0c
DW_FORM_sdata DWARFForm = 0x0d
DW_FORM_strp DWARFForm = 0x0e
DW_FORM_udata DWARFForm = 0x0f
DW_FORM_ref_addr DWARFForm = 0x10
DW_FORM_ref1 DWARFForm = 0x11
DW_FORM_ref2 DWARFForm = 0x12
DW_FORM_ref4 DWARFForm = 0x13
DW_FORM_ref8 DWARFForm = 0x14
DW_FORM_ref_udata DWARFForm = 0x15
DW_FORM_indirect DWARFForm = 0x16
DW_FORM_sec_offset DWARFForm = 0x17
DW_FORM_exprloc DWARFForm = 0x18
DW_FORM_flag_present DWARFForm = 0x19
DW_FORM_strx DWARFForm = 0x1a
DW_FORM_addrx DWARFForm = 0x1b
DW_FORM_ref_sup4 DWARFForm = 0x1c
DW_FORM_strp_sup DWARFForm = 0x1d
DW_FORM_data16 DWARFForm = 0x1e
DW_FORM_line_strp DWARFForm = 0x1f
DW_FORM_ref_sig8 DWARFForm = 0x20
DW_FORM_implicit_const DWARFForm = 0x21
DW_FORM_loclistx DWARFForm = 0x22
DW_FORM_rnglistx DWARFForm = 0x23
DW_FORM_ref_sup8 DWARFForm = 0x24
DW_FORM_strx1 DWARFForm = 0x25
DW_FORM_strx2 DWARFForm = 0x26
DW_FORM_strx3 DWARFForm = 0x27
DW_FORM_strx4 DWARFForm = 0x28
DW_FORM_addrx1 DWARFForm = 0x29
DW_FORM_addrx2 DWARFForm = 0x2a
DW_FORM_addrx3 DWARFForm = 0x2b
DW_FORM_addrx4 DWARFForm = 0x2c
// extensions, etc...
)
type Class int
const (
address Class = iota
addrptr
block
constant
exprloc
flag
lineptr
loclist
loclistptr
macptr
rnglist
rnglistptr
reference
str
stroffsetpt
)
Then, we can write our chooseFormAndAdvanceBySize
function and a couple helpers:
func chooseFormAndAdvanceBySize(
reader *BinaryReader,
header *CUHeader,
sections *DWARFSections,
attr abbrevAttribute,
) Form {
form := Form{}
dwarfForm := DWARFForm(attr.form)
switch dwarfForm {
// read N bytes of data as a constant value
case DW_FORM_data1:
form.data, _ = Read[uint8](reader)
form.class = constant
case DW_FORM_data2:
form.data, _ = Read[uint16](reader)
form.class = constant
case DW_FORM_data4:
form.data, _ = Read[uint32](reader)
form.class = constant
case DW_FORM_data8:
form.data, _ = Read[uint64](reader)
form.class = constant
case DW_FORM_sdata:
form.data, _ = leb128.DecodeS64(reader)
form.class = constant
case DW_FORM_udata:
form.data, _ = leb128.DecodeU64(reader)
form.class = constant
// read an address
case DW_FORM_addr:
form.data, _ = Read[uintptr](reader)
form.class = address
// read a reference of N bytes
case DW_FORM_ref_addr:
form.data = readOffset(header, reader)
form.class = reference
case DW_FORM_ref1:
form.data, _ = Read[uint8](reader)
form.class = reference
case DW_FORM_ref2:
form.data, _ = Read[uint16](reader)
form.class = reference
case DW_FORM_ref4:
form.data, _ = Read[uint32](reader)
form.class = reference
case DW_FORM_ref8:
form.data, _ = Read[uint64](reader)
form.class = reference
// flags
case DW_FORM_flag:
form.data, _ = Read[uint8](reader)
form.class = flag
case DW_FORM_flag_present:
form.data = []byte{1} // just indicates true
form.class = flag
// strings
case DW_FORM_string:
// read a string from the .debug_info section
form.data = readNullTerminatedString(reader)
form.class = str
case DW_FORM_strp:
// read a string from the .debug_str section
offset := readOffset(header, reader)
strSection := sections.str[offset:]
strReader := NewBinaryReader(bytes.NewBuffer(strSection))
form.data = readNullTerminatedString(strReader)
form.class = str
case DW_FORM_line_strp: // first introduced in DWARF v5
// read a string from the .debug_line_str section
offset := readOffset(header, reader)
strSection := sections.line_str[offset:]
strReader := NewBinaryReader(bytes.NewBuffer(strSection))
form.data = readNullTerminatedString(strReader)
form.class = str
// read a DWARF expression as an N byte buffer
// (much more on these in a later post!)
case DW_FORM_exprloc:
length, _ := leb128.DecodeU64(reader)
buf := make([]byte, length)
reader.Read(buf)
form.data = buf
form.class = exprloc
// offset in to one of many sections based on the attribute
case DW_FORM_sec_offset:
form.data = readOffset(header, reader)
// there are many more of these, and this is a real pain
// when writing a real parser, so be sure to RTFM
switch attr.form {
case 0x10: // DW_AT_stmt_list
form.class = lineptr
}
// this is a hard-coded constant value that comes from
// the attribute itself in the .debug_abbrev section
case DW_FORM_implicit_const:
form.data = attr.implicitConstVal
form.class = constant
}
return form
}
func readOffset(header *CUHeader, reader *BinaryReader) uint64 {
if header.is32Bit {
val, _ := Read[uint32](reader)
return uint64(val)
}
val, _ := Read[uint64](reader)
return val
}
func readNullTerminatedString(reader *BinaryReader) string {
buf := []byte{}
ch, _ := Read[uint8](reader)
for ch != 0 {
buf = append(buf, ch)
ch, _ = Read[uint8](reader)
}
return string(buf)
}
Phew! That was a lot. However, now we have a full DIE tree! Be sure to check your work against dwarfdump cloop
because a lot of these are subtle and it's easy to make mistakes, plus a single mistake tends to lead to cascading failures (i.e. if you read one byte instead of two, all subsequent reads are now off by one).
Today, we parsed the .debug_abbrev
and .debug_info
sections to ultimately construct a tree of debug information entries.
As was mentioned throughout the article, there's a lot more nuance writing a .debug_info
parser that's able to handle any binary you'd encounter in the wild. Refer to the DWARF documentation for more info, but this hopefully was a helpful jumping off point for understanding the structure of the section.
Stay tuned for the next part where we'll learn how to read line number information so we can map addresses in the program text back to their source location!