Jim Calabro

About
RSS | Atom | JSON
---
LinkedIn
Chess
Mastodon
Last.fm

How DWARF Works: Debug Information Entries

Sep 28, 2024
This is part of the series on DWARF.

Debug Information Entries

So far, we've parsed enough of the contents of an ELF executable to get the various debug info sections as raw byte arrays. Now, we'll use all these sections together to make progress towards properly inspecting a running instance of the target program.

This post's goal is to construct the Debug Information Entry (DIE) tree. This is a long task with several side quests along the way.

A DIE is a piece of information about a particular node of a compiled program, and they are stored in a tree. When compiler authors write parsers/lexers, the first step in the pipeline is generally attempting to turn the source text of a program in to an Abstract Syntax Tree, and you can sort of think of the DIE tree as a similar thing for DWARF debuggers. All your compile units, functions, variables, struct definitions, etc. each get an entry in the DIE tree. A DIE tree is really more like a forest in that each compile unit in your executable gets its own tree.

We will use these DIEs later on to ascertain key facts about a running process such as "what is the data type of the variable named x, and where in memory/registers are its bytes?", which is critical for building a debugger.

Perhaps it's easiest to show what it looks like. Here's the relevant section of cloop's DIE tree using dwarfdump cloop:

< 1><0x000002dd>    DW_TAG_subprogram
                      DW_AT_external              yes(1)
                      DW_AT_name                  main
                      DW_AT_decl_file             0x00000001 /home/jcalabro/local/data/cloop/main.c
                      DW_AT_decl_line             0x00000004
                      DW_AT_decl_column           0x00000005
                      DW_AT_type                  <0x00000058>
                      DW_AT_low_pc                0x00401156
                      DW_AT_high_pc               <offset-from-lowpc> 86 <highpc: 0x004011ac>
                      DW_AT_frame_base            len 0x0001: 0x9c:
                          DW_OP_call_frame_cfa
                      DW_AT_call_all_tail_calls   yes(1)
                      DW_AT_sibling               <0x0000031c>
< 2><0x000002ff>      DW_TAG_variable
                        DW_AT_name                  pid
                        DW_AT_decl_file             0x00000001
                        DW_AT_decl_line             0x00000005
                        DW_AT_decl_column           0x0000000b
                        DW_AT_type                  <0x0000009d>
                        DW_AT_location              len 0x0002: 0x9164:
                            DW_OP_fbreg -28
< 2><0x0000030d>      DW_TAG_variable
                        DW_AT_name                  ndx
                        DW_AT_decl_file             0x00000001
                        DW_AT_decl_line             0x00000006
                        DW_AT_decl_column           0x00000018
                        DW_AT_type                  <0x0000031c>
                        DW_AT_location              len 0x0002: 0x9168:
                            DW_OP_fbreg -24

There's not a ton to see in our cloop program because it's so simple, but there is a function (aka subprogram) named main and two variables, ndx and pid, just as we'd expect. It tells us on what lines they appear, and what their data types are, and DW_AT_location actually tells us how to find the value of a variable at runtime! There's a lot of good stuff in here.

You can see that all these entries have a single tag that indicates the type of entry (each prefixed with DW_TAG_), then some number of attributes, or metadata describing that particular entry (each prefixed with DW_AT_). If you look at the full output, you can see a lot of DIEs for primitive C types and the standard library.

It's a bit tough to tell because we're only one function call deep, but this data is stored in a tree. You can see it in the indentation of the tags and attributes if you look carefully, or you can use the numbers on the far left as reference of depth (< 1> and < 2>). This is just how dwarfdump renders the DIE tree; the actual format on disk is not of this form.

Abbrev Tables

All that being said, the very first thing we have to parse actually is not the DIE tree. No, we will start by reading the full set of abbreviation tables. The abbrev tables give you a lot of information that's used in constructing the DIE tree. It tells you, for each potential entry in the tree, what is its tag (AST node type), and what attributes each node has. For each attribute, it tells you its data type via its form (is it a string, is it a constant integer value, and so on). The set of abbrev tables lives in the .debug_abbrev ELF section.

Parsing the abbrev table is pretty straightforward: it's an array of declarations (decls), each with a code, a tag, and some number of attributes. To see it in action, you run dwarfdump --print-abbrev cloop:

<    1><0x00000000> DW_TAG_member               DW_children_no
       <0x00000003>              DW_AT_name                  DW_FORM_strp
       <0x00000005>              DW_AT_decl_file             DW_FORM_implicit_const <4 (0x4)>
       <0x00000008>              DW_AT_decl_line             DW_FORM_data1
       <0x0000000a>              DW_AT_decl_column           DW_FORM_data1
       <0x0000000c>              DW_AT_type                  DW_FORM_ref4
       <0x0000000e>              DW_AT_data_member_location  DW_FORM_data1
<    2><0x00000012> DW_TAG_base_type            DW_children_no
       <0x00000015>              DW_AT_byte_size             DW_FORM_data1
       <0x00000017>              DW_AT_encoding              DW_FORM_data1
       <0x00000019>              DW_AT_name                  DW_FORM_strp
<    3><0x0000001d> DW_TAG_pointer_type         DW_children_no
       <0x00000020>              DW_AT_byte_size             DW_FORM_implicit_const <8 (0x8)>
       <0x00000023>              DW_AT_type                  DW_FORM_ref4
<    4><0x00000027> DW_TAG_typedef              DW_children_no
       <0x0000002a>              DW_AT_name                  DW_FORM_strp
       <0x0000002c>              DW_AT_decl_file             DW_FORM_data1
       <0x0000002e>              DW_AT_decl_line             DW_FORM_data1
       <0x00000030>              DW_AT_decl_column           DW_FORM_data1
       <0x00000032>              DW_AT_type                  DW_FORM_ref4

       etc...

We can write a parser for this with the following methodology:

  • Create a new abbrev table, storing our current offset (the number of bytes we've read so far through the section)
  • If we've read to the end of the section, we're done
  • If we're not done, read N decls for this table via:
    • Create a new decl and read its code
    • If the code is zero, we're done with this table (all tables end with a "null entry" whose code is zero)
    • Read the decl's tag enum value
    • Read a bool that indicates whether or not this node in the tree has children
    • Read all attributes for this decl
      • Read the attribute name enum value
      • Read the attribute form enum value
      • If the name and the form are both zero, we've read all of this decl's attributes
      • If the form is DW_FORM_implicit_const, read and store the implicit const value (we'll see how this is used later in this post)

Check the DWARF documentation for what each of those enum values should be, and see the appendix for more information on LEB128. Translated in to code, that looks like:

type abbrevTable struct {
    offset int
    decls  map[uint64]abbrevDecl
}

type abbrevDecl struct {
    code        uint64
    hasChildren uint8
    attrs       []abbrevAttribute
}

type abbrevAttribute struct {
    name             uint64
    form             uint64
    implicitConstVal int64
}

func parseAbbrevTables(abbrevSecion []byte) []abbrevTable {
    buf := bytes.NewBuffer(abbrevSecion)
    tables := []abbrevTable{}
    reader := NewBinaryReader(buf, binary.NativeEndian)

    for {
        table := abbrevTable{
            offset: BufferOffset(len(abbrevSecion), buf),
            decls:  map[uint64]abbrevDecl{},
        }
        if table.offset >= len(abbrevSecion) {
            break // nothing left to read in the section
        }

        for {
            decl := abbrevDecl{}
            decl.code, _ = leb128.DecodeU64(reader)
            if decl.code == 0 {
                break // we've finished with this table
            }

            decl.tag, _ = leb128.DecodeU64(reader)
            decl.hasChildren, _ = Read[uint8](reader)

            for {
                attr := abbrevAttribute{}
                attr.name, _ = leb128.DecodeU64(reader)
                attr.form, _ = leb128.DecodeU64(reader)
                if attr.name == 0 && attr.form == 0 {
                    break
                }

                if attr.form == 0x21 { // DW_FORM_implicit_const
                    attr.implicitConstVal, _ = leb128.DecodeS64(reader)
                }

                decl.attrs = append(decl.attrs, attr)
            }

            table.decls[decl.code] = decl
        }

        tables = append(tables, table)
    }

    return tables
}

That's it with abbrev tables. We'll use these quite a lot as we parse the DIE tree. You can see that we already parsed something that resembles a tree, but we're not at the point where the data is meaningful yet. Note that cloop only has one table, but there will probably be many in real-world programs.

Compilation Unit Headers

Now we're ready to start tackling the .debug_info section, which by is the section that contains most of the DIE information. We're going to parse one or more compilation units and their DIEs in a loop, starting with the compilation unit header. The header isn't itself a DIE, but it does give us some critical information in dealing with the remainder of the tree. So our outer loop looks like:

type CU struct {
    header *CUHeader
    dies   []DIE
}

type CUHeader struct {
    length            uintptr
    version           uint16
    unitType          uint8 // added in v5
    debugAbbrevOffset int
    addrSize          uint8

    // not a real field in the DWARF standard,
    // but helpful for bookkeeping
    is32Bit bool
}

cus := []*CU{}
infoReader := NewBinaryReader(bytes.NewBuffer(sections.info))
for {
    if infoReader.Offset() >= len(sections.info) {
        break
    }

    cuHeader := parseCUHeader(infoReader)

    // choose the correct abbrev table for this CU
    abbrev := abbrevTables[0]
    for _, a := range abbrevTables {
        if a.offset == cuHeader.debugAbbrevOffset {
            abbrev = a
            break
        }
    }

    cu := parseCU(infoReader, cuHeader, sections, abbrev)
    cus = append(cus, cu)
}

Then, we can implement parseCUHeader. Each compilation unit header contains these fields in order, but note that the order of debug_abbrev_offset and addr_size is reversed as of DWARF v5.

Additionally, see the appendix for more information on fields of type initial length.

  • length, initial length: the number of bytes taken by this CU header in the .debug_info section (this does not include the number of bytes it takes to store the length itself, so either 4 or 12 bytes)
  • version, uint16: DWARF version number
  • unit_type, uint8: type of compilation unit (this field was added in DWARF v5 and there is no such field in v4 or below, see DWARF v5, section 3.1 for more details, and for our purposes we only care about DW_UT_compile)
  • debug_abbrev_offset, uint32 if is_32_bit, else uint64: how far to seek in to the abbrev section in order to find this CU's abbrev table
  • addr_size, uint8: how large an address is in this CU

So we can parse a CU header with something like:

func parseCUHeader(reader *BinaryReader) *CUHeader {
    header := &CUHeader{}

    header.length = readInitialLength(reader)

    // we know it's a 32 bit binary if our initial length
    // was 4 bytes, not 12
    header.is32Bit = reader.Offset() == 4

    header.version, _ = Read[uint16](reader)

    if header.version >= 5 {
        // this field was added in DWARF v5
        header.unitType, _ = Read[uint8](reader)

        // DWARF v5 changes the order of fields (switches abbrev_offset and addr_size)
        header.addrSize, _ = Read[uint8](reader)
        header.debugAbbrevOffset = parseAbbrevOffset(reader, header)
    } else {
        header.debugAbbrevOffset = parseAbbrevOffset(reader, header)
        header.addrSize, _ = Read[uint8](reader)
    }

    return header
}

func parseAbbrevOffset(reader *BinaryReader, header *CUHeader) int {
    if header.is32Bit {
        offset, _ := Read[uint32](reader)
        return int(offset)
    }

    offset, _ := Read[uint64](reader)
    return int(offset)
}

You can check your work against readelf --debug-dump=info cloop:

Contents of the .debug_info section:

  Compilation Unit @ offset 0:
   Length:        0x320 (32-bit)
   Version:       5
   Unit Type:     DW_UT_compile (1)
   Abbrev Offset: 0
   Pointer Size:  8

etc...

Compilation Units

Now, we're ready to begin parsing the DIE tree in this compilation unit! What we really want at the end of this section is actually just an array of DIEs, but we do need to interpret them as a tree in order to parse the .debug_info section correctly.

First, we should pick an abbrev table by searching the list of abbrev table offsets for the one that matches with the debug_abbrev_offset field from the CU header. This is the table we'll use for the rest of parsing this CU. This code is already handled in the "outer loop" example above.

Next, we'll begin reading through the .debug_info section to determine the list of DIEs, their depth in the tree, and the form of each of their attributes. DWARF forms are basically just data types, so this is just saying that we want to figure out how to interpret the bytes of each attribute (should we read a uint8, a string, etc.). Forms are an enum with the prefix DW_FORM_.

Additionally, based on which form we're looking at, we should also be sure to store its associated attribute class for more information on how to interpret this data. We won't be using this much since our code is just a demo, but real-world, robust parsers will use this a lot. Read the spec to learn more.

To complete this, we should:

  • Read a ULEB128, that's the code of an item in the abbrev table that we will look up
  • If the code is zero, we are either done reading DIEs for the entire compilation unit, or we're done reading child DIEs for a node in the tree
    • If you are at the root level of the DIE tree, you're done parsing this compile unit
    • Else, pop up one level in the tree and continue
  • Use the code to look up the abbrev decl we care about in the abbrev table
  • For each attribute in the decl we're going to want to choose which type of form to use to interpret the bytes of this attribute, and read that many bytes from the .debug_info section
    • Note that list of form types is very long, so I'll only implement enough to parse cloop on my machine in the code sample below. Refer to the documentation for the version(s) of DWARF you care about to see the full list.
  • If DW_CHILDREN_yes was set on the decl, push an item on to the DIE tree

Easy to say, but somewhat tedious to actually do. First, let's write that algorithm:

func parseCU(
    reader *BinaryReader,
    header *CUHeader,
    sections *DWARFSections,
    abbrev abbrevTable,
) *CU {
    dies := []DIE{}       // list of all DIEs
    dieTree := []uint64{} // stack of abbrev codes

    for {
        dieOffset := reader.Offset()

        abbrevCode, _ := leb128.DecodeU64(reader)
        if abbrevCode == 0 {
            if len(dieTree) == 1 {
                break // we're done
            }

            // this DIE has no more children, pop the stack
            dieTree = dieTree[:len(dieTree)-1]
            continue
        }

        abbrevDecl := abbrev.decls[abbrevCode]
        die := DIE{
            offset: dieOffset,
            depth:  len(dieTree),
            tag:    abbrevDecl.tag,
        }

        for _, attr := range abbrevDecl.attrs {
            form := chooseFormAndAdvanceBySize(reader, header, sections, attr)
            die.forms = append(die.forms, form)
        }

        dies = append(dies, die)
        if abbrevDecl.hasChildren == 1 {
            dieTree = append(dieTree, abbrevCode)
        }
    }

Next, let's define a few types and "enums" (Go doesn't really have enums):

type DIE struct {
    offset int
    depth  int
    tag    uint64
    forms  []Form
}

type Form struct {
    data  any
    class Class
}

type DWARFForm int

const (
    // we're not going to use all of these today, but a real parser should
    DW_FORM_addr           DWARFForm = 0x01
    DW_FORM_block2         DWARFForm = 0x03
    DW_FORM_block4         DWARFForm = 0x04
    DW_FORM_data2          DWARFForm = 0x05
    DW_FORM_data4          DWARFForm = 0x06
    DW_FORM_data8          DWARFForm = 0x07
    DW_FORM_string         DWARFForm = 0x08
    DW_FORM_block          DWARFForm = 0x09
    DW_FORM_block1         DWARFForm = 0x0a
    DW_FORM_data1          DWARFForm = 0x0b
    DW_FORM_flag           DWARFForm = 0x0c
    DW_FORM_sdata          DWARFForm = 0x0d
    DW_FORM_strp           DWARFForm = 0x0e
    DW_FORM_udata          DWARFForm = 0x0f
    DW_FORM_ref_addr       DWARFForm = 0x10
    DW_FORM_ref1           DWARFForm = 0x11
    DW_FORM_ref2           DWARFForm = 0x12
    DW_FORM_ref4           DWARFForm = 0x13
    DW_FORM_ref8           DWARFForm = 0x14
    DW_FORM_ref_udata      DWARFForm = 0x15
    DW_FORM_indirect       DWARFForm = 0x16
    DW_FORM_sec_offset     DWARFForm = 0x17
    DW_FORM_exprloc        DWARFForm = 0x18
    DW_FORM_flag_present   DWARFForm = 0x19
    DW_FORM_strx           DWARFForm = 0x1a
    DW_FORM_addrx          DWARFForm = 0x1b
    DW_FORM_ref_sup4       DWARFForm = 0x1c
    DW_FORM_strp_sup       DWARFForm = 0x1d
    DW_FORM_data16         DWARFForm = 0x1e
    DW_FORM_line_strp      DWARFForm = 0x1f
    DW_FORM_ref_sig8       DWARFForm = 0x20
    DW_FORM_implicit_const DWARFForm = 0x21
    DW_FORM_loclistx       DWARFForm = 0x22
    DW_FORM_rnglistx       DWARFForm = 0x23
    DW_FORM_ref_sup8       DWARFForm = 0x24
    DW_FORM_strx1          DWARFForm = 0x25
    DW_FORM_strx2          DWARFForm = 0x26
    DW_FORM_strx3          DWARFForm = 0x27
    DW_FORM_strx4          DWARFForm = 0x28
    DW_FORM_addrx1         DWARFForm = 0x29
    DW_FORM_addrx2         DWARFForm = 0x2a
    DW_FORM_addrx3         DWARFForm = 0x2b
    DW_FORM_addrx4         DWARFForm = 0x2c

    // extensions, etc...
)

type Class int

const (
    address Class = iota
    addrptr
    block
    constant
    exprloc
    flag
    lineptr
    loclist
    loclistptr
    macptr
    rnglist
    rnglistptr
    reference
    str
    stroffsetpt
)

Then, we can write our chooseFormAndAdvanceBySize function and a couple helpers:

func chooseFormAndAdvanceBySize(
    reader *BinaryReader,
    header *CUHeader,
    sections *DWARFSections,
    attr abbrevAttribute,
) Form {
    form := Form{}

    dwarfForm := DWARFForm(attr.form)
    switch dwarfForm {

    // read N bytes of data as a constant value
    case DW_FORM_data1:
        form.data, _ = Read[uint8](reader)
        form.class = constant
    case DW_FORM_data2:
        form.data, _ = Read[uint16](reader)
        form.class = constant
    case DW_FORM_data4:
        form.data, _ = Read[uint32](reader)
        form.class = constant
    case DW_FORM_data8:
        form.data, _ = Read[uint64](reader)
        form.class = constant
    case DW_FORM_sdata:
        form.data, _ = leb128.DecodeS64(reader)
        form.class = constant
    case DW_FORM_udata:
        form.data, _ = leb128.DecodeU64(reader)
        form.class = constant

    // read an address
    case DW_FORM_addr:
        form.data, _ = Read[uintptr](reader)
        form.class = address

    // read a reference of N bytes
    case DW_FORM_ref_addr:
        form.data = readOffset(header, reader)
        form.class = reference
    case DW_FORM_ref1:
        form.data, _ = Read[uint8](reader)
        form.class = reference
    case DW_FORM_ref2:
        form.data, _ = Read[uint16](reader)
        form.class = reference
    case DW_FORM_ref4:
        form.data, _ = Read[uint32](reader)
        form.class = reference
    case DW_FORM_ref8:
        form.data, _ = Read[uint64](reader)
        form.class = reference

    // flags
    case DW_FORM_flag:
        form.data, _ = Read[uint8](reader)
        form.class = flag
    case DW_FORM_flag_present:
        form.data = []byte{1} // just indicates true
        form.class = flag

    // strings
    case DW_FORM_string:
        // read a string from the .debug_info section
        form.data = readNullTerminatedString(reader)
        form.class = str
    case DW_FORM_strp:
        // read a string from the .debug_str section
        offset := readOffset(header, reader)
        strSection := sections.str[offset:]
        strReader := NewBinaryReader(bytes.NewBuffer(strSection))
        form.data = readNullTerminatedString(strReader)
        form.class = str
    case DW_FORM_line_strp: // first introduced in DWARF v5
        // read a string from the .debug_line_str section
        offset := readOffset(header, reader)
        strSection := sections.line_str[offset:]
        strReader := NewBinaryReader(bytes.NewBuffer(strSection))
        form.data = readNullTerminatedString(strReader)
        form.class = str

    // read a DWARF expression as an N byte buffer
    // (much more on these in a later post!)
    case DW_FORM_exprloc:
        length, _ := leb128.DecodeU64(reader)
        buf := make([]byte, length)
        reader.Read(buf)
        form.data = buf
        form.class = exprloc

    // offset in to one of many sections based on the attribute
    case DW_FORM_sec_offset:
        form.data = readOffset(header, reader)

        // there are many more of these, and this is a real pain
        // when writing a real parser, so be sure to RTFM
        switch attr.form {
        case 0x10: // DW_AT_stmt_list
            form.class = lineptr
        }

    // this is a hard-coded constant value that comes from
    // the attribute itself in the .debug_abbrev section
    case DW_FORM_implicit_const:
        form.data = attr.implicitConstVal
        form.class = constant
    }

    return form
}

func readOffset(header *CUHeader, reader *BinaryReader) uint64 {
    if header.is32Bit {
        val, _ := Read[uint32](reader)
        return uint64(val)
    }

    val, _ := Read[uint64](reader)
    return val
}

func readNullTerminatedString(reader *BinaryReader) string {
    buf := []byte{}
    ch, _ := Read[uint8](reader)
    for ch != 0 {
        buf = append(buf, ch)
        ch, _ = Read[uint8](reader)
    }
    return string(buf)
}

Phew! That was a lot. However, now we have a full DIE tree! Be sure to check your work against dwarfdump cloop because a lot of these are subtle and it's easy to make mistakes, plus a single mistake tends to lead to cascading failures (i.e. if you read one byte instead of two, all subsequent reads are now off by one).

Summary

Today, we parsed the .debug_abbrev and .debug_info sections to ultimately construct a tree of debug information entries.

As was mentioned throughout the article, there's a lot more nuance writing a .debug_info parser that's able to handle any binary you'd encounter in the wild. Refer to the DWARF documentation for more info, but this hopefully was a helpful jumping off point for understanding the structure of the section.

Stay tuned for the next part where we'll learn how to read line number information so we can map addresses in the program text back to their source location!

Thank you for reading the series on DWARF. Please don't hesitate to reach out with comments, questions, or errata to jim at this domain dot com.