Jim Calabro

About
RSS | Atom | JSON
---
LinkedIn
Chess
Mastodon
Last.fm

How DWARF Works: Parsing Just Enough ELF

Sep 25, 2024
This is part of the series on DWARF.

What are ELF and DWARF?

Executable and Linkable Format (ELF) is a file format for executables, object files, shared libraries, and more that's used on various Unix-like systems. If you've ever downloaded and run a program on Linux, you're using an ELF executable. It's akin to an .exe file on Windows.

DWARF is a debugging information format that is used with ELF files. Debug information allows you to do neat things with a running program such as:

  • Map the compiled machine code stored inside back to the original source code
  • Figure out where variables are stored throughout the lifetime of a program as it executes
  • Unwind the callstack to generate a backtrace to find out where your program is stopped and where it came from
  • Much more!

In this series, we'll go in to a lot of detail on topics such as these. Let's take it from the top: parsing ELF files, which contain DWARF debug information.

Our Test Program

Let's parse some files! In order to do so, we'll need a program to play around with. Throughout the rest of this series, I'm going to use this dead-simple C program called cloop that gets its own process ID, then loops forever and prints it once per second. C is a good choice because it is simple, has no runtime, and it's well-supported by all the tools we'll be using. Here's the full program:

#include <unistd.h>
#include <stdio.h>

int main() {
    pid_t pid = getpid();
    unsigned long long ndx = 0;
    while (1) {
        printf("c looping (pid %d): %llu\n", pid, ndx);
        fflush(stdout);
        ndx++;
        sleep(1);
    }

    return 0;
}

I'll be compiling this with gcc 14.2.1 on Manjaro Linux with kernel version 6.9.12 using this build.sh script, but feel free to play around with CC and DWARF as we go (it defaults to DWARF version 5):

#!/usr/bin/env bash

${CC:-gcc} -Wall -Wextra -Werror -no-pie -O0 -g -gdwarf-${DWARF:-5} -o cloop main.c

Additionally, for this series, I'll give some short code examples of Go code to help illustrate various concepts. I chose Go because it's popular, terse, simple to read, and has a large standard library to help us out. I'll intentionally omit error handling and not worry about writing effecient code to keep the examples short. I'm using Go version 1.22.7.

Parsing The ELF File Header

There's a lot of data contained within ELF files, but for our needs it's pretty straightforward, and we can ignore most of it. We just want to grab the raw binary of each debug info section as well as a couple facts about the executable.

Each binary file starts with the ELF header, followed by various "sections", each of which is just a region of the file that has a distinct job. For instance, the program text (machine code) of your executable or object is in the .text section.

We first want to open and read the contents of the binary file. It starts with 16 bytes of "ELF Identifier" header data. The first four of those bytes are the magic number 0x7f followed by 0x45, 0x4c, 0x46, or ELF in ASCII. So using our BinaryReader, we'd do:

fileBuf, _ := os.ReadFile(filePath)
reader := NewBinaryReader(bytes.NewBuffer(fileBuf), binary.NativeEndian)

magic := []byte{0x7f, 'E', 'L', 'F'}

magicBuf := make([]byte, len(magic))
reader.Read(magicBuf)

if !slices.Equal(magic, magicBuf) {
    panic("incorrect ELF magic number")
}

Next comes a the e_ident header section, which contains several one-byte flags, each prefixed with EI_, then some padding, which you should skip over. They are, in order:

  • EI_CLASS: address size (1 for a 32-bit binary, 2 for a 64-bit binary)
  • EI_DATA: byte order (1 for 2's compliment little-endian, 2 for 2's compliment big-endian)
  • EI_VERSION: file format version (should always be 1 as of time of writing)
  • EI_OSABI: operating system and ABI (see Go's implementation for a list of values)
  • EI_ABIVERSION: often ignored on Linux
  • padding: 7 bytes (gives us 16 total bytes in the e_indent section)

Next up is the rest of the ELF file headers, again in order. Refer to the documentation or a robust implementation such as Go or Zig for more information on each field and their values.

  • e_type, uint16: file type
  • e_machine, uint16: machine type
  • e_version, uint32: file format version
  • e_entry, uintptr: virtual address at which the start of the program resides
  • e_phoff, uintptr: byte offset from the start of the file at which the program header table is located
  • e_shoff, uintptr: byte offset from the start of the file at which the section header table is located
  • e_flags, uint32: processor-specific flags
  • e_ehsize, uint16: the number of bytes in this ELF header
  • e_phentsize, uint16: the number of bytes in one entry in the program header table (all entries are the same size)
  • e_phnum, uint16: the number of entries in the program header table
  • e_shentsize, uint16: the number of bytes in one entry in the section header table (all entries are the same size)
  • e_shnum, uint16: the number of entries in the section header table
  • e_shstrndx, uint16: the section header table index of the entry associated with the section name string table

It's giving us a few facts about the binary, then a list of offsets from the start of the file that indicate where each secion is located (everything that starts with sh). We'll use these fields to look up the section header table, read each entry in the table, and use those entries to find the debug sections we care about.

Note that in Go, uintptr is the built-in data type for an int of your machine's address size, meaning 4 bytes on 32-bit systems, and 8 bytes on 64-bit systems.

Also, In digging through the docs, you may have noticed some values such as LOPROC = 0xff00; HIPROC = 0xffff;. Both ELF and DWARF commonly reserve large ranges of high values for each processor, programming language, OS, etc. to define their own custom values for various enums. We won't be using these too much, but be aware that GNU, Go, Zig, and others commonly make use of these. You'll be able to get more information on each by reading through various compilers.

Parsing The Section Header Table

Next up, we need to parse each section header contained within the file. The "table" is just a fancy word for "an array of section header entries". So once we're done, we'll have a list of where all sections start and end within the binary, the name of each section, and some other data.

The section header table starts at the e_shoff'th byte in the file, and is e_shentsize * e_shnum bytes long.

The fields of each section header are:

  • sh_name, uint32: name of the section as an index in to the string table
  • sh_type, uint32: section type enum
  • sh_flags, uintptr: flags for this section
  • sh_addr, uintptr: the address at which this section should reside within the address space of the process, if relevant
  • sh_offset, uintptr: offset from the first byte of the ELF file to where the start of this section resides
  • sh_size, uintptr: the number of bytes in the section
  • sh_link, uint32: enum indicating the linkage of this section
  • sh_info, uint32: enum indicating extra information about this section
  • sh_addralign, uintptr: contraints on the alignment of addresses on the target platform (0 and 1 mean no constraints)
  • sh_entsize, uintptr: if the section contains a table of fixed-size elements (i.e. a symbol table), this is the size of each element

Read e_shnum entries, which should be exactly enough bytes. To give an example of how this might look in code, consider:

type ELFSectionHeader struct {
    sh_name      uint32
    sh_type      uint32
    sh_flags     uintptr
    sh_addr      uintptr
    sh_offset    uintptr
    sh_size      uintptr
    sh_link      uint32
    sh_info      uint32
    sh_addralign uintptr
    sh_entsize   uintptr

    // this is not part of the standard, but we'll
    // look up and store the name on this struct later
    name string
}

sectionHeaderTable := fileBuf[shOff : shOff+uintptr(shentSize*shNum)]
sectionHeaderTableReader := NewBinaryReader(
    bytes.NewBuffer(sectionHeaderTable),
    binary.NativeEndian,
)

sectionHeaders := []*ELFSectionHeader{}
for ndx := 0; ndx < int(shNum); ndx++ {
    header := &ELFSectionHeader{}
    header.sh_name, _ = Read[uint32](sectionHeaderTableReader)
    header.sh_type, _ = Read[uint32](sectionHeaderTableReader)
    header.sh_flags, _ = Read[uintptr](sectionHeaderTableReader)
    header.sh_addr, _ = Read[uintptr](sectionHeaderTableReader)
    header.sh_offset, _ = Read[uintptr](sectionHeaderTableReader)
    header.sh_size, _ = Read[uintptr](sectionHeaderTableReader)
    header.sh_link, _ = Read[uint32](sectionHeaderTableReader)
    header.sh_info, _ = Read[uint32](sectionHeaderTableReader)
    header.sh_addralign, _ = Read[uintptr](sectionHeaderTableReader)
    header.sh_entsize, _ = Read[uintptr](sectionHeaderTableReader)

    sectionHeaders = append(sectionHeaders, header)
}

Once we have all this information, we're going to want to use the sh_name field to look up our section name in the section header string table. This is the ELF section with index e_shstrndx, named .shstrtab. In my case with the test C program, it's the 35th section, though yours may be different.

This table is a series of null-terminated strings all next to each other in one long array. You can read the entire table in to an array, then use the sh_name field to find the entry at that index.

I'll use the sh_size and sh_offset fields of the e_shstrndx'th entry to find our location within the binary:

sectionNames := sectionHeaders[shStrTabNdx]

start := sectionNames.sh_offset
end := start + sectionNames.sh_size
sectionNamesBuf := fileBuf[start:end]

for _, header := range sectionHeaders {
    for ndx := header.sh_name; ; ndx++ {
        ch := sectionNamesBuf[ndx]
        if ch == 0 {
            break
        }
        header.name += string(ch)
    }
}

Now we're able to look up each debug information section by name!

Debug Info Sections

There's a fair number of sections in there! You may recognize some of them, but for the most part, we care about the ones that start with .debug_, though we also care about .eh_frame. If you want to check your work, you can with readelf --headers cloop. We'll get in to what each of these sections mean over time.

There are actually a few sections that are missing from the binary on my machine that we also would want to save if they were present (there are some sections that were present in older versions of DWARF for instance, but were dropped when v5 was released). We'll want to take the content of each one of those sections and save them for parsing later:

    type DWARFSections struct {
        abbrev      []byte
        line        []byte
        info        []byte
        addr        []byte
        aranges     []byte
        frame       []byte
        eh_frame    []byte
        line_str    []byte
        loc         []byte
        loclists    []byte
        names       []byte
        macinfo     []byte
        macro       []byte
        pubnames    []byte
        pubtypes    []byte
        ranges      []byte
        rnglists    []byte
        str         []byte
        str_offsets []byte
        types       []byte
    }

    getSection := func(header *ELFSectionHeader) []byte {
        start := header.sh_offset
        end := header.sh_offset + header.sh_size
        return fileBuf[start:end]
    }

    sections := &DWARFSections{}
    for _, header := range sectionHeaders {
        switch header.name {
        case ".debug_abbrev":
            sections.abbrev = getSection(header)
        case ".debug_line":
            sections.line = getSection(header)
        case ".debug_info":
            sections.info = getSection(header)
        case ".debug_addr":
            sections.addr = getSection(header)
        case ".debug_aranges":
            sections.aranges = getSection(header)
        case ".debug_frame":
            sections.frame = getSection(header)
        case ".eh_frame":
            sections.eh_frame = getSection(header)
        case ".debug_line_str":
            sections.line_str = getSection(header)
        case ".debug_loc":
            sections.loc = getSection(header)
        case ".debug_loclists":
            sections.loclists = getSection(header)
        case ".debug_names":
            sections.names = getSection(header)
        case ".debug_macinfo":
            sections.macinfo = getSection(header)
        case ".debug_macro":
            sections.macro = getSection(header)
        case ".debug_pubnames":
            sections.pubnames = getSection(header)
        case ".debug_pubtypes":
            sections.pubtypes = getSection(header)
        case ".debug_ranges":
            sections.ranges = getSection(header)
        case ".debug_rnglists":
            sections.rnglists = getSection(header)
        case ".debug_str":
            sections.str = getSection(header)
        case ".debug_str_offsets":
            sections.str_offsets = getSection(header)
        case ".debug_types":
            sections.types = getSection(header)
        }
    }

Now we're almost ready to start parsing those debug info sections in to something that allows us to inspect a running program!

PIE 🥧

The last thing we'll want to do with our ELF file for now is examine it to determine if it is a position independent executable (PIE, also known as position independent code or PIC). PIE means that the code can be loaded and executed at any address in the process' memory space, and is the opposite of aboslute code, which must be loaded at a fixed address in memory. You can enable PIE with the -fPIC compiler flag in gcc and clang. It ultimately doesn't restrict our capabilities as a debugger at all, it just means that we need to look up where in the process' address space our code is loaded when we start the program (we'll do that much later).

For now, we can determine if we're PIC based on the value of the FLAGS_1 field in the .dynamic section like so:

var dynamicHeader *ELFSectionHeader
for _, header := range sectionHeaders {
    if header.name == ".dynamic" {
        dynamicHeader = header
        break
    }
}

pie := false
dynamicBuf := getSection(dynamicHeader)
dynamicReader := NewBinaryReader(bytes.NewBuffer(dynamicBuf), binary.NativeEndian)
for {
    tag, _ := Read[uintptr](dynamicReader)
    val, err := Read[uintptr](dynamicReader)

    if tag == 0x6fff_fffb { // DT_FLAGS_1
        if (val & 0x0800_0000) > 0 { // DF_1_PIE
            pie = true
            break
        }
    }

    if err == io.EOF {
        break
    }
}

You can check your work on this using readelf --dynamic cloop. You may want to try compiling cloop with -fPIE and without -no-pie and re-running your parser to make sure things are looking good.

Summary

That's it for today! We learned what ELF and DWARF are as well as how to parse just enough ELF to get the debug information sections we care about. ELF is probably the easiest section in this series, so strap in.

Thank you for reading the series on DWARF. Please don't hesitate to reach out with comments, questions, or errata to jim at this domain dot com.