Object Code Classification

Introduction

The information on this page is a summary of preliminary research into using machine learning on compiled object code. It includes the dataset and a summary of the results so far.

Publications

Dataset

This dataset contains the extracted raw object code for various sample executable and library files for the different architectures. There is no file metadata (ELF or PE) headers associated with code, it is simply the raw opcodes and operands.

This version contains samples from 20 different targets including:

  • Intel (amd64, i386, ia64)
  • ARM (arm64, armel, armhf)
  • SPARC (sparc, sparc64)
  • Power (s390, s390x)
  • PowerPC (powerpc, ppc64)
  • MIPS (mips, mipsel)
  • DEC Alpha (alpha)
  • HP PA-Risc (hppa)
  • AVR 8-bit (avr)
  • NVidia CUDA (cuda)
  • SH4 (sh4)
  • Motorola M68000 (m68k)

Download

Dataset is available, just email clemej1@umbc.edu and I’ll send it to you.
It is approx. 160MB in size.

Directory Layout

Each directory contains a set of code samples and a JSON-formatted file that contains some basic information about each sample. The samples were generated from individual files (taken mainly from Debian binary distributions). The file name of each ‘.code’ file is the md5sum of the original file. More metadata about the original file is contained in the JSON file in each sample directory. : base/ / .json .code .code ….

     <arch name>/
                <arch name>.json
                <md5 hash of original file>.code
        <md5 hash of original file>.code
     ...

For each .code sample in a directory, the JSON metadata file contains an entry that includes information taken from the orifinal file the sample was extracted from. This includes:

  • Original file name and file size
  • MD5 hash of the original file (sample name is ‘.code’
  • Architecture description of the file (name, endianness, wordsize)
  • The output of the UNIX file command on the original file
  • A set of tags (including the arch name used in this dataset)
  • Information about the object file sections this code sample was extracted from
    • The section name
    • The size (in bytes) of the sample related to that section
    • The offset of that information into the sample file

The section information is only relevant if you wish to know which part of the sample came from which section.

Here’s an example of the JSON entry for a file in the armhf directory:

    {
            "arch": "Advanced Risc Machines ARM.",
            "code_sections": [
                    [
                            ".text",
                            15792,
                            0
                    ],
                    [
                            ".init",
                            10,
                            15792
                    ],
                    [
                            ".plt",
                            968,
                            15802
                    ],
                    [
                            ".fini",
                            6,
                            16770
                    ]
            ],
            "endian": "little",
            "filehash": "3c2ae7a15be942fd8111bde7664b5aa0",
            "fileinfo": "ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 2.6.26, BuildID[sha1]=8c11b8ccbc39824dc3e16f7f5705a3e08cd28f09, stripped",
            "filename": "../wheezy-armhf/lib/udev/udev-acl",
            "filesize": 30428,
            "tags": [
                    "armhf"
            ],
            "wordsize": 32
    },

Changelog

v0.1, March 2015

  • Sample files collected and analyzed for paper

v1.0, Aug 24, 2015

  • Initial public release of dataset
  • Extract object code into separate files
    • This brings the dataset size down from 1.7G to less than 700M
  • Create JSON metadata files