Reading IDX Files in D, an introduction to compile time programming

Author: Dr Chibisi Chima-Okereke Created: August 21, 2020 01:53:12 GMT Published: August 21, 2020 01:53:12 GMT

Introduction

IDX files are used to store multi-dimensional array data which are used in various applications one of which is input data in machine learning models. The data type of each element is coded in the file itself so unless the file name or creator gives you a hint, you can not be certain what the type of each element in the array is until you read it. This can be tricky in static languages because data types of all objects must be known at compile time. Often programmers resort to dynamic typing or type masking techniques meaning that dynamic methods must be used rather than more computationally efficient static function overloading.

The D programming language has extensive generic programming and compile time capabilities. Using a few of these tools, it is fairly straight forward to read an IDX file and determine the data type all done at compile which can be a good alternative to resorting to dynamic typing techniques. The compile time tools used in this article are

Templates: the same concept and similar usage to templates in C++, they are a very powerful tool for code generation and even compile time evaluation. We can create templates using the template keyword though a shorthand exists for creating template structs, classes, or functions.
alias: the alias keyword in D has many uses in compile time code, in this case it is used to declare types. For example
alias T = double;
denotes T as type double - this (happens at compile time). By way of a further example here is a template that converts a type T to a pointer type T*:
template P(T){ alias P = T*; }
for reasons we won’t be going into this is also known as an eponymous template. We can use the template like this
P!(double)[] x;
to declare x as an array of pointers (equivalent to double*[] x;). Those new to D but with some familiarity with C++ will notice that there are no angle bracket <> here, instead D uses (in my opinion a much nicer) TEMPLATE!(ARGS...) form to specify (compile time) template arguments (ARGS).
enum: in D enums are used to create enumerations, but they can be used to declare compile time constants of any type. For example:
enum string home = "Earth";
will create a compile time string constant home that has the value "Earth".
static if: is not a regular if statement. It’s conditional compilation meaning that one block of code is compiled rather than another. More on this later.
static foreach: essentially a compile time iteration over a range whose values are known at compile time.

In this article we won’t be forming a multi-dimensional array, only showing that we can read data of the type specified in the file and it’s dimensions into a contiguous array at compile time - which is sufficient to subsequently create a multi-dimensional array.

Preliminaries

Three functions from the standard library will be used.

import std.stdio: writeln;
import std.conv: to;
import std.bitmanip: bigEndianToNative;

writeln function allow the printing of outputs.
The to (template) function allows us to do type conversion, for example converting "3.14159" from a string to a double is done using
auto x = to!(double)("3.14159");
The bigEndianToNative function also happens to be a template function and is used to convert big endian byte order (the format the data is stored in) to whatever byte order the system happens to be.

All three of the above functions are actually template functions. It’s interesting that to write this read function, we only need three functions, that’s how straightforward this is in D.

Lookup tables

The third byte in the IDX file is used to code the type of the elements in the multi-dimensional array. Below is a mapping table taken from the MNIST website:

0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)

We can create a mapping table from byte to type using a template. Though D has associative arrays, they are at least at the moment only for runtime operations.

template DataType(ubyte byteCode)
{
  static if(byteCode == 0x08)
  {
    alias DataType = ubyte;
  }else static if(byteCode == 0x09)
  {
    alias DataType = byte;
  }else static if(byteCode == 0x0B)
  {
    alias DataType = short;
  }else static if(byteCode == 0x0C)
  {
    alias DataType = int;
  }else static if(byteCode == 0x0D)
  {
    alias DataType = float;
  }else static if(byteCode == 0x0E)
  {
    alias DataType = double;
  }
}

In the above code static if comes into play, when the code is compiled and only one of those options is actually substituted into the call point, the rest go away. Notice also that we can specify a data type, in this case ubyte (unsigned byte), rather that using type parameters. Note: all template parameters must be known at compile time . I have chosen to represent the size in bytes in a separate table:

template Stride(ubyte byteCode)
{
  static if(byteCode == 0x08)
  {
    enum long Stride = 1;
  }else static if(byteCode == 0x09)
  {
    enum long Stride = 1;
  }else static if(byteCode == 0x0B)
  {
    enum long Stride = 2;
  }else static if(byteCode == 0x0C)
  {
    enum long Stride = 4;
  }else static if(byteCode == 0x0D)
  {
    enum long Stride = 4;
  }else static if(byteCode == 0x0E)
  {
    enum long Stride = 8;
  }
}

The `readIDX()` function

The declaration

Now that we have described mapping tables, we can begin to create the function for reading IDX files that I am calling readIDX. Firstly the declaration:
auto readIDX(string filePath)(){/*... code ...*/}
is shorthand for doing this:

template readIDX(string filePath)
{
  auto readIDX()
  {
    /*... code ...*/
  }
}

Compile time I/O

Now we step into the internals of the function. In D the import keyword for importing packages and modules has a little known alternative use, which is to read files at compile time! As I stated before enums can be used to denote compile time constants:
enum ubyte[] idxData = cast(ubyte[])import(filePath);
here I read the data in as a ubyte array by default. From this point onwards the rest of the process is all about converting the bytes that we have just read into recognisable data and returning it. Next we declare the return type (R) and the element size in bytes (stride) which is used later when we extract the data. The MNIST website also specifies that the fourth byte denotes the number of dimensions in the array so we extract that an use it to create a integer array of appropriate size.

alias R = DataType!(idxData[2]);
enum stride = Stride!(idxData[2]);
enum N = to!(long)(idxData[3]);
int[N] dims;

static foreach

Below we read and convert the dimensions of the array into the integer array using a static foreach. Note the unusual use of double "{{" brackets. This is because at compile time the single "{" does not affect compile time variables, so in order to limit the scope of constants created here we use the double curly brackets. Without this need for scoping of compile time variables we could use the single curly bracket with static foreach.

static foreach(i; 0..N)
{{
  enum start = (i + 1) * 4;
  enum end = start + 4;
  enum ubyte[4] tmp = cast(ubyte[4])idxData[start..end];
  dims[i] = bigEndianToNative!(int, 4)(tmp);
}}

Converting the data elements

Firstly we calculate the total number of elements in the array as a product of it’s dimensions:

long nitems = 1;
foreach(dim; dims)
{
  nitems *= dim;
}

then we declare the output data and use static if to conditionally compile for when the data is ubyte - in this case we don’t need to do anything, data = idxData[pre..$]; assigns the appropriate slice of the array (non-copy operation), or when the data is one of the other types in which case we need to take it’s size into account apply a byte order conversion if necessary. The line
data[i] = *cast(R*)(tmp.ptr);
takes the byte order converted block of ubyte[] and casts it to the correct data type R specified in the mapping table.

R[] data;
immutable(long) pre = ((N + 1) * 4);
static if(is(R == ubyte))
{
  data = idxData[pre..$];
}else{
  data = new R[nitems];
  foreach(i; 0..data.length)
  {
    long start = stride * i + pre;
    long end = start + stride;
    ubyte[] tmp = idxData[start..end];
    static if(stride > 1)
    {
      tmp = bigEndianToNative!(int, stride)(tmp);
    }
    data[i] = *cast(R*)(tmp.ptr);
  }
}

That’s basically it.

Compilation

To compile code that does a compile time file read with import, you must specify the a path to your filePath variable using the compiler flag “-J="LOCATION"” (for the dmd compiler). It’s not a big deal because it means that you can just do “-J="."” if you have given a relative path (as in linux). For example the line I used to compile my code for a filePath = "data/t10k-images.idx3-ubyte" variable (Ubuntu OS) is:

dmd idx.d -J="." && ./idx