IDX files are used to store multi-dimensional array data which are used in various applications one of which is input data in machine learning models. The data type of each element is coded in the file itself so unless the file name or creator gives you a hint, you can not be certain what the type of each element in the array is until you read it. This can be tricky in static languages because data types of all objects must be known at compile time. Often programmers resort to dynamic typing or type masking techniques meaning that dynamic methods must be used rather than more computationally efficient static function overloading.
The D programming language has extensive generic programming and compile time capabilities. Using a few of these tools, it is fairly straight forward to read an IDX file and determine the data type all done at compile which can be a good alternative to resorting to dynamic typing techniques. The compile time tools used in this article are
template
keyword though a shorthand exists for creating template structs, classes, or functions.alias
keyword in D has many uses in compile time code, in this case it is used to declare types. For examplealias T = double;
T
as type double
- this (happens at compile time). By way of a further example here is a template that converts a type T
to a pointer type T*
: template P(T){ alias P = T*; }
P!(double)[] x;
x
as an array of pointers (equivalent to double*[] x;
). Those new to D but with some familiarity with C++ will notice that there are no angle bracket <>
here, instead D uses (in my opinion a much nicer) TEMPLATE!(ARGS...)
form to specify (compile time) template arguments (ARGS
).enum string home = "Earth";
home
that has the value "Earth"
.In this article we won’t be forming a multi-dimensional array, only showing that we can read data of the type specified in the file and it’s dimensions into a contiguous array at compile time - which is sufficient to subsequently create a multi-dimensional array.
Three functions from the standard library will be used.
import std.stdio: writeln;
import std.conv: to;
import std.bitmanip: bigEndianToNative;
writeln
function allow the printing of outputs.to
(template) function allows us to do type conversion, for example converting "3.14159"
from a string to a double
is done usingauto x = to!(double)("3.14159");
bigEndianToNative
function also happens to be a template function and is used to convert big endian byte order (the format the data is stored in) to whatever byte order the system happens to be.All three of the above functions are actually template functions. It’s interesting that to write this read function, we only need three functions, that’s how straightforward this is in D.
The third byte in the IDX file is used to code the type of the elements in the multi-dimensional array. Below is a mapping table taken from the MNIST website:
0x08
: unsigned byte
0x09
: signed byte
0x0B
: short
(2 bytes)0x0C
: int
(4 bytes)0x0D
: float
(4 bytes)0x0E
: double
(8 bytes)We can create a mapping table from byte to type using a template. Though D has associative arrays, they are at least at the moment only for runtime operations.
template DataType(ubyte byteCode)
{
static if(byteCode == 0x08)
{
alias DataType = ubyte;
}else static if(byteCode == 0x09)
{
alias DataType = byte;
}else static if(byteCode == 0x0B)
{
alias DataType = short;
}else static if(byteCode == 0x0C)
{
alias DataType = int;
}else static if(byteCode == 0x0D)
{
alias DataType = float;
}else static if(byteCode == 0x0E)
{
alias DataType = double;
}
}
In the above code static if
comes into play, when the code is compiled and only one of those options is actually substituted into the call point, the rest go away. Notice also that we can specify a data type, in this case ubyte
(unsigned byte), rather that using type parameters. Note: all template parameters must be known at compile time . I have chosen to represent the size in bytes in a separate table:
template Stride(ubyte byteCode)
{
static if(byteCode == 0x08)
{
enum long Stride = 1;
}else static if(byteCode == 0x09)
{
enum long Stride = 1;
}else static if(byteCode == 0x0B)
{
enum long Stride = 2;
}else static if(byteCode == 0x0C)
{
enum long Stride = 4;
}else static if(byteCode == 0x0D)
{
enum long Stride = 4;
}else static if(byteCode == 0x0E)
{
enum long Stride = 8;
}
}
readIDX()
functionNow that we have described mapping tables, we can begin to create the function for reading IDX files that I am calling readIDX
. Firstly the declaration:auto readIDX(string filePath)(){/*... code ...*/}
is shorthand for doing this:
template readIDX(string filePath)
{
auto readIDX()
{
/*... code ...*/
}
}
Now we step into the internals of the function. In D the import
keyword for importing packages and modules has a little known alternative use, which is to read files at compile time! As I stated before enums can be used to denote compile time constants:enum ubyte[] idxData = cast(ubyte[])import(filePath);
here I read the data in as a ubyte
array by default. From this point onwards the rest of the process is all about converting the bytes that we have just read into recognisable data and returning it. Next we declare the return type (R
) and the element size in bytes (stride
) which is used later when we extract the data. The MNIST website also specifies that the fourth byte denotes the number of dimensions in the array so we extract that an use it to create a integer array of appropriate size.
alias R = DataType!(idxData[2]);
enum stride = Stride!(idxData[2]);
enum N = to!(long)(idxData[3]);
int[N] dims;
Below we read and convert the dimensions of the array into the integer array using a static foreach
. Note the unusual use of double "{{"
brackets. This is because at compile time the single "{"
does not affect compile time variables, so in order to limit the scope of constants created here we use the double curly brackets. Without this need for scoping of compile time variables we could use the single curly bracket with static foreach
.
static foreach(i; 0..N)
{{
enum start = (i + 1) * 4;
enum end = start + 4;
enum ubyte[4] tmp = cast(ubyte[4])idxData[start..end];
dims[i] = bigEndianToNative!(int, 4)(tmp);
}}
Firstly we calculate the total number of elements in the array as a product of it’s dimensions:
long nitems = 1;
foreach(dim; dims)
{
nitems *= dim;
}
then we declare the output data and use static if
to conditionally compile for when the data is ubyte
- in this case we don’t need to do anything, data = idxData[pre..$];
assigns the appropriate slice of the array (non-copy operation), or when the data is one of the other types in which case we need to take it’s size into account apply a byte order conversion if necessary. The linedata[i] = *cast(R*)(tmp.ptr);
takes the byte order converted block of ubyte[]
and casts it to the correct data type R
specified in the mapping table.
R[] data;
immutable(long) pre = ((N + 1) * 4);
static if(is(R == ubyte))
{
data = idxData[pre..$];
}else{
data = new R[nitems];
foreach(i; 0..data.length)
{
long start = stride * i + pre;
long end = start + stride;
ubyte[] tmp = idxData[start..end];
static if(stride > 1)
{
tmp = bigEndianToNative!(int, stride)(tmp);
}
data[i] = *cast(R*)(tmp.ptr);
}
}
That’s basically it.
To compile code that does a compile time file read with import
, you must specify the a path to your filePath
variable using the compiler flag “-J="LOCATION"
” (for the dmd compiler). It’s not a big deal because it means that you can just do “-J="."
” if you have given a relative path (as in linux). For example the line I used to compile my code for a filePath = "data/t10k-images.idx3-ubyte"
variable (Ubuntu OS) is:
dmd idx.d -J="." && ./idx