Kernel Matrix Calculations in Chapel

Author: Dr Chibisi Chima-Okereke Created: December 23, 2020 20:34:46 GMT Published: January 6, 2021 05:56:04 GMT

A previous article looked at Kernel Matrix calculations in Nim. In this article, we look at Kernel Matrix calculations in the Chapel programming language, and offer some tips on performance optimization. Chapel is a static compiled programming language developed by Cray and is aimed at high performance programming for both multicore computers and clusters with an emphasis on data and task parallelism in scientific computing. It is heavily influenced by FORTRAN, in that it has matrix/tensor style arrays with which the user can run element-element arithmetic operations. Chapel has both records (value types similar to C/C++ structs) and single dispatch reference classes. More information on the Chapel programming language is given in the official website. The particulars of Kernel Matrix calculations are discussed in the previous article and will not be repeated here.

Chapel appeared in 2009 and is now at version:

chpl --version
chpl version 1.23.0
Copyright 2020 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

Here I acknowledge the help provied by the Chapel community, and would like to thank them for helping me with the performance optimized code and their patience with my questions.

Hello World in Chapel

In Chapel, there is no need for a main function. Executing code can be written script-like so “Hello World!” can be this:

//hello.chpl
writeln("Hello World!");

A more comprehensive version is below. The proc keyword denotes a procedure (or function) in Chapel:

proc main()
{
  writeln("Hello World!");
}

Something that is noticeable is the time taken to compile a simple program in Chapel. In this case it takes nearly 5 seconds.

/usr/bin/time -v chpl hello.chpl
Command being timed: "chpl hello.chpl"
User time (seconds): 4.56
System time (seconds): 0.19
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.74
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 270168
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 145735
Voluntary context switches: 393
Involuntary context switches: 47
Swaps: 0
File system inputs: 0
File system outputs: 18112
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

By way of comparison, “Hello World!” in D is given below:

import std.stdio: writeln;

void main()
{
  writeln("Hello World!");
}

The compilation timings are given below, first for the dmd compiler v2.090.1

/usr/bin/time -v dmd hello.d
Command being timed: "dmd hello.d"
User time (seconds): 0.17
System time (seconds): 0.28
Percent of CPU this job got: 93%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.49
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 53492
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 36
Minor (reclaiming a frame) page faults: 26428
Voluntary context switches: 1079
Involuntary context switches: 13
Swaps: 0
File system inputs: 7018
File system outputs: 1840
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

and then for D’s LLVM ldc2 compiler v1.23.0

/usr/bin/time -v ldc2 hello.d
Command being timed: "ldc2 hello.d"
User time (seconds): 0.12
System time (seconds): 1.13
Percent of CPU this job got: 95%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.32
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 87444
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 326
Minor (reclaiming a frame) page faults: 29116
Voluntary context switches: 3028
Involuntary context switches: 21
Swaps: 0
File system inputs: 40166
File system outputs: 2032
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

these puts the timing of the Chapel compiler into perspective.

Records in Chapel

Records in Chapel are created using the record keyword, and the associated procedures are nested inside the record itself. The records for Dot and Gaussian kernels are given below with comments:

//kernelbasic.chpl

//Here we load the module
use Math;

// record keyword for creating struct-like objects
record DotProduct {
  //Here the type is a parameter of the object (parametric types)
  type T;
  //The kernel method xrow is an array having domain D, and element
  //type T. The question mark is for inference.
  //The return type is given by the `: T`
  proc kernel(xrow: [?D] ?T, yrow: [D] T): T
  {
    //Declare a variable dist of type T with var
    var dist: T = 0: T;
    //Iterating over a Domain
    for i in D {
      dist += xrow[i] * yrow[i];
    }
    return dist;
  }
}

record Gaussian {
  type T;
  //Data member of record, in this case const - once assigned can not be changed
  const theta: T;
  proc kernel(xrow: [?D] ?T, yrow: [D] T): T
  {
    var dist: T = 0: T;
    for i in D {
      var tmp = xrow[i] - yrow[i];
      dist += tmp * tmp;
    }
    return exp(-sqrt(dist)/this.theta);
  }
}

//...

A word on arrays and domains

In Chapel, arrays are defined on domains, which describe the range of indexes used to access the elements of the array. In the example below a 3 x 4 matrix is created using domain notation {1..3, 1..4}

var el: real = 1;
//Here we create an array on a domain with 64 bit floating point elements
var A: [{1..3, 1..4}] real;
//Basic for loops in chapel
for j in 1..4 {
    for i in 1..3 {
        A[i, j] = el;
        el += 1;
    }
}

//Outputs the particulars of the matrix
writeln("A.domain: ", A.domain);
writeln("A.domain(0): ", A.domain.dim(0));
writeln("A.domain(1): ", A.domain.dim(1));
writeln("A: \n", A);
A.domain: {1..3, 1..4}
A.domain(0): 1..3
A.domain(1): 1..4
A: 
1.0 4.0 7.0 10.0
2.0 5.0 8.0 11.0
3.0 6.0 9.0 12.0

Setting up basic thread-local parallelism in Chapel

Chapel defaults to a certain number of threads (usually 4) for your system, however, to make use of more threads or to otherwise specify the number of threads to the maximum number, set the variable CHPL_RT_NUM_THREADS_PER_LOCALE to the string 'MAX_LOGICAL'. For example, in .profile of a linux (Ubuntu) system, add:

export CHPL_RT_NUM_THREADS_PER_LOCALE='MAX_LOGICAL'

The function for the kernel matrix calculation is given below. Notice that it does not have an explicit return type, since this can be automatically infered by the Chapel compiler. The function also uses a forall loop which parallelizes the loop to take advantage of the threads specified previously:

proc calculateKernelMatrix(K /* auto inferred parameter */, data: [?D] ?T)
{
  var n = D.dim(0).last;
  //Creates an n x n domain 
  var E: domain(2) = {D.dim(0), D.dim(0)};
  //Uses the domain to create the output array
  var mat: [E] T;
  
  //forall loops parallelise the loop over tasks (threads)
  forall j in D.dim(0) {
    for i in j..n {
      mat[i, j] = K.kernel(data[i, ..], data[j, ..]);
      mat[j, i] = mat[i, j];
    }
  }
  return mat;
}

Benchmark for “naive” implementation

A script to demonstrate this kernel matrix implementation is given below:

//scriptbasic.chpl
//For random generation of a matrix
use Random;
use kernelbasic;


proc main()
{
  //var data: [1..10, 1..5] real;
  var data: [1..10_000, 1..512] real;
  //Fills the matrix with random data
  fillRandom(data);
  //Creates a dot kernel record
  var Kernel = new DotProduct(real);
  //Executes the kernel matrix calculation
  var mat = calculateKernelMatrix(Kernel, data);
}

To compile:

chpl scriptbasic.chpl && /usr/bin/time -v ./scriptbasic

output:

chpl scriptbasic.chpl && /usr/bin/time -v ./scriptbasic 
Command being timed: "./scriptbasic"
User time (seconds): 3915.87
System time (seconds): 0.59
Percent of CPU this job got: 639%
Elapsed (wall clock) time (h:mm:ss or m:ss): 10:12.07
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 828532
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206200
Voluntary context switches: 6967
Involuntary context switches: 51744
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

So it takes about 10 minutes and 12 seconds to run the calculation on a “vanilla” basis, but the next sections show how this timing can be improved upon in such a way that it becomes a tiny fraction of the current time elapsed.

Compiler based performance optimization in Chapel

Chapel compiler flags can be printed using the chpl --help command. One very powerful way to improve performance is by using the --fast flag, which switches on many optimizations but we shall come back to this later on. Let’s firstly look at the effect of switching on the --optimize flag:

chpl scriptbasic.chpl --optimize && /usr/bin/time -v ./scriptbasic 
Command being timed: "./scriptbasic"
User time (seconds): 689.88
System time (seconds): 0.45
Percent of CPU this job got: 650%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:46.11
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 828140
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206170
Voluntary context switches: 1360
Involuntary context switches: 10969
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

The --optimize flag speeds up the calculation about 6 times to 1 minute and 46 seconds. Next we can turn off the checks (bounds, numeric, and other checks) by adding the --no-checks flag:

chpl scriptbasic.chpl --no-checks --optimize && /usr/bin/time -v ./scriptbasic 
Command being timed: "./scriptbasic"
User time (seconds): 219.67
System time (seconds): 0.43
Percent of CPU this job got: 649%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.89
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 828100
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206114
Voluntary context switches: 431
Involuntary context switches: 3721
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

this speeds things up by about 3 times. However, we can do much better.

Implementation based optimization

Instead of passing arrays, we can pass pointers instead, so the implementation of our Dot kernel becomes:

record DotProduct {
  type T;
  proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T
  {
    var dist: T = 0: T;
    for i in 0..#p {
      dist += xrow[i] * yrow[i];
    }
    return dist;
  }
}

and the implementation of the kernel function becomes:

// This module is used for the guided iteration
use DynamicIters;
proc calculateKernelMatrix(K, data: [?D] ?T) /* : [?E] T */
{
  var n = D.dim(0).last;
  var p = D.dim(1).last;
  var E: domain(2) = {D.dim(0), D.dim(0)};
  var mat: [E] T;
  //Create all the pointers to be iterated over
  var rowPointers: [1..n] c_ptr(T) =
    forall i in 1..n do c_ptrTo(data[i, 1]);
  
  //Here we use a guided iteration - which should improve performance
  forall j in guided(1..n by -1) {
    for i in j..n {
      mat[i, j] = K.kernel(rowPointers[i], rowPointers[j], p);
      mat[j, i] = mat[i, j];
    }
  }
  return mat;
}

The vanilla compilation performance for this implementation is:

chpl scriptperf.chpl && /usr/bin/time -v ./scriptperf
Command being timed: "./scriptperf"
User time (seconds): 173.09
System time (seconds): 0.33
Percent of CPU this job got: 1157%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.98
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 828060
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206110
Voluntary context switches: 173
Involuntary context switches: 9008
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

that’s less than half the time of the optimized naive implementation. Let’s throw in more flags and see what we get:

chpl scriptperf.chpl --no-checks --optimize --vectorize --inline && /usr/bin/time -v ./scriptperf
Command being timed: "./scriptperf"
User time (seconds): 46.63
System time (seconds): 0.35
Percent of CPU this job got: 1129%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.16
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 827544
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206021
Voluntary context switches: 54
Involuntary context switches: 2597
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

That’s pretty good, nearly four times faster. There is one more thing we can do which is using fast math - which speeds up calculations by removing certain checks e.g. NaN, and carrying out approximations for certain numeric operations, so it may introduce numerical errors. In some cases though it is perfectly reasonable to use it, and we can do so by using the --no-ieee-float flag.

chpl scriptperf.chpl --no-checks --optimize --vectorize --no-ieee-float --inline && /usr/bin/time -v ./scriptperf
Command being timed: "./scriptperf"
User time (seconds): 35.59
System time (seconds): 0.48
Percent of CPU this job got: 1126%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.20
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 827740
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206026
Voluntary context switches: 55
Involuntary context switches: 2032
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

which shaves off about a second. What about the --fast flag? It turns out that we can get pretty much all our optimizations up to this point with this implementation by just using that flag:

chpl scriptperf.chpl --fast && /usr/bin/time -v ./scriptperf
Command being timed: "./scriptperf"
User time (seconds): 42.83
System time (seconds): 0.52
Percent of CPU this job got: 1146%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.78
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 827772
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206028
Voluntary context switches: 38
Involuntary context switches: 2340
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

and of we just add fast math, we get:

chpl scriptperf.chpl --no-ieee-float --fast && /usr/bin/time -v ./scriptperf
Command being timed: "./scriptperf"
User time (seconds): 32.62
System time (seconds): 0.35
Percent of CPU this job got: 1149%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.86
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 827564
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206026
Voluntary context switches: 28
Involuntary context switches: 1792
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Lastly, you may have noticed that a guided forall loop is used, if we drop this and use a regular forall, we get:

chpl scriptperf.chpl --fast --no-checks --optimize --vectorize --no-ieee-float --inline && /usr/bin/time -v ./scriptperf
Command being timed: "./scriptperf"
User time (seconds): 30.77
System time (seconds): 0.39
Percent of CPU this job got: 698%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 827440
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 206025
Voluntary context switches: 82
Involuntary context switches: 125
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

which is slower.

Summary

This article introduces the Chapel programming language, a static compiled object oriented programming language for high performance computing by using basic Kernel Matrix computations. It demonstrates basic use of arrays and parallelism and shows how performance can be optimized at the compiler interface and by changing modifying the implementation. The Chapel language is an interesting language for high performance computing and shows some promise, it should be noted however that parts of the language are still under development and the compiler and ecosystem is not as mature or as heavily used as other languages in the same space such as D and Julia.

Thank you for reading!

Merry Christmas and Happy Holidays!