Why counting from one matters, R & Julia vs Python & C/C++

Active Analytics Ltd: posted 13 August 2014 by Chibisi Chima-Okereke


When we are young, we are taught to count from one ... 1, 2, 3, ... and so on. In mathematics the position of items in a vector or matrix start with number one and not zero. When implementing algorithms from mathematical equations in a language like C where the position identifier starts from the integer zero things can get a little confusing. It sounds like a trivial issue but a lot of effort can be spent making sure that algorithms are correct and bugs easily occur requiring re-writing of algorithms or resulting in release of buggy software. Python and C/C++ share the same numbering system - both start from zero while R and Julia programming languages number vectors from one.


To illustrate this issue, let's take a simple example. The equation below calculates the mean squared error of the reserves from the Mack Chain Ladder paper (top of page 220).

\(\begin{equation} \hat{mse(\hat{R})} = \sum_{i=2}^{I}{\{(mse(\hat{R})) + \hat{C}_{iI}(\sum_{j=i+1}^{I-1}{\hat{{C}}_{jI})} \sum_{k=I+1-i}^{I-1}{\frac{2\hat{\sigma}_{k}^{2} \hat{f}_{k}^{2}}{\sum^{I-k}_{n=1}{C_{nk}}}}\}} \qquad \ldots (1) \end{equation}\)

Consider an implementation in Julia below:

for i = 2:I
	term1 = term2 = 0
	for j = (i + 1):I
		term1 += ciI[j]
	term1 *= ciI[i]
	for k = (I + 1 - i):(I - 1)
		term2 += (2*sigmas[k]/(f[k]^2))/ECnk[k]
	tmseR += mseR[i] + term1*term2

Each of the items ciI, sigmas (\(\sigma^2\)), f, ECnk, and mseR are vectors. The sum \(\sum^{I-k}_{n=1}{C_{nk}}\) is pre-summed over n to give ECnk and so is also a vector in index k. You can see that this implementation is a natural interpretation of the equation. The equivalent implementation in C is this:

int i, j, k;
double term1, term2;
for(i = 1; i < I; ++i)
	term1 = term2 = 0;
	for(j = i + 1; j < I; ++j)
		term1 += ciI[j];
	term1 *= ciI[i];
	for(k = I - 1 - i; k < I - 1; ++k)
		term2 += (2.*sigmas[k]/pow(fact[k], 2))/ECnk[k];
	tmseR += mseR[i] + term1*term2;

There are three potential points of confusion:

  1. The tradition in C is to use the "<" or ">" sign as the limit, you could use "<=" or ">=" though this is not seen often. You learn from books to use the former set and so when you stare at the for loop you have to "remind" yourself that less than the limit is okay - you don't want to hit the limit because you started numbering from zero and to do so will result in undefined behaviour.
  2. The presence of a decrement can also cause confusion. You get so used to ++i that when you should write --i you may forget and create a loop that doesn't end - oops!
  3. The start of a loop can cause you a headache. Even in the simple algorithm above, at the outer loop you think: We are starting the numbering of i from one not two to get the correct vector position - fine. In the first inner loop the numbering of j starts from (i + 1) just as in the paper which is fine since we have already started numbering i from one. The correct starting point for the second inner loop is not immediately obvious but if you get out a pen and paper, you'll find that the correct starting point is (I - 1 - i) quite different from the paper which sports (I + 1 - i).

With more complex algorithms this can get worse. So at Active Analytics we like counting from one like R and Julia - it saves us alot of time. The advantage with Julia is that since for loops are fast, you can be very literal with your algorithm in relation to equations which means much faster development times. In R though the literal numbering is nice, there are fewer chances to make full use of it since for loops are slow, see a previous blog article for more details.

Data Science Consulting & Software Training

Active Analytics Ltd. is a data science consultancy, and Open Source Statistical Software Training company. Please contact us for more details or to comment on the blog.

Dr. Chibisi Chima-Okereke, Training, Statistics and Data Science Consulting.