I tried to understand the concept of axis in python libraries numpy and pandas better, because I often mix them up with similar concepts in R. After trying a few things out and reading around, I think I understand both worlds better now.

During this process, a post on StackOverflow was particularly helpful.

Axis in Python

Consider the following code snippet

import numpy as np
import pandas as pd

ar = np.array([[3,4,5], [4,5,6]])
df = pd.DataFrame({'A':[3,4,5], 'B':[4,5,6]})

## in numpy, if axis is `None`, the mean of the flattened array is reported
ar.mean() # 4.5
## axis=0 means that the operation acts on all *rows* in each column
ar.mean(axis=0) ## array([3.5, 4.5, 5.5])
## axis=1 means that the operation acts on the all *columns* in each row
ar.mean(axis=1) ## array([4., 5.])

## in pandas, if axis is not given, the mean of the columns (axis=0) is reported
## output
##> A    4.0
##> B    5.0
##> dtype: float64
df.mean()
## axis=0 means that the operation acts on all *rows* in each column
## equivalently, one can use `df.mean(axis='rows')` or `df.mean(axis='index')`.
df.mean(axis=0)
## axis=1 means that the operation acts on all *columns* in each row
## output
##> 0    3.5
##> 1    4.5
##> 2    5.5
## dtype: float64
df.mean(axis=1)

In the documentation of numpy, it is stated that the axis parameter specifies Axis or axes along which the means are computed. Unfortunately, I find the concept of ‘along which’ particularly confusing.

The Python behaviour can be better understood with a three-dimensional array

It turns out that the concept of axis is easier to understand if we use an example of a three-dimensional array.

>>> ar2 = np.array([[[3,4],[5,6]],[[7,8],[9,10]]])
>>> ar2
array([[[ 3,  4],
        [ 5,  6]],

       [[ 7,  8],
        [ 9, 10]]])
>>> ar2.mean()
6.5
>>> ar2.mean(axis=0) # mean of 3 and 7, 4 and 8, 5 and 9, and 6 and 10
array([[5., 6.],
       [7., 8.]])
>>> ar2.mean(axis=1) # mean of 3 and 5, 4 and 6, 7 and 9, and 8 and 10
array([[4., 5.],
       [8., 9.]])
>>> ar2.mean(axis=2) # mean of 3 and 4, 5 and 6, 7 and 8, and 9 and 10
array([[3.5, 5.5],
       [7.5, 9.5]])

In essence, when we run ar2.mean(axis=0), we ask numpy to go through ar2[i, 0, 0] where i can take values between 0 and the first element of ar2.shape, and calculate the mean value of the values that numpy sees during the iteration. Next, numpy goes through ar2[i, 0, 1] and does the same calculation. Next, it goes through ar2[i, 1, 0]. And finally, it goes though ar2[i, 1, 1].

The same logic applies to other values of the parameter axis. The only change we shall make then is to change the position of i: it will be put in the axisth position in the index list used to fetch an element in the n-dimensional array. If you have doubt about that, you can verify the results above with the logic that we have just described. Sure enough, the logic also applies to arrays of higher (or lower) dimensions.

In summary, in numpy and pandas, the axis parameter in sum actually specifies numpy to calculate the mean of all values that can be fetched in the form of array[0, 0, ..., i, ..., 0] where i iterates through all possible values. The process is repeated with the position of i fixed and the indices of other dimensions vary one after the other (from the most far-right element). The result is a n-1-dimensional array.

MARGINS in R

My confusion at the beginning may come from similar operations in R with apply, where the parameter MARGIN is a vector giving the subscripts which the function will be applied over. Compare the results below with the ones above.

mymat <- matrix(c(3,4,5,4,5,6), byrow=TRUE, nrow=2)
apply(mymat, 1, mean) ## identical to `rowMeans(myMat)`, reporting c(4, 5)
apply(mymat, 2, mean) ## identical to `colMeans(myMat)`, c(3.5, 4.5, 5.5)

As you see, the behaviour of setting MARGINS to 1 and 2 is actually the opposite of that in Python.

Apply apply to a three-dimensional array in R

Let us give it a try.

> (d3array <- array(3:10, c(2,2,2)))
, , 1

     [,1] [,2]
[1,]    3    5
[2,]    4    6

, , 2

     [,1] [,2]
[1,]    7    9
[2,]    8   10
> d3array[1,,,] # this may help us understand the first result better
     [,1] [,2]
[1,]    3    7
[2,]    5    9
> mean(d3array[1,,])
6
> apply(d3array, 1, mean)
[1] 6 7
> apply(d3array, 2, mean)
[1] 5.5 7.5
> apply(d3array, 3, mean)
[1] 4.5 8.5

It turns out the logic can be understood easily. apply(d3array, 1, mean) will calculate the mean values of d3array[i,,] where i takes all possible values, and return the results in a vector. Similarly, apply(d3array, 2, mean) will calculate the mean values of d3array[,i,], etc.

In summary, in R, the MARGINS parameter let the apply function calculate the mean of all values that can be fetched in the form of array[, ... , i, ... ,] where i iterates through all possible values. The process is not repeated when all i values have been iterated. The result is therefore a simple vector.

Conclusions

While I can understand the logic of either convention, I found it is easy to mix the two. I am not sure whether I am the only one who easily mixes up axis in Python and MARGIN in R. Therefore, I document the differences here, with the hope that at least I can remind myself when I am confused again.

In panda, one can use axis="rows" or axis="index" to calculate mean values of each column, equal to colMeans in R. We say that we get mean along rows or mean along index in Python, and mean of columns in R.

Alternately, one uses axis="columns" to calculate mean values of each row, equal to rowMeans in R. We say that we get mean along columns in Python, and mean of rows in R.

I thank Iakov Davydov for pointing out the advantage of using rows and columns.