Axis in Python and MARGIN in R explained
I tried to understand the concept of axis in python libraries numpy and
pandas better, because I often mix them up with similar concepts in R. After
trying a few things out and reading around, I think I understand both worlds
better now.
During this process, a post on StackOverflow was particularly helpful.
Axis in Python
Consider the following code snippet
import numpy as np
import pandas as pd
ar = np.array([[3,4,5], [4,5,6]])
df = pd.DataFrame({'A':[3,4,5], 'B':[4,5,6]})
## in numpy, if axis is `None`, the mean of the flattened array is reported
ar.mean() # 4.5
## axis=0 means that the operation acts on all *rows* in each column
ar.mean(axis=0) ## array([3.5, 4.5, 5.5])
## axis=1 means that the operation acts on the all *columns* in each row
ar.mean(axis=1) ## array([4., 5.])
## in pandas, if axis is not given, the mean of the columns (axis=0) is reported
## output
##> A 4.0
##> B 5.0
##> dtype: float64
df.mean()
## axis=0 means that the operation acts on all *rows* in each column
## equivalently, one can use `df.mean(axis='rows')` or `df.mean(axis='index')`.
df.mean(axis=0)
## axis=1 means that the operation acts on all *columns* in each row
## output
##> 0 3.5
##> 1 4.5
##> 2 5.5
## dtype: float64
df.mean(axis=1)
In the documentation of
numpy,
it is stated that the axis parameter specifies Axis or axes along which the
means are computed. Unfortunately, I find the concept of ‘along which’
particularly confusing.
The Python behaviour can be better understood with a three-dimensional array
It turns out that the concept of axis is easier to understand if we use
an example of a three-dimensional array.
>>> ar2 = np.array([[[3,4],[5,6]],[[7,8],[9,10]]])
>>> ar2
array([[[ 3, 4],
[ 5, 6]],
[[ 7, 8],
[ 9, 10]]])
>>> ar2.mean()
6.5
>>> ar2.mean(axis=0) # mean of 3 and 7, 4 and 8, 5 and 9, and 6 and 10
array([[5., 6.],
[7., 8.]])
>>> ar2.mean(axis=1) # mean of 3 and 5, 4 and 6, 7 and 9, and 8 and 10
array([[4., 5.],
[8., 9.]])
>>> ar2.mean(axis=2) # mean of 3 and 4, 5 and 6, 7 and 8, and 9 and 10
array([[3.5, 5.5],
[7.5, 9.5]])
In essence, when we run ar2.mean(axis=0), we ask numpy to go through ar2[i,
0, 0] where i can take values between 0 and the first element of ar2.shape,
and calculate the mean value of the values that numpy sees during the iteration.
Next, numpy goes through ar2[i, 0, 1] and does the same calculation. Next, it
goes through ar2[i, 1, 0]. And finally, it goes though ar2[i, 1, 1].
The same logic applies to other values of the parameter axis. The only change
we shall make then is to change the position of i: it will be put in the
axisth position in the index list used to fetch an element in the
n-dimensional array. If you have doubt about that, you can verify the results
above with the logic that we have just described. Sure enough, the logic also
applies to arrays of higher (or lower) dimensions.
In summary, in numpy and pandas, the axis parameter in sum actually
specifies numpy to calculate the mean of all values that can be fetched in the
form of array[0, 0, ..., i, ..., 0] where i iterates through all possible
values. The process is repeated with the position of i fixed and the indices
of other dimensions vary one after the other (from the most far-right element).
The result is a n-1-dimensional array.
MARGINS in R
My confusion at the beginning may come from similar operations in R with
apply, where the parameter MARGIN is a vector giving the subscripts which
the function will be applied over. Compare the results below with the ones
above.
mymat <- matrix(c(3,4,5,4,5,6), byrow=TRUE, nrow=2)
apply(mymat, 1, mean) ## identical to `rowMeans(myMat)`, reporting c(4, 5)
apply(mymat, 2, mean) ## identical to `colMeans(myMat)`, c(3.5, 4.5, 5.5)
As you see, the behaviour of setting MARGINS to 1 and 2 is actually the
opposite of that in Python.
Apply apply to a three-dimensional array in R
Let us give it a try.
> (d3array <- array(3:10, c(2,2,2)))
, , 1
[,1] [,2]
[1,] 3 5
[2,] 4 6
, , 2
[,1] [,2]
[1,] 7 9
[2,] 8 10
> d3array[1,,,] # this may help us understand the first result better
[,1] [,2]
[1,] 3 7
[2,] 5 9
> mean(d3array[1,,])
6
> apply(d3array, 1, mean)
[1] 6 7
> apply(d3array, 2, mean)
[1] 5.5 7.5
> apply(d3array, 3, mean)
[1] 4.5 8.5
It turns out the logic can be understood easily. apply(d3array, 1, mean) will
calculate the mean values of d3array[i,,] where i takes all possible values,
and return the results in a vector. Similarly, apply(d3array, 2, mean) will
calculate the mean values of d3array[,i,], etc.
In summary, in R, the MARGINS parameter let the apply function calculate the
mean of all values that can be fetched in the form of array[, ... , i, ... ,]
where i iterates through all possible values. The process is not repeated when
all i values have been iterated. The result is therefore a simple vector.
Conclusions
While I can understand the logic of either convention, I found it is easy to mix
the two. I am not sure whether I am the only one who easily mixes up axis in
Python and MARGIN in R. Therefore, I document the differences here, with the
hope that at least I can remind myself when I am confused again.
In panda, one can use axis="rows" or axis="index" to calculate mean values
of each column, equal to colMeans in R. We say that we get mean along rows or
mean along index in Python, and mean of columns in R.
Alternately, one uses axis="columns" to calculate mean values of each row,
equal to rowMeans in R. We say that we get mean along columns in Python, and
mean of rows in R.
I thank Iakov Davydov for pointing out the advantage of using rows and columns.