Axis in Python and MARGIN in R explained
I tried to understand the concept of axis
in python libraries numpy
and
pandas
better, because I often mix them up with similar concepts in R. After
trying a few things out and reading around, I think I understand both worlds
better now.
During this process, a post on StackOverflow was particularly helpful.
Axis in Python
Consider the following code snippet
import numpy as np
import pandas as pd
ar = np.array([[3,4,5], [4,5,6]])
df = pd.DataFrame({'A':[3,4,5], 'B':[4,5,6]})
## in numpy, if axis is `None`, the mean of the flattened array is reported
ar.mean() # 4.5
## axis=0 means that the operation acts on all *rows* in each column
ar.mean(axis=0) ## array([3.5, 4.5, 5.5])
## axis=1 means that the operation acts on the all *columns* in each row
ar.mean(axis=1) ## array([4., 5.])
## in pandas, if axis is not given, the mean of the columns (axis=0) is reported
## output
##> A 4.0
##> B 5.0
##> dtype: float64
df.mean()
## axis=0 means that the operation acts on all *rows* in each column
## equivalently, one can use `df.mean(axis='rows')` or `df.mean(axis='index')`.
df.mean(axis=0)
## axis=1 means that the operation acts on all *columns* in each row
## output
##> 0 3.5
##> 1 4.5
##> 2 5.5
## dtype: float64
df.mean(axis=1)
In the documentation of
numpy
,
it is stated that the axis parameter specifies Axis or axes along which the
means are computed. Unfortunately, I find the concept of ‘along which’
particularly confusing.
The Python behaviour can be better understood with a three-dimensional array
It turns out that the concept of axis
is easier to understand if we use
an example of a three-dimensional array.
>>> ar2 = np.array([[[3,4],[5,6]],[[7,8],[9,10]]])
>>> ar2
array([[[ 3, 4],
[ 5, 6]],
[[ 7, 8],
[ 9, 10]]])
>>> ar2.mean()
6.5
>>> ar2.mean(axis=0) # mean of 3 and 7, 4 and 8, 5 and 9, and 6 and 10
array([[5., 6.],
[7., 8.]])
>>> ar2.mean(axis=1) # mean of 3 and 5, 4 and 6, 7 and 9, and 8 and 10
array([[4., 5.],
[8., 9.]])
>>> ar2.mean(axis=2) # mean of 3 and 4, 5 and 6, 7 and 8, and 9 and 10
array([[3.5, 5.5],
[7.5, 9.5]])
In essence, when we run ar2.mean(axis=0)
, we ask numpy to go through ar2[i,
0, 0]
where i
can take values between 0 and the first element of ar2.shape
,
and calculate the mean value of the values that numpy sees during the iteration.
Next, numpy goes through ar2[i, 0, 1]
and does the same calculation. Next, it
goes through ar2[i, 1, 0]
. And finally, it goes though ar2[i, 1, 1]
.
The same logic applies to other values of the parameter axis
. The only change
we shall make then is to change the position of i
: it will be put in the
axis
th position in the index list used to fetch an element in the
n-dimensional array. If you have doubt about that, you can verify the results
above with the logic that we have just described. Sure enough, the logic also
applies to arrays of higher (or lower) dimensions.
In summary, in numpy
and pandas
, the axis
parameter in sum
actually
specifies numpy
to calculate the mean of all values that can be fetched in the
form of array[0, 0, ..., i, ..., 0]
where i
iterates through all possible
values. The process is repeated with the position of i
fixed and the indices
of other dimensions vary one after the other (from the most far-right element).
The result is a n-1-dimensional array.
MARGINS in R
My confusion at the beginning may come from similar operations in R with
apply
, where the parameter MARGIN
is a vector giving the subscripts which
the function will be applied over. Compare the results below with the ones
above.
mymat <- matrix(c(3,4,5,4,5,6), byrow=TRUE, nrow=2)
apply(mymat, 1, mean) ## identical to `rowMeans(myMat)`, reporting c(4, 5)
apply(mymat, 2, mean) ## identical to `colMeans(myMat)`, c(3.5, 4.5, 5.5)
As you see, the behaviour of setting MARGINS
to 1
and 2
is actually the
opposite of that in Python.
Apply apply
to a three-dimensional array in R
Let us give it a try.
> (d3array <- array(3:10, c(2,2,2)))
, , 1
[,1] [,2]
[1,] 3 5
[2,] 4 6
, , 2
[,1] [,2]
[1,] 7 9
[2,] 8 10
> d3array[1,,,] # this may help us understand the first result better
[,1] [,2]
[1,] 3 7
[2,] 5 9
> mean(d3array[1,,])
6
> apply(d3array, 1, mean)
[1] 6 7
> apply(d3array, 2, mean)
[1] 5.5 7.5
> apply(d3array, 3, mean)
[1] 4.5 8.5
It turns out the logic can be understood easily. apply(d3array, 1, mean)
will
calculate the mean values of d3array[i,,]
where i
takes all possible values,
and return the results in a vector. Similarly, apply(d3array, 2, mean)
will
calculate the mean values of d3array[,i,]
, etc.
In summary, in R, the MARGINS
parameter let the apply
function calculate the
mean of all values that can be fetched in the form of array[, ... , i, ... ,]
where i
iterates through all possible values. The process is not repeated when
all i
values have been iterated. The result is therefore a simple vector.
Conclusions
While I can understand the logic of either convention, I found it is easy to mix
the two. I am not sure whether I am the only one who easily mixes up axis
in
Python and MARGIN
in R. Therefore, I document the differences here, with the
hope that at least I can remind myself when I am confused again.
In panda
, one can use axis="rows"
or axis="index"
to calculate mean values
of each column, equal to colMeans
in R. We say that we get mean along rows
or
mean along index
in Python, and mean of columns
in R.
Alternately, one uses axis="columns"
to calculate mean values of each row,
equal to rowMeans
in R. We say that we get mean along columns
in Python, and
mean of rows
in R.
I thank Iakov Davydov for pointing out the advantage of using rows
and columns
.