Introduction

The Wilcoxon-Mann-Whitney rank sum test is one of the most commonly used algorithms for non-parametric tests. The current implementations in R however are not optimal for multiple hypothesis testing. The BioQC package implements the algorithm in a native C routine to allow fast computation in scenarios where a numeric vector (e.g. expression values of all genes) is compared against a collection of subsets of the vector (e.g. expression values of genes belonging to a collection of gene sets). We demonstrate the use of the package with the BioQC algorithm, which uses the Wilcoxon-Mann-Whitney rank sum test and a collection of tissue-preferentially gene signatures to detect tissue heterogeneity in high-throughput gene expression studies.

In this vignette we demonstrate the use of BioQC by a simulated expression dataset, and compare the performance of Wilcoxon-Mann-Whitney rank sum test algorithm implemented in BioQC to other implementations available in R. To use BioQC, the users only need to provide an expression dataset, in the form of a numeric matrix, or an ExpressionSet object. The BioQC package provides tissue-specific genes that can be used directly with the algorithm. The output is one score for each tissue type and each sample. The ranks of the score within each sample can be compared with prior knowledge about the sample to infer tissue heterogeneity. The hypotheses generated by BioQC should be further tested in follow-up experiments.

A dummy example

We demonstrate the basic use of the package with a dummy example. First, we load BioQC library and the tissue signatures into the R session.

library(Biobase)
library(BioQC)
gmtFile <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt", package="BioQC")
gmt <- readGmt(gmtFile)

Next, we synthesize an ExpressionSet object, using randomly generated data.

Nrow <- 2000L
Nsample <- 5L
gss <- unique(unlist(sapply(gmt, function(x) x$genes)))
myEset <- new("ExpressionSet",
              exprs=matrix(rnorm(Nrow*Nsample), nrow=Nrow),
              featureData=new("AnnotatedDataFrame",
                              data.frame(GeneSymbol=sample(gss, Nrow))))

Finally we run the BioQC algorithm and print the summary of the results. As expected, no single tissue scored significantly after multiple correction.

dummyRes <- wmwTest(myEset, gmt, valType="p.greater", simplify=TRUE)
summary(p.adjust(dummyRes, "BH"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9215  0.9347  1.0000  0.9743  1.0000  1.0000

Using basic data structures

The dummy example above shows how to run BioQC algorithm with an ExpressionSet and a list read from a GMT file (a file format capturing gene sets). Users can also use more basic data structures (e.g. matrices and list of integer indexes) to run the algorithm as shown by the following example. Other data structures will be coerced into these basic data structures; we refer the interested user to the documentation of the wmwTest function.

myMatrix <- matrix(rnorm(Nrow*Nsample),
                   ncol=Nsample,
                   dimnames=list(NULL, LETTERS[1:Nsample]))
myList <- list(signature1=sample(1:Nrow, 100),
               signature2=sample(1:Nrow, 50),
               signature3=sample(1:Nrow, 200))
wmwTest(myMatrix, myList, valType="p.greater", simplify=TRUE)
##                    A          B          C         D         E
## signature1 0.1529726 0.39838925 0.44942433 0.0827896 0.8544985
## signature2 0.1521229 0.63237801 0.01732592 0.5765494 0.3946468
## signature3 0.8593754 0.07167512 0.16760316 0.0523685 0.3693992

Case study with real dataset

We have applied BioQC to a real gene expression profiling dataset. BioQC helped to generated hypotheses about potential contamination of rat kidney examples by pancreas tissues, which could be confirmed with qRT-PCR.

See the BioQC-kidney vignette for more details.

Benchmarking against R implementation

In the core of wmwTest, an efficient C-implementation makes it feasible to run a large number of Wilcoxon-Mann-Whitney tests on large-scale expression profiling datasets and with many signature lists. Compared to native R implementations in stats (wilcox.text) and limma (rankSumTestWithCorrelation) packages, the BioQC implementation requires less memory and avoids repetitive statistical ranking of data.

The following code, though in a small scale, demonstrates the difference between the performances of two implementations.

bm.Nrow <- 22000
bw.Nsample <- 5
bm.Ngs <- 5
bm.Ngssize <- sapply(1:bm.Ngs, function(x) sample(1:bm.Nrow/2, replace=TRUE))
ind <- lapply(1:bm.Ngs, function(i) sample(1:bm.Nrow, bm.Ngssize[i]))
exprs <- matrix(round(rnorm(bm.Nrow*bw.Nsample),4), nrow=bm.Nrow)

system.time(Cres <- wmwTest(exprs, ind, valType="p.less", simplify=TRUE))
##    user  system elapsed 
##   0.087   0.000   0.087
system.time(Rres <- apply(exprs, 2, function(x)
                          sapply(ind, function(y)
                                 wmwTestInR(x, y, valType="p.less"))))
##    user  system elapsed 
##   1.557   0.015   1.573

With 22000 genes, five samples, and five gene sets, the BioQC implementation is about 20x faster than the R implementation (dependent on individual machines and settings). Our benchmark shows that with the same number of genes, 2000 samples and 200 gene sets (similar to the total number of tissues collected in the BioQC signature list), the BioQC implementation can be about 1000x faster than the R implementation.

Technical notes

Background matters

Even using the same set of signatures, the p-value of enrichment may vary depending on the background. A toy example is shown below: assuming that in a gene expression profile with values of 20000 genes, there are half of them lower expressed than the other half, mimicking the commonly observed patterns that some genes are almost not or very low expressed; depending on whether including these lowly expressed genes or not, the p-value reported by the Wilcoxon’s test varies.

bgVec <- rnorm(20000)
bgVec[1:10000] <- bgVec[1:10000] + 2
bgVec[1:10] <- bgVec[1:10] + 1
ind <- c(1:10)

(pAllGenes <- wmwTest(bgVec, ind))
## [1] 4.2681e-06
(pHighExpGenes <- wmwTest(bgVec[1:10000], ind))
## [1] 0.0002268254

It is therefore recommended to report any filtering prior to the BioQC analysis to make the results reproducible.

Acknowledgement

We would like to thank Iakov Davydov for suggestions to improve the package.

R Session Info

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BioQC_1.21.4        Biobase_2.53.0      BiocGenerics_0.39.2
## [4] knitr_1.33         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7         edgeR_3.35.0       magrittr_2.0.1     lattice_0.20-44   
##  [5] R6_2.5.0           ragg_1.1.3         rlang_0.4.11       fastmap_1.1.0     
##  [9] stringr_1.4.0      tools_4.1.0        grid_4.1.0         xfun_0.25         
## [13] jquerylib_0.1.4    htmltools_0.5.1.1  systemfonts_1.0.2  yaml_2.2.1        
## [17] digest_0.6.27      rprojroot_2.0.2    pkgdown_1.6.1.9001 crayon_1.4.1      
## [21] textshaping_0.3.5  sass_0.4.0         fs_1.5.0           memoise_2.0.0     
## [25] cachem_1.0.5       evaluate_0.14      rmarkdown_2.10     limma_3.49.4      
## [29] stringi_1.7.3      compiler_4.1.0     bslib_0.2.5.1      desc_1.3.0        
## [33] locfit_1.5-9.4     jsonlite_1.7.2