Discoveries Weekly No. 7 (May 25-31, 2020)

Here are the things I discovered this week that fascinates me.

Biology
Computational biology and genomics
- Nucleosome positioning and its prediction
Chemistry
- OCR software for chemical structures
- Predicting physicochemical properties of small molecules
Programming
Statistics and machine learning
Other gems

Biology

Human macrophage development at single-cell resolution

Bian et al. report on Nature their effort to characterize human macrophage development at single-cell resolution.

Macrophages are part of the innate immune system. They are also the first cells of the nascent immune system that emerge during embryonic development. In mice, embryonic macrophages infiltrate developing organs and differentiate there into tissue-resident macrophages. While this process is well characterized, how macrophages in human differentiate during the development process is less well known.

The authors determine to better understand this process from the perspective of gene expression profiles of each macrophage cell. To this purpose, they isolated CD45+ haematopoietic cells, the cells that responsible for the formation of blood cellular components (haema=blood in Greek, poiesis=to make), from human embryos. Next, they used single-cell RNA sequencing to characterize them. In particular, they performed single-cell culture of a subset of cells, known as yolk sac-derived myeloid-biased progenitors that express surface markers CD45, CD34, and CD44.

Such myeloid progenitors develop later into myeloid cells, including megakaryocytes, erythrocytes, mast cells, and myeloblast including into basophil, neutrophil, eosinophils, and monocytes, which mature into macrophages. The alternative route to myeloid cell differentiation is to differentiate into lymphoid cells including natural killer cells, T cells, and B cells. Together they constitute the innate and adaptive immune system. The following figure by A. Rad and Mikael Häggström, M.D., used under the GNU Free Documentation License, provides a high-level overview of this process.

They investigated macrophage heterogeneity across multiple anatomical sites and identified different types of tissue residential macrophages in the central nervous system, liver, lung, and skin. Furthermore, they analysed the specification trajectories of resident macrophages. Finally, they compared embryonic resident macrophages and their adult counterparts (the data of which were collected from published studies).

The dataset can be interesting for macrophage researchers, in particular researchers of tissue-resident macrophages such as microglia, a special type of macrophages in the brain. And it can be a good reference dataset to benchmark macrophages derived from induced pluripotent stem cells.

The single-cell sequencing data of human yolk sacs are available on NCBI Gene Expression Omnibus with the accession number GSE137010. The data of haematopoietic cells in human embryos at different development stages (known as Carnegie stages) and multiple sites (yolk sac, head, liver, blood, lung, and skin) are available at GSE133345.

Lysyl oxidase mediates drug resistance in triple-negative breast cancer

I took notes here in another blog post.

Leverage loss of heterozygosity of essential genes to find cancer targets

Human cells contain two copies of each chromosome, one from the father and the other from the mother. Genes, which reside on chromosomes, have therefore two copies usually (barring from those genes encoded by sex chromosomes X and Y in males). The copies are also called alleles. If out of any reason one allele is lost, it is called loss of heterozygosity, often abbreviated as LOH. It is, for instance, often observed in cancer.

Hundreds of genes undergo loss of heterozygosity in cancer cells. Nichols et al. reported Nature Communications how they used information of loss of heterozygosity of polymorphisms in essential genes to propose drug targets. Their hypothesis is that inactivation of the single allele retained in tumours can selectively kill cancer cells but not somatic cells.

To verify the hypothesis, they shovel through many datasets and identify more than 5000 variants in 1278 essential genes that undergo LOH in cancer, and evaluated the potential for each to be targeted using allele-specific gene-editing, short interference RNA, or small molecules. They showed that mono-allelic inactivation of two essential genes, PRIM1 (DNA primase small subunit 1, GeneID 5557) and EXOSC8 (exosome component 8, GeneID 11340), selectively inhibited cancer cell growth.

While the results remain to be validated in vivo and the toxicity profiles of such treatments need to be characterized, the idea is an interesting one, and it may inspire further research in this direction.

For those who are interested in how they came to the variants, here is a short summary. They mined the Exome Aggregation Consortium (ExAC) database, filtered to include only common variants in genes for which copy number calls were available through the NCBI Genomic Data Commons. They found 228,440 potential targetable variants.

They also looked at patient-derived genome-wide copy number and LOH data from The Cancer Genome Atlas (TCGA).

The essential gene list was derived from three genome-wide loss-of-function screens of haploid human cell lines. It was filtered using CCLE gene copy-number and RNA expression data to keep only reasonably expressed genes. Other filters such as tumour suppressor genes were also used.

Computational biology and genomics

Nucleosome positioning and its prediction

Last weeks I read a few interesting papers and reviews about the regulation expression of gene expression by nucleosomes and transcription factors. In a blog post, I share my learning note about nucleosome positioning and its prediction.

Chemistry

OCR software for chemical structures

OSRA (Optical Structure Recognition Application) is a software that recognizes graphical presentations of chemical structures, such as those in PDF and image files, into computer-readable formats such as SMILEs and SD files. It is developed at the National Cancer Institute of U.S. The link at NCI points the latest version of v1.4.0 as of May 2020. A major re-write, OSRA II, is available at SourceForge.

I gave it a try and it seems to work well with an example that I found.

Predicting physicochemical properties of small molecules

SwissADME computes physicochemical descriptors and predicts ADME (adsorption, distribution, metabolism, and excretion) parameters. They can be used as reference values to predict pharmacokinetic properties, drug-like nature and medicinal chemistry friendliness small molecules.

Similar with working with any other prediction tools, the results should be taken with a grain of salt, the interpretation should only be done with caution, and ideally, a human expert should be involved to help with the interpretation. The tool is nevertheless useful for anyone who works with drug-like molecules to get a first, rough idea of drug-related physicochemical properties of his or her favourite molecule.

Another web-based tool for similar purposes is MolInspiration. It was used in the Medicinal Chemistry online course at EdX that I took. The course was given by Erland Stevens, an outstanding teacher, and is offered by Davidson’s College and Novartis.

Programming

Containerization with Docker

I have an application that I need to build locally and share and run remotely. For this purpose, one can use Docker, which essentially runs a program in an encapsulated environment to keep it isolated from the host and from other containers.

To achieve this, each container interacts with its private file system. This file system is provided by a Docker image. In another word, an image includes everything an application needs: the script or the binary code, dependencies, and data files.

Though it is perfectly fine to use Docker daemon as explained in the Docker tutorial, some issues make people concerned about using the Docker daemon. A solution to these issues is to use Podman. The reasons why using Podman makes sense is explained by William Henry in a blog of RedHat Developers.

For laymen like me, it suffices to know that Podman can do the job of a Docker daemon. Images of Podman and Docker are compatible. And the commands of Podman are identical with those of Docker. That means we can set up a Docker image and run it on Podman.

Center aligned figure and figure caption with kramdown in Jekyll

Thanks to a blog post and a StackOverflow answer, I implemented a simple solution to align figures and figure captions horizontally in the centre.

To do this, follow these three steps:

Step 1: Create a new file named image.html in _includes:

<figure class="image">
  <img class="centre-image" src="" alt="">
  <figcaption></figcaption>
</figure>

Step 2: Add following CSS codes to assets/css/style.scss or other CSS files your page will load.

@import
  "minima/skins/classic",
  "minima/initialize";

.centre-image {
    margin: 0 auto;
    display: block;
}

figcaption {
  display: block;
  margin: 0 auto;
  margin-bottom: 1em;
  text-align: centre;
}

Step 3: Display the image from your markdown with:

<!-- solution found on
	https://stackoverflow.com/questions/19331362/using-an-image-caption-in-markdown-jekyll
-->

<figure class="image">
  <img class="centre-image" src="/assets/image/my-equation.jpg" alt="This equation changed my world.">
  <figcaption>This equation changed my world.</figcaption>
</figure>

Other tricks that I learned

To specify the compiler or the parameters passed to the compiler used to build R packages, edit the file ~/.R/Makevars, and specify variables like CC (used for C code) and CXX (C++ code).
The reason I needed to learn the trick above is that my R package rqubic in Bioconductor reported an error when gcc-10 was used with the parameter -fno-common. See explanation of this flag here. In essence, it reports error when a variable is defined multiple times, for instance by a variable definition in the header file, which is included in multiple files. The solution: use extern to tell C that the variable is defined somewhere else (known as a declaration), and define the variable in the C code only once.
It is considered to be a good typography to use curly quotes instead of straight quotes. In kramdown, a Markdown variant used by Jekyll, use “ and ” for left and right double quotes (“I shall leave now.”). Similarly, use ‘ and ’ for left and right single quotes (‘I shall leave now.’).
In kramdown Markdown, use … to show an ellipsis like 1, 2,
…. For other typographic symbols, including em-dash (—), en-dash (–), left guillemet («), and right guillemet (»), see the documentation.
I found out that it is possible to use vim key bindings in RStudio. The option is in Menu->Options->Code->Editting->General->Keybindings. Use Ctrl-1 to move to the source pane and Ctrl-2 to move to the console. Now I can use Rstudio as if I am using vim.

Statistics and machine learning

Rule-based models in R with Cubist and RuleFit

Max Kuhn posted a blog post on modern rule-based models on R Views. He introduced the R packages dplyr and rpart, and highlighted the C5.0 rules, the Cubist method, and the RuleFit method.

The Cubist method is an ensemble method works like boosting, using multiple models that are built sequentially to make predictions. And it uses nearest-neighbour adjustment following model predictions, using similar data in the training set to update prediction results.

One step further, while Cubist creates rules as data subsets and then estimates a linear regression model within each subset, RuleFit creates rules as predictors and then fits one, optionally generalized, linear model.

Bayesian analysis with R for drug development

I read the first three chapters of the book Bayesian Analysis with R for Drug Development by Harry Yang and Steven Novick working at AstraZeneca. The first three chapters introduce drug development, concepts and algorithms of Bayesian statistics, and sample size and power calculation.

I like the style of writing, which is clear, compact, and application-oriented. The authors introduced drug development including discovery, clinical trials, and chemistry, manufacturing, and control (CMC). They guided readers through the Bayes’s theorem, prior setting, and different sampling methods such as rejection sampling, Gibbs sampling, and Metropolis-Hastings sampling. Using binomial distribution and normal distribution as examples, they showed readers how the algorithms work in principle, and importantly, how day-by-day analysis is done with commonly used open-source software for Bayesian analysis, including WinBUGS/OpenBUGS, JAGS, and Stan. In the third and the last chapter of the Background section of the book, they explained an important application of Bayesian statistics in drug discovery and development, namely estimating minimal sample size of a clinical trial and calculating the statistical power of such a trial.

The rest of the book deals with case studies of Bayesian statistics in preclinical and clinical research (section II), and applications in CMC (section III). I have not dived into the details yet. But skimming through the chapters left me the impression that the authors have done a good job in capturing the many aspects of applied Bayesian statistics into self-containing short stories, which help the readers grasp the essence and advantage of Bayesian approaches quickly.

I recommend the book to researchers who are new to Bayesian statistics and who want to apply it in drug discovery and development.

TensorFlow Probability

The GitHub repository tensorflow/probability brings probabilistic reasoning and statistical analysis in Tensorflow.

Building on numerical operations the Tensorflow (layer 0), it implements distributions (tfp.distributions) and bijectors (transformation of random variables, tfp.bijectors) as statistical building blocks in Layer 1.

In layer 2, the package provides tools for model building, including joint distributions and probabilistic layers (tfp.layers) that extends Tensorflow layers, providing uncertainty over the functions they present.

In layer 3, tools for probabilistic inference are provided. They include Markov chain Monte Carlo (tfp.mcmc), approximating integrals via sampling, variational inference (tfp.vi), approximating integrals via optimization, as well as optimizers and tools to compute Monte Carlo expectations.

The project is under active development, and the interfaces may not be stable. However, it provides a toolkit to view probabilistic reasoning and statistical inference from a machine-learning and computer-science perspective.

Other gems

I discovered Pure bash bible by Dylan Araps on Github. It seems to be an interesting collection of tips and tricks of using bash to perform routine tasks.
Amazon AWS explained one line each, discovered through Hacker News.
I enjoyed reading Why weren’t we ready for the coronavirus by David Quamenn in New Yorker. The author argues that U.S. failed to learn the lessons of the past outbreaks.
The article by Quamenn was complemented by the article What the coronavirus crisis reveals about American medicine by Siddhartha Mukherjee. He dissected the strength and weakness of the American medical system revealed by the pandemic. He argues that medicine is a complex web of systems and processes. At its core, it is a health-care delivery system, a research program, a set of protocols for quality control, and a forum for information sharing that allows iterative improvement.
The new outfit for the SpaceX launch: maybe it will become the new fashion of car drivers, not necessarily only of those who drive a Tesla.

New outfit for SpaceX Launch, expected on 30.05.2020. Copyright: Kim
Shiflett/NASA via AP — New outfit for SpaceX Launch, expected on 30.05.2020. Copyright: Kim Shiflett/NASA via AP

Happy weekend!