Here are the things I discovered this week that fascinates me.

Piano piece of the week: Musette

I am learning piano playing since Feburary 2019. Here is a recording this week of the Musette from Kavierbüchlein für Anna Magdalena Bach.

JDZhang · Musette aus Kavierbüchlein für Anna Magdalena Bach

Computer science: Mapping three decades of intellectual change in academia

The paper by Daniel Ramage et al. was discovered from WeChat ID 科研圈 and the paper appears on The authors used a statistical model to analyse dissertation abstracts to uncover the hidden relationships between academic disciplines and how they have changed over the last three decades.

An interesting question that the authors tried to address is how one scientific field affects another, for instance computer science on genetics. To quantify this, the authors developed a measure of language incorporation: which words in each dissertation can be attributed to which source areas in academia?

About methodology: the authors associated the words in each dissertation abstract with their most similar area by using Partially Labelled Dirichlet Allocation (PLDA), a Bayesian statistical text mining approach developed and published by the authors in 2011 based on earlier statistical topic models.

An interesting finding of the authors (though not necessarily surprising): among the subjects, the highest exporting areas tend to be methodological or concern abstract reasoning: mathematics, philosophy, computer science, and statistics. Where as mathematics and philosophy present a fundamental form of knowledge with direct applicability in other fields, the authors found that computer science and statistics are coming to inhabit a similarly central role. The authors explained their observation by that contemporary science and engineering become ever more data driven, and therefore fields that enable the collection, processing, and interpretation of large datasets are becoming more important.

Another interesting finding is that biological sciences still morrow more language from other areas compared with other areas. And the split of biology is an interesting observation. According to the authors: ‘the majority of the field of biology is now linked to reductionist approaches and applications in medicine, while ecology and evolutionary biology has nearly split off into … environmental studies … in the earth and agricultural sciences.”

The authors used the example of computational biology to illustrate their points about new fields and asymmetric flow of language between fields. I am very fortunate to work in this new interdisciplinary area. On the other hand, I ask myself whether it is possible, and if so, how to bring languages in biology into other areas. Another question I have is how scientists experience the flow between languages and methods between subjects, whether the experience is consistent with findings from dissertation abstracts, and what are the implications for scientific communication, including communication within the field, communication with decision makers, and public communication. Last not least, the method of PLDA, a supervised generative model, can be interesting for tagged data.

Drug discovery: the discovery of Sofosbuvir

I read an interesting story about the discovery of Sofosbuvir, the ‘HCV cure’ molecule by Gilead. The article was written by 嘤其鸣 and appeared first on Interesting points include that

  • There are six main genotypes of HCV. Sofosbuvir targets the NS5B RNA-dependent RNA polymerase (RdRp) and is efficacious against all genotypes of HCV, because the Gly317-Asp318-Asp319 (GDD) triplet in HCV NS5B is conserved in all genotypes.
  • Major challanges of nucleos(t)ide analogues: toxicity due to off-target effects, competition with endogeneous nucleoside metabolism, and affinity to RdRp.
  • A derivative of uridine was chosen as the backbone. However, the first lead compound PSI-6130, despite its good inhibitory effect of viral replication, the PK profile was rather poor. The other lead compound PSI-6206 failed due to low bioavailability (25%), despite good tolerability and efficacy profile.
  • A pro-drug approach was taken, which led to the discovery of mericitabine. It has an improved PK profile, but its efficacy is inferior to other compounds and its half life is short.
  • Active metabolite of PSI-6130 was fount to be its triphosphate form, and it was further found out that the monophosphate form of PSI-6130 was good enough, because it has a long half-life and can be converted to the triphosphate form.
  • Further medchem work used the prodrug approach, improved the physico-chemical properties while keeping the efficacy based on experience and known structure-activity-relationships (SAR). Animal models (rat, dog, and non-human primates) were used to determine the safety, PK and PD profile of drug candidates.
  • The improved lead compound showed good safety and PD profile, was single-dosed per day, and well tolerated. However, it was found out that it was a racemic mixture. Further work showed that that the S-form was active (EC_90=0.42uM), which is sofosbuvir.

Like the story of vemurafenib, I awe at the many turn-arounds and coming-back-and-forth between molecular, cellular, and organ-and-system level work.




Glia-to-neuron conversion by single-gene knockout

Haibo Zhou et al. reported on Cell that down-regulation of a single RNA-binding protein, polypyrimidine tract-binding protein 1 (Ptbp1) resulted in the conversion of Müller glia cells into retinal ganglion cells. The approach also induced neurons with dopaminergic features in the striatum, and alleviated motor defects in a Parkinson’s disease mouse model.

I am impressed by the study, not only because it is a lot of hard work, but also because I am impressed by the cell-type conversion induced by knockdown of a single gene. I am not pretty sure whether the delivery of the CRISPR system described in the paper (CasRx) is feasible in human. However, it seems that Ptbp1 (PTBP1 in human) or its downstream effector genes can be interesting drug targets.

A knock-in screening of primary human T cell fitness

Roth et al. reported a knock-in screening of primary human T cell fitness. They combined barcoding, single-cell sequencing, and functional assays to identify cell abundance and cell state ex vivo and in vivo.

A main biological finding is a novel transforming growth factor beta (TGF-beta) chimeric receptor that improved solid tumor clearance.

It seems to be an interesting approach to identify knock-in elements for cell therapies.

Eight types of nCov-19 vaccines in development

An article by Ewen Callaway, published on Nature, illustrates eight major types of nCOV-19 vaccines: weakened virus, inactivated virus, replicating viral vector, non-replicating viral vector, DNA-based vaccine, RNA-based vaccine, protein subunit, virus-like particles. Protein subunit is worked on by most teams.

Computational biology

Scirpy: A Scanpy extension for analysing single-cell T-cell receptor

sequencing data

This preprint on bioRxiv comes from Gregor Sturm, a former master student with us, now a PhD candidate in the group of Francesca Finotello at the Medical University of Innsbruck, Austria, and his colleagues.

Scanpy is one of the most popular libraries to analyse single-cell sequencing data. Scirpy is built on the Scanpy platform and allows the analysis of visualization of immune repertoires from single cells.

I will try to learn the advantages of Scirpy over established tools such as pyVDJ. Until then, a big thumb up to Gregor and the team for making and sharing the tool! The source code and documentation are available at

Learning causal networks using inducible transcription factors and transcriptome-wide series

The paper by Hackett et al. on Molecular Systems Biology, together with a News and Views article by C. Jackson and Gresham, is an interesting read for people interested in gene expression regulation. The authors, mostly from Calico (the biotech company focusing on lifespan, backed up by Alphabet/Google) and Google, report a large dataset of hundreds of inducible transcription factors (TFs) and time-series data in budding yeast.

With the data, the authors built a whole-cell transcriptional model to predict and validate new transcriptional regulators. I agree with the authors that the data can be of general use to understand temporal transcriptional regulation patterns, and to elucidate relationships between TFs and target genes.

A clear drawback is that the authors used microarray, which may fail to capture lowly expressed genes and their dynamics. And though the authors claim to establish causal relationships between TFs and their targets by using inducible TFs and measuring their targets, I am not sure yet whether it is possible to reverse engineer TFs.

What I found interesting: the authors used only two types of models to fit the temporal expression data: either a sigmoidal or a impulse-like response (double sigmoidal), by claiming that time-courses with significant signal across the dataset only follows this two types of models. Computationally, they fit a Bayesian version of the Chechkik & Koller (CK) kinetic model to each time-course. The software code is available on GitHub calico/impulse.

Despite the drawback, the dataset can be very interesting for anyone interested in gene expression regulation. The data is available here:

And I find it may be interesting to read this paper together with another recent paper on mathematical modelling of gene expression in eukaryotic cells by Cao and Grima on PNAS: Cao, Z., & Grima, R. (2020). Analytical distributions for detailed models of stochastic gene expression in eukaryotic cells. Proceedings of the National Academy of Sciences, 117(9), 4682–4692.

Other interesting papers

The Science X newsletter

I found the Science X Newsletter by an interesting daily newsletter of scientific news. Many articles apparently come from University PR offices, whose objectivity can be argued. It seems that the impact statements often only come from the authors of the covered publications, but not from other experts in the field. Nevertheless, it has a wide range of coverage of topics, and can be interesting for a quick read to know what is going on.

The Science X Newsletter confirms my ‘pure hypothesis’: the pure a subject is (say physics), the more likely that its researchers will be informed about development in the less pure subjects (biology, chemistry, etc.). The term of ‘purity’ comes from the xkcd comic here

I guess it is, at least partially, because the pure subjects find more applications in less pure subjects. Researchers in the pure subject therefore has more opportunity to know development of less pure subjects (or application of their knowledge and skill sets there).

Does it also reflect the ability of knowledge and skill-set transfer between subjects? The article above about the efflux and influx of language may be also interpreted from this angle.

Other gems