Information

What do HG and NA mean in Geuvadis project RNAseq sample labels?

What do HG and NA mean in Geuvadis project RNAseq sample labels?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm looking at RNASeq Data from Geuvadis website e.g. the file GD660.GeneQuantRPKM.txt.gz.

The samples are labeled by e.g. HG00105.1.M_120209_7 or NA20812.2.M_111216_6

What do HG and NA mean? Are they ethnic backgrounds?

If not, how can I look up the ethnic mapping to the samples?

Thanks!


If you googleHG00105, among the first hits is geo accession GSM649517 with the titleHG00105/NA12878.

Channel1:

Characteristics gender: Male cell line: lymphoblast cell line HG00105 ethnicity: British from England and Scotland, UK (1000 Genomes codes: GBR)

Channel2:

gender: female cell line: lymphoblast cell line NA12878 ethnicity: Northwest European American from Utah (HapMap code: CEU)

So no, they are not coding for the ethnic background but are part of the unique identifier of the sample.


Genetic association of molecular traits: A help to identify causative variants in complex diseases

In the past 15 years, major progresses have been made in the understanding of the genetic basis of regulation of gene expression. These new insights have revolutionized our approach to resolve the genetic variation underlying complex diseases. Gene transcript levels were the first expression phenotypes that were studied. They are heritable and therefore amenable to genome-wide association studies. The genetic variants that modulate them are called expression quantitative trait loci. Their study has been extended to other molecular quantitative trait loci (molQTLs) that regulate gene expression at the various levels, from chromatin state to cellular responses. Altogether, these studies have generated a wealth of basic information on the genome-wide patterns of gene expression and their inter-individual variation. Most importantly, molQTLs have become an invaluable asset in the genetic study of complex diseases. Although the identification of the disease-causing variants on the basis of their overlap with molQTLs requires caution, molQTLs can help to prioritize the relevant candidate gene(s) in the disease-associated regions and bring a functional interpretation of the associated variants, therefore, bridging the gap between genotypes and clinical phenotypes.


SYNOPSIS

  • In scVI, datasets from different labs and technologies are integrated in a joint latent space.
  • In scANVI, cell type annotations are transferred between datasets and across different scenarios.
  • Uncertainties of differential gene expression in multiple samples are quantified.
  • The performance of scVI and scANVI in data integration and cell state annotation is superior to other related methods.

Introduction

The incidence of breast cancer is influenced by multiple factors that include age, genetics, and reproductive history. An understanding of normal tissue biology and its inherent heterogeneity is an important step toward dissecting mechanisms that lead to oncogenesis. Normal breast tissue comprises a complex epithelial ductal system embedded in a stromal matrix that is composed of fibroblasts, adipocytes, endothelial, and immune cells. In human breast, puberty-induced branching results in a complex branched ductal system in which the ducts terminate in a cluster of acini termed a terminal duct lobular unit (TDLU) (Fu et al, 2020 ). The dynamic changes occurring in the breast epithelium during puberty, pregnancy, and lactation are driven by the concerted action of systemic hormones and growth factors, among which the ovarian hormones estrogen and progesterone play a key role (Brisken & O'Malley, 2010 ). Over the lifetime of a woman, sustained exposure to ovarian steroid hormones is a well-established risk factor for breast cancer, with a clear correlation between the number of menstrual cycles and breast cancer risk (Clemons & Goss, 2001 Hankinson et al, 2004 ). Indeed, early ovarian ablation is protective against breast cancer (Parker et al, 2009a ).

Breast cancer comprises a diverse set of diseases characterized by heterogeneity that influences treatment response and patient outcome. This heterogeneity cannot be precisely defined through the classic parameters of histopathology, tumor grade, and nodal involvement. Expression profiling has proven pivotal in defining the intrinsic subtypes of breast cancer: luminal A and luminal B, triple-negative (often used interchangeably with basal-like), HER2-overexpressing, and claudin-low (Perou et al, 2000 Sorlie et al, 2001 ). These likely reflect distinct “cells of origin”, unique differentiation blockades, and different repertoires of mutations. More recent genome sequencing efforts have defined recurrent “driver” genes and copy number changes among the different breast tumor subtypes (Cancer Genome Atlas, 2012 Alexandrov et al, 2013 Nik-Zainal et al, 2016 ). The advent of single-cell technologies has enabled an understanding of cellular heterogeneity at an unprecedented level. This is particularly relevant to tumors, which exist as ecosystems composed of malignant cells interspersed with stromal and immune cells. Emerging data from single-cell genomics indicate significant tumor heterogeneity, while single-cell transcriptomic profiling of breast tumors indicates diverse immune cell populations (Chung et al, 2017 Azizi et al, 2018 Karaayvaz et al, 2018 Kim et al, 2018 Savas et al, 2018 Qian et al, 2020 ). In addition, recent evaluation of the proteomes of a large number of tumors for up to 70 proteins (Wagner et al, 2019 ) yielded insights into the immune compartments of tumors and potential cellular cross-talk. Cellular diversity among the different breast cancer subtypes, however, has not been evaluated systematically. In the context of normal breast tissue, single-cell profiling of epithelial cells has confirmed the presence of three primary epithelial populations and predicted cell trajectories (Nguyen et al, 2018 ) but the normal milieu of the ductal network awaits further investigation.

Here we sought to further probe cellular heterogeneity within normal and neoplastic breast tissue (and involved LNs) through single-cell transcriptome analysis. We posed the following questions: What is the complexity within the normal breast ductal microenvironment and does hormonal or BRCA1 mutation status influence molecular diversity? What is the degree of heterogeneity within the cancer cell compartment and its microenvironment across tumor subtypes? What is the relationship between primary breast tumors and malignant cells that seed lymph nodes? Single-cell profiling was performed on tissue specimens from normal or preneoplastic BRCA1 +/− tissue (28 specimens), and tumors (34 specimens) representing estrogen receptor (ER) + , HER2 + , and triple-negative (TNBC) breast cancers, including male tumors and seven matched pairs of ER + tumors and involved lymph nodes. Not surprisingly, extensive changes in the immune/stromal landscape were found between the preneoplastic versus neoplastic states in BRCA1 mutation carriers. While all tumor subtypes exhibited intra-tumoral heterogeneity, distinct changes occurred within the microenvironment of different cancer subtypes. Moreover, we observed either clonal migration of genomically distinct ER + breast cancer cells into the axillary lymph nodes or mass migration of tumor cells. Together, this large-scale integration of patient samples encompassing the transcriptomes of > 340,000 cells provides a framework for deciphering the clinical relevance of heterogeneity within normal tissue and breast tumors.


Emerging challenges in cancer immunotherapy

Despite the considerable progress that has been made in the field of cancer immunotherapy, significant challenges remain.

Cancer immunotherapy requires personalization

During tumorigenesis, cancer cells acquire different numbers and types of mutations. Also, as a result of the immunoediting process, mutations that stimulate an antitumor immunity might be lost, whereas less immunogenic mutations might be selectively maintained. Consequently, neoantigens are rarely shared between patients [ 42], and thus neoantigen-based cancer immunotherapies require personalization.

In addition, although some immune escape strategies, such as the upregulation of PD-L1 expression, are more commonly used than others [ 43], this observation may not justify the use of a single immunotherapy approach to treat all cancer patients. Indeed, for more successful therapeutic outcomes, it might be necessary to have a complete understanding of how a given individual tumor managed to escape antitumor immunity. In support of this hypothesis is the fact that only a small percentage of patients may benefit from immune checkpoint blockades [ 44]. In fact, nivolumab or ipilimumab induce durable responses in only 10–30% of melanoma patients when either agent is used alone [ 45, 46].

The need for better biomarkers in cancer immunotherapy

In an attempt to increase the proportion of responders to immunotherapies, a great amount of recent effort has been devoted to identifying predictive biomarkers. To date, many biomarkers have been evaluated in preclinical and clinical studies. These biomarkers include PD-1/PD-L1 expression, tumor-infiltrating immune cells, absolute lymphocyte counts, TCR clonality, tumor mutational load, immune-related gene expression profiles, MHC class I epitope frequency/specificity and tumor mismatch–repair status [ 47]. In addition, serum biomarkers have received much attention recently. This approach relies on the expectation that it might be possible to identify molecules that can predict the response to immunotherapies with ease and reliability [ 48]. Nonetheless, biomarker studies have illustrated that predicting which patients are likely to respond using a single biomarker is quite challenging. As an example, melanoma patients who have high PD-L1 expression show higher response rates compared with patients with low PD-L1 expression. Nevertheless, not all patients with high PD-L1 expression respond to the treatment, and, at the same time, PD1/PD-L1-targeted therapies are effective in some PD-L1 negative patients [ 49]. More strikingly, PD-L1 expression is not significantly associated with anti-PD-1/PD-L1 treatment responses in some other malignancies, such as RCC [ 50]. Collectively, these findings illustrate that there is still much work to be done and that integrative analysis of multiple biomarkers might be necessary to improve the prediction of therapeutic response. In line with that, the stratification of tumors based on both the presence/absence of tumor-infiltrating T cell in addition to PD-L1 expression has been suggested as a better predictive method to design and identify ideal immunotherapies, rather than evaluating these two factors individually [ 51].

The difficulty of selecting the right combination therapy

The combination of PD1/PD-L1- and CTLA-4-targeted immunotherapy is a major breakthrough in cancer treatment. Immune checkpoint inhibitor combinations have improved the median overall survival of melanoma patients with advanced disease from <1 year [ 52] to 37.6 months using nivolumab, 19.9 months using ipilimumab and >3 years with nivolumab-plus-ipilimumab. This dramatic improvement was seen in more than half of nivolumab- and nivolumab-plus-ipilimumab-treated patients, and in about one-third of the ipilimumab-receiving group [ 53]. Likewise, the inhibition of the enzyme indoleamine 2, 3-dioxygenase (IDO), a T-cell-response suppressor, in conjunction with other immune checkpoint inhibitors, resulted in a significant increase in patients’ response rates and overall survival rates. This observation leads to the development of several IDO inhibitors that have been/are currently being tested in Phase 1, 2 and 3 clinical trials [ 54]. Although these are encouraging findings, not all of the treated patients had complete responses. Also, these numbers are less dramatic in the case of other malignancies [ 55]. Therefore, there remains a need to identify combination treatment options.

Furthermore, besides combining different modalities of immunotherapy, conventional cancer treatments may act synergistically with immunity-based treatments. For example, some conventional cancer treatments trigger immunogenic cancer cell death. This effect results in the release of more tumor antigens, and thus it stimulates host antitumor immunity [ 56]. Supporting evidence comes from the observation that the combination of ipilimumab with whole-brain radiation therapy or stereotactic radiosurgery increased average overall survival by 13 months, compared with those only receiving radiation, in patients with melanoma brain metastases. The risk of death was also significantly reduced [ 57]. However, there are thousands of possible drug combinations, making it challenging to know where to begin in solidifying the optimal combination of treatments.

Resistance to cancer immunotherapy

One of the major obstacles to effective response to cancer immunotherapy is resistance. For example, cancer patients undergoing transplantation, HIV positive individuals, and elderly people have preexisting systemic intrinsic resistance to immunity-based treatments [ 58]. Moreover, many individuals have an intact immune system but may lack antitumor immune activity only at the site of the cancer. In fact, immunological factors, including the density and geography of tumor-infiltrating CD8 T cells (immunoscore) [ 59–61], and CD4:CD8 T cells, can play an important role in immunotherapy resistance [ 62]. Furthermore, over the course of cancer progression, tumor cells consistently acquire changes in their genetic, epigenetic, transcriptional and metabolic profiles, as well as alterations in their oncogenic signaling [ 63]. Similarly, stromal cells continually change the expression of their cell surface molecules, the activity of their intracellular signaling pathways and their cellular metabolism [ 64–68]. As might be expected, some of these modifications can confer immune resistance and thus attenuate the effectiveness of cancer treatments [ 64]. Notably, immunotherapies may also promote these cellular changes [ 58]. As a result, to overcome emerged resistance to immunotherapy, it would be necessary to perform a comprehensive integrative analysis to consider all of these factors.


3 RESULTS AND DISCUSSION

3.1 Genome summary

Multiple libraries with different insert sizes were constructed from DNA extracted from the eggs of the purified X12 population. In total, 95.22 Gb of sequencing data was generated, of which 13.65 Gb (96.81X coverage) was produced from Illumina reads, 28.48 Gb (201.97X coverage) from PacBio reads, 31.32 Gb (222.13X coverage) from 10X Genomics linked-read libraries and 21.77 Gb (154.39X coverage) from the Hi-C library (Table S1). The assembled genome is estimated to be 141.01 Mb, with scaffold and contig N50 sizes of 16.27 Mb and 330.54 kb, respectively (Figure 2b). In addition, the sequencing results (SCN_Lian) were compared with the newly released sequencing results (SCN_Masonbrink) of 2019 (Masonbrink et al., 2019 ) and the genomes of the plant–parasitic nematode M. hapla (Opperman et al., 2008 ) and the free-living nematode C. elegans (The C. elegans sequencing consortium, 1998 ) (Table 1). The genome size of SCN_Lian is 141.01 Mb, which is almost identical to that of SCN_Masonbrink, at 123 Mb. Notably, SCN_Masonbrink did not assemble the genome of H. glycines at the chromosome scale, though SCN_Lian did. The BUSCO value of SCN_Lian is 53.4% compared with 72% for SCN_Masonbrink, but the BUSCO value of SCN_Masonbrink is ~54% when analysed using the nematode database and the genomic data supplied by Masonbrink et al. Therefore, there is little difference in assembly quality between the genomes of SCN_Masonbrink and SCN_Lian. The GC content of SCN_Lian (36.89%) is similar to that of C. elegans (35.4%), whereas M. hapla has an unusually low GC content of 27.4%. SCN_Masonbrink annotated 29,769 genes, and SCN_Lian annotated 11,882 genes.

141 Mb with scaffold and contig N50 size of 16.27 Mb and 330.54 Kb. Also listed in the table are the size and number of N60, N70, N80 and N90 of contigs and scaffolds. (c) Clustering of scaffolds using Hi-C data into pseudochromosome-scale scaffolds. Listed are the 258 scaffolds of total length

12 Mb used for clustering. Also listed in the table are the cluster numbers, the number of contigs and the reference length of contigs [Colour figure can be viewed at wileyonlinelibrary.com]

H. glycines M. hapla C. elegans
SCN-Masonbrink SCN-Lian
Sequencing material Inbred population TN10 (Hg type 1.2.6.7) Natural population X12 (Hg type 1.2.3.4.5.6.7)
Genome size, Mb 123.85 141.01 54 100
Contigs, bp 738 889 3,452 N/A
Contig N50, Kb 304,130 330,544 N/A N/A
Scaffolds N/A 267 1,523 N/A
Scaffolds N50, bp N/A 16,265,615 83,645 17,494,000
Assembled, bp N/A 141,354,287 53,578,246 100,267,623
Sequence coverage, % N/A 98.33 99.2 100
Per cent complete BUSCO, % 72 53.4 59 99.6
G+C, % N/A 36.89 27.4 35.4
Annotated genes 29,769 11,882 14,420 20,060
Repeats numbers accounted for the genome, % 34 51.10 17 16.5
Identified SNPs 1,619,134 247,046 N/A N/A
Chromosomes NA 18 16 6
Chromosomes-level assembly N/A Yes N/A Yes

The data quality control results are shown in Tables S2–S4 and Figures S1 and S2. The following was obtained for assessment of polymerase length distribution: read number of 2,080,111, with mean read length of 13,703 and read length N50 of 23,355. Insert size length distribution showed the following: read number of 2,080,111, with mean read length of 9,875 and read length N50 of 14,429. Assessment of subread length distribution revealed that the read number was 3,179,171, with mean read length of 8,948 and read length N50 of 12,988. According to bwa software, the mapping rate of all small fragment reads to the genome was approximately 90.72%, and the coverage rate was approximately 98.33% (Table S5) thus, the reads show good agreement with the assembled genome. After sorting chromosome coordinates, removing repeated sequences and performing single nucleotide polymorphism (SNP) calling for the BWA comparison results, 247,046 SNPs were obtained, with 0.213% SNP heterozygosity and 0.0024% SNP homozygosity based on SAMtools (http://samtools.sourceforge.net/) (Table S6) therefore, the genome assembly has high single-base accuracy. In addition, the GC content and average depth of the assembled genome were calculated and mapped using 10k Windows without repeated calculation. The results showed that the GC content is concentrated in a region encompassing 40% of the genome, without apparent separation, which showed that the genome was not contaminated by foreign sources (Table S7 and Figure S3).

The results of CEGMA analysis demonstrated that the assembly was complete, with mapping rate of 86.29% (a total of 214 genes) (Table S8). BUSCO evaluation results also indicated that the assembly result was complete, with 53.4% assembled complete single-copy genes of 978 homologous single-copy genes (Table S9). Remarkably, only 53.4% of the genes in the H. glycines assembly are single copy according to the BUSCO analysis, with 3.7% duplicated. For comparison, the BUSCO results for SCN_Masonbrink indicate that 56% of the genes in H. glycines are single copy, with 16% duplicated (Masonbrink et al., 2019 ).

Results of repeat prediction showed that the X12 genome contains 51.10% repeat sequences. Repetitive sequence statistics and classification results are shown in Tables S10 and S11 and Figure S4. The genome of H. glycines is diploid and consists of repeated sequences with higher nucleotide divergence (19.21%) than the genomes of Meloidogyne species, which are polyploid and consist of duplicated regions with low nucleotide divergence (

8%) (Abad et al., 2008 Blanc-Mathieu et al., 2017 Sato et al., 2018 Szitenberg et al., 2017 ).

Gene structure prediction was performed, and 11,882 protein-coding genes were predicted, with a mean of 1,233.92 bp of coding sequence (CDS) and 8.3 exons per gene (Table S12 and Figure S5). The transcript lengths of genes, CDSs, exons and introns of SCN are comparable to those of the genomes used for homology-based prediction (Table S13 and Figure S6). In addition, noncoding RNA genes were predicted in the SCN genome, including a total length of 17,688-bp ribosomal RNA (rRNA), 46,685-bp transfer RNA (tRNA), 39,375-bp microRNA (miRNA) and 21,549-bp snRNA genes (Table S14). Based on functional annotation of protein-coding genes, 64.5% (7,663), 76.5% (9,093), 60% (7,126), 70.7% (8,405), 49.1% (5,840) and 61.5% (7,303) of genes are annotated in Swiss-Prot, Nr, kegg , InterPro, GO and Pfam, respectively. The four life stages of SCN were isolated and then mixed before sequencing for genome annotation. In total, 9,383 protein-coding genes (79.0%) with conserved functional motifs and functional terms were successfully annotated (Table S15 and Figure S7). The distribution of genes, GC contents, long terminal repeats (LTRs), long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), tRNAs, miRNAs, snRNAs and rRNAs in X12’s chromosomes are shown in Figure 3.

There are some differences regarding the results for SCN_Lian and SCN_Masonbrink, such as the number of annotated genes. The possible reasons are as follows. First, there were differences in the sequencing technologies used. For SCN_Masonbrink, PacBio long-read technology was mainly used, whereas combined Illumina short-read and PacBio long-read technologies were employed for SCN_Lian. Second, there were differences in the materials sequenced. The inbred population TN10 (Hg type 1.2.6.7) was used for SCN_Masonbrink, but the natural population X12 (Hg type 1.2.3.4.5.6.7), which is the most virulent SCN population identified to date, was utilized for SCN_Lian. The differences in the pathogenicity of these populations may also be judged from the differential proportions of S genes (50.4% and 45.3% in SCN_Lian and SCN_Masonbrink, respectively) and D genes (2.3% and 8.7% in SCN_Lian and SCN_Masonbrink, respectively) in the BUSCO results (Table 2). Third, different annotation methods were applied. Gene annotations were performed using Braker for SCN_Masonbrink with an unmasked assembly, which annotated 29,769 genes including 12,357 expressed repetitive elements and showed that the H. glycines genome has a significant number of repeats, at 34% of the genome. To prevent the number of genes from being too high, which can be caused by false positives from repeats during gene annotation, repeat masking before structure annotation was performed for SCN_Lian, as also conducted in many other studies (Xu et al., 2013 Zhang et al., 2019 ). To obtain more comprehensive and accurate repeat sequences, homologous sequence alignment and ab initio prediction were performed. Ultimately, 11,882 annotated genes and 51.10% nonredundant repeat sequences were obtained.

Scientific name Version Genome size Gene number BUSCO genome
Caenorhabditis_elegans ensembl.metazoa.v32 98M C:98.6% (S:98.0%, D0.6%), F:0.8%, M:0.6%, n:982
Caenorhabditis_briggsae ensembl.metazoa.v32 106M C:97.7% (S:97.0%, D0.7%), F:1.5%, M:0.8%, n:982
Ascaris_suum ensembl.metazoa.v32 265M C:89.8% (S:88.0%, D1.8%), F:6.6%, M:3.6%, n:982
Brugia_malayi wormbase.WBPS6 93M C:96.6% (S:96.0%, D0.6%), F:2.4%, M:1.0%, n:982
Onchocerca_volvulus ensembl.metazoa.v32 94M C:97.6% (S:97.3%, D0.3%), F:1.7%, M:0.7%, n:982
Meloidogyne hapla 54M 14420 C:59.9% (S:58.7%, D1.2%), F:9.4%, M:30.7%, n:982
Meloidogyne incognita 184M 43718/45351 C:61.8% (S:25.8%, D36.0%), F:8.1%, M:30.1%, n:982
Heterodera glycines (SCN-Lian) 135M 11882 C:52.7% (S:50.4%, D2.3%), F:9.6%, M:37.7%, n:982
H. glycines (SCN-Masonbrink) 129M 29769 C:54.0% (S:45.3%, D8.7%), F:10.4%, M:35.6%, n:982

3.2 Chromosome observation and Hi-C scaffolding

The chromosome number of H. glycines during meiosis was observed under a fluorescence microscope using 450–490 nm excitation (2n = 18) (Figure 4). The Illumina-based Hi-C data were remapped to the PacBio assembly, clustering into nine pseudomolecules using the Proximo Hi-C scaffolding pipeline (Figure 2a). The Hi-C scaffolding was able to anchor and order with high confidence all of the 258 scaffolds into nine pseudomolecules. The scaffold sizes ranged from 7.6 to 185 Mb with an N50 of 16.3 Mb (Figure 2c). The overall scaffolding rate was 91.2% (Table S16).

3.3 Evolutionary analysis

A total of 25,535 gene family clusters were constructed. The genes used for gene family clustering in each species are shown in Table S17. In total, 482 single-copy gene families are common to all 12 species. The distribution of single-copy orthologs, multiple-copy orthologs, genes unique to H. glycines and other orthologs in different species is shown in Table S18. Protein sequences from the 482 single-copy gene families were used for phylogenetic tree reconstruction, and the estimation of divergence time was performed (Figure 5) with mcmctree software. Synteny diminished as phylogenetic relatedness declined, and our results showed that the divergence time between H. glycines and M. hapla is approximately 143.6 million years. Thus, the divergence of H. glycines preceded that of the model nematode C. elegans. Moreover, because plant parasitism is a lifestyle found in three different clades in the nematode tree of life, plant parasitism appeared at least three times independently during the evolution of nematodes (Danchin & Perfus-Barbeoch, 2009 ). It was also inferred that plant parasites evolved from fungus-feeding nematodes, according to previous results that showed consistent coclustering of plant parasites with fungivorous species (Holterman et al., 2006 ).


RESULTS

The goal of this study was to examine the transcriptional landscape of CD34 + /CD38 − blasts by comparing them to CD34 + /CD38 + blasts and to normal CD34 + /CD38 − cells. We used flow cytometry to sort CD33 + /CD34 + /CD38 − and CD33 + /CD34 + /CD38 + blast populations from the bone marrow cell suspensions of two patients with AML (AML1 (M0) and AML2 (M5)) and four patients with normal bone marrow (N) (refer to the Materials and Methods), and we then performed single-cell mRNA sequencing on 359 sorted cells. After stringent filtering, we generated 311 single-cell RNA-seq profiles with an average of 7 × 10 6 uniquely mapped reads per cell (Figure 1A and B, see the Materials and Methods). As expected, the number of detected genes per cell was comparable between the conditions and variable between the cells [19, 20] (Figure 1). On average, 1764 transcribed genes were detected per cell (RPKM > 10). Consistent with previous reports [19–21], we observed a substantial variability in the cell-to-cell transcriptome (Pearson 0.0007 < r < 1, mean 0.57) (Figure 1C). Bulk RNA from each individual was prepared for bulk-to-single cell transcriptome correlations. As described, the bulk transcriptomes were highly correlated with the single-cell transcriptomes (Pearson 0.38 < r < 0.89, mean 0.63) (Figure 1C), thus supporting the assumption that bulk samples reflect the average of single-cell populations [20, 22].

(A) Boxplots showing the number of uniquely mapped reads and transcribed genes detected in 313 single cells from 4 healthy patients (N1, N2, N3, N4) and 2 patients with AML (AML1, AML2). CD38 − and CD38 + blast cells from patient AML2 were sorted separately (refer to the Materials and Methods). See also Figure S1. The right panel shows the cumulative average number of transcribed genes detected per cell for each sample and per RPKM category. (B) Hierarchical clustering of 267 single-cell samples (refer to the Materials and Methods) based on Pearson correlation. The cell labels are colored according to the sample origin (same color code as the right panel in (A)). CD38 − and CD38 + single cells are not included in this plot. The correlation coefficient is also colored according to the scale ranging from 1.0 (blue) to - 1.0 (red). (C) Cell-to-cell and cell-to-bulk correlation matrix including 7 single-cell samples (sc) from N3 and bulk samples from N3 and N4. Scatter plots show correlation with gene expression (RPKM > 0). Numbers represent pairwise Pearson’s correlation coefficients.

Individual leukemic CD34 + /CD38 − cells trigger pathways that promote stemness and cancer progression

We first compared the transcriptome profiles of 24 CD34 + /CD38 − cells and 24 CD34 + /CD38 + cells from the same patient with AML (AML2), and an analysis of the differentially expressed transcripts identified 625 genes that presented significant changes (p-value < 0.05) (refer to the supplementary data, Table S1). As expected, the hierarchical clustering map of the differentially expressed transcripts showed two distinct cell populations (Figure 2A). An analysis of the gene ontology terms conducted on this set of differentially expressed transcripts revealed significantly enriched terms associated with the genes mainly implicated in the NOD-like receptor pathway (e.g., CXCL2, CXCL8, NLRP3, and TRAF6) (Figure 2B). This term also includes chemokines that are critical for the survival and proliferation of cancer cells, such as CXC-motif ligand 8 (CXCL8, IL-8) [23, 24]. Malignant CD34 + /CD38 − cells displayed reduced cell cycle activity, and genes that promote cell proliferation and cell cycling (CDK7, CDKN2A, HDAC1, MCM3, PCNA, and MYC) were downregulated compared with the more differentiated blasts. Interestingly, DNA replication genes (POLA2, RFC3, and RNASEH2A) were activated in those cells, whereas the nucleotide excision repair pathway was significantly downregulated, thus suggesting a promoted DNA damage (Figure 2B), which was previously described in quiescent cells [25–27]. A gene set enrichment analysis (GSEA) was performed to evaluate the presence of well-characterized stem cell regulatory genes in our dataset (Figure 2C). We tested predefined gene lists from published gene expression profiles of pathways activated in LSCs and HSCs (refer to the supplementary data, Table S4) [11, 12, 28, 29] and confirmed that the stem-associated genes expressed by our sorted CD33 + /CD34 + /CD38 − leukemic cells were consistent with published data (FDR < 0.01). Several studies have implicated the Wnt/β-catenin pathway in the development of leukemia stem cells [30]. A GSEA of this specific pathway confirmed the overexpression of 49 of 148 genes in CD34 + /CD38 − blasts (Figure 2C).

(A) Heat map and supervised hierarchical clustering of differentially expressed genes between CD38 − and CD38 + cells (refer to the supplementary data, Table S1 for gene list details). The rows represent genes and columns represent single cells from CD38 − and CD38 + subpopulations. Color coding denotes log2-transformed RPKM values. (B) GO term analysis of CD38 − and CD38 + differentially expressed transcripts (fold enrichment DAVID analysis). The bar diagram shows the significantly enriched terms (p-value < 0.05, Fisher’s Exact Test) among downregulated transcripts (blue) and upregulated transcripts (grey). See also Figure S1. (C) Signature enrichment plots from GSEA using 5 different gene target lists (refer to the supplementary data, Table S4 for gene list details). Black vertical bars are genes ordered according to their fold change expression. Values indicate normalized enrichment score (NES), FDR adjusted p-value and number of significant enriched genes out of the total genes tested. (D) Gene network visualization of differentially expressed genes between leukemic CD38 − and CD38 + single cells. Stemness-related genes and pathways are highlighted. See also Figure S2.

The network map illustrates the functional connectivity among the differentially expressed genes (Figure 2D). CD34 + /CD38 − leukemic cells expressed genes related to the following 4 distinct signaling pathways: TNFα/NF-κB, c-Kit–mediated stem cell factor (SCF), Rb/E2F, ERK/MAPK and AKT. Remarkably, genes important for hematopoiesis and leukemogenesis, such as BMI1 (FC CD38-/CD38+: 4.47, p-value: 0.00113, Kolmogorov-Smirnov statistical test), HOX genes (refer to the supplementary data, Figure S2), and MYB have been previously reported to be deregulated in leukemic cells [31, 32]. Moreover, these genes have been extensively associated with cancer stem cell maintenance in general [31, 33–35] and support the “stemness” and tumorigenic potential of sorted CD33 + /CD34 + /CD38 − blasts. Remarkably, expression profiles generated by single-cell RNA-seq on a few number of cells enable to transcriptionally distinguish the molecular features of CD33 + /CD34 + /CD38 − from CD33 + /CD34 + /CD38 + blasts.

CD34 + /CD38 − single-cell transcription profiles distinguish leukemic cells from normal stem cells

To assess whether our framework was suitable for characterizing the transcriptome profiles of CD34 + /CD38 − blasts, we analyzed 267 individual CD33 + /CD34 + /CD38 − cells from four patients with normal bone marrow and two patients with AML. Interestingly, SOX4 was among the top 200 highly expressed genes (refer to the supplementary data, Figure S3) in normal CD34 + /CD38 − cells and in the CD34 + /CD38 − blasts of the two AML samples, and its presence was clearly associated with “stemness” properties [36–39], thus reinforcing the suggestion that all of the captured cells might possess the hallmarks of putative stem cells. Remarkably, we identified 5 transcriptionally different clusters on the t-SNE map (Figure 3, Figure S4) (refer to the Materials and Methods). Fifteen cells were unassigned and deliberately allocated to cluster 1. The remaining 219 CD34 + /CD38 − cells grouped into four clusters. AML1 and AML2 cells distinctly clustered in clusters 2 and 4, respectively, whereas all of the non-AML cells formed a distinct and unique cluster, cluster 3 (Figure 3). Cluster 5 consisted of 5 outlier cells from normal patients (Figure 3). The observed clustering indicated that inter-individual variability had less impact on mRNA expression profiles than the disease phenotype and that identified clusters reflect disease states. To identify genes associated with the disease state, the D 3 E method [40] was used to identify differentially expressed transcripts among the AML1, AML2 and non-AML CD34 + /CD38 − cells (Figure 4A). We found 858 and 763 genes that were differentially transcribed in the AML1 and AML2 cells compared with the non-AML cells (p-value < 0.05), respectively, and counted 185 genes that were essentially related to cell cycle regulation and cancer pathways in all cell types (Figure 4B). Among the upregulated or downregulated genes in the AML1 and AML2 cells, the most enriched gene ontology categories were related to the cell cycle, DNA replication, DNA repair, cellular senescence and self-renewal and stemness, such as the JNK pathway, FOXM1 and PLK1 networks, and TGF-beta signaling pathway (Figure 4B). A GSEA performed with literature-based gene lists confirmed that the AML1 and AML2 blasts showed significant enrichment in the published stem cell gene sets and leukemia-activated pathways as well as in a prognostic gene signature (refer to the supplementary data, Figure S5). The overlap of enriched functional-related gene groups in the two AML cells revealed common properties of LSCs in “stemness”-related signaling pathways that control survival, tumorigenesis and self-renewal. The patients with AML1 and AML2 belonged to different subtypes of AML. The leukemic cells of patients with AML2 harbored the common FLT3-ITD mutation, which corresponds to an internal tandem duplication (ITD) in the Fms-like tyrosine kinase 3 gene (FLT3), which encodes the receptor for the cytokine FLT3 ligand (FLT3L). Normally, the FLT3 receptor is expressed at the surface of HSCs and is required for the development of myeloid progenitors. The FLT3-ITD mutation results in a hyper-sensitivity of the FLT3 receptor, which promotes uncontrolled cell proliferation mediated by AKT-, MEK-, and ERK-activated pathways [41, 42]. Indeed, we observed that FLT3 transcripts are overexpressed in AML2 cells compared with non-AML and AML1 cells (refer to the supplementary data, Figure S6). The pathways deregulated by this overexpression, i.e., RAS, ERK, AKT, TGF-beta, and GPCR, are illustrated in the networking map (Figure 5A). A comparison with the cellular network map from the AML1 sample showed that these perturbations were specific for AML2 CD34 + /CD38-blasts (Figure 5A).

t-SNE map with cells colored by cluster identities (left plot) or by individuals (right plot). Cells were classified into 5 clusters according to their expression patterns using the SEURAT algorithm with the default parameters. Cells that could not be assigned were placed by default into cluster 1.

(A) Heat maps of the differentially expressed genes (D 3 E analysis) discriminating N cells from AML cells. Genes are shown in rows. Log2-transformed RPKM values are indicated in the color map (refer to the supplementary data, Table S2 and S3). The Venn diagram indicates the number of genes differentially expressed between N and AML1 samples (858 genes) and between N and AML2 samples (763 genes). (B) GO term analysis of differentially expressed genes between AML and N samples (GeneAnalytics tool, adj. p-value < 0.05). The purple color indicates the significant enrichment of gene ontology pathways in AML1 (673 genes), AML2 (578 genes) and AML1 & 2 (185 genes).

(A) Interaction map of differentially expressed genes (D 3 E analysis) for AML1 (left) and AML2 (right). Relevant signaling pathways are highlighted. Genes representing the main nodes are colored in red. * indicates pathways known to be perturbed by FL3-ITD mutation. (B) Graphic representation of TFBs enriched in differentially expressed gene sets for AML1 (left) and AML2 (right). Circles are sized in proportion to the number of differentially expressed genes enriched for the labelled TFBs.

Overall, our single-cell transcriptome analysis conducted on CD33 + /CD34 + /CD38 − cells from two AML samples provided valuable information on the nature of the AML. The effects of somatic mutations on the cancer cells are detectable providing valuable insight into disease-associated and patient-specific gene networks.

Core set of transcription factors are co-activated in leukemic CD34 + /CD38 − cells

To further investigate the regulatory networks of leukemic CD34 + /CD38 − cells, we investigated whether previously identified differentially expressed genes in the AML1 and AML2 cells were co-regulated by a set of transcription factors. Therefore, we evaluated the enrichment of transcription factor binding sites (TFBs) associated with those differentially expressed gene sets (refer to the Materials and Methods). Of the 763 significantly differentially expressed genes in the CD34 + /CD38 − AML2 cells, nine TFBs (PAX4, CEBP, MEF2, POU3F2, E2F, TATA, NFY, FREAC3 (FOXC1) and HLF) were significantly enriched (adjusted p-value < 0.05), and of the 858 differentially expressed genes in the CD34 + /CD38 − AML1 cells, 62 TFBs (the nine TFBs associated with the AML2 dataset as well as other relevant TBFs, such as OCT1, GATA1, EVI1 and MEF2 (Figure 5B)) were significantly enriched (adjusted p-value < 0.05). Interestingly, nine common TFBs are found in common between the CD34 + /CD38 − blasts of both AML samples. This result suggests a relationship among leukemic CD33 + /CD34 + /CD38 − cells such as a joint transcription program or a shared cellular identity.

Transcription profiles of leukemic CD34 + /CD38 − single-cells can identify genes associated with putative survival outcomes in the AML-TCGA cohort

Reports from AML patients have indicated that the gene expression signatures of leukemic blasts at diagnosis have prognostic significance [11, 43]. Thus, we hypothesized that these signatures could be detected in leukemic CD33 + /CD34 + /CD38 − single-cell transcription profiles. RNA-seq data and clinical outcome data of 163 individuals with AML accessible on The Cancer Genome Atlas (TCGA) public website [44] were examined, and then the associations among 1675 differentially expressed genes from our study were assessed against the clinical outcomes in the AML-TCGA cohort. Specifically, we ranked the genes according to the p-values derived from a univariate Cox regression. Thirty-six genes from the AML2 gene set and 22 genes from the AML1 gene set were identified as significant for overall survival based on a comparison of the patients in the top 50% of gene expression against those in the bottom 50% of gene expression (p-value ≤ 0.05) (refer to the supplementary data, Table S5). Five genes (MPO, ITGAX, RUFY3, FEM1C, and HSF2) were common between the AML1 and AML2 gene sets. The effect of the relative expression of selected genes on the survival of AML-TCGA patients is shown in Kaplan-Meier plots in Figure 6. Interestingly, the overall expression of myeloperoxidase (MPO) was higher in the normal CD33 + /CD34 + /CD38 − single cells compared with the AML cells (FC N/AML2=24.77, p-value=3.92x10 −10 FC N/AML1=32, p-value=3.97x10 −9 ), whereas high levels of MPO transcripts were significantly associated with improved survival in the AML-TCGA samples (p-value=0.0042) (Figure 6). MPO is a myeloid lineage marker that has been described as a potential prognostic factor for AML [45–48]. This analysis identified genes whose single-cell expression in leukemic cells was associated with the survival of AML-TCGA patients (refer to the supplementary data, Table S5). Moreover, we found an enrichment of previously published prognostic gene signatures [12, 49] in single leukemic CD34 + /CD38 − cells (refer to the supplementary data, Figure S5). Collectively, these findings indicate that the transcriptional landscape of leukemic CD33 + /CD34 + /CD38 − cells contains putative relevant disease outcome information and the single-cell transcriptomic strategy presented here is of potential use in further prognostic evaluations.

(Left) Kaplan-Meier survival curves for AML-TCGA patients stratified based on transcript levels. Patients were divided into two groups: 50% lowest gene expression and 50% highest gene expression. P-values were calculated using a univariate Cox regression analysis. (Right) Box-plots displaying the distribution of transcript levels (log2-transformed RPKM) in AML1, AML2 and N single-cell sample sets. The groups of cells with the highest significant expression are shown in red (p-value < 0.05) (refer to the supplementary data, Table S5). HZ: hazard ratio.


Discussion

Figure 5. Probing sequencing gaps and nucleotide isomers by MS/MS. (a) Automated sequencing of a modified 20-nt RNA with a 2′-O-methylation in position C11 (R5: AUAGCCCAGUCmAGUCUACGC). 2′-O-modifications inhibit acid hydrolysis of the phosphate backbone leading to weak or missing ladder fragments. The mass of the missing fragment in the lower 3′ ladder corresponded to A + C + methyl (gray text). (b) Structure of the missing dimer determined by MS/MS analysis, with characteristic fragment ions as labeled. (c) Fragmentation spectrum following extended acid hydrolysis (80 °C, 75% (v/v) formic acid, 2 h) to increase the abundance of the dimer. MS/MS data was collected for the modified dimer and fragment ions were used to confirm that the methylation is on the ribose 2′ position of cytidine and the sequence is CmA. Assignable fragments labels are indicated on the dimer structure in panel (b).


Dimensional reduction

Here, we represent the transcription profiles of all cells in the data set in a 2D or 3D projection, typically using the t-SNE or UMAP methods.

SNN graph

The SNN graph was calculated earlier in the FindNeighbors() function.

Calculate UMAP

He we calculate the UMAP using the same 15 PCs we have also used for clustering.

Overview

After calculation of the UMAP, we plot it while coloring the cells by the number of transcripts, sample, cluster, and cell cycle.

Samples

To understand how much the samples overlap, we plot the UMAP again and split the samples in different panels. Evidently, they overlap very well.

Clusters

Instead, for the clusters we simply add a label to the geometric center of each cluster.

Poincaré map (work-in-progress)


Comments on this article Comments (0)

All my comments were addressed by the authors. Some of the suggested improvements were introduced, other not but the authors explained they did not overlap with the objectives of the article.

I only suggest to correct a . Continue reading

All my comments were addressed by the authors. Some of the suggested improvements were introduced, other not but the authors explained they did not overlap with the objectives of the article.

  • "Normlisation” instead of "Normalisation” occurs once in the chapter “Transformation to FPKM values and normalisation”
  • "those those” – repeated word in the sentence started from “Figure 2 shows that the set of down-regulated genes has more discriminant power…”
  • there is no dot at the end of the last sentence of the Summary chapter. Moreover, I suggest not to use a term “wild-type NPM1c samples” formed as an equivalent of “NPM1c mutated samples”, e.g. in the description of the Figure 3 and 4, as “NPM1c” means “cytoplasmic NPM1” which is typical for samples with the mutated NPM1 gene. The term “wild-type NPM1 samples” is more appropriate.

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genomics, transcriptomics, biology of acute myeloid leukemia

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Transcriptome analysis, Gene expression regulation

Introduction
The article presents a new package, which was developed from a previous one (‘singscore’) elaborated by the same authors. The package ‘singscore’ is a single-sample gene set scoring method which is valuable for analysis of transcriptomes of samples collected through the long time and not sequenced in the same run, experiment, platform or laboratory. Here, the authors apply the method for classification of TCGA acute myeloid (AML) samples using transcriptional ‘gene signatures’ identified by other authors (Verhaak and Ross) as typical for the NPM1c mutation, KMT2A (MLL) gene fusions, and PML-RARA gene fusions.

AML is a heterogeneous and multi-clonal disease. The transcriptome picture of AML is very complex and can result from different mutations, genomic rearrangements, and aberrant regulation of gene expression at different levels (see my last review paper 1 ). Sometimes, the gene expression profiles can overlap between samples with different genetic lesions. The examples are HOX-gene-based discriminative signatures determined by Verhaak and Ross, not limited to AML with mutated NPM1, but specific also for AML cases with 11q23 abnormalities and KMT2A (MLL) gene rearrangements. The authors are aware of this fact and underline it in the article, too. On the other side, samples with the same mutation can present varying expression profiles because of additional features which also affect the transcriptome. Therefore, interpretation of the results must be done with caution. For me, the authors interpret the results a little bit too optimistically. The presented method is not sufficient to determine the mutation of interest, but can be used as a supplementary approach, a screening tool or (in future) a personalized medicine tool which could classify patients based on transcriptomic profiles, associated with specific treatment response.

Testing package
I have used the package in R with TCGA data and confirm the code works, it is fast and generates the same results and plots as presented in the article. Moreover, I have tested the whole procedure on my own RNA-seq dataset (not published yet but used for supplementary expression analysis of NPM1 alternative transcripts, see the article 2 ). My dataset contains 28 AML samples, including 8 with NPM1 mutation (verified with three independent approaches). The data load was more tricky as I started from the csv table with counts and I had to convert the DESeqDataSet object into the RangedSummarizedExperiment object. The best discrimination was achieved with the Verhaak signature, but only with the down-regulated genes. All samples with NPM1 mutation were clearly separated from others, however, 2 additional samples, without NPM1 mutation, were grouped together with NPM1c. It is possible they have KMT2A (MLL) rearrangements, but I cannot verify it now. Analyzing the plots, up-regulated genes and all genes form the signature are not as efficient, separating only 2 or 3 NPM1c samples from the rest. In the article, the authors also admit that the set of down-regulated genes has more discriminant power, but they claim up-regulated genes also contribute to the discrimination.

I have no samples with PML-RARA fusion in my dataset and the status of KMT2A (MLL) gene was unknown in my patients, so I could not compare the efficiency of signatures other than NPM1c. APL with PML-RARA is the most distinctive AML subtype which is easily separated from other AML samples based on transcriptome profile, so I would expect good results. This example shows the package works well for samples with very specific gene expression profile.

Interpretation of the results
Considering the result interpretation, I have some doubts. The singscores are composed of two components, an enrichment score and a dispersion estimate of ranks. I am informed that “high expression of up-regulated genes and low expression of down-regulated genes would result in higher scores”. This is logical. I also know the maximal ranges ([−1, 1] for signatures involving up- and down-regulated genes). However, which value is really high? For example, is 0.2 enough (it looks like from the plots in the article and from my own data, too) or maybe I should expect much more, e.g. 0.7? Similarly, which value of dispersion should I expect? I suppose, the lowest, but which is low enough or which range is optimal? What is difficult to understand is that “despite the range of scores increasing, the discriminant power drops moderately” (for up-regulated genes from Verhaak signature, when compared to down-regulated genes). I see the same paradox in my data – the scores are higher for up-reg. genes, which are much less efficient in discrimination between NPM1c+ and NPM1c- samples. It looks like scores only do not reflect the trends observed from score vs dispersion plots.

  • Instruction how to prepare and load the data other than those deposited in the GDC database.
  • Annotation of samples on the plots with unique sample IDs (when a sample is misclassified, a user does not know which one it is it will also be helpful to localize a sample on various plots, e.g. generated with a signature of up- and down-regulated genes) – it is written in the article that the ‘singscore’ package supports different types of annotations (“Annotations of interest can be overlayed on each plot”) but I found only color code annotation whereas I would like to have color code for mutation type and additional text labels with sample IDs on the same plot.
  • A command which lists the samples typed as strong candidates for having particular mutation, ordered according to the calculated metrics.
  • Generating a threshold line between samples with and without mutation.

What I would expect the most from a package designed to identify mutations in transcriptomes, is a direct identification of mutations in RNAseq data. The results of mutation typing based on gene expression profile will be strongly supported when a particular mutation will be covered by RNAseq reads. From my own experience, I know it is possible (for NPM1 mutation detection, see 4 ). For genes with high and middle expression level, the coverage can be even higher than that obtained from genome- or exome-level data. And in the case when DNA data are unavailable, it would be really fantastic. Because it demands completely different data processing, it may be considered for future versions of the package.

Is the rationale for developing the new software tool clearly explained?

Is the description of the software tool technically sound?

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

1. Handschuh L: Not Only Mutations Matter: Molecular Picture of Acute Myeloid Leukemia Emerging from Transcriptome Studies. Journal of Oncology. 2019 2019: 1-36 Publisher Full Text
2. Handschuh L, Wojciechowski P, Kazmierczak M, Marcinkowska-Swojak M, et al.: NPM1 alternative transcripts are upregulated in acute myeloid and lymphoblastic leukemia and their expression level affects patient outcome. Journal of Translational Medicine. 2018 16 (1). Publisher Full Text
3. Alcalay M, Tiacci E, Bergomas R, Bigerna B, et al.: Acute myeloid leukemia bearing cytoplasmic nucleophosmin (NPMc+ AML) shows a distinct gene expression profile characterized by up-regulation of genes involved in stem-cell maintenance.Blood. 2005 106 (3): 899-902 PubMed Abstract | Publisher Full Text
4. Marcinkowska-Swojak M, Handschuh L, Wojciechowski P, Goralski M, et al.: Simultaneous detection of mutations and copy number variation of NPM1 in the acute myeloid leukemia using multiplex ligation-dependent probe amplification.Mutat Res. 2016 786: 14-26 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genomics, transcriptomics, biology of acute myeloid leukemia.

We thank the reviewer for their time and effort directed towards reviewing our manuscript and for the helpful feedback they have provided. Where appropriate we have modified our manuscript to . Continue reading We thank the reviewer for their time and effort directed towards reviewing our manuscript and for the helpful feedback they have provided. Where appropriate we have modified our manuscript to address the reviewer’s comments.
Below we include a point by point response to the reviewer’s comments and where appropriate we list corresponding changes to the manuscript.

Introduction
LH: The article presents a new package, which was developed from a previous one (‘singscore’) elaborated by the same authors. The package ‘singscore’ is a single-sample gene set scoring method which is valuable for analysis of transcriptomes of samples collected through the long time and not sequenced in the same run, experiment, platform or laboratory. Here, the authors apply the method for classification of TCGA acute myeloid (AML) samples using transcriptional ‘gene signatures’ identified by other authors (Verhaak and Ross) as typical for the NPM1c mutation, KMT2A (MLL) gene fusions, and PML-RARA gene fusions.

DDB: We note that the purpose of the initial manuscript may have been unclear, in part because it is listed under the F1000 Software Tool Article section. The purpose of this article is to present a workflow demonstrating the use of singscore, and it is a R/Bioconductor workflow illustrating the use of singscore, and so is not intended as a new package or tool.
https://www.bioconductor.org/packages/release/workflows/
https://www.bioconductor.org/packages/release/workflows/html/SingscoreAMLMutations.html
Some of the review comments below assume that this manuscript presents a specific software package for detecting NPM1 mutations, so we have clarified the manuscript as outlined below in order to make the purpose and intention more clear (note that the title of the manuscript has changed from ‘predicting mutations’ to ‘predicting mutation status’ based upon feedback from reviewer 1).
This work arose from an observation in another project that the Verhaak signature scored with singscore appeared to correlate strongly with mutation status, and we thought this would be an interesting example that might also help researchers to investigate the links between genetic lesions and corresponding transcriptional changes, an area in which the reviewer clearly has expertise.

LH: AML is a heterogeneous and multi-clonal disease. The transcriptome picture of AML is very complex and can result from different mutations, genomic rearrangements, and aberrant regulation of gene expression at different levels (see my last review paper 1 ). Sometimes, the gene expression profiles can overlap between samples with different genetic lesions. The examples are HOX-gene-based discriminative signatures determined by Verhaak and Ross, not limited to AML with mutated NPM1, but specific also for AML cases with 11q23 abnormalities and KMT2A (MLL) gene rearrangements. The authors are aware of this fact and underline it in the article, too. On the other side, samples with the same mutation can present varying expression profiles because of additional features which also affect the transcriptome.

LH: Therefore, interpretation of the results must be done with caution. For me, the authors interpret the results a little bit too optimistically. The presented method is not sufficient to determine the mutation of interest, but can be used as a supplementary approach, a screening tool or (in future) a personalized medicine tool which could classify patients based on transcriptomic profiles, associated with specific treatment response.

DDB: We agree with the reviewer that gene set scoring of transcriptomic data should not be the only method used to identify patient samples carrying genetic lesions. The Bioconductor workflow we present is intended to provide an example of applying singscore to study mutation/fusion based gene sets as we believe that singscore provides a relatively flexible and intuitive approach for investigating different gene sets across large data sets. We feel that a particularly useful feature is the ability to combine different signatures/gene sets (including phenotype/cell-cycle signatures etc) to explore how these transcriptional changes are associated over different samples, demonstrated in Figures 5-7.
We have modified the “Description of biological relevance” section to address the reviewer’s comments around the complexity of AML genomic lesions and corresponding transcriptomic changes. The review highlighted is particularly relevant for this work and accordingly we mention the complexity of AML and direct the reader towards this resource by extending the first paragraph (new text underlined):
.. A more recent study which has focused primarily on genomic data has further refined the clinically significant AML subtypes [Papaemmanuil (2016), NEJM], highlighting a number of co-occurring as well as mutually exclusive mutations. As the identification of putative driver fusions/mutations continues, work has also been directed towards how these lesions interact with each other and other features (e.g. cellular proliferation, changes due to phenotypic plasticity, or variation in post-transcriptional regulators such as microRNAs) to drive transcriptional changes as discussed in a recent review [Hanschuch (2019), J. Oncol.].

We have also added a paragraph at the end of this section which discusses some of the limitations for our approach and provides more context in which it could be applied:
While we demonstrate that singscore is capable of inferring mutation status from the transcriptional profile of AML samples, we note that it is best used to supplement alternative data which can provide a more definitive resolution of these lesions. Processing of raw RNA‑seq data will directly identify the presence of gene‑fusion products or mutations within protein-coding regions, although for many large data sets the quantified transcript abundance data are much easier to obtain without access agreements. The method can also be applied to legacy microarray data sets where genome and RNA sequencing data are unavailable. As such singscore provides a useful approach to supplement established methods for the study of genetic lesions in cancer. By exploring associations between different genomic and phenotypically relevant signatures, it may also help to better characterise true driver mutations which exert consistent effects on the transcriptome of AML samples and other cancers.

Testing package
LH: I have used the package in R with TCGA data and confirm the code works, it is fast and generates the same results and plots as presented in the article. Moreover, I have tested the whole procedure on my own RNA-seq dataset (not published yet but used for supplementary expression analysis of NPM1 alternative transcripts, see the article 2 ). My dataset contains 28 AML samples, including 8 with NPM1 mutation (verified with three independent approaches). The data load was more tricky as I started from the csv table with counts and I had to convert the DESeqDataSet object into the RangedSummarizedExperiment object. The best discrimination was achieved with the Verhaak signature, but only with the down-regulated genes. All samples with NPM1 mutation were clearly separated from others, however, 2 additional samples, without NPM1 mutation, were grouped together with NPM1c. It is possible they have KMT2A (MLL) rearrangements, but I cannot verify it now. Analyzing the plots, up-regulated genes and all genes form the signature are not as efficient, separating only 2 or 3 NPM1c samples from the rest. In the article, the authors also admit that the set of down-regulated genes has more discriminant power, but they claim up-regulated genes also contribute to the discrimination.

LH: I have no samples with PML-RARA fusion in my dataset and the status of KMT2A (MLL) gene was unknown in my patients, so I could not compare the efficiency of signatures other than NPM1c. APL with PML-RARA is the most distinctive AML subtype which is easily separated from other AML samples based on transcriptome profile, so I would expect good results. This example shows the package works well for samples with very specific gene expression profile.

DDB: We thank the reviewer for the exceptional effort and time invested to test our workflow on independent data - we hope results from this analysis have been informative for identifying other features within their data. Our workflow includes guidance for users who wish to import data from other sources such as those used by the reviewer.

Interpretation of the results
LH: Considering the result interpretation, I have some doubts. The singscores are composed of two components, an enrichment score and a dispersion estimate of ranks. I am informed that “high expression of up-regulated genes and low expression of down-regulated genes would result in higher scores”. This is logical. I also know the maximal ranges ([−1, 1] for signatures involving up- and down-regulated genes). However, which value is really high? For example, is 0.2 enough (it looks like from the plots in the article and from my own data, too) or maybe I should expect much more, e.g. 0.7? Similarly, which value of dispersion should I expect? I suppose, the lowest, but which is low enough or which range is optimal? What is difficult to understand is that “despite the range of scores increasing, the discriminant power drops moderately” (for up-regulated genes from Verhaak signature, when compared to down-regulated genes). I see the same paradox in my data – the scores are higher for up-reg. genes, which are much less efficient in discrimination between NPM1c+ and NPM1c- samples. It looks like scores only do not reflect the trends observed from score vs dispersion plots.

DDB: Interpretation of singscores is intentionally left to be problem specific as it generally requires some domain-specific knowledge of the biological system and corresponding signature genes – ideally the computational biologist or bioinformaticians working on each project can provide some guidance.
The basic interpretation of the score is the normalised mean rank of genes within the signature relative to all other genes in the sample. At the extrema this interpretation is relatively simple - near 1, a higher value would indicate that genes in the signature are expressed at higher levels relative to other genes. For scores towards zero, however, the interpretation can be much more difficult – a score of zero (on the range [-1,1]) could indicate the signature genes are tightly clustered around the sample-wide mean abundance, or it could indicate a highly-dispersed almost uniform distribution across the entire abundance range (with a mean near the mean of all genes). By exploring the singscores together with dispersion estimates this information is summarised, assisting in estimation of effect size variability.
Interpretation of scores depends on the context of the experiment and the typical behaviour of the gene set. A “high” score is best determined relative to other samples. This is achieved either by comparing scores from other samples in large datasets such as TCGA, or better, across a set of samples from a given experiment with known conditions. Other methods normalise the data before computing scores, and we note that a recent paper has applied z-score normalisation to results from singscore for comparison to ssGSEA [Cui et al (2019) Oncogene, DOI: 10.1038/s41388-019-1026-9].
All singscores for samples remain the same and do not have to be recomputed upon addition of new samples, and interpretation will improve as more samples are added to the study. For example, Gaussian mixture modelling could be used to separate the NPM1c scores based on our expectation that there are two groups. This could be switched with other unsupervised classification algorithms such as hierarchical clustering or k-means clustering. We have added an example analysis to the manuscript to demonstrate how scores can be interpreted in an unsupervised setting, under the section “Transcriptional signatures to predict mutation status/Unsupervised classification of mutations”.
There may be some cases where sample annotation is not available. In such scenarios, we are unable to build regression models to interpret scores. A higher singscore would provide stronger evidence for the signature but the magnitude is difficult to interpret without a reference. One approach to deal with this situation is to compare scores to those from other datasets where the mutations status is known. An alternative approach would be to compare scores within the dataset using unsupervised learning methods.
Here we demonstrate the use of three clustering methods (Gaussian mixture decomposition, k-means clustering, and hierarchical clustering) to stratify samples, and as we have done previously [wang et al (2012) Journal of clinical bioinformatics] use the adjusted Rand index (ARI) to compare classifications. As expected, supervised (GLM) classification results in the best prediction. This is followed by clustering based on the score using Gaussian mixture decomposition. Any other classification algorithm along with prior knowledge could be used to decompose scores into groups. The important characteristic of singscores is that they maintain the discriminant power of gene signatures therefore can be coupled with supervised, semi-supervised or unsupervised algorithms to perform stratification.

#Gaussian mixture model
m1 = Mclust(scoredf$Score, G = 2, verbose = FALSE)
#k-means clustering
m2 = kmeans(scoredf[, 5:6], centers = 2, nstart = 100)
#hierarchical clustering
m3 = hclust(dist(scoredf[, 5:6]))

mutation_inference = cbind(
'GLM' = prediction,
'mclust' = m1$classification,
'k-means' = m2$cluster,
'hclust' = cutree(m3, k = 2)
)
apply(mutation_inference, 2, adjustedRandIndex, scoredf$NPM1c.Mut)
```
Possible improvements
LH: Although the package is generally useful and well described, the following improvements will make it more friendly for less advanced R users, e.g.

LH: Instruction how to prepare and load the data other than those deposited in the GDC database.

DDB: We have noted in text that the rank matrix can be computed using either a SummarizedExperiment object, DGEList object, ExpressionSet object, numeric matrix or a numeric data frame. As such, a numeric matrix with sample names as column names and genes as row names would suffice. Scoring should be performed on length bias corrected measurements such as RPKM/FPKM or TPM and not CPM or raw counts.
Transcriptional signatures to predict mutation status/Score TCGA AML samples using the Verhaak signature - extract from text: “The `rankGenes` function will compute ranks from expression data in the form of either a numeric matrix, numeric data frame, ExpressionSet object, DGEList object or a SummarizedExperiment object”

LH: Annotation of samples on the plots with unique sample IDs (when a sample is misclassified, a user does not know which one it is it will also be helpful to localize a sample on various plots, e.g. generated with a signature of up- and down-regulated genes) – it is written in the article that the ‘singscore’ package supports different types of annotations (“Annotations of interest can be overlayed on each plot”) but I found only color code annotation whereas I would like to have color code for mutation type and additional text labels with sample IDs on the same plot.

DDB: Sample labels can be added to landscape plots but were not supported in other visualisations. We have added functionality to the latest version of the singscore package (v1.5.1) to allow labelling of samples in the score vs. dispersion plots. We have modified the text to clarify, and modified Figure 6 to label samples where classification uncertainty (NMP1c vs WT) is high to demonstrate this feature. The changes below have been made to the section: “Transcriptional signatures to predict mutation status/Diagnostics of the Verhaak signature”.
Sample annotations of interest (e.g. clinical annotations) can be colour coded on each plot. …

Figure 6:
```
select_aml = !mutated_gene %in% 'Other'

#label samples with an mclust NPM1c classification uncertainty of > 0.3
label_samples = substr(rownames(verhaak_scores), 6, 12) #sample ID from barcodes
label_samples[m1$uncertainty < 0.3] = NA

#project mutations onto the landscape
p1 = projectScoreLandscape(
p_mll_npm1c,
verhaak_scores,
rossmll_scores,
subSamples = select_aml,
annot = mutated_gene[select_aml],
sampleLabels = label_samples[select_aml]
)
p1 + theme(legend.box = 'vertical')
```
LH: A command which lists the samples typed as strong candidates for having particular mutation, ordered according to the calculated metrics.

DDB: As discussed in an earlier comment, we recommend such analyses to be problem specific. Generally, a higher score would indicate a stronger effect of genes in the signature relative to WT samples therefore samples with higher scores would be stronger candidates for mutations. Alternatively, the partitioning created using Gaussian mixture modelling could be used as a guide for separation and samples with a score much higher than the threshold would be the strongest candidates for the mutation.

LH: Generating a threshold line between samples with and without mutation.

DDB: See above discussion/recommendation.

LH: In future, it would be also good to include other signatures, e.g. the signature of 369 genes identified by Alcalay et al. in 2005 3 , discriminating AML patients with NPMc+ from patients with NPMc-, even in cases with rare chromosomal abnormalities.

LH: What I would expect the most from a package designed to identify mutations in transcriptomes, is a direct identification of mutations in RNAseq data. The results of mutation typing based on gene expression profile will be strongly supported when a particular mutation will be covered by RNAseq reads. From my own experience, I know it is possible (for NPM1 mutation detection, see 4 ). For genes with high and middle expression level, the coverage can be even higher than that obtained from genome- or exome-level data. And in the case when DNA data are unavailable, it would be really fantastic. Because it demands completely different data processing, it may be considered for future versions of the package.

DDB: As we have outlined at the start of this review there may have been some misunderstanding around the purpose of this Workflow package/paper. We agree with the reviewer that direct detection of mutations/fusions from RNA-seq data is the best approach, and we now note this in the ‘Description of biological relevance section’ as stated above. The other gene signatures mentioned above could be incorporated in a workflow as singscore supports analysis and comparison of multiple gene sets.

We thank the reviewer for their time and effort directed towards reviewing our manuscript and for the helpful feedback they have provided. Where appropriate we have modified our manuscript to address the reviewer’s comments.
Below we include a point by point response to the reviewer’s comments and where appropriate we list corresponding changes to the manuscript.

Introduction
LH: The article presents a new package, which was developed from a previous one (‘singscore’) elaborated by the same authors. The package ‘singscore’ is a single-sample gene set scoring method which is valuable for analysis of transcriptomes of samples collected through the long time and not sequenced in the same run, experiment, platform or laboratory. Here, the authors apply the method for classification of TCGA acute myeloid (AML) samples using transcriptional ‘gene signatures’ identified by other authors (Verhaak and Ross) as typical for the NPM1c mutation, KMT2A (MLL) gene fusions, and PML-RARA gene fusions.

DDB: We note that the purpose of the initial manuscript may have been unclear, in part because it is listed under the F1000 Software Tool Article section. The purpose of this article is to present a workflow demonstrating the use of singscore, and it is a R/Bioconductor workflow illustrating the use of singscore, and so is not intended as a new package or tool.
https://www.bioconductor.org/packages/release/workflows/
https://www.bioconductor.org/packages/release/workflows/html/SingscoreAMLMutations.html
Some of the review comments below assume that this manuscript presents a specific software package for detecting NPM1 mutations, so we have clarified the manuscript as outlined below in order to make the purpose and intention more clear (note that the title of the manuscript has changed from ‘predicting mutations’ to ‘predicting mutation status’ based upon feedback from reviewer 1).
This work arose from an observation in another project that the Verhaak signature scored with singscore appeared to correlate strongly with mutation status, and we thought this would be an interesting example that might also help researchers to investigate the links between genetic lesions and corresponding transcriptional changes, an area in which the reviewer clearly has expertise.

LH: AML is a heterogeneous and multi-clonal disease. The transcriptome picture of AML is very complex and can result from different mutations, genomic rearrangements, and aberrant regulation of gene expression at different levels (see my last review paper 1 ). Sometimes, the gene expression profiles can overlap between samples with different genetic lesions. The examples are HOX-gene-based discriminative signatures determined by Verhaak and Ross, not limited to AML with mutated NPM1, but specific also for AML cases with 11q23 abnormalities and KMT2A (MLL) gene rearrangements. The authors are aware of this fact and underline it in the article, too. On the other side, samples with the same mutation can present varying expression profiles because of additional features which also affect the transcriptome.

LH: Therefore, interpretation of the results must be done with caution. For me, the authors interpret the results a little bit too optimistically. The presented method is not sufficient to determine the mutation of interest, but can be used as a supplementary approach, a screening tool or (in future) a personalized medicine tool which could classify patients based on transcriptomic profiles, associated with specific treatment response.

DDB: We agree with the reviewer that gene set scoring of transcriptomic data should not be the only method used to identify patient samples carrying genetic lesions. The Bioconductor workflow we present is intended to provide an example of applying singscore to study mutation/fusion based gene sets as we believe that singscore provides a relatively flexible and intuitive approach for investigating different gene sets across large data sets. We feel that a particularly useful feature is the ability to combine different signatures/gene sets (including phenotype/cell-cycle signatures etc) to explore how these transcriptional changes are associated over different samples, demonstrated in Figures 5-7.
We have modified the “Description of biological relevance” section to address the reviewer’s comments around the complexity of AML genomic lesions and corresponding transcriptomic changes. The review highlighted is particularly relevant for this work and accordingly we mention the complexity of AML and direct the reader towards this resource by extending the first paragraph (new text underlined):
.. A more recent study which has focused primarily on genomic data has further refined the clinically significant AML subtypes [Papaemmanuil (2016), NEJM], highlighting a number of co-occurring as well as mutually exclusive mutations. As the identification of putative driver fusions/mutations continues, work has also been directed towards how these lesions interact with each other and other features (e.g. cellular proliferation, changes due to phenotypic plasticity, or variation in post-transcriptional regulators such as microRNAs) to drive transcriptional changes as discussed in a recent review [Hanschuch (2019), J. Oncol.].

We have also added a paragraph at the end of this section which discusses some of the limitations for our approach and provides more context in which it could be applied:
While we demonstrate that singscore is capable of inferring mutation status from the transcriptional profile of AML samples, we note that it is best used to supplement alternative data which can provide a more definitive resolution of these lesions. Processing of raw RNA‑seq data will directly identify the presence of gene‑fusion products or mutations within protein-coding regions, although for many large data sets the quantified transcript abundance data are much easier to obtain without access agreements. The method can also be applied to legacy microarray data sets where genome and RNA sequencing data are unavailable. As such singscore provides a useful approach to supplement established methods for the study of genetic lesions in cancer. By exploring associations between different genomic and phenotypically relevant signatures, it may also help to better characterise true driver mutations which exert consistent effects on the transcriptome of AML samples and other cancers.

Testing package
LH: I have used the package in R with TCGA data and confirm the code works, it is fast and generates the same results and plots as presented in the article. Moreover, I have tested the whole procedure on my own RNA-seq dataset (not published yet but used for supplementary expression analysis of NPM1 alternative transcripts, see the article 2 ). My dataset contains 28 AML samples, including 8 with NPM1 mutation (verified with three independent approaches). The data load was more tricky as I started from the csv table with counts and I had to convert the DESeqDataSet object into the RangedSummarizedExperiment object. The best discrimination was achieved with the Verhaak signature, but only with the down-regulated genes. All samples with NPM1 mutation were clearly separated from others, however, 2 additional samples, without NPM1 mutation, were grouped together with NPM1c. It is possible they have KMT2A (MLL) rearrangements, but I cannot verify it now. Analyzing the plots, up-regulated genes and all genes form the signature are not as efficient, separating only 2 or 3 NPM1c samples from the rest. In the article, the authors also admit that the set of down-regulated genes has more discriminant power, but they claim up-regulated genes also contribute to the discrimination.

LH: I have no samples with PML-RARA fusion in my dataset and the status of KMT2A (MLL) gene was unknown in my patients, so I could not compare the efficiency of signatures other than NPM1c. APL with PML-RARA is the most distinctive AML subtype which is easily separated from other AML samples based on transcriptome profile, so I would expect good results. This example shows the package works well for samples with very specific gene expression profile.

DDB: We thank the reviewer for the exceptional effort and time invested to test our workflow on independent data - we hope results from this analysis have been informative for identifying other features within their data. Our workflow includes guidance for users who wish to import data from other sources such as those used by the reviewer.

Interpretation of the results
LH: Considering the result interpretation, I have some doubts. The singscores are composed of two components, an enrichment score and a dispersion estimate of ranks. I am informed that “high expression of up-regulated genes and low expression of down-regulated genes would result in higher scores”. This is logical. I also know the maximal ranges ([−1, 1] for signatures involving up- and down-regulated genes). However, which value is really high? For example, is 0.2 enough (it looks like from the plots in the article and from my own data, too) or maybe I should expect much more, e.g. 0.7? Similarly, which value of dispersion should I expect? I suppose, the lowest, but which is low enough or which range is optimal? What is difficult to understand is that “despite the range of scores increasing, the discriminant power drops moderately” (for up-regulated genes from Verhaak signature, when compared to down-regulated genes). I see the same paradox in my data – the scores are higher for up-reg. genes, which are much less efficient in discrimination between NPM1c+ and NPM1c- samples. It looks like scores only do not reflect the trends observed from score vs dispersion plots.

DDB: Interpretation of singscores is intentionally left to be problem specific as it generally requires some domain-specific knowledge of the biological system and corresponding signature genes – ideally the computational biologist or bioinformaticians working on each project can provide some guidance.
The basic interpretation of the score is the normalised mean rank of genes within the signature relative to all other genes in the sample. At the extrema this interpretation is relatively simple - near 1, a higher value would indicate that genes in the signature are expressed at higher levels relative to other genes. For scores towards zero, however, the interpretation can be much more difficult – a score of zero (on the range [-1,1]) could indicate the signature genes are tightly clustered around the sample-wide mean abundance, or it could indicate a highly-dispersed almost uniform distribution across the entire abundance range (with a mean near the mean of all genes). By exploring the singscores together with dispersion estimates this information is summarised, assisting in estimation of effect size variability.
Interpretation of scores depends on the context of the experiment and the typical behaviour of the gene set. A “high” score is best determined relative to other samples. This is achieved either by comparing scores from other samples in large datasets such as TCGA, or better, across a set of samples from a given experiment with known conditions. Other methods normalise the data before computing scores, and we note that a recent paper has applied z-score normalisation to results from singscore for comparison to ssGSEA [Cui et al (2019) Oncogene, DOI: 10.1038/s41388-019-1026-9].
All singscores for samples remain the same and do not have to be recomputed upon addition of new samples, and interpretation will improve as more samples are added to the study. For example, Gaussian mixture modelling could be used to separate the NPM1c scores based on our expectation that there are two groups. This could be switched with other unsupervised classification algorithms such as hierarchical clustering or k-means clustering. We have added an example analysis to the manuscript to demonstrate how scores can be interpreted in an unsupervised setting, under the section “Transcriptional signatures to predict mutation status/Unsupervised classification of mutations”.
There may be some cases where sample annotation is not available. In such scenarios, we are unable to build regression models to interpret scores. A higher singscore would provide stronger evidence for the signature but the magnitude is difficult to interpret without a reference. One approach to deal with this situation is to compare scores to those from other datasets where the mutations status is known. An alternative approach would be to compare scores within the dataset using unsupervised learning methods.
Here we demonstrate the use of three clustering methods (Gaussian mixture decomposition, k-means clustering, and hierarchical clustering) to stratify samples, and as we have done previously [wang et al (2012) Journal of clinical bioinformatics] use the adjusted Rand index (ARI) to compare classifications. As expected, supervised (GLM) classification results in the best prediction. This is followed by clustering based on the score using Gaussian mixture decomposition. Any other classification algorithm along with prior knowledge could be used to decompose scores into groups. The important characteristic of singscores is that they maintain the discriminant power of gene signatures therefore can be coupled with supervised, semi-supervised or unsupervised algorithms to perform stratification.

#Gaussian mixture model
m1 = Mclust(scoredf$Score, G = 2, verbose = FALSE)
#k-means clustering
m2 = kmeans(scoredf[, 5:6], centers = 2, nstart = 100)
#hierarchical clustering
m3 = hclust(dist(scoredf[, 5:6]))

mutation_inference = cbind(
'GLM' = prediction,
'mclust' = m1$classification,
'k-means' = m2$cluster,
'hclust' = cutree(m3, k = 2)
)
apply(mutation_inference, 2, adjustedRandIndex, scoredf$NPM1c.Mut)
```
Possible improvements
LH: Although the package is generally useful and well described, the following improvements will make it more friendly for less advanced R users, e.g.

LH: Instruction how to prepare and load the data other than those deposited in the GDC database.

DDB: We have noted in text that the rank matrix can be computed using either a SummarizedExperiment object, DGEList object, ExpressionSet object, numeric matrix or a numeric data frame. As such, a numeric matrix with sample names as column names and genes as row names would suffice. Scoring should be performed on length bias corrected measurements such as RPKM/FPKM or TPM and not CPM or raw counts.
Transcriptional signatures to predict mutation status/Score TCGA AML samples using the Verhaak signature - extract from text: “The `rankGenes` function will compute ranks from expression data in the form of either a numeric matrix, numeric data frame, ExpressionSet object, DGEList object or a SummarizedExperiment object”

LH: Annotation of samples on the plots with unique sample IDs (when a sample is misclassified, a user does not know which one it is it will also be helpful to localize a sample on various plots, e.g. generated with a signature of up- and down-regulated genes) – it is written in the article that the ‘singscore’ package supports different types of annotations (“Annotations of interest can be overlayed on each plot”) but I found only color code annotation whereas I would like to have color code for mutation type and additional text labels with sample IDs on the same plot.

DDB: Sample labels can be added to landscape plots but were not supported in other visualisations. We have added functionality to the latest version of the singscore package (v1.5.1) to allow labelling of samples in the score vs. dispersion plots. We have modified the text to clarify, and modified Figure 6 to label samples where classification uncertainty (NMP1c vs WT) is high to demonstrate this feature. The changes below have been made to the section: “Transcriptional signatures to predict mutation status/Diagnostics of the Verhaak signature”.
Sample annotations of interest (e.g. clinical annotations) can be colour coded on each plot. …

Figure 6:
```
select_aml = !mutated_gene %in% 'Other'

#label samples with an mclust NPM1c classification uncertainty of > 0.3
label_samples = substr(rownames(verhaak_scores), 6, 12) #sample ID from barcodes
label_samples[m1$uncertainty < 0.3] = NA

#project mutations onto the landscape
p1 = projectScoreLandscape(
p_mll_npm1c,
verhaak_scores,
rossmll_scores,
subSamples = select_aml,
annot = mutated_gene[select_aml],
sampleLabels = label_samples[select_aml]
)
p1 + theme(legend.box = 'vertical')
```
LH: A command which lists the samples typed as strong candidates for having particular mutation, ordered according to the calculated metrics.

DDB: As discussed in an earlier comment, we recommend such analyses to be problem specific. Generally, a higher score would indicate a stronger effect of genes in the signature relative to WT samples therefore samples with higher scores would be stronger candidates for mutations. Alternatively, the partitioning created using Gaussian mixture modelling could be used as a guide for separation and samples with a score much higher than the threshold would be the strongest candidates for the mutation.

LH: Generating a threshold line between samples with and without mutation.

DDB: See above discussion/recommendation.

LH: In future, it would be also good to include other signatures, e.g. the signature of 369 genes identified by Alcalay et al. in 2005 3 , discriminating AML patients with NPMc+ from patients with NPMc-, even in cases with rare chromosomal abnormalities.

LH: What I would expect the most from a package designed to identify mutations in transcriptomes, is a direct identification of mutations in RNAseq data. The results of mutation typing based on gene expression profile will be strongly supported when a particular mutation will be covered by RNAseq reads. From my own experience, I know it is possible (for NPM1 mutation detection, see 4 ). For genes with high and middle expression level, the coverage can be even higher than that obtained from genome- or exome-level data. And in the case when DNA data are unavailable, it would be really fantastic. Because it demands completely different data processing, it may be considered for future versions of the package.

DDB: As we have outlined at the start of this review there may have been some misunderstanding around the purpose of this Workflow package/paper. We agree with the reviewer that direct detection of mutations/fusions from RNA-seq data is the best approach, and we now note this in the ‘Description of biological relevance section’ as stated above. The other gene signatures mentioned above could be incorporated in a workflow as singscore supports analysis and comparison of multiple gene sets.



Comments:

  1. Diedrick

    Bravo, a sentence ..., great idea

  2. Keyser

    Yes indeed. It was with me too. Let's discuss this issue.

  3. Daran

    I think they will help you find the right solution. Don't be upset.

  4. Tugrel

    So what is next?

  5. Yolkree

    and I even really liked ...



Write a message