We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

Living systems are composed of multiple layers that encode information about the system. The primary layers are:

1. Epigenome: Defined by chromatin configuration. The structure of chromatin is based on the way that histones organize DNA. DNA is divided into nucleosome and nucleosome-free regions, forming its final shape and influencing gene expression.

- Genome: Includes coding and non-coding DNA. Genes defined by coding DNA are used to build RNA, and Cis-regulatory elements regulate the expression of these genes.
- Transcriptome RNAs (ex. mRNA, miRNA, ncRNA, piRNA) are transcribed from DNA. They have regulatory functions and manufacture proteins.
- Proteome Composed of proteins. This includes transcription factors, signaling proteins, and metabolic enzymes.

Interactions between these components are all different, but understanding them can put particular parts of the system into the context of the whole. To discover relationships and interactions within and between layers, we can use networks.

## Introducing Biological Networks

Biological networks are composed as follows:**Regulatory Net **– set of regulatory interactions in an organism.

- Nodes are regulators (ex. transcription factors) and associated targets.
- Edges correspond to regulatory interaction, directed from the regulatory factor to its target. They are signed according to the positive or negative e↵ect and weighted according to the strength of the reaction.

**Metabolic Net **– connects metabolic processes. There is some flexibility in the representation, but an example is a graph displaying shared metabolic products between enzymes.

- Nodes are enzymes.
- Edges correspond to regulatory reactions, and are weighted according to the strength of the reaction.

**Signaling Net **– represents paths of biological signals.

- Nodes are proteins called signaling receptors.
- Edges are transmitted and received biological signals, directed from transmitter to receiver.

**Protein Net **– displays physical interactions between proteins.

• Nodes are individual proteins.

• Edges are physical interactions between proteins.

**Co-Expression Net** – describes co-expression functions between genes. Quite general; represents functional rather than physical interaction networks, unlike the other types of nets. Powerful tool in computational analysis of biological data.

• Nodes are individual genes.

• Edges are co-expression relationships.

Today, we will focus exclusively on regulatory networks. Regulatory networks control context-specific gene expression, and thus have a great deal of control over development. They are worth studying because they are prone to malfunction and causing disease.

## Interactions Between Biological Networks

Individual biological networks (that is, layers) can themselves be considered nodes in a larger network representing the entire biological system. We can, for example, have a signaling network sensing the environment governing the expression of transcription factors. In this example, the network would display that TFs govern the expression of proteins, proteins can play roles as enzymes in metabolic pathways, and so on.

The general paths of information exchange between these networks are shown in figure 21.4.

## Studying Regulatory Networks

In general, networks are used to represent dependencies among variables. Structural dependencies can be represented by the presence of an edge between nodes - as such, unconnected nodes are conditionally independent. Probabilistically, edges can be assigned a ”weight” that represents the strength or the likelihood of the interaction. Networks can also be viewed as matrices, allowing mathematical operations. These frameworks provides an effective way to represent and study biological systems.

These networks are particularly interesting to study because malfunctions can have a large effect. Many diseases are caused by rewirings of regulatory networks. They control context specific expression in development. Because of this, they can be used in systems biology to predict development, cell state, system state, and more. In addition, they encapsulate much of the evolutionary difference between organisms that are genetically similar.

To describe regulatory networks, there are several challenging questions to answer.

**Element Identification** What are the elements of a network? Elements constituting regulatory networks were identified last lecture. These include upstream motifs and their associated factors.

**Network Structure Analysis** How are the elements of a network connected? Given a network, structure analysis consists of examination and characterization of important properties. It can be done biological networks but is not restricted to them.

**Network Inference **How do regulators interact and turn on genes? This is the task of identifying gene edges and characterizing their actions.

**Network Applications** What can we do with networks once we have them? Applications include predict- ing function of regulating genes and predicting expression levels of regulated genes.

^{1}More in the epigenetics lecture.

## Computational inference and analysis of genetic regulatory networks via a supervised combinatorial-optimization pattern

Post-genome era brings about diverse categories of omics data. Inference and analysis of genetic regulatory networks act prominently in extracting inherent mechanisms, discovering and interpreting the related biological nature and living principles beneath mazy phenomena, and eventually promoting the well-beings of humankind.

### Results

A supervised combinatorial-optimization pattern based on information and signal-processing theories is introduced into the inference and analysis of genetic regulatory networks. An associativity measure is proposed to define the regulatory strength/connectivity, and a phase-shift metric determines regulatory directions among components of the reconstructed networks. Thus, it solves the undirected regulatory problems arising from most of current linear/nonlinear relevance methods. In case of computational and topological redundancy, we constrain the classified group size of pair candidates within a multiobjective combinatorial optimization (MOCO) pattern.

### Conclusions

We testify the proposed approach on two real-world microarray datasets of different statistical characteristics. Thus, we reveal the inherent design mechanisms for genetic networks by quantitative means, facilitating further theoretic analysis and experimental design with diverse research purposes. Qualitative comparisons with other methods and certain related focuses needing further work are illustrated within the discussion section.

## Abstract

Unraveling molecular regulatory networks underlying disease progression is critically important for understanding disease mechanisms and identifying drug targets. The existing methods for inferring gene regulatory networks (GRNs) rely mainly on time-course gene expression data. However, most available omics data from cross-sectional studies of cancer patients often lack sufficient temporal information, leading to a key challenge for GRN inference. Through quantifying the latent progression using random walks-based manifold distance, we propose a latent-temporal progression-based Bayesian method, PROB, for inferring GRNs from the cross-sectional transcriptomic data of tumor samples. The robustness of PROB to the measurement variabilities in the data is mathematically proved and numerically verified. Performance evaluation on real data indicates that PROB outperforms other methods in both pseudotime inference and GRN inference. Applications to bladder cancer and breast cancer demonstrate that our method is effective to identify key regulators of cancer progression or drug targets. The identified ACSS1 is experimentally validated to promote epithelial-to-mesenchymal transition of bladder cancer cells, and the predicted FOXM1-targets interactions are verified and are predictive of relapse in breast cancer. Our study suggests new effective ways to clinical transcriptomic data modeling for characterizing cancer progression and facilitates the translation of regulatory network-based approaches into precision medicine.

## 2 SEBINI ARCHITECTURE

SEBINI uses a standard three-tier architecture: (1) a web-based client user interface, (2) an application logic middle tier consisting of a suite of Java servlets and other Java programs (>100 Java classes) and (3) a relational database storing the data required by the middle tier. Inferred networks (as well as the raw data, discretized data and algorithm parameter selections used to generate the networks) are permanently stored in the database for visualization, topological and statistical analysis, and for later export in a human-readable or program-specific format. Inference and discretization (binning) algorithms can be any sort of executable program a Java handler class is added for each new algorithm to handle communication between the invocation web page, the database and the algorithm. Security is implemented on a project basis, with one owner and possibly multiple users per project.

Major design issues included (1) the interface for user navigation among possibly huge datasets, allowing easy drill down from a network set to a specific network to a specific node or edge and (2) producing an efficient, understandable mapping from the inferred networks and inferred edges back to the corresponding original expression data. Note that we have one-to-many relationships from an expression dataset to a binned expression dataset, as well as a one-to-many relationship between a binned dataset and the inferred network and inferred edges created by the selected inference algorithm. Records for each of these data types are permanently stored and connected to the appropriate records of the other data types. Other design decisions: all inter-servlet communication is routed through a CentralControl servlet, for a clear (and reusable) flow of control. Each binning and inference algorithm is invoked in a separate Java thread that performs job posting to the database, thus allowing dynamic monitoring of job progress by the user. Jobs are timed to the millisecond, allowing comparison between algorithms of relative speed versus relative power.

SEBINI was initially implemented on a Dell desktop running Red Hat Linux, using Java ver. 1.4, PostgreSQL ver. 7.4 and Tomcat 4.1. SEBINI has also been installed on a Windows 2003 Web Server. Machine-specific parameters are stored in an easily changed properties text file. Mathworks' MATLAB is required for some of the inference algorithms.

## CHARACTERIZATION OF NETWORK TOPOLOGY

Perhaps the most general level of network analysis comes from global network measures that allow us to characterize and compare the given network topologies (i.e. the configuration of the nodes and their connecting edges). Global measures such as the degree distribution (the degree of a node is the number of edges it participates in) and the clustering coefficient (the number of edges connecting the neighbours of the node divided by the maximum number of such edges) have recently been thoroughly reviewed in the context of cellular networks [ 8**] and in proteomics [ 9]. It has been proposed that these quantitative graph concepts can efficiently capture the cellular network organization, providing insights into their evolution, function, stability and dynamic responses [ 10**]. For instance, several types of surveyed biological networks, such as PPI, gene regulation and metabolic networks, are thought to display scale-free topologies (i.e. most nodes have only a few connections whereas some nodes are highly connected), characterized by a power-law degree distribution that decays slower than exponential. This particular type of network topology is also frequently observed in numerous non-biological networks and it can be generated by simple and elegant evolutionary models, where new nodes attach preferentially to sites that are already highly connected. Numerous improvements to this generic model include, for instance, iterative network duplication and integration to its original core, leading to hierarchical network topologies, which are characterized by non-constant clustering coefficient distribution [ 8, 10].

It should be noticed, however, that, in practice, the architecture of large-scale biological networks is determined with sampling methods, resulting in subnets of the true network, and only these partial networks can be applied to characterize the topology of the underlying, hidden network [ 11]. It has recently been recognized that it is possible to extrapolate from subnets to the properties of the whole network only if the degree distributions of the whole network and randomly sampled subnets share the same family of probability distributions [ 12]. While this is the case in specific classes of network graph models, including classical Erdös–Rényi and exponential random graphs, the condition is not satisfied for scale-free degree distributions. Accordingly, recent studies in interactome networks have revealed that the commonly accepted scale-free model for PPI networks may fail to fit the data [ 13]. Moreover, limited sampling alone may as well give rise to apparent scale-free topologies, irrespective of the original network topology [ 14]. These results suggest that interpretation of the global properties of the complete network structure based on the current—still limited—accuracy and coverage of the observed networks should be made with caution. Moreover, while the scale-free and hierarchical graph properties can efficiently characterize some large-scale attributes of networks, the local modularity and network clustering is likely to be the key concept in understanding most cellular mechanisms and functions.

## 1 Introduction

Modeling the coupled dynamics of gene (protein) expression patterns in accordance with changing internal and environmental conditions is an important task in systems biology. To characterize and uncover the exact dynamics of genome-wide gene regulatory networks (GRNs), significant research effort has been devoted to continuously refining computational methods that will allow researchers to understand the complex interactions of gene regulations ( Hughes *et al.*, 2000). Such methods, often referred to as reverse engineering ( Karlebach and Shamir, 2008 Madhamshettiwar *et al.*, 2012 Prill *et al.*, 2010 Stolovitzky *et al.*, 2007), have been used to fit discrete models of GRNs to high-throughput experimental data. In the literature, gene expression-based inference approaches have shown modest performance when applied to real data compared to *in silico* expression data ( Madhamshettiwar *et al.*, 2012 Marbach *et al.*, 2012). In addition, predictive performance over a purely microarray expression-based approach can be improved by incorporating multiple types of data, such as gene set enrichment ( Chouvardas *et al.*, 2016), sequence information ( Yu *et al.*, 2014) and network topology ( Hartemink *et al.*, 2001).

On the other hand, GRNs have commonly been modeled using ordinary differential equations (ODE), Boolean networks and probabilistic graphical models including Bayesian networks ( de Hoon *et al.*, 2002 Friedman *et al.*, 2000 Lovrics *et al.*, 2014). For the reconstructed GRN model reassessment in light of additional evidence, in the recent past, computational methodologies have been developed and formalized mathematically, in order to rigorously integrate prior biological knowledge and high-throughput measurements ( Covert *et al.*, 2004 Gat-Viks *et al.*, 2006). Furthermore, such methodologies have been formalized in a manner that allows for good predictive descriptions of experimental data. Regardless of the modeling or computational approach applied, it is important to assess the validity of such networks. Given the topology of a biological network and a partial set of microarray expression profiles for all genes in the network, a reverse engineering algorithm must infer a probabilistic dynamical system that best *explains* the observed experimental data. In this article, we consider this reverse engineering problem. We describe the *dynamics* of a network as trajectories of gene-expression levels at steady state, given experimental conditions.

In the literature, some methods that can take a biological network and simulate biological data of different genes as either time-series data or steady-state values have been proposed. One of these is *sgnesR* ( Tripathi *et al.*, 2017), an R package used to simulate a gene-expression profile from a given gene network using the stochastic simulation algorithm, for which the reaction parameters are specified under defined constraints. Similarly, a multi-view genomic data simulator proposed by Fratello *et al.* (2015) can generate synthetic biological data from ODE-based network models with known parameters, constructed through an iterative procedure. Simulated datasets, although fully controlled, are often too simplistic to efficiently explain the complex regulatory interactions among biological entities compared to real gene-expression data. Another widely used simulation and modeling tool in systems biology is the complex pathway simulator (COPASI) ( Hoops *et al.*, 2006 Klipp *et al.*, 2008). COPASI is a stand-alone program that specializes in setting up and analyzing biochemical and kinetic network models while also providing some basic stoichiometric analyses. It allows for more detailed and fine-grained analysis, but also demands more knowledge, namely about the kinetics of individual processes. An important factor in the simulation of these models is the knowledge of kinetic reaction parameters. This information can be extracted from the literature however, it is hard to find ( Klipp *et al.*, 2008). Lack of kinetic constants stem from difficulty in measurements and uncertainties in the function of many proteins and their interactions, and thus limit the application of some of these approaches. However, these simulators provide valuable information that can be used to test network inference methods qualitatively, as well as to identify model parameters.

In our work, we apply a probabilistic model to statistically assess the global consistency between GRNs and the gene-expression profile of diverse experimental conditions. Therefore, we explore a probabilistic framework that allows us to model uncertainty in cellular networks through integration of prior biological knowledge and high-throughput experimental data. We formalize the model as a probabilistic factor graph ( Kschischang *et al.*, 2001), which can handle highly complex systems and extensive datasets. This probabilistic model allows us to overcome the drawbacks of models that assume noiseless observations, because it is able to mix noisy continuous measurements with discrete regulatory relations among variables. Furthermore, it does not require the explicit determination of network kinetic parameters. Our method is applied to *Escherichia* *coli* DNA microarray data, where it is successfully used to predict the global allowable steady state of genes in the respective extracted sub-networks. Our analyses are performed on real gene-expression data and networks. The method is further validated using network perturbation techniques ( Maslov, 2008), as well as gene deletion experiments. The rest of this article is organized as follows: In Section 2, we formulate a probabilistic factor graph network (FGN) framework for the analysis of biological networks given experimental data. We follow on with the inference model by applying message-passing algorithm. Section 3 elucidates examples of the regulatory networks with a brief discussion on data discretization methodology. Section 4 presents statistical analyses of cellular network examples using the described framework. The article is concluded in Section 5.

## Methods

Based on probability and signal processing theories, the following section introduces a dimensionless metric for regulatory strengths and a phase-shift metric for determining regulatory orientations. For network inference, we propose a combinatorial-optimization framework for constraining the inference complexities. The framework allows the possibility of incorporating acquired knowledge and specific aims for integrative mining and analysis.

### Probability theory-based inference of biological network structures

Correlation analysis aims to reveal the strength of a linear relationship between random variables (R.V.) statistical correlation (coefficient) represents the departure of two R.V. from independence. Among the various metrics often used to measure the correlation or association, the *Pearson* product-moment correlation coefficient is applicable to some data of diverse characteristics. Normally, the correlation *ρ* _{X,Y} is denoted as the covariance of two R.V. divided by the product of their standard deviations, which can be represented as [7, 10, 12, 13]

where cov indicates covariance, *E* is the expected value operator, *μ* _{X} = *E*(*X*), and σ _{X} 2 = *E*[(*X*-*E*(*X*)) 2 ]=*E*(*X* 2 )-*E* 2 (*X*).

When interpreting the *Pearson* product-moment correlation coefficient, Cohen noted that the proposed interpretative criteria were arbitrary in general and that specific treatments should be adopted for specific cases in those ranging from physics to other social sciences [22]. Apart from the parametric statistic, nonparametric correlation metrics such as the *χ* 2 test, Spearman’s *ρ*, and Kendall’s *τ* are proposed, and those metrics can be applied to problems of diverse nonnormal distributions [23].

### Information-theoretic inference of biological network structures

To quantify the mutual dependence of two R.V., mutual information is frequently adopted as an alternative in information-theoretic applications, in addition to the above metric. The mutual information of two discrete R.V. can be defined as [24],

where *p*(*x*, *y*) denotes the joint probability distribution of *X* and *Y*, and *p*_{1}(*x*) and *p*_{2}(*y*) represents the marginal probability distributions of *X* and *Y* respectively. The measure normally adopts the well-defined form *I*(*X*, *Y*, *b*), where *b* denotes the base. In general, a base of 2 can be specified since that is the common unit of the bit. Thus, for analysis within this context, we consistently use the base of 2.

### Associativity measure for describing regulatory connectivity

The above-described measures illustrate the correlation and dependence relationships of R.V. Normally, these R.V. characterize different entities within a system. The interconnections in the biological network can be weighted by the probability of association between the pairs being investigated [25]. Since the above metrics, *i.e.* the *Pearson* product-moment correlation and mutual information are dimensionless vector quantities we introduce an associativity measure (AM) for illuminating the connectivity between candidate pairs. Within this uniform measure, the quantities of mutual information and correlation metrics can be projected onto the orthogonal coordinates of a 2D plane. The metric is represented in a formal term as,

where *MI* _{i}and *Cor* _{i}denote the mutual information and correlation quantities respectively *ω* _{i}_{1}*ω* _{i}_{2}represent the weights of both quantities *α* _{i}is the phase difference for the *i* th pair candidate and *N* is a set of natural numbers. Note that the weights here aim to leverage any possible asymmetric distribution within the datasets of the above subterms *MI* _{i}and *Cor* _{i}. The weights can be derived from previously-acquired knowledge or from a specific theoretical hypothesis, *e.g.* the respective centroids of datasets.

### Phase-shift metric for determining regulatory directions

Currently, most gene expression profiles are discrete time-series data. The data samples are diverse expression densities measured at multiple time points, and the data intervals represent the sampling periods. When *n* samples are compared, a total of *n*(*n*-1)/2 pairwise comparisons are obtained. Butte *et al.* utilized a type of signal processing method to cluster and compare the similarity of expression profiles [26]. For every potential pairwise regulation, the activities of the investigated genes can be modularized as a subsystem. Their expression patterns might be viewed as input and output signals, as shown in Figure 9.

Each pairwise association might be modularized as a subsystem with the expression patterns serving as input and output signals.

For each pair, the coherence, gain, and phase shift might be calculated by discrete Fourier transform (DFT) of the inputs and outputs. The coherence of signals *a* and *b* is a function of the power spectral density (PSD) and the cross power spectral density (CPSD), defined as below,

where *PSD* _{aa}(*f*), *PSD* _{bb}(*f*), and *CPSD* _{ab}(*f*) measure the PSD and CPSD of the associated pairwise signals. The symbol *f* represents a frequency-domain metric. Normally, signals *a* and *b* are of the same length. A coherence of 1 represents a scalar multiples relationship between two investigated signals, while 0 indicates that such a relationship is not linearly related. The transfer function (TF) between two associated input/output signals measures the signal amplification and related time lag/latency properties, which are defined as,

The regular transfer functions will be of the complex-valued form, the arctangents of which are the corresponding transfer phases (TP). The absolute values denote the related transfer gains (TG), and both metrics are represented as,

Theoretically, the TP illustrates the phase shift between the investigated pairwise signals, *i.e.* the input and output. The phase shift ranges might be allocated within -π to π, where -π represents a phase lead of half a wavelength and π denotes a phase lag of half a wavelength. Whether the input signals are amplified or not is not illuminated at the output by the transfer gain and determines the related degrees at different frequencies. The larger the ratio, the less energy is lost by the output. Note that at different frequencies, the transfer phase and relative transfer gain might differ from each other. An effective evaluation criterion for these metrics is the related coherence, namely, at frequencies where the coherence values are high, the corresponding transfer phases and gains are much more reliable than others.

The advantages of such metrics lie in the flexible and quantitative characteristics of determining the regulatory delay via dynamic threshold. Factual regulatory mechanisms have multiple possibilities, and inherent regulatory delay effects might vary during the whole biological processes. The phase-shift metric determines such possibilities underlying regulatory mechanisms in a quantitative manner. The advantages include the inherent capabilities of integrating *a priori* biological knowledge. This kind of knowledge-based inference method avoids redundant false-positive connectivities within pairwise candidates.

Such dynamic threshold is applicable to the majority of problems facing theoretical and experimental biologists. Since regulatory connectivity underlying pairwise candidates may differ in diverse processes or at different sampling times, systematic and quantitative determination of these regulations with empirical and theoretical knowledge will be much more effective than those generated by most currently-available computational approaches [17]. Such types of flexible network connectivities and regulations characterize major regulatory processes from the perspectives of information and signal processing theories.

### A MOCO pattern for constraining computational complexities

In the following sections, we extract inherent regulations and decipher network structures by introducing a pairwise gene hierarchy criterion (PGHC) for classifying possible gene pairs into three major groups as follows.

Authentic Pairwise Genes (APGs): These include pairs with mutual information values and correlation coefficients larger than specific thresholds. Moreover, the corresponding *P* value resides in the confidence interval, namely, smaller than 0.05.

Questionable Pairwise Genes (QPGs): These include pairs that do not satisfy both of the thresholds mentioned above. The group contains pairs of two classes. One class has pairs with mutual information larger than specific thresholds but satisfies neither the criteria of correlation coefficients nor *P* values. The other class includes pairs with correlation coefficients larger than specific thresholds and with *P* values residing in the confidence interval but the related mutual information does not satisfy specific thresholds.

Unauthentic Pairwise Genes (UPGs): These include those pair candidates that do not satisfy any criteria of the APGs or QPGs defined above.

The QPGs actually act as a subsidiary candidate pool for the APGs in case the empirical thresholds are set too high to extract structures merely from the APGs. Under such conditions, the QPGs will be ranked according to mutual information values, correlation coefficients, and *P* values. Optimal pairs will be allocated to the APGs to refine the former network connectivity. The algorithm for the supervised PGHC is shown in table 1.

Thus, network reconstruction might be transformed into a class of MOCO problems [10, 12, 13]. The optimization objectives include first reaching suitable thresholds for mutual information and correlation coefficient to maximize the feasible components in the APGs. The inference might be carried out with much more confidence and reliability. The second objective is to maximize the UPGs. The larger the UPGs, the fewer the problems faced during further solution searching. This decreases the feasible solution space for subsequent computations. In addition, the following relative constraints exist. There are nonnegative constraints for the sizes of groups, and the total number of pair candidates is fixed, *i.e.* the valid combinatorial space is limited. The gain thresholds for guaranteeing valid network connectivity and previously-acquired biochemical knowledge and different experimental conditions constitute other prominent constraints for the reconstruction process. The MOCO paradigm is described as follows,

where *F* _{i}is the multiobjective function set *S*_{1}is the set of feasible group combinations for APGs, QPGs, and UPGs *S*_{2}is the number set of all gene pairs (*S* _{2} = <*n*(*n*-1) / 2>, *n* is the total number of genes) *S*_{3}is the set of necessary gain constraints (GC) and *S*_{4}is the set of possible constraints from acquired biological knowledge (ABK).

Recently quite a few authors have argued the necessity of incorporating the preferences of decision-makers (DM) into MOCO solution selection [27–29]. For the problem under investigation, the DM’s preferences mainly stem from the GC (*S*_{3}) and ABK (*S*_{4}) illustrated above.

In cases governed by lower thresholds of mutual information and correlation metrics, APGs will form the group with the maximum components within the total pair candidates. On the other hand, with the heightened thresholds, many more pairs might be grouped into UPGs. This reduces the computational complexity for network reconstruction since APGs have fewer components in such situations. If APGs are classified with above-normal sizes, the reconstructed network will be densely connected and will have much more redundancies. On the contrary, a sparsely connected structure will be inferred with an undersized candidate group of APGs.

Since biological theoreticians and experimentalists may vary specific mutual information and correlation thresholds to incorporate empirical or concrete knowledge into the reconstruction procedures, the underlying coordination approaches via the MOCO framework might be feasible and significant, especially for those containing pivotal structural connectivity or for specific analysis purposes.

The APGs, QPGs, and UPGs engender the underlying evolutionary mechanisms with respect to dynamic threshold by the above metrics and related biochemical knowledge, as shown in Figure 10.

**Schematic representation of the MOCO problem by dynamic thresholding of mutual information and correlation metrics.** Total pairs are classified into APGs, QPGs and UPGs. The upper rightward horizontal arrow represents dynamic thresholding by mutual information, and the left descending arrow is for thresholding of the correlation measure.

## Differential gene regulatory networks in development and disease

Gene regulatory networks, in which differential expression of regulator genes induce differential expression of their target genes, underlie diverse biological processes such as embryonic development, organ formation and disease pathogenesis. An archetypical systems biology approach to mapping these networks involves the combined application of (1) high-throughput sequencing-based transcriptome profiling (RNA-seq) of biopsies under diverse network perturbations and (2) network inference based on gene-gene expression correlation analysis. The comparative analysis of such correlation networks across cell types or states, differential correlation network analysis, can identify specific molecular signatures and functional modules that underlie the state transition or have context-specific function. Here, we review the basic concepts of network biology and correlation network inference, and the prevailing methods for differential analysis of correlation networks. We discuss applications of gene expression network analysis in the context of embryonic development, cancer, and congenital diseases.

**Keywords:** Coexpression networks Correlation Systems biology Transcriptomics.

Dana-Farber Cancer Institute, Medical Oncology, Boston, MA, USA

Universität der Bundeswehr München, Department of Computer Science, Werner-Heisenberg-Weg 39, 85577 Neubiberg, Germany

Dana-Farber Cancer Institute, Medical Oncology, Boston, MA, USA

Tampere University of Technology, Computational Medicine and Statistical Learning Laboratory, Department of Signal Processing, Tampere, Finland

Dana-Farber Cancer Institute, Medical Oncology, Boston, MA, USA

Universität der Bundeswehr München, Department of Computer Science, Werner-Heisenberg-Weg 39, 85577 Neubiberg, Germany

Dana-Farber Cancer Institute, Medical Oncology, Boston, MA, USA

Tampere University of Technology, Computational Medicine and Statistical Learning Laboratory, Department of Signal Processing, Tampere, Finland

UMIT –The Health and Life Sciences University, Eduard Wallnoefer Zentrum 1, 6060 Hall Austria

Nankai University, College of Computer and Control Engineering, 300071 Tianjin, P.R. China

Nankai University, College of Computer and Control Engineering, 300071 Tianjin, P.R. China

Tampere University of Technology, Predictive Medicine and Analytics Lab Department of Signal Processing, Tampere, Finland

### Summary

This chapter presents the basic steps that are required to conduct a genome-scale gene regulatory networks (GRN) inference and network-based functional analysis in an R programming environment. The analysis is performed for a large-scale multiple myeloma gene expression data set. It shows the retrieval of gene expression data sets from the NCBI “GeoDB” database, their preprocessing and probe set summarization for gene annotation based on “Entrez” gene identifiers and gene symbols. The first step for the inference of a GRN is the data retrieval and data preprocessing. The chapter uses a publicly available preprocessed multiple myeloma data set available from “GeoDB” with the accession “GSE4581”. The chapter gives basic gene expression data processing requirements for the inference and analysis of GRN by the application of the “bc3net” R package. The “bc3net” is a bagging approach of the “c3net” and aggregates an ensemble of “c3net” GRN that are inferred by bootstrapping a gene expression data set.

### Supplementary Figure 1 Comparison of datasets simulated from synthetic networks by using BoolODE and GeneNetWeaver.

Each row corresponds to the synthetic network indicated by the label on the left. (a) The network itself, with red edges representing inhibition and blue edges representing activation. (b) A 2D t-SNE visualization of one BoolODE-generated dataset for 2,000 cells. The color of each point indicates the simulation time: blue for earlier, green for intermediate, and yellow for later times. (c) Each colour corresponds to a different subset of cells obtained by using *k*-means clustering of the BoolODE-generated dataset, with *k* set to the number of expected steady states. (d) A 2-D t-SNE visualization of one GeneNetWeaver output.

### Supplementary Figure 2 Box plots of AUPRC values for synthetic networks.

Each row corresponds to one of the six synthetic networks. Each column corresponds to an algorithm. Red, blue, yellow, purple and green box plots correspond to AUPRC values for 10 datasets with 100, 200, 500, 2,000, and 5,000 cells, respectively. The gray dotted line indicates the AUPRC value for a random predictor, which is equal to the network’s density. In every boxplot, the box shows the 1 st and 3 rd quartile, and whiskers denote 1.5 times the interquartile range.

### Supplementary Figure 3 Box plots of AUROC values for synthetic networks.

Each row corresponds to one of the six synthetic networks. Each column corresponds to an algorithm. Red, blue, yellow, purple and green box plots correspond to AUROC values for 10 datasets with 100, 200, 500, 2,000, and 5,000 cells, respectively. The gray dotted line indicates the AUROC value for a random predictor (0.5). In every boxplot, the box shows the 1 st and 3 rd quartile, and whiskers denote 1.5 times the interquartile range.

### Supplementary Figure 4 Box plots of AUPRC values for curated models.

Each row corresponds to one of the four curated models. Each column corresponds to an algorithm. Red, blue and yellow box plots correspond to AUPRC values for 10 datasets with no dropouts, a dropout rate of *q* = 50, and a dropout rate of *q* = 70, respectively. The gray dotted line indicates the AUPRC value for a random predictor, i.e., the network density. In every boxplot, the box shows the 1 st and 3 rd quartile, and whiskers denote 1.5 times the interquartile range.

### Supplementary Figure 5 Box plots of AUROC values for curated models.

Each row corresponds to one of the four curated models. Each column corresponds to an algorithm. Red, blue and yellow box plots correspond to AUROC values for 10 datasets with no dropouts, a dropout rate of *q* = 50, and a dropout rate of *q* = 70, respectively. The gray dotted line indicates the AUROC value for a random predictor (0.5). In all boxplots, the box shows the 1 st and 3 rd quartile, and whiskers denote 1.5 times the interquartile range.

### Supplementary Figure 6 Box plots of early precision values for curated models.

Each row corresponds to one of the four curated models. Each column corresponds to an algorithm. Red, blue and yellow box plots correspond to early precision values for 10 datasets with no dropouts, a dropout rate of *q* = 50, and a dropout rate of *q* = 70, respectively. The gray dotted line indicates the early precision value for a random predictor (network density). In each boxplot, the box shows the 1 st and 3 rd quartile, and whiskers denote 1.5 times the interquartile range.

### Supplementary Figure 7 Scalability of GRN algorithms on experimental single-cell RNA-Seq datasets.

Variation in running time and memory usage of GRN inference algorithms with respect to number of genes for three experimental single-cell RNA-Seq datasets. Each point represents the mean running time or memory across all three datasets and the shaded regions correspond to one standard deviation around the mean. Missing values indicate that the method either did not complete after one day or gave a runtime error. We did not consider SCNS since it took over a day on the 19-gene GSD Boolean model. We obtained these results on a computer with a 32-core 2.0GHz processor and 32GB of memory running Ubuntu 18.04.

### Supplementary Figure 8 Summary of EPR values for experimental single-cell RNA-Seq datasets with 500 and 1000 genes.

Summary of EPR results for experimental single-cell RNA-seq datasets. The left half of the figure (500 genes) shows results for datasets composed of the 500 most-varying genes. Each row corresponds to one scRNA-seq dataset. The first three columns report network statistics. The next six columns report EPR values. The right half (1000 genes) shows results for the 1000 most-varying genes. In both sections, algorithms are sorted by median EPR across the datasets (rows) for the 500 gene set. For each dataset, the color in each cell is proportional to the corresponding value scaled between 0 and 1 (ignoring values that are less than that of a random predictor, which are shown as black squares). We display the highest and lowest values for each dataset inside the corresponding cells. Abbreviations: GENI: GENIE3, GRNB: GRNBoost2, PCOR: PPCOR, SINC: SINCERITIES.

### Supplementary Figure 9 Summary of AUPRC ratio values for experimental single-cell RNA-Seq datasets with TFs + 500 and TFs + 1000 genes.

Summary of AUPRC ratio results for experimental single-cell RNA-seq datasets. The left half of the figure (TFs+500 genes) shows results for datasets composed of all significantly-varying TFs and the 500 most-varying genes. Each row corresponds to one scRNA-seq dataset. The first three columns report network statistics. The next six columns report AUPRC ratios. The right half (TFs+1000 genes) shows results for all significantly-varying TFs and the 1000 most-varying genes. In both sections, algorithms are sorted by median AUPRC ratio across the datasets (rows) for the TFs+500 gene set. For each dataset, the color in each cell is proportional to the corresponding value scaled between 0 and 1 (ignoring values that are less than that of a random predictor, which are shown as black squares). We display the highest and lowest values for each dataset inside the corresponding cells. Abbreviations: GENI: GENIE3, GRNB: GRNBoost2, PCOR: PPCOR, SINC: SINCERITIES.

### Supplementary Figure 10 Summary of AUPRC ratio values for experimental single-cell RNA-Seq datasets with 500 and 1000 genes.

Summary of AUPRC ratio values for experimental single-cell RNA-seq datasets. The left half of the figure (500 genes) shows results for datasets composed of the 500 most-varying genes. Each row corresponds to one scRNA-seq dataset. The first three columns report network statistics. The next six columns report AUPRC ratios. The right half (1000 genes) shows results for the 1000 most-varying genes. In both sections, algorithms are sorted by median AUPRC ratios across the datasets (rows) for the 500 gene set. For each dataset, the color in each cell is proportional to the corresponding value scaled between 0 and 1 (ignoring values that are less than that of a random predictor, which are shown as black squares). We display the highest and lowest values for each dataset inside the corresponding cells. Abbreviations: GENI: GENIE3, GRNB: GRNBoost2, PCOR: PPCOR, SINC: SINCERITIES.