Background: Machine learning (ML) is an intelligent data mining technique that has been widely used in engineering , computer sciences and informatics. The robustness of ML technologies have recently been evidenced in transcriptomic data analysis for clustering gene expression patterns and classifying human diseases and cancers. Differential network (DN) analysis is a recently developed approach for identifying biologically important genes with individual features generated by considering the changes of genes' interactions between two or more networks. DN analysis has been validated to be complementary to traditional differnetial expression (DE) analysis and is especially effective in detecting biologically important genes that have less dramatic expression changes for certain experiments. However, the combination of ML technologies and differnetial network analytic method to transcriptomic data analysis for more effectively identifying biologically important genes is still rarely performed.
Results: We developed a R package mlDNA for machine learning (ML)-based
differential network analysis of transcriptimic data. mlDNA can be used to:
(1) ML-based gene filtering for meaningful gene co-expression network inference by simultaneously considering four different classes of expression characteristics;
(2) Construct gene co-expression networks from gene expression data with five correlation and two noncorrelation measures;
(3) Perform differential network analysis to characterize genes with more than 30 network characteristics(such as degree, closeness, eccentricity, eigenvector, PageRank, and so on );
(4) Identify candidate genes of interest (e.g., stress-related genes) with ML-based prediction models integrated multiple network characteristics;
(5) Estimate the covergence degree between different experimential conditions;
(6) Quantify the activity of pathways;
(7) Detect condition-specifc expressed genes.
Application: mlDNA package has been successfully applied to reanalyze a series of Arabidopsis stress-response microarray datasets and identify a large number of stress-related candidate genes for salt, cold, drought, wound, heat and genotoxic stresses. More than 80 candidates have been evidenced by high-throughput phenotypic screening experiment. (see details in mlDNA paper).
Perspectives: The input of mlDNA only require a set of genes of interest and a pair of gene expression dataset under two biological conditions. This package would have a broad application potential for many conditions and species.
Notes: To successfully run mlDNA,  make sure the required packages (e.g., rsgcc, igraph, snowfall, bigmemory, e1071, gplots, hash, et al.) have already installed; When perform network inference and analysis for large number of genes (>10000), the bigmemory package (only run on Linux and Mac) is necessary for storing and operating big matrix generated from network inference and network characteristic calculation; Computers with a good amount of physical memory (12 GB to start with, more is no luxury) and processors are recommended for running mlDNA, since parallel programming will be performed for rapidly calcluating network properties (e.g., betweeness, closness, et al.).
mlDNA R package has been accepted and published on the Comprehensive R Archive Network
(CRAN; http://cran.r-project.org/web/packages/mlDNA. To install mlDNA and related packages, please type following commend in the R programming environment:
 mlDNA package (Version 1.1) currently only works with R Version 2.15.1 or higher. Please upgrade your R if you have an older version of R.
 mlDNA package requires the following packages: biwt, cairoDevice, e1071, fBasics, GeneSelector, grDevices, gplots, gWidgets, gWidgetsRGtk2, hash, igraph, nnet, parmigene, pROC, qvalue, ROCR, rsgcc, stringr, snowfall, and bigmemory. If you failed to install mlDNA, please try to load these packages (e.g., library(packageName) ) to check whether they have been installed successfully. The GeneSelector and qvalue packages are from Bioconductor and installed with the following commands:
The following tutorial guides the reader to implement mlDNA for predict stress-related candidate genes.
Before starting, the user should load the mlDNA package, and specify the number of CPUs for parallel computing,
the file directories for recording PSOL iteration results
and network-related results.
|#load mlDNA and related packages|
|cpus <- 6|
|# file directory for storing PSOL iteration results|
|PSOLResDic <- "/home/wanglab/mlDNA/PSOL/"|
|dir.create(PSOLResDic, showWarnings = FALSE)|
|#file directory for storing network-related results|
|netResFileDic <- "/home/wanglab/mlDNA/network/"|
|dir.create(netResFileDic, showWarnings = FALSE)|
Before running mlDNA, the user need to prepare a set of genes of interest (positive samples) and normalized gene expression datasets
from two different biological conditions. The sample size of gene expression data is at least five. In mlDNA package, we provided a sample data including 895 known
salt stress-related genes and two sets of time-series microarray data profiled in roots of Arabidopsis under normal condition and salt conditions.
|#load sample data.|
|#get gene expression data under control condition. Rownames and colnames are gene IDs and sample IDs, respectively.|
|ControlExpMat <- as.matrix(sampleData$ControlExpMat)|
|#get gene expression data under salt stress condition. Rownames and colnames are gene IDs and sample IDs, respectively|
|SaltExpMat <- as.matrix(sampleData$StressExpMat)|
|#get known salt stress-related genes (positive samples in the machine learning process)|
|positiveSamples <- as.character(sampleData$KnownSaltGenes)|
|# take a glance at the sample data|
|#Gene numbers, and sample size in ControlExpMat|
|#Expression values for the first five genes in ControlExpMat|
|#Gene numbers, and sample size in SaltExpMat|
|#Expression values for the first five genes in SaltExpMat|
|#Number of positive samples|
|#First 10 genes in positiveSamples|
mlDNA generated expression characteristics for each gene
based on the absolute expression values, within-condition expression variations
measured with z-score, between-condition expression changes measured as fold-change,
and coefficient of variation (cv) in two conditions.
|#Create two vectors for describing the sample information in the ControlExpMat and SaltExpMat.|
|#The numbers in these two vectors represent biological conditions.|
|#The replications under the same conditions were denoted with the same number.|
|#Two expression datasets should contain the same number of condtions.|
|sampleVec1 <- c(1, 2, 3, 4, 5, 6)|
|sampleVec2 <- c(1, 2, 3, 4, 5, 6)|
|#generate expression feature matrix with totally 32 characteristics for four measures including z-Score, fold change, cv and absoulte expression value)|
|featureMat <- expFeatureMatrix( expMat1 = ControlExpMat, sampleVec1 = sampleVec1, expMat2 = SaltExpMat, sampleVec2 = sampleVec2, logTransformed = TRUE, base = 2, features = c("zscore", "foldchange", "cv", "expression") )|
|take a glance at the sample data|
The mlDNA package can selectively output a subset of expression characteristics by modifying the parameter values of "features" in function "expFeatureMatrix". You can also add new characteristics to each gene into the feature matrix "featureMat".
The ML-based gene filtering process can be used to remove "non-informative" genes for co-expression network construction, providing a complementary approach to traditional differential expression analytic methods. As no negative samples were pre-determined, positive sample only learning (PSOL) algorithm was performed to solve the binary classification problem (genes whether filtered out or retained for network construction ) with supervised ML technologies.
|#"unlabeled" samples are genes not included in the positive sample set|
|unlabelSamples <- setdiff( rownames(featureMat), positiveSamples )|
|#start to implement PSOL algorithm. Note that the running time of PSOL algorithm is related to the iteration number, the size of positive sample set and the number of "unlabeled" samples. The parallel computing is highly recommended here.|
|#first, selecting an initial set of negative samples (i.e., "non-informative" genes) for building ML-based classification model. Three results ("featureMatrix_ED_adjmat_bfile", "featureMatrix_ED_adjmat_dfile" and "PSOL_InitialNegativeSelection_res.RData") will be generated in the file directory PSOLResDic. The first two are the bigmatrix backing and description files of adjacency matrix recording the Euclidean distances between any pairs of genes included in microarray dataset.|
|InitalRes <- PSOL_InitialNegativeSelection(featureMatrix = featureMat, positives = positiveSamples, unlabels = unlabelSamples, negNum = length(positiveSamples), cpus = cpus, PSOLResDic = PSOLResDic)|
|#get initial negative set and the updated unlabeled samples|
|negatives <- InitalRes$negatives|
|unlables <- InitalRes$unlabels|
|#Then, expanding negative samples at different iteration times of PSOL. The Random forest algorithm is selected to build ML-based prediction model, whose performance is eccessed with five-fold cross validation test.|
|PSOL_NegativeExpansion(featureMat = featureMat, positives = positiveSamples, negatives = negatives, unlabels = unlables, cpus = cpus, iterator = 50, cross = 5, TPR = 0.98, method = "randomForest", plot = TRUE, trace = TRUE, PSOLResDic = PSOLResDic, ntrees = 200 )|
|#extract PSOL results at the iteration times 1:10|
|res <- PSOL_ResultExtraction ( PSOLResDic, iterations = c(1:15) )|
|#obtain signal genes (genes in positive and unlabeled sample sets) at the 12th iteration of PSOL|
|signalGenes <- c( res[]$positives, res[]$unlabels )|
|#If plot is TRUE in the function "PSOL_NegativeExpansion", the distribution of filtered-out gene number and AUC at different iteration times (see Figure 1) will be automatically drawn in the file "PSOL_NegativeIncreasement.pdf". The detailed numbers can be found in the file "PSOL_NegativeIncreasement.txt"|
 mlDNA also provides functions to select a set of candidate genes for network consturction with differential expression analytic methods (running geneRanker), and interaction-removal method (running interactionRemoval).
With the selected "informative" genes and gene expression data,
the gene co-expression network is then constructed with the Gini
correlation coefficient algorithm. In fault, the shortest-path distances
between any pairs of genes are also calculated after the network
has been constructed. As the calculation of shortest-path distances
is time-consuming, the parallel computing is highly recommended here.
You can also omit this procedure by setting the parameter "distmatFlag" as FALSE.
|#As an example, here we only consider 3000 "informative" genes for the network construction and differential network analysis|
|signalGenes <- signalGenes[1:8000]|
|#from gene expression data to gene co-expression network. You can save controlNet and stressNet for re-use with the command save in R.|
|controlNet <- exp2net( expmat = ControlExpMat[signalGenes,], method = "GCC", pvalue = 0.01, cpus = cpus, expDescribe = "Control", connListFlag = TRUE, distmatFlag = TRUE, saveType = "bigmatrix", netResFileDic = netResFileDic )|
|stressNet <- exp2net( expmat = SaltExpMat[signalGenes,], method = "GCC", pvalue = 0.01, cpus = cpus, expDescribe = "Stress", connListFlag = TRUE, distmatFlag = TRUE, saveType = "bigmatrix", netResFileDic = netResFileDic )|
|# take a glance at the results|
|#correlations between the first five genes in the adjacency matrix|
|#In the constructed gene co-expression network, distances between the first five genes in shortest-path distance matrix|
|#for one interested gene|
|gene <- signalGenes|
|#the connected genes of this gene in the network|
|Connected <- controlNet$connectivityList[[gene]]|
|#the correlations between connected gene with the interested gene are positive|
|posConnected <- Connected [[gene ]]$pos|
|#The correlations between connected gene with the interested gene are negative|
|negConn <- Connected [[gene]]$neg|
The mlDNA package can generate more than 30 network characteristics for considering
the changes in the connections of one gene in different networks.
The "closeness", "eccentricity", "dis2knodes", "closeness2knodes",
"eccenticity2knodes" are calculated based on the shortest-path distances
between genes. Please see the mlDNA paper for detailed description of these network characteristics
|#generate network feature matrix from differential network analysis|
|netFeatureMat <- netFeatureMatrix( net1 = controlNet, net2 = stressNet, nodes = signalGenes, knodes = positiveSamples, cpus = cpus, verbose = TRUE, netResFileDic = netResFileDic, features = c( "expDistance", "ASC", "corDistance", "AllConnectivity", "PosConnectivity", "NegConnectivity", "closeness", "eccentricity", "eigenvector", "page.rank", "dis2knodes", "closeness2knodes", "eccenticity2knodes") )|
|# take a glance at the results|
|#genes selected by PSOL algorithm were regarded as control samplesHere the users can also define negative samples by themselves.|
|negativeSamples <- setdiff(signalGenes, positiveSamples)|
|#generate a random seed for dividing samples as training and testing dataset in cross-validation|
|seed <- randomSeed()|
|#five-fold cross validation method is used to train and test the random forest-based classification model constructed with 33 network characteristics.|
|cvRes <- cross_validation(seed = seed, method = "randomForest", featureMat = netFeatureMat, positives = positiveSamples, negatives = negativeSamples, cross = 5, cpus = cpus, ntree = 100 )|
|#five receiver operating characteristic (ROC) curves are plotted to visualize the performance of RF-based classification model. please see Figure 2.|
|# AUC (area under the curve) values for five rounds of cross validation|
|aucVec <- rep(0, 5)|
|for( i in 1:5 )|
|aucVec[i] = cvRes[[i]]$test.AUC|
|#the performance of RF-based classification model evaluated with the average of these 5 AUC values|
|#the best classifier obtained in the five-fold cross validation experiment. This classifier is usually used to perform the prediction on new datasets.|
|idx <- which(aucVec == max(aucVec) )|
|RFClassifier <- cvRes[[idx]]$classifier|
|#To perform the prediction, a network feature matrix is input, and the prediction score of each gene to be stress-related genes is output|
|predScore <- predictor(method = "randomForest", classifier = RFClassifier, featureMat = netFeatureMat)|
|# The prediction scores of first 10 genes.|
|#Given the prediction score for a set of positive and negative samples. Here we used the result on testing dataset from the first round of cross validation as an example.|
|positiveSampleScores <- cvRes[]$positives.test.score|
|negativeSampleScores <- cvRes[]$negatives.test.score|
|#Find the optimal prediction score cutoff with the F-score measurement.|
|FScoreRes <- optimalScore( positiveSampleScores, negativeSampleScores, beta = 2, plot = TRUE )|
Figure 3 shows the distribution of F-score at threshold of all possible prediction scores. The optimal prediction score is selected when F-score reaches the maximal value. Then, for a gene with prediction score higher than the optimal score, this gene is regarded as a promising stress-related candidate genes.
mlDNA package also provides several statistical methods for
downstream bioinformatics analysis of ML-based prediction results.
The function "ConvergenceDegree" quantitatively estimates the
convergence degree between two conditions, and yields a value ranged
from 0 (no convergence) to 1 (absolute convergence).
|#suppose the set of genes related to two experimental conditions are respectively A and B, then the convergence degree between these two conditions is:|
|ConvergenceDegree (A, B)|
For multiple biological conditions (n >= 3), the overlap coefficient output by this function can been organized with a graph, in which the nodes represent biological conditions, edges represent the covergence between two conditions. The node size indicate genes in the biological condition of interest, and edge width represents the overlap coefficient between two biological conditions. As exemplified in Figure 4 for displaying the convergence between salt, cold, drought, heat and genotoxic stresses.
The "AverageRankScore" function in mlDNA package quantifies the
activity of a pathway based on the sum of the rank of expression
changes of all of the genes in this pathway, using the average
rank-based score (AS) algorithm [PMID: 22096597]. Since this rank-based statistics is
normalized by the number of genes in the analyzed pathway and whole
genome, it is thus robust for directly comparing the activities of
pathways with different gene numbers under different
|#generate expression feature matrix|
|featureMat <- expFeatureMatrix( expMat1 = ControlExpMat, sampleVec1 = sampleVec1, expMat2 = SaltExpMat, sampleVec2 = sampleVec2, logTransformed = TRUE, base = 2, features = "foldchange" )|
|#calculate the activity of a pathway, suppose this pathway contain the first 20 genes in the positive sample set, then:|
|genes <- positiveSamples[1:20]|
|res <- AverageRankScore( featureMat = featureMat, selGenes = genes )|
This function calculates one value for one modular/pathay under one biological conditions. If multiple biological conditions are available, the output can be organized with a heatmap for intuitively displaying the changes of regulation activity of one pathway across different biological conditions. An example is shown in Figure 5.
mlDNA package can also be used to identify condition-specifical
express genes, which are highly expressed in one condition, but not or
lowly expression in other experimental conditions.
|##show colnames of SaltExpMat|
|as the dot is used to indicate the replications in one condition, we need to change the colname of genes at the 0.5 time point|
|colnames(SaltExpMat) <- "Salt_0_5h"|
|#get specifically expressed genes at diferent time points|
|res <- ConditionSpecificGenes( expmat = SaltExpMat, logtransformed = TRUE, base = 2, threshold = 0.75 )|
|#number of condition-specifical expressed genes|
Q1: How to solve the error "Error in .cor_all(xs = mat, rowidx = row.idx, colidx = col.idx, corMethod = method, :
could not find function "as.big.matrix""?
A1: This error is caused due to the failure of loading the package "bigmemory", which is only avaiable for Linux and Mac users. Please load this package using the command library(bigmemory) before running functions in mlDNA package.
Q2: Why my program was killed automatically, when running exp2net function on Linux?
A2: While running exp2net, the distances between any pair of genes in the network will be calcuated with igraph package. This calcuation requires huge memory and long time if the network is density. So the program was killed automatically by Linux if the limited memory is available.
Q3: The error I met is not included in this Q&A Section, Could you please help me to get ride of this error?
A3: Yes, please send the detailed information about the error to: Chuang Ma (firstname.lastname@example.org).
2013-11-18, an updated version of mlDNA (version 1.1) were generated. A new function "interactionRemoval" has been added to identify a set of candidate genes with the interaction-removal approach [PMID: 22252388].
2013-06-18, the source codes and binary packages of mlDNA (version 1.0) were released at the Comprehensive R Archive Network (CRAN).