standardcor-WGCNA • standardcor

suppressWarnings(library(standardcor))
#> Loading required package: qlcMatrix
#> Loading required package: Matrix
#> Loading required package: slam
#> Loading required package: sparsesvd
suppressWarnings(library(WGCNA))
#> Loading required package: dynamicTreeCut
#> Loading required package: fastcluster
#> 
#> Attaching package: 'fastcluster'
#> The following object is masked from 'package:stats':
#> 
#>     hclust
#> 
#> 
#> Attaching package: 'WGCNA'
#> The following object is masked from 'package:stats':
#> 
#>     cor
suppressWarnings(library(tidyverse))
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.6
#> ✔ forcats   1.0.1     ✔ stringr   1.6.0
#> ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
#> ✔ purrr     1.2.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ tidyr::expand() masks Matrix::expand()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ✖ purrr::none()   masks qlcMatrix::none()
#> ✖ tidyr::pack()   masks Matrix::pack()
#> ✖ tidyr::unpack() masks Matrix::unpack()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

WGCNA, Integration of multi-omics

This notebook covers a correlation standardization technique (more information here: https://github.com/PriceLab/standardcor) to combine omics for a downstream WGCNA analysis. To combine omics, we start with correlations separately by each omics. Each omics tends to have a different distribution of correlations; for example, randomly selected analytes from a proteomics assay tends to be positively correlated (also true for transcriptomics), but randomly selected metabolites are typically closer to uncorrelated. This is not an issue when computing adjacencies for any of the single ’omics, because the transformation from correlations to adjacencies would (in WGCNA) be fit to each of these separate distributions separately. When we combine correlations across different ’omics, or between analytes measured in different ’omics experiments, WGCNA would attempt to find one power 𝑘 to fit them all, and that would result in very different distributions of adjacency for analyte pairs from different source experiments. One particular source of differences in distribution is sample size. If we compute correlations between metabolites from 1,000 individuals, we should expect much more accurate correlation values than if we compute them from only 10; sampling variation alone will result in a sqrt(1000/10) = 10-fold difference in variance!

We will therefore make a smooth model of the distribution of correlations of each type, and then transform the correlation values to a single, shared smooth model. While the resulting values are no longer interpretable as correlations, this will standardize the significance of correlations from different sources onto a single, shared significance scale. When WGCNA fits the values on this scale, it will be applying the same significance standards to all the correlations, regardless of their original source.

Data and preprocessing

The required input files are:

Phenotype Table - containing the outcome of inPrimary Merest
MetabolitPrimary e Table - metabolite abunda
Lipid Table - Lipid abundance values
Biogenic Amine Table - Biogenic Amine Tablence values
Protein Table - protein abundance values

Data for this analysis were synthesized from Longevity Consortium generated proteomic and metabolomic data. We used a Gaussian Mixture Model (GMM) to create 1000 synthetic participants for each omic type, 500 cases and 500 controls, followed by the addition of a synthetic case/control signal. The signal was generated by adding 0.1 to a set of proteins and metabolites for all cases.

The original Longevity Consortium datasets were filtered for high missingness (>20%), imputed using random-forest imputation and log normalized prior to running GMM.

Load data

#Phenotypes
pheno <- read_delim("../data/WGCNA/case_synth.tsv", show_col_types = FALSE)

### Load proteins
prots <- read_delim("../data/WGCNA/proteins_synth.tsv", show_col_types = FALSE)

### Metabolites
lipids_df <- read_delim("../data/WGCNA/Lipids_synth.tsv", show_col_types = FALSE)
amines_df  <- read_delim("../data/WGCNA/BA_synth.tsv", show_col_types = FALSE)
primary_df  <- read_delim("../data/WGCNA/Primary_synth.tsv", show_col_types = FALSE)
 
### Metabolite features
features <- read_delim("../data/WGCNA/Met_Features.tsv", show_col_types = FALSE)

## Merge together
in_df <- merge(primary_df, lipids_df, by="subjectID")
in_df <- merge(in_df, amines_df, by="subjectID")
in_df <- merge(in_df, prots, by="subjectID")

# Check dimensions of each dataframe
dim(primary_df)
#> [1] 500 150
dim(lipids_df)
#> [1] 500 713
dim(amines_df)
#> [1] 500 322
dim(prots)
#> [1] 500 284
dim(in_df)
#> [1]  500 1466
dim(pheno)
#> [1] 500   2

# Drop id column and get features
num.analytes <- setdiff(unique(c(colnames(primary_df),colnames(lipids_df),colnames(amines_df),colnames(prots))),'subjectID')
num_df <- in_df[,colnames(in_df) %in% num.analytes]
num_df <- as.matrix(num_df)
rownames(num_df) <- in_df$subjectID

## Filter samples and features based on WGCNA NA criteria (50%)
gsg = goodSamplesGenes(num_df, verbose = 5);
#>  Flagging genes and samples with too many missing values...
#>   ..step 1
gsg$allOK
#> [1] TRUE
if (!gsg$allOK)
{
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0) 
    printFlush(paste("Removing genes:", paste(names(num_df)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0) 
    printFlush(paste("Removing samples:", paste(rownames(num_df)[!gsg$goodSamples], collapse = ", ")));
  # Remove the offending genes and samples from the data:
  num_df = num_df[gsg$goodSamples, gsg$goodGenes]
}

dim(num_df)
#> [1]  500 1465

# Get the names of remaining analytes overall (all.analytes) by category
cat.prots <- intersect(colnames(prots),colnames(num_df))
cat.primary  <- intersect(colnames(primary_df),colnames(num_df))
cat.lipid <- intersect(colnames(lipids_df),colnames(num_df))
cat.amines <- intersect(colnames(amines_df),colnames(num_df))

all.analytes <- c(cat.prots,cat.primary,cat.lipid,cat.amines)
print(paste(length(cat.prots),length(cat.lipid),length(cat.primary),length(cat.amines),length(all.analytes)))
#> [1] "283 712 149 321 1465"

# We will construct a correlation matrix Z in parts corresponding to each category of analyte pairs.
n.analytes <- length(all.analytes)
# To compute correlations, features must be numeric. We will use Spearman
all_df <- num_df[,all.analytes]

Generate Correlations

We use an implementation of sparse spearman correlation which saves memory and time. This helps to prevent the notebook from crashing! We use Spearman rank correlationsince it is not affected by scalar multiples or by log-transformation. More information on this technique can be found here: https://github.com/saketkc/blog/blob/main/2022-03-10/SparseSpearmanCorrelation2.ipynbs

s.prots <- as(all_df[,cat.prots], "sparseMatrix")
s.primary <- as(all_df[,cat.primary], "sparseMatrix")
s.lipid <- as(all_df[,cat.lipid], "sparseMatrix")
s.amines <- as(all_df[,cat.amines], "sparseMatrix")

# Within-category correlations
Z.pp <- SparseSpearmanCor2(s.prots)
Z.mm <- SparseSpearmanCor2(s.primary)
Z.ll <- SparseSpearmanCor2(s.lipid)
Z.aa <- SparseSpearmanCor2(s.amines)
# Cross-category correlations
Z.pm <- SparseSpearmanCor2(s.prots, s.primary)
Z.pl <- SparseSpearmanCor2(s.prots, s.lipid)
Z.pa <- SparseSpearmanCor2(s.prots, s.amines)
Z.ml <- SparseSpearmanCor2(s.primary, s.lipid)
Z.ma <- SparseSpearmanCor2(s.primary, s.amines)
Z.la <- SparseSpearmanCor2(s.lipid, s.amines)

# Add row and column names to each dataframe
dimnames(Z.pp) <- list(colnames(all_df[,cat.prots]), colnames(all_df[,cat.prots]))
dimnames(Z.mm) <- list(colnames(all_df[,cat.primary]), colnames(all_df[,cat.primary]))
dimnames(Z.ll) <- list(colnames(all_df[,cat.lipid]), colnames(all_df[,cat.lipid]))
dimnames(Z.aa) <- list(colnames(all_df[,cat.amines]), colnames(all_df[,cat.amines]))

dimnames(Z.pm) <- list(colnames(all_df[,cat.prots]), colnames(all_df[,cat.primary]))
dimnames(Z.pl) <- list(colnames(all_df[,cat.prots]), colnames(all_df[,cat.lipid]))
dimnames(Z.pa) <- list(colnames(all_df[,cat.prots]), colnames(all_df[,cat.amines]))
dimnames(Z.ml) <- list(colnames(all_df[,cat.primary]), colnames(all_df[,cat.lipid]))
dimnames(Z.ma) <- list(colnames(all_df[,cat.primary]), colnames(all_df[,cat.amines]))
dimnames(Z.la) <- list(colnames(all_df[,cat.lipid]), colnames(all_df[,cat.amines]))

Within-category standarized distributions

Below we use the standardcor functions to estimate a Beta parameters from mean and std dev for each omic.

### Protein Protein
Z.unique <- Z.pp[row(Z.pp) < col(Z.pp)]
vw <- estimateShape(Z.pp)
v.pp <- vw[1]
w.pp <- vw[2]
print(paste("Protein pairs: rho_ij ~ Beta(v =",round(v.pp,3),",w =",round(w.pp,3),")"))
#> [1] "Protein pairs: rho_ij ~ Beta(v = 36.482 ,w = 34.418 )"

fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,4),
     main="Pairwise protein correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, v.pp, w.pp)/2, lwd=3, col="MediumBlue")

The blue line shows the model distrubtion. We can see it fits the background distribution. We will repeat this process for all other omics

### Metabolite-Metabolite
Z.unique <- as.vector(Z.mm[row(Z.mm) < col(Z.mm)])
vw <- estimateShape(Z.mm)
v.mm <- vw[1]
w.mm <- vw[2]
print(paste("Metabolite Pairs: rho_ij ~ Beta(v =",round(v.mm,3),",w =",round(w.mm,3),")"))
#> [1] "Metabolite Pairs: rho_ij ~ Beta(v = 102.351 ,w = 101.913 )"

fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise primary metabolite correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, v.mm, w.mm)/2, lwd=3, col="MediumBlue")

### Lipid - Lipid
Z.unique <- as.vector(Z.ll[row(Z.ll) < col(Z.ll)])
vw <- estimateShape(Z.ll)
v.ll <- vw[1]
w.ll <- vw[2]
print(paste("Lipid Pairs: rho_ij ~ Beta(v =",round(v.ll,3),",w =",round(w.ll,3),")"))
#> [1] "Lipid Pairs: rho_ij ~ Beta(v = 28.808 ,w = 25.44 )"

fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise lipid correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, v.ll, w.ll)/2, lwd=3, col="MediumBlue")

### Amine - Amine
Z.unique <- as.vector(Z.aa[row(Z.aa) < col(Z.aa)])
vw <- estimateShape(Z.aa)
v.aa <- vw[1]
w.aa <- vw[2]
print(paste("Amine Pairs: rho_ij ~ Beta(v =",round(vw[1],3),",w =",round(vw[2],3),")"))
#> [1] "Amine Pairs: rho_ij ~ Beta(v = 116.617 ,w = 110.49 )"

fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise amine correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, vw[1], vw[2])/2, lwd=3, col="MediumBlue")

Cross-category

We continue the same process for each of the cross-category omics combinations. consider.

dim(Z.pm)
#> [1] 283 149
Z.unique <- as.vector(Z.pm) # there are no self-comparisons, nor are there repeats due to symmetry
vw <- estimateShape(Z.pm)
v.pm <- vw[1]
w.pm <- vw[2]
print(paste("Protein-metabolite: rho_ij ~ Beta(v =",round(vw[1],3),",w =",round(vw[2],3),")"))
#> [1] "Protein-metabolite: rho_ij ~ Beta(v = 177.718 ,w = 174.852 )"

# The distribution of these cross-correlations is
# markedly narrower than either of the contributing 'omics
fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise Protein-metabolite correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, vw[1], vw[2])/2, lwd=3, col="MediumBlue")

# Modeling cross-category correlations: protein-lipid
Z.unique <- as.vector(Z.pl) # there are no self-comparisons, nor are there repeats due to symmetry
vw <- estimateShape(Z.pl)
v.pl <- vw[1]
w.pl <- vw[2]
print(paste("Protein-metabolite: rho_ij ~ Beta(v =",round(vw[1],3),",w =",round(vw[2],3),")"))
#> [1] "Protein-metabolite: rho_ij ~ Beta(v = 89.235 ,w = 89.5 )"

# The distribution of these cross-correlations is
# markedly narrower than either of the contributing 'omics
fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise Protein-metabolite correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, vw[1], vw[2])/2, lwd=3, col="MediumBlue")

# Modeling cross-category correlations: protein-amine
Z.unique <- as.vector(Z.pa) # there are no self-comparisons, nor are there repeats due to symmetry
vw <- estimateShape(Z.pa)
v.pa <- vw[1]
w.pa <- vw[2]
print(paste("Protein-metabolite: rho_ij ~ Beta(v =",round(vw[1],3),",w =",round(vw[2],3),")"))
#> [1] "Protein-metabolite: rho_ij ~ Beta(v = 138.697 ,w = 137.27 )"

# The distribution of these cross-correlations is
# markedly narrower than either of the contributing 'omics
fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise Protein-metabolite correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, vw[1], vw[2])/2, lwd=3, col="MediumBlue")

# Modeling cross-category correlations: primary-lipid
Z.unique <- as.vector(Z.ml) # there are no self-comparisons, nor are there repeats due to symmetry
vw <- estimateShape(Z.ml)
v.ml <- vw[1]
w.ml <- vw[2]
print(paste("Protein-lipid: rho_ij ~ Beta(v =",round(vw[1],3),",w =",round(vw[2],3),")"))
#> [1] "Protein-lipid: rho_ij ~ Beta(v = 189.953 ,w = 186.133 )"

# The distribution of these cross-correlations is
# markedly narrower than either of the contributing 'omics
fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise Protein-lipid correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, vw[1], vw[2])/2, lwd=3, col="MediumBlue")

# Modeling cross-category correlations: primary-amine
Z.unique <- as.vector(Z.ma) # there are no self-comparisons, nor are there repeats due to symmetry
vw <- estimateShape(Z.ma)
v.ma <- vw[1]
w.ma <- vw[2]
print(paste("Protein-amine: rho_ij ~ Beta(v =",round(vw[1],3),",w =",round(vw[2],3),")"))
#> [1] "Protein-amine: rho_ij ~ Beta(v = 190.21 ,w = 186.059 )"

# The distribution of these cross-correlations is
# markedly narrower than either of the contributing 'omics
fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise Protein-amine correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, vw[1], vw[2])/2, lwd=3, col="MediumBlue")

# Modeling cross-category correlations: lipid-amine
Z.unique <- as.vector(Z.la) # there are no self-comparisons, nor are there repeats due to symmetry
vw <- estimateShape(Z.la)
v.la <- vw[1]
w.la <- vw[2]
print(paste("Lipid-amine: rho_ij ~ Beta(v =",round(vw[1],3),",w =",round(vw[2],3),")"))
#> [1] "Lipid-amine: rho_ij ~ Beta(v = 131.107 ,w = 129.04 )"

# The distribution of these cross-correlations is
# markedly narrower than either of the contributing 'omics
fine <- 40
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Z.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,7.5),
     main="Pairwise lipid-amine correlations", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, vw[1], vw[2])/2, lwd=3, col="MediumBlue")

Merging correlations from disparate data subsets

This centering process is similar to, but not the same as “quantile normalization”. In quantile normalization, each value $x_i$ in the observed data has a cumulative probability $p_i = \Pr{\{x < x_i\} }$ among the observed data $\{ x \}$ which, accounting for the possibility of ties in a finite dataset, is the average rank of all observed values equal to $x_i$ divided by the total number of observed values. The value $x_i$ is then transformed to the value $q_i$ with cumulative probability $p_i$ according to a normalizing probability distribution $D$ (i.e. $\Pr{ \{ d < q_i | D \} } = p_i$ ).

In this centering process, rather than using the rank of $x_i$ in the observed data, we use a null model of the observed data; $p_i$ is therefore a p-value for $x_i$ under the null model, and the transformed value $q_i$ has the same significance under the normalized probability distribution $D$ as $x_i$ had in the original null distribution. This allows us to merge datasets while retaining the significance they had in their original cohort. Only when the empirical distribution is used as the null model are these two processes the same.

Here, we use a beta distribution as a null model precisely because the appropriate null model for correlations of arbitrary independent vectors is, under some reasonable assumptions, indistinguishable from a Beta distribution. We believe this is a more appropriate null model of correlations, and suggest that the primary effect of this null model is to account for the effective number of dimensions in the observed data, prior to using a 2D geometric model to compute correlations between the observations.

nu.std <- 34 # As wide as the widest compoonent, and centered at 0

Zc.pp <- centerBeta(Z.pp, v.pp, w.pp, nu.std)
Zc.mm <- centerBeta(Z.mm, v.mm, w.mm, nu.std)
Zc.ll <- centerBeta(Z.ll, v.ll, w.ll, nu.std)
Zc.aa <- centerBeta(Z.aa, v.aa, w.aa, nu.std)
Zc.pm <- centerBeta(Z.pm, v.pm, w.pm, nu.std)
Zc.pl <- centerBeta(Z.pl, v.pl, w.pl, nu.std)
Zc.pa <- centerBeta(Z.pa, v.pa, w.pa, nu.std)
Zc.ml <- centerBeta(Z.ml, v.ml, w.ml, nu.std)
Zc.ma <- centerBeta(Z.ma, v.ma, w.ma, nu.std)
Zc.la <- centerBeta(Z.la, v.la, w.la, nu.std)

# Combined, centered correlations

Zc <- matrix(0, nrow = length(all.analytes),
             ncol = length(all.analytes))
rownames(Zc) <- all.analytes
colnames(Zc) <- all.analytes

### Construct a final dataframe that contains all the correlation values.
###
# Block-structured correlation matrix
# Zc = [ PP     PM   PL  PA  |
#      | PM^T   MM   ML   MA |
#      | PL^T  MC^T  LL  LA  |
#      | PA^T  MA^T  LA^T AA ]
###
Zc[cat.prots, cat.prots] <- Zc.pp
Zc[cat.primary,  cat.primary]  <- Zc.mm
Zc[cat.lipid,  cat.lipid]  <- Zc.ll
Zc[cat.amines,  cat.amines]  <- Zc.aa

Zc[cat.prots, cat.primary]  <- Zc.pm
Zc[cat.primary, cat.prots]  <- t(Zc.pm)

Zc[cat.prots, cat.lipid]  <- Zc.pl
Zc[cat.lipid, cat.prots]  <- t(Zc.pl)

Zc[cat.prots, cat.amines]  <- Zc.pa
Zc[cat.amines, cat.prots]  <- t(Zc.pa)

Zc[cat.primary, cat.lipid]  <- Zc.ml
Zc[cat.lipid, cat.primary]  <- t(Zc.ml)

Zc[cat.primary, cat.amines]  <- Zc.ma
Zc[cat.amines, cat.primary]  <- t(Zc.ma)

Zc[cat.lipid, cat.amines]  <- Zc.la
Zc[cat.amines, cat.lipid]  <- t(Zc.la)


print(str_c("nrow: ", nrow(Zc)))
#> [1] "nrow: 1465"
Zc[1:5,1:5]
#>              1433Z        A1AG2        A1AT        A1BG       A2AP
#> 1433Z  1.000000000 -0.009608999  0.01402562 -0.02168788 -0.3082587
#> A1AG2 -0.009608999  1.000000000  0.11051047  0.07297294 -0.1927106
#> A1AT   0.014025625  0.110510466  1.00000000  0.08730883 -0.3104823
#> A1BG  -0.021687883  0.072972937  0.08730883  1.00000000 -0.0553188
#> A2AP  -0.308258749 -0.192710596 -0.31048227 -0.05531880  1.0000000

Z.unique <- Zc[row(Zc) < col(Zc)]
print(paste("Target: rho_ij ~ Beta(v =",round(nu.std,3),",w =",round(nu.std,3),")"))
#> [1] "Target: rho_ij ~ Beta(v = 34 ,w = 34 )"

x <- (1+Z.unique)/2
mZ <- mean(x)
s2Z <- var(x)
v.c <- mZ*(mZ*(1-mZ)/s2Z - 1)
w.c <- (1-mZ)*(mZ*(1-mZ)/s2Z - 1)
print(paste("Method of moments: rho_ij ~ Beta(v =",round(v.c,3),",w =",round(w.c,3),")"))
#> [1] "Method of moments: rho_ij ~ Beta(v = 21.514 ,w = 21.308 )"


fine <- 100
Zc.unique <- as.vector(Zc[row(Zc) < col(Zc)])
Bs <- (c(-fine:(1+fine))-0.5)/fine
hist(Zc.unique, breaks=Bs, xlab="Correlation", ylab="Density", ylim=c(0,5),
     main="All pairwise correlations, centered", prob=TRUE)
box()
abline(v=c(-1:1),lty=3)

r <- c(-fine:fine)/fine
lines(r, dbeta((1+r)/2, nu.std, nu.std)/2, lwd=3, col="orangered")
lines(r, dbeta((1+r)/2, v.c, w.c)/2, lwd=3, col="MediumBlue")

These are now standardized correlations. The mean and variance of this distribution suggest a model (shown in blue) that fits less well than the standardizing model (in orange); this is a consequence of the differences between the models we fitted and the empirical distributions, and indicates that the enrichment of high correlations we observed in the individual ’omics distributions has been preserved. If we had used quantile normalization, the overabundance of high correlations would have been shifted to lower correlation values, and the fitted blue model would be identical to the standardizing model.

WGCNA

The code below follows a standard WGCNA analysis. More information can be found here: https://peterlangfelder.com/2018/11/25/wgcna-resources-on-the-web/

#Manually convert the pairwise correlation DF to the signed network DF
Zc_signed <- 0.5 + 0.5 * Zc

print(str_c("nrow: ", nrow(Zc_signed)))
#> [1] "nrow: 1465"

#Choose a set of soft-thresholding powers
powers <- c(c(1:10), seq(from=11, to=15, by=1))
cutoff <- 0.8

#Call the network topology analysis function
sft <- pickSoftThreshold.fromSimilarity(Zc_signed, RsquaredCut=cutoff, powerVector=powers, blockSize=5000, verbose=5)
#>  pickSoftThreshold: calculating connectivity for given powers...
#>    ..working on genes 1 through 1465 of 1465
#> Warning: executing %dopar% sequentially: no parallel backend registered
#>    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
#> 1      1  0.13800 10.800          0.949  736.00   735.000  797.0
#> 2      2  0.13000  4.190          0.917  378.00   375.000  450.0
#> 3      3  0.00102 -0.188          0.731  199.00   195.000  270.0
#> 4      4  0.34300 -2.250          0.722  107.00   103.000  176.0
#> 5      5  0.77800 -2.830          0.859   59.80    55.300  124.0
#> 6      6  0.91600 -2.820          0.919   34.50    30.400   92.5
#> 7      7  0.96200 -2.510          0.952   20.70    17.000   73.1
#> 8      8  0.93900 -2.260          0.922   13.00     9.640   60.6
#> 9      9  0.96400 -2.000          0.954    8.64     5.620   52.0
#> 10    10  0.96300 -1.810          0.953    6.06     3.300   46.0
#> 11    11  0.96000 -1.670          0.949    4.49     2.040   41.6
#> 12    12  0.97100 -1.550          0.963    3.52     1.290   38.4
#> 13    13  0.97600 -1.460          0.970    2.89     0.808   35.8
#> 14    14  0.97800 -1.400          0.975    2.47     0.526   33.9
#> 15    15  0.96900 -1.360          0.965    2.18     0.339   32.3

#Plot the results
options(repr.plot.width=9, repr.plot.height=5)
par(mfrow=c(1,2))
cex1 <- 0.8
##Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)", ylab="Scale Free Topology Model Fit, signed R^2", type="n",
     main=paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers, cex=cex1, col="black")
##Line corresponds to using an R^2 cut-off of h
abline(h=cutoff, col="red")
##Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)", ylab="Mean Connectivity", type="n",
     main=paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1, col="black")


print(str_c("Estimated soft-thresholding power: ", sft$powerEstimate))
#> [1] "Estimated soft-thresholding power: 6"

#Choose the power that best approximates a scale free topology while still maintaining high level of connectivity in the network
softPower <- sft$powerEstimate
softPower
#> [1] 6
#Generate the adjacency matrix using the chosen soft-thresholding power
adjacency <- adjacency.fromSimilarity(Zc, power=softPower, type="signed")

print(str_c("nrow: ", nrow(adjacency)))
#> [1] "nrow: 1465"
#head(adjacency)

#Turn adjacency into topological overlap
##You can input whatever matrix you want here!
# Turn adjacency into topological overlap
TOM = TOMsimilarity(adjacency,TOMType = "signed");
#> ..connectivity..
#> ..matrix multiplication (system BLAS)..
#> ..normalization..
#> ..done.
# Turn into distance matrix
dissTOM = 1-TOM
colnames(dissTOM) <- colnames(all_df)
rownames(dissTOM) <- colnames(dissTOM)

# Cluster the TOM distance matrix to find modules
# Can call whatever clusting method you want here

# Call the hierarchical clustering function
geneTree = hclust(as.dist(dissTOM), method = "ward.D2");
# Plot the resulting clustering tree (dendrogram)
#sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
    labels = FALSE, hang = 0.04);
box()

#Larger modules can be easier to interpret, so we set the minimum module size relatively high
minModuleSize <- max(c(20, round(ncol(all_df)/200, digits=0)))
print(str_c("minClusterSize = ", minModuleSize))
#> [1] "minClusterSize = 20"

#Module identification using dynamic tree cut
dynamicMods <- cutreeDynamic(dendro=geneTree, distM=dissTOM,
                             deepSplit=4, pamStage=TRUE, pamRespectsDendro=FALSE,
                             minClusterSize=minModuleSize)
#>  ..cutHeight not given, setting it to 5.17  ===>  99% of the (truncated) height range in dendro.
#>  ..done.
table(dynamicMods)
#> dynamicMods
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
#> 431 183 105  79  60  56  53  40  40  38  36  36  34  32  31  30  30  28  28  27 
#>  21  22  23 
#>  25  22  21

#Convert numeric lables into colors
dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)
#> dynamicColors
#>         black          blue         brown          cyan     darkgreen 
#>            53           183           105            32            22 
#>       darkred darkturquoise         green   greenyellow        grey60 
#>            25            21            60            36            30 
#>     lightcyan    lightgreen   lightyellow       magenta  midnightblue 
#>            30            28            28            40            31 
#>          pink        purple           red     royalblue        salmon 
#>            40            38            56            27            34 
#>           tan     turquoise        yellow 
#>            36           431            79

#Plot the dendrogram and colors underneath
options(repr.plot.width=12, repr.plot.height=6)
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
                    dendroLabels=FALSE, hang=0.03,
                    addGuide=TRUE, guideHang=0.05,
                    main="Gene dendrogram and module colors")

#Calculate eigengenes
MEList <- moduleEigengenes(all_df, colors=dynamicColors, impute=TRUE, nPC=2)
MEs <- MEList$eigengenes
print(str_c("nrow: ", nrow(MEs)))
#> [1] "nrow: 500"
head(MEs)
#>       MEblack       MEblue      MEbrown     MEcyan  MEdarkgreen    MEdarkred
#> 1  0.01769292 -0.051160929 -0.044378898 0.01880385  0.026300750  0.056728907
#> 2  0.04529345 -0.027186301  0.081716559 0.04014007 -0.065170342 -0.027008752
#> 3  0.03116145 -0.064842191  0.050969124 0.03833421 -0.027185755  0.034050318
#> 4 -0.01276477 -0.011334389 -0.024473474 0.01594035  0.005975657  0.003775484
#> 5  0.02255580 -0.108910882  0.020253466 0.01890705 -0.087284917 -0.091232433
#> 6  0.01620370 -0.008809908  0.006610517 0.04426588  0.015338840 -0.008089793
#>   MEdarkturquoise      MEgreen MEgreenyellow     MEgrey60  MElightcyan
#> 1    -0.054838015  0.007742242  -0.006937975  0.023279528 -0.025568891
#> 2     0.081574426  0.027353168   0.037621498  0.003638059  0.031146586
#> 3    -0.024189996  0.075929834  -0.010496068  0.028760979  0.011325765
#> 4     0.001072615  0.026527175  -0.040967189  0.020079862 -0.003064056
#> 5     0.028547619 -0.042833872  -0.050583886 -0.053727660  0.044451861
#> 6     0.067679731  0.052427209   0.029833876 -0.008533758  0.029449067
#>    MElightgreen MElightyellow    MEmagenta MEmidnightblue      MEpink
#> 1  0.0831709906   0.054407199 -0.011885640    -0.02318667 -0.05903108
#> 2 -0.0223851354   0.054747059  0.002597031    -0.01508430  0.04217619
#> 3  0.0063840596   0.041592726 -0.014797012    -0.04415341 -0.01475084
#> 4 -0.0280092308   0.008061438  0.010143341     0.00139921 -0.01220036
#> 5  0.0120321816  -0.018502448 -0.124222277    -0.09145141  0.06847406
#> 6 -0.0004212465  -0.021378014 -0.060088746    -0.07637588  0.05340293
#>       MEpurple        MEred  MEroyalblue     MEsalmon        MEtan MEturquoise
#> 1  0.001787618 -0.044040999  0.051904178  0.001376754 -0.045764356 0.095295608
#> 2  0.066292815 -0.015444106  0.009153908  0.035193925  0.009062544 0.002543363
#> 3  0.046233328 -0.042038230  0.029811076  0.013927902 -0.055651008 0.010488607
#> 4 -0.019138426 -0.002514375  0.009501255  0.040701101  0.010042727 0.016468088
#> 5 -0.006243943 -0.121599175  0.012112474 -0.013905834 -0.100395931 0.064945152
#> 6 -0.056277156 -0.066816892 -0.019450368  0.024405079 -0.042246700 0.052961943
#>       MEyellow
#> 1  0.021543695
#> 2  0.048948253
#> 3  0.056389149
#> 4 -0.024116585
#> 5 -0.023709852
#> 6  0.005627724

#Calculate dissimilarity of module eigengenes
MEDiss <- 1 - cor(MEs, use="pairwise.complete.obs")

#Cluster module eigengenes
METree <- hclust(as.dist(MEDiss), method="ward.D2")

#Plot the result
options(repr.plot.width=10, repr.plot.height=5)
plot(METree, main="Clustering of module eigengenes",
     xlab="", sub="")
MEDissThres <- 0.3
abline(h=MEDissThres, col="red")

#Call an automatic merging function
merge <- mergeCloseModules(all_df, dynamicColors, cutHeight=MEDissThres, verbose=0)

#Eigengenes of the new merged modules
mergedMEs <- merge$newMEs

#The merged module colors
mergedColors <- merge$colors
table(mergedColors)
#> mergedColors
#>          blue         brown          cyan     darkgreen darkturquoise 
#>           219           283            32            47            21 
#>         green   greenyellow        grey60   lightyellow       magenta 
#>            60            36            30            28            40 
#>  midnightblue        purple           red        salmon     turquoise 
#>            31            38            56            34           431 
#>        yellow 
#>            79

#Plot the dendrogram and module colors
options(repr.plot.width=12, repr.plot.height=6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
                    c("Dynamic Tree Cut", "Merged dynamic"),
                    dendroLabels=FALSE, hang=0.03,
                    addGuide=TRUE, guideHang=0.05,
                    main="Gene dendrogram and module colors")


#Rename
moduleColors <- mergedColors
MEs <- mergedMEs

#Rename
moduleColors <- mergedColors
MEs <- mergedMEs

#Clean the module eigengene table
eigengene_df <- MEs %>%
    rownames_to_column(var="public_client_id")
names(eigengene_df)[2:ncol(eigengene_df)] <- names(eigengene_df)[2:ncol(eigengene_df)] %>%
    str_replace(., "^ME", "") %>%
    str_to_title(.)
print("Module eigengene table")
#> [1] "Module eigengene table"
print(str_c("- nrow: ", nrow(eigengene_df)))
#> [1] "- nrow: 500"
head(eigengene_df)
#>   public_client_id   Turquoise         Blue          Red      Magenta
#> 1                1 0.095295608 -0.052219663 -0.044040999 -0.011885640
#> 2                2 0.002543363 -0.001818211 -0.015444106  0.002597031
#> 3                3 0.010488607 -0.063797679 -0.042038230 -0.014797012
#> 4                4 0.016468088  0.005857586 -0.002514375  0.010143341
#> 5                5 0.064945152 -0.107111127 -0.121599175 -0.124222277
#> 6                6 0.052961943 -0.040473960 -0.066816892 -0.060088746
#>   Midnightblue        Brown Darkturquoise  Lightyellow       Cyan        Green
#> 1  -0.02318667  0.003551357  -0.054838015  0.054407199 0.01880385  0.007742242
#> 2  -0.01508430  0.035070077   0.081574426  0.054747059 0.04014007  0.027353168
#> 3  -0.04415341  0.020193625  -0.024189996  0.041592726 0.03833421  0.075929834
#> 4   0.00139921 -0.011460655   0.001072615  0.008061438 0.01594035  0.026527175
#> 5  -0.09145141  0.035817706   0.028547619 -0.018502448 0.01890705 -0.042833872
#> 6  -0.07637588  0.018605726   0.067679731 -0.021378014 0.04426588  0.052427209
#>         Grey60    Darkgreen  Greenyellow       Salmon       Purple       Yellow
#> 1  0.023279528  0.046261983 -0.006937975  0.001376754  0.001787618  0.021543695
#> 2  0.003638059 -0.046143879  0.037621498  0.035193925  0.066292815  0.048948253
#> 3  0.028760979  0.007247343 -0.010496068  0.013927902  0.046233328  0.056389149
#> 4  0.020079862  0.006183944 -0.040967189  0.040701101 -0.019138426 -0.024116585
#> 5 -0.053727660 -0.095823745 -0.050583886 -0.013905834 -0.006243943 -0.023709852
#> 6 -0.008533758  0.003092429  0.029833876  0.024405079 -0.056277156  0.005627724

##Sample metadata
sample_tbl <- pheno[pheno$subjectID %in% rownames(MEs),]
print("Sample metadata after the filter")
#> [1] "Sample metadata after the filter"
print(str_c("- nrow: ", nrow(sample_tbl)))
#> [1] "- nrow: 500"

#Code sex and race
phenotype_tbl <- sample_tbl 

phenotype_tbl <- phenotype_tbl[match(rownames(MEs), rownames(phenotype_tbl)),]

#Calculate the numbers of modules and samples
#nModules <- ncol(MEs)
nSamples <- nrow(phenotype_tbl)

#Names (colors) of the modules
modNames = substring(names(MEs), 3)

##Check ID order before the cor() function
print(str_c("Matched IDs?: ", all(rownames(MEs)==rownames(phenotype_tbl))))
#> [1] "Matched IDs?: TRUE"

#Calculate module–trait relationship
moduleTraitCor <- as.data.frame(cor(MEs, phenotype_tbl, use="p"))
rownames(moduleTraitCor) <- str_to_title(modNames)
print("Module–trait relationship table")
#> [1] "Module–trait relationship table"
print(str_c("nrow: ", nrow(moduleTraitCor)))
#> [1] "nrow: 16"

#Calculate statisitcal significance of module–trait relationship
MTRpval <- as.data.frame(corPvalueStudent(as.matrix(moduleTraitCor), nSamples))
rownames(MTRpval) <- str_to_title(modNames)
print("Module–trait relationship p-value table")
#> [1] "Module–trait relationship p-value table"
print(str_c("- nrow: ", nrow(MTRpval)))
#> [1] "- nrow: 16"

#Eliminate the dummy module (Grey)
moduleTraitCor <- moduleTraitCor[rownames(moduleTraitCor)!="Grey",]
MTRpval <- MTRpval[rownames(MTRpval)!="Grey",]

#P-value adjustment across modules (per trait) using Benjamini–Hochberg method
MTRpval_adj <- as.data.frame(apply(MTRpval, 2, function(x){p.adjust(x, length(x), method="BH")}))
print("Module–trait relationship adjusted p-value table")
#> [1] "Module–trait relationship adjusted p-value table"
print(str_c("- nrow: ", nrow(MTRpval_adj)))
#> [1] "- nrow: 16"

#Prepare text labels as matrix
textMatrix <- paste("r = ",signif(as.matrix(moduleTraitCor), 3),"\n(P = ",
                    signif(as.matrix(MTRpval_adj), 2),")", sep="")
dim(textMatrix) <- dim(moduleTraitCor)
#Revert module names back to apply color conversion
temp_c <- rownames(moduleTraitCor) %>%
    str_to_lower(.) %>%
    str_c("ME",.)

#Visualize
options(repr.plot.width=10, repr.plot.height=10)
par(mar=c(5, 5, 3, 2))
labeledHeatmap(Matrix=moduleTraitCor,
               xLabels=colnames(moduleTraitCor),
               yLabels=temp_c,
               #ySymbols=rownames(moduleTraitCor),
               colorLabels=FALSE,
               colors=blueWhiteRed(50),
               textMatrix=textMatrix,
               setStdMargins=FALSE,
               cex.text=1,
               zlim=c(-1,1),
               main=paste("Module–trait relationships"))

Now that we have identified modules asssociated with a phenotype, we can take the analysis in many other directions. Please refer to the WGCNA documentation for more information.