This analysis/workflow characterizes transcription factor (TF) expression patterns during endogenous control (D21) differentiation across developmental timepoints. Specifically, this workflow identifies, via hierarchical clustering, temporal, stage-specific expression of transcription factor modules in time-series bulk RNA-seq data. It also gives a framework for visualization of the ‘wave-like’ expression of these transcription factor modules.
TFs are important regulators of gene expression programs, and many TFs are used as cell-type specific marker genes. Hence, if you are characterizing a time-series differentiation, identifying stage-specific transcription factors in your control samples provides a baseline for understanding how these TF modules are perturbed in your experimental condition.
Furthermore, at the end of this analysis, I provide a basic framework for understanding how these stage-specific transcription factor module gene expression ‘waves’ are ultimately perturbed or changed as a result of a treatment/condition.
Depending on your experimental design (number of timepoints/replicates, your experimental group (genotype, condition, treatment), etc.), you can modify aspects of this work. For example, if you have 5+ timepoints, you will likely have to use a higher k (number of clusters) to capture and partititon all biologically meaningful, homogeneous transcription factor gene expression modules.
If you have any questions of suggestions, feel free to email me at zohebk@uchicago.edu or post on Biostars.
For this workflow, I re-processed publicly available data from the following source. You can use this data to follow along in this workflow, or you can use your own data.
Note that for this dataset, the experimental group is based on genotype (‘T21’). ‘D21’ samples represent the control samples; hence, we will use D21 samples to establish baseline transcription factor expression.
Data source: Martinez JL et al. Transcriptional consequences of trisomy 21 on neural induction. Front Cell Neurosci. 2024;18:1341141. doi:10.3389/fncel.2024.1341141
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000 May;25(1):25-9. doi: 10.1038/75556. PMID: 10802651; PMCID: PMC3037419.
For this analysis, we require four objects: - vst counts: DESeq2 variance-stabilized transformed expression matrix (for heatmaps) - raw counts: Raw counts as input for DESeq2 likelihood ratio test - cpm matrix: Counts per million (for filtering/removal of lowly expressed TFs) - metadata: sample information including genotype and timepoint for various steps of workflow
data_dir <- "data"
vst_counts <- readRDS(file.path(data_dir, "counts/vst/WC24_vst_counts.rds"))
raw_counts <- readRDS(file.path(data_dir, "counts/raw/WC24_filt_raw_counts.rds"))
cpm_matrix <- readRDS(file.path(data_dir, "counts/cpm/WC24_cpm.rds"))
metadata <- readRDS(file.path(data_dir, "metadata/WC24_metadata_clean.rds"))
cat("Data dimensions:\n")## Data dimensions:
## VST: 13826 genes x 18 samples
## Raw: 13826 genes x 18 samples
## sample_id timepoint genotype stage rep isogenic
## BA1D6_S1 BA1D6_S1 6 D21 neuroepithelium 1 WC24
## BA2D6_S2 BA2D6_S2 6 D21 neuroepithelium 2 WC24
## BA3D6_S3 BA3D6_S3 6 D21 neuroepithelium 3 WC24
## BA1D10_S4 BA1D10_S4 10 D21 early neuroectoderm 1 WC24
## BA2D10_S5 BA2D10_S5 10 D21 early neuroectoderm 2 WC24
## BA3D10_S6 BA3D10_S6 10 D21 early neuroectoderm 3 WC24
## genotype_timepoint isogenic_genotype_timepoint
## BA1D6_S1 D21_D6 WC24_D21_D6
## BA2D6_S2 D21_D6 WC24_D21_D6
## BA3D6_S3 D21_D6 WC24_D21_D6
## BA1D10_S4 D21_D10 WC24_D21_D10
## BA2D10_S5 D21_D10 WC24_D21_D10
## BA3D10_S6 D21_D10 WC24_D21_D10
## isogenic_genotype_timepoint_rep
## BA1D6_S1 WC24_D21_D6_1
## BA2D6_S2 WC24_D21_D6_2
## BA3D6_S3 WC24_D21_D6_3
## BA1D10_S4 WC24_D21_D10_1
## BA2D10_S5 WC24_D21_D10_2
## BA3D10_S6 WC24_D21_D10_3
##
## 6 10 17
## D21 3 3 3
## T21 3 3 3
To establish and identify baseline transcription factor temporal expression, we will focus exclusively on control (D21) samples to characterize normal differentiation patterns before any genotype comparisons. So we will subset metadata and our counts objects to only contain control sample information.
meta_d21 <- metadata %>%
filter(genotype == "D21") %>%
arrange(timepoint)
d21_samples <- meta_d21$sample_id
vst_d21 <- vst_counts[, d21_samples]
raw_d21 <- raw_counts[, d21_samples]
cpm_d21 <- cpm_matrix[, d21_samples]
cat("D21 samples:", length(d21_samples), "\n")## D21 samples: 9
##
## 6 10 17
## 3 3 3
We use the Gene Ontology annotation GO:0003700 (DNA-binding transcription factor activity) to identify all human transcription factors. If your data is in mouse, switch org.Hs.eg.db to org.Mm.eg.db.
We can see that this GO term identifies ~1500 TFs, some of which will be appreciably expressed in our data.
tf_annotation <- AnnotationDbi::select(
org.Hs.eg.db,
keys = "GO:0003700",
keytype = "GOALL",
columns = "SYMBOL"
)
human_tfs <- unique(na.omit(tf_annotation$SYMBOL))
tfs_in_data <- intersect(human_tfs, rownames(raw_d21))
cat("Human TFs (GO:0003700):", length(human_tfs), "\n")## Human TFs (GO:0003700): 1466
## TFs in expression data: 1039
While many TFs will be expressed in bulk RNA-seq time-series data, we do not necessarily want to include all of them in our analysis.
Specifically, many TFs will not show a temporal expression pattern. In other words, some TFs will not show any expression pattern that is related to time. A TF might be basally expressed at a constant level across time, for example. For the purposes of this analysis, we do not necessarily concern ourselves with these TFs whose expression is not time-dependent. Rather, we want to focus our analysis on TFs whose expression ‘peaks’ at a particular stage in our differentiation, as this indicates the TF is important for establishing or maintaining that particular developmental stage.
Thus, to identify TFs whose expression is stage-specific/temporally-related, we use DESeq2’s likelihood ratio test (LRT) to identify TFs with significant expression changes across differentiation timepoints. I have provided two links below to read about LRT from the DESeq2 vignette and HBC training resources, but in brief, the LRT compares a full model (~timepoint) against a reduced model (~1) to test whether timepoint explains significant variance in expression.
Why do we use LRT here? - the LRT tests whether a factor (timepoint) contributes significantly to expression - unlike Wald tests, LRT doesn’t compare specific pairs of timepoints - genes passing LRT show significant temporal dynamics regardless of direction
You can read more about DESeq2’s LRT here: - https://hbctraining.github.io/DGE_workshop/lessons/08_DGE_LRT.html - https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html (command-F LRT)
meta_d21$timepoint <- factor(meta_d21$timepoint, levels = sort(unique(meta_d21$timepoint)))
stopifnot(all(colnames(raw_d21) == meta_d21$sample_id))
raw_d21_tfs <- raw_d21[tfs_in_data, ]
dds <- DESeqDataSetFromMatrix(
countData = raw_d21_tfs,
colData = meta_d21,
design = ~ timepoint
)
dds <- DESeq(dds, test = "LRT", reduced = ~ 1)
lrt_results <- results(dds)
summary(lrt_results)##
## out of 1038 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 375, 36%
## LFC < 0 (down) : 345, 33%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
The output of DESeq2’s LRT on our control, endogenous differentiation samples is going to be a set of pvalues/FDR-adjusted p-values for each TF.
The LRT won’t output a useable log2FoldChange value in this case because we simply looking at temporally variable TFs in our control differentiation. We aren’t comparing the expression of these TFs to another condition/genotype, so we will simply use a padj cutoff here to define TFs whose expression pattern across time is ‘temporally significant.’
padj_threshold <- 0.05
temporal_tfs <- rownames(lrt_results)[which(lrt_results$padj < padj_threshold)]
temporal_tfs <- temporal_tfs[!is.na(temporal_tfs)]
temporal_tf_df <- as.data.frame(lrt_results) %>%
rownames_to_column("gene_id") %>%
filter(padj < padj_threshold) %>%
arrange(padj)
cat("Temporally dynamic TFs (padj <", padj_threshold, "):", length(temporal_tfs), "\n\n")## Temporally dynamic TFs (padj < 0.05 ): 670
## gene_id baseMean stat padj
## 1 NKX2-1 454.4607 1069.2465 6.796379e-230
## 2 ZFP42 903.3953 985.6268 4.886861e-212
## 3 RFX4 1495.2113 959.5038 1.532750e-206
## 4 ST18 687.9726 944.2318 2.381254e-203
## 5 RARG 530.4785 923.1016 7.383331e-199
## 6 ELF4 2528.3210 845.4633 4.446564e-182
## 7 NKX2-2 289.5501 760.3032 1.184020e-163
## 8 NR2E1 1771.3166 700.5637 9.719455e-151
## 9 NTN1 1666.5245 696.5273 6.500740e-150
## 10 HESX1 722.3745 685.6152 1.370061e-147
This step in the workflow is optional.
Essentially, although a TF might be stage-specific in terms of its expression pattern, that does not provide insight into the level (or magnitude) of the expression of that TF at that particular stage.
If we directly used the DESeq2 LRT significat temporal TFs for our downstream analysis, we would likely have a non-trivial number of TFs whose expression at a given stage is not appreciably high. The assumption here is that, if a TF is truly biologically relevant for defining a particular stage of differentiation (ie, cardiac progenitor to cardiomyocyte transition), then that TF should be expressed an appreciably high level at one of those stages.
With that logic, I filter TFs by expression level here to focus on biologically relevant TFs that are maintaining the identity of a particular timepoint in differentiation. I use a fairly stringent expression cutoff here defined by edgeR TMM-normalized counts per million. Specifically, a TF must have mean CPM > 10 at at least one differentiation stage to be retained. In other words, a TF must reach this expression threshold at one or more differentiation stages to be considered biologically relevant for our analysis.
cpm_d21_temporal <- cpm_d21[intersect(temporal_tfs, rownames(cpm_d21)), ]
timepoints <- unique(meta_d21$timepoint)
mean_cpm_by_stage <- sapply(timepoints, function(tp) {
tp_samples <- meta_d21$sample_id[meta_d21$timepoint == tp]
rowMeans(cpm_d21_temporal[, tp_samples, drop = FALSE])
})
colnames(mean_cpm_by_stage) <- paste0("D", timepoints)
cpm_threshold <- 10
max_mean_cpm <- apply(mean_cpm_by_stage, 1, max)
expressed_tfs <- names(max_mean_cpm)[max_mean_cpm > cpm_threshold]
cat("CPM threshold:", cpm_threshold, "\n")## CPM threshold: 10
## TFs passing filter: 531
## TFs filtered out: 139
So based on the previous filtering/pre-processing steps now have our final set of TFs: 1. Filter/subset list of genes to all human TFs
Subset all TFs to just those with significant temporal expression changes (via DESeq2 LRT)
Further refine our list of stage-specific TFs to those that are appreciably in at least 1 ‘stage’ or timepoint in control differentiation
Next, we prepare the VST matrix for visualization as follows:
2.Certain TFs at particular stages might be extreme outliers in terms of expression that can distort visualization color scales in our heatmap. To address this, we apply winsorization limits (capping values at ±3 standard deviations for z-score vst counts).
final_tfs <- intersect(expressed_tfs, rownames(vst_d21))
sample_order <- meta_d21 %>%
arrange(timepoint) %>%
pull(sample_id)
vst_final <- vst_d21[final_tfs, sample_order]
# z-score normalization
vst_zscore <- t(scale(t(vst_final)))
# winsorize extreme values for heatmap visualization
winsorize_limit <- 3
vst_zscore[vst_zscore > winsorize_limit] <- winsorize_limit
vst_zscore[vst_zscore < -winsorize_limit] <- -winsorize_limit
cat("Final TFs for clustering:", length(final_tfs), "\n")## Final TFs for clustering: 531
## Z-score range: [ -2.38 , 2.4 ]
To organize the temporally expressed transcription factors into coherent and biologically meaningful modules, we perform hierarchical clustering on z-score (row-normalized) VST counts, followed by a specific temporal reordering step for heatmap visualization purposes.
First, we apply hierarchical clustering (Euclidean distance, complete linkage) to partition the transcription factors into \(k=5\) distinct expression modules. This groups genes based on the similarity of their expression profiles over time, regardless of their absolute expression levels.
We then reorder these clusters based on their peak expression timing (in other words, at what stage/timepoint does a particular cluster of TFs ‘peak’ in expression? we order clusters based on this so that the heatmap is more visually cherent) This forces the heatmap to display a “cascade” or “wave-like” pattern:
k_clusters <- 5
# initial hierarchical clustering
initial_hclust <- hclust(dist(vst_zscore, method = "euclidean"), method = "complete")
initial_clusters <- cutree(initial_hclust, k = k_clusters)
initial_cluster_df <- data.frame(
gene_id = names(initial_clusters),
cluster = initial_clusters,
stringsAsFactors = FALSE
)
# calculate mean log2CPM per gene per timepoint
cpm_final <- cpm_d21[final_tfs, sample_order]
log2cpm_final <- log2(cpm_final + 1)
mean_log2cpm_by_gene <- sapply(timepoints, function(tp) {
tp_samples <- meta_d21$sample_id[meta_d21$timepoint == tp]
rowMeans(log2cpm_final[, tp_samples, drop = FALSE])
})
colnames(mean_log2cpm_by_gene) <- paste0("D", timepoints)
# calculate cluster-level profiles for sorting
cluster_profiles <- sapply(1:k_clusters, function(cl) {
genes_in_cluster <- initial_cluster_df$gene_id[initial_cluster_df$cluster == cl]
colMeans(mean_log2cpm_by_gene[genes_in_cluster, , drop = FALSE])
})
colnames(cluster_profiles) <- paste0("Cluster_", 1:k_clusters)
cluster_profiles <- t(cluster_profiles)
# sort clusters by peak timing
cluster_sort_metrics <- data.frame(
cluster = 1:k_clusters,
d6 = cluster_profiles[, "D6"],
d10 = cluster_profiles[, "D10"],
d17 = cluster_profiles[, "D17"]
) %>%
mutate(
peak_timepoint = case_when(
d6 >= d10 & d6 >= d17 ~ "D6",
d10 >= d6 & d10 >= d17 ~ "D10",
TRUE ~ "D17"
),
decline_d6_to_d10 = d10 - d6,
peak_order = case_when(
peak_timepoint == "D6" ~ 1,
peak_timepoint == "D10" ~ 2,
TRUE ~ 3
),
secondary_sort = case_when(
peak_timepoint == "D6" ~ decline_d6_to_d10,
peak_timepoint == "D10" ~ -abs(d10 - (d6 + d17)/2),
TRUE ~ -decline_d6_to_d10
)
) %>%
arrange(peak_order, secondary_sort)
sorted_cluster_order <- cluster_sort_metrics$cluster
cluster_mapping <- setNames(1:k_clusters, sorted_cluster_order)
# swap cluster 2 and cluster 3 for visualization
swap_mapping <- c("1" = 1, "2" = 3, "3" = 2, "4" = 4, "5" = 5)
# reorder genes by sorted clusters and apply swap
cluster_df <- initial_cluster_df %>%
mutate(
original_cluster = cluster,
cluster = cluster_mapping[as.character(cluster)],
cluster = swap_mapping[as.character(cluster)]
) %>%
arrange(cluster, gene_id)
ordered_genes <- c()
for (cl in 1:k_clusters) {
genes_in_cluster <- cluster_df$gene_id[cluster_df$cluster == cl]
if (length(genes_in_cluster) > 1) {
sub_hclust <- hclust(dist(vst_zscore[genes_in_cluster, ], method = "euclidean"), method = "complete")
genes_in_cluster <- genes_in_cluster[sub_hclust$order]
}
ordered_genes <- c(ordered_genes, genes_in_cluster)
}
vst_zscore_ordered <- vst_zscore[ordered_genes, ]
cluster_df <- cluster_df %>%
mutate(gene_id = factor(gene_id, levels = ordered_genes)) %>%
arrange(gene_id) %>%
mutate(gene_id = as.character(gene_id))
cluster_genes <- split(cluster_df$gene_id, cluster_df$cluster)
print(table(cluster_df$cluster))##
## 1 2 3 4 5
## 97 141 43 100 150
Here, I define curated marker gene lists for four stages of neural differentiation (neuroepithelium, early neuroectoderm, definitive neuroectoderm, mature neuron).
For this particular experiment, the authors defined the 3 timepoints as neuroepithelium, early neuroectoderm, and definitive neuroectoderm. Based on canonical TFs from the literature, I then check which of these markers are present in my filtered TF set and annotate them to the specific clusters they belong to. In the heatmap, I selectively label only these markers to help orient and validate whether the clusters align with expected developmental stages.
neuroepithelium <- c('SOX1','PAX6','OTX2','ZIC1','ZIC2','SIX3','LHX2','GBX2','SOX2','HES3')
early_neuroectoderm <- c('PAX6','SOX1','FOXG1','OTX1','EMX2','LHX2','HES5','SOX3','IRX3','NR2F1')
definitive_neuroectoderm <- c('PAX6','FOXG1','EMX1','EOMES','NEUROG1','NEUROG2','ASCL1','SOX11','POU3F2','DLX2')
mature_neuron <- c('TBR1','SATB2','BCL11B','FEZF2','CUX1','CUX2','MEF2C','NEUROD1','NEUROD2','NEUROD6')
all_markers <- unique(c(neuroepithelium, early_neuroectoderm, definitive_neuroectoderm, mature_neuron))
markers_in_data <- intersect(all_markers, final_tfs)
cat("Markers in dataset:", length(markers_in_data), "/", length(all_markers), "\n")## Markers in dataset: 21 / 35
## SOX1, PAX6, OTX2, ZIC1, ZIC2, SIX3, LHX2, SOX2, HES3, FOXG1, OTX1, EMX2, HES5, SOX3, NR2F1, EOMES, ASCL1, POU3F2, SATB2, FEZF2, NEUROD1
Now, we are finally ready to generate two heatmaps of the z-scored VST expression matrix: one without row labels for a clean overview, and one with selective labeling of the canonical stage markers defined above. For this heatmap, I first cluster each individual ‘cluster’ of TFs individually. I then used this ordering of genes in each subcluster when I integrate all clusters together in the expression wave heatmap.
# annotations
annotation_col <- meta_d21[match(sample_order, meta_d21$sample_id), ] %>%
select(timepoint) %>%
mutate(Timepoint = factor(paste0("D", timepoint), levels = c("D6", "D10", "D17"))) %>%
select(Timepoint)
rownames(annotation_col) <- sample_order
annotation_row <- data.frame(
Cluster = factor(cluster_df$cluster[match(ordered_genes, cluster_df$gene_id)])
)
rownames(annotation_row) <- ordered_genes
cluster_colors <- c(
"1" = "#E41A1C", "2" = "#377EB8", "3" = "#4DAF4A",
"4" = "#984EA3", "5" = "#FF7F00"
)
annotation_colors <- list(
Timepoint = c("D6" = "#66C2A5", "D10" = "#FC8D62", "D17" = "#8DA0CB"),
Cluster = cluster_colors
)
heatmap_colors <- colorRampPalette(
c("#053061", "#2166AC", "#4393C3", "#92C5DE", "#D1E5F0",
"#FDDBC7", "#F4A582", "#D6604D", "#B2182B", "#67001F")
)(100)
n_per_timepoint <- table(meta_d21$timepoint[match(sample_order, meta_d21$sample_id)])
gap_positions_col <- cumsum(n_per_timepoint[-length(n_per_timepoint)])
cluster_sizes <- table(factor(cluster_df$cluster[match(ordered_genes, cluster_df$gene_id)], levels = 1:k_clusters))
gap_positions_row <- cumsum(cluster_sizes[-length(cluster_sizes)])
pheatmap(
vst_zscore_ordered,
color = heatmap_colors,
cluster_rows = FALSE,
cluster_cols = FALSE,
annotation_col = annotation_col,
annotation_row = annotation_row,
annotation_colors = annotation_colors,
show_rownames = FALSE,
show_colnames = FALSE,
fontsize = 10,
border_color = NA,
gaps_col = gap_positions_col,
gaps_row = gap_positions_row,
main = paste0("Temporally Dynamic TFs in D21 Differentiation (k=", k_clusters, ", n=", nrow(vst_zscore_ordered), ")")
)labels_row <- ifelse(ordered_genes %in% markers_in_data, ordered_genes, "")
pheatmap(
vst_zscore_ordered,
color = heatmap_colors,
cluster_rows = FALSE,
cluster_cols = FALSE,
annotation_col = annotation_col,
annotation_row = annotation_row,
annotation_colors = annotation_colors,
labels_row = labels_row,
show_rownames = TRUE,
show_colnames = FALSE,
fontsize = 10,
fontsize_row = 8,
border_color = NA,
gaps_col = gap_positions_col,
gaps_row = gap_positions_row,
main = paste0("Temporally Dynamic TFs with Marker Labels (k=", k_clusters, ")")
)Then we extract the genes contained within each cluster of TFs
for (i in 1:k_clusters) {
peak <- cluster_sort_metrics$peak_timepoint[cluster_sort_metrics$cluster == sorted_cluster_order[i]]
cat("Cluster", i, "(peak:", peak, ",", length(cluster_genes[[as.character(i)]]), "genes):",
paste(head(cluster_genes[[as.character(i)]], 5), collapse = ", "),
ifelse(length(cluster_genes[[as.character(i)]]) > 5, "...", ""), "\n")
}## Cluster 1 (peak: D6 , 97 genes): HES1, ZNF766, CREB3L4, MYCN, ZNF670 ...
## Cluster 2 (peak: D10 , 141 genes): FEZF2, RAX, ZBTB12, ERF, ZNF777 ...
## Cluster 3 (peak: D10 , 43 genes): ABHD2, ZSCAN20, IKZF2, ZBED6, FOXK1 ...
## Cluster 4 (peak: D17 , 100 genes): ZNF112, ZNF225, ZNF302, ZNF211, ZNF410 ...
## Cluster 5 (peak: D17 , 150 genes): GLI1, SOHLH2, IRF2, LITAF, KLF7 ...
To complement the heatmap visualization, I also visualize the expression trend for each individual cluster with lineplots using mean log2(CPM) across the differentiation time-course
lineplot_data <- as.data.frame(mean_log2cpm_by_gene) %>%
rownames_to_column("gene_id") %>%
left_join(cluster_df %>% select(gene_id, cluster), by = "gene_id") %>%
pivot_longer(cols = starts_with("D"), names_to = "timepoint", values_to = "log2cpm") %>%
mutate(
timepoint = factor(timepoint, levels = c("D6", "D10", "D17")),
cluster = factor(cluster)
)
cluster_summary <- lineplot_data %>%
group_by(cluster, timepoint) %>%
summarise(
mean_log2cpm = mean(log2cpm),
sd_log2cpm = sd(log2cpm),
n_genes = n(),
.groups = "drop"
)
ggplot(cluster_summary, aes(x = timepoint, y = mean_log2cpm, group = cluster)) +
geom_ribbon(aes(ymin = mean_log2cpm - sd_log2cpm,
ymax = mean_log2cpm + sd_log2cpm, fill = cluster),
alpha = 0.3) +
geom_line(aes(color = cluster), linewidth = 1.5) +
geom_point(aes(color = cluster), size = 3) +
scale_color_manual(values = cluster_colors) +
scale_fill_manual(values = cluster_colors) +
facet_wrap(~ cluster, ncol = 3, scales = "free_y",
labeller = labeller(cluster = function(x) paste("Cluster", x))) +
labs(title = "TF Expression Trends by Cluster",
subtitle = "Mean expression with ±1 SD confidence interval",
x = "Timepoint", y = "Mean log2(CPM + 1)") +
theme_bw(base_size = 12) +
theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
strip.background = element_rect(fill = "gray90"),
strip.text = element_text(size = 11, face = "bold"),
legend.position = "none",
panel.grid.minor = element_blank())ggplot(cluster_summary, aes(x = timepoint, y = mean_log2cpm, color = cluster, group = cluster)) +
geom_line(linewidth = 2) +
geom_point(size = 5, shape = 21, fill = "white", stroke = 2) +
geom_point(aes(fill = cluster), size = 4, shape = 21, stroke = 0) +
scale_color_manual(values = cluster_colors,
labels = sapply(1:k_clusters, function(i) paste0("Cluster ", i, " (n=", sum(cluster_df$cluster == i), ")"))) +
scale_fill_manual(values = cluster_colors,
labels = sapply(1:k_clusters, function(i) paste0("Cluster ", i, " (n=", sum(cluster_df$cluster == i), ")"))) +
labs(title = "TF Cluster Expression Patterns",
subtitle = "Mean log2(CPM + 1) across D21 differentiation",
x = "Timepoint", y = "Mean log2(CPM + 1)", color = "Cluster", fill = "Cluster") +
theme_bw(base_size = 14) +
theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 13, hjust = 0.5, color = "gray40"),
axis.title = element_text(size = 14, face = "bold"),
legend.position = "right",
legend.key.size = unit(1.2, "cm"),
panel.grid.minor = element_blank()) +
guides(fill = guide_legend(override.aes = list(alpha = 1)))example_genes <- cluster_df %>%
left_join(data.frame(gene_id = rownames(mean_log2cpm_by_gene),
mean_expr = rowMeans(mean_log2cpm_by_gene)), by = "gene_id") %>%
group_by(cluster) %>%
arrange(desc(mean_expr)) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(cluster, gene_id)
example_plot_data <- log2cpm_final[example_genes$gene_id, ] %>%
as.data.frame() %>%
rownames_to_column("gene_id") %>%
pivot_longer(cols = -gene_id, names_to = "sample_id", values_to = "log2cpm") %>%
left_join(meta_d21 %>% select(sample_id, timepoint), by = "sample_id") %>%
left_join(example_genes, by = "gene_id") %>%
mutate(timepoint = factor(paste0("D", timepoint), levels = c("D6", "D10", "D17")),
cluster = factor(cluster),
label = paste0("Cluster ", cluster, ": ", gene_id))
ggplot(example_plot_data, aes(x = timepoint, y = log2cpm, group = 1)) +
geom_point(aes(color = cluster), size = 3, alpha = 0.7) +
geom_line(data = example_plot_data %>%
group_by(gene_id, timepoint, cluster, label) %>%
summarise(log2cpm = mean(log2cpm), .groups = "drop"),
aes(color = cluster), linewidth = 1.2) +
scale_color_manual(values = cluster_colors) +
facet_wrap(~ label, ncol = 3, scales = "free_y") +
labs(title = "Example TF Expression Across Differentiation",
subtitle = "Individual samples (points), mean (line)",
x = "Timepoint", y = "log2(CPM + 1)") +
theme_bw(base_size = 12) +
theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
strip.background = element_rect(fill = "gray90"),
strip.text = element_text(size = 10, face = "bold"),
legend.position = "none",
panel.grid.minor = element_blank())cluster_patterns <- cluster_summary %>%
pivot_wider(names_from = timepoint, values_from = c(mean_log2cpm, sd_log2cpm, n_genes)) %>%
select(cluster, starts_with("mean_log2cpm"), n_genes_D6) %>%
rename(d6 = mean_log2cpm_D6, d10 = mean_log2cpm_D10, d17 = mean_log2cpm_D17, n_genes = n_genes_D6) %>%
mutate(
fc_d10_vs_d6 = d10 - d6,
fc_d17_vs_d6 = d17 - d6,
fc_d17_vs_d10 = d17 - d10,
peak = case_when(d6 >= d10 & d6 >= d17 ~ "D6", d10 >= d6 & d10 >= d17 ~ "D10", TRUE ~ "D17"),
pattern = case_when(
fc_d17_vs_d6 > 0.5 & fc_d10_vs_d6 > 0.25 ~ "Increasing",
fc_d17_vs_d6 < -0.5 & fc_d10_vs_d6 < -0.25 ~ "Decreasing",
fc_d10_vs_d6 > 0.5 & fc_d17_vs_d10 < -0.5 ~ "Transient Peak (D10)",
fc_d10_vs_d6 < -0.5 & fc_d17_vs_d10 > 0.5 ~ "Transient Dip (D10)",
abs(fc_d17_vs_d6) < 0.5 & abs(fc_d10_vs_d6) < 0.5 ~ "Stable",
TRUE ~ "Complex"
)
)
print(cluster_patterns %>% select(cluster, n_genes, peak, pattern) %>% as.data.frame())## cluster n_genes peak pattern
## 1 1 97 D6 Decreasing
## 2 2 141 D10 Complex
## 3 3 43 D10 Transient Peak (D10)
## 4 4 100 D17 Increasing
## 5 5 150 D17 Complex
output_dir <- "results"
if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
write.csv(cluster_df, file.path(output_dir, "d21_temporal_tf_clusters.csv"), row.names = FALSE)
write.csv(cluster_summary, file.path(output_dir, "d21_cluster_expression_summary.csv"), row.names = FALSE)
write.csv(cluster_patterns, file.path(output_dir, "d21_cluster_patterns.csv"), row.names = FALSE)
write.csv(temporal_tf_df, file.path(output_dir, "d21_temporal_tf_lrt_results.csv"), row.names = FALSE)
saveRDS(cluster_genes, file.path(output_dir, "d21_cluster_gene_lists.rds"))
write.csv(example_genes, file.path(output_dir, "d21_example_genes_per_cluster.csv"), row.names = FALSE)Having defined stage-specific TF modules from control (D21) differentiation, I now ask how these same modules behave in experimental group (T21) samples. This approach uses the D21-derived clusters as a reference framework to identify where T21 expression diverges from normal temporal patterns.
Differences between the D21 and T21 trajectories within a TF cluster indicates that the TF module is dysregulated in T21. This provides a starting point for identifying which stages of differentiation are most affected and which transcriptional programs may be delayed/accelerated/etc.
meta_t21 <- metadata %>%
filter(genotype == "T21") %>%
arrange(timepoint)
t21_samples <- meta_t21$sample_id
cpm_t21 <- cpm_matrix[final_tfs, t21_samples]
# calculate median CPM per gene per timepoint for each genotype
median_cpm_t21_by_gene <- sapply(timepoints, function(tp) {
tp_samples <- meta_t21$sample_id[meta_t21$timepoint == tp]
apply(cpm_t21[, tp_samples, drop = FALSE], 1, median)
})
colnames(median_cpm_t21_by_gene) <- paste0("D", timepoints)
median_cpm_d21_by_gene <- sapply(timepoints, function(tp) {
tp_samples <- meta_d21$sample_id[meta_d21$timepoint == tp]
apply(cpm_final[, tp_samples, drop = FALSE], 1, median)
})
colnames(median_cpm_d21_by_gene) <- paste0("D", timepoints)
# calculate cluster-level medians
cluster_median_d21 <- sapply(1:k_clusters, function(cl) {
genes <- intersect(cluster_df$gene_id[cluster_df$cluster == cl], rownames(median_cpm_d21_by_gene))
if (length(genes) > 0) apply(median_cpm_d21_by_gene[genes, , drop = FALSE], 2, median) else rep(NA, 3)
})
colnames(cluster_median_d21) <- paste0("Cluster_", 1:k_clusters)
cluster_median_t21 <- sapply(1:k_clusters, function(cl) {
genes <- intersect(cluster_df$gene_id[cluster_df$cluster == cl], rownames(median_cpm_t21_by_gene))
if (length(genes) > 0) apply(median_cpm_t21_by_gene[genes, , drop = FALSE], 2, median) else rep(NA, 3)
})
colnames(cluster_median_t21) <- paste0("Cluster_", 1:k_clusters)
genotype_comparison_data <- bind_rows(
as.data.frame(t(cluster_median_d21)) %>%
rownames_to_column("cluster") %>%
mutate(cluster = gsub("Cluster_", "", cluster)) %>%
pivot_longer(cols = starts_with("D"), names_to = "timepoint", values_to = "median_cpm") %>%
mutate(genotype = "D21"),
as.data.frame(t(cluster_median_t21)) %>%
rownames_to_column("cluster") %>%
mutate(cluster = gsub("Cluster_", "", cluster)) %>%
pivot_longer(cols = starts_with("D"), names_to = "timepoint", values_to = "median_cpm") %>%
mutate(genotype = "T21")
) %>%
mutate(
cluster = factor(cluster, levels = as.character(1:k_clusters)),
timepoint = factor(timepoint, levels = c("D6", "D10", "D17")),
genotype = factor(genotype, levels = c("D21", "T21"))
)
genotype_colors <- c("D21" = "royalblue3", "T21" = "red3")
ggplot(genotype_comparison_data, aes(x = timepoint, y = median_cpm, color = genotype, group = genotype)) +
geom_line(linewidth = 1.8) +
geom_point(size = 4, shape = 21, aes(fill = genotype), stroke = 1.5) +
scale_color_manual(values = genotype_colors) +
scale_fill_manual(values = genotype_colors) +
facet_wrap(~ cluster, ncol = 3, scales = "free_y",
labeller = labeller(cluster = function(x) paste("Cluster", x))) +
labs(title = "D21 vs T21 Expression by TF Cluster",
subtitle = "Median CPM (clusters defined from D21)",
x = "Timepoint", y = "Median CPM", color = "Genotype", fill = "Genotype") +
theme_bw(base_size = 13) +
theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
strip.background = element_rect(fill = "gray90"),
strip.text = element_text(size = 12, face = "bold"),
legend.position = "top",
panel.grid.minor = element_blank())ggplot(genotype_comparison_data, aes(x = timepoint, y = median_cpm,
color = cluster, group = interaction(cluster, genotype),
linetype = genotype)) +
geom_line(linewidth = 1.5) +
geom_point(size = 3.5, aes(shape = genotype)) +
scale_color_manual(values = cluster_colors) +
scale_linetype_manual(values = c("D21" = "solid", "T21" = "dashed")) +
scale_shape_manual(values = c("D21" = 16, "T21" = 17)) +
labs(title = "TF Cluster Expression: D21 vs T21",
subtitle = "Solid = D21, Dashed = T21",
x = "Timepoint", y = "Median CPM", color = "Cluster", linetype = "Genotype", shape = "Genotype") +
theme_bw(base_size = 14) +
theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 13, hjust = 0.5, color = "gray40"),
legend.position = "right",
panel.grid.minor = element_blank()) +
guides(color = guide_legend(order = 1), linetype = guide_legend(order = 2), shape = guide_legend(order = 2))## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.3.2 viridis_0.6.5
## [3] viridisLite_0.4.2 RColorBrewer_1.1-3
## [5] pheatmap_1.0.12 ggplot2_4.0.1
## [7] tibble_3.2.1 tidyr_1.3.1
## [9] dplyr_1.1.4 edgeR_4.4.1
## [11] limma_3.62.1 org.Hs.eg.db_3.20.0
## [13] AnnotationDbi_1.68.0 DESeq2_1.46.0
## [15] SummarizedExperiment_1.36.0 Biobase_2.66.0
## [17] MatrixGenerics_1.18.0 matrixStats_1.4.1
## [19] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1
## [21] IRanges_2.40.1 S4Vectors_0.44.0
## [23] BiocGenerics_0.52.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.46.0 gtable_0.3.6 xfun_0.52
## [4] bslib_0.9.0 lattice_0.22-6 vctrs_0.6.5
## [7] tools_4.4.2 generics_0.1.4 parallel_4.4.2
## [10] RSQLite_2.3.9 blob_1.2.4 pkgconfig_2.0.3
## [13] Matrix_1.7-1 S7_0.2.0 lifecycle_1.0.4
## [16] GenomeInfoDbData_1.2.13 compiler_4.4.2 farver_2.1.2
## [19] Biostrings_2.74.0 statmod_1.5.0 codetools_0.2-20
## [22] htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [25] pillar_1.11.1 crayon_1.5.3 jquerylib_0.1.4
## [28] BiocParallel_1.40.0 DelayedArray_0.32.0 cachem_1.1.0
## [31] abind_1.4-8 tidyselect_1.2.1 locfit_1.5-9.10
## [34] digest_0.6.37 purrr_1.0.2 labeling_0.4.3
## [37] fastmap_1.2.0 grid_4.4.2 colorspace_2.1-1
## [40] cli_3.6.3 SparseArray_1.6.0 magrittr_2.0.3
## [43] S4Arrays_1.6.0 dichromat_2.0-0.1 withr_3.0.2
## [46] scales_1.4.0 UCSC.utils_1.2.0 bit64_4.5.2
## [49] rmarkdown_2.29 XVector_0.46.0 httr_1.4.7
## [52] bit_4.5.0.1 gridExtra_2.3 png_0.1-8
## [55] memoise_2.0.1 evaluate_1.0.4 knitr_1.50
## [58] rlang_1.1.4 Rcpp_1.0.13-1 glue_1.8.0
## [61] DBI_1.2.3 rstudioapi_0.17.1 jsonlite_1.8.9
## [64] R6_2.6.1 zlibbioc_1.52.0
This workflow identified 531 transcription factors with:
These TFs were clustered into 5 expression modules, sorted by peak expression timing:
| Cluster | N Genes | Peak | Pattern |
|---|---|---|---|
| 1 | 97 | D6 | Decreasing |
| 2 | 141 | D10 | Complex |
| 3 | 43 | D10 | Transient Peak (D10) |
| 4 | 100 | D17 | Increasing |
| 5 | 150 | D17 | Complex |
Cluster sorting logic:
These stage-specific TF modules can now be compared between genotypes or treatment conditions to identify dysregulated transcriptional programs.