1 Overview

This analysis/workflow characterizes transcription factor (TF) expression patterns during endogenous control (D21) differentiation across developmental timepoints. Specifically, this workflow identifies, via hierarchical clustering, temporal, stage-specific expression of transcription factor modules in time-series bulk RNA-seq data. It also gives a framework for visualization of the ‘wave-like’ expression of these transcription factor modules.

TFs are important regulators of gene expression programs, and many TFs are used as cell-type specific marker genes. Hence, if you are characterizing a time-series differentiation, identifying stage-specific transcription factors in your control samples provides a baseline for understanding how these TF modules are perturbed in your experimental condition.

Furthermore, at the end of this analysis, I provide a basic framework for understanding how these stage-specific transcription factor module gene expression ‘waves’ are ultimately perturbed or changed as a result of a treatment/condition.

Depending on your experimental design (number of timepoints/replicates, your experimental group (genotype, condition, treatment), etc.), you can modify aspects of this work. For example, if you have 5+ timepoints, you will likely have to use a higher k (number of clusters) to capture and partititon all biologically meaningful, homogeneous transcription factor gene expression modules.

If you have any questions of suggestions, feel free to email me at zohebk@uchicago.edu or post on Biostars.

2 Dataset for Analysis

For this workflow, I re-processed publicly available data from the following source. You can use this data to follow along in this workflow, or you can use your own data.

Note that for this dataset, the experimental group is based on genotype (‘T21’). ‘D21’ samples represent the control samples; hence, we will use D21 samples to establish baseline transcription factor expression.

Data source: Martinez JL et al. Transcriptional consequences of trisomy 21 on neural induction. Front Cell Neurosci. 2024;18:1341141. doi:10.3389/fncel.2024.1341141

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000 May;25(1):25-9. doi: 10.1038/75556. PMID: 10802651; PMCID: PMC3037419.

2.1 Workflow Summary

Data loading: Load all required input data (DESeq2 vst counts, raw counts, cpm, and metadata)
Subset to control samples: filter to D21 genotype only
Get transcription factor list: retrieve human TFs from GO:0003700 annotation
Temporal trend analysis: run DESeq2 likelihood ratio test (LRT) on TF genes
Expression filtering: filter TFs by mean CPM > 10 at any differentiation stage
Clustering and visualization: generate heatmap with k=5 clusters, sorted by peak expression timing
Cluster characterization: extract clusters and visualize expression trends

3 Load Dependencies

library(DESeq2)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(edgeR)
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(pheatmap)
library(RColorBrewer)
library(viridis)
library(patchwork)

set.seed(2025)

4 Load Data

For this analysis, we require four objects: - vst counts: DESeq2 variance-stabilized transformed expression matrix (for heatmaps) - raw counts: Raw counts as input for DESeq2 likelihood ratio test - cpm matrix: Counts per million (for filtering/removal of lowly expressed TFs) - metadata: sample information including genotype and timepoint for various steps of workflow

data_dir <- "data"

vst_counts <- readRDS(file.path(data_dir, "counts/vst/WC24_vst_counts.rds"))
raw_counts <- readRDS(file.path(data_dir, "counts/raw/WC24_filt_raw_counts.rds"))
cpm_matrix <- readRDS(file.path(data_dir, "counts/cpm/WC24_cpm.rds"))
metadata <- readRDS(file.path(data_dir, "metadata/WC24_metadata_clean.rds"))

cat("Data dimensions:\n")

## Data dimensions:

cat("  VST:", nrow(vst_counts), "genes x", ncol(vst_counts), "samples\n")

##   VST: 13826 genes x 18 samples

cat("  Raw:", nrow(raw_counts), "genes x", ncol(raw_counts), "samples\n")

##   Raw: 13826 genes x 18 samples

print(head(metadata))

##           sample_id timepoint genotype               stage rep isogenic
## BA1D6_S1   BA1D6_S1         6      D21     neuroepithelium   1     WC24
## BA2D6_S2   BA2D6_S2         6      D21     neuroepithelium   2     WC24
## BA3D6_S3   BA3D6_S3         6      D21     neuroepithelium   3     WC24
## BA1D10_S4 BA1D10_S4        10      D21 early neuroectoderm   1     WC24
## BA2D10_S5 BA2D10_S5        10      D21 early neuroectoderm   2     WC24
## BA3D10_S6 BA3D10_S6        10      D21 early neuroectoderm   3     WC24
##           genotype_timepoint isogenic_genotype_timepoint
## BA1D6_S1              D21_D6                 WC24_D21_D6
## BA2D6_S2              D21_D6                 WC24_D21_D6
## BA3D6_S3              D21_D6                 WC24_D21_D6
## BA1D10_S4            D21_D10                WC24_D21_D10
## BA2D10_S5            D21_D10                WC24_D21_D10
## BA3D10_S6            D21_D10                WC24_D21_D10
##           isogenic_genotype_timepoint_rep
## BA1D6_S1                    WC24_D21_D6_1
## BA2D6_S2                    WC24_D21_D6_2
## BA3D6_S3                    WC24_D21_D6_3
## BA1D10_S4                  WC24_D21_D10_1
## BA2D10_S5                  WC24_D21_D10_2
## BA3D10_S6                  WC24_D21_D10_3

table(metadata$genotype, metadata$timepoint)

##      
##       6 10 17
##   D21 3  3  3
##   T21 3  3  3

5 Subset to Control Samples

To establish and identify baseline transcription factor temporal expression, we will focus exclusively on control (D21) samples to characterize normal differentiation patterns before any genotype comparisons. So we will subset metadata and our counts objects to only contain control sample information.

meta_d21 <- metadata %>%
  filter(genotype == "D21") %>%
  arrange(timepoint)

d21_samples <- meta_d21$sample_id

vst_d21 <- vst_counts[, d21_samples]
raw_d21 <- raw_counts[, d21_samples]
cpm_d21 <- cpm_matrix[, d21_samples]

cat("D21 samples:", length(d21_samples), "\n")

## D21 samples: 9

table(meta_d21$timepoint)

## 
##  6 10 17 
##  3  3  3

6 Retrieve Transcription Factor List

We use the Gene Ontology annotation GO:0003700 (DNA-binding transcription factor activity) to identify all human transcription factors. If your data is in mouse, switch org.Hs.eg.db to org.Mm.eg.db.

We can see that this GO term identifies ~1500 TFs, some of which will be appreciably expressed in our data.

tf_annotation <- AnnotationDbi::select(
  org.Hs.eg.db,
  keys = "GO:0003700",
  keytype = "GOALL",
  columns = "SYMBOL"
)

human_tfs <- unique(na.omit(tf_annotation$SYMBOL))
tfs_in_data <- intersect(human_tfs, rownames(raw_d21))

cat("Human TFs (GO:0003700):", length(human_tfs), "\n")

## Human TFs (GO:0003700): 1466

cat("TFs in expression data:", length(tfs_in_data), "\n")

## TFs in expression data: 1039

7 Temporal Trend Analysis

7.1 DESeq2 Likelihood Ratio Test

While many TFs will be expressed in bulk RNA-seq time-series data, we do not necessarily want to include all of them in our analysis.

Specifically, many TFs will not show a temporal expression pattern. In other words, some TFs will not show any expression pattern that is related to time. A TF might be basally expressed at a constant level across time, for example. For the purposes of this analysis, we do not necessarily concern ourselves with these TFs whose expression is not time-dependent. Rather, we want to focus our analysis on TFs whose expression ‘peaks’ at a particular stage in our differentiation, as this indicates the TF is important for establishing or maintaining that particular developmental stage.

Thus, to identify TFs whose expression is stage-specific/temporally-related, we use DESeq2’s likelihood ratio test (LRT) to identify TFs with significant expression changes across differentiation timepoints. I have provided two links below to read about LRT from the DESeq2 vignette and HBC training resources, but in brief, the LRT compares a full model (~timepoint) against a reduced model (~1) to test whether timepoint explains significant variance in expression.

Why do we use LRT here? - the LRT tests whether a factor (timepoint) contributes significantly to expression - unlike Wald tests, LRT doesn’t compare specific pairs of timepoints - genes passing LRT show significant temporal dynamics regardless of direction

You can read more about DESeq2’s LRT here: - https://hbctraining.github.io/DGE_workshop/lessons/08_DGE_LRT.html - https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html (command-F LRT)

meta_d21$timepoint <- factor(meta_d21$timepoint, levels = sort(unique(meta_d21$timepoint)))
stopifnot(all(colnames(raw_d21) == meta_d21$sample_id))

raw_d21_tfs <- raw_d21[tfs_in_data, ]

dds <- DESeqDataSetFromMatrix(
  countData = raw_d21_tfs,
  colData = meta_d21,
  design = ~ timepoint
)

dds <- DESeq(dds, test = "LRT", reduced = ~ 1)
lrt_results <- results(dds)

summary(lrt_results)

## 
## out of 1038 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 375, 36%
## LFC < 0 (down)     : 345, 33%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

7.2 Filter Temporally Significant TFs

The output of DESeq2’s LRT on our control, endogenous differentiation samples is going to be a set of pvalues/FDR-adjusted p-values for each TF.

The LRT won’t output a useable log2FoldChange value in this case because we simply looking at temporally variable TFs in our control differentiation. We aren’t comparing the expression of these TFs to another condition/genotype, so we will simply use a padj cutoff here to define TFs whose expression pattern across time is ‘temporally significant.’

padj_threshold <- 0.05

temporal_tfs <- rownames(lrt_results)[which(lrt_results$padj < padj_threshold)]
temporal_tfs <- temporal_tfs[!is.na(temporal_tfs)]

temporal_tf_df <- as.data.frame(lrt_results) %>%
  rownames_to_column("gene_id") %>%
  filter(padj < padj_threshold) %>%
  arrange(padj)

cat("Temporally dynamic TFs (padj <", padj_threshold, "):", length(temporal_tfs), "\n\n")

## Temporally dynamic TFs (padj < 0.05 ): 670

print(head(temporal_tf_df[, c("gene_id", "baseMean", "stat", "padj")], 10))

##    gene_id  baseMean      stat          padj
## 1   NKX2-1  454.4607 1069.2465 6.796379e-230
## 2    ZFP42  903.3953  985.6268 4.886861e-212
## 3     RFX4 1495.2113  959.5038 1.532750e-206
## 4     ST18  687.9726  944.2318 2.381254e-203
## 5     RARG  530.4785  923.1016 7.383331e-199
## 6     ELF4 2528.3210  845.4633 4.446564e-182
## 7   NKX2-2  289.5501  760.3032 1.184020e-163
## 8    NR2E1 1771.3166  700.5637 9.719455e-151
## 9     NTN1 1666.5245  696.5273 6.500740e-150
## 10   HESX1  722.3745  685.6152 1.370061e-147

8 CPM-Based Expression Filtering (Optional)

This step in the workflow is optional.

Essentially, although a TF might be stage-specific in terms of its expression pattern, that does not provide insight into the level (or magnitude) of the expression of that TF at that particular stage.

If we directly used the DESeq2 LRT significat temporal TFs for our downstream analysis, we would likely have a non-trivial number of TFs whose expression at a given stage is not appreciably high. The assumption here is that, if a TF is truly biologically relevant for defining a particular stage of differentiation (ie, cardiac progenitor to cardiomyocyte transition), then that TF should be expressed an appreciably high level at one of those stages.

With that logic, I filter TFs by expression level here to focus on biologically relevant TFs that are maintaining the identity of a particular timepoint in differentiation. I use a fairly stringent expression cutoff here defined by edgeR TMM-normalized counts per million. Specifically, a TF must have mean CPM > 10 at at least one differentiation stage to be retained. In other words, a TF must reach this expression threshold at one or more differentiation stages to be considered biologically relevant for our analysis.

cpm_d21_temporal <- cpm_d21[intersect(temporal_tfs, rownames(cpm_d21)), ]
timepoints <- unique(meta_d21$timepoint)

mean_cpm_by_stage <- sapply(timepoints, function(tp) {
  tp_samples <- meta_d21$sample_id[meta_d21$timepoint == tp]
  rowMeans(cpm_d21_temporal[, tp_samples, drop = FALSE])
})
colnames(mean_cpm_by_stage) <- paste0("D", timepoints)

cpm_threshold <- 10
max_mean_cpm <- apply(mean_cpm_by_stage, 1, max)
expressed_tfs <- names(max_mean_cpm)[max_mean_cpm > cpm_threshold]

cat("CPM threshold:", cpm_threshold, "\n")

## CPM threshold: 10

cat("TFs passing filter:", length(expressed_tfs), "\n")

## TFs passing filter: 531

cat("TFs filtered out:", nrow(mean_cpm_by_stage) - length(expressed_tfs), "\n")

## TFs filtered out: 139

9 Prepare Final VST Matrix

So based on the previous filtering/pre-processing steps now have our final set of TFs: 1. Filter/subset list of genes to all human TFs

Subset all TFs to just those with significant temporal expression changes (via DESeq2 LRT)
Further refine our list of stage-specific TFs to those that are appreciably in at least 1 ‘stage’ or timepoint in control differentiation

Next, we prepare the VST matrix for visualization as follows:

We calculate row-wise Z-scores to center TF gene expression at zero and scale variance to one. This transformation creates a standardized expression for a gene across samples, which allows us to cluster TF based on relative expression patterns (e.g., “peak at Day 10”) rather than absolute abundance

2.Certain TFs at particular stages might be extreme outliers in terms of expression that can distort visualization color scales in our heatmap. To address this, we apply winsorization limits (capping values at ±3 standard deviations for z-score vst counts).

final_tfs <- intersect(expressed_tfs, rownames(vst_d21))

sample_order <- meta_d21 %>%
  arrange(timepoint) %>%
  pull(sample_id)

vst_final <- vst_d21[final_tfs, sample_order]

# z-score normalization
vst_zscore <- t(scale(t(vst_final)))

# winsorize extreme values for heatmap visualization
winsorize_limit <- 3
vst_zscore[vst_zscore > winsorize_limit] <- winsorize_limit
vst_zscore[vst_zscore < -winsorize_limit] <- -winsorize_limit

cat("Final TFs for clustering:", length(final_tfs), "\n")

## Final TFs for clustering: 531

cat("Z-score range: [", round(min(vst_zscore), 2), ",", round(max(vst_zscore), 2), "]\n")

## Z-score range: [ -2.38 , 2.4 ]

10 Clustering and Visualization

10.1 Clustering by Peak Expression Timing

To organize the temporally expressed transcription factors into coherent and biologically meaningful modules, we perform hierarchical clustering on z-score (row-normalized) VST counts, followed by a specific temporal reordering step for heatmap visualization purposes.

First, we apply hierarchical clustering (Euclidean distance, complete linkage) to partition the transcription factors into \(k=5\) distinct expression modules. This groups genes based on the similarity of their expression profiles over time, regardless of their absolute expression levels.

We then reorder these clusters based on their peak expression timing (in other words, at what stage/timepoint does a particular cluster of TFs ‘peak’ in expression? we order clusters based on this so that the heatmap is more visually cherent) This forces the heatmap to display a “cascade” or “wave-like” pattern:

k_clusters <- 5

# initial hierarchical clustering
initial_hclust <- hclust(dist(vst_zscore, method = "euclidean"), method = "complete")
initial_clusters <- cutree(initial_hclust, k = k_clusters)

initial_cluster_df <- data.frame(
  gene_id = names(initial_clusters),
  cluster = initial_clusters,
  stringsAsFactors = FALSE
)

# calculate mean log2CPM per gene per timepoint
cpm_final <- cpm_d21[final_tfs, sample_order]
log2cpm_final <- log2(cpm_final + 1)

mean_log2cpm_by_gene <- sapply(timepoints, function(tp) {
  tp_samples <- meta_d21$sample_id[meta_d21$timepoint == tp]
  rowMeans(log2cpm_final[, tp_samples, drop = FALSE])
})
colnames(mean_log2cpm_by_gene) <- paste0("D", timepoints)

# calculate cluster-level profiles for sorting
cluster_profiles <- sapply(1:k_clusters, function(cl) {
  genes_in_cluster <- initial_cluster_df$gene_id[initial_cluster_df$cluster == cl]
  colMeans(mean_log2cpm_by_gene[genes_in_cluster, , drop = FALSE])
})
colnames(cluster_profiles) <- paste0("Cluster_", 1:k_clusters)
cluster_profiles <- t(cluster_profiles)

# sort clusters by peak timing
cluster_sort_metrics <- data.frame(
  cluster = 1:k_clusters,
  d6 = cluster_profiles[, "D6"],
  d10 = cluster_profiles[, "D10"],
  d17 = cluster_profiles[, "D17"]
) %>%
  mutate(
    peak_timepoint = case_when(
      d6 >= d10 & d6 >= d17 ~ "D6",
      d10 >= d6 & d10 >= d17 ~ "D10",
      TRUE ~ "D17"
    ),
    decline_d6_to_d10 = d10 - d6,
    peak_order = case_when(
      peak_timepoint == "D6" ~ 1,
      peak_timepoint == "D10" ~ 2,
      TRUE ~ 3
    ),
    secondary_sort = case_when(
      peak_timepoint == "D6" ~ decline_d6_to_d10,
      peak_timepoint == "D10" ~ -abs(d10 - (d6 + d17)/2),
      TRUE ~ -decline_d6_to_d10
    )
  ) %>%
  arrange(peak_order, secondary_sort)

sorted_cluster_order <- cluster_sort_metrics$cluster
cluster_mapping <- setNames(1:k_clusters, sorted_cluster_order)

# swap cluster 2 and cluster 3 for visualization
swap_mapping <- c("1" = 1, "2" = 3, "3" = 2, "4" = 4, "5" = 5)

# reorder genes by sorted clusters and apply swap
cluster_df <- initial_cluster_df %>%
  mutate(
    original_cluster = cluster,
    cluster = cluster_mapping[as.character(cluster)],
    cluster = swap_mapping[as.character(cluster)]
  ) %>%
  arrange(cluster, gene_id)

ordered_genes <- c()
for (cl in 1:k_clusters) {
  genes_in_cluster <- cluster_df$gene_id[cluster_df$cluster == cl]
  if (length(genes_in_cluster) > 1) {
    sub_hclust <- hclust(dist(vst_zscore[genes_in_cluster, ], method = "euclidean"), method = "complete")
    genes_in_cluster <- genes_in_cluster[sub_hclust$order]
  }
  ordered_genes <- c(ordered_genes, genes_in_cluster)
}

vst_zscore_ordered <- vst_zscore[ordered_genes, ]

cluster_df <- cluster_df %>%
  mutate(gene_id = factor(gene_id, levels = ordered_genes)) %>%
  arrange(gene_id) %>%
  mutate(gene_id = as.character(gene_id))

cluster_genes <- split(cluster_df$gene_id, cluster_df$cluster)

print(table(cluster_df$cluster))

## 
##   1   2   3   4   5 
##  97 141  43 100 150

10.2 Define Marker Transcription Factors

Here, I define curated marker gene lists for four stages of neural differentiation (neuroepithelium, early neuroectoderm, definitive neuroectoderm, mature neuron).

For this particular experiment, the authors defined the 3 timepoints as neuroepithelium, early neuroectoderm, and definitive neuroectoderm. Based on canonical TFs from the literature, I then check which of these markers are present in my filtered TF set and annotate them to the specific clusters they belong to. In the heatmap, I selectively label only these markers to help orient and validate whether the clusters align with expected developmental stages.

neuroepithelium <- c('SOX1','PAX6','OTX2','ZIC1','ZIC2','SIX3','LHX2','GBX2','SOX2','HES3')
early_neuroectoderm <- c('PAX6','SOX1','FOXG1','OTX1','EMX2','LHX2','HES5','SOX3','IRX3','NR2F1')
definitive_neuroectoderm <- c('PAX6','FOXG1','EMX1','EOMES','NEUROG1','NEUROG2','ASCL1','SOX11','POU3F2','DLX2')
mature_neuron <- c('TBR1','SATB2','BCL11B','FEZF2','CUX1','CUX2','MEF2C','NEUROD1','NEUROD2','NEUROD6')

all_markers <- unique(c(neuroepithelium, early_neuroectoderm, definitive_neuroectoderm, mature_neuron))
markers_in_data <- intersect(all_markers, final_tfs)

cat("Markers in dataset:", length(markers_in_data), "/", length(all_markers), "\n")

## Markers in dataset: 21 / 35

cat(paste(markers_in_data, collapse = ", "), "\n")

## SOX1, PAX6, OTX2, ZIC1, ZIC2, SIX3, LHX2, SOX2, HES3, FOXG1, OTX1, EMX2, HES5, SOX3, NR2F1, EOMES, ASCL1, POU3F2, SATB2, FEZF2, NEUROD1

10.3 Temporal Expression Heatmap

Now, we are finally ready to generate two heatmaps of the z-scored VST expression matrix: one without row labels for a clean overview, and one with selective labeling of the canonical stage markers defined above. For this heatmap, I first cluster each individual ‘cluster’ of TFs individually. I then used this ordering of genes in each subcluster when I integrate all clusters together in the expression wave heatmap.

# annotations
annotation_col <- meta_d21[match(sample_order, meta_d21$sample_id), ] %>%
  select(timepoint) %>%
  mutate(Timepoint = factor(paste0("D", timepoint), levels = c("D6", "D10", "D17"))) %>%
  select(Timepoint)
rownames(annotation_col) <- sample_order

annotation_row <- data.frame(
  Cluster = factor(cluster_df$cluster[match(ordered_genes, cluster_df$gene_id)])
)
rownames(annotation_row) <- ordered_genes

cluster_colors <- c(
  "1" = "#E41A1C", "2" = "#377EB8", "3" = "#4DAF4A",
  "4" = "#984EA3", "5" = "#FF7F00"
)

annotation_colors <- list(
  Timepoint = c("D6" = "#66C2A5", "D10" = "#FC8D62", "D17" = "#8DA0CB"),
  Cluster = cluster_colors
)

heatmap_colors <- colorRampPalette(
  c("#053061", "#2166AC", "#4393C3", "#92C5DE", "#D1E5F0",
    "#FDDBC7", "#F4A582", "#D6604D", "#B2182B", "#67001F")
)(100)

n_per_timepoint <- table(meta_d21$timepoint[match(sample_order, meta_d21$sample_id)])
gap_positions_col <- cumsum(n_per_timepoint[-length(n_per_timepoint)])

cluster_sizes <- table(factor(cluster_df$cluster[match(ordered_genes, cluster_df$gene_id)], levels = 1:k_clusters))
gap_positions_row <- cumsum(cluster_sizes[-length(cluster_sizes)])

pheatmap(
  vst_zscore_ordered,
  color = heatmap_colors,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  annotation_col = annotation_col,
  annotation_row = annotation_row,
  annotation_colors = annotation_colors,
  show_rownames = FALSE,
  show_colnames = FALSE,
  fontsize = 10,
  border_color = NA,
  gaps_col = gap_positions_col,
  gaps_row = gap_positions_row,
  main = paste0("Temporally Dynamic TFs in D21 Differentiation (k=", k_clusters, ", n=", nrow(vst_zscore_ordered), ")")
)

labels_row <- ifelse(ordered_genes %in% markers_in_data, ordered_genes, "")

pheatmap(
  vst_zscore_ordered,
  color = heatmap_colors,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  annotation_col = annotation_col,
  annotation_row = annotation_row,
  annotation_colors = annotation_colors,
  labels_row = labels_row,
  show_rownames = TRUE,
  show_colnames = FALSE,
  fontsize = 10,
  fontsize_row = 8,
  border_color = NA,
  gaps_col = gap_positions_col,
  gaps_row = gap_positions_row,
  main = paste0("Temporally Dynamic TFs with Marker Labels (k=", k_clusters, ")")
)

11 Cluster Membership

Then we extract the genes contained within each cluster of TFs

for (i in 1:k_clusters) {
  peak <- cluster_sort_metrics$peak_timepoint[cluster_sort_metrics$cluster == sorted_cluster_order[i]]
  cat("Cluster", i, "(peak:", peak, ",", length(cluster_genes[[as.character(i)]]), "genes):",
      paste(head(cluster_genes[[as.character(i)]], 5), collapse = ", "),
      ifelse(length(cluster_genes[[as.character(i)]]) > 5, "...", ""), "\n")
}

## Cluster 1 (peak: D6 , 97 genes): HES1, ZNF766, CREB3L4, MYCN, ZNF670 ... 
## Cluster 2 (peak: D10 , 141 genes): FEZF2, RAX, ZBTB12, ERF, ZNF777 ... 
## Cluster 3 (peak: D10 , 43 genes): ABHD2, ZSCAN20, IKZF2, ZBED6, FOXK1 ... 
## Cluster 4 (peak: D17 , 100 genes): ZNF112, ZNF225, ZNF302, ZNF211, ZNF410 ... 
## Cluster 5 (peak: D17 , 150 genes): GLI1, SOHLH2, IRF2, LITAF, KLF7 ...

12 Cluster Expression Trends

To complement the heatmap visualization, I also visualize the expression trend for each individual cluster with lineplots using mean log2(CPM) across the differentiation time-course

lineplot_data <- as.data.frame(mean_log2cpm_by_gene) %>%
  rownames_to_column("gene_id") %>%
  left_join(cluster_df %>% select(gene_id, cluster), by = "gene_id") %>%
  pivot_longer(cols = starts_with("D"), names_to = "timepoint", values_to = "log2cpm") %>%
  mutate(
    timepoint = factor(timepoint, levels = c("D6", "D10", "D17")),
    cluster = factor(cluster)
  )

cluster_summary <- lineplot_data %>%
  group_by(cluster, timepoint) %>%
  summarise(
    mean_log2cpm = mean(log2cpm),
    sd_log2cpm = sd(log2cpm),
    n_genes = n(),
    .groups = "drop"
  )

ggplot(cluster_summary, aes(x = timepoint, y = mean_log2cpm, group = cluster)) +
  geom_ribbon(aes(ymin = mean_log2cpm - sd_log2cpm,
                  ymax = mean_log2cpm + sd_log2cpm, fill = cluster),
              alpha = 0.3) +
  geom_line(aes(color = cluster), linewidth = 1.5) +
  geom_point(aes(color = cluster), size = 3) +
  scale_color_manual(values = cluster_colors) +
  scale_fill_manual(values = cluster_colors) +
  facet_wrap(~ cluster, ncol = 3, scales = "free_y",
             labeller = labeller(cluster = function(x) paste("Cluster", x))) +
  labs(title = "TF Expression Trends by Cluster",
       subtitle = "Mean expression with ±1 SD confidence interval",
       x = "Timepoint", y = "Mean log2(CPM + 1)") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
        strip.background = element_rect(fill = "gray90"),
        strip.text = element_text(size = 11, face = "bold"),
        legend.position = "none",
        panel.grid.minor = element_blank())

ggplot(cluster_summary, aes(x = timepoint, y = mean_log2cpm, color = cluster, group = cluster)) +
  geom_line(linewidth = 2) +
  geom_point(size = 5, shape = 21, fill = "white", stroke = 2) +
  geom_point(aes(fill = cluster), size = 4, shape = 21, stroke = 0) +
  scale_color_manual(values = cluster_colors,
                     labels = sapply(1:k_clusters, function(i) paste0("Cluster ", i, " (n=", sum(cluster_df$cluster == i), ")"))) +
  scale_fill_manual(values = cluster_colors,
                    labels = sapply(1:k_clusters, function(i) paste0("Cluster ", i, " (n=", sum(cluster_df$cluster == i), ")"))) +
  labs(title = "TF Cluster Expression Patterns",
       subtitle = "Mean log2(CPM + 1) across D21 differentiation",
       x = "Timepoint", y = "Mean log2(CPM + 1)", color = "Cluster", fill = "Cluster") +
  theme_bw(base_size = 14) +
  theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 13, hjust = 0.5, color = "gray40"),
        axis.title = element_text(size = 14, face = "bold"),
        legend.position = "right",
        legend.key.size = unit(1.2, "cm"),
        panel.grid.minor = element_blank()) +
  guides(fill = guide_legend(override.aes = list(alpha = 1)))

12.1 Example Gene Expression per Cluster

example_genes <- cluster_df %>%
  left_join(data.frame(gene_id = rownames(mean_log2cpm_by_gene),
                       mean_expr = rowMeans(mean_log2cpm_by_gene)), by = "gene_id") %>%
  group_by(cluster) %>%
  arrange(desc(mean_expr)) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  select(cluster, gene_id)

example_plot_data <- log2cpm_final[example_genes$gene_id, ] %>%
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  pivot_longer(cols = -gene_id, names_to = "sample_id", values_to = "log2cpm") %>%
  left_join(meta_d21 %>% select(sample_id, timepoint), by = "sample_id") %>%
  left_join(example_genes, by = "gene_id") %>%
  mutate(timepoint = factor(paste0("D", timepoint), levels = c("D6", "D10", "D17")),
         cluster = factor(cluster),
         label = paste0("Cluster ", cluster, ": ", gene_id))

ggplot(example_plot_data, aes(x = timepoint, y = log2cpm, group = 1)) +
  geom_point(aes(color = cluster), size = 3, alpha = 0.7) +
  geom_line(data = example_plot_data %>%
              group_by(gene_id, timepoint, cluster, label) %>%
              summarise(log2cpm = mean(log2cpm), .groups = "drop"),
            aes(color = cluster), linewidth = 1.2) +
  scale_color_manual(values = cluster_colors) +
  facet_wrap(~ label, ncol = 3, scales = "free_y") +
  labs(title = "Example TF Expression Across Differentiation",
       subtitle = "Individual samples (points), mean (line)",
       x = "Timepoint", y = "log2(CPM + 1)") +
  theme_bw(base_size = 12) +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
        strip.background = element_rect(fill = "gray90"),
        strip.text = element_text(size = 10, face = "bold"),
        legend.position = "none",
        panel.grid.minor = element_blank())

13 Cluster Characterization

cluster_patterns <- cluster_summary %>%
  pivot_wider(names_from = timepoint, values_from = c(mean_log2cpm, sd_log2cpm, n_genes)) %>%
  select(cluster, starts_with("mean_log2cpm"), n_genes_D6) %>%
  rename(d6 = mean_log2cpm_D6, d10 = mean_log2cpm_D10, d17 = mean_log2cpm_D17, n_genes = n_genes_D6) %>%
  mutate(
    fc_d10_vs_d6 = d10 - d6,
    fc_d17_vs_d6 = d17 - d6,
    fc_d17_vs_d10 = d17 - d10,
    peak = case_when(d6 >= d10 & d6 >= d17 ~ "D6", d10 >= d6 & d10 >= d17 ~ "D10", TRUE ~ "D17"),
    pattern = case_when(
      fc_d17_vs_d6 > 0.5 & fc_d10_vs_d6 > 0.25 ~ "Increasing",
      fc_d17_vs_d6 < -0.5 & fc_d10_vs_d6 < -0.25 ~ "Decreasing",
      fc_d10_vs_d6 > 0.5 & fc_d17_vs_d10 < -0.5 ~ "Transient Peak (D10)",
      fc_d10_vs_d6 < -0.5 & fc_d17_vs_d10 > 0.5 ~ "Transient Dip (D10)",
      abs(fc_d17_vs_d6) < 0.5 & abs(fc_d10_vs_d6) < 0.5 ~ "Stable",
      TRUE ~ "Complex"
    )
  )

print(cluster_patterns %>% select(cluster, n_genes, peak, pattern) %>% as.data.frame())

##   cluster n_genes peak              pattern
## 1       1      97   D6           Decreasing
## 2       2     141  D10              Complex
## 3       3      43  D10 Transient Peak (D10)
## 4       4     100  D17           Increasing
## 5       5     150  D17              Complex

14 Export Results

output_dir <- "results"
if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)

write.csv(cluster_df, file.path(output_dir, "d21_temporal_tf_clusters.csv"), row.names = FALSE)
write.csv(cluster_summary, file.path(output_dir, "d21_cluster_expression_summary.csv"), row.names = FALSE)
write.csv(cluster_patterns, file.path(output_dir, "d21_cluster_patterns.csv"), row.names = FALSE)
write.csv(temporal_tf_df, file.path(output_dir, "d21_temporal_tf_lrt_results.csv"), row.names = FALSE)
saveRDS(cluster_genes, file.path(output_dir, "d21_cluster_gene_lists.rds"))
write.csv(example_genes, file.path(output_dir, "d21_example_genes_per_cluster.csv"), row.names = FALSE)

15 D21 vs T21 Comparison

Having defined stage-specific TF modules from control (D21) differentiation, I now ask how these same modules behave in experimental group (T21) samples. This approach uses the D21-derived clusters as a reference framework to identify where T21 expression diverges from normal temporal patterns.

Differences between the D21 and T21 trajectories within a TF cluster indicates that the TF module is dysregulated in T21. This provides a starting point for identifying which stages of differentiation are most affected and which transcriptional programs may be delayed/accelerated/etc.

meta_t21 <- metadata %>%
  filter(genotype == "T21") %>%
  arrange(timepoint)

t21_samples <- meta_t21$sample_id
cpm_t21 <- cpm_matrix[final_tfs, t21_samples]

# calculate median CPM per gene per timepoint for each genotype
median_cpm_t21_by_gene <- sapply(timepoints, function(tp) {
  tp_samples <- meta_t21$sample_id[meta_t21$timepoint == tp]
  apply(cpm_t21[, tp_samples, drop = FALSE], 1, median)
})
colnames(median_cpm_t21_by_gene) <- paste0("D", timepoints)

median_cpm_d21_by_gene <- sapply(timepoints, function(tp) {
  tp_samples <- meta_d21$sample_id[meta_d21$timepoint == tp]
  apply(cpm_final[, tp_samples, drop = FALSE], 1, median)
})
colnames(median_cpm_d21_by_gene) <- paste0("D", timepoints)

# calculate cluster-level medians
cluster_median_d21 <- sapply(1:k_clusters, function(cl) {
  genes <- intersect(cluster_df$gene_id[cluster_df$cluster == cl], rownames(median_cpm_d21_by_gene))
  if (length(genes) > 0) apply(median_cpm_d21_by_gene[genes, , drop = FALSE], 2, median) else rep(NA, 3)
})
colnames(cluster_median_d21) <- paste0("Cluster_", 1:k_clusters)

cluster_median_t21 <- sapply(1:k_clusters, function(cl) {
  genes <- intersect(cluster_df$gene_id[cluster_df$cluster == cl], rownames(median_cpm_t21_by_gene))
  if (length(genes) > 0) apply(median_cpm_t21_by_gene[genes, , drop = FALSE], 2, median) else rep(NA, 3)
})
colnames(cluster_median_t21) <- paste0("Cluster_", 1:k_clusters)

genotype_comparison_data <- bind_rows(
  as.data.frame(t(cluster_median_d21)) %>%
    rownames_to_column("cluster") %>%
    mutate(cluster = gsub("Cluster_", "", cluster)) %>%
    pivot_longer(cols = starts_with("D"), names_to = "timepoint", values_to = "median_cpm") %>%
    mutate(genotype = "D21"),
  as.data.frame(t(cluster_median_t21)) %>%
    rownames_to_column("cluster") %>%
    mutate(cluster = gsub("Cluster_", "", cluster)) %>%
    pivot_longer(cols = starts_with("D"), names_to = "timepoint", values_to = "median_cpm") %>%
    mutate(genotype = "T21")
) %>%
  mutate(
    cluster = factor(cluster, levels = as.character(1:k_clusters)),
    timepoint = factor(timepoint, levels = c("D6", "D10", "D17")),
    genotype = factor(genotype, levels = c("D21", "T21"))
  )

genotype_colors <- c("D21" = "royalblue3", "T21" = "red3")

ggplot(genotype_comparison_data, aes(x = timepoint, y = median_cpm, color = genotype, group = genotype)) +
  geom_line(linewidth = 1.8) +
  geom_point(size = 4, shape = 21, aes(fill = genotype), stroke = 1.5) +
  scale_color_manual(values = genotype_colors) +
  scale_fill_manual(values = genotype_colors) +
  facet_wrap(~ cluster, ncol = 3, scales = "free_y",
             labeller = labeller(cluster = function(x) paste("Cluster", x))) +
  labs(title = "D21 vs T21 Expression by TF Cluster",
       subtitle = "Median CPM (clusters defined from D21)",
       x = "Timepoint", y = "Median CPM", color = "Genotype", fill = "Genotype") +
  theme_bw(base_size = 13) +
  theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
        strip.background = element_rect(fill = "gray90"),
        strip.text = element_text(size = 12, face = "bold"),
        legend.position = "top",
        panel.grid.minor = element_blank())

ggplot(genotype_comparison_data, aes(x = timepoint, y = median_cpm,
                                      color = cluster, group = interaction(cluster, genotype),
                                      linetype = genotype)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3.5, aes(shape = genotype)) +
  scale_color_manual(values = cluster_colors) +
  scale_linetype_manual(values = c("D21" = "solid", "T21" = "dashed")) +
  scale_shape_manual(values = c("D21" = 16, "T21" = 17)) +
  labs(title = "TF Cluster Expression: D21 vs T21",
       subtitle = "Solid = D21, Dashed = T21",
       x = "Timepoint", y = "Median CPM", color = "Cluster", linetype = "Genotype", shape = "Genotype") +
  theme_bw(base_size = 14) +
  theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 13, hjust = 0.5, color = "gray40"),
        legend.position = "right",
        panel.grid.minor = element_blank()) +
  guides(color = guide_legend(order = 1), linetype = guide_legend(order = 2), shape = guide_legend(order = 2))

write.csv(genotype_comparison_data, file.path(output_dir, "d21_t21_cluster_comparison.csv"), row.names = FALSE)

16 Session Information

sessionInfo()

## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.3.2             viridis_0.6.5              
##  [3] viridisLite_0.4.2           RColorBrewer_1.1-3         
##  [5] pheatmap_1.0.12             ggplot2_4.0.1              
##  [7] tibble_3.2.1                tidyr_1.3.1                
##  [9] dplyr_1.1.4                 edgeR_4.4.1                
## [11] limma_3.62.1                org.Hs.eg.db_3.20.0        
## [13] AnnotationDbi_1.68.0        DESeq2_1.46.0              
## [15] SummarizedExperiment_1.36.0 Biobase_2.66.0             
## [17] MatrixGenerics_1.18.0       matrixStats_1.4.1          
## [19] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1        
## [21] IRanges_2.40.1              S4Vectors_0.44.0           
## [23] BiocGenerics_0.52.0        
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.46.0         gtable_0.3.6            xfun_0.52              
##  [4] bslib_0.9.0             lattice_0.22-6          vctrs_0.6.5            
##  [7] tools_4.4.2             generics_0.1.4          parallel_4.4.2         
## [10] RSQLite_2.3.9           blob_1.2.4              pkgconfig_2.0.3        
## [13] Matrix_1.7-1            S7_0.2.0                lifecycle_1.0.4        
## [16] GenomeInfoDbData_1.2.13 compiler_4.4.2          farver_2.1.2           
## [19] Biostrings_2.74.0       statmod_1.5.0           codetools_0.2-20       
## [22] htmltools_0.5.8.1       sass_0.4.10             yaml_2.3.10            
## [25] pillar_1.11.1           crayon_1.5.3            jquerylib_0.1.4        
## [28] BiocParallel_1.40.0     DelayedArray_0.32.0     cachem_1.1.0           
## [31] abind_1.4-8             tidyselect_1.2.1        locfit_1.5-9.10        
## [34] digest_0.6.37           purrr_1.0.2             labeling_0.4.3         
## [37] fastmap_1.2.0           grid_4.4.2              colorspace_2.1-1       
## [40] cli_3.6.3               SparseArray_1.6.0       magrittr_2.0.3         
## [43] S4Arrays_1.6.0          dichromat_2.0-0.1       withr_3.0.2            
## [46] scales_1.4.0            UCSC.utils_1.2.0        bit64_4.5.2            
## [49] rmarkdown_2.29          XVector_0.46.0          httr_1.4.7             
## [52] bit_4.5.0.1             gridExtra_2.3           png_0.1-8              
## [55] memoise_2.0.1           evaluate_1.0.4          knitr_1.50             
## [58] rlang_1.1.4             Rcpp_1.0.13-1           glue_1.8.0             
## [61] DBI_1.2.3               rstudioapi_0.17.1       jsonlite_1.8.9         
## [64] R6_2.6.1                zlibbioc_1.52.0

17 Summary

This workflow identified 531 transcription factors with:

Significant temporal expression changes during D21 differentiation (DESeq2 LRT, padj < 0.05)
Adequate expression levels (mean CPM > 10 at any differentiation stage)

These TFs were clustered into 5 expression modules, sorted by peak expression timing:

TF Expression Clusters (sorted by peak timing)
Cluster	N Genes	Peak	Pattern
1	97	D6	Decreasing
2	141	D10	Complex
3	43	D10	Transient Peak (D10)
4	100	D17	Increasing
5	150	D17	Complex

Cluster sorting logic:

Clusters ordered by peak expression timepoint (D6 → D10 → D17)
Within same peak, sub-sorted by trajectory

These stage-specific TF modules can now be compared between genotypes or treatment conditions to identify dysregulated transcriptional programs.

Temporal transcription factor-based characterization of bulk time-series RNA-seq data for control/endogenous differentiation

Zoheb Khan

September 19, 2025