The tfclassify package provides a curated database of
~1,600 human transcription factors based on the Lambert
et al. 2018 census (Cell 172:650-665). It validates genes against this
gold-standard database and assigns TF family classifications.
Key features:
# TF database summary
db_info <- tf_database_info()
data.frame(
Metric = c("Total TFs", "TF Families", "DBD Classes", "Source"),
Value = c(db_info$total_tfs, db_info$n_families, db_info$n_classes, db_info$source)
)# top TF families by count
data.frame(
Family = names(db_info$top_families),
Count = as.numeric(db_info$top_families)
)To demonstrate the functionalities of this workflow, we analyze a publicly available dataset from:
Martinez JL et al. Transcriptional consequences of trisomy 21 on neural induction. Front Cell Neurosci. 2024;18:1341141. doi:[10.3389/fncel.2024.1341141](https://doi.org/10.3389/fncel.2024.1341141)
This dataset examines transcriptional changes during iPSC-derived neural differentiation. We compare Day 17 vs Day 6 of differentiation in control (D21) cells to identify TFs involved in neural lineage commitment.
# Load DESeq2 results: Day 17 vs Day 6 differentiation (apeglm shrunken LFC)
res_file <- "data/deg/apeglm/unfiltered/WC24_D21_D17_vs_D21_D6.rds"
res <- readRDS(res_file)
# convert to data frame
deseq_results <- as.data.frame(res)
# Get gene symbols from count matrix (DESeq2 results use numeric row indices)
counts <- readRDS("data/counts/normalized/WC24_norm_counts.rds")
gene_symbols <- rownames(counts)
# map gene symbols by row index
deseq_results$gene_id <- gene_symbols[as.integer(rownames(deseq_results))]
rownames(deseq_results) <- NULL
# reorder columns
deseq_results <- deseq_results[, c("gene_id", setdiff(names(deseq_results), "gene_id"))]
head(deseq_results)## [1] 13826 5
## [1] "gene_id" "baseMean" "log2FoldChange" "lfcSE"
## [5] "padj"
## baseMean log2FoldChange padj
## Min. : 19.2 Min. :-12.03797 Min. :0.0000000
## 1st Qu.: 297.7 1st Qu.: -0.39686 1st Qu.:0.0000002
## Median : 933.5 Median : 0.01519 Median :0.0096932
## Mean : 2454.9 Mean : 0.04478 Mean :0.1812361
## 3rd Qu.: 2210.8 3rd Qu.: 0.50331 3rd Qu.:0.2842740
## Max. :671301.4 Max. : 11.11793 Max. :0.9998523
## NA's :3
Interpretation: Positive log2FC = upregulated at D17 (late differentiation); Negative log2FC = downregulated at D17 (higher at D6/early).
The main function classify_tfs() adds family annotations
to your data:
# classify all genes
classified <- classify_tfs(deseq_results, gene_col = "gene_id")
c(total = nrow(classified),
tfs = sum(classified$is_tf),
pct_tfs = round(mean(classified$is_tf) * 100, 2))## total tfs pct_tfs
## 13826.00 1026.00 7.42
# View TFs with family assignments
classified %>%
dplyr::filter(is_tf) %>%
dplyr::select(gene_id, log2FoldChange, padj, tf_family, tf_class) %>%
dplyr::arrange(desc(abs(log2FoldChange))) %>%
head(15)| Column | Description | Example |
|---|---|---|
tf_family |
TF family name | SOX, FOX, ZNF, HOX |
tf_class |
DNA-binding domain class | HMG, Forkhead, C2H2 zinc finger |
tf_subfamily |
More specific annotation | Estrogen receptor, PPAR |
is_tf |
Boolean TF status | TRUE/FALSE |
Use classify_tfs() with significance thresholds to get
significant TFs with family annotations:
# Significant TFs (|log2FC| >= 1, padj <= 0.05)
sig_tfs <- classify_tfs(
deseq_results,
gene_col = "gene_id",
filter_to_tfs = TRUE,
log2fc_col = "log2FoldChange",
log2fc_threshold = 1,
padj_col = "padj",
padj_threshold = 0.05
)
nrow(sig_tfs)## [1] 170
# upregulated at D17 (positive LFC) vs downregulated at D17 (negative LFC)
up_tfs <- sig_tfs %>% dplyr::filter(log2FoldChange > 0) %>% dplyr::arrange(desc(log2FoldChange))
down_tfs <- sig_tfs %>% dplyr::filter(log2FoldChange < 0) %>% dplyr::arrange(log2FoldChange)
c(up_at_D17 = nrow(up_tfs), down_at_D17 = nrow(down_tfs))## up_at_D17 down_at_D17
## 90 80
# Top TFs upregulated at D17 (late differentiation)
up_tfs %>%
dplyr::select(gene_id, log2FoldChange, padj, tf_family) %>%
head(10)# top TFs downregulated at D17 (higher at D6/early)
down_tfs %>%
dplyr::select(gene_id, log2FoldChange, padj, tf_family) %>%
head(10)# all TFs in dataset
all_tfs <- classify_tfs(deseq_results, gene_col = "gene_id", filter_to_tfs = TRUE)
plot_tf_families(
all_tfs,
top_n = 20,
title = "TF Family Distribution in Dataset",
palette = "Set2"
)TFs with generic family assignments (only DBD class known):
## total_tfs unmapped
## 1026 44
The ZNF family dominates with 500+ members. Exclude it to see other patterns:
plot_tf_families(
all_tfs,
top_n = 20,
title = "TF Family Distribution (Excluding ZNF)",
palette = "Set2",
exclude_families = "ZNF"
)plot_tf_families(
sig_tfs,
top_n = 15,
title = "Significantly DE TF Families (D17 vs D6)",
palette = "Set3",
exclude_families = "ZNF"
)# count families by direction
up_counts <- up_tfs %>%
dplyr::count(tf_family, name = "count") %>%
dplyr::mutate(direction = "Up at D17")
down_counts <- down_tfs %>%
dplyr::count(tf_family, name = "count") %>%
dplyr::mutate(count = -count, direction = "Down at D17")
combined <- dplyr::bind_rows(up_counts, down_counts)
# Top families (excluding ZNF)
top_fams <- combined %>%
dplyr::group_by(tf_family) %>%
dplyr::summarize(total = sum(abs(count)), .groups = "drop") %>%
dplyr::filter(!tf_family %in% c("ZNF")) %>%
dplyr::arrange(desc(total)) %>%
dplyr::slice_head(n = 15) %>%
dplyr::pull(tf_family)
plot_data <- combined %>%
dplyr::filter(tf_family %in% top_fams) %>%
dplyr::mutate(tf_family = factor(tf_family, levels = rev(top_fams)))
ggplot(plot_data, aes(x = tf_family, y = count, fill = direction)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("Up at D17" = "#E64B35", "Down at D17" = "#4DBBD5")) +
geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
labs(
title = "TF Families by Regulation Direction",
subtitle = "Neural differentiation: Day 17 vs Day 6",
x = NULL, y = "Number of TFs", fill = NULL
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "gray40"),
legend.position = "top")## [1] "SOX1" "SOX10" "SOX11" "SOX12" "SOX13" "SOX14" "SOX15" "SOX17" "SOX18"
## [10] "SOX2" "SOX21" "SOX3" "SOX30" "SOX4" "SOX5" "SOX6" "SOX7" "SOX8"
## [19] "SOX9" "SRY"
# SOX TFs in our data
classified %>%
dplyr::filter(tf_family == "SOX") %>%
dplyr::select(gene_id, log2FoldChange, padj) %>%
dplyr::arrange(desc(abs(log2FoldChange)))tf_summary <- sig_tfs %>%
dplyr::group_by(tf_family, tf_class) %>%
dplyr::summarize(
n = n(),
mean_log2FC = round(mean(log2FoldChange), 2),
n_up = sum(log2FoldChange > 0),
n_down = sum(log2FoldChange < 0),
.groups = "drop"
) %>%
dplyr::arrange(desc(n))
tf_summary %>% head(15)# classify a gene list directly
my_genes <- c("SOX2", "NANOG", "POU5F1", "GATA4", "GAPDH", "ACTB")
classify_tfs(my_genes)## SOX2 NANOG POU5F1 GATA4 GAPDH ACTB
## "SOX" "NANOG" "POU" "GATA" NA NA
## OCT4 SLUG P53
## "POU" "SNAI" "p53"
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tfclassify_0.1.0 testthat_3.2.3 lubridate_1.9.4 forcats_1.0.0
## [5] stringr_1.6.0 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [9] tidyr_1.3.1 tibble_3.2.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 httr2_1.2.1 xfun_0.52 bslib_0.9.0
## [5] htmlwidgets_1.6.4 devtools_2.4.5 remotes_2.5.0 tzdb_0.4.0
## [9] vctrs_0.6.5 tools_4.4.2 generics_0.1.4 pkgconfig_2.0.3
## [13] RColorBrewer_1.1-3 S7_0.2.0 desc_1.4.3 lifecycle_1.0.4
## [17] compiler_4.4.2 farver_2.1.2 brio_1.1.5 httpuv_1.6.16
## [21] htmltools_0.5.8.1 usethis_3.1.0 sass_0.4.10 yaml_2.3.10
## [25] later_1.4.2 pillar_1.11.1 jquerylib_0.1.4 urlchecker_1.0.1
## [29] ellipsis_0.3.2 cachem_1.1.0 sessioninfo_1.2.3 mime_0.13
## [33] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4 labeling_0.4.3
## [37] rprojroot_2.1.0 fastmap_1.2.0 grid_4.4.2 cli_3.6.3
## [41] magrittr_2.0.3 dichromat_2.0-0.1 pkgbuild_1.4.8 withr_3.0.2
## [45] rappdirs_0.3.3 scales_1.4.0 promises_1.3.3 timechange_0.3.0
## [49] rmarkdown_2.29 hms_1.1.3 memoise_2.0.1 shiny_1.11.1
## [53] evaluate_1.0.4 knitr_1.50 viridisLite_0.4.2 miniUI_0.1.2
## [57] profvis_0.4.0 rlang_1.1.4 Rcpp_1.0.13-1 xtable_1.8-4
## [61] glue_1.8.0 pkgload_1.4.0 rstudioapi_0.17.1 jsonlite_1.8.9
## [65] R6_2.6.1 fs_1.6.5
| Function | Description |
|---|---|
classify_tfs() |
Classify genes with TF family annotations |
filter_tfs() |
Filter to TFs only (no annotations) |
filter_expression() |
Filter by expression level |
is_tf() |
Quick TF check |
plot_tf_families() |
Bar chart visualization |
plot_tf_pie() |
Pie chart visualization |
get_family_members() |
Get TFs in a family |
get_unmapped_tfs() |
Get TFs with generic assignments |
list_tf_families() |
List all family names |
tf_database_info() |
Database statistics |
TF Database: Lambert SA et al. (2018) “The Human Transcription Factors.” Cell 172(4):650-665. doi:[10.1016/j.cell.2018.01.029](https://doi.org/10.1016/j.cell.2018.01.029)
Example Data: Martinez JL et al. (2024) “Transcriptional consequences of trisomy 21 on neural induction.” Front Cell Neurosci. 18:1341141. doi:[10.3389/fncel.2024.1341141](https://doi.org/10.3389/fncel.2024.1341141)