Introduction

The tfclassify package provides a curated database of ~1,600 human transcription factors based on the Lambert et al. 2018 census (Cell 172:650-665). It validates genes against this gold-standard database and assigns TF family classifications.

Key features:

  • Validate genes as TFs against curated database
  • Assign TF families (SOX, FOX, GATA, ZNF, etc.)
  • DNA-binding domain classification
  • Filter by significance and expression thresholds
  • Publication-ready visualizations
library(tidyverse)
library(ggplot2)
devtools::load_all("tfclassify")
# TF database summary
db_info <- tf_database_info()

data.frame(
  Metric = c("Total TFs", "TF Families", "DBD Classes", "Source"),
  Value = c(db_info$total_tfs, db_info$n_families, db_info$n_classes, db_info$source)
)
# top TF families by count
data.frame(
  Family = names(db_info$top_families),
  Count = as.numeric(db_info$top_families)
)

Loading data

To demonstrate the functionalities of this workflow, we analyze a publicly available dataset from:

Martinez JL et al. Transcriptional consequences of trisomy 21 on neural induction. Front Cell Neurosci. 2024;18:1341141. doi:[10.3389/fncel.2024.1341141](https://doi.org/10.3389/fncel.2024.1341141)

This dataset examines transcriptional changes during iPSC-derived neural differentiation. We compare Day 17 vs Day 6 of differentiation in control (D21) cells to identify TFs involved in neural lineage commitment.

# Load DESeq2 results: Day 17 vs Day 6 differentiation (apeglm shrunken LFC)
res_file <- "data/deg/apeglm/unfiltered/WC24_D21_D17_vs_D21_D6.rds"
res <- readRDS(res_file)

# convert to data frame
deseq_results <- as.data.frame(res)

# Get gene symbols from count matrix (DESeq2 results use numeric row indices)
counts <- readRDS("data/counts/normalized/WC24_norm_counts.rds")
gene_symbols <- rownames(counts)

# map gene symbols by row index
deseq_results$gene_id <- gene_symbols[as.integer(rownames(deseq_results))]
rownames(deseq_results) <- NULL

# reorder columns
deseq_results <- deseq_results[, c("gene_id", setdiff(names(deseq_results), "gene_id"))]

head(deseq_results)
dim(deseq_results)
## [1] 13826     5
names(deseq_results)
## [1] "gene_id"        "baseMean"       "log2FoldChange" "lfcSE"         
## [5] "padj"
summary(deseq_results[, c("baseMean", "log2FoldChange", "padj")])
##     baseMean        log2FoldChange           padj          
##  Min.   :    19.2   Min.   :-12.03797   Min.   :0.0000000  
##  1st Qu.:   297.7   1st Qu.: -0.39686   1st Qu.:0.0000002  
##  Median :   933.5   Median :  0.01519   Median :0.0096932  
##  Mean   :  2454.9   Mean   :  0.04478   Mean   :0.1812361  
##  3rd Qu.:  2210.8   3rd Qu.:  0.50331   3rd Qu.:0.2842740  
##  Max.   :671301.4   Max.   : 11.11793   Max.   :0.9998523  
##                                         NA's   :3

Interpretation: Positive log2FC = upregulated at D17 (late differentiation); Negative log2FC = downregulated at D17 (higher at D6/early).


Classifying TFs

Basic classification

The main function classify_tfs() adds family annotations to your data:

# classify all genes
classified <- classify_tfs(deseq_results, gene_col = "gene_id")

c(total = nrow(classified),
  tfs = sum(classified$is_tf),
  pct_tfs = round(mean(classified$is_tf) * 100, 2))
##    total      tfs  pct_tfs 
## 13826.00  1026.00     7.42
# View TFs with family assignments
classified %>%
  dplyr::filter(is_tf) %>%
  dplyr::select(gene_id, log2FoldChange, padj, tf_family, tf_class) %>%
  dplyr::arrange(desc(abs(log2FoldChange))) %>%
  head(15)

Classification columns

Column Description Example
tf_family TF family name SOX, FOX, ZNF, HOX
tf_class DNA-binding domain class HMG, Forkhead, C2H2 zinc finger
tf_subfamily More specific annotation Estrogen receptor, PPAR
is_tf Boolean TF status TRUE/FALSE

Significant TFs

Filtering to significant DE TFs

Use classify_tfs() with significance thresholds to get significant TFs with family annotations:

# Significant TFs (|log2FC| >= 1, padj <= 0.05)
sig_tfs <- classify_tfs(
  deseq_results,
  gene_col = "gene_id",
  filter_to_tfs = TRUE,
  log2fc_col = "log2FoldChange",
  log2fc_threshold = 1,
  padj_col = "padj",
  padj_threshold = 0.05
)
nrow(sig_tfs)
## [1] 170

Upregulated vs downregulated

# upregulated at D17 (positive LFC) vs downregulated at D17 (negative LFC)
up_tfs <- sig_tfs %>% dplyr::filter(log2FoldChange > 0) %>% dplyr::arrange(desc(log2FoldChange))
down_tfs <- sig_tfs %>% dplyr::filter(log2FoldChange < 0) %>% dplyr::arrange(log2FoldChange)

c(up_at_D17 = nrow(up_tfs), down_at_D17 = nrow(down_tfs))
##   up_at_D17 down_at_D17 
##          90          80
# Top TFs upregulated at D17 (late differentiation)
up_tfs %>%
  dplyr::select(gene_id, log2FoldChange, padj, tf_family) %>%
  head(10)
# top TFs downregulated at D17 (higher at D6/early)
down_tfs %>%
  dplyr::select(gene_id, log2FoldChange, padj, tf_family) %>%
  head(10)

Visualization

TF family distribution

# all TFs in dataset
all_tfs <- classify_tfs(deseq_results, gene_col = "gene_id", filter_to_tfs = TRUE)

plot_tf_families(
  all_tfs,
  top_n = 20,
  title = "TF Family Distribution in Dataset",
  palette = "Set2"
)

Unmapped TFs

TFs with generic family assignments (only DBD class known):

unmapped <- get_unmapped_tfs(all_tfs)
c(total_tfs = nrow(all_tfs), unmapped = nrow(unmapped))
## total_tfs  unmapped 
##      1026        44
unmapped %>%
  dplyr::select(gene_id, tf_family, tf_class) %>%
  head(15)

Excluding ZNF family

The ZNF family dominates with 500+ members. Exclude it to see other patterns:

plot_tf_families(
  all_tfs,
  top_n = 20,
  title = "TF Family Distribution (Excluding ZNF)",
  palette = "Set2",
  exclude_families = "ZNF"
)

Significant TFs plot

plot_tf_families(
  sig_tfs,
  top_n = 15,
  title = "Significantly DE TF Families (D17 vs D6)",
  palette = "Set3",
  exclude_families = "ZNF"
)

Up vs down regulation

# count families by direction
up_counts <- up_tfs %>%
  dplyr::count(tf_family, name = "count") %>%
  dplyr::mutate(direction = "Up at D17")

down_counts <- down_tfs %>%
  dplyr::count(tf_family, name = "count") %>%
  dplyr::mutate(count = -count, direction = "Down at D17")

combined <- dplyr::bind_rows(up_counts, down_counts)

# Top families (excluding ZNF)
top_fams <- combined %>%
  dplyr::group_by(tf_family) %>%
  dplyr::summarize(total = sum(abs(count)), .groups = "drop") %>%
  dplyr::filter(!tf_family %in% c("ZNF")) %>%
  dplyr::arrange(desc(total)) %>%
  dplyr::slice_head(n = 15) %>%
  dplyr::pull(tf_family)

plot_data <- combined %>%
  dplyr::filter(tf_family %in% top_fams) %>%
  dplyr::mutate(tf_family = factor(tf_family, levels = rev(top_fams)))

ggplot(plot_data, aes(x = tf_family, y = count, fill = direction)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("Up at D17" = "#E64B35", "Down at D17" = "#4DBBD5")) +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
  labs(
    title = "TF Families by Regulation Direction",
    subtitle = "Neural differentiation: Day 17 vs Day 6",
    x = NULL, y = "Number of TFs", fill = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "gray40"),
        legend.position = "top")


Exploring TF families

Query specific families

# get all members of a family
sox_members <- get_family_members("SOX")
sox_members
##  [1] "SOX1"  "SOX10" "SOX11" "SOX12" "SOX13" "SOX14" "SOX15" "SOX17" "SOX18"
## [10] "SOX2"  "SOX21" "SOX3"  "SOX30" "SOX4"  "SOX5"  "SOX6"  "SOX7"  "SOX8" 
## [19] "SOX9"  "SRY"
# SOX TFs in our data
classified %>%
  dplyr::filter(tf_family == "SOX") %>%
  dplyr::select(gene_id, log2FoldChange, padj) %>%
  dplyr::arrange(desc(abs(log2FoldChange)))

Nuclear receptors (with subfamily)

classified %>%
  dplyr::filter(tf_family == "Nuclear Receptor") %>%
  dplyr::select(gene_id, log2FoldChange, padj, tf_subfamily) %>%
  dplyr::arrange(desc(abs(log2FoldChange))) %>%
  head(15)

Summary table

tf_summary <- sig_tfs %>%
  dplyr::group_by(tf_family, tf_class) %>%
  dplyr::summarize(
    n = n(),
    mean_log2FC = round(mean(log2FoldChange), 2),
    n_up = sum(log2FoldChange > 0),
    n_down = sum(log2FoldChange < 0),
    .groups = "drop"
  ) %>%
  dplyr::arrange(desc(n))

tf_summary %>% head(15)

Additional features

Character vector input

# classify a gene list directly
my_genes <- c("SOX2", "NANOG", "POU5F1", "GATA4", "GAPDH", "ACTB")
classify_tfs(my_genes)
##    SOX2   NANOG  POU5F1   GATA4   GAPDH    ACTB 
##   "SOX" "NANOG"   "POU"  "GATA"      NA      NA

Quick TF check

is_tf(c("SOX2", "GAPDH", "FOXF1", "ACTB"))
## [1]  TRUE FALSE  TRUE FALSE

Gene aliases

# common aliases are resolved
classify_tfs(c("OCT4", "SLUG", "P53"))
##   OCT4   SLUG    P53 
##  "POU" "SNAI"  "p53"
# Maps to: POU5F1, SNAI2, TP53

Mouse data

# use species = "mouse" for mouse gene symbols
classify_tfs(c("Sox2", "Nanog", "Pou5f1"), species = "mouse")
##    Sox2   Nanog  Pou5f1 
##   "SOX" "NANOG"   "POU"

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tfclassify_0.1.0 testthat_3.2.3   lubridate_1.9.4  forcats_1.0.0   
##  [5] stringr_1.6.0    dplyr_1.1.4      purrr_1.0.2      readr_2.1.5     
##  [9] tidyr_1.3.1      tibble_3.2.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       httr2_1.2.1        xfun_0.52          bslib_0.9.0       
##  [5] htmlwidgets_1.6.4  devtools_2.4.5     remotes_2.5.0      tzdb_0.4.0        
##  [9] vctrs_0.6.5        tools_4.4.2        generics_0.1.4     pkgconfig_2.0.3   
## [13] RColorBrewer_1.1-3 S7_0.2.0           desc_1.4.3         lifecycle_1.0.4   
## [17] compiler_4.4.2     farver_2.1.2       brio_1.1.5         httpuv_1.6.16     
## [21] htmltools_0.5.8.1  usethis_3.1.0      sass_0.4.10        yaml_2.3.10       
## [25] later_1.4.2        pillar_1.11.1      jquerylib_0.1.4    urlchecker_1.0.1  
## [29] ellipsis_0.3.2     cachem_1.1.0       sessioninfo_1.2.3  mime_0.13         
## [33] tidyselect_1.2.1   digest_0.6.37      stringi_1.8.4      labeling_0.4.3    
## [37] rprojroot_2.1.0    fastmap_1.2.0      grid_4.4.2         cli_3.6.3         
## [41] magrittr_2.0.3     dichromat_2.0-0.1  pkgbuild_1.4.8     withr_3.0.2       
## [45] rappdirs_0.3.3     scales_1.4.0       promises_1.3.3     timechange_0.3.0  
## [49] rmarkdown_2.29     hms_1.1.3          memoise_2.0.1      shiny_1.11.1      
## [53] evaluate_1.0.4     knitr_1.50         viridisLite_0.4.2  miniUI_0.1.2      
## [57] profvis_0.4.0      rlang_1.1.4        Rcpp_1.0.13-1      xtable_1.8-4      
## [61] glue_1.8.0         pkgload_1.4.0      rstudioapi_0.17.1  jsonlite_1.8.9    
## [65] R6_2.6.1           fs_1.6.5

Function reference

Function Description
classify_tfs() Classify genes with TF family annotations
filter_tfs() Filter to TFs only (no annotations)
filter_expression() Filter by expression level
is_tf() Quick TF check
plot_tf_families() Bar chart visualization
plot_tf_pie() Pie chart visualization
get_family_members() Get TFs in a family
get_unmapped_tfs() Get TFs with generic assignments
list_tf_families() List all family names
tf_database_info() Database statistics

References

TF Database: Lambert SA et al. (2018) “The Human Transcription Factors.” Cell 172(4):650-665. doi:[10.1016/j.cell.2018.01.029](https://doi.org/10.1016/j.cell.2018.01.029)

Example Data: Martinez JL et al. (2024) “Transcriptional consequences of trisomy 21 on neural induction.” Front Cell Neurosci. 18:1341141. doi:[10.3389/fncel.2024.1341141](https://doi.org/10.3389/fncel.2024.1341141)