Functional Enrichment Analysis

This protocol is designed as a walk-through tour of popular functional enrichment analysis tools and describes the use of three functional enrichment tools:

Example Data - TCGA lung cancer study

This tour can be used with your own data, or with an example dataset. Three example datasets are provided in the following slides.

The first dataset is from the TCGA lung cancer study, and includes a comparison of the expression of transcripts in lung cancer biopses versus normal tissue. The data was processed to produce the example files:

  1. Calculate log2 fold change and adjusted pvalue for cancer vs normal per gene from group average expression values.
  2. Filter for significantly up-regulated genes: log2 fold change => 2 AND adjusted pvalue <= 0.05; list of 367 up-regulated genes (NCBI gene symbol). Download data.
  3. Filter for significantly down-regulated genes: log2 fold change <= -2 AND adjusted pvalue <= 0.05; list of 516 down-regulated genes (NCBI gene symbol). Download data.
  4. Gene ranking calculated for all genes using sign(log2 fold change) * -log10(pvalue); file with Ensembl gene id and rank value for all genes. Download data.

These input files can be used for ORA and GSEA analysis in Enrichr and WebGestalt.

Example Data - Pinto et al. SARS-CoV-2 study

The Pinto et al. study is a multi-OMICs study of SARS-CoV-2 host responses in lung epithelial cells. The data files below were adapted from the supplementary data files provided with the publication, which were already pre-filtered.

  1. Differentially regulated transcripts identified in cells infected with SARS-CoV2; list of 2656 genes (NCBI gene symbol). Download data.
  2. Differentially expressed proteins identified in cells infected with SARS-CoV2; list of 225 proteins. Download data with Uniprot identifiers / Download data with NCBI gene symbols.

These input files can be used for ORA analysis in Enrichr and WebGestalt.

Example Data - Voineagu et al. autism study

The Voineagu et al. study compares the transcriptome between autistic and normal brain, and was downloaded from Expression Atlas. It contains data for all genes measured with the following comma-separated columns:

  • NCBI gene symbol
  • log2 fold change
  • p value

This data is used for ORA and GSEA analysis in the Interactive Enrichment Analysis tool. It is pre-loaded in the tool.

Enrichr

The Enrichr tool offers an easy-to-use interface for basic Over-Representation Analysis (ORA) for a large number of gene set libraries.

Before starting analysis, you can browse the available gene sets under Libraries in the top menu.

Enrichr: Gene Set Libraries

Enrichr includes over 200 gene set libraries, including Gene Ontology terms, pathways, disease-associated sets, cell type markers etc.

You can also search for gene sets either by the term name under Term search or by gene under Gene search.

Enrichr: Start Analysis

Analysis is started by simply copying the list of genes from an input text file into the input box on the right of the Analyze tab and clicking Submit.

Note that this corresponds to the TCGA example data file containing up-regulated genes.

Enrichr: Results Overview

The results are displayed as a grid of libraries for each functional category (at the top), the image below shows the results in the Pathways category.

Clicking on one of the squares will show the detailed results for that library:

Enrichr: Gene Set Results

Looking at the results for a specific library in detail, the gene sets are sorted by p-value ranking by default. Clicking on any of the bars representing gene sets re-sorts the bar graph by the different score.

Enrichr: Save Results

To download an image of the bar chart, click either of the SVG PNG JPG buttons at the top right of the chart.

Enrichr: Table View

In addition to the default bar chart, results are also available as a table under Table. The table include the four enrichment scores calculated by Enrichr: P-value, Adjusted p-value, Odds Ratio and Combined score. More information about these scores can be found in the Enrichr documentation.

At the bottom of the table there is a link to Export entries to table which will download a .txt file with the full results table.

WebGestalt

WebGestalt offers more advanced analysis options for a smaller number of gene set libraries. Using WebGestalt, it is possible to run either Over-Representation Analysis (ORA), Gene Set Enrichment Analysis (GSEA) or Network Topology-based Analysis (TNA).

WebGestalt: Start Analysis

To start analysis, the Basic parameters section should be completed with the correct Organism of interest, Method of Interest (ORA, GSEA or NTA) and Functional Database. Choose a functional database category, for example pathway, and then specify which data source to use, for example Wikipathway.

WebGestalt: Start Analysis

The input data is defined in the List section, and the dataset or list of genes can either be pasted into the input field or uploaded as a text file. Specify the identifier type under ID type for uploaded reference list. The Analyte Type should match the identifier type in the input data. The list used here corresponds to the TCGA example input file containing the full dataset with associated gene rank.

Clicking Submit at the bottom left starts the analysis.

WebGestalt: Results Summary

The top of the results page includes a job summary and a link to download the full results.

WebGestalt: Enrichment Results

Results are displayed as a bar chart by default. Right-clicking on the bar chart lets you download in either PNG or SVG format.

Clicking on the bars updates the pathway-specific display at the bottom of the page.

WebGestalt: Enrichment Results

The results specific to a gene set (pathway in this case) include the scoring statistics calculated for the enrichment, including the enrichment score, a sortable table and the enrichment plot.

The enrichment plot is described in detail here. Briefly, the plot in the upper half represents the running enrichment score (ES), as the analysis walks down the ranked list of genes (bottom section), starting at the most highly ranked gene. When a gene in the pathway is in the ranked list, the score goes up, and when it’s not it goes down. The corresponding overlap between the ranked list and the gene set is indicated by a line in the middle section. The top score in the enrichment map plot is the score reported for the particular gene set.

WebGestalt: Enrichment Results

For WikiPathways results, the link in the upper left is clickable and will open a pathway view with the overlapping/leading edge genes highlighted.

WebGestalt: Enrichment Results

In addition to the bar chart, the main results overview can also be visualized as a table or volcano plot by clicking the Table and Volcano buttons at the top left.

The volcano plot has options for customized downloads available, as well as pan/zoom controls.

Interactive Enrichment Analysis

Interactive Enrichment Analysis is a user-friendly interactive tool to perform enrichment analysis for multiple datasets across multiple public databases. This tool can run both Over-Representation Analysis(ORA) and Gene Set Enrichment Analysis (GSEA). Follow these steps to get started:

  • Install the latest versions of R and RStudio. Follow the steps provided on that page and then return here.
  • Open RStudio.
  • Set your working directory.
  • Copy/paste these lines into the Console and hit return:
  • if(!require(devtools)) install.packages(c("devtools","httr")); library(devtools); library(httr)
    options(shiny.launch.browser = .rs.invokeShinyWindowExternal)
    source_url("https://raw.github.com/gladstone-institutes/Interactive-Enrichment-Analysis/main/launch_app.R")
    

This will install some basic dependencies, download the project to your working directory (Files tab in RStudio) and launch the tool in your browser.

Interactive Enrichment Analysis

The tool will open in your browser:

Interactive Enrichment Analysis: Database

A set of public database collections is provided in the drop-down in the Database section. These are updated monthly and date stamped. It is also possible to construct your own database collection under BUILD NEW DATABASE, and providing one or more GMT files.

Once a database collection is selected, the specific databases contained in the collection will be displayed.

Interactive Enrichment Analysis: Datasets

Input datasets are selected in the Datasets section; one or more datasets can be selected if they are all of the same format. The supported formats for input datasets are:

  • a simple list of genes (for ORA only)
  • a list of genes with a rank column (for GSEA only)
  • a list of genes with a p.value column (for GSEA and ORA)
  • a list of genes with p.value and fold.change columns (for GSEA and ORA)


Interactive Enrichment Analysis: Dataset Options

The tool will display a preview of the first few rows of the first chosen dataset along with the required and optional columns that were detected.

Gene identifier and organism should match the data. The analysis parameters can also be adjusted here.

Interactive Enrichment Analysis: Start Analysis

When Databases and Datasets are loaded and verified, the side panel will be updated with a Run Analyses option.

Once analysis is started, the initial set of panels will collapse and a Progress panel will open along with a progress bar in the lower-right corner. The Progress panel will report the individual analysis steps, any errors, as well as start and finish times for the run, and the location of the output folder.

Interactive Enrichment Analysis: View Results

Analysis results can be viewed by clicking View Results. Results are viewed in another app that allows you to interactively explore results and download individual files.

In the results app, the Data tab displays the processed input data as a searchable table and volcano plot. A checkbox at the top allows you to also view the genes that were excluded from the analysis, e.g., due to failed identifier mapping. Both the table and plot are downloadable.

Interactive Enrichment Analysis: Data plots

The input data is visualized in a volcano plot, which plots genes by statistical significance, p.value, versus magnitude of change, fold.change. Genes are highlighted based on the p.value and fold-change cutoffs selected during setup. Selected genes are labeled; select genes either by top n genes, or by selection by name.

A bar plot of your input data is also available (via a drop-down), highlighting positive and negative fold-change values for a subset of genes, either top n genes or genes selected by name.

Interactive Enrichment Analysis: Analysis Results

The Results tab displays a table of enriched terms/pathways, a plot of top results, and a plot of a selected result from the table. The table is paginated and supports selection, search and filtering. The plots for top results include dot, heatmap, emap, concept networks, etc.

The table and plots can all be downloaded.

Interactive Enrichment Analysis: Analysis Results

Using the left side panel, one can navigate between GSEA and ORA methods while viewing the results for a particular database to explore hits in common (and unique) to the different methods. Here we are comparing the ORA and GSEA results for the WikiPathways database, note the differences in the table and dot plot.

Interactive Enrichment Analysis: Dot plot

A dot plot of results highlight multiple statistics for the top enriched terms/pathways. By default, the size of dots corresponds to the percentage of dataset genes overlapping a term/pathway versus all genes in the term/pathway, and the color represents adjusted p value. These parameters and other plot view options are customizable in a set of controls under the plot.

Interactive Enrichment Analysis: Overlapping genes

Several plot types are available for visualizing the gene overlap between results.

Heatmap plots the top results (terms/pathways) versus top genes to highlight overlapping genes among the most enriched results. As with all plots, the view options are customizable.

Enrichment map (emap) connects results (terms/pathways) with overlapping genes in a network view. The interactions (edges) in the network represent gene overlap between terms/pathways (nodes), and nodes are sized based on the number of genes in them.

Concept network (cnet) connects results via their overlapping genes. Results (terms/pathways) are represented by the larger nodes, and each gene in that term/pathway is connected via interactions to it. Interactions are color-coded by term/pathway. In this way, it is easy to see which genes are relevant to more than one term/pathway.

Interactive Enrichment Analysis: GSEA plot

For each result in the results table, results-specific plots are available based on the database the results is from and the analysis method used. For example, for GSEA analysis, the GSEA Enrichment Score plot is available. As described earlier, the plot in the upper half represents the running enrichment score (ES), as the analysis walks down the ranked list of genes. The middle section shows where the members of the gene set appear in the ranked list of genes. The top score in the enrichment map plot is the score reported for the particular gene set.

Interactive Enrichment Analysis: Pathway plot

Some database-specific visualizations are also available, for example WikiPathways visualizations of the data on models from the WikiPathways database. Genes that met the criteria for fold change and p value cutoff (for ORA), or leading edge genes (for GSEA) are highlighted in the pathway (orange for up, blue for down). The WikiPathways button will take you to the pathway model at WikiPathways.

Opening the pathway in a new window gives you access to a larger, interactive format of the pathway.

Functional Enrichment Analysis: Downstream analysis

Once you have completed exploring your data and results in either of these tools, there are several options for continued analysis and exploration: