GeneFEAST

User Guide

Installation

Option 1: Don’t install! Instead, download the GeneFEAST ready-to-use Docker container!

To download the latest container from the repository:

docker pull ghcr.io/avigailtaylor/genefeast:latest


This Docker image is designed for “standard Docker installations” on hosts with AMD64 and ARM64 CPUs.


Option 2: Locally install the package and its dependencies

  1. Install Python 3.12
  2. Install Graphviz
  3. Create and activate a virtual environment

  4. IMPORTANT
    • We strongly recommend installing GeneFEAST in a virtual environment because of its dependencies and requirements.
    • Make sure to create a virtual environment using Python 3.12 explicitly, rather than your computer's default version.
  5. Install the most recent version of setuptools:
  6. Install GeneFEAST:



Setup

To run GeneFEAST, you will need:

Functional enrichment analysis (FEA) results file(s)

TypeIDDescriptionGeneID

  • Type: Term type/originating database
  • ID: Term ID in database
  • Description: Term description
  • GeneID: "/"-separated list of gene IDs corresponding to GoIs annotated by the term
    • Note that in a GSEA-type FEA this is known as the term's leading edge subset

Example
TypeIDDescriptionGeneID
"GO" "GO:0071774" "response to fibroblast growth factor" "CCN2/THBS1/EGR3/FGF2/SPRY4/
NDST1/CCL2/IER2/FLRT3/PRKD2/
CXCL8/SPRY2/FRS2/FGFR1/SPRY1/
RUNX2/HYAL1/KDM5B/NOG/ZFP36L1/
COL1A1/CASR/FGFR3/FGF1/EXT1/
FGFBP1/GATA3/NR4A1"
"GO" "GO:0002294" "CD4-positive alpha-beta T cell differentiation involved in immune response" "RARA/BCL6/SMAD7/SOCS3/PTGER4/
JUNB/ZC3H12A/FOXP1/ENTPD7/NFKBIZ/
NLRP3/RC3H1/RORC/RIPK2/ANXA1/
RELB/MYB/IL6/LGALS9/GATA3"
"GO" "GO:2000514" "regulation of CD4-positive alpha-beta T cell activation" "RARA/BCL6/SMAD7/JUNB/RUNX1/
ZC3H12A/NFKBIZ/NLRP3/RC3H1/CD274/
CBLB/RIPK2/ANXA1/AGER/RUNX3/
SOCS1/VSIR/PRKCQ/LGALS9/GATA3"

The corresponding table in CSV format:

Type,ID,Description,GeneID    
"GO","GO:0071774","response to fibroblast growth factor","CCN2/THBS1/EGR3/FGF2/SPRY4/NDST1/CCL2/IER2/FLRT3/PRKD2/CXCL8/SPRY2/FRS2/FGFR1/SPRY1/RUNX2/HYAL1/KDM5B/NOG/ZFP36L1/COL1A1/CASR/FGFR3/FGF1/EXT1/FGFBP1/GATA3/NR4A1"
"GO","GO:0002294","CD4-positive alpha-beta T cell differentiation involved in immune response","RARA/BCL6/SMAD7/SOCS3/PTGER4/JUNB/ZC3H12A/FOXP1/ENTPD7/NFKBIZ/NLRP3/RC3H1/RORC/RIPK2/ANXA1/RELB/MYB/IL6/LGALS9/GATA3"
"GO","GO:2000514","regulation of CD4-positive alpha-beta T cell activation","RARA/BCL6/SMAD7/JUNB/RUNX1/ZC3H12A/NFKBIZ/NLRP3/RC3H1/CD274/CBLB/RIPK2/ANXA1/AGER/RUNX3/SOCS1/VSIR/PRKCQ/LGALS9/GATA3"


Genes of interest (GoI) file(s)

IMPORTANT
  • GoI must be listed using IDs that match those used in the FEA results file.
  • If you do not have quantitative data, you can just provide a dummy column with the same numerical value entered for each gene.

Example
GeneIDlog2FC
PDGFB2.845276684
GTPBP41.396754262
C12orf491.469143469
SLC2A11.618759309
CCN22.593769464
CXCR42.528192609
NCOA52.137989231
CDKN1A3.154969844
RARA1.444539048

The corresponding table in CSV format:

GeneID,log2FC
PDGFB,2.845276684
GTPBP4,1.396754262
C12orf49,1.469143469
SLC2A1,1.618759309
CCN2,2.593769464
CXCR4,2.528192609
NCOA5,2.137989231
CDKN1A,3.154969844
RARA,1.444539048


IMPORTANT README FOR ORA-type FEAs
  • GoI files can contain just significantly differentially expressed genes, or they can contain results for all genes tested in a given experiment.
  • Either way, GeneFEAST will only store quantitative data for GoI that are also present in the FEA results file for the corresponding experiment.

IMPORTANT README FOR GSEA-type FEAs
  • Since, by definition, GSEA-type FEAs do not select a subset of genes for analysis, the GoI file in this scenario is just the results for all genes tested in a given experiment.
  • Importantly, GeneFEAST will only store quantitative data for genes present in at least one leading edge subset in the GSEA-type FEA results file, and it is this set of genes that are considered as de facto GoI for the rest of the GeneFEAST analysis.

IMPORTANT NOTE ABOUT BEST PRACTICE FOR GENEFEAST ANALYSIS OF GSEA-TYPE FEAs
  • Standard GSEAs look for enrichment of gene sets amongst either the most over-expressed or the most under-expressed genes in a list of genes ranked by expression.
  • Thus, for any significantly enriched term, either a positive enrichment score (corresponding to enrichment amongst over-expressed genes) or a negative enrichment score (corresponding to enrichment amongst under-expressed genes) is reported, but never both.
  • Consequently, in the GeneFEAST setting, you should split your GSEA results into two FEA results files: one for positive scoring terms and one for negative scoring terms, then treat these as separate FEA results files for GeneFEAST summarisation purposes.


A YAML setup file

You will use this setup file to give GeneFEAST the id(s) of the FEA(s) to summarise, the location(s) of the FEA file(s), and the location(s) of the GoI file(s).

To summarise a single FEA:

FEAs:
    - id: "FEA_1"
      goi_file: "goi_file_for_FEA_1"
      fea_file: "FEA_1_results_file"

To summarise multiple FEAs (e.g. three FEAs):

FEAs:
    - id: "FEA_1"
      goi_file: "goi_file_for_FEA_1"
      fea_file: "FEA_1_results_file"

    - id: "FEA_2"
      goi_file: "goi_file_for_FEA_2"
      fea_file: "FEA_2_results_file"

    - id: "FEA_3"
      goi_file: "goi_file_for_FEA_3"
      fea_file: "FEA_3_results_file"
You can create a YAML setup file using this template.


In addition, you can also provide GeneFEAST with:

An FEA results file outputted by one of the 'enrich' family of functions available from the clusterProfiler R package. This is instead of the basic FEA results file described above and is useful if you want GeneFEAST to create dot plots of your FEA results.
TypeIDDescription GeneRatioBgRatiopvalue p.adjustqvalueGeneIDcount

IMPORTANT
  • Columns ID through to count are output by the enrich functions.
  • However, you will need to add the "Type" column manually, e.g., using Excel or VIM.

When using 'enrich'-formatted FEA results, you will need to add the following line to your YAML setup file:

ENRICH: True

You can then also add this line if you want GeneFEAST to output dot plots of your FEA results:

DOTPLOTS: True

You can create a setup YAML file with these additional lines of code using this template.


Search terms to look up alongside your GoI.

As part of the report generation process, GeneFEAST conducts a literature search for each GoI via the National Center for Biotechnology Information's Gene and PubMed services (Sayers, et al., 2021).


This literature search can incorporate additional search terms, which you can specify in your YAML setup file using the following code:



SEARCH_WORDS:
- search term 1
- search term 2
- etc.

You can create a setup YAML file with these additional lines of code using this template.


Extra annotations for genes.

Occasionally, you may wish to keep track of an a priori set of genes relevant to your study, for example those that are members of a particular biological signature, throughout the GeneFEAST report.


Each extra annotation will be displayed as an additional row at the top of the term-GoI heatmap panel in the split heatmap created for each community of terms (similarly for each meta community of communities).


To do this, first make an extra annotation (EA) file. The EA file is a headerless CSV file, with one EA per row, and two columns:

  1. Extra annotation name.
  2. "/"-separated list of gene IDs to be labelled with the EA.

Example EA file:

RNA_DRG_IFN,STAT1/IFI16/SP110/MX1/IFIT5/PARP12/EIF2AK2/IFI44/PARP14/TRIM21/DDX60L/IFI127/ADAR/HERC6/IFI35/ISG20/LGALS9/UBE2L6/DHX58/STAT2/OAS3/ISG15/IRF7/IFI6/IFI44L/IFITM1/OAS1/D$
Proteome_DRG_IFN,IFIT2/IFIT1/IFIT3/OAS2/MX2/OASL/IFIH1/ISG15/MX1/SP110/IFI44/CMPK2/IFI44L/OAS1/DDX58/STAT1/IFIT5/DDX60/PARP12/IFI16/DDX60L/OAS3/EIF2AK2/ISG20/ADAR/IFI35/STAT2/LGAL$

Then, add this line of code to your setup YAML file:


EA_FILE: extra_annotation_file
You can create a setup YAML file with these additional lines of code using this template.
Pre-made PNG images for significantly enriched/over-represented terms.

For example, if KEGG pathway images have been generated as part of the FEA, these images can be incorporated into the report.


For each FEA being summarised you have the option of providing a directory (folder) containing at most one image for each enriched term identified in that FEA.


In the setup YAML file, do this by specifying the image directory for an FEA by adding the field "input_image_dir" to that FEA's record:



FEAs:
    - id: "FEA_1"
      goi_file: "goi_file_for_FEA_1"
      fea_file: "FEA_1_results_file"
      input_img_dir: "image_directory_for_FEA_1"

You can create a setup YAML file with this additional lines of code using this template


IMPORTANT
  • GeneFEAST automatically generates a GO hierarchy for all terms with a Type string starting "GO" (or "go", "Go", and "gO"; case is ignored). So, if you provide a corresponding image for such a term, this will be ignored. The work around here, should you wish to provide alternative images for GO terms, is to change their Type field in the FEA file to be something other than a string starting with "GO" (or "go", "Go", and "gO").
  • Similarly, for MSIGDB terms, GeneFEAST will always try to include an HTML tabular description of the term, and any provided image will be ignored. As for GO terms, the work around here is to change the Type field in the FEA file to be something other than a string starting with "MSIGDB" (or any other case variant).

A GO OBO file.

GeneFEAST ships with a GO OBO file, but if you want to provide a more recent version of this yourself, you can do so in the setup YAML file by adding this line of code:



OBO_FILE: "GO_OBO_file"

MSIGDB HTML file.

GeneFEAST ships with an MSIGDB HTML file containing an HTML tabular summary of each MSIGDB term, but if you want to provide a more recent version of this yourself, you can do so in the setup YAML file by adding this line of code:



MSIGDB_HTML: "MSIGDB_HTML_file"

GeneFEAST parameters

GeneFEAST runs with preconfigured parameter settings for summarising and visualising FEA results from bulk RNASeq experiments. However, parameters can be over-written to potentially get better performance tailored to the user's FEA(s).


Click here to see the list of parameters which can be set in the setup YAML file.
Parameters are shown with their default values


# **************************************************************************************************************************
# *** Parameters for filtering terms prior to summarisation ***
# **************************************************************************************************************************

MIN_NUM_GENES: 10
# Number of genes of interest that a term must annotate in order to be included in the GeneFEAST report.

MAX_DCNT: 50
MIN_LEVEL: 3
# These parameters pertain to GO terms. 
# MAX_DCNT means maximum descendant count allowed for a GO term to be included in the GeneFEAST summary report.
# MIN_LEVEL means the minimum level in the GO hierarchy that GO term must have to be included in the GeneFEAST summary report.
# Please refer to article https://doi.org/10.1038/s41598-018-28948-z for further explanation of these terms.

# **************************************************************************************************************************
# *** Parameters affecting how terms and communities are clustered into communities and meta communities, respectively ***
# **************************************************************************************************************************

TT_OVERLAP_MEASURE: OC
# Overlap measure to use when calculating the gene set overlap between terms. Two values are recognised:
# OC (Overlap Coefficient)
# JI (Jaccard Index)
# We recommend using OC here.

MIN_WEIGHT_TT_EDGE: 0.5
# Minimum gene set overlap between terms (as measured using TT_OVERLAP_MEASURE) required for two terms to be
# connected (i.e. to have an edge between them) in the term-term network that GeneFEAST constructs prior to finding
# communities of terms. (Please see GeneFEAST paper for further details).

SC_BC_OVERLAP_MEASURE: OC
# Overlap measure to use when calculating the gene set overlap between a term and a community of terms. Two values are recognised:
# OC (Overlap Coefficient)
# JI (Jaccard Index)
# We recommend using OC here.

MIN_WEIGHT_SC_BC: 0.25
# Minimum gene set overlap required between a term and a community of terms for that term to be considered weakly connected
# to the community of terms (i.e. having some connectivtity to the community, but not enough to be considered part of that community).

BC_BC_OVERLAP_MEASURE: JI
# Overlap measure to use when calculating the gene set overlap between two communities of terms. Two values are recognised:
# OC (Overlap Coefficient)
# JI (Jaccard Index)
# We recommend using JI here.

MIN_WEIGHT_BC_BC: 0.1
# Minimum gene set overlap required between two communities of terms for those two communities to be connected (i.e. to 
# have an edge between them) in the community-community network that GeneFEAST constructs prior to finding
# meta-communities of communities. (Please see GeneFEAST paper for further details).

MAX_COMMUNITY_SIZE_THRESH: 15
MAX_META_COMMUNITY_SIZE_THRESH: 15
# In GeneFEAST, the size communities and meta communities is attenuated using an adaptive algorithm (see main paper for details).
# These two values are parameters for the adaptive algorithm, which will ensure that community and meta-community sizes do
# not exceed these thresholds.

COMBINE_TERM_TYPES: False
# If you are using GeneFEAST to summarize terms from multiple databases, such that the set of terms to be summarised contains more than one type,
# then you can choose either to only allow clustering of terms when terms are from the same database/ share their type (COMBINE_TERM_TYPES: False),
# or to allow the clustering of terms into communities comprised of terms from different databases (COMBINE_TERM_TYPES: True).


# **************************************************************************************************************************
# *** Parameters affecting display of heatmaps ***
# **************************************************************************************************************************

QUANT_DATA_TYPE: log2 FC
# This is the label for the colourmap legend in the split heatmaps

HEATMAP_WIDTH_MIN: 10
HEATMAP_HEIGHT_MIN: 6.5
# These parameters control the size of the split heatmaps. These may need adjusting depending on the size of your display.

HEATMAP_MIN: -4
HEATMAP_MAX: 4
# These parameters give the range of values expected for the provided quantitative data type, and will be used to set the scale
# for the colourmap used in the split heatmap. You should adjust these to match your data. In the case that you do not have 
# quantitative data for your genes of interest and have replaced this column with a singular, dummy, variable, you should set these
# values so that your dummy value is in the range.

# **************************************************************************************************************************
# *** Parameters affecting HTML report ***
# **************************************************************************************************************************

TOOLTIPS: False
# If set to True, the HTML report will be rendered with tooltips to help the novice user better understand the report's contents.

DEFAULT_META_VIEW: circos
# This parameter sets the default first plot shown for meta communities. Accepted values are: "circos","upset","heatmapa","heatmapb","heatmapc", and "litsearch".

DEFAULT_COMMUNITY_VIEW: circos
# This parameter sets the default first plot shown for communities. Accepted values are: "circos","upset","heatmapa","heatmapb","heatmapc", and "litsearch".



Running GeneFEAST

Below are simple/quick-start instructions for setting up and running GeneFEAST. Users with more computational experience can refer to the note at the end of this section.


Setting up your GeneFEAST project directory (folder)

Start by making your GeneFEAST project directory (folder) and navigating there. For example, in Linux:

mkdir my_genefeast_project
cd my_genefeast_project


Next, copy the following files to this directory:


For example, in Linux, use the following commands to copy your fea_file and goi_file from their original locations to your GeneFEAST project directory:

cp /full/path/to/fea_file .
cp /full/path/to/goi_file .

# The precondition for these cp commands is that they are called from inside your GeneFEAST project directory.


(Optional) Also copy over the following, as needed:


Now create your setup YAML file (you can use this template), and also save it in your GeneFEAST project directory.



You are now ready to run GeneFEAST! Stay in your GeneFEAST project directory and run GeneFEAST using one of the following options:



Running GeneFEAST from a Docker container

  docker run --rm -v ${PWD}:/data -w /data ghcr.io/avigailtaylor/genefeast gf <SETUP_YAML_FILE> <OUTPUT_DIR>

  # The precondition for this command is that the setup YAML file and data files are located in directory ${PWD}.
  # (Technical note for more advanced users: the -v flag is bind-mounting directory ${PWD} on the host machine to the directory called data in the container.)


Running GeneFEAST from installation

If you have installed GeneFEAST, then you can run it on the command line or from inside a Python session.

Command line:
gf <SETUP_YAML_FILE> <OUTPUT_DIR> 
Python session:
from genefeast import gf

gf.gf(<SETUP_YAML_FILE>, <OUTPUT_DIR>)


NOTES



A note for users with more computational experience on the use of file paths when calling GeneFeast.

If, like me, you prefer to use a directory structure that separates code, input, and output, that's absolutely fine. If you know what you're doing you can replace file names with file paths, in either or both of the main call to GeneFEAST and the setup YAML file, and GeneFEAST will know what to do....I've presented the simplest process above so that all users can get started with GeneFEAST!

On an extra technical note - if you're using GeneFEAST via its Docker container please do make sure that all the directories referenced in your GeneFEAST call and setup YAML file are bind-mounted to the correct directory on the host computer :)



Viewing the GeneFEAST report

To view a GeneFEAST HTML report, navigate to the output directory (specified by you in the <OUTPUT_DIR> parameter, above) and use a web browser to open file GeneFEAST_REPORT_<FEA_IDENTIFIER(S)>.html.

This file will be listed first when the contents of &ltOUTPUT_DIR&gt are listed alphabetically.


IMPORTANT



The figure below summarises the structure and contents of HTML reports generated by GeneFEAST: GeneFEAST report structure


Reports summarising a single FEA have a 'Communities overview' front page (grey inset), which provides a list of meta communities, communities, and terms (green frame in grey inset), a silhouette plot of communities (i), and a graphical grid search of community detection parameters (ii). The Communities overview homepage has the following anchor links (black, solid arrows) into the ‘Full report’:

A top navigation bar with ‘Communities overview’ and ‘Full report’ dropdown menus is fixed at the top of the report and always visible. This provides direct access to every part of the report at all times,

Reports summarising multiple FEAs start with a front page showing an upset plot of the sets of terms identified as enriched in each of the input FEAs (top left green frame). We refer to each set of terms found in two or more FEAs as a "FEA term-set intersection". The navigation bar at the top of this front page provides a ‘Reports’ dropdown menu from which the user can navigate to separate reports summarising the terms in each FEA term-set intersection. Each of the separate reports has the structure of a report summarising a single FEA, as described above.


TOP TIP!


For each community of enriched terms, GeneFEAST reports:

Where applicable, community frames have links back to their meta community and also to sibling communities in their meta community (black, dashed arrows); separately, they also have a list of links to terms sharing some gene set overlap, where that overlap was too weak for membership of the community (black, dotted arrow).

Term frames have similar, reduced content of community frames. In particular, they do not include upset plots and dot plots, and the term-gene heatmap element of their split heatmaps is extended to highlight which genes, if any, contribute to enriched terms that have been clustered into a community; in this case the corresponding gene-community pair is depicted in the heatmap. Term frames have links back to weakly connected communities (black, dotted arrow).


TOP TIP!


Meta community frames contain:

In addition, meta community frames have links to member communities (black, dashed arrow).


TOP TIP!


Further information on GeneFEAST report elements:



Additional GeneFEAST output

Term-community membership, term- and experiment-GoI relationships are also output in CSV format, for input into downstream programs. The columns are:


NOTE

This table structure is necessarily repetitive, with possible multiple entries per gene. In particular, each gene has one entry per FEA/experiment-term pair where:


Where to find these CSV files