To download the latest container from the repository:
docker pull ghcr.io/avigailtaylor/genefeast:latest
This Docker image is designed for “standard Docker installations” on hosts with AMD64 and ARM64 CPUs.
IMPORTANT
- We strongly recommend installing GeneFEAST in a virtual environment because of its dependencies and requirements.
- Make sure to create a virtual environment using Python 3.12 explicitly, rather than your computer's default version.
Type | ID | Description | GeneID |
- Type: Term type/originating database
- ID: Term ID in database
- Description: Term description
- GeneID: "/"-separated list of gene IDs corresponding to GoIs annotated by the term
- Note that in a GSEA-type FEA this is known as the term's leading edge subset
Type | ID | Description | GeneID |
"GO" | "GO:0071774" | "response to fibroblast growth factor" | "CCN2/THBS1/EGR3/FGF2/SPRY4/ NDST1/CCL2/IER2/FLRT3/PRKD2/ CXCL8/SPRY2/FRS2/FGFR1/SPRY1/ RUNX2/HYAL1/KDM5B/NOG/ZFP36L1/ COL1A1/CASR/FGFR3/FGF1/EXT1/ FGFBP1/GATA3/NR4A1" |
"GO" | "GO:0002294" | "CD4-positive alpha-beta T cell differentiation involved in immune response" | "RARA/BCL6/SMAD7/SOCS3/PTGER4/ JUNB/ZC3H12A/FOXP1/ENTPD7/NFKBIZ/ NLRP3/RC3H1/RORC/RIPK2/ANXA1/ RELB/MYB/IL6/LGALS9/GATA3" |
"GO" | "GO:2000514" | "regulation of CD4-positive alpha-beta T cell activation" | "RARA/BCL6/SMAD7/JUNB/RUNX1/ ZC3H12A/NFKBIZ/NLRP3/RC3H1/CD274/ CBLB/RIPK2/ANXA1/AGER/RUNX3/ SOCS1/VSIR/PRKCQ/LGALS9/GATA3" |
Type,ID,Description,GeneID
"GO","GO:0071774","response to fibroblast growth factor","CCN2/THBS1/EGR3/FGF2/SPRY4/NDST1/CCL2/IER2/FLRT3/PRKD2/CXCL8/SPRY2/FRS2/FGFR1/SPRY1/RUNX2/HYAL1/KDM5B/NOG/ZFP36L1/COL1A1/CASR/FGFR3/FGF1/EXT1/FGFBP1/GATA3/NR4A1"
"GO","GO:0002294","CD4-positive alpha-beta T cell differentiation involved in immune response","RARA/BCL6/SMAD7/SOCS3/PTGER4/JUNB/ZC3H12A/FOXP1/ENTPD7/NFKBIZ/NLRP3/RC3H1/RORC/RIPK2/ANXA1/RELB/MYB/IL6/LGALS9/GATA3"
"GO","GO:2000514","regulation of CD4-positive alpha-beta T cell activation","RARA/BCL6/SMAD7/JUNB/RUNX1/ZC3H12A/NFKBIZ/NLRP3/RC3H1/CD274/CBLB/RIPK2/ANXA1/AGER/RUNX3/SOCS1/VSIR/PRKCQ/LGALS9/GATA3"
IMPORTANT
- GoI must be listed using IDs that match those used in the FEA results file.
- If you do not have quantitative data, you can just provide a dummy column with the same numerical value entered for each gene.
GeneID | log2FC |
PDGFB | 2.845276684 |
GTPBP4 | 1.396754262 |
C12orf49 | 1.469143469 |
SLC2A1 | 1.618759309 |
CCN2 | 2.593769464 |
CXCR4 | 2.528192609 |
NCOA5 | 2.137989231 |
CDKN1A | 3.154969844 |
RARA | 1.444539048 |
GeneID,log2FC
PDGFB,2.845276684
GTPBP4,1.396754262
C12orf49,1.469143469
SLC2A1,1.618759309
CCN2,2.593769464
CXCR4,2.528192609
NCOA5,2.137989231
CDKN1A,3.154969844
RARA,1.444539048
- GoI files can contain just significantly differentially expressed genes, or they can contain results for all genes tested in a given experiment.
- Either way, GeneFEAST will only store quantitative data for GoI that are also present in the FEA results file for the corresponding experiment.
- Since, by definition, GSEA-type FEAs do not select a subset of genes for analysis, the GoI file in this scenario is just the results for all genes tested in a given experiment.
- Importantly, GeneFEAST will only store quantitative data for genes present in at least one leading edge subset in the GSEA-type FEA results file, and it is this set of genes that are considered as de facto GoI for the rest of the GeneFEAST analysis.
IMPORTANT NOTE ABOUT BEST PRACTICE FOR GENEFEAST ANALYSIS OF GSEA-TYPE FEAs
- Standard GSEAs look for enrichment of gene sets amongst either the most over-expressed or the most under-expressed genes in a list of genes ranked by expression.
- Thus, for any significantly enriched term, either a positive enrichment score (corresponding to enrichment amongst over-expressed genes) or a negative enrichment score (corresponding to enrichment amongst under-expressed genes) is reported, but never both.
- Consequently, in the GeneFEAST setting, you should split your GSEA results into two FEA results files: one for positive scoring terms and one for negative scoring terms, then treat these as separate FEA results files for GeneFEAST summarisation purposes.
You will use this setup file to give GeneFEAST the id(s) of the FEA(s) to summarise, the location(s) of the FEA file(s), and the location(s) of the GoI file(s).
To summarise a single FEA:
FEAs:
- id: "FEA_1"
goi_file: "goi_file_for_FEA_1"
fea_file: "FEA_1_results_file"
FEAs:
- id: "FEA_1"
goi_file: "goi_file_for_FEA_1"
fea_file: "FEA_1_results_file"
- id: "FEA_2"
goi_file: "goi_file_for_FEA_2"
fea_file: "FEA_2_results_file"
- id: "FEA_3"
goi_file: "goi_file_for_FEA_3"
fea_file: "FEA_3_results_file"
You can create a YAML setup file using this template.
Type | ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | GeneID | count |
IMPORTANT
- Columns ID through to count are output by the enrich functions.
- However, you will need to add the "Type" column manually, e.g., using Excel or VIM.
ENRICH: True
DOTPLOTS: True
You can create a setup YAML file with these additional lines of code using this template.
As part of the report generation process, GeneFEAST conducts a literature search for each GoI via the National Center for Biotechnology Information's Gene and PubMed services (Sayers, et al., 2021).
This literature search can incorporate additional search terms, which you can specify in your YAML setup file using the following code:
SEARCH_WORDS:
- search term 1
- search term 2
- etc.
You can create a setup YAML file with these additional lines of code using this template.
Occasionally, you may wish to keep track of an a priori set of genes relevant to your study, for example those that are members of a particular biological signature, throughout the GeneFEAST report.
Each extra annotation will be displayed as an additional row at the top of the term-GoI heatmap panel in the split heatmap created for each community of terms (similarly for each meta community of communities).
To do this, first make an extra annotation (EA) file. The EA file is a headerless CSV file, with one EA per row, and two columns:
RNA_DRG_IFN,STAT1/IFI16/SP110/MX1/IFIT5/PARP12/EIF2AK2/IFI44/PARP14/TRIM21/DDX60L/IFI127/ADAR/HERC6/IFI35/ISG20/LGALS9/UBE2L6/DHX58/STAT2/OAS3/ISG15/IRF7/IFI6/IFI44L/IFITM1/OAS1/D$
Proteome_DRG_IFN,IFIT2/IFIT1/IFIT3/OAS2/MX2/OASL/IFIH1/ISG15/MX1/SP110/IFI44/CMPK2/IFI44L/OAS1/DDX58/STAT1/IFIT5/DDX60/PARP12/IFI16/DDX60L/OAS3/EIF2AK2/ISG20/ADAR/IFI35/STAT2/LGAL$
Then, add this line of code to your setup YAML file:
EA_FILE: extra_annotation_file
You can create a setup YAML file with these additional lines of code using this template.
For example, if KEGG pathway images have been generated as part of the FEA, these images can be incorporated into the report.
For each FEA being summarised you have the option of providing a directory (folder) containing at most one image for each enriched term identified in that FEA.
In the setup YAML file, do this by specifying the image directory for an FEA by adding the field "input_image_dir" to that FEA's record:
FEAs:
- id: "FEA_1"
goi_file: "goi_file_for_FEA_1"
fea_file: "FEA_1_results_file"
input_img_dir: "image_directory_for_FEA_1"
You can create a setup YAML file with this additional lines of code using this template
IMPORTANT
- GeneFEAST automatically generates a GO hierarchy for all terms with a Type string starting "GO" (or "go", "Go", and "gO"; case is ignored). So, if you provide a corresponding image for such a term, this will be ignored. The work around here, should you wish to provide alternative images for GO terms, is to change their Type field in the FEA file to be something other than a string starting with "GO" (or "go", "Go", and "gO").
- Similarly, for MSIGDB terms, GeneFEAST will always try to include an HTML tabular description of the term, and any provided image will be ignored. As for GO terms, the work around here is to change the Type field in the FEA file to be something other than a string starting with "MSIGDB" (or any other case variant).
GeneFEAST ships with a GO OBO file, but if you want to provide a more recent version of this yourself, you can do so in the setup YAML file by adding this line of code:
OBO_FILE: "GO_OBO_file"
GeneFEAST ships with an MSIGDB HTML file containing an HTML tabular summary of each MSIGDB term, but if you want to provide a more recent version of this yourself, you can do so in the setup YAML file by adding this line of code:
MSIGDB_HTML: "MSIGDB_HTML_file"
GeneFEAST runs with preconfigured parameter settings for summarising and visualising FEA results from bulk RNASeq experiments. However, parameters can be over-written to potentially get better performance tailored to the user's FEA(s).
# **************************************************************************************************************************
# *** Parameters for filtering terms prior to summarisation ***
# **************************************************************************************************************************
MIN_NUM_GENES: 10
# Number of genes of interest that a term must annotate in order to be included in the GeneFEAST report.
MAX_DCNT: 50
MIN_LEVEL: 3
# These parameters pertain to GO terms.
# MAX_DCNT means maximum descendant count allowed for a GO term to be included in the GeneFEAST summary report.
# MIN_LEVEL means the minimum level in the GO hierarchy that GO term must have to be included in the GeneFEAST summary report.
# Please refer to article https://doi.org/10.1038/s41598-018-28948-z for further explanation of these terms.
# **************************************************************************************************************************
# *** Parameters affecting how terms and communities are clustered into communities and meta communities, respectively ***
# **************************************************************************************************************************
TT_OVERLAP_MEASURE: OC
# Overlap measure to use when calculating the gene set overlap between terms. Two values are recognised:
# OC (Overlap Coefficient)
# JI (Jaccard Index)
# We recommend using OC here.
MIN_WEIGHT_TT_EDGE: 0.5
# Minimum gene set overlap between terms (as measured using TT_OVERLAP_MEASURE) required for two terms to be
# connected (i.e. to have an edge between them) in the term-term network that GeneFEAST constructs prior to finding
# communities of terms. (Please see GeneFEAST paper for further details).
SC_BC_OVERLAP_MEASURE: OC
# Overlap measure to use when calculating the gene set overlap between a term and a community of terms. Two values are recognised:
# OC (Overlap Coefficient)
# JI (Jaccard Index)
# We recommend using OC here.
MIN_WEIGHT_SC_BC: 0.25
# Minimum gene set overlap required between a term and a community of terms for that term to be considered weakly connected
# to the community of terms (i.e. having some connectivtity to the community, but not enough to be considered part of that community).
BC_BC_OVERLAP_MEASURE: JI
# Overlap measure to use when calculating the gene set overlap between two communities of terms. Two values are recognised:
# OC (Overlap Coefficient)
# JI (Jaccard Index)
# We recommend using JI here.
MIN_WEIGHT_BC_BC: 0.1
# Minimum gene set overlap required between two communities of terms for those two communities to be connected (i.e. to
# have an edge between them) in the community-community network that GeneFEAST constructs prior to finding
# meta-communities of communities. (Please see GeneFEAST paper for further details).
MAX_COMMUNITY_SIZE_THRESH: 15
MAX_META_COMMUNITY_SIZE_THRESH: 15
# In GeneFEAST, the size communities and meta communities is attenuated using an adaptive algorithm (see main paper for details).
# These two values are parameters for the adaptive algorithm, which will ensure that community and meta-community sizes do
# not exceed these thresholds.
COMBINE_TERM_TYPES: False
# If you are using GeneFEAST to summarize terms from multiple databases, such that the set of terms to be summarised contains more than one type,
# then you can choose either to only allow clustering of terms when terms are from the same database/ share their type (COMBINE_TERM_TYPES: False),
# or to allow the clustering of terms into communities comprised of terms from different databases (COMBINE_TERM_TYPES: True).
# **************************************************************************************************************************
# *** Parameters affecting display of heatmaps ***
# **************************************************************************************************************************
QUANT_DATA_TYPE: log2 FC
# This is the label for the colourmap legend in the split heatmaps
HEATMAP_WIDTH_MIN: 10
HEATMAP_HEIGHT_MIN: 6.5
# These parameters control the size of the split heatmaps. These may need adjusting depending on the size of your display.
HEATMAP_MIN: -4
HEATMAP_MAX: 4
# These parameters give the range of values expected for the provided quantitative data type, and will be used to set the scale
# for the colourmap used in the split heatmap. You should adjust these to match your data. In the case that you do not have
# quantitative data for your genes of interest and have replaced this column with a singular, dummy, variable, you should set these
# values so that your dummy value is in the range.
# **************************************************************************************************************************
# *** Parameters affecting HTML report ***
# **************************************************************************************************************************
TOOLTIPS: False
# If set to True, the HTML report will be rendered with tooltips to help the novice user better understand the report's contents.
DEFAULT_META_VIEW: circos
# This parameter sets the default first plot shown for meta communities. Accepted values are: "circos","upset","heatmapa","heatmapb","heatmapc", and "litsearch".
DEFAULT_COMMUNITY_VIEW: circos
# This parameter sets the default first plot shown for communities. Accepted values are: "circos","upset","heatmapa","heatmapb","heatmapc", and "litsearch".
Below are simple/quick-start instructions for setting up and running GeneFEAST. Users with more computational experience can refer to the note at the end of this section.
Start by making your GeneFEAST project directory (folder) and navigating there. For example, in Linux:
mkdir my_genefeast_project
cd my_genefeast_project
Next, copy the following files to this directory:
For example, in Linux, use the following commands to copy your fea_file and goi_file from their original locations to your GeneFEAST project directory:
cp /full/path/to/fea_file .
cp /full/path/to/goi_file .
# The precondition for these cp commands is that they are called from inside your GeneFEAST project directory.
(Optional) Also copy over the following, as needed:
Now create your setup YAML file (you can use this template), and also save it in your GeneFEAST project directory.
You are now ready to run GeneFEAST! Stay in your GeneFEAST project directory and run GeneFEAST using one of the following options:
docker run --rm -v ${PWD}:/data -w /data ghcr.io/avigailtaylor/genefeast gf <SETUP_YAML_FILE> <OUTPUT_DIR>
# The precondition for this command is that the setup YAML file and data files are located in directory ${PWD}.
# (Technical note for more advanced users: the -v flag is bind-mounting directory ${PWD} on the host machine to the directory called data in the container.)
If you have installed GeneFEAST, then you can run it on the command line or from inside a Python session.
gf <SETUP_YAML_FILE> <OUTPUT_DIR>
from genefeast import gf
gf.gf(<SETUP_YAML_FILE>, <OUTPUT_DIR>)
NOTES
- When you run GeneFEAST, it will use the setup YAML file to count how many FEAs are being summarised, and then generate single or multi FEA summary reports accordingly.
- Make sure <OUTPUT_DIR> does not already exist.
If, like me, you prefer to use a directory structure that separates code, input, and output, that's absolutely fine. If you know what you're doing you can replace file names with file paths, in either or both of the main call to GeneFEAST and the setup YAML file, and GeneFEAST will know what to do....I've presented the simplest process above so that all users can get started with GeneFEAST!
On an extra technical note - if you're using GeneFEAST via its Docker container please do make sure that all the directories referenced in your GeneFEAST call and setup YAML file are bind-mounted to the correct directory on the host computer :)
To view a GeneFEAST HTML report, navigate to the output directory (specified by you in the <OUTPUT_DIR>
parameter, above) and use a web browser to open
file GeneFEAST_REPORT_<FEA_IDENTIFIER(S)>.html
.
This file will be listed first when the contents of <OUTPUT_DIR>
are listed alphabetically.
IMPORTANT
- Viewing the HTML output report requires a web-browser with HTML5 and JavaScript 1.6 support.
- Please make sure to keep all the output generated by GeneFEAST in the output directory; the HTML report uses relative links to images, and will break if the relative directory structure is broken.
The figure below summarises the structure and contents of HTML reports generated by GeneFEAST:
Reports summarising a single FEA have a 'Communities overview' front page (grey inset), which provides a list of meta communities, communities, and terms (green frame in grey inset), a silhouette plot of communities (i), and a graphical grid search of community detection parameters (ii). The Communities overview homepage has the following anchor links (black, solid arrows) into the ‘Full report’:
Reports summarising multiple FEAs start with a front page showing an upset plot of the sets of terms identified as enriched in each of the input FEAs (top left green frame). We refer to each set of terms found in two or more FEAs as a "FEA term-set intersection". The navigation bar at the top of this front page provides a ‘Reports’ dropdown menu from which the user can navigate to separate reports summarising the terms in each FEA term-set intersection. Each of the separate reports has the structure of a report summarising a single FEA, as described above.
TOP TIP!
- To help the novice user understand the contents of a GeneFEAST report, run GeneFEAST with tooltips switched on.
- Do this by adding this line to the setup YAML file:
TOOLTIPS: True
.
For each community of enriched terms, GeneFEAST reports:
Term frames have similar, reduced content of community frames. In particular, they do not include upset plots and dot plots, and the term-gene heatmap element of their split heatmaps is extended to highlight which genes, if any, contribute to enriched terms that have been clustered into a community; in this case the corresponding gene-community pair is depicted in the heatmap. Term frames have links back to weakly connected communities (black, dotted arrow).
TOP TIP!
- The default first plot shown for communities is a circos plot, but you can change this by setting
DEFAULT_COMMUNITY_VIEW
in the setup YAML file.- Accepted values are "circos","upset","heatmapa","heatmapb","heatmapc", and "litsearch".
- Note that term frames will also be affected by this setting, except when "upset" or "circos" are chosen as the default first plot.
Meta community frames contain:
TOP TIP!
- The default first plot shown for meta communities is a circos plot, but you can change this by setting
DEFAULT_META_VIEW
in the setup YAML file.- Accepted values are "circos","upset","heatmapa","heatmapb","heatmapc", and "litsearch".
Further information on GeneFEAST report elements:
Term-community membership, term- and experiment-GoI relationships are also output in CSV format, for input into downstream programs. The columns are:
NOTEThis table structure is necessarily repetitive, with possible multiple entries per gene. In particular, each gene has one entry per FEA/experiment-term pair where:
- the gene was identified as a GoI in the experiment that underwent FEA;
- the gene is annotated by the term;
- and the term has been identified as significantly enriched in the FEA.
Where to find these CSV files
- When GeneFEAST has been used in single FEA summarisation mode, view this CSV file by navigating to <OUTPUT_DIR> and opening file
GeneFEAST_TABLE_
.html</code></mark>. </li> - When GeneFEAST has been used in multiple FEA summarisation mode there will be an output CSV file for each FEATSI. Find these files by navigating to <OUTPUT_DIR> and opening directory (folder)
GeneFEAST_TABLES_
.html</code></mark>. </li> </ul> </blockquote>