Title: | Comprehensive GO Terms Comparison Between Species |
---|---|
Description: | Supports the assessment of functional enrichment analyses obtained for several lists of genes and provides a workflow to analyze them between two species via weighted graphs. Methods are described in Sosa et al. (2023) <doi:10.1016/j.ygeno.2022.110528>. |
Authors: | Chrystian Camilo Sosa [aut, cre, cph] , Diana Carolina Clavijo-Buriticá [aut], Mauricio Alberto Quimbaya [aut], Maria Victoria Diaz [ctb], Camila Riccio Rengifo [ctb], Nicolas López-Rozo [ctb], Victor Hugo García Merchán [aut, ctb], Arlen James Mosquera [ctb], Andrés Álvarez [ctb] |
Maintainer: | Chrystian Camilo Sosa <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.2.1 |
Built: | 2024-11-22 05:38:08 UTC |
Source: | https://github.com/ccsosa/gocompare |
GOCompare is a an R package used to compare a GO terms list between two species
Package: | GOCompare |
Type: | Package |
Version: | 1.0.2.1 |
Date: | 2022-12-02 |
License: | GPL-3 |
This dataset is the original dataset obtained for Clavijo-Buriticá (In preparation)
A_thaliana
A_thaliana
A data frame with 4063 rows and 6 variables:
Numeric: False discovery rate values for the GO term
numeric: Number of genes in the list of genes for a given GO term
numeric: Number of genes in the genome of a species for a given GO term
character: GO term name or GO term id
character: Genes found fot a given GO term
character: A column representing the belonging of a group of comparison
https://data.mendeley.com/datasets/myyy2wxd59/1
Clavijo-Buriticá, Sosa, C.C., Mosquera, A.J. Álvarez, A., Medina, J. Quimbaya, M.A. A systematic comparison of the molecular machinery associated with Cancer-Hallmarks between plants and humans reveals Arabidopsis thaliana as a useful model to understand specific carcinogenic events (to be submitted, Target journal: Plos Biology)
This dataset is a subset of the original dataset obtained for Clavijo-Buriticá (In preparation)
A_thaliana_compress
A_thaliana_compress
A data frame with 120 rows and 6 variables (30 GO terms per cancer hallmark):
Numeric: False discovery rate values for the GO term
numeric: Number of genes in the list of genes for a given GO term
numeric: Number of genes in the genome of a species for a given GO term
character: GO term name or GO term id
character: Genes found fot a given GO term
character: A column representing the belonging of a group of comparison
https://data.mendeley.com/datasets/myyy2wxd59/1
Clavijo-Buriticá, Sosa, C.C., Mosquera, A.J. Álvarez, A., Medina, J. Quimbaya, M.A. A systematic comparison of the molecular machinery associated with Cancer-Hallmarks between plants and humans reveals Arabidopsis thaliana as a useful model to understand specific carcinogenic events (to be submitted, Target journal: Plos Biology)
compareGOspecies function provides a simple workflow to compare results of functional enrichment analysis for two species.
To use this function you will need two matrices with a column which, represents the features to be compared (e.g.feature). This function will extract the unique GO terms for two matrices and it will generate a presence-absence matrix where rows will represent a combination of categories and species (e.g H.sapiens AID) and columns will represent the GO terms analyzed. Further, this function will calculate Jaccard distances and it will provide as outputs a list with four slots: 1.) A principal coordinates analysis (PCoA) 2.) The Jaccard distance matrix 3.) A list of shared GO terms between species 4.) Finally, a list of the unique GO terms and the belonging to the respective species.
compareGOspecies( df1, df2, GOterm_field, species1, species2, skipPCoA = FALSE, paired_lists = TRUE )
compareGOspecies( df1, df2, GOterm_field, species1, species2, skipPCoA = FALSE, paired_lists = TRUE )
df1 |
A data frame with the results of a functional enrichment analysis for the species 1 with an extra column "feature" with the features to be compared |
df2 |
A data frame with the results of a functional enrichment analysis for the species 2 with an extra column "feature" with the features to be compared |
GOterm_field |
This is a string with the column name of the GO terms (e.g; "Functional_Category") |
species1 |
This is a string with the species name for species 1 (e.g; "H. sapiens") |
species2 |
This is a string with the species name for species 2 (e.g; "A. thaliana") |
skipPCoA |
This is a boolean to indicate if the PCoA graphics can be skipped |
paired_lists |
This is a boolean to indicate if both species have same comparable categories (gene lists). If the paired_lists is FALSE the counts will be done only for species and categories will be kept in the outcomes. Please use carefully when paired_lists = FALSE. |
This function will return a list with four slots: graphics, distance shared_GO_list, and unique_GO_list
Do not use "-" in the feature column. This will lead to wrong results!
#Loading example datasets data(H_sapiens_compress) data(A_thaliana_compress) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" #Running function x <- compareGOspecies(df1=H_sapiens_compress, df2=A_thaliana_compress, GOterm_field=GOterm_field, species1=species1, species2=species2, skipPCoA=FALSE, paired_lists=TRUE) ## Not run: #Displaying PCoA results x$graphics # Checking shared GO terms between species print(tapply(x$shared_GO_list$feature,x$shared_GO_list$feature,length)) ## End(Not run)
#Loading example datasets data(H_sapiens_compress) data(A_thaliana_compress) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" #Running function x <- compareGOspecies(df1=H_sapiens_compress, df2=A_thaliana_compress, GOterm_field=GOterm_field, species1=species1, species2=species2, skipPCoA=FALSE, paired_lists=TRUE) ## Not run: #Displaying PCoA results x$graphics # Checking shared GO terms between species print(tapply(x$shared_GO_list$feature,x$shared_GO_list$feature,length)) ## End(Not run)
This dataset is the results of running the compareGOspecies species and it is composed of four slots:
PCoA graphics
numeric: Jaccard distance matrix
data.frame with shared GO terms between species
data.frame with unique GO terms and their belonging two each species
comparison_ex_compress
comparison_ex_compress
An object of class list
of length 4.
https://data.mendeley.com/datasets/myyy2wxd59/1
Clavijo-Buriticá, Sosa, C.C., Mosquera, A.J. Álvarez, A., Medina, J. Quimbaya, M.A. A systematic comparison of the molecular machinery associated with Cancer-Hallmarks between plants and humans reveals Arabidopsis thaliana as a useful model to understand specific carcinogenic events (to be submitted, Target journal: Plos Biology)
This dataset is the results of running the compareGOspecies species and it is composed of three slots:
numeric: Jaccard distance matrix
data.frame with shared GO terms between species
data.frame with unique GO terms and their belonging two each species
comparison_ex_compress_CH
comparison_ex_compress_CH
An object of class list
of length 3.
https://data.mendeley.com/datasets/myyy2wxd59/1
Clavijo-Buriticá, Sosa, C.C., Mosquera, A.J. Álvarez, A., Medina, J. Quimbaya, M.A. A systematic comparison of the molecular machinery associated with Cancer-Hallmarks between plants and humans reveals Arabidopsis thaliana as a useful model to understand specific carcinogenic events (to be submitted, Target journal: Plos Biology)
evaluateGO_species provides a simple function to compare results of functional enrichment analysis for two species through the use of proportion tests or Pearson's Chi-squared Tests and a False discovery rate correction
evaluateCAT_species(df1, df2, species1, species2, GOterm_field, test = "prop")
evaluateCAT_species(df1, df2, species1, species2, GOterm_field, test = "prop")
df1 |
A data frame with the results of a functional enrichment analysis for the species 1 with an extra column "feature" with the features to be compared |
df2 |
A data frame with the results of a functional enrichment analysis for the species 2 with an extra column "feature" with the features to be compared |
species1 |
This is a string with the species name for the species 1 (e.g; "H. sapiens") |
species2 |
This is a string with the species name for the species 2 (e.g; "A. thaliana") |
GOterm_field |
This is a string with the column name of the GO terms (e.g; "Functional_Category") |
test |
This is a string with the hypothesis test to be performed. Two options are provided, "prop" and "chi-squared" (default value="prop") |
This function will return a data.frame with the following fields:
CAT | Category |
pvalue | p-value obtained through the use of Pearson's Chi-squared Test |
FDR | Multiple comparison correction for the p-value column |
#Loading example datasets data(H_sapiens) data(A_thaliana) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" #Running function x <- evaluateCAT_species(df1= H_sapiens, df2=A_thaliana, species1=species1, species2=species2, GOterm_field=GOterm_field, test="prop") print(x)
#Loading example datasets data(H_sapiens) data(A_thaliana) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" #Running function x <- evaluateCAT_species(df1= H_sapiens, df2=A_thaliana, species1=species1, species2=species2, GOterm_field=GOterm_field, test="prop") print(x)
evaluateGO_species provides a simple function to compare results of functional enrichment analysis for two species through the use of proportion tests or Pearson's Chi-squared Tests and a False discovery rate correction
evaluateGO_species(df1, df2, species1, species2, GOterm_field, test = "prop")
evaluateGO_species(df1, df2, species1, species2, GOterm_field, test = "prop")
df1 |
A data frame with the results of a functional enrichment analysis for the species 1 with an extra column "feature" with the features to be compared |
df2 |
A data frame with the results of a functional enrichment analysis for the species 2 with an extra column "feature" with the features to be compared |
species1 |
This is a string with the species name for the species 1 (e.g; "H. sapiens") |
species2 |
This is a string with the species name for the species 2 (e.g; "A. thaliana") |
GOterm_field |
This is a string with the column name of the GO terms (e.g; "Functional_Category") |
test |
This is a string with the hypothesis test to be performed. Two options are provided, "prop" and "chi-squared" (default value="prop") |
This function will return a data.frame with the following fields:
GO | GO term analyzed |
pvalue | p-value obtained through the use of Pearson's Chi-squared Test |
FDR | Multiple comparison correction for the p-value column |
#Loading example datasets data(H_sapiens) data(A_thaliana) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" #Running function x <- evaluateGO_species(df1= H_sapiens, df2=A_thaliana, species1=species1, species2=species2, GOterm_field=GOterm_field, test="prop") print(x)
#Loading example datasets data(H_sapiens) data(A_thaliana) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" #Running function x <- evaluateGO_species(df1= H_sapiens, df2=A_thaliana, species1=species1, species2=species2, GOterm_field=GOterm_field, test="prop") print(x)
graph_two_GOspecies is a function to create undirected graphs
The graph_two_GOspecies is an analog of the graphGOspecies function, and it has the same options (" Categories " and " GO "). Nevertheless, the way in which the edge and node weights are calculated is slightly different. Since two species are compared, three possible graphs are available \({G}_1,\, {G}_2\), and \({G}_3 \). \({G}_1\), and \({G}_2 \) represent each of the species analyzed and \({G}_3\) is a subgraph of \({G}_1,\, {G}_2\), which contains the GO terms or Categories co-ocurring between both species.
Categories option: (Weight): The nodes \((V)\) represent groups of gene lists (categories), and the edges \((E)\) represent GO terms co-occurring between pairs of categories and the weight of the nodes provides a measure of how a GO term is conserved between two species and a series of categories but it is biased to categories.
\[\widehat{K}_w(u)=\sum_{v \epsilon V_1}^{}w(u,v) + \sum_{v \epsilon V_2}^{}w(u,v)\](5)
(shared weight): The nodes \((V)\) represent groups of gene lists (categories), and the edges \((E)\) represent GO terms co-occurring between pairs of categories that are only shared between species. This node weight \({K}_s\) is computed from a shared weight of edges \({s}\), where \({N}1\) and \({N}2\) are the set of GO terms associated with the edge \(e = (u,v) \) for species 1 and 2, respectively. Therefore the node shared weight \({K}_s(u)\) is the sum of \({s}\).
\[s(e) = \frac{\mid {N1} \ n \ {N2} \mid}{\mid {N1} \bigcup {N2} \mid}\](6)
\[{K}_s(u)=\sum_{v \epsilon (V_1 \bigcup V_2) }^{}{s(u,v)}\](7)
(combined weight): This node weight \({K}_c(u)\) is a combination of the weight and the shared weight. The idea of this combined weight is to find categories with more frequent GO terms co-ocurring in order to observe functional similarities between two species with a balance of GO terms co-occurring among gene lists (categories) and the two species. This node weight varies from -1 (categories with GO terms found only in one species and few categories) to 1 (categories with GO terms shared widely between species and among other categories). the combined node weight \({K}_c\) is defined as the sum of the min-max normalized weights \(\widehat{K}_w\) and \({K}_s\) minus 1.
\[minmax(y)=\frac{y-min(y)}{max(y)-min(y)}\](8) \[{K}_c(u)= minmax(\widehat{K}_w(u)) + minmax({K}_s(u)) - 1 \] (9)
GO option: Given there are three possible graphs are available \({G}_1,\, {G}_2\), and \({G}_3\). \({G}_1\), and \({G}_2\) represent each of the species analyzed and \({G}_3\) is a subgraph of \({G}_1,\, {G}_2\), which contains the GO terms or Categories co-ocurring between both species. For this case, Nodes are GO terms and edges are categories where a GO terms is co-ocurring. This weight is similar to the GO weight calculated for graphGOspecies function. it is calculated as the equation 5.
\[\widehat{K}_w(u)=\sum_{v \epsilon V_1}^{}w(u,v) + \sum_{v \epsilon V_2}^{}w(u,v)\](5)
graph_two_GOspecies( x, species1, species2, GOterm_field, saveGraph = FALSE, option = "Categories", numCores = 2, outdir = NULL, filename = NULL )
graph_two_GOspecies( x, species1, species2, GOterm_field, saveGraph = FALSE, option = "Categories", numCores = 2, outdir = NULL, filename = NULL )
x |
is a list obtained as output of the comparegOspecies function |
species1 |
This is a string with the species name for species 1 (e.g; "H. sapiens") |
species2 |
This is a string with the species name for species 2 (e.g; "A. thaliana") |
GOterm_field |
This is a string with the column name of the GO terms (e.g; "Functional_Category") |
saveGraph |
logical, if |
option |
(values: "Categories or "GO"). This option allows create either a graph where nodes are GO terms and edges are features and GO as well as species belonging are edges attributes or a graph where nodes are GO terms and edges are species belonging (default value="Categories") |
numCores |
numeric, Number of cores to use for the process (default value numCores=2). For the example below, only one core will be used |
outdir |
This parameter will allow save the graph file in a folder described here (e.g: "D:").This parameter only works when saveGraph=TRUE |
filename |
The name of the graph filename to be saved in the outdir detailed by the user.This parameter only works when saveGraph=TRUE |
This function will return a list with two slots: edges and nodes. (Categories): Edges list columns:
Column | Description |
SOURCE and TARGET | The source and target categories (Nodes in the edge) |
GO_N | The number of GO terms between the categories |
WEIGHT | Edge weight |
GO | GO terms available for both nodes |
SP1 | Number of GO terms for the species 1 |
SP2 | Number of GO terms for the species 2 |
SHARED | Number of GO terms shared or co-ocurring between the categories |
SHARED_WEIGHT | Shared weight for the edge |
Node list columns:
Column | Description |
CAT | Category name |
CAT_WEIGHT | Node weight |
SHARED_WEIGHT | Shared weight for the node |
COMBINED_WEIGHT | Combined weight for the node |
(GO):
Edges list columns:
Column | Description |
SOURCE and TARGET | The source and target GO terms (Nodes in the edge) |
FEATURE | The number of Categories where both GO Terms were found |
SP | Species where the GO terms was found (Species 1, Species 2 or Shared) |
WEIGHT | Edge weight |
Node list columns:
Column | Description |
GO | GO term node name |
GO_WEIGHT | Node weight |
GOterm_field <- "Functional_Category" data(comparison_ex_compress_CH) #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" x_graph <- graph_two_GOspecies(x=comparison_ex_compress_CH, species1=species1, species2=species2, GOterm_field=GOterm_field, numCores=1, saveGraph = FALSE, option= "Categories", outdir = NULL, filename= NULL)
GOterm_field <- "Functional_Category" data(comparison_ex_compress_CH) #Defining the species names species1 <- "H. sapiens" species2 <- "A. thaliana" x_graph <- graph_two_GOspecies(x=comparison_ex_compress_CH, species1=species1, species2=species2, GOterm_field=GOterm_field, numCores=1, saveGraph = FALSE, option= "Categories", outdir = NULL, filename= NULL)
graphGOspecies is a function to create undirected graphs using two options:
Categories option:
The nodes \((V)\) represent groups of gene lists (categories), and the edges \((E)\) represent GO terms co-occurring between pairs of categories. More specifically, Two categories: \(u,v \epsilon V \) are connected by an edge \(e=(u,v)\).the edge weights \(w(e)\) are defined as the ratio of the number of GO terms co-occurring between two categories. Edge weights w(e) are defined as the ratio of the number of GO terms (e.g. biological processes) co-occurring between two categories \(BP_{u} \ n BP_{v}\) compared to the total number of GO terms available. A node weight \(K_{w}(u)\) is defined as the sum of the edge weights where the node u is a participant. Thus, the node weight represents how frequently GO terms are reported and expressed in a biological phenomenon.
\[w(e) = \frac{\mid BP_{u} n {BP_{v}}\mid}{\mid BP\mid}\](1)
\[K_{w} = \sum_{{v} \epsilon {V}}{w(u,v)}\](2)
GO option:
The nodes \({V}\) represent GO terms and the edges \({E}'\) represent categories where a pair of GO terms co-occur. More specifically, two GO terms are connected by an edge \({e}'=({u},{v}')\). the edge weight \({w}'({e}')\) corresponds to the number of categories co-occurring the GO terms \({u}\) and \({v}'\),compared with the total number of GO terms (Equation 3). A node weight \({K}'_w({u}')\) is defined,in this case the weight represents the importance of a GO term (more frequent co-occurring).(Please be patient, it requires a long time to finish).
\[{w}'({e}')=\frac{\mid{Cu}'\cap {Cv}'\mid}{\mid BP \mid}\](3)
\[{K}'_w({u}')=\sum_{{v}'\epsilon {V}'}{{w}'({u}',{v}')}\](4)
graphGOspecies( df, GOterm_field, option = "Categories", numCores = 2, saveGraph = FALSE, outdir = NULL, filename = NULL )
graphGOspecies( df, GOterm_field, option = "Categories", numCores = 2, saveGraph = FALSE, outdir = NULL, filename = NULL )
df |
A data frame with the results of a functional enrichment analysis for a species with an extra column "feature" with the features to be compared |
GOterm_field |
This is a string with the column name of the GO terms (e.g: "Functional.Category") |
option |
(values: "GO" or "Categories"). This option allows create either a graph where nodes are GO terms and edges are features or alternatively a graph where nodes are features and edges are GO terms (default value="Categories") |
numCores |
numeric, Number of cores to use for the process (default value numCores=2). For the example below, only one core will be used |
saveGraph |
logical, if |
outdir |
This parameter will allow save the graph file in a folder described here (e.g: "D:").This parameter only works when saveGraph=TRUE |
filename |
The name of the graph filename to be saved in the outdir detailed by the user.This parameter only works when saveGraph=TRUE |
This function will return a list with two slots: edges and nodes.
(Categories): Edges list columns:
Column | Description |
SOURCE and TARGET | The source and target categories (Nodes in the edge) |
FEATURES_N | The number of GO terms between the categories |
WEIGHT | Edge weight |
FEATURES | GO terms available for both nodes |
Node list columns:
Column | Description |
feature | Category name |
GO_count | GO terms counts for the node |
WEIGHT | Node weight |
(GO):
Edges list columns:
Column | Description |
SOURCE and TARGET | The source and target GO terms (Nodes in the edge) |
FEATURE | The number of Categories where both GO Terms were found |
WEIGHT | Edge weight |
Node list columns:
Column | Description |
GO | GO term node name |
GO_WEIGHT | Node weight |
#Loading example datasets data(H_sapiens_compress) GOterm_field <- "Functional_Category" #Running function x <- graphGOspecies(df=H_sapiens_compress, GOterm_field=GOterm_field, option = "Categories", numCores=1, saveGraph=FALSE, outdir = NULL, filename=NULL)
#Loading example datasets data(H_sapiens_compress) GOterm_field <- "Functional_Category" #Running function x <- graphGOspecies(df=H_sapiens_compress, GOterm_field=GOterm_field, option = "Categories", numCores=1, saveGraph=FALSE, outdir = NULL, filename=NULL)
This dataset is a subset of the original dataset obtained for Clavijo-Buriticá (In preparation)
H_sapiens
H_sapiens
A data frame with 5000 rows and 6 variables:
Numeric: False discovery rate values for the GO term
numeric: Number of genes in the list of genes for a given GO term
numeric: Number of genes in the genome of a species for a given GO term
character: GO term name or GO term id
character: Genes found fot a given GO term
character: A column representing the belonging of a group of comparison
https://data.mendeley.com/datasets/myyy2wxd59/1
Clavijo-Buriticá, Sosa, C.C., Mosquera, A.J. Álvarez, A., Medina, J. Quimbaya, M.A. A systematic comparison of the molecular machinery associated with Cancer-Hallmarks between plants and humans reveals Arabidopsis thaliana as a useful model to understand specific carcinogenic events (to be submitted, Target journal: Plos Biology)
This dataset is a subset of the original dataset obtained for Clavijo-Buriticá (In preparation)
H_sapiens_compress
H_sapiens_compress
A data frame with 120 rows and 6 variables (30 GO terms per cancer hallmark):
Numeric: False discovery rate values for the GO term
numeric: Number of genes in the list of genes for a given GO term
numeric: Number of genes in the genome of a species for a given GO term
character: GO term name or GO term id
character: Genes found fot a given GO term
character: A column representing the belonging of a group of comparison
https://data.mendeley.com/datasets/myyy2wxd59/1
Clavijo-Buriticá, Sosa, C.C., Mosquera, A.J. Álvarez, A., Medina, J. Quimbaya, M.A. A systematic comparison of the molecular machinery associated with Cancer-Hallmarks between plants and humans reveals Arabidopsis thaliana as a useful model to understand specific carcinogenic events (to be submitted, Target journal: Plos Biology)
Provides an easy way to get the frequency of GO terms such as biological processes for a data frame and a series of features
mostFrequentGOs(df, GOterm_field)
mostFrequentGOs(df, GOterm_field)
df |
A data frame with the results of a functional enrichment analysis for a species with an extra column "feature" with the features to be compared |
GOterm_field |
This is a string with the column name of the GO terms (e.g; "Functional.Category") |
This function will return a table with the frequency of GO terms per feature
#Loading example datasets data(H_sapiens) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Running function x <- mostFrequentGOs(df=H_sapiens, GOterm_field=GOterm_field) #Displaying results head(x)
#Loading example datasets data(H_sapiens) #Defining the column with the GO terms to be compared GOterm_field <- "Functional_Category" #Running function x <- mostFrequentGOs(df=H_sapiens, GOterm_field=GOterm_field) #Displaying results head(x)