orngGsea: Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) is a method which tries to identify groups of genes that are regulated together. It is implemented in module orngGsea, which is included in Orange for Functional Genomics package. To use orngGsea you need to install Orange for Functional Genomics.

GSEA

GSEA takes gene expression data for multiple samples with their phenotypes and computes gene set enrichment for given gene sets. To use it run runGSEA method with the following arguments:

Arguments

data
An ExampleTable with gene expression data. An example should correspond to a sample with its phenotype (class value). Attributes represent individual genes. Their names should be meaningful gene aliases.
classValues
A pair of class values describing phenotypes that are chosen as two distinct phenotypes on which gene correlations are computed. Only examples with one of chosen class values are considered for analysis. If not specified, first two class values in classVar attribute descriptor are used.
organism
Organism code as used in KEGG. Needed for matching gene names in data to those in gene sets. Some examples: hsa for human, mmu for mouse. Default: hsa.
geneSets
A python dictionary of gene sets, where key is a gene set name which points to a list of gene aliases for genes in the gene set. Default: gene sets from MSIGDB.
n
GSEA computes gene set significance by permutation tests. This parameter specifies the number of permutations. Default: 100.
permutation
Type of permutation. If "class", class values (phenotypes) are permuted. This is the default. However, if number of samples is small (less than 10), it is advisable to use "gene" permutations even though they ignore gene-gene interactions.
minSize, maxSize
Minimum and maximum number of genes from gene set also present in the data set for that gene set to be analysed. Defaults: 3 and 1000.
minPart
Minimum fraction of genes from the gene set also present in the data set for that gene set to be analysed. Default: 0.1.
Method runGSEA returns a dictionary where key is a gene set label and its value a list of:
  • enrichment score,
  • normalised enrichment score,
  • P-value,
  • FDR,
  • whole gene set size,
  • matched genes from the gene set,
  • gene aliases for matched genes from the gene set.

A note on gene name matching. Gene name matching is performed with the help of KEGG database. A gene from a gene set is tried to be matched with a gene from the data set. If an alias for a gene from the gene set is the same as an alias for a gene in the data set, then those aliases are matched. If not, it is checked if gene alias from the gene set and gene alias from the data set are both gene aliases of the same gene according to KEGG database for a given organism. If they are, we have a match.

Example

We present a simple usage examples. Data used here are not gene expression data. For the method to work we had to specify our one sets of attributes that seem to "belong together".

gsea1.py (uses iris.tab)

import orange, orngGsea data = orange.ExampleTable("iris") gen1 = dict([ ("sepal",["sepal length", "sepal width"]), ("petal",["petal length", "petal width", "petal color"]) ]) res = orngGsea.runGSEA(data, minSize=2, geneSets=gen1) print "%5s %6s %6s %s" % ("LABEL", "NES", "P-VAL", "GENES") for name,resu in res.items(): print "%5s %6.3f %6.3f %s" % (name, resu[1], resu[2], str(resu[6]))

Corresponding output:

LABEL NES P-VAL GENES petal -1.125 0.732 ['petal length', 'petal width'] sepal 1.080 0.623 ['sepal length', 'sepal width']

We can see that a "gene" labelled "petal color" was not used, because it couldn't be matched to any attribute in the data set.


References

Subramanian, Aravind and Tamayo, Pablo and Mootha, Vamsi K. and Mukherjee, Sayan and Ebert, Benjamin L. and Gillette, Michael A. and Paulovich, Amanda and Pomeroy, Scott L. and Golub, Todd R. and Lander, Eric S. and Mesirov, Jill P. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 2005.