[protocol]GO enrichment analysis
[protocol]GO enrichment analysis
背景:
什么是富集分析,自己可以百度。我到目前也没发现一个比较通俗易懂的介绍。直接理解为一种统计学方法就可以了。 用于查看显著性。
富集分析有很多种,最常见的是GO富集分析。也有pathway富集分析。[pathway的我目前不会啊 ::>_<:: ]
工具:
也有很多种,我这里主要是用Ontologizer (links:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html)
需要的文件:
1.gene_ontology_edit.obo (这个类似GO的库文件, links:http://www.geneontology.org/GO.downloads.ontology.shtml)
2.whole_gene_list (这是所研究物种中,包含的所有gene_list)
通过一下脚本可以从CDS文件中提取gene_list
grep "^>" sequence.fasta | tr -d ">"> sequenceHeader.list
3. sample_1_gene_list (其中一个点的gene_list)
4.最后一个也是最关键的。就是要有一个Association file .这个通过一下脚本生成
4.1需要2个文件 和一个脚本
文件1:GO_ids 文件 格式如下
- gene136870 GO:0006950,GO:0005737
gene143400 0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020
文件2: GOIDTermIndexFile 这个可以下载到 links:http://www.geneontology.org/GO.downloads.files.shtml
脚本1:
- import sys
- #input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website- inFile = open(sys.argv[1],'r')
- goFile = open(sys.argv[2],'r')
- #parse the GOID/terms index and store it in the dictionary, goHash
- goHash = {}
- for line in goFile:
- #skip lines that start with '!' character as they are comment headers
- if line[0] != "!":
- data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects- goHash[data[0]] = data[-1]
- #Here are some columns that the GAF format wants.
- #Since Ontologizer doesn't care about this, we can just make it up
- DB = 'yourOrganism'
- DBRef = 'PMID:0000000'
- DBEvi = 'ISO'
- DBObjectType = 'gene'
- DBTaxon = 'taxon:79327'
- DBDate = '23022011'
- DBAssignedBy = 'PFAM'
- #potential obselete goids that you have in your annotation
- potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf- if len(sys.argv) == 3:
- print '!gaf-version: 2.0'
- #Loop through the GO annotation file and generate assocation lines.
- for line in inFile:
- data = line.strip().split('\t')
- #if gene has go annotations
- if len(data) > 1:
- #gid is the gene id, goIDs is an array of assocated GO ids
- gid = data[0]
- goIDs = data[1].split(',')
- #second column of the .gaf file that Ontologizer doesn't care about
- DBID = "db." gid
- #third column is the name of the gene which Ontologizer does use
- DBObjSym = gid
- #for each GO ID in the line, print out an association line
- for goID in goIDs:
- if goHash.has_key(goID):
- DBAspect = goHash[goID]
- DBObjName = 'myOrganism' DBID
- outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids.- if len(sys.argv) == 3:
- print '\t'.join(outArray)
- else:
- potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list- if len(sys.argv) == 4:
- print '\n'.join(set(potentialObs))
执行:
- python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf
以上4个文件都有了 可以执行Ontologizer了 参考:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html
java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology_edit.obo -s moso_target_gene.txt -p sequnecHeader.list -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000
5.这个软件是可以画图的。但是需要下载软件。下载步骤如下: (as root )
- yum list available 'graphviz*'
- yum install 'graphviz*'
- dot -Tpng view-4hourSMinduced-Parent-Child-Westfall-Young-Single-Step.dot -oExample.png
Most gene enrichment websites out there only allow you to find enrichments for popular model organisms using pre-established gene ontology annotations. I ran into this problem early on during my phd when confronted with having to generate enrichment data on Schmidtea mediterranea.
Software and inputs
To get custom enrichments, I used Ontologizer. The inputs for ontologizer are:
- Gene Ontology OBO file
A flatfile containing information on all the current GO terms and their relationship to other terms. You can get it here. - List of all gene ids
If you have a .fasta file of all your genes, you can use this command to extract the gene id information:- grep "^>" sequence.fasta | tr -d ">" > sequenceHeader.list
- List of gene ids you want to find enrichment for
The subset of all your genes that you want the enrichment for, ie. differentially expressed gene. Ontologizer can take in an entire folder of gene id lists if you set a directory as input. - Association file for your organism
This is a GO Annotation File (.gaf). The specifications is detailed here. Fortunately, Ontologizer doesn't really care about most of the database information columns in the .gaf file, so you can fake most of it. To generate this file, you will have to first generate GO annotations for your genes through software like InterproScan or Blast2Go. You will also need a tab delimited index of GO IDs to GO category which you can get here.
Generating a gene association file
Assuming your GO annotations are formatted in two columns where first column is the gene id and second column is a comma separated list of your GO ids:
- gene136870 GO:0006950,GO:0005737
gene143400 GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020
Here is the python script to generate the association file with comments:
- import sys
- #input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website- inFile = open(sys.argv[1],'r')
- goFile = open(sys.argv[2],'r')
- #parse the GOID/terms index and store it in the dictionary, goHash
- goHash = {}
- for line in goFile:
- #skip lines that start with '!' character as they are comment headers
- if line[0] != "!":
- data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects- goHash[data[0]] = data[-1]
- #Here are some columns that the GAF format wants.
- #Since Ontologizer doesn't care about this, we can just make it up
- DB = 'yourOrganism'
- DBRef = 'PMID:0000000'
- DBEvi = 'ISO'
- DBObjectType = 'gene'
- DBTaxon = 'taxon:79327'
- DBDate = '23022011'
- DBAssignedBy = 'PFAM'
- #potential obselete goids that you have in your annotation
- potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf- if len(sys.argv) == 3:
- print '!gaf-version: 2.0'
- #Loop through the GO annotation file and generate assocation lines.
- for line in inFile:
- data = line.strip().split('\t')
- #if gene has go annotations
- if len(data) > 1:
- #gid is the gene id, goIDs is an array of assocated GO ids
- gid = data[0]
- goIDs = data[1].split(',')
- #second column of the .gaf file that Ontologizer doesn't care about
- DBID = "db." gid
- #third column is the name of the gene which Ontologizer does use
- DBObjSym = gid
- #for each GO ID in the line, print out an association line
- for goID in goIDs:
- if goHash.has_key(goID):
- DBAspect = goHash[goID]
- DBObjName = 'myOrganism' DBID
- outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids.- if len(sys.argv) == 3:
- print '\t'.join(outArray)
- else:
- potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list- if len(sys.argv) == 4:
- print '\n'.join(set(potentialObs))
Putting it all together
Make sure you have your custom GO annotation file generated in the format shown above and the GOID to term aspect index file from here (text file version of 'Terms, IDs, secondary IDs, obsoletes'). Save this script and use it by:
- python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf
If you want to check if any of your annotation is potentially deprecated or obsolete, use this command:
- python myScript.py GO.customAnnotationFile GOIDTermIndexFile y > potentialDeprecated.goids
Now that you have the Gene Assocation File, you can use Ontologizer by:
- java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology.1_2.obo -o out -p myOrganism.ids -s in
The parameters used are:
- a - the gaf file
- g - the obo file
- o - the output directory
- p - the list of all gene ids
- s - the input directory/file of gene subsets
The enrichments with all relevent statistics will be generated per input gene list in a tab delimited file.
[protocol]GO enrichment analysis的更多相关文章
- GSEA - Gene set enrichment analysis 基因集富集 | ORA - Over-Representation Analysis 分析原理与应用
RNA-seq是利器,大部分做实验的老板手下都有大量转录组数据,所以RNA-seq的分析需求应该是很大的(大部分的生信从业人员应该都差不多要沾边吧). 普通的转录组套路并不多,差异表达基因.富集分析. ...
- 功能的显著性分析——GO Enrichment Analysis
Gene Ontology(GO)是基因功能国际标准分类体系.GO富集分析是对差异基因等按GO分类,并对分类结果进行基于离散分布的显著性分析.错判率分析.富集度分析,得到与实验目的有显著联系的.低 ...
- 32、Differential Gene Expression using RNA-Seq (Workflow)
转载: https://github.com/twbattaglia/RNAseq-workflow Introduction RNAseq is becoming the one of the mo ...
- TLS1.3&TLS1.2形式化分析
本博客是对下面博客连接上的原文进行梳理+自己在其他方面资料做个整理 https://blog.csdn.net/andylau00j/article/details/79269499 https:// ...
- (转)基因芯片数据GO和KEGG功能分析
随着人类基因组计划(Human Genome Project)即全部核苷酸测序的即将完成,人类基因组研究的重心逐渐进入后基因组时代(Postgenome Era),向基因的功能及基因的多样性倾斜.通过 ...
- 利用GSEA对基因表达数据做富集分析
image Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a p ...
- HOMER | MEME | 转录因子的靶基因预测
Finding Enriched Motifs in Genomic Regions (findMotifsGenome.pl) 在指定区域做motif enrichment,大大降低了假阳性. ME ...
- 文献笔记:Genome-wide associations for birth weight and correlations with adult disease
该文献纳入了EGG(Early Growth Genetics Consortium)和UK biobank两大数据库,分为欧洲祖先和非欧洲祖先群体.这两个数据用到的样本量分别如下: Early Gr ...
- 微生物组学数据分析工具综述 | 16S+宏基因组+宏病毒组+宏转录组--转载
转载:https://mp.weixin.qq.com/s/xsL9GuLs7b3nRF8VeRtinQ 建立在高通量测序基础上的微生物群落研究,当前主要有三大类:基于16S/18S/ITS等扩增子做 ...
随机推荐
- linux配置IP访问权限
允许访问vi /etc/hosts.allow添加(可以添加多行,其中“:allow”可以省率)sshd:192.168.81.*:allow #表示192.1 ...
- 【Python】脚本运行报错:IndentationError: unindent does not match any outer indentation level
[问题] 一个python脚本,本来都运行好好的,然后写了几行代码,而且也都确保每行都对齐了,但是运行的时候,却出现语法错误: IndentationError: unindent does not ...
- pip的问题小结
Q:同时安装py2和py3后,pip2不能用 A:使用:python2 -m pip install xxx 代替 pip2 install xxx 命令 Q:怎么用pip更新第三方包 A:pip2 ...
- Java基础知识(JAVA之泛型)
什么是泛型?为什么要使用泛型? 泛型,即“参数化类型”.一提到参数,最熟悉的就是定义方法时有形参,然后调用此方法时传递实参.那么参数化类型怎么理解呢?顾名思义,就是将类型由原来的具体的类型参数化,类似 ...
- Jmeter原理
Jmeter结构体系及运行原理 Jmeter结构体系 把Jmeter的结构体系拆分为三维空间,如图: X1~X5:是负载模拟的一个过程,使用这些组件来完成负载的模拟: X1:选择协议,模拟用户请求 ...
- 006-优化web请求二-应用缓存、异步调用【Future、ListenableFuture、CompletableFuture】、ETag、WebSocket【SockJS、Stomp】
四.应用缓存 使用spring应用缓存.使用方式:使用@EnableCache注解激活Spring的缓存功能,需要创建一个CacheManager来处理缓存.如使用一个内存缓存示例 package c ...
- Redis入门到高可用(七)——Hash
一.结构 Mapmap结构: filed 不能相同,value可以相同. 二.重要指令 ♦️ HSET ♦️ HGET ♦️ HDEL ♦️ Hlen ♦️ HEXISTS ♦️HGETALL ...
- oralce 批量更新
<update id="batchUpdateCompensatoryData" parameterType="java.util.List"> & ...
- vue监听滚动事件 实现某元素吸顶或者固定位置显示
https://blog.csdn.net/wang1006008051/article/details/78003974 1.监听滚动事件 利用VUE写一个在控制台打印当前的scrollTop, 首 ...
- 118A
#include <iostream> #include <cctype> #include <string> using namespace std; int m ...