蛋白序列GO号注释及问题

#=============================== 版本1 ===============================================
InterProScan的三种使用方法
Interproscan,通过蛋白质结构域和功能位点数据库预测蛋白质功能。是EBI开发的一个集成了蛋白质家族、结构域和功能位点的非冗余数据库。Interproscan整合了一些使用最普及的一些数据库，并应用于功能未知的蛋白进行Interpro注释和GO注释。
以下介绍3中interpro注释的方法：

三、本地化的InterProScan注释
3.1 本地化的InterProScan安装与配置

3.1.1 从ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan下载以下5个文件：

RELEASE/latest/iprscan_v4.8.tar.gz

BIN/4.x/iprscan_bin4.x_[PLATFORM].tar.gz

DATA/iprscan_DATA_[LATESTDATAVERSION].tar.gz

DATA/iprscan_PTHR_DATA_[LATESTDATAVERSION].tar.gz

DATA/iprscan_MATCH_DATA_[LATESTDATAVERSION].tar.gz

3.1.2 将5个文件解压到一个文件夹中，然后运行其中的文件Config.pl，来对InterProScan进行配置。
3.1.3 配置的过程中，若选择进行本地web配置，则修改本地www服务的配置文件，以能进行本地化网页版的运行。
3.2 本地化InterProScan的使用。
3.2.1 命令行运行iprscan的方法：

$bin/iprscan -cli -iprlookup -goterms -format xml -i test.fasta -o test.out

# help

http://www.chenlianfu.com/?tag=iprscan

该模块中XML::Parser XML::Parser::Expat 这两个模块，后一个必须先安装，后续一个接着安装，由于是C层面的模块，需要安装一些东西

Expat must be installed prior to building XML::Parser and I can't find it in the standard library directories. Install 'expat-devel' (or 'libexpat1-dev') package

小提示：（root或者sudo权限） yum 或者 apt-get install expat-devel （具体版本具体办）

#============================================== 版本2 =============================================

https://github.com/ebi-pf-team/interproscan/wiki 原文链接

第一步：环境配置

Software requirements:

64-bit Linux
Perl (default on most Linux distributions)
Python 2.7.x only
Oracle's Java JDK/JRE version 8 (required by InterProScan 5.17-56.0 onwards). Earlier InterProScan release versions required Java 6 (version 6u4 and above) or Java 7.
Environment variables set
- $JAVA_HOME should point to the location of the JVM

$JAVA_HOME/bin should be added to the $PATH

第二步：数据下载

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz.md5

md5sum -c interproscan-5.27-66.0-64-bit.tar.gz.md5 （解压前，把xxx.tar.gz xxx.tar.gz.md5放到同一目录下做检查完整性）

tar -pxvzf interproscan-5.27-66.0-64-bit.tar.gz （-p参数为了保持文件的权限 -v 建议去掉，这个是解压过程显示）

（解压后进去有个data目录，后续panther数据解压放进去，配置文件默认路径，如果放其他地方，设置一下）

第三步：运行测试

./interproscan.sh -i test_proteins.fasta -f tsv

./interproscan.sh -i test_proteins.fasta -cpu  -f GFF3 -goterms -iprlookup -t p -T 20171127tmp

# 参数： -i 输入 -f format -goterms -iprlookup GO注释 -t 数据类型 -T 临时文件目录名称

小提示：

TSV 是Tab-separated values的缩写，即制表符分隔值。
CSV，Comma-separated values（逗号分隔值）。

#============================= 具体参数 ========================================

27/11/2017 14:41:35:049 Welcome to InterProScan-5.27-66.0

usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts

            -XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar

            interproscan-5.jar

Please give us your feedback by sending an email to

interhelp@ebi.ac.uk

 -appl,--applications <ANALYSES>            Optional, comma separated list

                                            of analyses.  If this option

                                            is not set, ALL analyses will

                                            be run.

 -b,--output-file-base <OUTPUT-FILE-BASE>   Optional, base output filename

                                            (relative or absolute path).

                                            Note that this option, the

                                            --output-dir (-d) option and

                                            the --outfile (-o) option are

                                            mutually exclusive.  The

                                            appropriate file extension for

                                            the output format(s) will be

                                            appended automatically. By

                                            default the input file

                                            path/name will be used.

 -cpu,--cpu <CPU>                           Optional, number of cores for

                                            inteproscan.

 -d,--output-dir <OUTPUT-DIR>               Optional, output directory.

                                            Note that this option, the

                                            --outfile (-o) option and the

                                            --output-file-base (-b) option

                                            are mutually exclusive. The

                                            output filename(s) are the

                                            same as the input filename,

                                            with the appropriate file

                                            extension(s) for the output

                                            format(s) appended

                                            automatically .

 -dp,--disable-precalc                      Optional.  Disables use of the

                                            precalculated match lookup

                                            service.  All match

                                            calculations will be run

                                            locally.

 -dra,--disable-residue-annot               Optional, excludes sites from

                                            the XML, JSON output

 -f,--formats <OUTPUT-FORMATS>              Optional, case-insensitive,

                                            comma separated list of output

                                            formats. Supported formats are

                                            TSV, XML, JSON, GFF3, HTML and

                                            SVG. Default for protein

                                            sequences are TSV, XML and

                                            GFF3, or for nucleotide

                                            sequences GFF3 and XML.

 -goterms,--goterms                         Optional, switch on lookup of

                                            corresponding Gene Ontology

                                            annotation (IMPLIES -iprlookup

                                            option)

 -help,--help                               Optional, display help

                                            information

 -i,--input <INPUT-FILE-PATH>               Optional, path to fasta file

                                            that should be loaded on

                                            Master startup. Alternatively,

                                            in CONVERT mode, the

                                            InterProScan 5 XML file to

                                            convert.

 -iprlookup,--iprlookup                     Also include lookup of

                                            corresponding InterPro

                                            annotation in the TSV and GFF3

                                            output formats.

 -ms,--minsize <MINIMUM-SIZE>               Optional, minimum nucleotide

                                            size of ORF to report. Will

                                            only be considered if n is

                                            specified as a sequence type.

                                            Please be aware of the fact

                                            that if you specify a too

                                            short value it might be that

                                            the analysis takes a very long

                                            time!

 -o,--outfile <EXPLICIT_OUTPUT_FILENAME>    Optional explicit output file

                                            name (relative or absolute

                                            path).  Note that this option,

                                            the --output-dir (-d) option

                                            and the --output-file-base

                                            (-b) option are mutually

                                            exclusive. If this option is

                                            given, you MUST specify a

                                            single output format using the

                                            -f option.  The output file

                                            name will not be modified.

                                            Note that specifying an output

                                            file name using this option

                                            OVERWRITES ANY EXISTING FILE.

 -pa,--pathways                             Optional, switch on lookup of

                                            corresponding Pathway

                                            annotation (IMPLIES -iprlookup

                                            option)

 -t,--seqtype <SEQUENCE-TYPE>               Optional, the type of the

                                            input sequences (dna/rna (n)

                                            or protein (p)).  The default

                                            sequence type is protein.

 -T,--tempdir <TEMP-DIR>                    Optional, specify temporary

                                            file directory (relative or

                                            absolute path). The default

                                            location is temp/.

 -version,--version                         Optional, display version

                                            number

 -vtsv,--output-tsv-version                 Optional, includes a TSV

                                            version file along with any

                                            TSV output (when TSV output

                                            requested)

Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge,

UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided

under the Apache License, Version 2.0

(http://www.apache.org/licenses/LICENSE-2.0.html). Third party components

(e.g. member database binaries and models) are subject to separate

licensing - please see the individual member database websites for

details.

Available analyses:

                      TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs

                         SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs

                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.

                      PANTHER (12.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.

                       Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database

                        Hamap (2017_10) : High-quality Automated and Manual Annotation of Microbial Proteomes

                        Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins

              ProSiteProfiles (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them

                        SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs

                          CDD (3.16) : Prediction of CDD domains in Proteins

                       PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family

              ProSitePatterns (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them

                         Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)

                       ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.

                   MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins

                        PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

Deactivated analyses:

                      Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl

                  SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

        SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

                        TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model

        SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

蛋白序列GO号注释及问题的更多相关文章

interproscan 软件对序列进行GO 注释
interproscan 软件实际上将对输入的查询序列和interpro 数据库中的序列去比对,将比对上的序列对应的GO信息作为查询序列的GO注释在interpro 数据库中,每条蛋白质序列有一个唯 ...
由merge into引起的序列跳号
最近生产库反应出一个问题,某张表的主键ID并没有按照原计划的期望增加,而是间歇性跳号,每次跳2万多,经过研究发现是某个同步过程的merge into引起的,具体语句如下 merge into t_if ...
利用BioPerl将DNA序列翻译成蛋白序列
转自 https://www.plob.org/article/4603.html 具体请去上面的网页查看. my $DNA="ATGCCCGGT";my $pep=&Tr ...
merge into 导致序列跳号
For each row merged by a MERGE statement. The reference to NEXTVAL can appear in the merge_insert_cl ...
【Python小试】计算蛋白序列中指定氨基酸所占的比例
编码 from __future__ import division def get_aa_percentage(protein, aa_list=['A','I','L','M','F','W',' ...
interpro 数据库
interpro 通过整合多个蛋白相关的数据库,提供了一个方便的对蛋白序列进行功能注释的平台,功能注释的内容包括蛋白质家族预测,domain 和结合位点预测 interoro 在整合多个数据库的同时 ...
KEGG注释
在 KEGG 数据库中,把功能相似的蛋白质归为同一组,然后标上 KO 号.通过相似性比对,可以为未知功能的蛋白序列注释上 KO 号. 截止到 2015 年 6 月 12 日,KEGG 数据库中共收录了 ...
Augustus 进行基因注释
目前的从头预测软件大多是基于HMM(隐马尔科夫链)和贝叶斯理论,通过已有物种的注释信息对软件进行训练,从训练结果中去推断一段基因序列中可能的结构,在这方面做的最好的工具是AUGUSTUS它可以仅使 ...
使用BRAKER2进行基因组注释
来自:https://www.jianshu.com/p/e6a5e1f85dda 使用BRAKER2进行基因组注释 BRAKER2是一个基因组注释流程,能够组合GeneMark,AUGUSTUS和转 ...

随机推荐

linux下安装kafka
安装条件: 确保zookeeper已经安装成功.zookeeper安装过程见:https://www.cnblogs.com/expiator/p/9853378.html 1.下载kafka 进入A ...
redis 4 集群重启与数据导入
1.redis 4 平时启用aof db与每天的完整备份. 2.集群状态检查 cluster info 检查集群状态 cluster nodes 检查节点状态 redis-cli -c -p 7000 ...
最小生成树kruskal模板
算法思路:每次选取权值最小的边,判断这两个点是否在同一个集合内,如果在则跳过,如果不在则加上这条边的权值可以使用并查集储存结点,可以快速判断结点是否在同一集合内. #include<iostr ...
C++旅馆问题。
有总钱数有每房每天住需要多少钱问最少可以住几天最后输入的是钱数.前边输入没个住所每天多少钱例如: 1001 1002 1003 1004 1000 -1 100 500 600 最少一天,最多 ...
896. Monotonic Array单调数组
［抄题］: An array is monotonic if it is either monotone increasing or monotone decreasing. An array A i ...
Netty---入门程序，搭建Websocket 服务器
Netty 常用的场景: 1.充当HTTP 服务器,但Netty 并没有遵循servlet 的标准,反而实现了自己的一套标准进行Http 服务: 2,RPC 远程调用,在分布式系统中常用的框架 3.S ...
MongoDB的增、删、改、查操作（五）
按照我们关系型数据库的思想,一个服务器要想存放数据,首先要有数据库,表,字段,约束,当然了也少不了主键,外键,索引,关系等: 但是在MongoDB的世界里边,我们不用预先的去创建这些信息从而直接来使用 ...
jsplumb流程器使用3--connector
jsPlumb.getInstance内可以放一个对象对象内可选地提供默认值: connector: 连接器(直线--a straight line, 贝塞尔曲线--a Bezier curv ...
jenkins如何获取text parameter多行的文本内容
如果是string的插件可以直接获取但text的不可以如果用 echo %aaa% 这种方式进行打印的话会发现只打印了第一行最后的解决方案: 使用了python脚本在python脚本里通过 ...
Navicat premiu的导入和导出
对于Navicat premiu(数据库管理工具)中对于数据库的导出和导入步骤如下: 1.选择要导出的数据库->转储SQL文件->选择结构和数据或结构->选择存放的路径,即可导出成功 ...

蛋白序列GO号注释及问题

蛋白序列GO号注释及问题的更多相关文章

随机推荐

热门专题