注:这些工具的应用都是受限的,有些本来就是只能用于预测动物,在使用之前务必用ground truth数据来测试一些。我想预测某一个植物的转录本,所以可以拿已经注释得比较好的拟南芥来测试一下。(测试的结果还是比较惊人的)

CPC

(熟悉的名字,原来是北京大学的高歌、魏丽萍开发的)

搜文章时才发现2017年已经出了CPC2了

CPC可在线使用
a Support Vector Machine-based classifier, named Coding Potential Calculator (CPC), to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features.
Coding Potential Calculator distinguish protein-coding from non-coding RNAs based on the sequence features of the input transcripts. Our preliminary performance assessment suggests the CPC can reliably discriminate the coding and non-coding transcripts in ~98% accuracy. We provide an online version of CPC here.
自称有98%的准确率

bin/run_predict.sh (input_seq) (result_in_table) (working_dir) (result_evidence)

CPC RESULTS (The first column is input sequence ID; the second column is input sequence length; the third column is coding status and the four column is the coding potential score (the "distance" to the SVM classification hyper-plane in the features space).)

AF282387	528	coding	3.32462
Tsix_mus 4300 noncoding -1.30047

HOMO EVIDENCE
ORF EVIDENCE

AF282387	ORF_FRAMEFINDER	4	529	99.43	109.41	Full
Tsix_mus ORF_FRAMEFINDER 4077 4206 3.00 27.50 Full

FRAME FINDER

>AF282387 Filobasidiella neoformans calcineurin B regulatory subunit (CNB1) mRNA, complete cds [framefinder (3,528) score=109.41 used=99.43% {forward,strict} ]
MGAAESSMFNSLEKNSNFSGPELMRLKKRFMKLDKDGSGSIDKDEFLQIPQIANNPLAHR
MIAIFDEDGSGTVDFQEFVGGLSAFSSKGGRDEKLRFAFKVYDMDRDGYISNGELYLVLK
QMVGNNLKDQQLQQIVDKTIMEADKDGDGKLSFEEFTQMVASTDIVKQMTLEDLF
>Tsix_mus NR_002844.1 Mus musculus X (inactive)-specific transcript, antisense (Tsix) on chromosome X [framefinder (4076,4205) score=27.50 used=3.00% {forward,strict} ]
MKGYVLKLSSWAGEIAQWLGVLTALPEGLSSILNNFVVAHSHL

BLAST RESULT

CPC2

CPC2 runs ∼1000 times faster than CPC1 and exhibits superior accuracy compared with CPC1, especially for long non-coding transcripts. Moreover, the model of CPC2 is species-neutral, making it feasible for ever-growing non-model organism transcriptomes.

个人测试,CPC1不用blast还是比较快的,但是blast起来真的是奇慢无比,它后台居然还在调用blastall这种古老的软件,现在我们连blast都嫌慢,都只用diamond了。

CPC2用python改写了,还是在调用libvm来进行分类。

CPC的大致原理:

1. 特征选择,Feature Selection。four intrinsic features as Fickett TESTCODE score, open reading frame (ORF) length, ORF integrity and isoelectric point (pI).

2. 使用svm构建分类模型,trained a support vector machine (SVM) model

3. 使用多个物种的数据来验证模型的性能。评价指标:sensitivity, specificity and accuracy

这么简单的方法,是不是瞬间有种我也能发NAR的错觉~~

PLEK

(predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme)

an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes.

貌似没有website,也没有GitHub,程序放在了sourceforge.

基本原理:

核心:kmer和svm

It is suitable for vertebrates lacking high-quality genome sequences and annotation information and is especially effective for the de novo assembled transcriptome data generated by PacBio or 454 sequencing platforms.

k-mer pattern is a specific string with k nucleotides, each can be ACG or T. For k = 1 to 5, we had 4 + 16 + 64 + 256 + 1024 = 1,364 patterns: 4 one-mer patterns, 16 two-mer patterns, 64 three-mer patterns, 256 four-mer patterns, and 1,024 five-mer patterns.

选了5种kmer

非常常规的特征选择,最后还是调用libsvm,发了BMCBioinformatics。看了之后是不是自己也想发一篇。

CNCI

Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts

特征选择

To distinguish protein-coding sequences from the non-coding sequences, we extracted five features, i.e. the length and S-score of MLCDS, length-percentage, score-distance and codon-bias. The length and S-score of MLCDS were used as the first two features, which assess the extent and quality of the MLCDS, respectively. Moreover, as demonstrated earlier in the text, protein-coding transcripts possess a special reading frame obviously distinct from the other five in the distribution of ANT. We analyzed six MLCDS candidates outputted by dynamic programming of the six reading frames for each transcript, with the assumption that there must exist one best MLCDS (as described earlier in the text); however, this phenomenon does not generally exist for non-coding transcripts. Thus, we defined other two features, length-percentage and score-distance, as follows:

测试结果:cnci不能直接处理fasta序列,输入fasta出来的结果为空。于是我就输入gtf和基因组2bit文件,才能出来有效的结果。

CPAT

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

使用说明文档:http://rna-cpat.sourceforge.net/

特征选择:

The first feature was the maximum length of the open reading frame (ORF).

The second feature was ORF coverage defined as the ratio of ORF to transcript lengths.

The third feature we used was the Fickett TESTCODE score (termed ‘Fickett score’ hereafter), which is a simple linguistic feature that distinguishes protein-coding RNA and ncRNA according to the combinational effect of nucleotide composition and codon usage bias (22).

The fourth feature we used was hexamer usage bias (termed ‘hexamer score’ hereafter). This may be the most discriminating feature because of the dependence between adjacent amino acids in proteins (23).

We build a logistic regression model using these four linguistic features as predictor variables. A χ2 test was used to evaluate whether our logit model with predictors fits the training data significantly better than the null model, which had only an intercept.

FEELnc

FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome

OrfPredictor

OrfPredictor: predicting protein-coding regions in EST-derived sequences

PhyloCSF

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

lncRNA的编码性预测——PhyloCSF的使用

后面会一一测试。

待续~~~

七种常见的核酸序列蛋白编码能力预测工具 | ncRNAs | lncRNA的更多相关文章

  1. 七种常见经典排序算法总结(C++实现)

    排序算法是非常常见也非常基础的算法,以至于大部分情况下它们都被集成到了语言的辅助库中.排序算法虽然已经可以很方便的使用,但是理解排序算法可以帮助我们找到解题的方向. 1. 冒泡排序 (Bubble S ...

  2. Java枚举的七种常见用法

    用法一:常量 在JDK1.5之前,我们定义常量都是:publicstaticfianl.....现在好了,有了枚举,可以把相关的常量分组到一个枚举类型里,而且枚举提供了比常量更多的方法. Java代码 ...

  3. 七种常见经典排序算法总结(C++)

    最近想复习下C++,很久没怎么用了,毕业时的一些经典排序算法也忘差不多了,所以刚好一起再学习一遍. 除了冒泡.插入.选择这几个复杂度O(n^2)的基本排序算法,希尔.归并.快速.堆排序,多多少少还有些 ...

  4. 七种常见阈值分割代码(Otsu、最大熵、迭代法、自适应阀值、手动、迭代法、基本全局阈值法)

    http://blog.csdn.net/xw20084898/article/details/17564957 一.工具:VC+OpenCV 二.语言:C++ 三.原理 otsu法(最大类间方差法, ...

  5. 【转】七种常见阈值分割代码(Otsu、最大熵、迭代法、自适应阀值、手动、迭代法、基本全局阈值法)

    http://blog.csdn.net/xw20084898/article/details/17564957 一.工具:VC+OpenCV 二.语言:C++ 三.原理 otsu法(最大类间方差法, ...

  6. 【图像算法】七种常见阈值分割代码(Otsu、最大熵、迭代法、自适应阀值、手动、迭代法、基本全局阈值法)

    图像算法:图像阈值分割 SkySeraph Dec 21st 2010  HQU Email:zgzhaobo@gmail.com    QQ:452728574 Latest Modified Da ...

  7. Java几种常见的编码方式

    几种常见的编码格式 为什么要编码 不知道大家有没有想过一个问题,那就是为什么要编码?我们能不能不编码?要回答这个问题必须要回到计算机是如何表示我们人类能够理解的符号的,这些符号也就是我们人类使用的语言 ...

  8. 常见的七种Hadoop和Spark项目案例

    常见的七种Hadoop和Spark项目案例 有一句古老的格言是这样说的,如果你向某人提供你的全部支持和金融支持去做一些不同的和创新的事情,他们最终却会做别人正在做的事情.如比较火爆的Hadoop.Sp ...

  9. Mol. Cell. Proteomics | 糖蛋白基因组学:一种常见的基因多态性影响人血清胎球蛋白/α-2-HS-糖蛋白的糖基化形式

    大家好,本次分享的是发表在Molecular & Cellular Proteomics上的一篇关于糖蛋白基因组学的文章,题目是Glycoproteogenomics: A Frequent ...

随机推荐

  1. topcoder srm 400 div1

    problem1 link 枚举指数,然后判断是不是素数即可. problem2 link 令$f[len][a][b][r]$(r=0或者1)表示子串$init[a,a+len-1]$匹配$goal ...

  2. c++string,常见用法总结

    #include<iostream> #include<string> using namespace std; int main() { //创建对象,及初始化 string ...

  3. 為什麼gnome-terminal中不能使用ctrl_shift_f來進行查找? 是因為 跟输入法的全局设置衝突了!

    但是,也要注意, 为什么ctrl+shift_f有时候可以使用, 有时候又不可以使用? 是因为, 这个跟输入法的状态有关, 如果输入法是英文, 那么中文的 "简体/繁体切换快捷键ctrl+s ...

  4. ZOJ 3829 Known Notation(贪心)题解

    题意:给一串字符,问你最少几步能变成后缀表达式.后缀表达式定义为,1 * 1 = 1 1 *,题目所给出的字串不带空格.你可以进行两种操作:加数字,交换任意两个字符. 思路:(不)显然,最终结果数字比 ...

  5. Golang踩坑录 两种方式来读取文件一行所导致的问题

    前两天零零碎碎看完了golang的基础,想着找个小项目练练手,可是出现了一个十分棘手的问题 我要做的东西是网站路径爆破 所以我会从文本字典中把一行行路径读取然后与域名拼接,但是我在跑起程序后出现了问题 ...

  6. nowcoder 合并回文子串

    链接:https://www.nowcoder.com/acm/contest/6/C来源:牛客网题目输入两个字符串A和B,合并成一个串C,属于A和B的字符在C中顺序保持不变.如"abc&q ...

  7. C# 控制台运行 应用运行

    https://blog.csdn.net/Koala_Ivy/article/details/79577830 开发遇到的问题 记录一下 前段时间捣鼓dotnetty框架,服务端写了一个控制台程序来 ...

  8. Paper Reading: Perceptual Generative Adversarial Networks for Small Object Detection

    Perceptual Generative Adversarial Networks for Small Object Detection 2017-07-11  19:47:46   CVPR 20 ...

  9. Java基础 【Math、Random、System、BigInteger、BigDecimal、Date、Calendar等常用类的使用】

    学习的这几个类  是日常工作中经常要使用到的类 Math 类包含用于执行基本数序运算的方法,如初等指数.对数.平方根和 三角函数. 成员方法 1.public static int abs(int a ...

  10. Shiro学习笔记 三(认证授权)

    第一种首先基于角色的权限控制 1.由于不断的创建SecurityFactory工程等步骤重复多次,所以应该将这些步骤封装成一个工具类 还是首先看一下目录结构 主要用到文件 首先贴一下工具类的方法 pa ...