之前的文章:构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb

本地运行blast时,需要指定out format。

常见的网页版blast结果可以参照:Blast结果的详细解析

*** Formatting options
-outfmt <String>
alignment view options:
0 = Pairwise,
1 = Query-anchored showing identities,
2 = Query-anchored no identities,
3 = Flat query-anchored showing identities,
4 = Flat query-anchored no identities,
5 = BLAST XML,
6 = Tabular,
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1),
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
18 = Organism Report Options 6, 7 and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)
When not provided, the default value is:
'qaccver saccver pident length mismatch gapopen qstart qend sstart send
evalue bitscore', which is equivalent to the keyword 'std'
Default = `0'

默认是0,也就是会输出比对的结果。

但是这样的结果显然不适合批量处理,批量处理的文件格式显然必须是dataframe。

所以网上有人推荐“outfmt 7 or 10 works perfect”,所以一般就选10吧。

Diamond的输出格式:

--outfmt (-f)          output format
0 = BLAST pairwise
5 = BLAST XML
6 = BLAST tabular
100 = DIAMOND alignment archive (DAA)
101 = SAM Value 6 may be followed by a space-separated list of these keywords: qseqid means Query Seq - id
qlen means Query sequence length
sseqid means Subject Seq - id
sallseqid means All subject Seq - id(s), separated by a ';'
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive - scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive - scoring matches
qframe means Query frame
btop means Blast traceback operations(BTOP)
staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
qcovhsp means Query Coverage Per HSP
qtitle means Query title Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

diamond就选6吧,便于批量处理。

diamond 比对转录本到Pfam库的部分结果,可以看到,格式6非常适合做批量处理。

wgs.RNAseq_00000687     M1AFS7.1        60.9    179     65      2       337     861     102     279     8.1e-55 221.5
wgs.RNAseq_00000687 A0A0B2PIQ0.1 61.0 177 66 2 337 861 319 494 4.0e-54 219.2
wgs.RNAseq_00000687 A0A0L9UG27.1 61.2 178 65 3 337 861 315 491 4.0e-54 219.2
wgs.RNAseq_00000687 A0A0L9UG27.1 63.2 38 14 0 6 119 250 287 1.0e-04 55.1
wgs.RNAseq_00000688 A0A0D2QMP5.1 65.8 114 35 2 218 550 317 429 1.7e-34 153.3
wgs.RNAseq_00000688 A0A0D2QMP5.1 53.2 62 26 1 36 221 256 314 8.7e-10 71.2
wgs.RNAseq_00000688 A0A0D2TR34.1 65.8 114 35 2 218 550 322 434 1.7e-34 153.3
wgs.RNAseq_00000688 A0A0D2TR34.1 53.2 62 26 1 36 221 261 319 8.7e-10 71.2
wgs.RNAseq_00000688 F6H0G0.1 65.8 111 36 2 221 550 320 429 5.6e-33 148.3  

每一列是什么呢? BLASTn output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

 1.  qseqid  query (e.g., gene) sequence id
 2.  sseqid  subject (e.g., reference genome) sequence id
 3.  pident  percentage of identical matches
 4.  length  alignment length
 5.  mismatch  number of mismatches
 6.  gapopen  number of gap openings
 7.  qstart  start of alignment in query
 8.  qend  end of alignment in query
 9.  sstart  start of alignment in subject
 10.  send  end of alignment in subject
 11.  evalue  expect value
 12.  bitscore  bit score

比对结果的第三列和第四列非常有用,尤其是在鉴别软件stringtie等预测出来的转录本是否为有效转录本时,其实预测出来的转录本大部分都是没有意义的,但是又能部分hit到蛋白上,这时我们就只能选出比对最长的那个转录本,其余的可以看作是无效的转录本。

操作比较简单,先对比对长度排个序,从长到短。

cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort

其次就是利用python的panda模块去冗余就好了。

import pandas as pd

infile = "extract_no_N_200.fasta.diamond.nr.sort"
#infile = "test.sort" df = pd.read_csv(infile, sep="\t", header=None)
df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"] df1 = df.sort_values(['l4'],ascending=False) df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first') df2.to_csv("rm_dup_protein_"+infile, sep="\t", index=False, header=False) df3 = df2.sort_values(['l3'],ascending=False) df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first') df4.to_csv("unique_protein_"+infile, sep="\t", index=False, header=False)

  

这样我们得到的转录本就基本上是无冗余的了。

然后对每个转录本取identical match比例最高的蛋白就好了。其实信息是可以全部保留的。

blast和diamond怎么只输出最优的hit?

blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1

  

待续~

blast | diamond 输出结果选择和解析 | 比对的更多相关文章

  1. ubuntu12.04软件中心打开错误和 ubuntu 包管理之“:E: 读错误 - read (5: 输入/输出错误) E: 无法解析或打开软件包的列表或是状态文件。”的解决

    执行ubuntu软讲中心时打不开.老是崩溃,从终端也下载不了软件. 执行包管理的update或者search等等会报错: E: 读错误 - read (5: 输入/输出错误) E: 无法解析或打开软件 ...

  2. el表达式原样输出,不被解析

    今天遇到了,在jar包都有的前提下EL表达式原样输出,不被解析,原因是: page指令中确少 isELIgnored="false" 加上就好了 <%@ page langu ...

  3. Bioperl 解析blast的输出结果

    用bioperl 解析blast的默认输出结果, 整理成-m8格式的输出 #!/usr/bin/perl use Bio::SearchIO; my ($blast) = @ARGV; my $sea ...

  4. Java五道输出易错题解析(避免小错误)

    收集了几个易错的或好玩的Java输出题,分享给大家,以后在编程学习中稍微注意下就OK了. 1. 看不见的空格? 下面的输出会正常吗? package basic; public class Integ ...

  5. 【转】Python实现修改Windows CMD命令行输出颜色(完全解析)

    用Python写命令行程序的时候,单一的输出颜色太单调.其实我们可以加些色彩,比如用红色表示警告,绿色表示结果正常等.网上也有几篇类似的帖子,但是没有把问题讲清楚,贴的代码也不是太清晰.这里,对Win ...

  6. Java五道输出易错题解析(进来挑战下)

    转自:http://blog.csdn.net/lanxuezaipiao/article/details/41985243 收集了几个易错的或好玩的Java输出题,分享给大家,以后在编程学习中稍微注 ...

  7. 解决Editor.md通过代码块原样输出Emoji被强制解析问题

    Editor.md是一款优秀的开源Markdown 编辑器,在使用中遇到的一些问题和功能改进分享给需要的伙伴. 项目地址 https://github.com/pandao/editor.md 问题 ...

  8. 如何让浏览器直接输出HTML代码而不解析

    方法一: 将HTML代码嵌入到<script type='text/html' style='display:block'></scipt>中 <script type= ...

  9. samtools faidx输出的fai文件格式解析 | fasta转bed | fasta to bed

    fai示例: Sc0000003 2774837 10024730 60 61 Sc0000004 2768176 12845826 60 61 Sc0000005 2756750 15660150 ...

随机推荐

  1. OpenJudge cdqz/Data Structure Challenge 2 (Problem 5822) - 可持久化线段树

    描述 给一个空数列,有M次操作,每次操作是以下三种之一: (1)在数列后加一个数 (2)求数列中某位置的值 (3)撤销掉最后进行的若干次操作(1和3) 输入 第一行一个正整数M. 接下来M行,每行开头 ...

  2. 尚硅谷面试第一季-09SpringMVC中如何解决POST请求中文乱码问题GET的又如何处理呢

    目录结构: 关键代码: web.xml <filter> <filter-name>CharacterEncodingFilter</filter-name> &l ...

  3. 复旦大学2016--2017学年第二学期高等代数II期末考试情况分析

    一.期末考试成绩班级前十五名 林晨(93).朱民哲(92).何陶然(91).徐钰伦(91).吴嘉诚(91).于鸿宝(91).宁盛臻(90).杨锦文(89).占文韬(88).章俊鑫(87).颜匡萱(87 ...

  4. topcoder srm 715 div1 -23

    1.一个计算器,它执行的是一个只包含‘+’,‘-’的字符串$s$.初始化值为0,每遇到一个‘+’增加1,否则减少1.并保存运算过程的最大最小值$Max,Min$,最后的答案是$Max-Min$.比如$ ...

  5. Python3基础 list len 输出元素的个数

             Python : 3.7.0          OS : Ubuntu 18.04.1 LTS         IDE : PyCharm 2018.2.4       Conda ...

  6. shiro中authc和user的权限区别

    前者(authc)是认证过,后者(user)是登录过,如果开启了rememberMe功能的话,后者(user)也是可以通过的,而前者(authc)通过不了.故我们用authc来校验一些关键操作,比如购 ...

  7. (转载)WinCC 卸载后 Simatic Shell 的删除

    现象:WinCC卸载后,在计算机(我的电脑)中仍有Simatic Shell文件夹,双击无反应 解决:1.按Win+X,运行“regedit”,打开注册表2.在注册表中,选中HKEY_LOCAL_MA ...

  8. java.lang.OutOfMemoryError:GC overhead limit exceeded

    在调测程序时报java.lang.OutOfMemoryError:GC overhead limit exceeded 错误 错误原因:在用程序进行数据切割时报了该错误.由于在本地执行数据切割测试的 ...

  9. WijmoJS 全面支持 Angular 7

    概述 首先恭喜Angular团队发布Angular 7.0.0版本! 对于大多数开发人员,只需要执行一个命令就可以更新到Angular 7: ng update \@angular/cli \@ang ...

  10. HDU 4323 Magic Number(编辑距离DP)

    http://acm.hdu.edu.cn/showproblem.php?pid=4323 题意: 给出n个串和m次询问,每个询问给出一个串和改变次数上限,在不超过这个上限的情况下,n个串中有多少个 ...