blast | diamond 输出结果选择和解析

之前的文章：构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb

本地运行blast时，需要指定out format。

常见的网页版blast结果可以参照：Blast结果的详细解析

*** Formatting options

 -outfmt <String>

   alignment view options:

     0 = Pairwise,

     1 = Query-anchored showing identities,

     2 = Query-anchored no identities,

     3 = Flat query-anchored showing identities,

     4 = Flat query-anchored no identities,

     5 = BLAST XML,

     6 = Tabular,

     7 = Tabular with comment lines,

     8 = Seqalign (Text ASN.1),

     9 = Seqalign (Binary ASN.1),

    10 = Comma-separated values,

    11 = BLAST archive (ASN.1),

    12 = Seqalign (JSON),

    13 = Multiple-file BLAST JSON,

    14 = Multiple-file BLAST XML2,

    15 = Single-file BLAST JSON,

    16 = Single-file BLAST XML2,

    18 = Organism Report

   Options 6, 7 and 10 can be additionally configured to produce

   a custom format specified by space delimited format specifiers.

   The supported format specifiers are:

   	    qseqid means Query Seq-id

   	       qgi means Query GI

   	      qacc means Query accesion

   	   qaccver means Query accesion.version

   	      qlen means Query sequence length

   	    sseqid means Subject Seq-id

   	 sallseqid means All subject Seq-id(s), separated by a ';'

   	       sgi means Subject GI

   	    sallgi means All subject GIs

   	      sacc means Subject accession

   	   saccver means Subject accession.version

   	   sallacc means All subject accessions

   	      slen means Subject sequence length

   	    qstart means Start of alignment in query

   	      qend means End of alignment in query

   	    sstart means Start of alignment in subject

   	      send means End of alignment in subject

   	      qseq means Aligned part of query sequence

   	      sseq means Aligned part of subject sequence

   	    evalue means Expect value

   	  bitscore means Bit score

   	     score means Raw score

   	    length means Alignment length

   	    pident means Percentage of identical matches

   	    nident means Number of identical matches

   	  mismatch means Number of mismatches

   	  positive means Number of positive-scoring matches

   	   gapopen means Number of gap openings

   	      gaps means Total number of gaps

   	      ppos means Percentage of positive-scoring matches

   	    frames means Query and subject frames separated by a '/'

   	    qframe means Query frame

   	    sframe means Subject frame

   	      btop means Blast traceback operations (BTOP)

   	    staxid means Subject Taxonomy ID

   	  ssciname means Subject Scientific Name

   	  scomname means Subject Common Name

   	sblastname means Subject Blast Name

   	 sskingdom means Subject Super Kingdom

   	   staxids means unique Subject Taxonomy ID(s), separated by a ';'

   			 (in numerical order)

   	 sscinames means unique Subject Scientific Name(s), separated by a ';'

   	 scomnames means unique Subject Common Name(s), separated by a ';'

   	sblastnames means unique Subject Blast Name(s), separated by a ';'

   			 (in alphabetical order)

   	sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

   			 (in alphabetical order)

   	    stitle means Subject Title

   	salltitles means All Subject Title(s), separated by a '<>'

   	   sstrand means Subject Strand

   	     qcovs means Query Coverage Per Subject

   	   qcovhsp means Query Coverage Per HSP

   	    qcovus means Query Coverage Per Unique Subject (blastn only)

   When not provided, the default value is:

   'qaccver saccver pident length mismatch gapopen qstart qend sstart send

   evalue bitscore', which is equivalent to the keyword 'std'

   Default = `0'

默认是0，也就是会输出比对的结果。

但是这样的结果显然不适合批量处理，批量处理的文件格式显然必须是dataframe。

所以网上有人推荐“outfmt 7 or 10 works perfect”，所以一般就选10吧。

Diamond的输出格式：

--outfmt (-f)          output format

	0   = BLAST pairwise

	5   = BLAST XML

	6   = BLAST tabular

	100 = DIAMOND alignment archive (DAA)

	101 = SAM

	Value 6 may be followed by a space-separated list of these keywords:

	qseqid means Query Seq - id

	qlen means Query sequence length

	sseqid means Subject Seq - id

	sallseqid means All subject Seq - id(s), separated by a ';'

	slen means Subject sequence length

	qstart means Start of alignment in query

	qend means End of alignment in query

	sstart means Start of alignment in subject

	send means End of alignment in subject

	qseq means Aligned part of query sequence

	sseq means Aligned part of subject sequence

	evalue means Expect value

	bitscore means Bit score

	score means Raw score

	length means Alignment length

	pident means Percentage of identical matches

	nident means Number of identical matches

	mismatch means Number of mismatches

	positive means Number of positive - scoring matches

	gapopen means Number of gap openings

	gaps means Total number of gaps

	ppos means Percentage of positive - scoring matches

	qframe means Query frame

	btop means Blast traceback operations(BTOP)

	staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)

	stitle means Subject Title

	salltitles means All Subject Title(s), separated by a '<>'

	qcovhsp means Query Coverage Per HSP

	qtitle means Query title

	Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

diamond就选6吧，便于批量处理。

diamond 比对转录本到Pfam库的部分结果，可以看到，格式6非常适合做批量处理。

wgs.RNAseq_00000687     M1AFS7.1        60.9    179     65      2       337     861     102     279     8.1e-55 221.5

wgs.RNAseq_00000687     A0A0B2PIQ0.1    61.0    177     66      2       337     861     319     494     4.0e-54 219.2

wgs.RNAseq_00000687     A0A0L9UG27.1    61.2    178     65      3       337     861     315     491     4.0e-54 219.2

wgs.RNAseq_00000687     A0A0L9UG27.1    63.2    38      14      0       6       119     250     287     1.0e-04 55.1

wgs.RNAseq_00000688     A0A0D2QMP5.1    65.8    114     35      2       218     550     317     429     1.7e-34 153.3

wgs.RNAseq_00000688     A0A0D2QMP5.1    53.2    62      26      1       36      221     256     314     8.7e-10 71.2

wgs.RNAseq_00000688     A0A0D2TR34.1    65.8    114     35      2       218     550     322     434     1.7e-34 153.3

wgs.RNAseq_00000688     A0A0D2TR34.1    53.2    62      26      1       36      221     261     319     8.7e-10 71.2

wgs.RNAseq_00000688     F6H0G0.1        65.8    111     36      2       221     550     320     429     5.6e-33 148.3

每一列是什么呢？ BLASTn output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1.	qseqid	query (e.g., gene) sequence id
2.	sseqid	subject (e.g., reference genome) sequence id
3.	pident	percentage of identical matches
4.	length	alignment length
5.	mismatch	number of mismatches
6.	gapopen	number of gap openings
7.	qstart	start of alignment in query
8.	qend	end of alignment in query
9.	sstart	start of alignment in subject
10.	send	end of alignment in subject
11.	evalue	expect value
12.	bitscore	bit score

比对结果的第三列和第四列非常有用，尤其是在鉴别软件stringtie等预测出来的转录本是否为有效转录本时，其实预测出来的转录本大部分都是没有意义的，但是又能部分hit到蛋白上，这时我们就只能选出比对最长的那个转录本，其余的可以看作是无效的转录本。

操作比较简单，先对比对长度排个序，从长到短。

cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort

其次就是利用python的panda模块去冗余就好了。

import pandas as pd

infile = "extract_no_N_200.fasta.diamond.nr.sort"

#infile = "test.sort"

df = pd.read_csv(infile, sep="\t", header=None)

df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"]

df1 = df.sort_values(['l4'],ascending=False)

df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first')

df2.to_csv("rm_dup_protein_"+infile, sep="\t", index=False, header=False)

df3 = df2.sort_values(['l3'],ascending=False)

df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first')

df4.to_csv("unique_protein_"+infile, sep="\t", index=False, header=False)

这样我们得到的转录本就基本上是无冗余的了。

然后对每个转录本取identical match比例最高的蛋白就好了。其实信息是可以全部保留的。

blast和diamond怎么只输出最优的hit？

blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1

待续~

blast | diamond 输出结果选择和解析 | 比对的更多相关文章

ubuntu12.04软件中心打开错误和 ubuntu 包管理之“:E: 读错误 - read (5: 输入/输出错误) E: 无法解析或打开软件包的列表或是状态文件。”的解决
执行ubuntu软讲中心时打不开.老是崩溃,从终端也下载不了软件. 执行包管理的update或者search等等会报错: E: 读错误 - read (5: 输入/输出错误) E: 无法解析或打开软件 ...
el表达式原样输出，不被解析
今天遇到了,在jar包都有的前提下EL表达式原样输出,不被解析,原因是: page指令中确少 isELIgnored="false" 加上就好了 <%@ page langu ...
Bioperl 解析blast的输出结果
用bioperl 解析blast的默认输出结果, 整理成-m8格式的输出 #!/usr/bin/perl use Bio::SearchIO; my ($blast) = @ARGV; my $sea ...
Java五道输出易错题解析（避免小错误）
收集了几个易错的或好玩的Java输出题,分享给大家,以后在编程学习中稍微注意下就OK了. 1. 看不见的空格? 下面的输出会正常吗? package basic; public class Integ ...
【转】Python实现修改Windows CMD命令行输出颜色（完全解析）
用Python写命令行程序的时候,单一的输出颜色太单调.其实我们可以加些色彩,比如用红色表示警告,绿色表示结果正常等.网上也有几篇类似的帖子,但是没有把问题讲清楚,贴的代码也不是太清晰.这里,对Win ...
Java五道输出易错题解析（进来挑战下）
转自:http://blog.csdn.net/lanxuezaipiao/article/details/41985243 收集了几个易错的或好玩的Java输出题,分享给大家,以后在编程学习中稍微注 ...
解决Editor.md通过代码块原样输出Emoji被强制解析问题
Editor.md是一款优秀的开源Markdown 编辑器,在使用中遇到的一些问题和功能改进分享给需要的伙伴. 项目地址 https://github.com/pandao/editor.md 问题 ...
如何让浏览器直接输出HTML代码而不解析
方法一: 将HTML代码嵌入到<script type='text/html' style='display:block'></scipt>中 <script type= ...
samtools faidx输出的fai文件格式解析 | fasta转bed | fasta to bed
fai示例: Sc0000003 2774837 10024730 60 61 Sc0000004 2768176 12845826 60 61 Sc0000005 2756750 15660150 ...

随机推荐

overture里设置踏板标记
在学习如何设置踏板标记之前,我们先来了解什么是踏板标记.踏板标记一般是使用在乐谱上,众所周知,钢琴有三个踏板,每个踏板的作用都不一样:右边的踏板称为“延音踏板”,是用来延长琴弦振动的时间,使音延长的效 ...
解决vi删除键和方向键奇怪的问题
sudo vi /etc/vim/vimrc.tiny 把改为
B/S交互过程及tomcat体系结构
浏览器与服务器交互过程: 1.浏览器根据主机名,如www.baidu.com,去操作系统的hosts文件中查找主机名对应的ip地址. 2.如果查找不到,则会去互联网上的dns服务器上查找主机名对应的i ...
Bootstrap3基础 container 浏览器宽度与容器宽度的四种配合
内容参数 OS Windows 10 x64 browser Firefox 65.0.2 framework Bootstrap 3.3.7 editor ...
Android灯光系统通知灯【转】
本文转载自:https://blog.csdn.net/danwuxie/article/details/82193880 一.通知灯应用程序的编写 1.首先实现一个按钮功能 <LinearLa ...
Elasticsearch查询Index以及删除
查询Index信息 GET /bank HTTP/1.1Host: localhost:9200 { "bank": { "aliases": {}, &quo ...
python 之文件I/0
打开和关闭文件 open()函数必须要open()内置函数打开一个文件,创建一个file对象,相关的方法才可以调用它进行读写. 语法 file object=open(file_name [,acc ...
Docker2之Service
Make sure you have published the friendlyhello image you created by pushing it to a registry. We’ll ...
python爬虫训练——爬poj题目
首先要解决的就是不同的题目在不同的页上,也就是要实现翻页功能,自动获取所要爬取的地址,通过分析可以得出不同的页面也就是volume=后面的数字不同所以我们可以用re模块来替换即可: new_url ...
Js操作Cookie的实现

blast | diamond 输出结果选择和解析 | 比对

blast | diamond 输出结果选择和解析 | 比对的更多相关文章

随机推荐

热门专题