le final.snp.list | perl -lane '{$a+=1;print "$a\t$F[0]\t$F[1]\t$F[1]"}' | less >snp_site le final.indel.vcf |grep -v '^#' | less -S|perl -lane '{$a+=1;$b=$F[1]+length($F[3]);print "$a\t$F[0]\t$F[1]\t$b"}' | less -S >indel_site
这算是第二讲了,前面一讲是:Edit Distance编辑距离(NM tag)- sam/bam格式解读进阶 MD是mismatch位置的字符串的表示形式,貌似在call SNP和indel的时候会用到. 当然我这里要说的只是利用它来计算mismatch的个数 MD = line.get_tag('MD') pat = "[0-9]+[ATGC]+" MD_list = re.findall(pat,MD) for i in MD_list: for j in i: if j == '
beftools非常复杂,大概有20个命令,每个命令下面还有N多个参数 annotate .. edit VCF files, add or remove annotations call .. SNP/indel calling (former "view") cnv .. Copy Number Variation caller concat .. concatenate VCF/BCF files from the same set of samples consensus ..
参见: Question: How to extract allnon-seqencedpositions from a genome (Fasta file)? test.fa >chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtaaattgttt t
代码如下: #!/usr/bin/perl -w use strict; die "perl $0 <vcf> <genome>" if(@ARGV == 0); #Author:yueyao@genomics.cn my $vcf=shift; my $genome=shift; my%hash; my $id; open GENOME,$genome or die $!; while(<GENOME>){ chomp; if(/^>/)