鲤鱼基因组:http://www.ntv.cn/a/20140923/52953.shtml
 
关于鲤鱼基因组的测定,数据质量控制遭到质疑。
Why you should QC your reads AND your assembly?
 
http://grahametherington.blogspot.co.uk/2014/09/why-you-should-qc-your-reads-and-your.html

The genome sequence of the Common Carp Cyprinus carpio was published in Nature last week. By coincidence, I was doing some QC on some domesticated Ferret (Mustela ptorius furo) reads, which had thrown some kmer warnings in the FastQC tool. I blasted the kmers in NCBI and was quite perplexed by the number of hits that I found in the carp genome. Nearly all of the first 150 hits were all from the carp genome. Anyway, I looked a bit further into my odd kmers and it turns out that they were the ends of some Illumina adapter sequences that had presumably been incorporated into the paired-reads on the shorter ends of the insert size. This then took me back to the Carp Genome - what had creeped into that?

In the paper, the authors state that they used 454, Illumina and Solid sequencing and also used some previously published BAC-end sequences. The BAC-end and 454 sequences were assembled with the Celera assembler and the Illumina, Solid and 454 8kb mate-pair sequences were mapped to the assembly to construct the scaffolds. Finally, they used the paired-end information from the short paired-end reads to fill the gaps between the scaffolds. The final assembly consists of 9377 scaffolds.

The only quality control they speak of is "We then filtered out low-quality and short reads to obtain a set of usable reads".

So I thought I'd look at what was actually in their assembly. I downloaded the Carp genome assembly (9377 scaffolds) and created a blast database from it and then created a fasta file of Illumina adapter sequences (found here) and used them as query sequences to blast against the Carp genome. There is some redundancy in the Illumina adapter sequences, so I collapsed them, so retaining only unique sequences and then removed any adapter sequences that were sub-sequences of longer adapter (the final file consisted of 81 sequences). The blast resulted in 3750 hits (evalue < 8.00E-06) of which 1009 were of 100% identity.

This gave me a final tally of at least 20 Illumina adapter sequences incorporated into the final Common Carp genome assembly. Out of the 9377 scaffolds, 277 appears to have Illumina Adapter sequences in them. I've included the counts of the different Illumina adapter sequences (non-redundant) for the scaffolds at the bottom of the page.

I've not looked for adapter sequences used in Solid or 454 sequencing yet. It would be interesting to see what that throws up.

So, a lesson to be learned here. QC your assembly, especially if you're not overly stringent with your read QC.

Here's the data:
Common Carp genome scaffolds
Illumina adapter sequences
Illumina adapter sequences collapsed
Illumina adapters v Carp genome blast

Why you should QC your reads AND your assembly?的更多相关文章

  1. 10、RNA-seq for DE analysis training(Mapping to assign reads to genes)

    1.Goal of mapping 1)We want to assign reads to genes they were derived from 2)The result of the mapp ...

  2. samtools flagstat

    samtools flagstat命令简介: 统计输入文件的相关数据并将这些数据输出至屏幕显示.每一项统计数据都由两部分组成,分别是QC pass和QC failed,表示通过QC的reads数据量和 ...

  3. sam/bam格式

    1)Sam (Sequence Alignment/Map) ------------------------------------------------- 1) SAM 文件产生背景 随着Ill ...

  4. bam文件测序深度统计-bamdst

    最近接触的数据都是靶向测序,或者全外测序的数据.对数据的覆盖深度及靶向捕获效率的评估成为了数据质量监控中必不可少的一环. 以前都是用samtools depth 算出单碱基的深度后,用perl来进行深 ...

  5. [题解+总结]NOIP2010-2015后四题汇总

    1.前言 正式开始的第一周的任务--把NOIP2010至NOIP2015的所有D1/2的T2/3写出暴力.共22题. 暴力顾名思义,用简单粗暴的方式解题,不以正常的思路思考.能够较好的保证正确性,但是 ...

  6. DISCOVAR de novo

    海宝建议用这个拼接软件 http://www.broadinstitute.org/software/discovar/blog/?page_id=98 DISCOVAR – variant call ...

  7. MySql数据库3【优化3】缓存设置的优化

    1.表缓存 相关参数: table_open_cache 指定表缓存的大小.每当MySQL访问一个表时,如果在表缓冲区中还有空间,该表就被打开并放入其中,这样可以更快地访问表内容.通过检查峰值时间的状 ...

  8. 浅谈MySQL 数据库性能优化

    MySQL数据库是 IO 密集型的程序,和其他数据库一样,主要功能就是数据的持久化以及数据的管理工作.本文侧重通过优化MySQL 数据库缓存参数如查询缓存,表缓存,日志缓存,索引缓存,innodb缓存 ...

  9. 拾人牙慧篇之———QQ微信的第三方登录实现

    一.写在前面 关于qq微信登录的原理之流我就不一一赘述了,对应的官网都有,在这里主要是展示我是怎么实现出来的,看了好几个博客,有的是直接复制官网的,有的不知道为什么实现不了.我只能保证我的这个是我实现 ...

随机推荐

  1. hadoop集群搭建--CentOS部署Hadoop服务

    在了解了Hadoop的相关知识后,接下来就是Hadoop环境的搭建,搭建Hadoop环境是正式学习大数据的开始,接下来就开始搭建环境!我们用到环境为:VMware 12+CentOS6.4 hadoo ...

  2. poj3160强连通分量加dfs

    After retirement as contestant from WHU ACM Team, flymouse volunteered to do the odds and ends such ...

  3. 使用了UnityEditor中的API,打包时却不能打包UnityEditor的问题

    前段时间写了一篇名叫<Unity使用Windows弹窗保存图片>的文章 然而现在项目进入了测试阶段 就在发布的时候,这个地方出问题了 问题出在using UnityEditor; 如上文章 ...

  4. vue.js2.0 自定义组件初体验

    理解 组件(Component)是 Vue.js 最强大的功能之一.组件可以扩展 HTML 元素,封装可重用的代码.在较高层面上,组件是自定义元素, Vue.js 的编译器为它添加特殊功能.在有些情况 ...

  5. css定位 浮动 伪类 margin

    一,margin .标准文档流,margin在竖直方向的不叠加,以较大的为准 .使用margin: auto;的盒子必须有明确的width,并且只有标准文档流的盒子 才能使用margin: auto; ...

  6. shell入门笔记1:执行方式、运行方式、变量、替换

    说明: 本文是关于http://c.biancheng.net/cpp/shell/的相关笔记 shell的两种执行方式 交互式(interactive) 解释执行用户的命令,用户输入一条命令,She ...

  7. Google的PageRank及其Map-reduce应用(日志五)

    上一篇:Hadoop的安装(日志四) 1,算法的原理解释: 如下图所示,G就是传说中的谷歌矩阵,这个矩阵是n*n型号的,n表示共计有n个网页. 如矩阵中所示: 11位置处的元素,是表示第一个网页指向的 ...

  8. centos6.5软件安装:RPM,SRPM与yum功能

    鸟哥的linxu私房菜读书笔记 前言: Linux上软件的安装可以以原始码的方式来安装软件,也就是利用厂商释出的 Tarball 来进行软件的安装.不过,你应该很容易发现,那就是每次安装软件都需要侦测 ...

  9. Vue声明式渲染

    Vue.js 的核心是一个允许采用简洁的模板语法来声明式的将数据渲染进 DOM,也就是将模板中的文本数据写进DOM中,使用  {{data}}  的格式写入.此代码都是Vue.js官网上的实例. 1. ...

  10. STM 8s 外部中断寄存器无法写入

    虽然说单片机开发就是对手册的研究,但是开发过程中,还是要做些笔记的,方便以后注意那些坑. 工作要求所以接触了一下STM328s00f3这个芯片,配置外部中断的时候遇到了一点问题 PS:IAR这个开发软 ...