http://www.cbcb.umd.edu/software/jellyfish/
 
http://www.genome.umd.edu/jellyfish.html
https://github.com/gmarcais/Jellyfish/releases
 
 
wget https://github.com/gmarcais/Jellyfish/releases/download/v2.2.3/jellyfish-2.2.3-CentOS6.tar.gz
tar -zxvf jellyfish-2.2.3-CentOS6.tar.gz
 
jellyfish就在bin里面,直接将这个可执行程序复制到你的环境变量目录里就可以用了
服务器安装包地址:
/home/cmiao/jellyfish-2.2.3
 
 
 
 
 
jellyfish不能用fq.gz 要先转为fq才行
用gunzip -c *.fq.gz > *.fq 
 
$ jellyfish count -t 30 -C -m 21 -s 150G  --min-quality=20 --quality-start=33 ./*.fastq
 
Assume a haploid genome, for simplicity. In the picture provided, the first peak at depth ~31 indicates amount of 1-copy content (in other words, the genome has exactly 1 copy of that kmer, so it is unique). The weak peak at ~62x indicates the amount of 2-copy content. Everything under ~11x can be assumed to be error kmers, unrelated to genome size.

So, to estimate manually, take the sum of the counts of unique kmers under the first peak and multiply by 1; add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This will give you the haploid genome size. So if your genome is tetraploid, the actual size will be 1/4 of your result, since the first peak will correspond to mutations present on only 1 ploidy (1/0/0/0 genotype).

You can make this more accurate by modelling the peaks as a sum of Gaussian curves, but that probably won't change the result much. Of course, this method is subjective because calling peaks is subjective.

Please note - I think 17-mers are too short for this kind of analysis. I prefer 31-mers because they are the longest computationally-efficient kmers. Also, FYI, BBNorm is faster than Jellyfish and can also generate kmer-frequency histograms:

khist.sh in=reads.fq hist=khist.txt

Also, it makes more sense to plot these things as log-log rather than linear-linear; and the Y-axis should be count, not frequency, which is useless for the purpose of genome-size estimation.
 

Outline

  1. count k-mer occurence using Jellyfish (jellyfish count)
  2. summarize as histogram (jellyfish histo)
  3. plot graph with R
  4. determine the total number of k-mer analyzed and the peak position
  5. compare the peak shape with poisson distribution

Count k-mer occurence

In this example we have 5 pair of fastq files in three different subdirectories. The file to process can be specified with "*/*.qf.fastq" and veriied with ls.

  1. $ ls */*.qf.fastq
  2. run1/s_1_1_sequence.qf.fastq run2/s_2_2_sequence.qf.fastq
  3. run1/s_1_2_sequence.qf.fastq run3/s_1_1_sequence.qf.fastq
  4. run2/s_1_1_sequence.qf.fastq run3/s_1_2_sequence.qf.fastq
  5. run2/s_1_2_sequence.qf.fastq run3/s_2_1_sequence.qf.fastq
  6. run2/s_2_1_sequence.qf.fastq run3/s_2_2_sequence.qf.fastq

Next, we issue the jellyfish count command

  1. jellyfish count -t 8 -C -m 25 -s 5G -o spec1_25mer --min-quality=20 --quality-start=33 */*.qf.fastq
-t 8
specifies the number of threads to be used. This value should be equal to the number of cores on the machine or the number of slots you reserved through job management system ($NSLOTS in SGE or UGE).
-C
specifies the both strands are considered. If you do not specify this, the apparent depth would be half, --- that is undesirable
-m 25
specified that now you are counting for 25 mer (i.e., k=25)
-s 5G
is some kind of magical number specification of hash size. This should be as high as the physical memory allows. The higher the faster, but exceeding the available memory leads to failure or extremely slow counting.
-o spec1_25mer
specifies the prefix of output file names.
--quality-start=33
specified that your fastq file have 33 based quality value string. Be careful on the dataformat. There are cases that your data are 64 based depending on the sequending system and software versions. This is relevant only when you specify --min-quality
--min-quality=20
specifies that nucleotide having qv lower than 20 should not included in the count. This selection reduces the k-mers derived from sequence errors and make the peak clearer.
*/*.qf.fastq
will be expanded to the ten filenames explained above by the shell and passed to jellyfish as input files

summarize as histogram (jellyfish histo)

First confirm that you got the output file

  1. $ ls spec1_25mer*
  2. spec1_25mer_0

now that there is a single file spec1_25mer_0

  1. $ jellyfish histo -o spec1_25mer.histo spec1_25mer_0

Confirm that you got the output

  1. $ ls spec1_25mer*
  2. spec1_25mer_0 spec1_25mer.histo

Examine the numbers by your eyes

  1. $ head -25 spec1_25mer.histo
  2. 1 461938583
  3. 2 95606044
  4. 3 19280477
  5. 4 13836754
  6. 5 11018480
  7. 6 9555090
  8. 7 8557935
  9. 8 7863244
  10. 9 7319505
  11. 10 6920880
  12. 11 6589723
  13. 12 6321923
  14. 13 6148638
  15. 14 6036120
  16. 15 5972264
  17. 16 5962234
  18. 17 5987696
  19. 18 6051171
  20. 19 6154429
  21. 20 6297373
  22. 21 6485135
  23. 22 6700579
  24. 23 6932570
  25. 24 7217627
  26. 25 7533211
 
 
运行出错:
terminate called after throwing an instance of 'jellyfish::invertible_hash::ErrorAllocation'
  what():  Failed to allocate 628292358736 bytes of memory
Aborted (core dumped)
解决:
50G should be more than enough. That is the amount of memory I usually am using and I have never had any problems. This also means that you probably do not need the high memory nodes.
 
 
 
 
 
freemao
FAFU

jellyfish K-mer analysis and genome size estimate的更多相关文章

  1. Evaluate|GC content|Phred|BAC|heterozygous single nucleotide polymorphisms|estimate genome size|

    (Evaluate):检查reads,可使用比对软件:使用SOAPaligner重新排列:采用massively parallel next-generation sequencing technol ...

  2. Maximum Size Subarray Sum Equals k -- LeetCode

    Given an array nums and a target value k, find the maximum length of a subarray that sums to k. If t ...

  3. 【LeetCode】325. Maximum Size Subarray Sum Equals k 解题报告 (C++)

    作者: 负雪明烛 id: fuxuemingzhu 个人博客:http://fuxuemingzhu.cn/ 目录 题目描述 题目大意 解题方法 prefix Sum 日期 题目地址:https:// ...

  4. The sequence and de novo assembly of the giant panda genome.ppt

    sequencing:使用二代测序原因:高通量,短序列 不用长序列原因: 1.算法错误率高 2.长序列测序将嵌合体基因错误积累.嵌合体基因:通过重组由来源与功能不同的基因序列剪接而形成的杂合基因 se ...

  5. [LeetCode] Longest Substring with At Least K Repeating Characters 至少有K个重复字符的最长子字符串

    Find the length of the longest substring T of a given string (consists of lowercase letters only) su ...

  6. [LeetCode] Kth Largest Element in an Array 数组中第k大的数字

    Find the kth largest element in an unsorted array. Note that it is the kth largest element in the so ...

  7. k近邻算法的Java实现

    k近邻算法是机器学习算法中最简单的算法之一,工作原理是:存在一个样本数据集合,即训练样本集,并且样本集中的每个数据都存在标签,即我们知道样本集中每一数据和所属分类的对应关系.输入没有标签的新数据之后, ...

  8. 6.3Sum && 4Sum [ && K sum ] && 3Sum Closest

    3Sum Given an array S of n integers, are there elements a, b, c in S such that a + b + c = 0? Find a ...

  9. 剑指offer系列55---最小的k个数

    [题目] 输入n个整数,找出其中最小的K个数.例如输入4,5,1,6,2,7,3,8这8个数字,则最小的4个数字是1,2,3,4,. *[思路]排序,去除k后的数. package com.exe11 ...

随机推荐

  1. 实战SQL Server 2005镜像配置全过程

    SQL Server 2005镜像配置基本概念 我理解的SQL Server 2005镜像配置实际上就是由三个服务器(也可以是同一服务器的三个 SQL 实例)组成的一个保证数据的环境,分别是:主服务器 ...

  2. 网卡eth0配置信息

    DEVICE=eth0:0 //虚拟网络接口,随意 ONBOOT=yes //系统启动时激活 BOOTPROTO=static //使用静态ip地址 IPADDR=192.168.6.100 //该虚 ...

  3. 项目中Enum枚举的使用

    在.NET中,枚举一般有两种常见用法,一是表示唯一的元素序列,比如表示订单状态(未提交,待处理,处理中...).另外一种是表示多种组合的状态,比如表示权限,因为可同时有多个不同权限. 基本用法 这里拿 ...

  4. Windows内核对象

    1. 内核对象 Windows中每个内核对象都只是一个内存块,它由操作系统内核分配,并只能由操作系统内核进行访问,应用程序不能在内存中定位这些数据结构并直接更改其内容.这个内存块是一个数据结构,其成员 ...

  5. submit回车提交影响

    $(".bInput").bind('keydown',function(event){//回车提交手动标签 if(event.keyCode==13){              ...

  6. vsftp 配置

    安装和基本配置网上很多文章,但他们的最终效果不是我想要的: 我想要的是,ftp上传的文件用户可以通过apache的http服务访问,也就是ftp上传的文件可以通过浏览器访问,并且可以通过ftp客户端修 ...

  7. (BFS)hdoj1242-Rescue

    题目地址 初学BFS,第一次用BFS做题.题目就是一个基本的BFS模型,需要稍加注意的是遇到警卫时间要+1,以及最后比的是最短的时间而不是步数. #include<cstdio> #inc ...

  8. C# 数据流操作 Stream 相关

    FileStream:對文件執行讀取與寫入 MemoryStream:對內存進行讀取與寫入 BufferedStream:對緩沖器進行讀取與寫入 StreamReader/StreamWriter 命 ...

  9. BCP批量导入数据时候如何处理表中自动增加的字段

    大容量导入数据时保留标识值 (SQL Server) http://msdn.microsoft.com/zh-cn/library/ms186335(v=sql.120).aspx 使用格式化文件跳 ...

  10. 如何在windows上搭建ftp服务器

    FTP(File Transfer Protocol)是TCP/IP网络上两台计算机传送文件的协议,使得主机间可以共享文件.目前有很多软件都能实现这一功能,然而windows自带的IIS就可以帮助你搭 ...