STAR manual
来源:STARmanual.pdf
来源:Calling variants in RNAseq
PART0 准备工作
#STAR 安装前的依赖的工具
#Red Hat, CentOS, Fedora.
sudo yum update
sudo yum install make
sudo yum install gcc-c++
sudo yum install glibc-static
PART1 Quick start
#STAR Basic workflow
###1. Generating genome indexes files
###构建索引
###2. Mapping reads to the genome
###将reads 比对到基因组上
#范例
##1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:
##(以人类基因组为例,构建索引)
genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir \
--genomeFastaFiles hg19.fa --runThreadN <n>
##2) Alignment jobs were executed as follows:
##(比对)
runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq \
--runThreadN <n>
##3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:
##(从1-pass STAR运行结果中提取可变剪切位点信息,再次构建索引)
genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir \
--genomeFastaFiles hg19.fa \
--sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab \
--sjdbOverhang 75 --runThreadN <n>
##4) The resulting index is then used to produce the final alignments as follows:
##(利用索引文件信息将reads进行比对,生成结果文件)
runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq \
--runThreadN <n>
PART2 Generating genome indexes files
#构建索引使用的基本参数
--runThreadN NumberOfThreads(线程数)
--runMode genomeGenerate (生成index的运行模式)
--genomeDir /path/to/genomeDir (index 存储的目录)
--genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 ... (参考基因组序列文件Ref.fa)
--sjdbGTFfile /path/to/annotations.gtf (参考基因组注释文件Ref.gtf)
### STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. While this is optional, and STAR can be run without annotations, using annotations is highly recommended whenever they are available. Starting from 2.4.1a, the annotations can also be included on the fly at the mapping step.(可选项;如果应用此项,STAR 将会从注释文件中提取可变剪切位点的信息,从而使后面的比对更为准确,一般如果有注释文件的话,则推荐使用;但是如果没有文件,STAR也能运行)
--sjdbOverhang ReadLength-1
###specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.(在构建可变剪切位点数据库时,应用这个参数可指定在已被注释的剪切位点附近的基因组序列的长度。理想情况下,这个长度等于(ReadLength-1),注意,这里的ReadLength指的是reads 的长度。例如,对于Illumina 2x100b 双端reads,理论值应该是100-1=99。但是假设reads长度不一致,那么其理论值则是max(ReadLength)-1。大多情况下,默认值100的运行效果和理论值几乎相同 )
###Genome files comprise binary genome sequence, suffix arrays, text chromosome names/lengths,splice junctions coordinates, and transcripts/genes information.
PART3 mapping jobs
#比对(mapping jobs)使用的基本参数
--runThreadN NumberOfThreads (线程数)
--genomeDir /path/to/genomeDir (index存储的目录)
--readFilesIn /path/to/read1 [/path/to/read2 ]
(RNA-seq FASTQ/FASTA files)
###如果提供的是压缩文件,则可使用此参数:
#####--readFilesCommand UncompressionCommand
#####例如:针对gzipped 文件 (*.gz)
--readFilesCommand zcat
or
--readFilesCommand gunzip -c
#####例如:针对bzip2-compressed 文件
--readFilesCommand bunzip2 -c
PART4 Output files
#STAR 生成多个输出文件,一般默认会自动存储在当前工作目录下,但可以利用参数指定生成目录和文件前缀。如下:
--outFileNamePrefix /path/to/output/dir/prefix.
#log files
###log.out/log.process.out/log.final.out
#SAM
###Aligned.out.sam - alignments in standard SAM format.
#STAR可以指定输出的比对文件的格式
--outSAMtype BAM Unsorted
#仅将SAM转变成BAM格式,不排序,输出Aligned.out.bam
--outSAMtype BAM SortedByCoordinate
#既将SAM转变成BAM格式,也对文件按名称排序,输出Aligned.sortedByCoord.out.bam,类似samtools sort 命令
--outSAMtype BAM Unsorted SortedByCoordinate
#生成两个文件,即未经过排序的Aligned.out.bam和经过排序的Aligned.sortedByCoord.out.bam
#Splice junctions
#SJ.out.tab 文件包含以下9列信息
column 1: chromosome
#(染色体)
column 2: first base of the intron (1-based)
#(内含子的第一个碱基)
column 3: last base of the intron (1-based)
#(内含子最后一个碱基)
column 4: strand (0: undened, 1: +, 2: -)
#(链的方向)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
#(内含子序列)
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
#(剪切位点是否已被注释)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang
STAR-Fusion is a software package for detecting fusion transcript from STAR chimeric output.
PART5 与下游分析相关的参数
With –quantMode TranscriptomeSAM option STAR will output alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file (in addition to alignments in genomic coordinates in Aligned.*.sam/bam files).
With –quantMode GeneCounts option STAR will count number reads per gene while mapping.
The counts coincide with those produced by htseq-count with default parameters.(这个参数的作用与htseq-count作用相同)
这个参数对应生成的文件是ReadsPerGene.out.tab
- column 1: gene ID
- column 2: counts for unstranded RNA-seq
- column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
- column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)With –quantMode TranscriptomeSAM GeneCounts
生成两个文件:
1)Aligned.toTranscriptome.out.bam
2)ReadsPerGene.out.tab
PART6 2-pass mapping
为了能准确发现新的剪切位点,力推使用STAR的 2-pass mode。
这并不是说增加检测的新剪切位点的数目,而是增强了检测到可变剪切reads比对到新剪切位点的能力。(即通过reads比对从而发现新的剪切位点)
It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions.
基本思想是:
首先进行1-pass STAR mapping(基本参数即可),收集可变剪切位点信息;
其次利用上一步的变剪切位点信息,进行2-pass STAR mapping
#For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples.
###1. Run 1st mapping pass for all samples with "usual" parameters. Using annotations is recommended either a the genome generation step, or mapping step.
###2. Run 2nd mapping pass for all samples , listing SJ.out.tab files from all samples in --sjdbFileChrStartEnd /path/to/sj1.tab /path/to/sj2.tab ....
#Per-sample 2-pass mapping.
###Annotated junctions will be included in both the 1st and 2nd passes. To run STAR 2-pass mapping for each sample separately, use --twopassMode Basic option. STAR will perform the 1st pass mapping, then it will automatically extract junctions, insert them into the genome index, and, finally, re-map all reads in the 2nd mapping pass. This option can be used with annotations, which can be included either at the run-time (see #1), or at the genome generation step.
###若每个样本分开运行 2-pass mapping,则推荐使用参数:--twopassMode
#####其基本思想是,先运行1st pass mapping,自动提取剪切位点信息,再将其插入genome index中,最后,将所有reads重新运行2nd mapping pass。
### --twopass1readsN denefines the number of reads to be mapped in the 1st pass.(该参数指定在1st pass mapping过程中进行比对的reads数)
#####The default and most sensitive approach is to set it to -1 (or make it bigger than the number of reads in the sample) in which case all reads in the input read file(s) are used in the 1st pass.
2-pass mapping with re-generated genome
- Run 1st pass STAR for all samples with “usual” parameters. Genome indices generated with annotations are recommended.
- Collect all junctions detected in the 1st pass by merging SJ.out.tab files from all runs. Filter the junctions by removing likelie false positives, e.g. junctions in the mitochondrion genome,
or non-canonical junctions supported by a few reads. If you are using annotations, only novel junctions need to be considered here, since annotated junctions will be re-used in the 2nd pass
anyway. - Use the filtered list of junctions from the 1st pass with –sjdbFileChrStartEnd option, together with annotations (via –sjdbGTFfile option) to generate the new genome indices for the 2nd pass mapping. This needs to be done only once for all samples.
- Run the 2nd pass mapping for all samples with the new genome index.
STAR manual的更多相关文章
- Sphinx 2.2.11-release reference manual
1. Introduction 1.1. About 1.2. Sphinx features 1.3. Where to get Sphinx 1.4. License 1.5. Credits 1 ...
- #include <sys/epoll.h> epoll - I/O event notification facility 服务器端 epoll(7) - Linux manual page http://www.man7.org/linux/man-pages/man7/epoll.7.html
epoll使用详解(精髓) - Boblim - 博客园 https://www.cnblogs.com/fnlingnzb-learner/p/5835573.html epoll使用详解(精髓) ...
- 11、比对软件STAR(https://github.com/alexdobin/STAR)
转载:https://mp.weixin.qq.com/s?__biz=MzI1MjU5MjMzNA==&mid=2247484731&idx=1&sn=b15fbee5910 ...
- 【Star CCM+实例】开发一个简单的计算流程.md
流程开发在CAE过程中处于非常重要的地位. 主要的作用可能包括: 将一些经过验证的模型隐藏在流程中,提高仿真的可靠性 将流程封装成更友好的界面,降低软件的学习周期 流程开发实际上需要做非常多的工作,尤 ...
- github中的watch、star、fork的作用
[转自:http://www.jianshu.com/p/6c366b53ea41] 在每个 github 项目的右上角,都有三个按钮,分别是 watch.star.fork,但是有些刚开始使用 gi ...
- [deviceone开发]-Star分享的几个示例
一.简介 这个是star早期分享的几个示例,都非常实用,包括弹出的菜单,模拟支付密码输入等.初学者推荐.也可以直接使用.二.效果图 三.相关下载 https://github.com/do-proje ...
- Mongodb 3.2 Manual阅读笔记:CH9 存储
9. 存储 9. 存储 9.1 存储引擎 9.1.1 WiredTiger存储引擎 9.1.1.1 文档级别并发 9.1.1.2 快照和检查点 9.1.1.3 Journaling 9.1.1.4 压 ...
- 时隔一年再读到the star
The Star Arthur C. Clarke It is three thousand light-years to the Vatican. Once, I believed that spa ...
- Github上的Watch和 Star的区别
Github 推出了新的 Notification 系统,更改了原有的 Watch 机制,为代码库增加了 Star 操作.Notification 将接收 Watching 代码库的动态,包括:* I ...
随机推荐
- 5、easyUI-菜单与按钮
列出源码:求解 <html> <head> <meta http-equiv="Content-Type" content="text/ht ...
- 2014江西理工大学C语言程序设计竞赛高级组题解
1001 Beautiful Palindrome Number 枚举回文数字前半部分,然后判断该数字是否满足,复杂度为O(sqrt(n))! 1002 Recovery Sequence 本题的核 ...
- <转> 堆和栈的区别
一.预备知识—程序的内存分配 一个由C/C++编译的程序占用的内存分为以下几个部分 1.栈区(stack)— 由编译器自动分配释放,存放函数的参数值,局部变量的值等.其操作方式类似于数 ...
- 【BZOJ1058】[ZJOI2007]报表统计 STL
[BZOJ1058][ZJOI2007]报表统计 Description 小Q的妈妈是一个出纳,经常需要做一些统计报表的工作.今天是妈妈的生日,小Q希望可以帮妈妈分担一些工作,作为她的生日礼物之一.经 ...
- 《从零开始学Swift》学习笔记(Day 27)——可选类型
原创文章,欢迎转载.转载请注明:关东升的博客 可选类型: 我们先看看如下代码: n1 = nil //编译错误 let str: String = nil //编译错误 Int和String类型不能接 ...
- SharePoint服务器端对象模型 之 序言
对于刚刚开始接触SharePoint的开发人员,即使之前有较为丰富的ASP.NET开发经验,在面对SharePoint时候可能也很难找到入手的方向.对于任何一种开发平台而言,学习开发的过程大致会包括: ...
- JavaScript基础深入之
JS的数值类型是分为两类:基本数据类型和引用数据类型. 基本类型占据的内存栈空间,引用类型被保存在堆空间.引用类型赋值的变量也是被保存在栈空间的,它的作用类似于电视遥控器,负责操作堆空间内指向的对象. ...
- mono下c#和c交互,字符串处理
起因是ulua里,从luajit读字符串到c#里,做了编码转换,如下 public static string lua_tostring(IntPtr luaState, int index) { v ...
- PAT 1073 多选题常见计分法 (20 分)
批改多选题是比较麻烦的事情,有很多不同的计分方法.有一种最常见的计分方法是:如果考生选择了部分正确选项,并且没有选择任何错误选项,则得到 50% 分数:如果考生选择了任何一个错误的选项,则不能得分.本 ...
- MD5,SHA256,时间戳获取
import hashlib # MD5加密 def jiamimd5(src): m = hashlib.md5() m.update(src.encode('UTF-8')) return m.h ...