【4】蛋白质组学鉴定软件之MSGFPlus
1.简介
MSGF+也是近年来应用得比较多的蛋白鉴定软件。java写的,2008年初次发表JPR,2014年升级发表NC,免费开源,持续更新维护,良心软件。而且,有研究者对不同蛋白质组学鉴定软件进行比较分析,MSGF+的表现也是非常不错的(一下子找不到文献出处~~)。
Github源码:https://github.com/MSGFPlus/msgfplus
支持的输入格式包括:mzML, mzXML, Mascot Generic File (mgf), MS2 files, Micromass Peak List files (pkl), Concatenated DTA files (_dta.txt)
主要支持HUPO PSI 的标准输入mzML格式,以及输出mzIdentML格式(简写mzid ),易转化为TSV格式。
关于mzIdentML格式,参考http://www.psidev.info/mzidentml
2.安装运行
软件下载:https://github.com/MSGFPlus/msgfplus/releases
关于使用,MS-GF+有非常详细的文档:MS-GF+ Documentation
参数配置文件:
https://github.com/MSGFPlus/msgfplus/tree/master/docs/ParameterFiles
关于运行,提供了很多示例以及参数的解释:
https://msgfplus.github.io/msgfplus/MSGFPlus.html
运行示例1:
java -Xmx4000M -jar MSGFPlus.jar \
-s test.mzML \
-d uniprot_swissprot_human_20190313_20417.fasta \
-t 20ppm -ti -1,2 -ntt 0 -tda 1 -e 0 -m 3 -inst 3 -minCharge 1 -maxCharge 6 -addFeatures 1 \
-mod Mods.txt \
-o test.mzid
修饰文件Mods.txt内容如下:
# This file is used to specify modifications
# # for comments
#
# Max Number of Modifications per peptide
# If this value is large, the search takes long.
NumMods=2
# To input a modification, use the following command:
# Mass or CompositionStr, Residues, ModType, Position, Name (all the five fields are required).
# CompositionStr (C[Num]H[Num]N[Num]O[Num]S[Num]P[Num]Br[Num]Cl[Num]Fe[Num])
# - C (Carbon), H (Hydrogen), N (Nitrogen), O (Oxygen), S (Sulfer), P (Phosphorus), Br (Bromine), Cl (Chlorine), Fe (Iron), and Se (Selenium) are allowed.
# - Negative numbers are allowed.
# - E.g. C2H2O1 (valid), H2C1O1 (invalid)
# Mass can be used instead of CompositionStr. It is important to specify accurate masses (integer masses are insufficient).
# - E.g. 15.994915
# Residues: affected amino acids (must be upper letters)
# - Must be uppor letters or *
# - Use * if this modification is applicable to any residue.
# - * should not be "anywhere" modification (e.g. "15.994915, *, opt, any, Oxidation" is not allowed.)
# - E.g. NQ, *
# ModType: "fix" for fixed modifications, "opt" for variable modifications (case insensitive)
# Position: position in the peptide where the modification can be attached.
# - One of the following five values should be used:
# - any (anywhere), N-term (peptide N-term), C-term (peptide C-term), Prot-N-term (protein N-term), Prot-C-term (protein C-term)
# - Case insensitive
# - "-" can be omitted
# - E.g. any, Any, Prot-n-Term, ProtNTerm => all valid
# Name: name of the modification (Unimod PSI-MS name)
# - For proper mzIdentML output, this name should be the same as the Unimod PSI-MS name
# - E.g. Phospho, Acetyl
# - Visit http://www.unimod.org to get PSI-MS names.
C2H3N1O1,C,fix,any,Carbamidomethyl # Fixed Carbamidomethyl C
#144.102063,*,fix,N-term,iTRAQ4plex # iTRAQ 4 plex
#144.102063,K,fix,any,iTRAQ4plex # iTRAQ 4 plex
# Variable Modifications (default: none)
O1,M,opt,any,Oxidation # Oxidation M
#15.994915,M,opt,any,Oxidation # Oxidation M (mass is used instead of CompositionStr)
H-1N-1O1,NQ,opt,any,Deamidated # Negative numbers are allowed.
#C2H3NO,*,opt,N-term,Carbamidomethyl # Variable Carbamidomethyl N-term
#H-2O-1,E,opt,N-term,Glu->pyro-Glu # Pyro-glu from E
#H-3N-1,Q,opt,N-term,Gln->pyro-Glu # Pyro-glu from Q
#C2H2O,*,opt,Prot-N-term,Acetyl # Acetylation Protein N-term
#C2H2O1,K,opt,any,Acetyl # Acetylation K
#CH2,K,opt,any,Methyl # Methylation K
#HO3P,STY,opt,any,Phospho # Phosphorylation STY
运行示例2:
java -Xmx4g -Xms1g -jar MSGFPlus.jar
-conf MSGFPlus_Parameters.txt \
-d test.fasta \
-s test.mzML \
-o test.mzid
参数配置文件MSGFPlus_Parameters.txt内容如下:
#Parent mass tolerance
# Examples: 2.5Da or 30ppm
# Use comma to set asymmetric values, for example "0.5Da,2.5Da" will set 0.5Da to the left (expMass<theoMass) and 2.5Da to the right (expMass>theoMass)
PrecursorMassTolerance=20ppm
#Max Number of Modifications per peptide
# If this value is large, the search will be slow
NumMods=5
#Modifications (see below for examples)
StaticMod=C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed Carbamidomethyl C
DynamicMod=O1, M, opt, any, Oxidation # Oxidized methionine
DynamicMod=H-1N-1O1, NQ, opt, any, Deamidated # Deamidation of Glutamine (+0.984016)
#Custom amino acids
CustomAA=C3H5NO, U, custom, U, Selenocysteine # Custom amino acids can only have C, H, N, O, and S
#CustomAA=H0, X, custom, X, RemoveAA # Remove AA
#Fragmentation Method
# 0 means as written in the spectrum or CID if no info (Default)
# 1 means CID
# 2 means ETD
# 3 means HCD
# 4 means Merge spectra from the same precursor (e.g. CID/ETD pairs, CID/HCD/ETD triplets)
FragmentationMethodID=3
#Instrument ID
# 0 means Low-res LCQ/LTQ (Default for CID and ETD); use InstrumentID=0 if analyzing a dataset with low-res CID and high-res HCD spectra
# 1 means High-res LTQ (Default for HCD; also appropriate for high res CID); use InstrumentID=1 for Orbitrap, Lumos, and QEHFX instruments
# 2 means TOF
# 3 means Q-Exactive
InstrumentID=1
#Enzyme ID
# 0 means No enzyme used
# 1 means Trypsin (Default); use this along with NTT=0 for a no-enzyme search of a tryptically digested sample
# 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Enzyme (for peptidomics)
EnzymeID=1
#Isotope error range
# Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
# Useful for accurate precursor ion masses
# Ignored if the parent mass tolerance is > 0.5Da or 500ppm
# The combination of -t and -ti determins the precursor mass tolerance.
# e.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
IsotopeErrorRange=0,3
#Number of tolerable termini
# The number of peptide termini that must have been cleaved by the enzyme (default 1)
# For trypsin, 2 means fully tryptic only, 1 means partially tryptic, and 0 means no-enzyme search
NTT=2
#Target/Decoy search mode
# 0 means don't search decoy database (default)
# 1 means search decoy database to compute FDR (source FASTA file must be forward-only proteins)
TDA=1
#Number of Threads (by default, uses all available cores)
NumThreads=8
#Minimum peptide length to consider
MinPepLength=6
#Maximum peptide length to consider
MaxPepLength=50
#Minimum precursor charge to consider (if not specified in the spectrum)
MinCharge=1
#Maximum precursor charge to consider (if not specified in the spectrum)
MaxCharge=6
#Number of matches per spectrum to be reported
#If this value is greater than 1 then the FDR values computed by MS-GF+ will be skewed by high-scoring 2nd and 3rd hits
NumMatchesPerSpec=1
#Amino Acid Modification Examples
# Specific static modifications using one or more StaticMod= entries
# Specific dynamic modifications using one or more DynamicMod= entries
# Modification format is:
# Mass or CompositionStr, Residues, ModType, Position, Name (all the five fields are required).
# Examples:
# C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed Carbamidomethyl C (alkylation)
# O1, M, opt, any, Oxidation # Oxidation M
# 15.994915, M, opt, any, Oxidation # Oxidation M (mass is used instead of CompositionStr)
# H-1N-1O1, NQ, opt, any, Deamidated # Negative numbers are allowed.
# CH2, K, opt, any, Methyl # Methylation K
# C2H2O1, K, opt, any, Acetyl # Acetylation K
# HO3P, STY,opt, any, Phospho # Phosphorylation STY
# C2H3NO, *, opt, N-term, Carbamidomethyl # Variable Carbamidomethyl N-term
# H-2O-1, E, opt, N-term, Glu->pyro-Glu # Pyro-glu from E
# H-3N-1, Q, opt, N-term, Gln->pyro-Glu # Pyro-glu from Q
# C2H2O, *, opt, Prot-N-term, Acetyl # Acetylation Protein N-term
#Custom amino acids examples
# Only supports empirical formulas of elements C H N O S.
# If other elements are needed, or a specific mass is needed, they can be added as fixed modifications on the custom AA
# Maximum atom counts: 255 C, 255 H, 63 N, 63 O, 15 S
# Format spec is:
# EmpiricalFormula, ResidueSymbol, custom, OriginalAA, Name (all the five fields are required, though OriginalAA is not actually used for anything)
# Examples:
# C5H7N1O2S0,J,custom,P,Hydroxylation # Hydroxyproline
# C3H6N2O0S1,X,custom,C,Amidation # C-terminal amidation of Cys
# C5H5N1O1S0,Z,custom,E,Glu->pyro-Glu # N-terminal pyroGlu residue, from either Glu OR Gln
3.结果
原始输出格式MzIdentML,示例文件test.mzid。
有2种方法将mzid文件转化为tsv,使结果更加易读。详见https://msgfplus.github.io/msgfplus/MzidToTsv.html:
- 一是MSGFPlus.jar内置的MzIDToTsv工具,实现容易,但对于大文件慢。
Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv
-i MzIDFile (MS-GF+ output file (*.mzid))
[-o TSVFile] (TSV output file (*.tsv) (Default: MzIDFileName.tsv))
[-showQValue 0/1] (0: do not show Q-values, 1: show Q-values (Default))
[-showDecoy 0/1] (0: do not show decoy PSMs (Default), 1: show decoy PSMs)
[-unroll 0/1] (0: merge shared peptides (Default), 1: unroll shared peptides)
- 二是单独使用MzidToTsvConverter.exe工具,转化快,处理大文件,限于Windows(Linux需要mono)
MzidToTsvConverter.exe -mzid:SearchResults.mzid -unroll -showDecoy
转化为tsv后的示例文件:test_Unrolled.tsv
表头内容包含:
1 #SpecFile
2 SpecID
3 ScanNum
4 FragMethod
5 Precursor
6 IsotopeError
7 PrecursorError(ppm)
8 Charge
9 Peptide
10 Protein
11 DeNovoScore
12 MSGFScore
13 SpecEValue
14 EValue
15 QValue
16 PepQValue
ref:
https://msgfplus.github.io/msgfplus/index.html
http://www.psidev.info/mzidentml
https://omics.pnl.gov/software/ms-gf
https://github.com/MSGFPlus/msgfplus
https://github.com/MSGFPlus/msgfplus/tree/master/docs/ParameterFiles
https://msgfplus.github.io/msgfplus/MzidToTsv.html
https://github.com/MSGFPlus/msgfplus/releases
蛋白质组学鉴定定量系列软件总结:
【1】蛋白鉴定软件之X!Tandem
【2】蛋白鉴定软件之Comet
【3】蛋白鉴定软件之Mascot
【4】蛋白质组学鉴定软件之MSGFPlus
【5】蛋白质组学鉴定定量软件之PD
【6】蛋白质组学鉴定定量软件之MaxQuant
【4】蛋白质组学鉴定软件之MSGFPlus的更多相关文章
- 【6】蛋白质组学鉴定定量软件之MaxQuant
目录 1.简介 2.下载安装 3.配置与运行 4.结果 5.Perseus后处理 6.小结 1.简介 2016年,德国马普所的Cox和蛋白质组学领域巨擘Matthias Mann合作开发了MaxQua ...
- 【5】蛋白质组学鉴定定量软件之PD
目录 1.简介 2.安装与配置 3.分析流程 4.结果 1.简介 PD全称Proteome Discoverer,是ThermoFisher在2008年推出的商业Windows软件,没错,收费,还不菲 ...
- 【3】蛋白鉴定软件之Mascot
目录 1.简介 2.配置 2.1在线版本 2.2 服务器版本 3.运行 3.1 在线版本 3.2 服务器版本 4.结果 1.简介 Mascot是非常经典的蛋白鉴定软件,被Frost & Sul ...
- 【2】蛋白鉴定软件之Comet
目录 1.简介 2.下载安装 3.软件使用 4.结果 1.简介 官网:http://comet-ms.sourceforge.net/ 1993年开发,持续更新,免费开源 适用Windows/Linu ...
- 【1】蛋白鉴定软件之X!Tandem
目录 1. 简介 2.下载安装 3. 软件试用 4. 结果 5. FAQ 1. 简介 X!Tandem是GPM:The Global Proteome Machine(主要基于Web的开源用户界面,用 ...
- Journal of Proteomics Research | 自动的、可重复的免疫多肽数据分析流程MHCquant
题目:MHCquant: Automated and reproducible data analysis for immunopeptidomics 期刊:Journal of Proteome R ...
- 从零开始编写自己的C#框架(24)——测试
导航 1.前言 2.不堪回首的开发往事 3.测试推动开发的成长——将Bug消灭在自测中 4.关于软件测试 5.制定测试计划 6.编写测试用例 7.执行测试用例 8.发现并提交Bug 9.开发人员修复B ...
- ST
这次说一下测试的基础部分 软件测试 软件测试(英语:software testing),描述一种用来促进鉴定软件的正确性.完整性.安全性和质量的过程.换句话说,软件测试是一种实际输出与预期输出间的审核 ...
- 软件测试software testing summarize
软件测试(英语:software testing),描述一种用来促进鉴定软件的正确性.完整性.安全性和质量的过程.软件测试的经典定义是:在规定的条件下对程序进行操作,以发现程序错误,衡量软件质量,并对 ...
随机推荐
- [no_code]OCR表格处理——功能规格说明书
项目 内容 这个作业属于哪个课程 2020春季计算机学院软件工程(罗杰 任健) 这个作业的要求在哪里 功能规格说明书 我们在这个课程的目标是 远程协同工作,采用最新技术开发软件 这个作业在哪个具体方面 ...
- Intellij IDEA 2021.2.3 最新版免费激活教程(可激活至 2099 年,亲测有效)
申明,本教程 Intellij IDEA 最新版破解.激活码均收集与网络,请勿商用,仅供个人学习使用,如有侵权,请联系作者删除.如条件允许,建议大家购买正版. 本教程更新于:2021 年 10 月 ...
- WebGL着色器渲染小游戏实战
项目起因 经过对 GLSL 的了解,以及 shadertoy 上各种项目的洗礼,现在开发简单交互图形应该不是一个怎么困难的问题了.下面开始来对一些已有业务逻辑的项目做GLSL渲染器替换开发. 起因是看 ...
- Noip模拟51 2021.9.12
T1 茅山道术 考场上卡在了一个恶心的地方, 当时以为每次施法都会产生新的可以施法的区间,然后想都没细想, 认为不可做,甚至$dfs$也无法打,考后一问发现是自己想多了.. 新产生的区间对答案根本没有 ...
- 零基础学习C语言字符串操作总结大全
本篇文章是对C语言字符串操作进行了详细的总结分析,需要的朋友参考下 1)字符串操作 strcpy(p, p1) 复制字符串 strncpy(p, p1, n) 复制指定长度字符串 strcat(p, ...
- 旋转数组的最小数字 牛客网 剑指Offer
旋转数组的最小数字 牛客网 剑指Offer 题目描述 把一个数组最开始的若干个元素搬到数组的末尾,我们称之为数组的旋转. 输入一个非减排序的数组的一个旋转,输出旋转数组的最小元素. 例如数组{3,4, ...
- BugKu之备份是个好习惯
题目:备份是个好习惯 思路分析 打开题目,看到一个字符串. 联系到题目,就猜到肯定是源代码泄露,用工具扫一下,发现了index.php.bak,验证了我的猜想,下载下来看看. <?php /** ...
- journalctl常用命令
journalctl -xe 查看全部日志# 以flow形式查看日志 $ journalctl -f # 查看内核日志 $ journalctl -k # 查看指定服务日志 $ journalctl ...
- Eclipse 中的Maven常见报错及解决方法
1.不小心将项目中的Maven Dependencies删除报错 项目报错: 点击Add Library,添加Maven Managed Dependencies又提示如下: 在这个时候需要项目右键: ...
- (原创)WinForm中莫名其妙的小BUG——ComboBox 尺寸高度问题
一.前言 使用WinForm很久了,多多少少遇到一些小BUG. 这些小BUG影响并不严重,而且只要稍微设置一下就能正常使用,所以微软也一直没有修复这些小BUG. 本来并不足以写篇文章去记录,但是昨天遇 ...