GWAS Catalog

The NHGRI-EBI Catalog of published genome-wide association studies

EBI负责维护的一个收集已发表的GWAS研究的数据库

Catalog stats

  • Last data release on 2019-09-24
  • 4220 publications
  • 107486 SNPs
  • 157336 associations
  • Genome assembly GRCh38.p12
  • dbSNP Build 151
  • Ensembl Build 96

基本的搜索方法

搜索表型:如breast carcinoma,会得到相关的非常规范的表型信息,EFO,就像GO一样,是一套表型分类规则。然后还会得到表型相关的基因。

搜索SNP:如rs7329174,会得到变异的详细信息,和对应的基因。

搜索人名:Yao,会得到相关的文献

搜索染色体位置:如2q37.1,Cytogenetic region

搜索基因:如HBS1L

搜索区域:如6:16000000-25000000

说是数据库,其实就是一个table,从这里下载,不过100MB

表里面有这些数据:

DATE ADDED TO CATALOG* +: Date a study is published in the catalog

PUBMEDID* +: PubMed identification number

FIRST AUTHOR* +: Last name and initials of first author

DATE* +: Publication date (online (epub) date if available)

JOURNAL* +: Abbreviated journal name

LINK* +: PubMed URL

STUDY* +: Title of paper

DISEASE/TRAIT* +: Disease or trait examined in study

INITIAL SAMPLE DESCRIPTION* +: Sample size and ancestry description for stage 1 of GWAS (summing across multiple Stage 1 populations, if applicable)

REPLICATION SAMPLE DESCRIPTION* +: Sample size and ancestry description for subsequent replication(s) (summing across multiple populations, if applicable)

REGION*: Cytogenetic region associated with rs number

CHR_ID*: Chromosome number associated with rs number

CHR_POS*: Chromosomal position associated with rs number

REPORTED GENE(S)*: Gene(s) reported by author

MAPPED GENE(S)*: Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen.

UPSTREAM_GENE_ID*: Entrez Gene ID for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_ID*: Entrez Gene ID for nearest downstream gene to rs number, if not within gene

SNP_GENE_IDS*: Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts

UPSTREAM_GENE_DISTANCE*: distance in kb for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_DISTANCE*: distance in kb for nearest downstream gene to rs number, if not within gene

STRONGEST SNP-RISK ALLELE*: SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype.

SNPS*: Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype)

MERGED*: denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)

SNP_ID_CURRENT*: current rs number (will differ from strongest SNP when merged = 1)

CONTEXT*: SNP functional class

INTERGENIC*: denotes whether SNP is in intergenic region (0 = no; 1 = yes)

RISK ALLELE FREQUENCY*: Reported risk/effect allele frequency associated with strongest SNP in controls (if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted.

P-VALUE*: Reported p-value for strongest SNP risk allele (linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit (for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7).

PVALUE_MLOG*: -log(p-value)

P-VALUE (TEXT)*: Information describing context of p-value (e.g. females, smokers).

OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1. Appropriate unit and increase/decrease are included for beta coefficients.

95% CI (TEXT)*: Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available.

PLATFORM (SNPS PASSING QC)*: Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable

CNV*: Study of copy number variation (yes/no)

ASSOCIATION COUNT+: Number of associations identified for this study

一些问题:

什么是Genotyping technology?

什么是Experimental Factor Ontology trait?

什么是Cytogenetic region?karyotype

什么是trait + risk allele?这里要分清SNP和allele的概念,SNP是位点,而allele则是该位点上碱基。考虑一下DNA双链,以及多倍体。

什么是risk/effect allele frequency?

odds ratio在GWAS里是个什么指标?wiki

The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele.

As an example, suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by 'A' and the number of individuals in the control group having allele T is represented by 'B'. Similarly, the number of individuals in the case group having allele C is represented by 'X' and the number of individuals in the control group having allele C is represented by 'Y'. In this case the odds ratio for allele T is A:B (meaning 'A to B', in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y).

When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is typically calculated using a simple chi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[18]

什么是MAF?the frequency of the minor allele

GWAS数据可以有哪些注释?phenotype annotation、population and linkage disequilibrium (LD) information

什么是CP loci?an effective region associated with at least two phenotypes

什么是genotype-calling?

GWAS的最基本的QC有哪些?

Quality Control Procedures for Genome Wide Association Studies

Data quality control in genetic case-control association studies

  • minor allele frequency (MAF) > 0.01; statistical power is extremely low for rare SNPs,很好理解,如果一个非常罕见的SNP,需要非常大的样本量才能有足够的power
  • Hardy-Weinberg equilibrium (HWE) test p-value > 5E-05;
  • missing genotypes rate < 10%; Genotypes are classified as missing if the genotype-calling algorithm cannot infer the genotype with sufficient confidence. Can be calculated across each individual and/or SNP.

什么是Experimental Factor Ontology?

什么是LD information (r2 and D’ values)?

Mathematical properties of the r2 measure of linkage disequilibrium

the square of the correlation coefficient between two indicator variables – one representing the presence or absence of a particular allele at the first locus and the other representing the presence or absence of a particular allele at the second locus. the frequency dependence of r2,也就是r2是MAF的函数。

Introduction to different measures of linkage disequilibrium (LD) and their calculation 两种常见的计算方法

NLM Catalog
The NLM Catalog provides access to NLM bibliographic data for journals, books, audiovisuals, computer software, electronic resources and other materials. Links to the library's holdings in LocatorPlus, NLM's online public access catalog, are also provided.

NCBI和本数据库里的期刊名字都是缩写,如何转化为全名呢?

在NCBI数据库里下载对应的信息,NLM

用sublime处理一下格式即可得到对应的关系

怎么计算这些变异在特定群里里面的LD score?

有现成的数据库可以用,LDlink

LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups

还有R包可以直接调用,LDlinkR: Access LDlink API with R

问题:

  • 无法一次性提取出全部数据,SNP数十万,读取很困难,该函数最好一次只读100个数据
  • 版本问题,GWAS数据库里的SNP不一定都有LD score,写代码时要注意

如何学会提出问题,并用统计和simulation来检验问题?

一个最重要的问题:我们观测到的结果是不是随机的?

这里就需要将我们的observe作simulation和shuffling。

这部分非常重要,也非常有意思。

如何过滤千人基因组里的SNP

quality control (QC)

  • Variants with minor allele frequency (MAF) > 0.01;
  • Variants with Hardy-Weinberg equilibrium (HWE) test p-value > 5E-05;
  • Variants with missing genotypes rate < 10%;

1000 genome数据库里使用的是VCF4.1的格式

快速批量下载ftp目录里的文件:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/*.gz

vcf转plink格式:

#for i in `seq 2 22`
#do
#plink --vcf ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --set-missing-var-ids @:# --make-bed --out chr$i.phase3.v5a.plink
#done #plink --vcf ALL.chrX.phase3_shapeit2_mvncall_integrated_v1a.20130502.genotypes.vcf.gz --missing-var-code NA --make-bed --out chrX.phase3.v5a.plink
#plink --vcf ALL.chr17.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --missing-var-code NA --make-bed --out chr17.phase3.v5a.plink
#plink --vcf ALL.chr8.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --missing-var-code NA --make-bed --out chr8.phase3.v5a.plink #plink --bfile ./chr1.phase3.v5a.plink --merge-list merge.txt --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp --make-bed --out chr1_22_XY.v5a.plink for i in `seq 1 22`
do
plink --bfile chr$i.phase3.v5a.plink --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp --make-bed --out chr$i.phase3.v5a.rmdup.plink
done plink --bfile chrX.phase3.v1a.plink --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp --make-bed --out chrX.phase3.v1a.rmdup.plink
plink --bfile chrY.phase3.v1a.plink --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp --make-bed --out chrY.phase3.v1a.rmdup.plink
plink --bfile ./chr1.phase3.v5a.rmdup.plink --merge-list merge.txt --make-bed --out chr1_22_XY.v5a.plink  

plink文档 - Whole genome association analysis toolset

PLINK 2.00 alpha

  

##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=CS,Number=1,Type=String,Description="Source call set.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate of this variant">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=MC,Number=.,Type=String,Description="Merged calls.">
##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this call, will be NULL NULL for START and E
##INFO=<ID=MEND,Number=1,Type=Integer,Description="Mitochondrial end coordinate of inserted sequence">
##INFO=<ID=MLEN,Number=1,Type=Integer,Description="Estimated length of mitochondrial insert">
##INFO=<ID=MSTART,Number=1,Type=Integer,Description="Mitochondrial start coordinate of inserted sequence">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="SV length. It is only calculated for structural variation MEIs. For other types of SVs, one may calculate the SV length by INFO:END-START+1
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=TSD,Number=1,Type=String,Description="Precise Target Site Duplication for bases, if unknown, value will be NULL">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth; only low coverage data were counted towards the DP, exome data were not used">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF,
##INFO=<ID=VT,Number=.,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=OLD_VARIANT,Number=1,Type=String,Description="old variant location. Format chrom:position:REF_allele/ALT_allele">

其中AF(Estimated allele frequency in the range (0,1))就是整天的MAF

如何随机读取VCF:Introduction to vcfR

其实上面那三个指标没有那么简单,需要自己计算:

Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons," drive an enormous amount of selection. Single nucleotide polymorphisms (SNPs) with a minor allele frequency of 0.05 (5%) or greater were targeted by the HapMap project.

How can I get the allele frequency of my variant?

you can calculate your frequency by dividing AC (allele count) by AN (allele number).

Allele frequency calculator

perl -MCPAN -e shell

install IPC::System::Simple

这是个perl脚本,太老了,跑出来的结果不太好,所以不用,折腾了我好久,还是用上面官方的方便。

perl calculate_allele_frq_from_vcf.pl  -vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -out_dir .  -sample_panel integrated_call_samples_v3.20130502.ALL.panel   
-pop EAS,CHB -region 1:10177-565255 -tabix /Users/surgery/LZXworkdir/bin/miniconda3/bin/tabix -vcftools_dir /Users/surgery/LZXworkdir/bin/vcftools_0.1.13

官网推荐的方法:这里是它的网页版本

grep EAS integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EAS.samples.list
grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EUR.samples.list vcf-subset -c EAS.samples.list ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | fill-an-ac |
bgzip -c > EAS.chr1.vcf.gz

  

需要把基本的统计遗传学知识串一下了:

参考基因组

比对

SNP

allele

Population

Allele frequency

Genotype frequency

Allele frequency和MFA联系和区别

接下来:

如何获取SNP的HWE p-value?

A genome-wide study of Hardy–Weinberg equilibrium with next generation sequence data

Evolution and the tree of life - 不错的遗传学公开课

Allele Frequencies and Hardy‐Weinberg Equilibrium - 关于HWE讲得比较透彻

Quality control analysis of the 1000 Genomes Project Omni2.5 genotypes - 实操借鉴

It is therefore of interest to test whether a population is in HWE at a locus.  We will discuss the two most popular ways of testing HWE

Hardy‐Weinberg Assumptions

  • infinite population
  • discrete generations
  • random mating
  • no selection
  • no migration in or out of population
  • no mutation
  • equal initial genotype frequencies in the two sexes

用HWE来过滤,是不想选到过于离谱的SNP,也就是我们只想选出大致符合HWE假设的SNP

如何获取missing genotype rate?

non-missing genotypes (call rate), Call rates were calculated using PLINK1.90.

install.packages("LDlinkR")

必须是R3.5.2版本及以上才能安装

最好用并行来做,不然这个真的是太慢了。

SNP的过滤一行命令搞定:

plink --file EAS.pop.founder.phase3 --maf 0.05 --geno 0.1 --hwe 0.001
plink --file EUR.pop.founder.phase3 --maf 0.05 --geno 0.1 --hwe 0.001

  

待续~

GWAS Catalog数据库简介的更多相关文章

  1. MongoDB数据库简介及安装

    一.MongoDB数据库简介 简介 MongoDB是一个高性能,开源,无模式的,基于分布式文件存储的文档型数据库,由C++语言编写,其名称来源取自"humongous",是一种开源 ...

  2. Oracle数据库简介

    Oracle数据库简介 一.介绍 Oracle数据库系统是美国Oracle(甲骨文)公司提供的以分布式数据库为核心的一组软件产品,是目前最流行的客户/服务器(Client/Server,C/S)或浏览 ...

  3. 数据库 简介 升级 SQLite 总结 MD

    Markdown版本笔记 我的GitHub首页 我的博客 我的微信 我的邮箱 MyAndroidBlogs baiqiantao baiqiantao bqt20094 baiqiantao@sina ...

  4. MongoDB,无模式文档型数据库简介

    MongoDB的名字源自一个形容词humongous(巨大无比的),在向上扩展和快速处理大数据量方面,它会损失一些精度,在旧金山举行的MondoDB大会上,Merriman说:“你不适宜用它来处理复杂 ...

  5. 全球第一开源ERP Odoo操作手册 数据库简介

    1.3 数据库简介 每一个独立核算的企业都有一套相互关联的账簿体系, 把这一套完整的账簿体系建立在计算机系统中就称为一个数据库. 一般一个企业只用一个数据库. 如果企业有几个下属的独立核算的实体,也可 ...

  6. MySql数据库基础之数据库简介及安装

    MySql数据库简介: 众所周知,MySql数据库是一款开源的关系型数据库,在Web应用方面,MySql是最好的.最流行的RDBMS(Relational Database Management Sy ...

  7. Scopus数据库简介

    ScienceDirect数据库1. Elsevier简介荷兰Elsevier 是全球最大的科学文献出版发行商,已有180多年的历史.其产品涵盖科学.技术和医学等各个领域,包括1800多种学术期刊(大 ...

  8. 数据库----ORACLE和MYSQL数据库简介

    一.什么是数据库? 数据库(Database---DB)按照组织.储存和管理数据的仓库.(理解以下三个概念)   数据(Data)用来描述事物的记录都可称数据,如文字音乐图像.   数据库系统(Dat ...

  9. 数据库之一、数据库简介及SQL概要

    1.数据库简介: 数据库(Database,DB)是一个长期存储在计算机内的.有组织的.有共享的.统一管理的数据集合.简单来讲就是可以放大量数据的地方.管理数据库的计算机系统称为数据库管理系统(Dat ...

随机推荐

  1. POSIX多线程

    全文共分四部分: POSIX多线程—概述    POSIX多线程—异步编程举例    POSIX多线程—线程基本概念    POSIX多线程—互斥量概述 POSIX多线程—概述 Content 1. ...

  2. ar 解压一个.a文件报错: xxx.a is a fat file (use libtool(1) or lipo(1) and ar(1) on it)

    Linux  使用终端指令 ar x /Users/apple/Desktop/libWC_LIB_SDKT.a解压一个文件 报错如图所示: 是因为该.a文件包含了多个cpu架构,比如armv7,ar ...

  3. 应用在App Store上被拒重新提交审核流程指南

    1. 打开地址: https://itunesconnect.apple.com 2. 输入APPID和密码后,再输入绑定手机后的验证码. 3. 查看“”我的APP“”,如果显示拒绝,可能需打开Mac ...

  4. 【RAC】 RAC For W2K8R2 安装--grid的安装(四)

    [RAC] RAC For W2K8R2 安装--grid的安装(四) 一.1  BLOG文档结构图 一.2  前言部分 一.2.1  导读 各位技术爱好者,看完本文后,你可以掌握如下的技能,也可以学 ...

  5. django 的form登录 注册

    #!/usr/bin/env python # -*- coding: utf8 -*- #__Author: "Skiler Hao" #date: 2017/3/30 15:4 ...

  6. Linux 各系统目录作用及内容

  7. QtCreator设置野火iMx6开发板提供的qt交叉编译套件

    在Ubuntu18 QtCreator上添加野火iMx6开发板的Qt交叉编译环境PC:Ubuntu18.04QtCreator: 4.8.2交叉编译环境:野火提供的 5-编译工具链->qt交叉编 ...

  8. 学会 Debug

    如何成为优秀程序员第 2/100 期分享 01 调试(Debug)是成为一个程序员的基石. 调试这个词第一个含义即是移除错误,但真正有意义的含义是,通过检查来观察程序的运行.一个不会调试的程序员等同于 ...

  9. 一个.Net的混淆防反编译工具ConfuserEx

    给大家推荐一个.Net的混淆防反编译工具ConfuserEx. 由于项目中要用到.Net的混淆防反编译工具. 在网上找了很多.Net混淆或混淆防反编译工具,如.NET Reactor.Dotfusca ...

  10. 利用SQL直接生成模型实体类

    在网上找来一个别人写好的,生成实体类的SQL代码 declare @TableName sysname = 'lkxxb' declare @Result varchar(max) = 'public ...