进行数据初步处理(perl)

  • 统计amplicon的RC(read counts),并且相互overlap大于75%的amplicon合并起来
  • 统计每个amplicon的GC含量,均值,



性别识别并校正,文库大小、长度、GC含量标准化

  • 文库大小标准化

    某个sample的文库大小(read counts * amplicon length) / 平均每个ampicon文库大小(sum(read counts * amplicon length) / amplicon size))

    标准化前



    标准化后

  • 性别识别并校正

    使用mcluster基于高斯混合模型对X染色体NRC与平均X染色体NRC比例值进行聚类,如果聚成两类,猜测混合的男女数据,如果一类,猜测全是男或者全是女,需要对男样本的X染色体NRC*2



    可见混合了男女sample

  • 去除GC < 0.1 分位数和 GC > 0.9 分位数的amplicon,不进行loess

  • 去除length < 0.005 分位数和 length > 0.995 分位数的amplicon,不进行loess

  • 使用loess进行长度、GC标准化(标准化时使用log NRC)

    GC标准化前



    GC标准化后



    GC再次标准化后(去除第一次标准化后小于或大于threthold的amplicon)



    GC标准化后,长度标准化前



    长度标准化后



    包括GC和amplicon length 极端值时的点图



进行ICA(独立成分分析)

  • 去除深度远远低于其他样品的amplicon,这种个别sample中amplicon的变化并不是实验偏差造成的

    每两两样品间作线性回归,计算预测值,如果实际值比预测值低2NRC,那么在ICA中不考虑该amplicon
  • ICA

    使用fastICA进行独立成分分析,G函数选择logcosh,这些独立成分(或者主成分)认为是主要的实验偏差。

    各主成分解释变异的比例(使用主成分函数princomp计算,fastICA中没有显示)



    所以只取一个主成分,计算各个主成分与amplicon在所有样品中平均logNRC的相关系数,取最大的一个主成分作为主成分,
  • 按照amplicon 平均深度标准化,并* K权重(ICA的结果中有)得到每个amplicon在各个主成分的值,得到一个行为amplicon,列为PCA的矩阵

  • 每个amplicon根据sample depth 进行校正后的logNRC(减去sample mean depth)*主要主成分的载荷(比如PC1的载荷)求均值,得到每个amplicon的baseline

    进行ICA之前的logNRC



    计算baseline时sample之间校正,减去sample 平均深度之后



    可见对部分amplicon进行了一些校正,特别是平均深度教少的两个amplicon
  • 取baseline > -2的amplicon出来,logNRC与主成分值进行线性回归,求残差,得到去除实验偏差造成的影响,对这些点进行方差校正



    主成分标准化前,logNRC Vs PC1



    主成分标准化后,logNRC Vs PC1



    主成分标准化前,logNRC Vs PC2



    主成分标准化后,logNRC Vs PC2

线性回归的系数非常小,所以影响不大

Coefficients:
Value Std. Error t value
(Intercept) 0.0287 0.0050 5.7175
S[tt, ]1 0.0054 0.0031 1.7508
S[tt, ]2 -0.0076 0.0041 -1.8356
S[tt, ]3 -0.0018 0.0040 -0.4615
S[tt, ]4 -0.0023 0.0028 -0.8080
S[tt, ]5 -0.0016 0.0021 -0.7327
S[tt, ]6 -0.0012 0.0027 -0.4470
S[tt, ]7 0.0000 0.0024 0.0087
S[tt, ]8 -0.0020 0.0018 -1.1415
S[tt, ]9 -0.0012 0.0021 -0.5675
S[tt, ]10 -0.0008 0.0016 -0.4773
S[tt, ]11 -0.0022 0.0018 -1.2418
S[tt, ]12 0.0010 0.0016 0.6654
S[tt, ]13 -0.0014 0.0018 -0.7660
S[tt, ]14 -0.0015 0.0013 -1.1455

并且可以看出logNRC方差随着PC1(logNRC)增大而减少

方差校正

不同样本总的方差可以看出不同,有的样本更加离散

  • 认为最后的logNRC服从正态分布,均值是0,而方差可以由(PC1解释)

    取logNRC > -2的amplicon,求出所有logNRC的绝对中位差(MAD),sd.all,再求出每个sample绝对中位差sd.sample,最后每个sample中,logNRC进行方差校正logNRC = (logNRC - median(logNRC))/(sd.sample/sd.all) + median(logNRC)



    校正后

  • 将observation 的amplicon 按照baseline从小到达排序,分成每50个amplicon 分一个bin,对bin内的每个sample的50个amplicon的logNRC求组内mad(旧版本是sd),将该bin的mad设置为多个sample中最大的mad(最大组内mad),该bin的平均baseline作为该bin 的baseline



  • 对bin mad和bin baseline用loess作线性回归,得到loess模型,将所有amplicon baseline输入,得到amplicon mad 的预测值,再求出每个sample 的amplicon logNRC的mad值,将该预测值/最大sample mad值,得到的比值作为该amplicon的sd





library(mclust) #for clustering to detect gender of control samples
library(MASS)
library(scales) #for graph transparency
library(fastICA) #for ICA
library(psych)
library(ggplot2) inputFile <-"~/project/ONCOCNV_RES/Control.stats.txt"
controlFile <- "~/project/ONCOCNV_RES/MyControl.stats.Processed.txt"
GCfile <- "~/project/ONCOCNV_RES/target.GC.txt" NUMBEROFPC=3
minFractionOfShortOrLongAmplicons=0.05 #change if needed NUMBEROFPC=3
if (NUMBEROFPC<1) {
NUMBEROFPC=3
}
if (NUMBEROFPC>3) {
cat("Can keep maximum 3 principal components")
NUMBEROFPC=3
} magicThreshold=-2 #normalized read count less than this value will not be considered if it is not consistent between the control replicates
maxLoess=50000 #maximum number of points for the regression, starting from v6.5 dataTable <-read.table(inputFile, header=TRUE);
data<-data.frame(dataTable) ncont <- length(data)-6
all.observations <- data[,7:length(data)] data$len<-data$end-data$start #pairs.panels(all.observations[,c(1:5)]) if (ncont >1) {
totalTargetLen<-sum(all.observations[,1]*data$len)
}else {
totalTargetLen<-sum(all.observations*data$len)
} nulInd <- NULL if (ncont >1) {
for (i in (1:(ncont))) {
nulInd<-c(nulInd,which(all.observations[,i]==0 | is.na(all.observations[,i])==TRUE | data$chr=="chrX"| data$chr=="chrY"| data$chr=="X"| data$chr=="Y") )
all.observations[,i]<- all.observations[,i]*data$len/totalTargetLen*length(data$len) #for ampli-based, correct back for amplicon length
}
noNulIdex <-c(1:length(all.observations[,1])) }else {
nulInd<-c(nulInd,which(all.observations==0 | is.na(all.observations)==TRUE | data$chr=="chrX"| data$chr=="chrY" | data$chr=="X"| data$chr=="Y") )
all.observations<- all.observations*data$len/totalTargetLen*length(data$len) #for ampli-based, correct back for amplicon length
noNulIdex <-c(1:length(all.observations)) } #pairs.panels(all.observations[,c(1:5)]) if (length(nulInd)>0) {
indexNoNul<- noNulIdex[-sort(unique(nulInd))]
} else {
indexNoNul<- noNulIdex
}
if (ncont >1) {
observations <- all.observations[indexNoNul,]
} else {
observations <- all.observations[indexNoNul]
}
noMakeNA<-NULL if (ncont >1){ NUMBEROFPC = ncont-1; #detect female/male: starting from version 5.0:
sex.vector=NULL
tt <- which(all.observations[,i]>0 & is.na(all.observations[,i])==FALSE & (data$chr=="chrX" | data$chr=="X")) if (length(tt)>0) { for (i in (1:ncont)) {
sex.vector[i] <- sum(all.observations[,i][tt] )/ sum(all.observations[tt,] )*ncont
}
mc <- Mclust(sex.vector,G=c(1,2))
#plot(mc)
if (mc$G==2) {
cat ("Warning: you have both male and female samples in the control. We will try to assign sex using read coverage on chrX\n")
sex.vector <- mc$classification/2
cat (sex.vector);cat("\n")
}else {
cat ("all your samples have the same number of chrX. We assume they are all male; change the code for assume they are all female\n")
propX = sum(all.observations[tt,])/length(tt)/sum(all.observations[-tt,])*length(all.observations[-tt,1])
for (i in (1:ncont)) {
if (propX<0.9) {sex.vector[i] <- 0.5;} else {sex.vector[i] <- 1;}
#sex.vector[i] <- 1 #uncomment if all your control samples are female
}
} #correct the number of reads on chr X for Gender #starting from version 5.0:
for (i in (1:ncont)) {
tt <- which(all.observations[,i]>0 & is.na(all.observations[,i])==FALSE & (data$chr=="chrX" | data$chr=="chrY" | data$chr=="X" | data$chr=="Y"))
all.observations[,i][tt]<- all.observations[,i][tt] / sex.vector[i]
} } lmin<-quantile(data$len,probs =0.005, na.rm = TRUE)
lmax<-quantile(data$len,probs = 0.995, na.rm = TRUE)
minFrac = length(which(data$len<lmin))/length(data$len) #strictly from v6.5
maxFrac = length(which(data$len>lmax))/length(data$len)
#remove too short or too long amplicons (only if there are less than 5% of them: parameter minFractionOfShortOrLongAmplicons)
if (minFrac< minFractionOfShortOrLongAmplicons &maxFrac<minFractionOfShortOrLongAmplicons) {
nulInd <- which(data$len<lmin | data$len>lmax)
} else if (minFrac< minFractionOfShortOrLongAmplicons &maxFrac>=minFractionOfShortOrLongAmplicons) {
nulInd <- unique(which(data$len<lmin ))
} else if (minFrac>= minFractionOfShortOrLongAmplicons &maxFrac<minFractionOfShortOrLongAmplicons) {
nulInd <- unique(which( data$len>lmax))
} Control.stats.GC <- read.delim(GCfile)
data$gc<-Control.stats.GC$GC
lmin<-quantile(data$gc,probs =0.01, na.rm = TRUE)
lmax<-quantile(data$gc,probs = 0.99, na.rm = TRUE)
nulInd <- unique(c(nulInd,which(data$gc<=lmin | data$gc>=lmax))) data$mean <-rowMeans(all.observations) lmin<-quantile(data$mean,probs =0.01, na.rm = TRUE)
lmax<-quantile(data$mean,probs = 0.99, na.rm = TRUE) minFrac = length(which(data$mean<=lmin))/length(data$mean)
maxFrac = length(which(data$mean>=lmax))/length(data$mean) #remove too short or too long amplicons (only if there are less than 5% of them: parameter minFractionOfShortOrLongAmplicons)
if (minFrac< minFractionOfShortOrLongAmplicons &maxFrac<minFractionOfShortOrLongAmplicons) {
nulInd <- unique(c(nulInd,which(data$mean<=lmin | data$mean>=lmax)))
} else if (minFrac< minFractionOfShortOrLongAmplicons &maxFrac>=minFractionOfShortOrLongAmplicons) {
nulInd <- unique(c(nulInd,which(data$mean<=lmin ))) } else if (minFrac>= minFractionOfShortOrLongAmplicons &maxFrac<minFractionOfShortOrLongAmplicons) {
nulInd <- unique(c(nulInd,which( data$mean>=lmax))) } noNulIdex <-c(1:length(all.observations[,1]))
indexNoNul<- noNulIdex[-sort(unique(nulInd))]
observations <- all.observations[indexNoNul,] correctedObs=matrix(nrow=length(all.observations[,1]),ncol=length(all.observations),byrow=T) for (i in (1:(ncont))) {
# tt <- which(all.observations[,i][indexNoNul]>0 & log(all.observations[,i][indexNoNul])>magicThreshold)
# gcCount.spl <- smooth.spline(data$gc[indexNoNul][tt], log(all.observations[,i][indexNoNul][tt]))
# predictions <- predict(gcCount.spl,data$gc)$y
# a1 <- log(all.observations[,i])-predictions
#
# #recorrect the at the tails of GC-content #need it because of the strange cases like control 3 and 9...
# tt <- which(all.observations[,i][indexNoNul]>0 & a1[indexNoNul] > magicThreshold)
# gcCount.spl <- smooth.spline(data$gc[indexNoNul][tt], a1[indexNoNul][tt])
# predictions <- predict(gcCount.spl,data$gc)$y
# resi <- a1-predictions
# len.spl <- smooth.spline(data$len[indexNoNul][tt], resi[indexNoNul][tt])
# correctedObs[,i] <- resi-predict(len.spl,data$len)$y
# correctedObs[,i][-indexNoNul] <- NA tt <- which(all.observations[,i][indexNoNul]>0 & log(all.observations[,i][indexNoNul])>magicThreshold) #starting from 5.3 - LOESS if (length(tt)>maxLoess) {tt=sort(sample(tt, maxLoess, replace = FALSE))} #starting from v6.5 to use on exome data #plot(data$gc[indexNoNul][tt],log(all.observations[,i][indexNoNul][tt]),cex=0.5,pch=19,col="lightblue")
gcCount.loess <- loess(log(all.observations[,i][indexNoNul][tt])~data$gc[indexNoNul][tt], control = loess.control(surface = "direct"),degree=2) #loess starting from version 5.2
predictions1<- predict (gcCount.loess,data$gc)
a1 <- log(all.observations[,i])-predictions1
#plot(data$gc[indexNoNul][tt],a1[indexNoNul][tt],cex=0.5,pch=19,col="lightblue")
tt <- which(all.observations[,i][indexNoNul]>0 & a1[indexNoNul] > magicThreshold) if (length(tt)>maxLoess) {tt=sample(tt, maxLoess, replace = FALSE)} #starting from v6.5 to use on exome data
#plot(data$gc[indexNoNul][tt],a1[indexNoNul][tt],cex=0.5,pch=19,col="lightblue")
gcCount.loess <- loess(a1[indexNoNul][tt]~data$gc[indexNoNul][tt],control = loess.control(surface = "direct"),degree=2) #loess starting from version 5.2
predictions<- predict (gcCount.loess,data$gc)
resi <- a1-predictions
#plot(data$gc[indexNoNul][tt],resi[indexNoNul][tt],cex=0.5,pch=19,col="lightblue")
#plot(data$len[indexNoNul][tt],resi[indexNoNul][tt],cex=0.5,pch=19,col="lightblue")
len.loess <- loess( resi[indexNoNul][tt]~data$len[indexNoNul][tt], control = loess.control(surface = "direct"),degree=2)
correctedObs[,i] <- resi-predict(len.loess,data$len)
#plot(data$len[indexNoNul][tt],correctedObs[,i][indexNoNul][tt],cex=0.5,pch=19,col="lightblue")
#plot(data$gc[indexNoNul][tt],correctedObs[,i][indexNoNul][tt],cex=0.5,pch=19,col="lightblue")
#plot(data$len,correctedObs[,i],cex=0.5,pch=19,col="lightblue")
#plot(data$gc,correctedObs[,i],cex=0.5,pch=19,col="lightblue")
}
for (i in (1:(ncont))) {
correctedObs[,i][-indexNoNul] <- NA
correctedObs[,i][which(is.infinite(correctedObs[,i]))] <- NA
#plot(data$len,correctedObs[,i],col=2)
} observations <- correctedObs[indexNoNul,] #they are already in LOG scale!!! makeNA <- NULL
for (i in (1:ncont)) {
for (j in (1:ncont)) {
if (i!=j) {
my.lm <- rlm((observations[,j]) ~(observations[,i]))
#hist((observations[,j])-my.lm$fitted,n=100)
predictions <- (correctedObs[,i])*my.lm$coefficients[2]+my.lm$coefficients[1]
tt <- (which ((correctedObs[,j])-predictions<magicThreshold))
tt2 <- which(is.na(correctedObs[,j]))
makeNA<-unique(c(makeNA,tt,tt2))
}
}
} if(length(makeNA)>0) {
noNulIdex <-c(1:length(correctedObs[,1]))
indexToFit<- noNulIdex[-sort(unique(makeNA))]
indexToFit<- intersect(indexToFit,indexNoNul)
} else {
indexToFit<- indexNoNul
} observations <- correctedObs[indexToFit,] #do ICA1 to for remaining points:
X<-observations #observations are in LOG scale
a <- fastICA(X,NUMBEROFPC, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.000001, verbose = TRUE) my.PCA <- princomp(X)
cumExplainVar<- cumsum(my.PCA[["sdev"]]^2)/sum(my.PCA[["sdev"]]^2)
cat("Explained variance by the first pronicpal components of PCA:")
cat(names(cumExplainVar))
cat (cumExplainVar)
#screeplot(my.PCA)
mainComp=1;myCor=0;
for (i in (1:NUMBEROFPC)) {
if(abs(cor(rowMeans(X),(a$X %*% a$K)[,i]))>myCor) {
myCor=abs(cor(rowMeans(X),a$X %*% a$K[,i]));
mainComp=i;
}
}
if (cor(rowMeans(X),a$X %*% a$K[,mainComp])<0) {a$K=-a$K;} shifts<-colMeans(X) controlFilePDF<-paste(controlFile,".pdf",sep="")
pdf(file = controlFilePDF, width=7*ncont/3, height=7*ncont/3)
par(mfrow=c((ncont-1),(ncont-1)))
atmp<-NULL
CellCounter=1
for (i in (1:(ncont-1))) {
for (j in (1:(ncont-1))) {
if (i<=j) {atmp<-c(atmp,CellCounter);CellCounter=CellCounter+1} else {atmp<-c(atmp,0);} }
}
layout(matrix(c(atmp), ncont-1, ncont-1, byrow = TRUE))
for (i in (1:(ncont-1))) {
for (j in ((i+1):ncont)) { my.lm <- rlm((observations[,j]) ~(observations[,i]))
myCstr1 <- paste("control",i)
myCstr2 <- paste("control",j) plot((observations[,i]),(observations[,j]),xlab=bquote(~NRC[.(myCstr1)] ),ylab=bquote(~NRC[.(myCstr2)] ),
col="white",pch=21,cex=0.8) #553
points((correctedObs[,i])[makeNA],(correctedObs[,j])[makeNA],
col="darkgrey",bg=alpha("grey",0.5),pch=21,cex=0.8) tt<-which((observations[,i])<(observations[,j]))
points((observations[,i])[tt],(observations[,j])[tt],
col=colors()[553],bg=alpha(colors()[553],0.5),pch=21,cex=0.8)
points((observations[,i])[-tt],(observations[,j])[-tt],
col=colors()[90],bg=alpha(colors()[90],0.5),pch=21,cex=0.8) abline(my.lm,col="grey",lwd=2)
abline(a=0,b=1,col="black",lwd=2,lty=3)
}
} dev.off()
} else { NUMBEROFPC = 1; #detect female/male: starting from version 5.0:
tt <- which(all.observations>0 & is.na(all.observations)==FALSE & (data$chr=="chrX" | data$chr=="X"))
tt2 <- which(all.observations>0 & is.na(all.observations)==FALSE & (data$chr!="chrX"| data$chr=="X"))
if (length(tt)>0) {
propX = sum(all.observations[tt])/length(tt)/(sum(all.observations[tt2])/length(tt2)) if (propX<0.9) {sex.vector <- 0.5;} else {sex.vector <- 1;}
#correct the number of reads on chr X for Gender #starting from version 5.0:
tt <- which(all.observations>0 & is.na(all.observations)==FALSE & (data$chr=="chrX" | data$chr=="chrY" | data$chr=="X" | data$chr=="Y"))
all.observations[tt]<- all.observations[tt] / sex.vector
} lmin<-quantile(data$len,probs =0.005, na.rm = TRUE)
lmax<-quantile(data$len,probs = 0.995, na.rm = TRUE)
minFrac = length(which(data$len<lmin))/length(data$len) #strictly from v6.5
maxFrac = length(which(data$len>lmax))/length(data$len)
#remove too short or too long amplicons (only if there are less than 5% of them: parameter minFractionOfShortOrLongAmplicons)
if (minFrac< minFractionOfShortOrLongAmplicons &maxFrac<minFractionOfShortOrLongAmplicons) {
nulInd <- which(data$len<lmin | data$len>lmax)
} else if (minFrac< minFractionOfShortOrLongAmplicons &maxFrac>=minFractionOfShortOrLongAmplicons) {
nulInd <- unique(which(data$len<lmin ))
} else if (minFrac>= minFractionOfShortOrLongAmplicons &maxFrac<minFractionOfShortOrLongAmplicons) {
nulInd <- unique(which( data$len>lmax))
} Control.stats.GC <- read.delim(GCfile)
data$gc<-Control.stats.GC$GC
lmin<-quantile(data$gc,probs =0.01, na.rm = TRUE)
lmax<-quantile(data$gc,probs = 0.99, na.rm = TRUE)
nulInd <- unique(c(nulInd,which(data$gc<=lmin | data$gc>=lmax))) data$mean <-all.observations lmin<-quantile(data$mean,probs =0.01, na.rm = TRUE)
lmax<-quantile(data$mean,probs = 0.99, na.rm = TRUE)
nulInd <- unique(c(nulInd,which(data$mean<=lmin | data$mean>=lmax))) noNulIdex <-c(1:length(all.observations))
indexNoNul<- noNulIdex[-sort(unique(nulInd))]
observations <- all.observations[indexNoNul] tt <- which(all.observations[indexNoNul]>0 & log(all.observations[indexNoNul])>magicThreshold) #starting from 5.3 - LOESS if (length(tt)>maxLoess) {tt=sample(tt, maxLoess, replace = FALSE)} #starting from v6.5 to use on exome data gcCount.loess <- loess(log(all.observations[indexNoNul][tt])~data$gc[indexNoNul][tt], control = loess.control(surface = "direct"),degree=2) #loess starting from version 5.2
predictions1<- predict (gcCount.loess,data$gc)
a1 <- log(all.observations)-predictions1
tt <- which(all.observations[indexNoNul]>0 & a1[indexNoNul] > magicThreshold) if (length(tt)>maxLoess) {tt=sample(tt, maxLoess, replace = FALSE)} #starting from v6.5 to use on exome data gcCount.loess <- loess(a1[indexNoNul][tt]~data$gc[indexNoNul][tt], control = loess.control(surface = "direct"),degree=2) #loess starting from version 5.2
predictions<- predict (gcCount.loess,data$gc)
resi <- a1-predictions
len.loess <- loess( resi[indexNoNul][tt]~data$len[indexNoNul][tt], control = loess.control(surface = "direct"),degree=2)
correctedObs<- resi-predict(len.loess,data$len) correctedObs[-indexNoNul] <- NA
correctedObs[which(is.infinite(correctedObs))] <- NA } if (ncont >1){
#calculate baseline for all datapoints (should be similar to take a mean(log))
X<-correctedObs #correctedObs are in LOG scale library(RColorBrewer)
layout(matrix(data=c(1,2), nrow=1, ncol=2), widths=c(8,1),
heights=c(1,1))
par(mar = c(5,6,4,7),oma=c(0.2,0.2,0.2,0.2),mex=0.5)
breaks=seq(-2,4,1)
pal=brewer.pal(7,"YlOrRd")
image(x=1:ncol(correctedObs[indexNoNul,]),y=1:nrow(correctedObs[indexNoNul,]),z=t(correctedObs[indexNoNul,]),xlab="control",ylab="amplicon",col=pal[2:7],breaks=breaks)
par(mar = c(6,0,4,3))
breaks2<-breaks[-length(breaks)]
image(x=1, y=0:length(breaks2),z=t(matrix(breaks2))*1.001,
col=pal[(1:length(breaks2))+1],axes=FALSE,breaks=breaks,
xlab="", ylab="",xaxt="n")
axis(4,at=0:(length(breaks2)-1), labels=breaks2, col="white",
las=1)
abline(h=0:length(breaks2),col="grey",lwd=1,xpd=F) for (i in (1:(ncont))) {
X[,i][which(is.infinite(X[,i]))]=NA
X[,i]<- (X[,i]-shifts[i])
} layout(matrix(data=c(1,2), nrow=1, ncol=2), widths=c(8,1),
heights=c(1,1))
par(mar = c(5,6,4,7),oma=c(0.2,0.2,0.2,0.2),mex=0.5)
breaks=seq(-2,4,1)
pal=brewer.pal(7,"YlOrRd")
image(x=1:ncol(X[indexNoNul,]),y=1:nrow(X[indexNoNul,]),z=t(X[indexNoNul,]),xlab="control",ylab="amplicon",col=pal[2:7],breaks=breaks)
par(mar = c(6,0,4,3))
breaks2<-breaks[-length(breaks)]
image(x=1, y=0:length(breaks2),z=t(matrix(breaks2))*1.001,
col=pal[(1:length(breaks2))+1],axes=FALSE,breaks=breaks,
xlab="", ylab="",xaxt="n")
axis(4,at=0:(length(breaks2)-1), labels=breaks2, col="white",
las=1)
abline(h=0:length(breaks2),col="grey",lwd=1,xpd=F) # weights <- a$K %*% a$W
weights <- a$K X[is.na(X)] <- 0 # S <- (X[indexNoNul,] %*% a$K %*% a$W) #ICs
S <- (X[indexNoNul,] %*% a$K) #PCs observations<- correctedObs[indexNoNul,] #already in LOG scale baseLine <- NULL
for (i in (1:(length(X[,1])))) {
baseLine[i] <- weighted.mean(X[i,],w=weights[,mainComp],na.rm = TRUE)
} for (i in (1:(ncont))) {
tt <- which(baseLine[indexNoNul]>magicThreshold & observations[,i]>magicThreshold)
my.lm <-rlm(observations[tt,i] ~ S[tt,])
#qplot(S[tt,1],observations[tt,i],xlab="PC1",ylab="Sample1logNRC",alpha=0.1,colour=3)
#qplot(S[tt,2],observations[tt,i],xlab="PC2",ylab="Sample1logNRC",alpha=0.1,colour=3)
#qplot(S[tt,2],resi[tt],xlab="PC2",ylab="Sample1logNRC",alpha=0.1,colour=3)
resi<- observations[,i]-my.lm$coefficients[1]
for (PC in c(1:NUMBEROFPC)) {
resi<- resi-my.lm$coefficients[PC+1]*S[,PC]
} #plotRatioFree(observations[,i],data$chr[indexNoNul],data$start[indexNoNul]) # #check sign:
## if (my.lm$coefficients[mainComp+1]<0) {resi<- -resi} - bug in version 5.5
observations[,i] <- resi
}
mu<-(baseLine[indexNoNul]) correctedObs[indexNoNul,]<-observations
data$IC1<-NA if (NUMBEROFPC>1) {
data$IC2<-NA
}
if (NUMBEROFPC>2) {
data$IC3<-NA
} tmp <- 1:NUMBEROFPC
data$IC1[indexNoNul] <- S[,mainComp]
tmp <- setdiff(tmp,mainComp)
if (NUMBEROFPC>1) {
data$IC2[indexNoNul]<- S[,tmp[1]]
}
if (NUMBEROFPC>2) {
data$IC3[indexNoNul]<- S[,tmp[2]]
} } else {
baseLine <- correctedObs
data$IC1=correctedObs
} #par(mfrow=c(1,2))
#plot(S[tt,1],resi[tt],xlab="PC1",ylab="Sample2logNRC",col="lightblue",pch=19,cex=0.5) if (ncont >1){ #observations<-correctedObs[indexToFit,] #fit SD on indexToFit points:
#mu<- baseLine[indexToFit] observations<-correctedObs[indexNoNul,] #fit SD on "indexNoNul" points starting from v.5.2
mu<- baseLine[indexNoNul] #some control samples are noisier than others.., normalyze for it starting from version 4.0
sd.all <- mad(unlist(observations),na.rm=TRUE) #"mad" instead of "sd" starting from version 4.0 : 0.185[indexToFit] => 0.24[indexNoNul] for (j in (1:(ncont))) { #starting from version 4.0 correct for variance
tt <- which(observations[,j]>=magicThreshold)
sd.j <- mad(unlist(observations[,j][tt]),na.rm=TRUE)
alph <- sd.j/sd.all
observations[,j]<-(observations[,j]-median(observations[,j],na.rm=TRUE))/alph+median(observations[,j],na.rm=TRUE) #use starting from version 4.0
}
# mu <- baseLine[indexToFit] #fit SD on "indexNoNul" points starting from v.5.2 #plot(S[,1],observations[,j],xlab="PC1",ylab="Sample1logNRC",col="lightblue",pch=19,cex=0.5) totalPoints <- length(mu)
evalVal=50
if (totalPoints/evalVal<5) {
evalVal=10
}
numSeg=floor(totalPoints/evalVal)
possibleMu<-NULL
sigma<-NULL
observations <- observations[order(mu),]
mu <- mu[order(mu)]
for (i in c(0:(numSeg-1))) {
possibleMu[i+1] <- mean(mu[(i*evalVal+1):(i*evalVal+evalVal)])
sigma[i+1]<-max(apply(observations[(i*evalVal+1):(i*evalVal+evalVal),],FUN=mad,MARGIN=2,center=0,na.rm=TRUE)) #"mad" instead of "sd" starting from version 4.0
} #sigma.spl <- smooth.spline(possibleMu, sigma)
#data$sd<- predict(sigma.spl,baseLine)$y/max(apply(observations,FUN=mad,MARGIN=2,center=0,na.rm=TRUE)) sigma.spl <- loess(sigma~possibleMu, control = loess.control(surface = "direct"),degree=2) #loess starting from version 5.2
data$sd<- predict (sigma.spl,baseLine)/max(apply(observations,FUN=mad,MARGIN=2,center=0,na.rm=TRUE))
#plot(possibleMu,sigma,col="orange",cex=0.5,pch=19)
#lines(possibleMu,predict (sigma.spl,possibleMu),col="red",lwd=0.5) tt <- which(data$sd<0)
if (length(tt)>0) {
cat(paste("Warning:",length(tt),"of amplicons have negative SD!!!"))
cat ("Will assign the maximum value of SD")
data$sd[tt] <- max(data$sd,na.rm=TRUE)
} } else {
data$sd=rep(1,length(data$mean))
} data$mean <- baseLine
data$mean[-indexNoNul]<-NA #plot(baseLine[indexNoNul],data$sd[indexNoNul],col="orange",pch=19,cex=0.5)
#plot(baseLine[indexNoNul],data$sd[indexNoNul],col="orange",pch=19,cex=0.5,xlim=c(-2,3)) write.table(data,file=controlFile,quote=FALSE,sep="\t",row.names=FALSE)

参考资料

ONCOCNV文献:https://academic.oup.com/bioinformatics/article/30/24/3443/2422154

ONCOCNV代码:http://boevalab.com/ONCOCNV/

ONCOCNV软件思路分析之control处理的更多相关文章

  1. ONCOCNV软件思路分析之tumor处理

    前期处理 perl脚本统计RC(RC(read counts)) 读入control baseline 和 sigma(最后baseline 预测的mad值) 将gc < 0.28或gc > ...

  2. 福州大学软件工程1816 | W班 第10次作业[个人作业——软件产品案例分析]

    作业链接 个人作业--软件产品案例分析 评分细则 本次个人项目分数由两部分组成(课堂得分(老师/助教占比60%,学生占比40%)满分40分+博客分满分60分) 课堂得分和博客得分表 评分统计图 千帆竞 ...

  3. 书评第003篇:《0day安全:软件漏洞分析技术(第2版)》

    本书基本信息 丛书名:安全技术大系 作者:王清(主编),张东辉.周浩.王继刚.赵双(编著) 出版社:电子工业出版社 出版时间:2011-6-1 ISBN:9787121133961 版次:1 页数:7 ...

  4. laravel中间件的创建思路分析

    网上有很多解析laravel中间件的实现原理,但是不知道有没有读者在读的时候不明白,作者是怎么想到要用array_reduce函数的? 本文从自己的角度出发,模拟了如果我是作者,我是怎么实现这个中间件 ...

  5. 软件案例分析——VS、VS Code

    软件案例分析--VS和VS Code 第一部分 调研,测评 一.使用10–30分钟这个软件的基本功能(请上传使用软件的照片) VS code Visual Studio 二.主要功能和目标用户有何不同 ...

  6. 软件产品案例分析--K米

    软件产品案例分析--K米 第一部分 调研,评测 评测 个人第一次上手体验 使用的第一款点歌软件,以为就是个遥控而已,使用后发现功能还挺多,能点挺久.觉得很方便,不用挤成一堆点歌了.K米的脸蛋(UI)好 ...

  7. 团队项目2.0软件改进分析MathAPP

    软件改进分析 在此基础上,进行软件的改进. 首先,我们把这个软件理解成一个投入市场的.帮助小朋友进行算术运算练习的APP. 从质量保证的角度,有哪些需要改进的BUG? 从用户的角度(把自己当成小学生或 ...

  8. [团队项目2.0]软件改进分析MathAPP

    软件改进分析 在此基础上,进行软件的改进. 首先,我们把这个软件理解成一个投入市场的.帮助小朋友进行算术运算练习的APP. 从质量保证的角度,有哪些需要改进的BUG? 从用户的角度(把自己当成小学生或 ...

  9. 第五周课后作业——热门软件创新分析+附加题1&附加题3

    鉴于我们寝室都热衷于手游,所以本次热门软件创新分析我就来分析一下几款热门的抽卡型手游.   阴阳师(后文简称YYS)——剧情画风唯美,配音引人入胜 作为网易公司研发的一款3D日式和风回合制游戏,YYS ...

随机推荐

  1. JBOD

    JBOD(Just a Bunch Of Disks)不是RAID,它是可以把不同容量的硬盘串连成一个大的逻辑盘,与RAID0不同的是在写入数据时是向一个硬盘写入,写满后再向下一个硬盘写. 尽管建议采 ...

  2. BZOJ 1202: [HNOI2005]狡猾的商人 [带权并查集]

    题意: 给出m个区间和,询问是否有区间和和之前给出的矛盾 NOIp之前做过hdu3038..... 带权并查集维护到根的权值和,向左合并 #include <iostream> #incl ...

  3. BZOJ 1370: [Baltic2003]Gang团伙 [并查集 拆点 | 种类并查集WA]

    题意: 朋友的朋友是朋友,敌人的敌人是朋友:朋友形成团伙,求最多有多少团伙 种类并查集WA了一节课,原因是,只有那两种关系才成立,诸如朋友的敌人是朋友之类的都不成立! 所以拆点做吧 #include ...

  4. SaltStack 与 Python 程序的结合

    SaltStack 库中的 Modules: 在 SaltStack 中,每个子系统插件(plug-in)都是一个 Python Module.因此,SaltStack 库中的 Module 可以看作 ...

  5. alertifyjs

    <%@ page contentType="text/html; charset=UTF-8"%> <!DOCTYPE html PUBLIC "-// ...

  6. R语言-来自Prosper的贷款数据探索

    案例分析:Prosper是美国的一家P2P在线借贷平台,网站撮合了一些有闲钱的人和一些急用钱的人.用户若有贷款需求,可在网站上列出期望数额和可承受的最大利率.潜在贷方则为数额和利率展开竞价. 本项目拟 ...

  7. 原创!!jquery简单tips和dialog

    <!------------------html代码-----------------------> <!DOCTYPE html><html><head&g ...

  8. 回归模型效果评估系列1-QQ图

    (erbqi)导语 QQ图全称 Quantile-Quantile图,也就是分位数-分位数图,简单理解就是把两个分布相同分位数的值,构成点(x,y)绘图:如果两个分布很接近,那个点(x,y)会分布在y ...

  9. Java中Excel表格的上传与下载

    详见:http://blog.csdn.net/lzh657083979/article/details/73252585

  10. include指令和include动作