利用支持向量机对基因表达标本是否癌变的预测

As we mentioned earlier, gene expression analysis has a wide variety of applications, including cancer studies. In 1999, Uri Alon analyzed gene expression data for 2,000 genes from 40 colon tumor tissues and compared them with data from colon tissues belonging to 21 healthy individuals, all measured at a single time point. We can represent his data as a 2,000 × 61 gene expression matrix, where the first 40 columns describe tumor samples and the last 21 columns describe normal samples.

Now, suppose you performed a gene expression experiment with a colon sample from a new patient, corresponding to a 62nd column in an augmented gene expression matrix. Your goal is to predict whether this patient has a colon tumor. Since the partition of tissues into two clusters (tumor vs. healthy) is known in advance, it may seem that classifying the sample from a new patient is easy. Indeed, since each patient corresponds to a point in 2,000-dimensional space, we can compute the center of gravity of these points for the tumor sample and for the healthy sample. Afterwards, we can simply check which of the two centers of gravity is closer to the new tissue.

Alternatively, we could perform a blind analysis, pretending that we do not already know the classification of samples into cancerous vs. healthy, and analyze the resulting 2,000 x 62 expression matrix to divide the 62 samples into two clusters. If we obtain a cluster consisting predominantly of cancer tissues, this cluster may help us diagnose colon cancer.

Final Challenge: These approaches may seem straightforward, but it is unlikely that either of them will reliably diagnose the new patient. Why do you think this is? Given Alon’s 2,000 × 61 gene expression matrix and gene data from a new patient, derive a superior approach to evaluate whether this patient is likely to have a colon tumor.

一、原理

参见

https://www.cnblogs.com/dfcao/p/3462721.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

二、

数据:

40 Cancer Samples

21 Healthy Samples

Unknown Sample

问题分析:

这是一个分类问题,训练集有61个,特征量有2000个,如果利用高斯核函数的SVM会出现过拟合,故选择线性核函数

代码

 from os.path import dirname
import numpy as np
import math
import random
import matplotlib.pyplot as plt
from sklearn import datasets, svm def Input():
X = []
Y = []
check_x=[]
check_y=[] dataset1 = open(dirname(__file__)+'colon_cancer.txt').read().strip().split('\n')
dataset1=[list(map(float,line.split()))[:] for line in dataset1]
X += dataset1[10:]
check_x += dataset1[:10]
Y += [1]*(len(dataset1)-10)
check_y += [1]*10 dataset2 = open(dirname(__file__)+'colon_healthy.txt').read().strip().split('\n')
dataset2=[list(map(float,line.split()))[:] for line in dataset2]
X += dataset2[5:]
check_x += dataset2[:5]
Y += [0]*(len(dataset2)-5)
check_y += [0]*5 dataset3 = open(dirname(__file__)+'colon_test.txt').read().strip().split('\n')
test_X = [list(map(float,line.split()))[:] for line in dataset3] return [X ,Y , test_X , check_x , check_y] if __name__ == '__main__':
INF = 999999 [X_train ,y_train , test_X,check_x, check_y] = Input() kernel = 'linear' # 线性核函数 clf = svm.SVC(kernel=kernel, gamma=10)
clf.fit(X_train,y_train) predict_for_ckeck = clf.predict(check_x)
cnt=0
for i in range(len(check_y)):
if check_y[i]==predict_for_ckeck[i]:
cnt+=1
print('Accuracy %.2f%%'%(cnt/len(check_y))) print(clf.predict(test_X))
Accuracy 87%
[0]

奇怪的是,只选择前20个基因进行分析,训练集预测正确率居然上升到90%

Accuracy 93%

[0]

实战--利用SVM对基因表达标本是否癌变的预测的更多相关文章

  1. 实战--利用HierarchicalClustering 进行基因表达聚类分析

    利用建立分级树对酵母基因表达数据进行聚类分析 一.原理 根据基因表达数据,得出距离矩阵 ↓ 最初,每个点都是一个集合 每次选取距离最小的两个集合,将他们合并,然后更新这个新集合与其它点的距离 新集合与 ...

  2. 机器学习实战之SVM

    一引言: 支持向量机这部分确实很多,想要真正的去理解它,不仅仅知道理论,还要进行相关的代码编写和测试,二者想和结合,才能更好的帮助我们理解SVM这一非常优秀的分类算法 支持向量机是一种二类分类算法,假 ...

  3. Weblogic CVE-2020-2551漏洞复现&CS实战利用

    Weblogic CVE-2020-2551漏洞复现 Weblogic IIOP 反序列化 漏洞原理 https://www.anquanke.com/post/id/199227#h3-7 http ...

  4. Druid未授权访问实战利用

    Druid未授权访问实战利用 ​ 最近身边的同学都开始挖src了,而且身边接触到的挖src的网友也是越来越多.作者也是在前几天开始了挖src之路.惊喜又遗憾的是第一次挖src就挖到了一家互联网公司的R ...

  5. opencv利用svm训练

    ]]]]]])rand2 = np.array([[]]]]]])label = np.array([[]]]]]]]]]]])data = np.vstack((rand1]]])pt_data = ...

  6. 实战--利用Lloyd算法进行酵母基因表达数据的聚类分析

    背景:酵母会在一定的时期发生diauxic shift,有一些基因的表达上升,有一些基因表达被抑制,通过聚类算法,将基因表达的变化模式聚成6类. ORF Name R1.Ratio R2.Ratio ...

  7. 机器学习实战------利用logistics回归预测病马死亡率

    大家好久不见,实战部分一直托更,很不好意思.本文实验数据与代码来自机器学习实战这本书,倾删. 一:前期代码准备 1.1数据预处理 还是一样,设置两个数组,前两个作为特征值,后一个作为标签.当然这是简单 ...

  8. 06机器学习实战之SVM

    对偶的概念 https://blog.csdn.net/qq_34531825/article/details/52872819?locationNum=7&fps=1 拉格朗日乘子法.KKT ...

  9. 在opencv3中利用SVM进行图像目标检测和分类

    采用鼠标事件,手动选择样本点,包括目标样本和背景样本.组成训练数据进行训练 1.主函数 #include "stdafx.h" #include "opencv2/ope ...

随机推荐

  1. [z]dbms_stats.lock_table_stats对于没有统计信息的表分区同样有效

    常见的分区表DDL如 split partition.add partition都会生成没有统计信息的表分区table partition,长期以来我对dbms_stats.lock_table_st ...

  2. centos 命令学习

    关机&重启 shutdown -h 10          #计算机将于10分钟后关闭,且会显示在登录用户的当前屏幕中 shutdown -h now       #计算机会立刻关机 shut ...

  3. asp.net core webapi iis jquery No 'Access-Control-Allow-Origin' header is present on访问跨域问题

    我的解决方案是:设置特定method允许所有请求源访问,具体看业务需求 第一步:starup文件下ConfigureServices中增加如下配置 //跨域//设置了允许所有来源 services.A ...

  4. IIS 7.5 上传文件大小限制

    上传插件:uploadify IIS版本:7.5 描述: 从IIS6升级到IIS7.5以后,网站上传文件大小被限制了,在Chrome下提示:ERR_CONNECTION_RESET,网上的各种方法都试 ...

  5. linux工具介绍

    http://linuxtools-rst.readthedocs.io/zh_CN/latest/tool/index.html 工具参考篇 1. gdb 调试利器 2. ldd 查看程序依赖库 3 ...

  6. BZOJ1059或洛谷1129 [ZJOI2007]矩阵游戏

    BZOJ原题链接 洛谷原题链接 通过手算几组例子后,很容易发现,同一列的\(1\)永远在这一列,且这些\(1\)有且仅有一个能产生贡献,行同理. 所以我们可以只考虑交换列,使得每一行都能匹配一个\(1 ...

  7. Spring Environment(二)源码分析

    Spring Environment(二)源码分析 Spring 系列目录(https://www.cnblogs.com/binarylei/p/10198698.html) Spring Envi ...

  8. 知名APP(支付宝、微信、花瓣等)首页设计技巧及原型实例讲解

    APP首页设计对APP自身来说是至关重要的,一款优秀的APP产品,其首页设计不仅需要清晰的展示产品核心功能,给用户创造良好的用户体验,而且还需要展示公司的品牌形象,提升在用户心中的品牌认知度.今天,就 ...

  9. js使用sessionStorage、cookie保存token

    本文是参考别人的博客写的,图片直接用的别人的 1.Token:token是客户端频繁向服务器端请求数据,服务器频繁的去数据库查询用户名和密码进行对比,判断用户名和密码正确与否,并作出相应的提示,在这样 ...

  10. android udp 无法收到数据 (模拟器中)

    解决方法:1. 运行模拟器2. 打开window 命令行执行:telnet localhost 55545554是模拟器的端口,执行之后会进入android console3. 在console下执行 ...