机器学习实战4:Adaboost提升:病马实例+非均衡分类问题
Adaboost提升算法是机器学习中很好用的两个算法之一,另一个是SVM支持向量机;机器学习面试中也会经常提问到Adaboost的一些原理;另外本文还介绍了一下非平衡分类问题的解决方案,这个问题在面试中也经常被提到,比如信用卡数据集中,失信的是少数,5:10000的情况下怎么准确分类?
一 引言
1 元算法(集成算法):多个弱分类器的组合;弱分类器的准确率很低 50%接近随机了
这种组合可以是 不同算法 或 同一算法不同配置 或是 数据集的不同部分分配给不同分类器;
2 bagging:把原始数据集随机抽样成S个与原始数据集一样大新数据集(允许有重复值),然后训练S个分类器,最后投票结果集成;
代表:随机森林
3 boosting:关注以后分类器错分的数据,而得到新的分类器;
代表:adaboost
bagging和boosting类似,都是抽样的方式构造多个数据集(特别适用于数据集有限的时候),并且多个组合分类器的类型都相同,但bagging是串行的,下一个分类器在上一个分类器的基础上继续训练得到的,权重均等;而boosting关注的是错分的数据,错分的数据权重大;
二 adaboost(adaptive boost)自适应提升算法
原理:为每一个样本赋均等的权重(D = 1/n),先用这个数据集训练第一个弱分类器,计算错误率,错误率是为了计算这个分类器最后投票的权重alpha,见公式:aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGoAAAApCAIAAABSjysCAAAGOElEQVRoge2ZXUhTfxjHD91EYUQX1pZGBK4GsjMCY9gLf6UUzBZGF2G20LHmhSgIGw3fUmGWYGGWo5xhFF2MMgZBYZH4gmdoJmG4FMLKXbSXtrOt6Xa2nZ//i8P/cP7b1N8501ODfa+Ov9/zfc6zj8+z8zIEpJWEkD9dQGorjS8ppfElpTS+pJTGl5TS+JJSiuGbn5/XaDQ4jm9Wwlu3bnV1dYXDYW72VMI3NzeXm5s7OTm5iTlJkqyrq1MqlZFIhIOdV3wLCwuFhYUfPnzg4MVxPDc3F8Mw5qLZbL53716SVUUikdra2jt37nDw8oePIIimpiYEQV68eMHBXlFR8ejRI+bK6Ohodnb2uXPnkq/N7XYLBIKlpSW2RmSVR62srAiFwsHBQbZGHMdzcnICgUDMektLi1wuh8/jdrtNJlNzc7PBYIjZ0uv1DQ0NbAvjdXi9Xm9mZiaH7jMajU1NTfHrZWVlrLovPz+/qqrKarVmZGS8f/+eubWwsHD48GG2hfGK79OnT9yGV61Wm83m+PXm5ub18Xk8HvoYw7Bt27Y5nU4AwLt376gDWuFwWCQSffv2jVVhvA6v3W7fvXs3h+EVi8Uejyd+PWZ4a2pqKioq1P9JpVIdP36c3rVYLNu3b5dKpQqFIuFZioqKxsbGWBXGa/fZbLaMjAwO3ScUCqPRaPx6TPfhOB5zB+f3+6kDl8slEAi6u7sdDofb7U54Fo1Go9frWRXGa/cRBJGVlRXffdFoFMfxdYz79+8nSTJ+Hf7Scf/+fQRBHA7H6urq4OBgwi7r6enp6OiAyUaL1+4zGAwymUyj0TD//ziOX758+eTJk+sYE3bf5OSkSCQSiUQWi2XDU4+Nje3cubO8vLyqqkqn0yW8Sb579y7r7mMVvRUiCEKn06Eouk5MQnwrKytOp9PpdNITur48Ho/T6fz169daAVzwwTcqjuNarVatVtfX1weDwZcvX3Z2drJq9bXU398vlUpXGVO8tLTEDFhreDdXWzi8Xq/31KlT/f39AICWlpbe3l6xWPz8+XNqNxKJPHjwAPu/vn//DpncaDSiKDo/P19QUFBZWXnt2rXMzMyHDx/SAWtdOjZXWzi8CoWioKCAOn7y5AmCIJcuXaJ3Q6FQfX39AEM1NTVv376FTE7hIwhCqVQePHjQbrf39vZKpVI6ILXx/fjxY8eOHfQ3dHt7+5EjR5aXl5kxLpeL+aff7w8Gg9Sx2WwuSqSuri4qgMIHAGhsbGxsbAQA9PX17dq1i76zjcdXV1cnl8t7ktPU1BQz51bhs1qtCIJQ39BTU1N79ux59erV4uIi/ZECgYBYLFYzhKKowWCgdoPB4Gwi2Wy2GHzXr1+n8SEIQj/Dx+P78uVLwpys5HA4+MPn9XotFkthYeGxY8fa2tpKSkqYl/9AIMC0+P1++HeQNL729nalUgkg8G2FtgpfIBDQ6XQGg6G2tnZubm5iYkIikVitVk5FJpBKpUJR1GQyyWQyFEWtVmtDQ8OhQ4eGhoaogNTG98eVxgeWl5dLS0v37dsnkUiGh4dZedP4gEqlev36tcvlUigUQqHQ5/PBe9P4AH2PjWEYgiCs3qyxxRcKhYxGo1ar1Wq1JpMJ0vVX46M1NDSEoihJkvAWoVBIEAR8fGtra05Ojt1uv3r1an5+PqSrsrKys7MT/izgj+ArLi5++vQpK8uJEycSvm1OKJIkDxw40NbWBgCw2WyLi4uQxjNnzszOzrIqjG98HR0d1dXVbF3d3d03b96EDHa5XDKZrLi4WCKRjIyMQLoIgsjOzo55g7+heMU3MDAgl8tDoVA0GmX1w/7Hjx/Pnj0LGSyXy8+fPx+NRn0+H/xZMAwrKyuDL4kSf/geP36clZWl1+tv37595coVtj9v5+XlTU9Pw0Tu3buXulzMzMz09fVB5q+urn727BmrkgCf+E6fPv0PQ1+/fmVlHx8fP3r0aCgU2jDy4sWLAoGgvLxcrVb//v0bJvn09LRMJot57oRRajx1AABIkrxx48aFCxc2nMdwOOzz+eBZ4DheWlr6+fNnDlWlDD5Kra2tCoWCfhWWvGZmZkpKSoxGIzd7iuEDAExMTPz8+XOzsr1584Zb31FKPXx/ldL4ktK/reT9lNoY9O8AAAAASUVORK5CYII=" alt="" />;错分的样本权重提升aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFoAAAAhCAIAAADWLytXAAAIkklEQVRogd1Ze0iT7xd/o7zONCwtKeimMyaWimKQgpqlSVHJFFvaxakYKimEqUFZItNl6+LX0Mi7bVMRTa00N69z5pYphNnKNpvzQunc3OZlume/P56+b2PmTKuv/vr8dc5zOefw4TzvOc/zIuD/ECKR6MKFC6iam5vLZDJ/i2Xkt1j5j3H58mUajQYAqK+vT0pKmp6ePnXqVHd392LruVyuUCgcGBgYHx/XbXmN0lFcXEwkEsPCwsLDw0tKSjgcDjo1OTlpZWU1OTkJAMDhcAUFBQAAEol07969hXakUqmXl1dycvKhQ4fs7Ox4PJ5uv2uUDgAAFoulUCh8Pr+oqGjfvn2lpaVwvK+vz8DAQC6Xi8XiAwcOTE9PAwAYDAaBQFhoJDc3l0gkzs/POzs7P3r0aEmna5SOgYEBAwMDLpcLVRqNZmBgIBAI4JStre3MzExlZaWHhwdckJWVFRUVtdAOiUS6e/fus2fPrKys6uvrl/S7RunIzMzEYrGzs7NQlUqlCII8ePAAqng8nsfjtbe3Z2ZmwhFPT893794BAEZHRy9evOjm5hYfH69UKlksVnBwMJlMTkhIaGtrW9LvGqUjMDAwJCQEVSEd//zzD1THxsY0Z0tKSh4/fgzlkydPVldXy2Sybdu2paenL9fvWqRDoVBgMJinT5+iIxwOB0GQvr6+H66XSCRQ+PTpk76+fmhoaEREhLu7e3Jy8nJd/ywdtbW1FAoFADA/Pz8yMgIH5+bmoJCamnrnzh2RSITH49HgVoyKigpLS8uJiQmoqlSqo0ePurm5oe4WQ29vL4IgMpkMqiKRaLmuf4qOmZkZZ2dnKODx+NraWgCAUqm8dOkSXJCZmZmSkgIWL3jLQkREhLe3N5SlUimRSHRycoItQ25uro6NIyMjpqam169fBwCwWKzIyMjluv5OB5VKtbe3DwgIiIiIIBAIeXl56NSrV69OnDgBhc2bN0ul0p6entbWVn9///b2dqVSidLB5/NdXV2XG4QmKBSKvr6+i4tLYmKir6/v4cOHMzIyYKZIJJLo6Gjd27Ozszds2GBvb79nz563b98u1/t3OuD5hLVtcnLSycmJRCLBKZSOjo4OSHlVVVV0dLSXl1d+fr5AIPiNdIjF4q8aUCqV6NTVq1d/2Fxoob6+Pj8/f3h4eAXev9ORmZm5a9cuVC0oKDA1NYUp2t3d7eDgAAC4du1aTEwMXPDixQv0sJDJ5KSkJABAb2+vp6fnCuL4GURGRlZWVv4h4xDf6fDw8NCsXoODgwiCoByfPn2azWaPjo6ibb9KpUKvCQKBoL+/HwBw8+bNhw8fajqYnp5msViLFYVlwcbGprOz89ft6MA3OoaHh9evX19dXY1O1NTUaNIxMzOTk5Oj21ZXV9etW7e0vv+jo6N4PJ5AIMDi5+fnd+TIkbKyshXEmpycLJfLV7Dx5/GNDljbNGtkfHy8g4PDzMzMH3UvEonu/Yv3799/+k2YmppaWTzf6AgKCkJrGwCgra3N3NwcFtRfRENDAw6HIxAIVzXQ1NQEZ4VCIUknwsPDDy0fwcHBEYvg2LFjOA0kJiZq0zE1NbV9+/akpCSFQjE+Pp6Tk4PD4crLywEA3d3dvr6+v0LH+Ph4Z2enWCwWi8Xt7e2FhYWFhYVfvnz5FZu/gunpabEGtE4fMjExER8ff1kDWVlZaD/X0NDAYDBWI+zVAaLWibCwMBqNpnvN34QlmnRnZ2fNRugPYWJigk6n0+n0zs5OLpeL3utXAA6H8/nzZygrFAq5XC6XyxUKBbpAIpHAp4AfYgk69u7du+LIfhJKpfLgwYMkEonBYISEhCAIgl7elouamhpHR8ehoSGoHj9+HEEQPz8/Ozu7gICA7OxslUolk8kCAwPRtwItLHFYhoeH/3R+crncLVu2iMVitVo9Ozu7e/duiUSyAjsCgWDnzp1CoRAdKSwshDdjtVrN5/Otra0zMjLUarVMJnN1dX39+vVCI6v/3iGVStevX+/j4wNfg9PS0hQKRVpaGoFAUCgUUqmURqMJBILZ2dnnz5+XlZX19PQkJib29/ePj4/TaLTW1lZoJyYm5sqVK5qWY2NjNbuHvLw8CwuL+fl5KPv4+CwMZons+G9QVFSkr69vYWGRn58PR9hsNoIgQ0NDCoXC1tb2yZMnTU1NmzZtsra2jouLc3d3t7S0DA0NjYuLMzQ07OnpUSgUhoaGLS0tmmZ37NiRnZ2Nqi9fvkQQhMViqdXqsbExDAYzMjKiFcnqZwcEl8vdv3+/np5eaGjo3Nwcj8dDEOTNmzcAAF9f35KSEgCAjY1NUFAQAIDJZGIwGPiBNDY2bm5uZjAYCII0NzejBnk8nomJyeDgIDpCpVLNzMzgrq9fvyIIsrDPXH062Gw2vAoqlUoymYwgCJPJhHTAS4ONjc1COkxMTOD2xehIT0/H4XCajvz9/dEDAum4f/++VjCrf1jodDqRSISyRCLZunVrcXHxhw8fYHao1WosFpuTkwOFM2fOqNXqxsbGjRs3wi0YDKalpaWjo2PdunWah8XFxeXGjRuoWlVVZW5u3tXVBdWxsTGYHVrBrH520Gg0PT09CoXS2Nh4+/ZtFxeXsbExWGKIRGJKSoqZmRkOhxOLxWh2lJWVGRkZwXpsbGxcUVEBADAyMkKz4+PHj3p6enQ6HQDQ0dGRmpqKxWLr6upQpzA7+Hy+VjCrTwebzeZyuTU1NVQqlUqlou8pdXV1ISEhHA6HTCYzmcyurq7g4OBz5861tLScP3+eQCDQ6fTS0lICgRAVFSWTyWJiYhISEgAAfD4/LCyMQCDAJ/XY2Njy8nKtq3leXp63t7dKpdIKZvXp+F3g8/lOTk5CoXDJlXK53NHRsaGhYeHU30MH+Lcr1d3jy2Sys2fPoj/0tPBX0QEA4HA4un/HSCQSHXf0/wHXPQ+ikTlfzQAAAABJRU5ErkJggg==" alt="" />,对分的样本权重降低aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGQAAAAmCAIAAAC9EKlkAAAIqElEQVRoge2aaUxTTRfHL2pEgRjcawgRBCyiRVSSGlFqBAMYQlSkFVBUIMVEaBR7VRQEY9KAVsKmgCZQIQQrhkoVAUWwKKhgZEuoFBsWRZa2lAoU2t5O3w/z5gYBoSw+L/g+v09nzsw99+TfmTtbEfAveoP8rxNYSPwr1jRYwGKp1eqfP3/iRaVS+affuFDFwjAsLCzs1atXuMfPz0+hUPzRly5UsQQCAZVKBQCoVConJyexWFxYWBgYGKjn4wUFBaWlpd3d3d+/f9f/pfNXLKlUSqPR3N3d/fz8PD09ORzO6I5Do9HKysoAANnZ2U5OToODg3K5fP369XK5fHyojo6O9vb29vZ2qVSqUqlOnTqVnp7u4+NDJBJramr0T2n+igUAiI2NNTc3l0gktbW1oaGh27Zt6+rqglW4WJGRkQ8ePAAAyOVyBEGam5vHx3F2diaTyWQyOSYmJi8vz93dXavVBgQEJCcnTyufeS3WgQMHQkNDoa3RaKytrVEUhUVfX9/Xr18DAGxtbcvLywEAcrmcQCD09/dPHpPFYsXHx+fk5FhaWgoEgmnlM3/FGh4eNjExKS0txT3h4eE0Gg3a79698/LyUqvVbDZbJpMBAJ4+fRoSEgIAGBkZQVGUTqdTqdTKysoxYcViMZPJZLPZxcXFXC53WinNX7F4PN7y5csHBgZwz5kzZ3CxMAxjMpnV1dV4LYPBgI1jY2MPHz4MAIiKiiKRSMPDw3OV0kzEwjBMq9X+rlaj0QAA1Gr1JG30wd/f/8iRI6M9NjY258+fn/JBHx8fNzc3Op0eEBAQEBAwh+uJaYul1WqPHTvW0dEBAOju7lapVAAAhUIBDalUSiQSe3p6MjIyIiIiZpyWWq22srKKj4/HPTwez9DQUCwWT/msj49PXFwctJubm9Vq9YzTGAOimyZ5eXlnz57V6XQSiWTr1q1isVin06WkpOTm5sIG9vb2vb29KpWKSqXC2hnw4cMHBEE6OjpgkcvlWltb83g8nU5XUVFRUVExybMhISEWFhYikaizs5NCofT19c0sh/FM0LPkcvm5c+dWr15NpVK9vLx27dqVl5eH1wYGBvJ4PADAkydPDh06pNFo+Hx+UFAQi8WCX1MSidTT0wMAuHLlCoPBmMEP+OzZMxsbG0NDw9OnT7u6urq4uDAYjPb2dlibnJz8/v37SR4XCoUEAsHIyGjVqlXPnz+fQQK/Y+JhmJCQYGVlBe2qqqq1a9fis1JgYCDM4Pjx4xwOBwAQERFBoVBiYmIePnwIfhVrzEdn9igUiiVLlkw5GAcGBkpKShoaGub27RMPQ29vbzqdjhd9fX337t0LbSaTmZ2drdPp1q1bV1hYCJ0pKSl1dXXQtre3FwgEOp0uIiIiMTFxroYApK2tjUwmz21M/ZmgZ8lkMlNTUzjWIFFRUQjy35ZtbW07duzo6+sTCATwow4AEAqFcLEDAODz+S0tLQqFYvfu3VKpdHRktVqtHMfkPyacSXBycnJcXV2n2SHmjAnE4nK5uDSQixcvbtq0CS/yeLxPnz5NHjcqKqqkpGSM89atW5vGMfmau6mpaXSxq6trwg3NP8MEYlGpVAqFghdHRkZsbGwuXbr0p1MpKyvLz8+vmDuqqqrmNsOxYimVSktLy9TUVNxz7969NWvW9Pb2ajSaWa4zHz16dGIcg4ODsPbt27dZvxIbG0udBTQajf4bPDw8NoyDxWJNIdaYb1h+fj6CICKRqKWlpaam5vLly2Qy+fPnzzqdLjU1NTMzczYfyJaWltJxqNXq2cT8J/mlZ718+dLf3z8uLu7kyZP79u0LCgoqKiqCR7darTY4OPjr16+z6VkLHX23OxiGrVy5Ep/y/j/Rd7sjEonMzMz+aCeHSCSSpKSktLQ0kUhUW1s7m1Bwe6TT6TAMk0gk4xu0trZ2dnbqH1DfnsXj8Ugk0h/93QAAX758IZFIHA4nIyPD1tYWnrLPjLi4OHwGb2trO3jwIJFIDA4OptPpLi4u8OCwtbXVzs5uymUQjr5iwfXkDJKeFvHx8Xv27IF2Q0PDiRMnZhYnKyvLzc0NHhZBwsLC9u/fD20+n29qagpPpRsbGx0dHac8X4XMr8O/+/fvIwgSGxsLi8nJyQMDA0lJSYmJiQAALpfLZrM7Ojp6enpu37798ePH/Pz8hISEoaGh6upqNpstFAoBAMPDw5aWlgUFBXhYDMN27twZGRmJe7y9vZ2dnaHt7+8/5aIBMr/EglcvBgYGbm5uFRUV0Mnlco2MjAAAvb29xsbGZWVld+/eXbx4MYVCYbPZDg4OHh4eV69e9fPz27Jly9DQUGVlpZGR0eg7LpFIhCBIXV0d7rlx44adnR2009PTHR0d9VlCzi+xAAAajSY3N3fjxo2LFi3KzMwEAJSXl0OxAAAmJiZw+BgbG8Mb1vDw8M2bNwMApFIpvN3JysoiEomjY7JYLCKROFoOb2/v7du3Q7u+vh5BEH1On+eXWIWFhRKJBAAwNDTk6ekJNfqdWPi9IRQLvwpjMpljxPLw8AgKCsKLg4ODBALh5s2bsLhQxYqPj3/8+DG037x5o6dYcE7AxRrTs2Qy2dKlS4uKinDPtWvXzM3N8XG6gMUiEAhZWVmdnZ3BwcFwNmxqalq2bFlCQgKDwTAwMLhw4YJSqTQ2Nn7x4oVGo4HbfgzD6urqEASpr68vKyvbsGED3HhgGJaenr5ixQqZTNbf319aWoqiqJOT0+itCIfDIZFIGIZNmd78Euvbt28CgSAhIQFF0YyMDHzuT05ORlFUKBRGR0dXV1fz+XwURa9fv15XV4eiKIqijY2Nd+7cQVE0LS1NqVSamZkVFxcDALq7u9FfGX9wFBgYGB0drU9680usuYLL5R49elSfziISiSwsLH78+KFP2L9TLK1Wy2Qyp/xTTWdnp4ODA1zN68PfKRZEJBJN3qCvrw9OvnryN4s15/wH0jupiUW3FNcAAAAASUVORK5CYII=" alt="" />; 然后用这个数据集训练第二个若分类器,迭代到弱分类器错误率为0或迭代指定个数的弱分类器停止;
aaarticlea/png;base64," alt="" />
直观如图,第一个分类器每个样本权重均等,最后根据错误率计算alpha=0.69;然后调整样本权重,错分的权重增加,得第二个分类器的alpha0.97;同理第三个分类器的alpha=0.90;最后投票,总的结果是= 0.69*D1 + 0.97*D2 + 0.90*D3
(1)弱分类器:本文采用是时单层分类器,又叫树桩分类器,是决策树最简单的一种;
def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
retArray = ones((shape(dataMatrix)[],))
if threshIneq == 'lt':
retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:,dimen] > threshVal] = -1.0
return retArray def buildStump(dataArr,classLabels,D):
dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
m,n = shape(dataMatrix)
numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,)))
minError = inf #init error sum, to +infinity
for i in range(n):#loop over all dimensions
rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
stepSize = (rangeMax-rangeMin)/numSteps
for j in range(-,int(numSteps)+):#loop over all range in current dimension
for inequal in ['lt', 'gt']: #go over less than and greater than
threshVal = (rangeMin + float(j) * stepSize)
predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
errArr = mat(ones((m,)))
errArr[predictedVals == labelMat] =
weightedError = D.T*errArr #calc total error multiplied by D
# print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)
if weightedError < minError:
minError = weightedError
bestClasEst = predictedVals.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump,minError,bestClasEst
原理:遍历每个属性,以一定步长,枚举大于和小于:找一条错误率最小的与垂直坐标轴的直线分开样本点;
例如 ins= (a,b,c) , 找到的若分类器是 a= 1 or b = 2 or c =3 这样的垂直坐标轴的直线;
(2)adaboost训练分类器的代码;
原理如上介绍,训练分类器就是为了得到若分类器的参数dim,thresh,ineq和alpha,前三个参数dim,thresh,ineq是弱分类器树桩分类器的参数,最后一个alpha是集合多弱分类器结果的权重;
def adaBoostTrainDS(dataArr,classLabels,numIt=): weakClassArr = []
m = shape(dataArr)[]
D = mat(ones((m,))/m) #init D to all equal
aggClassEst = mat(zeros((m,)))
for i in range(numIt):
bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
print 'error',error
#print "D:",D.T
alpha = float(0.5*log((1.0-error)/max(error,1e-)))#calc alpha, throw in max(error,eps) to account for error=
bestStump['alpha'] = alpha
weakClassArr.append(bestStump) #store Stump Params in Array
print "classEst: ",classEst.T
expon = multiply(-*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
D = multiply(D,exp(expon)) #Calc New D for next iteration
D = D/D.sum()
print 'D',D
#calc training error of all classifiers, if this is quit for loop early (use break)
aggClassEst += alpha*classEst
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,)))
errorRate = aggErrors.sum()/m
print "total error: ",errorRate
if errorRate == 0.0: break
return weakClassArr,aggClassEst
(3)测试adaboost代码:
根据弱i训练分类器得到的参数,使用设置参数的弱分类器对测试样本进行预测,最后结果通过alpha集成;
def adaClassify(datToClass,classifierArr):
dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
m = shape(dataMatrix)[]
aggClassEst = mat(zeros((m,)))
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\
classifierArr[i]['thresh'],\
classifierArr[i]['ineq'])#call stump classify
aggClassEst += classifierArr[i]['alpha']*classEst
print aggClassEst
return sign(aggClassEst)
三 病马数据集实例
datArr,labelArr = loadDataSet('HorseTraining2.txt')
classifierArr = adaBoostTrainDS(datArr,labelArr,)
testArr,testLabelArr = loadDataSet('HorseTraining2.txt')
prediciton = adaClassify(testArr,classifierArr) error = mat(ones((,)))
error[prediciton != mat(testLabelArr ).T] .sum()
这个实例就是调用了上面adaboost的接口,值得注意的是,这个病马的数据集是我们在上一篇文章logistics算法时用到的,在logistics里错误率是0.3,因为这个数据集有很多缺失值,难预测;而adaboost的50个弱分类器的错误率只有0.21;
主意: 弱分类器的个数,太少易欠拟合,太多易过拟合,最好的是适当的个数;就像一张经典的图,横坐标是弱分类器的个数,训练样本的错误率越来越低,测试样本的错误率是对勾型,取拐点处个数最好了,既不过拟合也不欠拟合。
四 不平衡分类问题
不平衡问题是正例和负例的比例相差很大,比如信用卡账户是否欠账,5个正例,5000个负例;
1解决方案
1)预处理级:过采样和欠采样及混合采样;
抽样过程可以通过随机或制定的方式实现:
(1)过采样:复制正例样本,增加样本个数;或者增加和正例样本相似的样本;
(2)欠采样:删除距离边界较远负例样本,上例中为了平衡,需要删除4950个负例;
(3)混合过采样和欠采样
2)算法级:代价敏感;
举个例子说明是什么代价敏感分类器:
二分类器代价矩阵:
真实结果|预测结果 | +1 | -1 |
+1 | -5 | 1 |
-1 | 50 | 0 |
根据代价矩阵表,求出最后的总的代价,选择代价最小的类做为左后的预测结果。
2 AUC计算代码:
就不能把准确率自己作为不平衡问题的评价指标了,因为在不平衡分类中,100个样本,90正例,10负例;则粗暴的把100个全分为正类就可以达到很高的100%准确率。这显然不是我们想要的结果。召回率这时候也起到了作用,正类中分对了多少,90%。
AUC是最为理想的一个指标:(通过正例和负例pairs的排名计算)
def plotROC(predStrengths, classLabels):
import matplotlib.pyplot as plt
cur = (1.0,1.0) #cursor
ySum = 0.0 #variable to calculate AUC
numPosClas = sum(array(classLabels)==1.0)
yStep = /float(numPosClas); xStep = /float(len(classLabels)-numPosClas)
sortedIndicies = predStrengths.argsort()#get sorted index, it's reverse
fig = plt.figure()
fig.clf()
ax = plt.subplot()
#loop through all the values, drawing a line segment at each point
for index in sortedIndicies.tolist()[]:
if classLabels[index] == 1.0:
delX = ; delY = yStep;
else:
delX = xStep; delY = ;
ySum += cur[]
#draw line from cur to (cur[]-delX,cur[]-delY)
ax.plot([cur[],cur[]-delX],[cur[],cur[]-delY], c='b')
cur = (cur[]-delX,cur[]-delY)
ax.plot([,],[,],'b--')
plt.xlabel('False positive rate'); plt.ylabel('True positive rate')
plt.title('ROC curve for AdaBoost horse colic detection system')
ax.axis([,,,])
plt.show()
print "the Area Under the Curve is: ",ySum*xStep
五 总结
优点:准确度较高,无参数调整;
缺点:对离散值敏感;
数据类型:数值和离散型;
机器学习实战4:Adaboost提升:病马实例+非均衡分类问题的更多相关文章
- 机器学习实战------利用logistics回归预测病马死亡率
大家好久不见,实战部分一直托更,很不好意思.本文实验数据与代码来自机器学习实战这本书,倾删. 一:前期代码准备 1.1数据预处理 还是一样,设置两个数组,前两个作为特征值,后一个作为标签.当然这是简单 ...
- 《机器学习实战》AdaBoost算法(手稿+代码)
Adaboost:多个弱分类器组成一个强分类器,按照每个弱分类器的作用大小给予不同的权重 一.Adaboost理论部分 1.1 adaboost运行过程 注释:算法是利用指数函数降低误差,运行过程通过 ...
- 机器学习实战笔记--AdaBoost(实例代码)
#coding=utf-8 from numpy import * def loadSimpleData(): dataMat = matrix([[1. , 2.1], [2. , 1.1], [1 ...
- 机器学习实战之AdaBoost算法
一,引言 前面几章的介绍了几种分类算法,当然各有优缺.如果将这些不同的分类器组合起来,就构成了我们今天要介绍的集成方法或者说元算法.集成方法有多种形式:可以使多种算法的集成,也可以是一种算法在不同设置 ...
- 机器学习实战笔记7(Adaboost)
1:简单概念描写叙述 Adaboost是一种弱学习算法到强学习算法,这里的弱和强学习算法,指的当然都是分类器,首先我们须要简介几个概念. 1:弱学习器:在二分情况下弱分类器的错误率会低于50%. 事实 ...
- 【机器学习实战】第7章 集成方法(随机森林和 AdaBoost)
第7章 集成方法 ensemble method 集成方法: ensemble method(元算法: meta algorithm) 概述 概念:是对其他算法进行组合的一种形式. 通俗来说: 当做重 ...
- [机器学习]-Adaboost提升算法从原理到实践
1.基本思想: 综合某些专家的判断,往往要比一个专家单独的判断要好.在”强可学习”和”弱可学习”的概念上来说就是我们通过对多个弱可学习的算法进行”组合提升或者说是强化”得到一个性能赶超强可学习算法的算 ...
- 《机器学习实战第7章:利用AdaBoost元算法提高分类性能》
import numpy as np import matplotlib.pyplot as plt def loadSimpData(): dataMat = np.matrix([[1., 2.1 ...
- [机器学习实战-Logistic回归]使用Logistic回归预测各种实例
目录 本实验代码已经传到gitee上,请点击查收! 一.实验目的 二.实验内容与设计思想 实验内容 设计思想 三.实验使用环境 四.实验步骤和调试过程 4.1 基于Logistic回归和Sigmoid ...
随机推荐
- NSNotification\KVO\block\delegate的区别和用法
在开发ios应用的时候,我们会经常遇到一个常见的问题:在不过分耦合的前提下,controllers间怎么进行通信.在IOS应用不断的出现三种模式来实现这种通信: 1.委托delegation: 2.通 ...
- 谢欣伦 - OpenDev原创教程 - 无连接套接字类CxUdpSocket
这是一个精练的无连接套接字类,类名.函数名和变量名均采用匈牙利命名法.小写的x代表我的姓氏首字母(谢欣伦),个人习惯而已,如有雷同,纯属巧合. CxUdpSocket的使用如下(以某个叫做CSomeC ...
- Oracle EBS Setup
1. Prevent close other forms after close original form
- package.json
1,项目按住shift,右击鼠标:"在此处打开命令行窗口" 2,cmd输入:npm init 输入name,varsion....license项的信息,yes 3,此项目中自动创 ...
- Maven项目WEB-INF/views无法引入js,css静态文件解决方法
web.xml针对文件后缀配置以下,对客户端请求的静态资源如图片.JS文件等的请求交由默认的servlet进行处理 <servlet-mapping> <servlet-name&g ...
- iOS应用程序的生命周期
iOS应用程序一般都是由自己编写的代码和系统框架(system frameworks)组成,系统框架提供一些基本infrastructure给所有app来运行,而你提供自己编写的代码来定制app的外观 ...
- 浅谈iOS开发中方法延迟执行的几种方式
Method1. performSelector方法 Method2. NSTimer定时器 Method3. NSThread线程的sleep Method4. GCD 公用延迟执行方法 - (vo ...
- mysql和CSV
1.mysql导入和导出数据可以通过mysql命令或者mysqldump来完成.mysqldump可以导入和导出完整的表结构和数据.mysql命令可以导入和导出csv文件. 1.mysql支持导入和导 ...
- oracle not in,not exists,minus 数据量大的时候的性能问题
http://blog.csdn.net/greenappple/article/details/7073349/ 耗时 minus<not exists<not in
- selinux 导致无法启动httpd
selinux 导致无法启动httpd ansible_dire:~ # /etc/init.d/httpd restart 停止 httpd: [失败]正在启动 httpd:(13)Permissi ...