这次lab主要主要是研究click-through rate (CTR)。数据集来自于Kaggle的Criteo Labs dataset。相关ipynb文件见我github

作业分成5个部分:one-hot encoding处理特征;构造one-hot encoding dictionary;解析CTR数据并处理特征;用逻辑回归来预测CTR;通过feature hashing来减少特征维度。

Featurize categorical data using one-hot-encoding

One-hot-encoding

这部分我们要实现one-hot encoding。我们在用实际数据处理前,先用一个包含3个样本的数据集上练习一下。样本有三个特征:什么动物,什么颜色,吃什么。其中最后一个特征可选。第一个特征有三个值:bear, cat, mouse;第二个有两个:black, tabby;第三个有两个:mouse, salmon。我们首先第一步是把(featureID, category) 映射到从0开始的连续整数。

  1. # Data for manual OHE
  2. # Note: the first data point does not include any value for the optional third feature
  3. sampleOne = [(0, 'mouse'), (1, 'black')]
  4. sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')]
  5. sampleThree = [(0, 'bear'), (1, 'black'), (2, 'salmon')]
  6. sampleDataRDD = sc.parallelize([sampleOne, sampleTwo, sampleThree])
  7. # TODO: Replace <FILL IN> with appropriate code
  8. sampleOHEDictManual = {}
  9. sampleOHEDictManual[(0,'bear')] = 0
  10. sampleOHEDictManual[(0,'cat')] = 1
  11. sampleOHEDictManual[(0,'mouse')] = 2
  12. sampleOHEDictManual[(1,'black')] = 3
  13. sampleOHEDictManual[(1,'tabby')] = 4
  14. sampleOHEDictManual[(2,'mouse')] = 5
  15. sampleOHEDictManual[(2,'salmon')] = 6

Sparse vectors

在处理稀疏矩阵的时候,我们常用SparseVector,而不是numpy,因为能减少计算量和存储空间。我们这里就验证SparseVector和numpy的计算结果是一样的。我们看看SparseVector的用法:

  1. pyspark.mllib.linalg.SparseVector(size, *args)[source]

size是指向量的长度,后面的参数一般是两个list,前者是存值的索引,后者是具体的值。比如

  1. a = SparseVector(4, [1, 3], [3.0, 4.0])

a的实际样子是[0, 3.0, 0, 4.0]

  1. import numpy as np
  2. from pyspark.mllib.linalg import SparseVector
  3. # TODO: Replace <FILL IN> with appropriate code
  4. aDense = np.array([0., 3., 0., 4.])
  5. aSparse = SparseVector(4,[1,3],[3,4])
  6. bDense = np.array([0., 0., 0., 1.])
  7. bSparse = SparseVector(4,{3:1})
  8. w = np.array([0.4, 3.1, -1.4, -.5])
  9. print aDense.dot(w)
  10. print aSparse.dot(w)
  11. print bDense.dot(w)
  12. print bSparse.dot(w)

OHE features as sparse vectors

这里是把上面的特征手动转换为SparseVector

  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleOneOHEFeatManual = SparseVector(7,[2,3],[1,1])
  3. sampleTwoOHEFeatManual = SparseVector(7,[1,4,5],[1,1,1])
  4. sampleThreeOHEFeatManual = SparseVector(7,[0,3,6],[1,1,1])

Define a OHE function

这一步是通过上面我们定义的字典,用代码对数据进行OHE转换

  1. # TODO: Replace <FILL IN> with appropriate code
  2. def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
  3. """Produce a one-hot-encoding from a list of features and an OHE dictionary.
  4. Note:
  5. You should ensure that the indices used to create a SparseVector are sorted.
  6. Args:
  7. rawFeats (list of (int, str)): The features corresponding to a single observation. Each
  8. feature consists of a tuple of featureID and the feature's value. (e.g. sampleOne)
  9. OHEDict (dict): A mapping of (featureID, value) to unique integer.
  10. numOHEFeats (int): The total number of unique OHE features (combinations of featureID and
  11. value).
  12. Returns:
  13. SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
  14. identifiers for the (featureID, value) combinations that occur in the observation and
  15. with values equal to 1.0.
  16. """
  17. sparseIndex = np.sort(list(OHEDict[i] for i in rawFeats))
  18. sparseValues = np.ones(len(rawFeats))
  19. return SparseVector(numOHEFeats,sparseIndex,sparseValues)
  20. # Calculate the number of features in sampleOHEDictManual
  21. numSampleOHEFeats = len(sampleOHEDictManual)
  22. # Run oneHotEnoding on sampleOne
  23. sampleOneOHEFeat = oneHotEncoding(sampleOne,sampleOHEDictManual,numSampleOHEFeats)
  24. print sampleOneOHEFeat

Apply OHE to a dataset

  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleOHEData = sampleDataRDD.map(lambda x : oneHotEncoding(x,sampleOHEDictManual,numSampleOHEFeats))
  3. print sampleOHEData.collect()

Part 2 Construct an OHE dictionary

Pair RDD of (featureID, category)

我们现在通过代码来构造OHE字典。首先把所以的特征值放到一个RDD里,然后去重。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleDistinctFeats = (sampleDataRDD
  3. .flatMap(lambda x: x).distinct())

OHE Dictionary from distinct features

我们现在要构造出OHE字典。主要是通过zipWithIndex和collectAsMap。前者是对RDD里的元素增加索引,后者是把RDD转换为Map。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleOHEDict = (sampleDistinctFeats
  3. .zipWithIndex().collectAsMap())
  4. print sampleOHEDict

Automated creation of an OHE dictionary


  1. # TODO: Replace <FILL IN> with appropriate code
  2. def createOneHotDict(inputData):
  3. """Creates a one-hot-encoder dictionary based on the input data.
  4. Args:
  5. inputData (RDD of lists of (int, str)): An RDD of observations where each observation is
  6. made up of a list of (featureID, value) tuples.
  7. Returns:
  8. dict: A dictionary where the keys are (featureID, value) tuples and map to values that are
  9. unique integers.
  10. """
  11. inputOHEDict = (inputData.flatMap(lambda x:x).distinct().zipWithIndex().collectAsMap())
  12. return inputOHEDict
  13. sampleOHEDictAuto = createOneHotDict(sampleDataRDD)
  14. print sampleOHEDictAuto

Part 3 Parse CTR data and generate OHE features

Loading and splitting the data

首先要下载数据,这段代码在LAB1里面也看到过


  1. # Run this code to view Criteo's agreement
  2. from IPython.lib.display import IFrame
  3. IFrame("http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/",
  4. 600, 350)
  5. # TODO: Replace <FILL IN> with appropriate code
  6. # Just replace <FILL IN> with the url for dac_sample.tar.gz
  7. import glob
  8. import os.path
  9. import tarfile
  10. import urllib
  11. import urlparse
  12. # Paste url, url should end with: dac_sample.tar.gz
  13. url = '<FILL IN>'
  14. url = url.strip()
  15. baseDir = os.path.join('data')
  16. inputPath = os.path.join('cs190', 'dac_sample.txt')
  17. fileName = os.path.join(baseDir, inputPath)
  18. inputDir = os.path.split(fileName)[0]
  19. def extractTar(check = False):
  20. # Find the zipped archive and extract the dataset
  21. tars = glob.glob('dac_sample*.tar.gz*')
  22. if check and len(tars) == 0:
  23. return False
  24. if len(tars) > 0:
  25. try:
  26. tarFile = tarfile.open(tars[0])
  27. except tarfile.ReadError:
  28. if not check:
  29. print 'Unable to open tar.gz file. Check your URL.'
  30. return False
  31. tarFile.extract('dac_sample.txt', path=inputDir)
  32. print 'Successfully extracted: dac_sample.txt'
  33. return True
  34. else:
  35. print 'You need to retry the download with the correct url.'
  36. print ('Alternatively, you can upload the dac_sample.tar.gz file to your Jupyter root ' +
  37. 'directory')
  38. return False
  39. if os.path.isfile(fileName):
  40. print 'File is already available. Nothing to do.'
  41. elif extractTar(check = True):
  42. print 'tar.gz file was already available.'
  43. elif not url.endswith('dac_sample.tar.gz'):
  44. print 'Check your download url. Are you downloading the Sample dataset?'
  45. else:
  46. # Download the file and store it in the same directory as this notebook
  47. try:
  48. urllib.urlretrieve(url, os.path.basename(urlparse.urlsplit(url).path))
  49. except IOError:
  50. print 'Unable to download and store: {0}'.format(url)
  51. extractTar()
  1. import os.path
  2. baseDir = os.path.join('data')
  3. inputPath = os.path.join('cs190', 'dac_sample.txt')
  4. fileName = os.path.join(baseDir, inputPath)
  5. if os.path.isfile(fileName):
  6. rawData = (sc
  7. .textFile(fileName, 2)
  8. .map(lambda x: x.replace('\t', ','))) # work with either ',' or '\t' separated data
  9. print rawData.take(1)

读取完数据后,把数据集分成三份,然后缓存起来。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. weights = [.8, .1, .1]
  3. seed = 42
  4. # Use randomSplit with weights and seed
  5. rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights,seed)
  6. # Cache the data
  7. rawTrainData.cache()
  8. rawValidationData.cache()
  9. rawTestData.cache()
  10. nTrain = rawTrainData.count()
  11. nVal = rawValidationData.count()
  12. nTest = rawTestData.count()
  13. print nTrain, nVal, nTest, nTrain + nVal + nTest
  14. print rawData.take(1)

Extract features

因为解析出来的特征并没有行列的信息,所以我们把特征处理成(列,值)的样子,其中列是这个特征在第几列,值是指本来的值,然后统计有多少个不同的(列,值)对。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. def parsePoint(point):
  3. """Converts a comma separated string into a list of (featureID, value) tuples.
  4. Note:
  5. featureIDs should start at 0 and increase to the number of features - 1.
  6. Args:
  7. point (str): A comma separated string where the first value is the label and the rest
  8. are features.
  9. Returns:
  10. list: A list of (featureID, value) tuples.
  11. """
  12. featuresList = point.split(',')
  13. return list((i,featuresList[i+1]) for i in range(len(featuresList)-1))
  14. parsedTrainFeat = rawTrainData.map(parsePoint)
  15. numCategories = (parsedTrainFeat
  16. .flatMap(lambda x: x)
  17. .distinct()
  18. .map(lambda x: (x[0], 1))
  19. .reduceByKey(lambda x, y: x + y)
  20. .sortByKey()
  21. .collect())
  22. print numCategories[2][1]

Create an OHE dictionary from the dataset

我们现在处理成了和part 2里一样了,看看这个字典的大小。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. ctrOHEDict = createOneHotDict(parsedTrainFeat)
  3. numCtrOHEFeats = len(ctrOHEDict.keys())
  4. print numCtrOHEFeats
  5. print ctrOHEDict[(0, '')]

Apply OHE to the dataset

我们在上面的基础上,把特征的Label加进去,就是实现一个parsePoint加强版。

  1. from pyspark.mllib.regression import LabeledPoint
  2. # TODO: Replace <FILL IN> with appropriate code
  3. def parseOHEPoint(point, OHEDict, numOHEFeats):
  4. """Obtain the label and feature vector for this raw observation.
  5. Note:
  6. You must use the function `oneHotEncoding` in this implementation or later portions
  7. of this lab may not function as expected.
  8. Args:
  9. point (str): A comma separated string where the first value is the label and the rest
  10. are features.
  11. OHEDict (dict of (int, str) to int): Mapping of (featureID, value) to unique integer.
  12. numOHEFeats (int): The number of unique features in the training dataset.
  13. Returns:
  14. LabeledPoint: Contains the label for the observation and the one-hot-encoding of the
  15. raw features based on the provided OHE dictionary.
  16. """
  17. pointList = point.split(',')
  18. pointLabel = pointList[0]
  19. pointFeaturesRaw = list((i,pointList[i+1]) for i in range(len(pointList)-1))
  20. pointFeatures = oneHotEncoding(pointFeaturesRaw,OHEDict,numOHEFeats)
  21. return LabeledPoint(pointLabel,pointFeatures)
  22. OHETrainData = rawTrainData.map(lambda point: parseOHEPoint(point, ctrOHEDict, numCtrOHEFeats))
  23. OHETrainData.cache()
  24. print OHETrainData.take(1)
  25. # Check that oneHotEncoding function was used in parseOHEPoint
  26. backupOneHot = oneHotEncoding
  27. oneHotEncoding = None
  28. withOneHot = False
  29. try: parseOHEPoint(rawTrainData.take(1)[0], ctrOHEDict, numCtrOHEFeats)
  30. except TypeError: withOneHot = True
  31. oneHotEncoding = backupOneHot

Handling unseen features

假如测试集和验证集里面有的特征没有出现在训练集里面,所以我们要更新oneHotEncoding()来对付之前没有出现过的特征值。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
  3. """Produce a one-hot-encoding from a list of features and an OHE dictionary.
  4. Note:
  5. If a (featureID, value) tuple doesn't have a corresponding key in OHEDict it should be
  6. ignored.
  7. Args:
  8. rawFeats (list of (int, str)): The features corresponding to a single observation. Each
  9. feature consists of a tuple of featureID and the feature's value. (e.g. sampleOne)
  10. OHEDict (dict): A mapping of (featureID, value) to unique integer.
  11. numOHEFeats (int): The total number of unique OHE features (combinations of featureID and
  12. value).
  13. Returns:
  14. SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
  15. identifiers for the (featureID, value) combinations that occur in the observation and
  16. with values equal to 1.0.
  17. """
  18. crossList = list(OHEDict.get(i,'-1') for i in rawFeats)
  19. sparseIndex = np.sort([elem for elem in crossList if elem != "-1"])
  20. sparseValues = np.ones(len(sparseIndex))
  21. return SparseVector(numOHEFeats,sparseIndex,sparseValues)
  22. OHEValidationData = rawValidationData.map(lambda point: parseOHEPoint(point, ctrOHEDict, numCtrOHEFeats))
  23. OHEValidationData.cache()
  24. print OHEValidationData.take(1)

Part 4 CTR prediction and logloss evaluation

Logistic regression

前面把数据处理好了,现在需要训练我们的分类器了,这里用到的是逻辑回归。主要的思路是,用LogisticRegressionWithSGD训练,得到模型LogisticRegressionModel。

  1. from pyspark.mllib.classification import LogisticRegressionWithSGD
  2. # fixed hyperparameters
  3. numIters = 50
  4. stepSize = 10.
  5. regParam = 1e-6
  6. regType = 'l2'
  7. includeIntercept = True
  8. model0 = LogisticRegressionWithSGD.train(OHETrainData, iterations=numIters, step=stepSize,regParam=regParam, regType=regType, intercept=includeIntercept)
  9. sortedWeights = sorted(model0.weights)
  10. print sortedWeights[:5], model0.intercept

Log loss

  1. # TODO: Replace <FILL IN> with appropriate code
  2. from math import log
  3. def computeLogLoss(p, y):
  4. """Calculates the value of log loss for a given probabilty and label.
  5. Note:
  6. log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
  7. and when p is 1 we need to subtract a small value (epsilon) from it.
  8. Args:
  9. p (float): A probabilty between 0 and 1.
  10. y (int): A label. Takes on the values 0 and 1.
  11. Returns:
  12. float: The log loss value.
  13. """
  14. epsilon = 10e-12
  15. if p == 0 :
  16. p += epsilon
  17. if p == 1 :
  18. p -= epsilon
  19. if y == 1 :
  20. return -log(p)
  21. if y == 0 :
  22. return -log(1-p)

Baseline log loss

现在我们要用上面写的loss function来计算训练集的Baseline Train Logloss。这里用标签的平均值。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. # Note that our dataset has a very high click-through rate by design
  3. # In practice click-through rate can be one to two orders of magnitude lower
  4. classOneFracTrain = OHETrainData.map(lambda lp: lp.label).sum()/OHETrainData.count()
  5. print classOneFracTrain
  6. logLossTrBase = OHETrainData.map(lambda lp : computeLogLoss(classOneFracTrain,lp.label)).sum()/OHETrainData.count()
  7. print 'Baseline Train Logloss = {0:.3f}\n'.format(logLossTrBase)

Predicted probability

通过上面训练好的模型来计算概率


  1. # TODO: Replace <FILL IN> with appropriate code
  2. from math import exp # exp(-t) = e^-t
  3. def getP(x, w, intercept):
  4. """Calculate the probability for an observation given a set of weights and intercept.
  5. Note:
  6. We'll bound our raw prediction between 20 and -20 for numerical purposes.
  7. Args:
  8. x (SparseVector): A vector with values of 1.0 for features that exist in this
  9. observation and 0.0 otherwise.
  10. w (DenseVector): A vector of weights (betas) for the model.
  11. intercept (float): The model's intercept.
  12. Returns:
  13. float: A probability between 0 and 1.
  14. """
  15. rawPrediction = 1 / (1 + exp(-x.dot(w)-intercept))
  16. # Bound the raw prediction value
  17. rawPrediction = min(rawPrediction, 20)
  18. rawPrediction = max(rawPrediction, -20)
  19. return rawPrediction
  20. trainingPredictions = OHETrainData.map(lambda lp: getP(lp.features,model0.weights,model0.intercept))
  21. print trainingPredictions.take(5)

valuate the model


  1. # TODO: Replace <FILL IN> with appropriate code
  2. def evaluateResults(model, data):
  3. """Calculates the log loss for the data given the model.
  4. Args:
  5. model (LogisticRegressionModel): A trained logistic regression model.
  6. data (RDD of LabeledPoint): Labels and features for each observation.
  7. Returns:
  8. float: Log loss for the data.
  9. """
  10. dataPrediction = data.map(lambda lp : (getP(lp.features,model.weights,model.intercept),lp.label))
  11. logLoss = dataPrediction.map(lambda (x,y) : computeLogLoss(x,y)).sum() / dataPrediction.count()
  12. return logLoss
  13. logLossTrLR0 = evaluateResults(model0, OHETrainData)
  14. print ('OHE Features Train Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  15. .format(logLossTrBase, logLossTrLR0))

Validation log loss

验证集上又计算一遍。。。


  1. # TODO: Replace <FILL IN> with appropriate code
  2. logLossValBase = OHEValidationData.map(lambda lp : computeLogLoss(classOneFracTrain,lp.label)).sum()/OHEValidationData.count()
  3. logLossValLR0 = evaluateResults(model0, OHEValidationData)
  4. print ('OHE Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  5. .format(logLossValBase, logLossValLR0))

我们在这里通过改变不同的阈值来获得ROC曲线。

Part 5 Reduce feature dimension via feature hashing

上面的例子告诉我们,通过OHE,我们可以获得一个不错的准确率,但是特征的个数太多了,多达233K个。所以我们需要feature hashing。

Hash function


  1. from collections import defaultdict
  2. import hashlib
  3. def hashFunction(numBuckets, rawFeats, printMapping=False):
  4. """Calculate a feature dictionary for an observation's features based on hashing.
  5. Note:
  6. Use printMapping=True for debug purposes and to better understand how the hashing works.
  7. Args:
  8. numBuckets (int): Number of buckets to use as features.
  9. rawFeats (list of (int, str)): A list of features for an observation. Represented as
  10. (featureID, value) tuples.
  11. printMapping (bool, optional): If true, the mappings of featureString to index will be
  12. printed.
  13. Returns:
  14. dict of int to float: The keys will be integers which represent the buckets that the
  15. features have been hashed to. The value for a given key will contain the count of the
  16. (featureID, value) tuples that have hashed to that key.
  17. """
  18. mapping = {}
  19. for ind, category in rawFeats:
  20. featureString = category + str(ind)
  21. mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
  22. if(printMapping): print mapping
  23. sparseFeatures = defaultdict(float)
  24. for bucket in mapping.values():
  25. sparseFeatures[bucket] += 1.0
  26. return dict(sparseFeatures)
  27. # Reminder of the sample values:
  28. # sampleOne = [(0, 'mouse'), (1, 'black')]
  29. # sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')]
  30. # sampleThree = [(0, 'bear'), (1, 'black'), (2, 'salmon')]

  1. # TODO: Replace <FILL IN> with appropriate code
  2. # Use four buckets
  3. sampOneFourBuckets = hashFunction(4, sampleOne, True)
  4. sampTwoFourBuckets = hashFunction(4, sampleTwo, True)
  5. sampThreeFourBuckets = hashFunction(4, sampleThree, True)
  6. # Use one hundred buckets
  7. sampOneHundredBuckets = hashFunction(100, sampleOne, True)
  8. sampTwoHundredBuckets = hashFunction(100, sampleTwo, True)
  9. sampThreeHundredBuckets = hashFunction(100, sampleThree, True)
  10. print '\t\t 4 Buckets \t\t\t 100 Buckets'
  11. print 'SampleOne:\t {0}\t\t {1}'.format(sampOneFourBuckets, sampOneHundredBuckets)
  12. print 'SampleTwo:\t {0}\t\t {1}'.format(sampTwoFourBuckets, sampTwoHundredBuckets)
  13. print 'SampleThree:\t {0}\t {1}'.format(sampThreeFourBuckets, sampThreeHundredBuckets)

Creating hashed features

这几步和上面差不多

  1. # TODO: Replace <FILL IN> with appropriate code
  2. def parseHashPoint(point, numBuckets):
  3. """Create a LabeledPoint for this observation using hashing.
  4. Args:
  5. point (str): A comma separated string where the first value is the label and the rest are
  6. features.
  7. numBuckets: The number of buckets to hash to.
  8. Returns:
  9. LabeledPoint: A LabeledPoint with a label (0.0 or 1.0) and a SparseVector of hashed
  10. features.
  11. """
  12. pointList = point.split(',')
  13. pointSize = len(pointList) -1
  14. pointLabel = pointList[0]
  15. pointFeatureRaw = list((i,pointList[i+1]) for i in range(pointSize ))
  16. pointFeature = SparseVector(numBuckets,hashFunction(numBuckets , pointFeatureRaw, True))
  17. return LabeledPoint(pointLabel,pointFeature)
  18. numBucketsCTR = 2 ** 15
  19. hashTrainData = rawTrainData.map(lambda x:parseHashPoint(x,numBucketsCTR))
  20. hashTrainData.cache()
  21. hashValidationData = rawValidationData.map(lambda x:parseHashPoint(x,numBucketsCTR))
  22. hashValidationData.cache()
  23. hashTestData = rawTestData.map(lambda x:parseHashPoint(x,numBucketsCTR))
  24. hashTestData.cache()
  25. print hashTrainData.take(1)

Sparsity

计算稀疏度。。定义在注释里面有

  1. # TODO: Replace <FILL IN> with appropriate code
  2. def computeSparsity(data, d, n):
  3. """Calculates the average sparsity for the features in an RDD of LabeledPoints.
  4. Args:
  5. data (RDD of LabeledPoint): The LabeledPoints to use in the sparsity calculation.
  6. d (int): The total number of features.
  7. n (int): The number of observations in the RDD.
  8. Returns:
  9. float: The average of the ratio of features in a point to total features.
  10. """
  11. return data.map(lambda x: len(x.features.values)).sum()/float(d*n)
  12. averageSparsityHash = computeSparsity(hashTrainData, numBucketsCTR, nTrain)
  13. averageSparsityOHE = computeSparsity(OHETrainData, numCtrOHEFeats, nTrain)
  14. print 'Average OHE Sparsity: {0:.7e}'.format(averageSparsityOHE)
  15. print 'Average Hash Sparsity: {0:.7e}'.format(averageSparsityHash)

Logistic model with hashed features


  1. numIters = 500
  2. regType = 'l2'
  3. includeIntercept = True
  4. # Initialize variables using values from initial model training
  5. bestModel = None
  6. bestLogLoss = 1e10
  7. # TODO: Replace <FILL IN> with appropriate code
  8. stepSizes = (1,10)
  9. regParams = (1e-6,1e-3)
  10. for stepSize in stepSizes:
  11. for regParam in regParams:
  12. model = (LogisticRegressionWithSGD
  13. .train(hashTrainData, numIters, stepSize, regParam=regParam, regType=regType,
  14. intercept=includeIntercept))
  15. logLossVa = evaluateResults(model, hashValidationData)
  16. print ('\tstepSize = {0:.1f}, regParam = {1:.0e}: logloss = {2:.3f}'
  17. .format(stepSize, regParam, logLossVa))
  18. if (logLossVa < bestLogLoss):
  19. bestModel = model
  20. bestLogLoss = logLossVa
  21. print ('Hashed Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  22. .format(logLossValBase, bestLogLoss))

Evaluate on the test set

和上面一样。。。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. # Log loss for the best model from (5d)
  3. logLossTest = evaluateResults(bestModel, hashTestData)
  4. # Log loss for the baseline model
  5. logLossTestBaseline = hashTestData.map(lambda lp : computeLogLoss(classOneFracTrain,lp.label )).sum() / hashTestData.count()
  6. print ('Hashed Features Test Log Loss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  7. .format(logLossTestBaseline, logLossTest))

CS190.1x-ML_lab4_ctr_student的更多相关文章

  1. CS190.1x Scalable Machine Learning

    这门课是CS100.1x的后续课,看课程名字就知道这门课主要讲机器学习.难度也会比上一门课大一点.如果你对这门课感兴趣,可以看看我这篇博客,如果对PySpark感兴趣,可以看我分析作业的博客. Cou ...

  2. CS190.1x-ML_lab1_review_student

    这是CS190.1x第一次作业,主要教你如何使用numpy.numpy可以说是python科学计算的基础包了,用途非常广泛.相关ipynb文件见我github. 这次作业主要分成5个部分,分别是:数学 ...

  3. Introduction to Big Data with PySpark

    起因 大数据时代 大数据最近太热了,其主要有数据量大(Volume),数据类别复杂(Variety),数据处理速度快(Velocity)和数据真实性高(Veracity)4个特点,合起来被称为4V. ...

  4. Ubuntu16.04 802.1x 有线连接 输入账号密码,为什么连接不上?

    ubuntu16.04,在网络配置下找到802.1x安全性,输入账号密码,为什么连接不上?   这是系统的一个bug解决办法:假设你有一定的ubuntu基础,首先你先建立好一个不能用的协议,就是按照之 ...

  5. 解压版MySQL5.7.1x的安装与配置

    解压版MySQL5.7.1x的安装与配置 MySQL安装文件分为两种,一种是msi格式的,一种是zip格式的.如果是msi格式的可以直接点击安装,按照它给出的安装提示进行安装(相信大家的英文可以看懂英 ...

  6. RTImageAssets 自动生成 AppIcon 和 @2x @1x 比例图片

    下载地址:https://github.com/rickytan/RTImageAssets 此插件用来生成 @3x 的图片资源对应的 @2x 和 @1x 版本,只要拖拽高清图到 @3x 的位置上,然 ...

  7. 802.1x协议&eap类型

    EAP: 0,扩展认证协议 1,一个灵活的传输协议,用来承载任意的认证信息(不包括认证方式) 2,直接运行在数据链路层,如ppp或以太网 3,支持多种类型认证 注:EAP 客户端---服务器之间一个协 ...

  8. 脱壳脚本_手脱壳ASProtect 2.1x SKE -&gt; Alexey Solodovnikov

    脱壳ASProtect 2.1x SKE -> Alexey Solodovnikov 用脚本.截图 1:查壳 2:od载入 3:用脚本然后打开脚本文件Aspr2.XX_unpacker_v1. ...

  9. iOS图片攻略之:有3x自动生成2x 1x图片

       关键字:Xcode插件,生成图片资源 代码类库:其他(Others) GitHub链接:https://github.com/rickytan/RTImageAssets   本项目是一个 Xc ...

  10. Keil V4.72升级到V5.1X之后

    问题描述 Keil V4.72升级到V5.1x之后,原来编译通过的工程,出现了如下错误: .\Libraries\CMSIS\CM3\DeviceSupport\ST\STM32F10x\STM32f ...

随机推荐

  1. Apache2启动错误Could not reliably determine the server's fully qualified domain name

    错误情况: AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using ...

  2. sh: ./bin/my_print_defaults: /lib/ld-linux.so.2: bad ELF interpreter: 没有那个文件或目录 FATAL ERROR: Neither host 'kvm' nor 'localhost' could be looked up with ./bin/resolveip Please configure the 'hostname'

    初始化数据库报错: sh: ./bin/my_print_defaults: /lib/ld-linux.so.2: bad ELF interpreter: 没有那个文件或目录FATAL ERROR ...

  3. KB和KiB的区别

    差别是KB等单位以10为底数的指数,KiB是以2为底数的指数. K 与 Ki 分别表示 kilo-(千) 与 kibi-(二进制千) .作为前缀使用时, k 表示 1,000,Ki 表示1,024. ...

  4. Centos7 永久更改主机名

    操作环境: [root@bogon ~]# uname -a Linux #localhost.localdomain 3.10.0-514.el7.centos.plus.i686 #1 SMP W ...

  5. dns服务器测试工具

    下载地址:https://www.eatm.app/wp-content/uploads/2018/08/eDnsTest.20180810.zip

  6. 网络唤醒(WOL)全解指南:原理篇

    什么是网络唤醒 网络唤醒(Wake-on-LAN,WOL)是一种计算机局域网唤醒技术,使局域网内处于关机或休眠状态的计算机,将状态转换成引导(Boot Loader)或运行状态.无线唤醒(Wake-o ...

  7. vue2.0路由切换后页面滚动位置不变BUG

    最近项目中遇到这样一个问题,vue切换路由,页面到顶端的滚动距离仍会保持不变.  方法一: 监听路由 // app.vue export default { watch:{ '$route':func ...

  8. bat替换文件的指定内容

    需求:替换文件my.ini中的1000 为10000,bat脚本如下: c:cd C:\Program Files\MySQL\MySQL Server 5.5copy my.ini my1126ba ...

  9. SAP跟踪前台操作导致的后台查询语句

    SAP跟踪前台操作导致的后台查询语句,通过这个可以查看前台对应了后台的数据库表,然后可以通过se11查看表内容,也可以删除表内容. 在sap升级的时候,首先需要拷贝正式的sap系统,然后将拷贝的系统中 ...

  10. MP实战系列(十)之SpringMVC集成SpringFox+Swagger2

    该示例基于之前的实战系列,如果公司框架是使用JDK7以上及其Spring+MyBatis+SpringMVC/Spring+MyBatis Plus+SpringMVC可直接参考该实例. 不过建议最好 ...