这次lab主要主要是研究click-through rate (CTR)。数据集来自于Kaggle的Criteo Labs dataset。相关ipynb文件见我github

作业分成5个部分:one-hot encoding处理特征;构造one-hot encoding dictionary;解析CTR数据并处理特征;用逻辑回归来预测CTR;通过feature hashing来减少特征维度。

Featurize categorical data using one-hot-encoding


这部分我们要实现one-hot encoding。我们在用实际数据处理前,先用一个包含3个样本的数据集上练习一下。样本有三个特征:什么动物,什么颜色,吃什么。其中最后一个特征可选。第一个特征有三个值:bear, cat, mouse;第二个有两个:black, tabby;第三个有两个:mouse, salmon。我们首先第一步是把(featureID, category) 映射到从0开始的连续整数。

  1. # Data for manual OHE
  2. # Note: the first data point does not include any value for the optional third feature
  3. sampleOne = [(0, 'mouse'), (1, 'black')]
  4. sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')]
  5. sampleThree = [(0, 'bear'), (1, 'black'), (2, 'salmon')]
  6. sampleDataRDD = sc.parallelize([sampleOne, sampleTwo, sampleThree])
  7. # TODO: Replace <FILL IN> with appropriate code
  8. sampleOHEDictManual = {}
  9. sampleOHEDictManual[(0,'bear')] = 0
  10. sampleOHEDictManual[(0,'cat')] = 1
  11. sampleOHEDictManual[(0,'mouse')] = 2
  12. sampleOHEDictManual[(1,'black')] = 3
  13. sampleOHEDictManual[(1,'tabby')] = 4
  14. sampleOHEDictManual[(2,'mouse')] = 5
  15. sampleOHEDictManual[(2,'salmon')] = 6

Sparse vectors


  1. pyspark.mllib.linalg.SparseVector(size, *args)[source]


  1. a = SparseVector(4, [1, 3], [3.0, 4.0])

a的实际样子是[0, 3.0, 0, 4.0]

  1. import numpy as np
  2. from pyspark.mllib.linalg import SparseVector
  3. # TODO: Replace <FILL IN> with appropriate code
  4. aDense = np.array([0., 3., 0., 4.])
  5. aSparse = SparseVector(4,[1,3],[3,4])
  6. bDense = np.array([0., 0., 0., 1.])
  7. bSparse = SparseVector(4,{3:1})
  8. w = np.array([0.4, 3.1, -1.4, -.5])
  9. print aDense.dot(w)
  10. print aSparse.dot(w)
  11. print bDense.dot(w)
  12. print bSparse.dot(w)

OHE features as sparse vectors


  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleOneOHEFeatManual = SparseVector(7,[2,3],[1,1])
  3. sampleTwoOHEFeatManual = SparseVector(7,[1,4,5],[1,1,1])
  4. sampleThreeOHEFeatManual = SparseVector(7,[0,3,6],[1,1,1])

Define a OHE function


  1. # TODO: Replace <FILL IN> with appropriate code
  2. def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
  3. """Produce a one-hot-encoding from a list of features and an OHE dictionary.
  4. Note:
  5. You should ensure that the indices used to create a SparseVector are sorted.
  6. Args:
  7. rawFeats (list of (int, str)): The features corresponding to a single observation. Each
  8. feature consists of a tuple of featureID and the feature's value. (e.g. sampleOne)
  9. OHEDict (dict): A mapping of (featureID, value) to unique integer.
  10. numOHEFeats (int): The total number of unique OHE features (combinations of featureID and
  11. value).
  12. Returns:
  13. SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
  14. identifiers for the (featureID, value) combinations that occur in the observation and
  15. with values equal to 1.0.
  16. """
  17. sparseIndex = np.sort(list(OHEDict[i] for i in rawFeats))
  18. sparseValues = np.ones(len(rawFeats))
  19. return SparseVector(numOHEFeats,sparseIndex,sparseValues)
  20. # Calculate the number of features in sampleOHEDictManual
  21. numSampleOHEFeats = len(sampleOHEDictManual)
  22. # Run oneHotEnoding on sampleOne
  23. sampleOneOHEFeat = oneHotEncoding(sampleOne,sampleOHEDictManual,numSampleOHEFeats)
  24. print sampleOneOHEFeat

Apply OHE to a dataset

  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleOHEData = sampleDataRDD.map(lambda x : oneHotEncoding(x,sampleOHEDictManual,numSampleOHEFeats))
  3. print sampleOHEData.collect()

Part 2 Construct an OHE dictionary

Pair RDD of (featureID, category)


  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleDistinctFeats = (sampleDataRDD
  3. .flatMap(lambda x: x).distinct())

OHE Dictionary from distinct features


  1. # TODO: Replace <FILL IN> with appropriate code
  2. sampleOHEDict = (sampleDistinctFeats
  3. .zipWithIndex().collectAsMap())
  4. print sampleOHEDict

Automated creation of an OHE dictionary

  1. # TODO: Replace <FILL IN> with appropriate code
  2. def createOneHotDict(inputData):
  3. """Creates a one-hot-encoder dictionary based on the input data.
  4. Args:
  5. inputData (RDD of lists of (int, str)): An RDD of observations where each observation is
  6. made up of a list of (featureID, value) tuples.
  7. Returns:
  8. dict: A dictionary where the keys are (featureID, value) tuples and map to values that are
  9. unique integers.
  10. """
  11. inputOHEDict = (inputData.flatMap(lambda x:x).distinct().zipWithIndex().collectAsMap())
  12. return inputOHEDict
  13. sampleOHEDictAuto = createOneHotDict(sampleDataRDD)
  14. print sampleOHEDictAuto

Part 3 Parse CTR data and generate OHE features

Loading and splitting the data


  1. # Run this code to view Criteo's agreement
  2. from IPython.lib.display import IFrame
  3. IFrame("http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/",
  4. 600, 350)
  5. # TODO: Replace <FILL IN> with appropriate code
  6. # Just replace <FILL IN> with the url for dac_sample.tar.gz
  7. import glob
  8. import os.path
  9. import tarfile
  10. import urllib
  11. import urlparse
  12. # Paste url, url should end with: dac_sample.tar.gz
  13. url = '<FILL IN>'
  14. url = url.strip()
  15. baseDir = os.path.join('data')
  16. inputPath = os.path.join('cs190', 'dac_sample.txt')
  17. fileName = os.path.join(baseDir, inputPath)
  18. inputDir = os.path.split(fileName)[0]
  19. def extractTar(check = False):
  20. # Find the zipped archive and extract the dataset
  21. tars = glob.glob('dac_sample*.tar.gz*')
  22. if check and len(tars) == 0:
  23. return False
  24. if len(tars) > 0:
  25. try:
  26. tarFile = tarfile.open(tars[0])
  27. except tarfile.ReadError:
  28. if not check:
  29. print 'Unable to open tar.gz file. Check your URL.'
  30. return False
  31. tarFile.extract('dac_sample.txt', path=inputDir)
  32. print 'Successfully extracted: dac_sample.txt'
  33. return True
  34. else:
  35. print 'You need to retry the download with the correct url.'
  36. print ('Alternatively, you can upload the dac_sample.tar.gz file to your Jupyter root ' +
  37. 'directory')
  38. return False
  39. if os.path.isfile(fileName):
  40. print 'File is already available. Nothing to do.'
  41. elif extractTar(check = True):
  42. print 'tar.gz file was already available.'
  43. elif not url.endswith('dac_sample.tar.gz'):
  44. print 'Check your download url. Are you downloading the Sample dataset?'
  45. else:
  46. # Download the file and store it in the same directory as this notebook
  47. try:
  48. urllib.urlretrieve(url, os.path.basename(urlparse.urlsplit(url).path))
  49. except IOError:
  50. print 'Unable to download and store: {0}'.format(url)
  51. extractTar()
  1. import os.path
  2. baseDir = os.path.join('data')
  3. inputPath = os.path.join('cs190', 'dac_sample.txt')
  4. fileName = os.path.join(baseDir, inputPath)
  5. if os.path.isfile(fileName):
  6. rawData = (sc
  7. .textFile(fileName, 2)
  8. .map(lambda x: x.replace('\t', ','))) # work with either ',' or '\t' separated data
  9. print rawData.take(1)


  1. # TODO: Replace <FILL IN> with appropriate code
  2. weights = [.8, .1, .1]
  3. seed = 42
  4. # Use randomSplit with weights and seed
  5. rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights,seed)
  6. # Cache the data
  7. rawTrainData.cache()
  8. rawValidationData.cache()
  9. rawTestData.cache()
  10. nTrain = rawTrainData.count()
  11. nVal = rawValidationData.count()
  12. nTest = rawTestData.count()
  13. print nTrain, nVal, nTest, nTrain + nVal + nTest
  14. print rawData.take(1)

Extract features


  1. # TODO: Replace <FILL IN> with appropriate code
  2. def parsePoint(point):
  3. """Converts a comma separated string into a list of (featureID, value) tuples.
  4. Note:
  5. featureIDs should start at 0 and increase to the number of features - 1.
  6. Args:
  7. point (str): A comma separated string where the first value is the label and the rest
  8. are features.
  9. Returns:
  10. list: A list of (featureID, value) tuples.
  11. """
  12. featuresList = point.split(',')
  13. return list((i,featuresList[i+1]) for i in range(len(featuresList)-1))
  14. parsedTrainFeat = rawTrainData.map(parsePoint)
  15. numCategories = (parsedTrainFeat
  16. .flatMap(lambda x: x)
  17. .distinct()
  18. .map(lambda x: (x[0], 1))
  19. .reduceByKey(lambda x, y: x + y)
  20. .sortByKey()
  21. .collect())
  22. print numCategories[2][1]

Create an OHE dictionary from the dataset

我们现在处理成了和part 2里一样了,看看这个字典的大小。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. ctrOHEDict = createOneHotDict(parsedTrainFeat)
  3. numCtrOHEFeats = len(ctrOHEDict.keys())
  4. print numCtrOHEFeats
  5. print ctrOHEDict[(0, '')]

Apply OHE to the dataset


  1. from pyspark.mllib.regression import LabeledPoint
  2. # TODO: Replace <FILL IN> with appropriate code
  3. def parseOHEPoint(point, OHEDict, numOHEFeats):
  4. """Obtain the label and feature vector for this raw observation.
  5. Note:
  6. You must use the function `oneHotEncoding` in this implementation or later portions
  7. of this lab may not function as expected.
  8. Args:
  9. point (str): A comma separated string where the first value is the label and the rest
  10. are features.
  11. OHEDict (dict of (int, str) to int): Mapping of (featureID, value) to unique integer.
  12. numOHEFeats (int): The number of unique features in the training dataset.
  13. Returns:
  14. LabeledPoint: Contains the label for the observation and the one-hot-encoding of the
  15. raw features based on the provided OHE dictionary.
  16. """
  17. pointList = point.split(',')
  18. pointLabel = pointList[0]
  19. pointFeaturesRaw = list((i,pointList[i+1]) for i in range(len(pointList)-1))
  20. pointFeatures = oneHotEncoding(pointFeaturesRaw,OHEDict,numOHEFeats)
  21. return LabeledPoint(pointLabel,pointFeatures)
  22. OHETrainData = rawTrainData.map(lambda point: parseOHEPoint(point, ctrOHEDict, numCtrOHEFeats))
  23. OHETrainData.cache()
  24. print OHETrainData.take(1)
  25. # Check that oneHotEncoding function was used in parseOHEPoint
  26. backupOneHot = oneHotEncoding
  27. oneHotEncoding = None
  28. withOneHot = False
  29. try: parseOHEPoint(rawTrainData.take(1)[0], ctrOHEDict, numCtrOHEFeats)
  30. except TypeError: withOneHot = True
  31. oneHotEncoding = backupOneHot

Handling unseen features


  1. # TODO: Replace <FILL IN> with appropriate code
  2. def oneHotEncoding(rawFeats, OHEDict, numOHEFeats):
  3. """Produce a one-hot-encoding from a list of features and an OHE dictionary.
  4. Note:
  5. If a (featureID, value) tuple doesn't have a corresponding key in OHEDict it should be
  6. ignored.
  7. Args:
  8. rawFeats (list of (int, str)): The features corresponding to a single observation. Each
  9. feature consists of a tuple of featureID and the feature's value. (e.g. sampleOne)
  10. OHEDict (dict): A mapping of (featureID, value) to unique integer.
  11. numOHEFeats (int): The total number of unique OHE features (combinations of featureID and
  12. value).
  13. Returns:
  14. SparseVector: A SparseVector of length numOHEFeats with indicies equal to the unique
  15. identifiers for the (featureID, value) combinations that occur in the observation and
  16. with values equal to 1.0.
  17. """
  18. crossList = list(OHEDict.get(i,'-1') for i in rawFeats)
  19. sparseIndex = np.sort([elem for elem in crossList if elem != "-1"])
  20. sparseValues = np.ones(len(sparseIndex))
  21. return SparseVector(numOHEFeats,sparseIndex,sparseValues)
  22. OHEValidationData = rawValidationData.map(lambda point: parseOHEPoint(point, ctrOHEDict, numCtrOHEFeats))
  23. OHEValidationData.cache()
  24. print OHEValidationData.take(1)

Part 4 CTR prediction and logloss evaluation

Logistic regression


  1. from pyspark.mllib.classification import LogisticRegressionWithSGD
  2. # fixed hyperparameters
  3. numIters = 50
  4. stepSize = 10.
  5. regParam = 1e-6
  6. regType = 'l2'
  7. includeIntercept = True
  8. model0 = LogisticRegressionWithSGD.train(OHETrainData, iterations=numIters, step=stepSize,regParam=regParam, regType=regType, intercept=includeIntercept)
  9. sortedWeights = sorted(model0.weights)
  10. print sortedWeights[:5], model0.intercept

Log loss

  1. # TODO: Replace <FILL IN> with appropriate code
  2. from math import log
  3. def computeLogLoss(p, y):
  4. """Calculates the value of log loss for a given probabilty and label.
  5. Note:
  6. log(0) is undefined, so when p is 0 we need to add a small value (epsilon) to it
  7. and when p is 1 we need to subtract a small value (epsilon) from it.
  8. Args:
  9. p (float): A probabilty between 0 and 1.
  10. y (int): A label. Takes on the values 0 and 1.
  11. Returns:
  12. float: The log loss value.
  13. """
  14. epsilon = 10e-12
  15. if p == 0 :
  16. p += epsilon
  17. if p == 1 :
  18. p -= epsilon
  19. if y == 1 :
  20. return -log(p)
  21. if y == 0 :
  22. return -log(1-p)

Baseline log loss

现在我们要用上面写的loss function来计算训练集的Baseline Train Logloss。这里用标签的平均值。

  1. # TODO: Replace <FILL IN> with appropriate code
  2. # Note that our dataset has a very high click-through rate by design
  3. # In practice click-through rate can be one to two orders of magnitude lower
  4. classOneFracTrain = OHETrainData.map(lambda lp: lp.label).sum()/OHETrainData.count()
  5. print classOneFracTrain
  6. logLossTrBase = OHETrainData.map(lambda lp : computeLogLoss(classOneFracTrain,lp.label)).sum()/OHETrainData.count()
  7. print 'Baseline Train Logloss = {0:.3f}\n'.format(logLossTrBase)

Predicted probability


  1. # TODO: Replace <FILL IN> with appropriate code
  2. from math import exp # exp(-t) = e^-t
  3. def getP(x, w, intercept):
  4. """Calculate the probability for an observation given a set of weights and intercept.
  5. Note:
  6. We'll bound our raw prediction between 20 and -20 for numerical purposes.
  7. Args:
  8. x (SparseVector): A vector with values of 1.0 for features that exist in this
  9. observation and 0.0 otherwise.
  10. w (DenseVector): A vector of weights (betas) for the model.
  11. intercept (float): The model's intercept.
  12. Returns:
  13. float: A probability between 0 and 1.
  14. """
  15. rawPrediction = 1 / (1 + exp(-x.dot(w)-intercept))
  16. # Bound the raw prediction value
  17. rawPrediction = min(rawPrediction, 20)
  18. rawPrediction = max(rawPrediction, -20)
  19. return rawPrediction
  20. trainingPredictions = OHETrainData.map(lambda lp: getP(lp.features,model0.weights,model0.intercept))
  21. print trainingPredictions.take(5)

valuate the model

  1. # TODO: Replace <FILL IN> with appropriate code
  2. def evaluateResults(model, data):
  3. """Calculates the log loss for the data given the model.
  4. Args:
  5. model (LogisticRegressionModel): A trained logistic regression model.
  6. data (RDD of LabeledPoint): Labels and features for each observation.
  7. Returns:
  8. float: Log loss for the data.
  9. """
  10. dataPrediction = data.map(lambda lp : (getP(lp.features,model.weights,model.intercept),lp.label))
  11. logLoss = dataPrediction.map(lambda (x,y) : computeLogLoss(x,y)).sum() / dataPrediction.count()
  12. return logLoss
  13. logLossTrLR0 = evaluateResults(model0, OHETrainData)
  14. print ('OHE Features Train Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  15. .format(logLossTrBase, logLossTrLR0))

Validation log loss


  1. # TODO: Replace <FILL IN> with appropriate code
  2. logLossValBase = OHEValidationData.map(lambda lp : computeLogLoss(classOneFracTrain,lp.label)).sum()/OHEValidationData.count()
  3. logLossValLR0 = evaluateResults(model0, OHEValidationData)
  4. print ('OHE Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  5. .format(logLossValBase, logLossValLR0))


Part 5 Reduce feature dimension via feature hashing

上面的例子告诉我们,通过OHE,我们可以获得一个不错的准确率,但是特征的个数太多了,多达233K个。所以我们需要feature hashing。

Hash function

  1. from collections import defaultdict
  2. import hashlib
  3. def hashFunction(numBuckets, rawFeats, printMapping=False):
  4. """Calculate a feature dictionary for an observation's features based on hashing.
  5. Note:
  6. Use printMapping=True for debug purposes and to better understand how the hashing works.
  7. Args:
  8. numBuckets (int): Number of buckets to use as features.
  9. rawFeats (list of (int, str)): A list of features for an observation. Represented as
  10. (featureID, value) tuples.
  11. printMapping (bool, optional): If true, the mappings of featureString to index will be
  12. printed.
  13. Returns:
  14. dict of int to float: The keys will be integers which represent the buckets that the
  15. features have been hashed to. The value for a given key will contain the count of the
  16. (featureID, value) tuples that have hashed to that key.
  17. """
  18. mapping = {}
  19. for ind, category in rawFeats:
  20. featureString = category + str(ind)
  21. mapping[featureString] = int(int(hashlib.md5(featureString).hexdigest(), 16) % numBuckets)
  22. if(printMapping): print mapping
  23. sparseFeatures = defaultdict(float)
  24. for bucket in mapping.values():
  25. sparseFeatures[bucket] += 1.0
  26. return dict(sparseFeatures)
  27. # Reminder of the sample values:
  28. # sampleOne = [(0, 'mouse'), (1, 'black')]
  29. # sampleTwo = [(0, 'cat'), (1, 'tabby'), (2, 'mouse')]
  30. # sampleThree = [(0, 'bear'), (1, 'black'), (2, 'salmon')]

  1. # TODO: Replace <FILL IN> with appropriate code
  2. # Use four buckets
  3. sampOneFourBuckets = hashFunction(4, sampleOne, True)
  4. sampTwoFourBuckets = hashFunction(4, sampleTwo, True)
  5. sampThreeFourBuckets = hashFunction(4, sampleThree, True)
  6. # Use one hundred buckets
  7. sampOneHundredBuckets = hashFunction(100, sampleOne, True)
  8. sampTwoHundredBuckets = hashFunction(100, sampleTwo, True)
  9. sampThreeHundredBuckets = hashFunction(100, sampleThree, True)
  10. print '\t\t 4 Buckets \t\t\t 100 Buckets'
  11. print 'SampleOne:\t {0}\t\t {1}'.format(sampOneFourBuckets, sampOneHundredBuckets)
  12. print 'SampleTwo:\t {0}\t\t {1}'.format(sampTwoFourBuckets, sampTwoHundredBuckets)
  13. print 'SampleThree:\t {0}\t {1}'.format(sampThreeFourBuckets, sampThreeHundredBuckets)

Creating hashed features


  1. # TODO: Replace <FILL IN> with appropriate code
  2. def parseHashPoint(point, numBuckets):
  3. """Create a LabeledPoint for this observation using hashing.
  4. Args:
  5. point (str): A comma separated string where the first value is the label and the rest are
  6. features.
  7. numBuckets: The number of buckets to hash to.
  8. Returns:
  9. LabeledPoint: A LabeledPoint with a label (0.0 or 1.0) and a SparseVector of hashed
  10. features.
  11. """
  12. pointList = point.split(',')
  13. pointSize = len(pointList) -1
  14. pointLabel = pointList[0]
  15. pointFeatureRaw = list((i,pointList[i+1]) for i in range(pointSize ))
  16. pointFeature = SparseVector(numBuckets,hashFunction(numBuckets , pointFeatureRaw, True))
  17. return LabeledPoint(pointLabel,pointFeature)
  18. numBucketsCTR = 2 ** 15
  19. hashTrainData = rawTrainData.map(lambda x:parseHashPoint(x,numBucketsCTR))
  20. hashTrainData.cache()
  21. hashValidationData = rawValidationData.map(lambda x:parseHashPoint(x,numBucketsCTR))
  22. hashValidationData.cache()
  23. hashTestData = rawTestData.map(lambda x:parseHashPoint(x,numBucketsCTR))
  24. hashTestData.cache()
  25. print hashTrainData.take(1)



  1. # TODO: Replace <FILL IN> with appropriate code
  2. def computeSparsity(data, d, n):
  3. """Calculates the average sparsity for the features in an RDD of LabeledPoints.
  4. Args:
  5. data (RDD of LabeledPoint): The LabeledPoints to use in the sparsity calculation.
  6. d (int): The total number of features.
  7. n (int): The number of observations in the RDD.
  8. Returns:
  9. float: The average of the ratio of features in a point to total features.
  10. """
  11. return data.map(lambda x: len(x.features.values)).sum()/float(d*n)
  12. averageSparsityHash = computeSparsity(hashTrainData, numBucketsCTR, nTrain)
  13. averageSparsityOHE = computeSparsity(OHETrainData, numCtrOHEFeats, nTrain)
  14. print 'Average OHE Sparsity: {0:.7e}'.format(averageSparsityOHE)
  15. print 'Average Hash Sparsity: {0:.7e}'.format(averageSparsityHash)

Logistic model with hashed features

  1. numIters = 500
  2. regType = 'l2'
  3. includeIntercept = True
  4. # Initialize variables using values from initial model training
  5. bestModel = None
  6. bestLogLoss = 1e10
  7. # TODO: Replace <FILL IN> with appropriate code
  8. stepSizes = (1,10)
  9. regParams = (1e-6,1e-3)
  10. for stepSize in stepSizes:
  11. for regParam in regParams:
  12. model = (LogisticRegressionWithSGD
  13. .train(hashTrainData, numIters, stepSize, regParam=regParam, regType=regType,
  14. intercept=includeIntercept))
  15. logLossVa = evaluateResults(model, hashValidationData)
  16. print ('\tstepSize = {0:.1f}, regParam = {1:.0e}: logloss = {2:.3f}'
  17. .format(stepSize, regParam, logLossVa))
  18. if (logLossVa < bestLogLoss):
  19. bestModel = model
  20. bestLogLoss = logLossVa
  21. print ('Hashed Features Validation Logloss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  22. .format(logLossValBase, bestLogLoss))

Evaluate on the test set


  1. # TODO: Replace <FILL IN> with appropriate code
  2. # Log loss for the best model from (5d)
  3. logLossTest = evaluateResults(bestModel, hashTestData)
  4. # Log loss for the baseline model
  5. logLossTestBaseline = hashTestData.map(lambda lp : computeLogLoss(classOneFracTrain,lp.label )).sum() / hashTestData.count()
  6. print ('Hashed Features Test Log Loss:\n\tBaseline = {0:.3f}\n\tLogReg = {1:.3f}'
  7. .format(logLossTestBaseline, logLossTest))


