Spark机器学习读书笔记-CH03
3.1.获取数据:
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
3.2.探索与可视化数据:
In [3]: user_data=sc.textFile("file:///root/studio/MachineLearningWithSpark/ch03/ml-100k/u.user")
In [4]: user_data.first()
Out[4]: u'1|24|M|technician|85711'
In [5]: user_fields=user_data.map(lambda line: line.split("|"))
In [8]: num_users = user_fields.map(lambda fields: fields[0]).count()
In [10]: num_genders=user_fields.map(lambda fields: fields[2]).distinct().count()
In [11]: num_occupations=user_fields.map(lambda fields: fields[3]).distinct().count()
In [12]: num_zIpcodes=user_fields.map(lambda fields: fields[4]).distinct().count()
In [16]: print "Users: %d, genders: %d, occupations: %d, zip codes: %d" %(num_users, num_genders, num_occupations, num_zipcodes)
Users: 943, genders: 2, occupations: 21, zip codes: 795
In [17]: ages = user_fields.map(lambda x: int(x[1])).collect()
In [18]: hist(ages, bins=20, color='lightblue', normed=True)
Out[18]:
(array([ 0.00064269, 0.00192808, 0.00449886, 0.0279572 , 0.02956393,
0.03374144, 0.04563129, 0.02538642, 0.02088756, 0.01863813,
0.02088756, 0.01606735, 0.0170314 , 0.01863813, 0.00674829,
0.00482021, 0.0054629 , 0.00192808, 0.00128539, 0.00128539]),
array([ 7. , 10.3, 13.6, 16.9, 20.2, 23.5, 26.8, 30.1, 33.4,
36.7, 40. , 43.3, 46.6, 49.9, 53.2, 56.5, 59.8, 63.1,
66.4, 69.7, 73. ]),
<a list of 20 Patch objects>)
n [19]: fig = matplotlib.pyplot.gcf()
In [20]: fig.set_size_inches(16, 10)
In [23]: count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
In [24]: import numpy as np
In [25]: x_axis1 = np.array([c[0] for c in count_by_occupation])
In [26]: y_axis1 = np.array([c[1] for c in count_by_occupation])
In [27]: x_axis = x_axis1[np.argsort(x_axis1)]
In [28]: y_axis = y_axis1[np.argsort(y_axis1)]
In [29]: pos = np.arange(len(x_axis))
In [30]: width = 1.0
In [31]: ax = plt.axes()
In [32]: ax.set_xticks(pos + (width / 2))
Out[32]:
[<matplotlib.axis.XTick at 0x7f1257bc6f50>,
<matplotlib.axis.XTick at 0x7f1257bc6a10>,
<matplotlib.axis.XTick at 0x7f1256fa2050>,
<matplotlib.axis.XTick at 0x7f1256fa2910>,
<matplotlib.axis.XTick at 0x7f1256fbe090>,
<matplotlib.axis.XTick at 0x7f1256fbe7d0>,
<matplotlib.axis.XTick at 0x7f1256fbef10>,
<matplotlib.axis.XTick at 0x7f1256fc9690>,
<matplotlib.axis.XTick at 0x7f1256fc9dd0>,
<matplotlib.axis.XTick at 0x7f124e6033d0>,
<matplotlib.axis.XTick at 0x7f1257b604d0>,
<matplotlib.axis.XTick at 0x7f124e603c90>,
<matplotlib.axis.XTick at 0x7f1257b602d0>,
<matplotlib.axis.XTick at 0x7f1257b60d90>,
<matplotlib.axis.XTick at 0x7f124e60f510>,
<matplotlib.axis.XTick at 0x7f124e60fc50>,
<matplotlib.axis.XTick at 0x7f124e6183d0>,
<matplotlib.axis.XTick at 0x7f124e618b10>,
<matplotlib.axis.XTick at 0x7f124e623290>,
<matplotlib.axis.XTick at 0x7f124e6239d0>,
<matplotlib.axis.XTick at 0x7f121c583150>]
In [34]: ax.set_xticklabels(x_axis)
Out[34]:
[<matplotlib.text.Text at 0x7f1257bc6410>,
<matplotlib.text.Text at 0x7f1257b68350>,
<matplotlib.text.Text at 0x7f1256fa2790>,
<matplotlib.text.Text at 0x7f1256fa2ed0>,
<matplotlib.text.Text at 0x7f1256fbe650>,
<matplotlib.text.Text at 0x7f1256fbed90>,
<matplotlib.text.Text at 0x7f1256fc9510>,
<matplotlib.text.Text at 0x7f1256fc9c50>,
<matplotlib.text.Text at 0x7f1256fd23d0>,
<matplotlib.text.Text at 0x7f1257c29ad0>,
<matplotlib.text.Text at 0x7f124e603f10>,
<matplotlib.text.Text at 0x7f1257b60510>,
<matplotlib.text.Text at 0x7f1257b60c10>,
<matplotlib.text.Text at 0x7f124e60f390>,
<matplotlib.text.Text at 0x7f124e60fad0>,
<matplotlib.text.Text at 0x7f124e618250>,
<matplotlib.text.Text at 0x7f124e618990>,
<matplotlib.text.Text at 0x7f124e623110>,
<matplotlib.text.Text at 0x7f124e623850>,
<matplotlib.text.Text at 0x7f124e623f90>,
<matplotlib.text.Text at 0x7f121c583710>]
In [35]: plt.bar(pos, y_axis, width, color='lightblue')
Out[35]: <Container object of 21 artists>
In [36]: plt.xticks(rotation=30)
Out[36]:
(array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5,
9.5, 10.5, 11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5,
18.5, 19.5, 20.5]), <a list of 21 Text xticklabel objects>)
In [37]: fig = matplotlib.pyplot.gcf()
In [38]: fig.set_size_inches(16, 10)
In [39]: count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()
In [46]: print "Map-reduce approach: "
Map-reduce approach:
In [47]: print dict(count_by_occupation)
{u'administrator': 79, u'writer': 45, u'retired': 14, u'lawyer': 12, u'doctor': 7, u'marketing': 26, u'executive': 32, u'none': 9, u'entertainment': 18, u'healthcare': 16, u'scientist': 31, u'student': 196, u'educator': 95, u'technician': 27, u'librarian': 51, u'programmer': 66, u'artist': 28, u'salesman': 12, u'other': 105, u'homemaker': 7, u'engineer': 67}
In [48]: print ""
In [49]: print "countByValue approach:"
countByValue approach:
In [50]: print dict(count_by_occupation2)
{u'administrator': 79, u'retired': 14, u'lawyer': 12, u'healthcare': 16, u'marketing': 26, u'executive': 32, u'scientist': 31, u'student': 196, u'technician': 27, u'librarian': 51, u'programmer': 66, u'salesman': 12, u'homemaker': 7, u'engineer': 67, u'none': 9, u'doctor': 7, u'writer': 45, u'entertainment': 18, u'other': 105, u'educator': 95, u'artist': 28}
In [51]: movie_data=sc.textFile("file:///root/studio/MachineLearningWithSpark/ch03/ml-100k/u.item")
In [52]: print movie_data.first()
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
In [53]: num_movies = movie_data.count()
In [54]: print "Movies: %d " % num_movies
Movies: 1682
In [51]: movie_data=sc.textFile("file:///root/studio/MachineLearningWithSpark/ch03/ml-100k/u.item")
In [52]: print movie_data.first()
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
In [53]: num_movies = movie_data.count()
In [54]: print "Movies: %d " % num_movies
Movies: 1682
In [55]: def convert_year(x):
....: try:
....: return int(x[-4:])
....: except:
....: return 1990
....:
In [56]: movie_fields = movie_data.map(lambda lines: lines.split("|"))
In [57]: years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))
In [58]: years_filtered = years.filter(lambda x: x != 1900)
In [59]: movie_ages = years_filtered.map(lambda yr: 1998 - yr).countByValue()
In [60]: values = movie_ages.values()
In [61]: bins = movie_ages.keys()
In [62]: hist(values, bins=bins, color='lightblue', normed=True)
Out[62]:
(array([ 0. , 0.07575758, 0.09090909, 0.09090909, 0.18181818,
0.18181818, 0.04545455, 0.07575758, 0.07575758, 0.03030303,
0. , 0.01515152, 0.01515152, 0.03030303, 0. ,
0.03030303, 0. , 0. , 0. , 0. ,
0. , 0. , 0.01515152, 0. , 0. ,
0.01515152, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.01515152, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.01515152, 0. , 0. , 0. , 0. ]),
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 72, 76]),
<a list of 70 Patch objects>)
In [63]: fig = matplotlib.pyplot.gcf()
In [64]: fig.set_size_inches(16, 10)
In [65]: rating_data = sc.textFile("file:///root/studio/MachineLearningWithSpark/ch03/ml-100k/u.data")
In [66]: print rating_data.first()
196 242 3 881250949
In [67]: num_ratings = rating_data.count()
In [68]: print "Ratings: %d " % num_ratings
Ratings: 100000
In [76]: rating_data = rating_data.map(lambda line: line.split("\t"))
In [77]: ratings = rating_data.map(lambda fields: int(fields[2]))
In [78]: max_rating = ratings.reduce(lambda x, y: max(x, y))
In [79]: min_rating = ratings.reduce(lambda x, y: min(x, y))
In [80]: mean_rating = ratings.reduce(lambda x, y: x + y)/num_ratings
In [81]: median_rating = np.median(ratings.collect())
In [82]: ratings_per_uer = num_ratings / num_users
In [76]: rating_data = rating_data.map(lambda line: line.split("\t"))
In [77]: ratings = rating_data.map(lambda fields: int(fields[2]))
In [78]: max_rating = ratings.reduce(lambda x, y: max(x, y))
In [79]: min_rating = ratings.reduce(lambda x, y: min(x, y))
In [80]: mean_rating = ratings.reduce(lambda x, y: x + y)/num_ratings
In [81]: median_rating = np.median(ratings.collect())
In [82]: ratings_per_uer = num_ratings / num_users
In [83]: ratings_per_movie = num_ratings / num_movies
In [84]: print "Min ratings: %d" % min_rating
Min ratings: 1
In [85]: print "Max ratings: %d" % max_rating
Max ratings: 5
In [86]: print "Average rating: %2.2f" % mean_rating
Average rating: 3.00
In [87]: print "Median rating: %d" % mean_rating
Median rating: 3
In [88]: print "Average # of ratings per user: %2.2f" % ratings_per_uer
Average # of ratings per user: 106.00
In [89]: print "Average # of ratings per movie: %2.2f" % ratings_per_movie
Average # of ratings per movie: 59.00
In [90]: ratings.stats()
Out[90]: (count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)
In [91]: count_by_rating = ratings.countByValue()
In [92]: x_axis = np.array(count_by_rating.keys())
In [93]: y_axis = np.array([float(c) for c in count_by_rating.values()])
In [94]: y_axis_normed = y_axis / y_axis.sum()
In [95]: pos = np.arange(len(x_axis))
In [96]: width = 1.0
In [97]: ax = plt.axes()
In [98]: ax.set_xticks(pos + (width / 2))
Out[98]:
[<matplotlib.axis.XTick at 0x7f121c371250>,
<matplotlib.axis.XTick at 0x7f121c360d90>,
<matplotlib.axis.XTick at 0x7f121c2e0e10>,
<matplotlib.axis.XTick at 0x7f121c2df5d0>,
<matplotlib.axis.XTick at 0x7f121c2dfd10>]
In [99]: ax.set_xticklabels(x_axis)
Out[99]:
[<matplotlib.text.Text at 0x7f121c290ed0>,
<matplotlib.text.Text at 0x7f121c298c90>,
<matplotlib.text.Text at 0x7f121c2df450>,
<matplotlib.text.Text at 0x7f121c2dfb90>,
<matplotlib.text.Text at 0x7f121c2fd310>]
In [100]:
In [100]: plt.bar(pos, y_axis_normed, width, color='lightblue')
Out[100]: <Container object of 5 artists>
In [101]: plt.xticks(rotation=30)
Out[101]: (array([ 0.5, 1.5, 2.5, 3.5, 4.5]), <a list of 5 Text xticklabel objects>)
In [102]: fig = matplotlib.pyplot.gcf()
In [103]: fig.set_size_inches(16, 10)
In [104]: user_ratings_grouped = rating_data.map(lambda fields: (int(fields[0]), int(fields[2]))).groupByKey()
In [105]: user_ratings_by_user = user_ratings_grouped.map(lambda (k, v): (k, len(v)))
In [106]: user_ratings_by_user.take(5)
Out[106]: [(2, 62), (4, 24), (6, 211), (8, 59), (10, 184)]
In [107]: user_ratings_by_user_local = user_ratings_by_user.map(lambda (k, v): v).collect()
In [108]: hist(user_ratings_by_user_local, bins=200, color='lightblue', normed=True)
Out[108]:
(array([ 0.02958007, 0.02129765, 0.01212783, 0.01212783, 0.00798662,
0.00946562, 0.00916982, 0.00739502, 0.00769082, 0.00621181,
0.00887402, 0.00532441, 0.00562021, 0.00414121, 0.00384541,
0.00532441, 0.00236641, 0.00354961, 0.0017748 , 0.0017748 ,
0.00295801, 0.00266221, 0.00325381, 0.00414121, 0.00414121,
0.00266221, 0.0017748 , 0.00236641, 0.00266221, 0.00295801,
0.0020706 , 0.0020706 , 0.00354961, 0.0017748 , 0.00236641,
0.00384541, 0.0017748 , 0.00295801, 0.001479 , 0.00266221,
0.0011832 , 0.001479 , 0.0017748 , 0.0008874 , 0.001479 ,
0.00236641, 0.0020706 , 0.001479 , 0.0008874 , 0.001479 ,
0.0008874 , 0.0020706 , 0.0011832 , 0.0008874 , 0.0020706 ,
0.0002958 , 0.0017748 , 0.0011832 , 0.0011832 , 0.0017748 ,
0.001479 , 0.0011832 , 0.0008874 , 0.0002958 , 0.0005916 ,
0.0002958 , 0.0008874 , 0.0008874 , 0.0002958 , 0.0008874 ,
0.0017748 , 0.001479 , 0.0008874 , 0.0008874 , 0.0005916 ,
0. , 0.0011832 , 0.0002958 , 0.0002958 , 0.0011832 ,
0.0002958 , 0.0005916 , 0.0005916 , 0.0005916 , 0.0005916 ,
0.0008874 , 0. , 0.0008874 , 0. , 0.0002958 ,
0. , 0. , 0.0002958 , 0. , 0.0011832 ,
0.0002958 , 0.0002958 , 0.0002958 , 0. , 0.0002958 ,
0.0005916 , 0. , 0.0011832 , 0. , 0. ,
0.0008874 , 0.0002958 , 0.0002958 , 0. , 0.0002958 ,
0. , 0. , 0. , 0. , 0. ,
0.0005916 , 0. , 0. , 0. , 0.0002958 ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.0002958 , 0.0002958 ,
0. , 0.0005916 , 0. , 0. , 0. ,
0. , 0. , 0. , 0.0002958 , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.0002958 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.0002958 , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.0002958 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.0002958 ]),
array([ 20. , 23.585, 27.17 , 30.755, 34.34 , 37.925,
41.51 , 45.095, 48.68 , 52.265, 55.85 , 59.435,
63.02 , 66.605, 70.19 , 73.775, 77.36 , 80.945,
84.53 , 88.115, 91.7 , 95.285, 98.87 , 102.455,
106.04 , 109.625, 113.21 , 116.795, 120.38 , 123.965,
127.55 , 131.135, 134.72 , 138.305, 141.89 , 145.475,
149.06 , 152.645, 156.23 , 159.815, 163.4 , 166.985,
170.57 , 174.155, 177.74 , 181.325, 184.91 , 188.495,
192.08 , 195.665, 199.25 , 202.835, 206.42 , 210.005,
213.59 , 217.175, 220.76 , 224.345, 227.93 , 231.515,
235.1 , 238.685, 242.27 , 245.855, 249.44 , 253.025,
256.61 , 260.195, 263.78 , 267.365, 270.95 , 274.535,
278.12 , 281.705, 285.29 , 288.875, 292.46 , 296.045,
299.63 , 303.215, 306.8 , 310.385, 313.97 , 317.555,
321.14 , 324.725, 328.31 , 331.895, 335.48 , 339.065,
342.65 , 346.235, 349.82 , 353.405, 356.99 , 360.575,
364.16 , 367.745, 371.33 , 374.915, 378.5 , 382.085,
385.67 , 389.255, 392.84 , 396.425, 400.01 , 403.595,
407.18 , 410.765, 414.35 , 417.935, 421.52 , 425.105,
428.69 , 432.275, 435.86 , 439.445, 443.03 , 446.615,
450.2 , 453.785, 457.37 , 460.955, 464.54 , 468.125,
471.71 , 475.295, 478.88 , 482.465, 486.05 , 489.635,
493.22 , 496.805, 500.39 , 503.975, 507.56 , 511.145,
514.73 , 518.315, 521.9 , 525.485, 529.07 , 532.655,
536.24 , 539.825, 543.41 , 546.995, 550.58 , 554.165,
557.75 , 561.335, 564.92 , 568.505, 572.09 , 575.675,
579.26 , 582.845, 586.43 , 590.015, 593.6 , 597.185,
600.77 , 604.355, 607.94 , 611.525, 615.11 , 618.695,
622.28 , 625.865, 629.45 , 633.035, 636.62 , 640.205,
643.79 , 647.375, 650.96 , 654.545, 658.13 , 661.715,
665.3 , 668.885, 672.47 , 676.055, 679.64 , 683.225,
686.81 , 690.395, 693.98 , 697.565, 701.15 , 704.735,
708.32 , 711.905, 715.49 , 719.075, 722.66 , 726.245,
729.83 , 733.415, 737. ]),
<a list of 200 Patch objects>)
In [109]: fig = matplotlib.pyplot.gcf()
In [110]: fig.set_size_inches(16, 10)
In [111]: hist(user_ratings_by_user_local, bins=200, color='lightblue', normed=True)
Out[111]:
(array([ 0.02958007, 0.02129765, 0.01212783, 0.01212783, 0.00798662,
0.00946562, 0.00916982, 0.00739502, 0.00769082, 0.00621181,
0.00887402, 0.00532441, 0.00562021, 0.00414121, 0.00384541,
0.00532441, 0.00236641, 0.00354961, 0.0017748 , 0.0017748 ,
0.00295801, 0.00266221, 0.00325381, 0.00414121, 0.00414121,
0.00266221, 0.0017748 , 0.00236641, 0.00266221, 0.00295801,
0.0020706 , 0.0020706 , 0.00354961, 0.0017748 , 0.00236641,
0.00384541, 0.0017748 , 0.00295801, 0.001479 , 0.00266221,
0.0011832 , 0.001479 , 0.0017748 , 0.0008874 , 0.001479 ,
0.00236641, 0.0020706 , 0.001479 , 0.0008874 , 0.001479 ,
0.0008874 , 0.0020706 , 0.0011832 , 0.0008874 , 0.0020706 ,
0.0002958 , 0.0017748 , 0.0011832 , 0.0011832 , 0.0017748 ,
0.001479 , 0.0011832 , 0.0008874 , 0.0002958 , 0.0005916 ,
0.0002958 , 0.0008874 , 0.0008874 , 0.0002958 , 0.0008874 ,
0.0017748 , 0.001479 , 0.0008874 , 0.0008874 , 0.0005916 ,
0. , 0.0011832 , 0.0002958 , 0.0002958 , 0.0011832 ,
0.0002958 , 0.0005916 , 0.0005916 , 0.0005916 , 0.0005916 ,
0.0008874 , 0. , 0.0008874 , 0. , 0.0002958 ,
0. , 0. , 0.0002958 , 0. , 0.0011832 ,
0.0002958 , 0.0002958 , 0.0002958 , 0. , 0.0002958 ,
0.0005916 , 0. , 0.0011832 , 0. , 0. ,
0.0008874 , 0.0002958 , 0.0002958 , 0. , 0.0002958 ,
0. , 0. , 0. , 0. , 0. ,
0.0005916 , 0. , 0. , 0. , 0.0002958 ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.0002958 , 0.0002958 ,
0. , 0.0005916 , 0. , 0. , 0. ,
0. , 0. , 0. , 0.0002958 , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.0002958 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.0002958 , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.0002958 , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.0002958 ]),
array([ 20. , 23.585, 27.17 , 30.755, 34.34 , 37.925,
41.51 , 45.095, 48.68 , 52.265, 55.85 , 59.435,
63.02 , 66.605, 70.19 , 73.775, 77.36 , 80.945,
84.53 , 88.115, 91.7 , 95.285, 98.87 , 102.455,
106.04 , 109.625, 113.21 , 116.795, 120.38 , 123.965,
127.55 , 131.135, 134.72 , 138.305, 141.89 , 145.475,
149.06 , 152.645, 156.23 , 159.815, 163.4 , 166.985,
170.57 , 174.155, 177.74 , 181.325, 184.91 , 188.495,
192.08 , 195.665, 199.25 , 202.835, 206.42 , 210.005,
213.59 , 217.175, 220.76 , 224.345, 227.93 , 231.515,
235.1 , 238.685, 242.27 , 245.855, 249.44 , 253.025,
256.61 , 260.195, 263.78 , 267.365, 270.95 , 274.535,
278.12 , 281.705, 285.29 , 288.875, 292.46 , 296.045,
299.63 , 303.215, 306.8 , 310.385, 313.97 , 317.555,
321.14 , 324.725, 328.31 , 331.895, 335.48 , 339.065,
342.65 , 346.235, 349.82 , 353.405, 356.99 , 360.575,
364.16 , 367.745, 371.33 , 374.915, 378.5 , 382.085,
385.67 , 389.255, 392.84 , 396.425, 400.01 , 403.595,
407.18 , 410.765, 414.35 , 417.935, 421.52 , 425.105,
428.69 , 432.275, 435.86 , 439.445, 443.03 , 446.615,
450.2 , 453.785, 457.37 , 460.955, 464.54 , 468.125,
471.71 , 475.295, 478.88 , 482.465, 486.05 , 489.635,
493.22 , 496.805, 500.39 , 503.975, 507.56 , 511.145,
514.73 , 518.315, 521.9 , 525.485, 529.07 , 532.655,
536.24 , 539.825, 543.41 , 546.995, 550.58 , 554.165,
557.75 , 561.335, 564.92 , 568.505, 572.09 , 575.675,
579.26 , 582.845, 586.43 , 590.015, 593.6 , 597.185,
600.77 , 604.355, 607.94 , 611.525, 615.11 , 618.695,
622.28 , 625.865, 629.45 , 633.035, 636.62 , 640.205,
643.79 , 647.375, 650.96 , 654.545, 658.13 , 661.715,
665.3 , 668.885, 672.47 , 676.055, 679.64 , 683.225,
686.81 , 690.395, 693.98 , 697.565, 701.15 , 704.735,
708.32 , 711.905, 715.49 , 719.075, 722.66 , 726.245,
729.83 , 733.415, 737. ]),
<a list of 200 Patch objects>)
3.3. 处理与转换数据;
In [112]: years_pre_processed = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x)).collect()
In [113]: years_pre_processed_array = np.array(years_pre_processed)
In [114]: mean_year = np.mean(years_pre_processed_array[years_pre_processed_array != 1900])
In [115]: median_year = np.median(years_pre_processed_array[years_pre_processed_array != 1900])
In [122]: index_bad_data = np.where(years_pre_processed_array == 1900)[0]
In [123]: index_bad_data
Out[123]: array([], dtype=int64)
In [124]: years_pre_processed_array[index_bad_data] = median_year
In [125]: print "Mean year of release: %d" % mean_year
Mean year of release: 1989
In [126]: print "Median year of release: %d" % median_year
Median year of release: 1995
In [130]: print "Index of '1900' after assigning median: %s" % np.where(years_pre_processed_array == 1900)[0]
Index of '1900' after assigning median: []
3.4.从数据中提取有用特征:
In [131]: all_occupations = user_fields.map(lambda fields: fields[3]).distinct().collect()
In [132]: all_occupations.sort()
In [133]:
In [133]: idx = 0
In [134]: all_occupations_dict = {}
In [135]: for o in all_occupations:
.....: all_occupations_dict[o] = idx
.....: idx += 1
.....:
In [136]: print "Encoding of 'doctor': %d" %all_occupations_dict['doctor']
Encoding of 'doctor': 2
In [137]: print "Encoding of 'programmer': %d" %all_occupations_dict['programmer']
Encoding of 'programmer': 14
In [139]: k = len(all_occupations_dict)
In [140]: binary_x = np.zeros(k)
In [141]: k_programmer = all_occupations_dict['programmer']
In [142]: binary_x[k_programmer] = 1
In [143]: print "Binary feature vector: %s" %binary_x
Binary feature vector: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0.]
In [144]: print "Length of binary vector: %d" %k
Length of binary vector: 21
In [145]: def extract_datetime(ts):
.....: import datetime
.....: return datetime.datetime.fromtimestamp(ts)
.....:
In [149]: timestamps = rating_data.map(lambda fields: int(fields[3]))
In [150]: hour_of_day = timestamps.map(lambda ts: extract_datetime(ts).hour)
In [151]: hour_of_day.take(5)
Out[151]: [23, 3, 15, 13, 13]
In [154]: def assign_tod(hr):
.....: times_of_day = {
.....: 'morning' : range(7, 12),
.....: 'lunch' : range(12, 14),
.....: 'afternoon' : range(14, 18),
.....: 'evening' : range(18, 23),
.....: 'night' : range(23, 7)
.....: }
.....: for k, v in times_of_day.iteritems():
.....: if hr in v:
.....: return k
.....:
In [166]: def assign_tod(hr):
.....: times_of_day = {
.....: 'morning' : range(7, 12),
.....: 'lunch' : range(12, 14),
.....: 'afternoon' : range(14, 18),
.....: 'evening' : range(18, 23),
.....: 'night' : range(23, 24) + range(0, 7)
.....: }
.....: for k, v in times_of_day.iteritems():
.....: if hr in v:
.....: return k
.....:
In [167]:
In [167]: time_of_day = hour_of_day.map(lambda hr: assign_tod(hr))
In [168]: time_of_day.take(5)
Out[168]: ['night', 'night', 'afternoon', 'lunch', 'lunch']
In [170]: def extract_titile(raw):
.....: import re
.....: grps = re.search("\((\w+)\)", raw)
.....: if grps:
.....: return raw[:grps.start()].strip()
.....: else:
.....: return raw
.....:
In [171]: raw_titles = movie_fields.map(lambda fields: fields[1])
In [172]: for raw_title in raw_titles.take(5):
.....: print extract_titile(raw_title)
.....:
Toy Story
GoldenEye
Four Rooms
Get Shorty
Copycat
In [173]: movie_titles = raw_titles.map(lambda m: extract_titile(m))
In [174]: title_terms = movie_titles.map(lambda t: t.split(" "))
In [175]: print title_terms.take(5)
[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'], [u'Get', u'Shorty'], [u'Copycat']]
In [176]: all_terms = title_terms.flatMap(lambda x: x).distinct().collect()
In [177]: idx = 0
In [178]: all_terms_dict = {}
In [179]: for term in all_terms:
.....: all_occupations_dict[term] = idx
.....: idx += 1
.....:
In [180]: print "Total number of terms: %d" % len(all_terms_dict)
Total number of terms: 0
In [181]: print "Index of term 'Dead': %d" % all_occupations_dict['Dead']
Index of term 'Dead': 147
In [182]: print "Index of term 'Rooms': %d" % all_occupations_dict['Rooms']
Index of term 'Rooms': 1963
In [184]: %paste
def create_vector(terms, term_dict):
from scipy import sparse as sp
num_terms = len(term_dict)
x = sp.csc_matrix((1, num_terms))
for t in terms:
if t in term_dict:
idx = term_dict[t]
x[0, idx] = 1
return x
## -- End pasted text --
In [185]:
In [185]: all_terms_bcast = sc.broadcast(all_terms_dict)
In [186]: term_vectors = title_terms.map(lambda terms: create_vector(terms, all_terms_bcast.value))
In [187]: term_vectors.take(5)
Out[187]:
[<1x0 sparse matrix of type '<type 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Column format>,
<1x0 sparse matrix of type '<type 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Column format>,
<1x0 sparse matrix of type '<type 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Column format>,
<1x0 sparse matrix of type '<type 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Column format>,
<1x0 sparse matrix of type '<type 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Column format>]
In [188]: np.random.seed(42)
In [189]: x = np.random.randn(10)
In [190]: norm_x_2 = np.linalg.norm(x)
In [191]: normalized_x = x /norm_x_2
In [192]: print "x: \n%s" % x
x:
[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696
1.57921282 0.76743473 -0.46947439 0.54256004]
In [193]: print "Normalized x: \n%s" % normalized_x
Normalized x:
[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237
0.60954584 0.29621508 -0.1812081 0.20941776]
In [194]: print "2-Norm of normalized_x: %2.4f" % np.linalg.norm(normalized_x)
2-Norm of normalized_x: 1.0000
In [199]: vector = sc.parallelize([x])
In [200]: from pyspark.mllib.feature import Normalizer
In [201]: normalizer = Normalizer()
In [202]: vector = sc.parallelize([x])
In [203]: normalized_x_mllib = normalizer.transform(vector).first().toArray()
In [204]: print "x: \n%s" % x
x:
[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696
1.57921282 0.76743473 -0.46947439 0.54256004]
In [205]: print "2-Norm of x: %2.4f" % norm_x_2
2-Norm of x: 2.5908
In [206]: print "Normalized x MLlib: \n%s" % normalized_x_mllib
Normalized x MLlib:
[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237
0.60954584 0.29621508 -0.1812081 0.20941776]
In [207]: print "2-Norm of normalized_x_mllib: %2.4f" % np.linalg.norm(normalized_x_mllib)
2-Norm of normalized_x_mllib: 1.0000
Spark机器学习读书笔记-CH03的更多相关文章
- Spark机器学习读书笔记-CH05
5.2.从数据中提取合适的特征 [root@demo1 ch05]# sed 1d train.tsv > train_noheader.tsv[root@demo1 ch05]# lltota ...
- Spark机器学习读书笔记-CH04
[root@demo1 ch04]# spark-shell --master yarn --jars /root/studio/jblas-1.2.3.jar scala> val rawDa ...
- 视觉机器学习读书笔记--------BP学习
反向传播算法(Back-Propagtion Algorithm)即BP学习属于监督式学习算法,是非常重要的一种人工神经网络学习方法,常被用来训练前馈型多层感知器神经网络. 一.BP学习原理 1.前馈 ...
- 视觉机器学习读书笔记--------SVM方法
SVM是一种二类分类模型,有监督的统计学习方法,能够最小化经验误差和最大化几何边缘,被称为最大间隔分类器,可用于分类和回归分析.支持向量机的学习策略就是间隔最大化,可形式化为一个求解凸二次规划的问题, ...
- 机器学习读书笔记(一)k-近邻算法
一.机器学习是什么 机器学习的英文名称叫Machine Learning,简称ML,该领域主要研究的是如何使计算机能够模拟人类的学习行为从而获得新的知识和技能,并且重新组织已学习到的知识和和技能,使之 ...
- 机器学习读书笔记(七)支持向量机之线性SVM
一.SVM SVM的英文全称是Support Vector Machines,我们叫它支持向量机.支持向量机是我们用于分类的一种算法. 1 示例: 先用一个例子,来了解一下SVM 桌子上放了两种颜色的 ...
- 机器学习读书笔记(五)AdaBoost
一.Boosting算法 .Boosting算法是一种把若干个分类器整合为一个分类器的方法,在boosting算法产生之前,还出现过两种比较重要的将多个分类器整合为一个分类器的方法,即boostrap ...
- 机器学习读书笔记(二)使用k-近邻算法改进约会网站的配对效果
一.背景 海伦女士一直使用在线约会网站寻找适合自己的约会对象.尽管约会网站会推荐不同的任选,但她并不是喜欢每一个人.经过一番总结,她发现自己交往过的人可以进行如下分类 不喜欢的人 魅力一般的人 极具魅 ...
- 【Todo】【读书笔记】机器学习-周志华
书籍位置: /Users/baidu/Documents/Data/Interview/机器学习-数据挖掘/<机器学习_周志华.pdf> 一共442页.能不能这个周末先囫囵吞枣看完呢.哈哈 ...
随机推荐
- 并查集 poj2236
网址:http://poj.org/problem?id=2236 题意:有n台坏的电脑,如果每两台电脑的距离不能超过d,那么这两台电脑有联系,用字符串O 表示标记第x台电脑维修了,用S判断从X到y是 ...
- JavaScript中的各种宽高属性
转自慕课网:http://www.imooc.com/article/14516 在js中,存在着N多的关于高度和宽度的属性,比如:clientHeight.offsetHeight.scroll ...
- 把数据输出到Word (非插件形式)
项目开发过程中,我们要把数据以各种各样的形式展现给客户.把数据以文档的形式展现给客户相信是一种比较头疼的问题,如果没有好的方法会 使得我的开发繁琐,而且满足不了客户的需求.接下来我会通过两种开发方式介 ...
- 高效的插入子节点DocumentFragment
DocumentFragment 对象 DocumentFragment 接口表示文档的一部分(或一段).更确切地说,它表示一个或多个邻接的 Document 节点和它们的所有子孙节点. Docume ...
- Media Player插件
<object classid="clsid:22D6F312-B0F6-11D0-94AB-0080C74C7E95" id="MediaPlayer1" ...
- MFC编程入门之二十二(常用控件:按钮控件Button、Radio Button和Check Box)
本节继续讲解常用控件--按钮控件的使用. 按钮控件简介 按钮控件包括命令按钮(Button).单选按钮(Radio Button)和复选框(Check Box)等.命令按钮就是我们前面多次提到的侠义的 ...
- 转:深入浅出UML类图(具体到代码层次)
深入浅出UML类图 作者:刘伟 ,发布于:2012-11-23,来源:CSDN 在UML 2.0的13种图形中,类图是使用频率最高的UML图之一.Martin Fowler在其著作<UML ...
- C# for循环 循环中先求长度还是使用list.Count,哪个效率高
这种问法可能问的不好.应该说成是第一种方法完全没有必要,很多人可能以为那样会为代码带来效率,而实际上是不会给效率带来任何提升. 因为事实上,索引器内部,为了安全期间,还是会去求整个list的count ...
- ngui的tween的tweenFactor属性
ngui的tween的tweenFactor属性 这个属性是用来记录动画运行的位置的.可以通过设置它来达到动画运行到一半从新设置从新开始
- 关于unity碰撞检测器的用法
今天已经是我第三次忘记了这两种碰撞检测的用法,混淆了.特意整理一下 首先把今天要解决涉及到的东西列出来 碰撞方法: public void OnTriggerEnter(Collider other) ...