Spark 实践——基于 Spark MLlib 和 YFCC 100M 数据集的景点推荐系统

1.前言

上接 YFCC 100M数据集分析笔记和使用百度地图api可视化聚类结果, 在对 YFCC 100M 聚类出的景点信息的基础上，使用 Spark MLlib 提供的 ALS 算法构建推荐模型。

本节代码可见：https://github.com/libaoquan95/TRS/tree/master/Analyse/recommend

数据信息：https://github.com/libaoquan95/TRS/tree/master/Analyse/dataset

2.数据预处理

在用户数据(user.csv) 和用户-景点数据(user-attraction.csv) 中，用户标识和景点标识都使用了字符串进行表示，但在 Spark MLlib 提供的 ALS 算法中，要求这两者是整数类型，所以首先要对数据进行预处理，将其转化为整数。

对于 userName, 联立 user.csv 和 user-attraction.csv，将 user-attraction.csv 中的 userName 转化为 userId 即可。

对于 provinceId, 可以考虑将其编码，provinceId 格式为省份标识_省内景点编号，如 HK_100 标识使用在香港拍摄的照片聚类出的第 100 个景点。

编码方式很简单，首先将 _ 前的省份标识转化为数字，之后与 _ 后的数字连接即可。

编码与解码代码如下：

val provinceToCode = Map(

  "LN" -> "10",

  "ShanX" -> "11",

  "ZJ" -> "12",

  "CQ" -> "13",

  "HLJ" -> "14",

  "AH" -> "15",

  "SanX" -> "16",

  "SD" -> "17",

  "SH" -> "18",

  "XJ" -> "19",

  "HuN" -> "20",

  "GS" -> "21",

  "HeN" -> "22",

  "BJ" -> "23",

  "NMG" -> "24",

  "YN" -> "25",

  "JX" -> "26",

  "HuB" -> "27",

  "JL" -> "28",

  "NX" -> "29",

  "TJ" -> "30",

  "FJ" -> "31",

  "SC" -> "32",

  "TW" -> "33",

  "GX" -> "34",

  "GD" -> "35",

  "HeB" -> "36",

  "HaiN" -> "37",

  "Macro" -> "38",

  "XZ" -> "39",

  "GZ" -> "40",

  "JS" -> "41",

  "QH" -> "42",

  "HK" -> "43"

)

val codeToProvince = Map(

  "10" -> "LN",

  "11" -> "ShanX",

  "12" -> "ZJ",

  "13" -> "CQ",

  "14" -> "HLJ",

  "15" -> "AH",

  "16" -> "SanX",

  "17" -> "SD",

  "18" -> "SH",

  "19" -> "XJ",

  "20" -> "HuN",

  "21" -> "GS",

  "22" -> "HeN",

  "23" -> "BJ",

  "24" -> "NMG",

  "25" -> "YN",

  "26" -> "JX",

  "27" -> "HuB",

  "28" -> "JL",

  "29" -> "NX",

  "30" -> "TJ",

  "31" -> "FJ",

  "32" -> "SC",

  "33" -> "TW",

  "34" -> "GX",

  "35" -> "GD",

  "36" -> "HeB",

  "37" -> "HaiN",

  "38" -> "Macro",

  "39" -> "XZ",

  "40" -> "GZ",

  "41" -> "JS",

  "42" -> "QH",

  "43" -> "HK"

)

// 编码

def codeing(str: String): String = {

  var code: String = ""

  val Array(province, index) = str.split('_')

  code = provinceToCode(province) + index

  code

}

// 解码

def decodeing(str: String): String = {

  var decode: String = ""

  decode = codeToProvince(str(0).toString+str(1).toString) + "_"

  for (i <- 1 to str.length-1){

    decode += str(i).toString

  }

  decode

}

之后加载用户数据 user.scv，并去除头标题。

val dataDirBase = "..\\dataset\\"

val userIdToName = sc.read.

  textFile(dataDirBase + "user.csv").

  flatMap{ line =>

    var Array(userId, userName) = line.split(',')

    if(userId == "userId"){

      None

    } else {

      Some((userId, userName))

    }

  }.collect().toMap

val userNameToId = sc.read.

  textFile(dataDirBase + "user.csv").

  flatMap{ line =>

    var Array(userId, userName) = line.split(',')

    if(userId == "userId"){

      None

    } else {

      Some((userName, userId))

    }

  }.collect().toMap

转化 user-attraction 数据

val userAttractionDF = sc.read.

  textFile(dataDirBase + "user-attraction.csv").

  flatMap{ line =>

    val Array(userName, attractionId, count, rating) = line.split(',')

    if (userName == "userName"){

      None

    } else {

      Some((userNameToId(userName).toInt, codeing(attractionId).toInt, count.toInt))

    }

  }.toDF("user", "attraction", "count").cache()

3.建立推荐模型

Spark MLlib ALS 算法接受三元组矩阵数据，分别代表用户标识，景点标识，评分数据，其中用户标识，景点标识必须是整数。

ALS 是最小交替二乘的简称，是使用矩阵分解算法来填补稀疏矩阵，预测评分，具体参见矩阵分解在协同过滤推荐算法中的应用

经历过上面的步骤后，userAttractionDF 已经转化为适应 ALS 算法的数据。之后可以建立推荐模型了，将数据拆分为训练集和测试集，使用训练集训练模型。具体算法如下：

val Array(trainData, cvData) = userAttractionDF.randomSplit(Array(0.9, 0.1))

val model = new ALS().

  setSeed(Random.nextLong()).

  setImplicitPrefs(true).

  setRank(10).

  setRegParam(0.01).

  setAlpha(1.0).

  setMaxIter(5).

  setUserCol("user").

  setItemCol("attraction").

  setRatingCol("count").

  setPredictionCol("prediction").

  fit(trainData)

4.进行推荐

Spark MLlib ALS 一次只能对一个用户进行推荐，代码如下：

def recommendByUser(userId: Int, topN: Int): Array[String] = {

  val toRecommend = model.itemFactors.

    select($"id".as("attraction")).

    withColumn("user", lit(userId))

  val topRecommendations  = model.transform(toRecommend).

    select("attraction", "prediction").

    orderBy($"prediction".desc).

    limit(topN)

  val recommends = topRecommendations.select("attraction").as[Int].collect()

  recommends.map(line => decodeing(line.toString))

}

推荐效果如下：

5.评测系统

验证推荐模型的正确率

def testRecommend(): Unit ={

  val topN = 10

  val users = cvData.select($"user").distinct().collect().map(u => u(0))

  var hit = 0.0

  var rec_count = 0.0

  var test_count = 0.0

  for (i <- 0 to users.length-1) {

    val recs = recommendByUser(users(i).toString.toInt, topN).toSet

    val temp = cvData.select($"attraction").

      where($"user" === users(i).toString.toInt).

      collect().map(a => decodeing(a(0).toString)).

      toSet

    hit += recs.&(temp).size

    rec_count += recs.size

    test_count += temp.size

  }

  print ("正确率：" + (hit / rec_count))

  print ("召回率：" + (hit / test_count))

}