项目实战-使用PySpark处理文本多分类问题
原文链接:https://cloud.tencent.com/developer/article/1096712
在大神创作的基础上,学习了一些新知识,并加以注释。
TARGET:将旧金山犯罪记录(San Francisco Crime Description)分类到33个类目中
源代码及数据集:之后提交。
一、载入数据集data
import time
from pyspark.sql import SQLContext
from pyspark import SparkContext
# 利用spark的csv库直接载入csv格式的数据
sc = SparkContext()
sqlContext = SQLContext(sc)
data = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('train.csv')
# 选10000条数据集,减少运行时间
data = data.sample(False, 0.01, 100)
print(data.count())
结果:
8703 1.1 除去与需求无关的列
# 除去一些不要的列,并展示前五行
drop_list = ['Dates', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
data = data.select([column for column in data.columns if column not in drop_list])
data.show(5)
1.2 显示数据结构
# 利用printSchema()方法显示数据的结构
data.printSchema()
结果:
root
|-- Category: string (nullable = true)
|-- Descript: string (nullable = true) 1.3 查看犯罪类型最多的前20个
# 包含数量最多的20类犯罪
from pyspark.sql.functions import col
data.groupBy('Category').count().orderBy(col('count').desc()).show()
结果:
+--------------------+-----+
| Category|count|
+--------------------+-----+
| LARCENY/THEFT| 1725|
| OTHER OFFENSES| 1230|
| NON-CRIMINAL| 962|
| ASSAULT| 763|
| VEHICLE THEFT| 541|
| DRUG/NARCOTIC| 494|
| VANDALISM| 447|
| WARRANTS| 406|
| BURGLARY| 347|
| SUSPICIOUS OCC| 295|
| MISSING PERSON| 284|
| ROBBERY| 225|
| FRAUD| 159|
| SECONDARY CODES| 124|
|FORGERY/COUNTERFE...| 109|
| WEAPON LAWS| 86|
| TRESPASS| 63|
| PROSTITUTION| 59|
| DISORDERLY CONDUCT| 54|
| DRUNKENNESS| 52|
+--------------------+-----+
only showing top 20 rows
1.4 查看犯罪描述最多的前20个
# 包含犯罪数量最多的20个描述
data.groupBy('Descript').count().orderBy(col('count').desc()).show()
结果: +--------------------+-----+
| Descript|count|
+--------------------+-----+
|GRAND THEFT FROM ...| 569|
| LOST PROPERTY| 323|
| BATTERY| 301|
| STOLEN AUTOMOBILE| 262|
|DRIVERS LICENSE, ...| 244|
|AIDED CASE, MENTA...| 223|
| WARRANT ARREST| 222|
|PETTY THEFT FROM ...| 216|
|SUSPICIOUS OCCURR...| 211|
|MALICIOUS MISCHIE...| 184|
| TRAFFIC VIOLATION| 168|
|THREATS AGAINST LIFE| 154|
|PETTY THEFT OF PR...| 152|
| FOUND PROPERTY| 138|
|MALICIOUS MISCHIE...| 138|
|ENROUTE TO OUTSID...| 121|
|GRAND THEFT OF PR...| 115|
|MISCELLANEOUS INV...| 101|
| DOMESTIC VIOLENCE| 99|
| FOUND PERSON| 98|
+--------------------+-----+
only showing top 20 rows 二、对犯罪描述进行分词
2.1 对Descript分词,先切分单词,再删除停用词
流程和scikit-learn版本的很相似,包含3个步骤:
1.regexTokenizer: 利用正则切分单词
2.stopwordsRemover: 移除停用词
3.countVectors: 构建词频向量
RegexTokenizer:基于正则的方式进行文档切分成单词组
inputCol: 输入字段
outputCol: 输出字段
pattern: 匹配模式,根据匹配到的内容切分单词
CountVectorizer:构建词频向量
covabSize: 限制的词频数
minDF:如果是float,则表示出现的百分比小于minDF,不会被当做关键词
如果是int,则表示出现是次数小于minDF,不会被当做关键词
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression # 正则切分单词
# inputCol:输入字段名
# outputCol:输出字段名
regexTokenizer = RegexTokenizer(inputCol='Descript', outputCol='words', pattern='\\W')
# 停用词
add_stopwords = ['http', 'https', 'amp', 'rt', 't', 'c', 'the']
stopwords_remover = StopWordsRemover(inputCol='words', outputCol='filtered').setStopWords(add_stopwords)
# 构建词频向量
count_vectors = CountVectorizer(inputCol='filtered', outputCol='features', vocabSize=10000, minDF=5)
2.2 对分词后的词频率排序,最频繁出现的设置为0
StringIndexer
StringIndexer将一列字符串label编码为一列索引号,根据label出现的频率排序,最频繁出现的label的index为0
该例子中,label会被编码成从0-32的整数,最频繁的label被编码成0
Pipeline是基于DataFrame的高层API,可以方便用户构建和调试机器学习流水线,可以使得多个机器学习算法顺序执行,达到高效的数据处理的目的。
fit():将DataFrame转换成一个Transformer的算法,将label列转化为特征向量
transform(): 将特征向量作为新列添加到DataFrame
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
label_stringIdx = StringIndexer(inputCol='Category', outputCol='label')
pipeline = Pipeline(stages=[regexTokenizer, stopwords_remover, count_vectors, label_stringIdx])
# fit the pipeline to training documents
pipeline_fit = pipeline.fit(data)
dataset = pipeline_fit.transform(data)
dataset.show(5)
结果:
+---------------+--------------------+--------------------+--------------------+--------------------+-----+
| Category| Descript| words| filtered| features|label|
+---------------+--------------------+--------------------+--------------------+--------------------+-----+
| LARCENY/THEFT|GRAND THEFT FROM ...|[grand, theft, fr...|[grand, theft, fr...|(309,[0,2,3,4,6],...| 0.0|
| VEHICLE THEFT| STOLEN AUTOMOBILE|[stolen, automobile]|[stolen, automobile]|(309,[9,27],[1.0,...| 4.0|
| NON-CRIMINAL| FOUND PROPERTY| [found, property]| [found, property]|(309,[5,32],[1.0,...| 2.0|
|SECONDARY CODES| JUVENILE INVOLVED|[juvenile, involved]|[juvenile, involved]|(309,[67,218],[1....| 13.0|
| OTHER OFFENSES|DRIVERS LICENSE, ...|[drivers, license...|[drivers, license...|(309,[14,23,28,30...| 1.0|
+---------------+--------------------+--------------------+--------------------+--------------------+-----+
only showing top 5 rows 三、训练/测试集划分
# set seed for reproducibility
# 数据集划分训练集和测试集,比例7:3, 设置随机种子100
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
print('Training Dataset Count:{}'.format(trainingData.count()))
print('Test Dataset Count:{}'.format(testData.count()))
结果:
Training Dataset Count:6117
Test Dataset Count:2586 四、模型训练和评价
4.1 以词频作为特征,利用逻辑回归进行分类
模型在测试集上预测和打分,查看10个预测概率值最高的结果:
LogisticRegression:逻辑回归模型
maxIter:最大迭代次数
regParam:正则化参数
elasticNetParam:正则化。0:l1;1:l2
start_time = time.time()
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(trainingData)
predictions = lrModel.transform(testData)
# 过滤prediction类别为0数据集
predictions.filter(predictions['prediction'] == 0).select('Descript', 'Category', 'probability', 'label', 'prediction').orderBy('probability', accending=False).show(n=10, truncate=30)
结果:
+--------------------------+--------+------------------------------+-----+----------+
| Descript|Category| probability|label|prediction|
+--------------------------+--------+------------------------------+-----+----------+
| ARSON OF A VEHICLE| ARSON|[0.1194196587417514,0.10724...| 26.0| 0.0|
| ARSON OF A VEHICLE| ARSON|[0.1194196587417514,0.10724...| 26.0| 0.0|
| ARSON OF A VEHICLE| ARSON|[0.1194196587417514,0.10724...| 26.0| 0.0|
| ATTEMPTED ARSON| ARSON|[0.12978385966276762,0.1084...| 26.0| 0.0|
| CREDIT CARD, THEFT OF| FRAUD|[0.21637136655265077,0.0836...| 12.0| 0.0|
| CREDIT CARD, THEFT OF| FRAUD|[0.21637136655265077,0.0836...| 12.0| 0.0|
| CREDIT CARD, THEFT OF| FRAUD|[0.21637136655265077,0.0836...| 12.0| 0.0|
| CREDIT CARD, THEFT OF| FRAUD|[0.21637136655265077,0.0836...| 12.0| 0.0|
| CREDIT CARD, THEFT OF| FRAUD|[0.21637136655265077,0.0836...| 12.0| 0.0|
|ARSON OF A VACANT BUILDING| ARSON|[0.22897903829071928,0.0980...| 26.0| 0.0|
+--------------------------+--------+------------------------------+-----+----------+
only showing top 10 rows
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# predictionCol: 预测列的名称
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
# 预测准确率
print(evaluator.evaluate(predictions))
end_time = time.time()
print(end_time - start_time)
结果:
0.9641817609126011
8.245999813079834 4.2 以TF-ID作为特征,利用逻辑回归进行分类
from pyspark.ml.feature import HashingTF, IDF
start_time = time.time()
# numFeatures: 最大特征数
hashingTF = HashingTF(inputCol='filtered', outputCol='rawFeatures', numFeatures=10000)
# minDocFreq:过滤的最少文档数量
idf = IDF(inputCol='rawFeatures', outputCol='features', minDocFreq=5)
pipeline = Pipeline(stages=[regexTokenizer, stopwords_remover, hashingTF, idf, label_stringIdx])
pipeline_fit = pipeline.fit(data)
dataset = pipeline_fit.transform(data)
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100) lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
lr_model = lr.fit(trainingData)
predictions = lr_model.transform(testData)
predictions.filter(predictions['prediction'] == 0).select('Descript', 'Category', 'probability', 'label', 'prediction').\
orderBy('probability', ascending=False).show(n=10, truncate=30)
结果:
+----------------------------+-------------+------------------------------+-----+----------+
| Descript| Category| probability|label|prediction|
+----------------------------+-------------+------------------------------+-----+----------+
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.865376337558355,0.018892...| 0.0| 0.0|
+----------------------------+-------------+------------------------------+-----+----------+
only showing top 10 rows
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
print(evaluator.evaluate(predictions))
end_time = time.time()
print(end_time - start_time)
结果:
0.9653361434618551
12.998999834060669 4.3 交叉验证
用交叉验证来优化参数,这里针对基于词频特征的逻辑回归模型进行优化
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
start_time = time.time()
pipeline = Pipeline(stages=[regexTokenizer, stopwords_remover, count_vectors, label_stringIdx])
pipeline_fit = pipeline.fit(data)
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
# 为交叉验证创建参数
# ParamGridBuilder:用于基于网格搜索的模型选择的参数网格的生成器
# addGrid:将网格中给定参数设置为固定值
# parameter:正则化参数
# maxIter:迭代次数
# numFeatures:特征值
paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.1, 0.3, 0.5])
.addGrid(lr.elasticNetParam, [0.0, 0.1, 0.2])
.addGrid(lr.maxIter, [10, 20, 50])
# .addGrid(idf.numFeatures, [10, 100, 1000])
.build()) # 创建五折交叉验证
# estimator:要交叉验证的估计器
# estimatorParamMaps:网格搜索的最优参数
# evaluator:评估器
# numFolds:交叉次数
cv = CrossValidator(estimator=lr,\
estimatorParamMaps=paramGrid,\
evaluator=evaluator,\
numFolds=5)
cv_model = cv.fit(trainingData)
predictions = cv_model.transform(testData) # 模型评估
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
print(evaluator.evaluate(predictions))
end_time = time.time()
print(end_time - start_time)
结果:
0.9807684755923513
368.97300004959106 4.4 朴素贝叶斯
from pyspark.ml.classification import NaiveBayes
start_time = time.time()
# smoothing:平滑参数
nb = NaiveBayes(smoothing=1)
model = nb.fit(trainingData)
predictions = model.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
.select('Descript', 'Category', 'probability', 'label', 'prediction') \
.orderBy('probability', ascending=False) \
.show(n=10, truncate=30)
结果:
+----------------------+-------------+------------------------------+-----+----------+
| Descript| Category| probability|label|prediction|
+----------------------+-------------+------------------------------+-----+----------+
| PETTY THEFT BICYCLE|LARCENY/THEFT|[1.0,1.236977662838925E-20,...| 0.0| 0.0|
| PETTY THEFT BICYCLE|LARCENY/THEFT|[1.0,1.236977662838925E-20,...| 0.0| 0.0|
| PETTY THEFT BICYCLE|LARCENY/THEFT|[1.0,1.236977662838925E-20,...| 0.0| 0.0|
|GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...| 0.0| 0.0|
|GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...| 0.0| 0.0|
|GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...| 0.0| 0.0|
|GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...| 0.0| 0.0|
|GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...| 0.0| 0.0|
|GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...| 0.0| 0.0|
|GRAND THEFT PICKPOCKET|LARCENY/THEFT|[1.0,7.699728277574397E-24,...| 0.0| 0.0|
+----------------------+-------------+------------------------------+-----+----------+
only showing top 10 rows
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
print(evaluator.evaluate(predictions))
end_time = time.time()
print(end_time - start_time)
结果:
0.977432832447723
5.371000051498413 4.5 随机森林
from pyspark.ml.classification import RandomForestClassifier
start_time = time.time()
# numTree:训练树的个数
# maxDepth:最大深度
# maxBins:连续特征离散化的最大分类数
rf = RandomForestClassifier(labelCol='label', \
featuresCol='features', \
numTrees=100, \
maxDepth=4, \
maxBins=32)
# Train model with Training Data
rfModel = rf.fit(trainingData)
predictions = rfModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
.select('Descript','Category','probability','label','prediction') \
.orderBy('probability', ascending=False) \
.show(n = 10, truncate = 30)
结果:
+----------------------------+-------------+------------------------------+-----+----------+
| Descript| Category| probability|label|prediction|
+----------------------------+-------------+------------------------------+-----+----------+
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
|PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.33206188381818563,0.1168...| 0.0| 0.0|
+----------------------------+-------------+------------------------------+-----+----------+
only showing top 10 rows
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction')
print(evaluator.evaluate(predictions))
end_time = time.time()
print(end_time - start_time)
结果:
0.27929770811242954
36.63699984550476
上面的结果可以看出:随机森林是优秀的、鲁棒的通用模型,但对于高维稀疏数据来说,并不是一个很好的选择。
明显,选择使用交叉验证的逻辑回归。
但是选择交叉验证的逻辑回归时需要注意一点:由于使用了交叉验证,训练时间会过长,在实际的应用场景中要根据业务选择最合适的模型。
项目实战-使用PySpark处理文本多分类问题的更多相关文章
- 洗礼灵魂,修炼python(73)--全栈项目实战篇(1)——【转载】前提准备之学习ubuntu
本篇是为项目实战做准备,学习Linux是必备的,不然都不好意思叫全栈对吧?下面是一位资深大神写的文章,够详细,我也不用浪费时间再写了 原文链接:Ubuntu学习——第一篇 内容: 一. Ubuntu简 ...
- Python NLP完整项目实战教程(1)
一.前言 打算写一个系列的关于自然语言处理技术的文章<Python NLP完整项目实战>,本文算是系列文章的起始篇,为了能够有效集合实际应用场景,避免为了学习而学习,考虑结合一个具体的项目 ...
- (转载)Android项目实战(二十七):数据交互(信息编辑)填写总结
Android项目实战(二十七):数据交互(信息编辑)填写总结 前言: 项目中必定用到的数据填写需求.比如修改用户名的文字编辑对话框,修改生日的日期选择对话框等等.现总结一下,方便以后使用. 注: ...
- (转载)Android项目实战(二十八):Zxing二维码实现及优化
Android项目实战(二十八):Zxing二维码实现及优化 前言: 多年之前接触过zxing实现二维码,没想到今日项目中再此使用竟然使用的还是zxing,百度之,竟是如此牛的玩意. 当然,项目中 ...
- KNN算法项目实战——改进约会网站的配对效果
KNN项目实战——改进约会网站的配对效果 1.项目背景: 海伦女士一直使用在线约会网站寻找适合自己的约会对象.尽管约会网站会推荐不同的人选,但她并不是喜欢每一个人.经过一番总结,她发现自己交往过的人可 ...
- 使用scikit-learn解决文本多分类问题(附python演练)
来源 | TowardsDataScience 译者 | Revolver 在我们的商业世界中,存在着许多需要对文本进行分类的情况.例如,新闻报道通常按主题进行组织; 内容或产品通常需要按类别打上标签 ...
- QT5 QSS QML界面美化视频课程系列 QT原理 项目实战 C++1X STL
QT5 QSS QML界面美化视频课程系列 QT原理 项目实战 C++1X STL 课程1 C语言程序设计高级实用速成课程 基础+进阶+自学 课程2 C语言程序设计Windows GDI图形绘 ...
- 【腾讯Bugly干货分享】React Native项目实战总结
本文来自于腾讯bugly开发者社区,非经作者同意,请勿转载,原文地址:http://dev.qq.com/topic/577e16a7640ad7b4682c64a7 “8小时内拼工作,8小时外拼成长 ...
- 【无私分享:ASP.NET CORE 项目实战(第十三章)】Asp.net Core 使用MyCat分布式数据库,实现读写分离
目录索引 [无私分享:ASP.NET CORE 项目实战]目录索引 简介 MyCat2.0版本很快就发布了,关于MyCat的动态和一些问题,大家可以加一下MyCat的官方QQ群:106088787.我 ...
随机推荐
- hdu-4635(tarjan缩点)
题意:先给你一个n个点,m条边的有向图,问你最多能够增加多少条边,使得这个图不是一个强连通图 解题思路:考虑最多要添加的边数,所以如果能把初始图划分成两个部分,每个部分都是完全图,这两个部分分别用单向 ...
- python第九天
复习内容: 文件处理 1. 操作文件的三步骤: ---打开文件:硬盘的空间被操作系统持有 | 文件对象被应用程序持有 ---操作文件:读写操作 ---释放文件:释放操作系统对硬盘空间的持有 2. ...
- Linux uniq 命令
Linux uniq 命令 Linux 命令大全 Linux uniq 命令用于检查及删除文本文件中重复出现的行列,一般与 sort 命令结合使用. uniq 可检查文本文件中重复出现的行列. 语法 ...
- \t \r \n \f
\t 的意思是 :水平制表符.将当前位置移到下一个tab位置. \r 的意思是: 回车.将当前位置移到本行的开头. \n 的意思是:回车换行.将当前位置移到下一行的开头. \f的意思是:换页.将当前位 ...
- Jenkins Sonar
sonar简介 SonarQube是 一个开源的代码质量分析平台,便于管理代码的质量,可检查出项目代码的漏洞和潜在的逻辑问题.同时,它提供了丰富的插件,支持多种语言的检测, 如 Java.Python ...
- jvm 字节码执行 (二)动态类型支持与基于栈的字节码解释执行
动态类型语言 动态类型语言的关键特征是它的类型检查的主体过程是在运行期而不是编译期. 举例子解释“类型检查”,例如代码: obj.println("hello world"); 假 ...
- kafka 日常使用和数据副本模型的理解
kafka 日常使用和数据副本模型的理解 在使用Kafka过程中,有时经常需要查看一些消费者的情况.Kafka健康状况.临时查看.同步一些数据,又由于Kafka只是用来做流式存储,又没有像Mysql或 ...
- python3 练手实例2 解一元二次方程组
import math def y(): a,b,c=map(float,input('请输入一元二次方程式ax^2+bx+c=0,abc的值,用空格隔开:').split()) d=math.pow ...
- HDU-1398 Square Coins(生成函数)
题意 与$hdu1028$类似,只不过可用的数字都是平方数. 思路 类似的思路,注意下细节. 代码 #include <bits/stdc++.h> #define DBG(x) cerr ...
- Luogu P4204 神奇口袋 题解报告
题目传送门 [题目大意] 一个口袋里装了t种颜色的球,第i种颜色的球的数目为a[i],每次随机抽一个小球,然后再放d个这种颜色的小球进口袋. 给出n个要求,第x个抽出的球颜色为y,求满足条件的概率. ...