数据分析，numpy pandas常用api记录

1、 np.percentile(train_list["wnum1"], [10, 90, 95, 99]) 计算一个多维数组的任意百分比分位数，此处的百分位是从小到大排列

2、fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6)) 定义画图的画布

　　- 在画图时，要注意首先定义画图的画布：fig = plt.figure( )
　　- 然后定义子图ax ，使用 ax= fig.add_subplot( 行，列，位置标)
　　- 当上述步骤完成后，可以用 ax.plot（）函数或者 df.plot(ax = ax)
　　- 在jupternotebook 需要用%定义：%matplotlib notebook；如果是在脚本编译器上则不用，但是需要一次性按流程把代码写完；
　　- 结尾时都注意记录上plt.show()

DataFrame.plot( )函数具体解释：https://blog.csdn.net/brucewong0516/article/details/80524442

3、train_list.plot(kind="hist", y=["wnum1", "wnum2"], bins=20, alpha=0.5, density=True, ax=ax1)

　　kind代表的图的类型，hlist是柱状图

bins代表柱状图有几个柱子，设置为20，默认是10

　　alpha 代表填充的不透明度 0-1取值

4、ax1.legend()

　　显示图例

5、train_unique_q = np.concatenate([train_data["q2"].unique(), train_data["q1"].unique()])　

　　示例：

　　>>> a=np.array([1,2,3])
　　>>> b=np.array([11,22,33])
　　>>> c=np.array([44,55,66])
　　>>> np.concatenate((a,b,c),axis=0) # 默认情况下，axis=0可以不写
　　array([ 1, 2, 3, 11, 22, 33, 44, 55, 66]) #对于一维数组拼接，axis的值不影响最后的结果

6、from collections import Counter

　　train_question = np.concatenate([train_data["q1"], train_data["q2"]])
　　test_question = np.concatenate([test_data["q1"], test_data["q2"]])
　　train_counter = Counter(train_question)
　　test_counter = Counter(test_question)
　　print(train_counter.most_common(10))
　　print(test_counter.most_common(10))

输出：

[('Q489328', 112), ('Q119369', 109), ('Q632400', 109), ('Q382228', 108), ('Q081677', 107), ('Q555455', 107), ('Q149996', 105), ('Q436579', 105), ('Q143237', 105), ('Q424359', 104)]

[('Q066137', 34), ('Q209532', 31), ('Q526780', 31), ('Q665740', 29), ('Q405292', 28), ('Q092597', 28), ('Q195819', 26), ('Q150486', 26), ('Q263180', 26), ('Q492945', 25)]

　　counter工具用于支持便捷和快速地计数

7、words = questions["words"].str.split(" ").tolist()

8、from gensim.corpora import Dictionary

　　word_dict = Dictionary(words)
　　char_dict = Dictionary(chars)

　　dictionary.id2token 结果:{0: 'human', 1: 'interface', 2: 'computer', 3: 'survey', 4: 'user'}

dictionary.token2id 结果：{'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 'user': 4, 'system': 5, ..................}

　　dictionary.dfs 结果：{0: 2, 1: 2, 2: 2, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 2, 9: 3, 10: 3, 11: 2}

　　参考博客：https://blog.csdn.net/qq_19707521/article/details/79174533

9、char_df = pd.concat([char_series, char_prop], axis=1)

10、train_list = train_data.merge(questions, how="left", left_on="q1", right_on="qid").drop("qid", axis=1)

11、

　　train_list = train_data.merge(questions, how="left", left_on="q1", right_on="qid").drop("qid", axis=1)
　　train_list.head()
　　train_list = train_list.rename(columns={"words": "w1", "chars": "c1"})
　　train_list = train_list.merge(questions, how="left", left_on="q2", right_on="qid").drop("qid", axis=1)
　　train_list = train_list.rename(columns={"words": "w2", "chars": "c2"})
　　train_list["wnum1"] = train_list["w1"].map(len)
　　train_list["cnum1"] = train_list["c1"].map(len)
　　train_list["wnum2"] = train_list["w2"].map(len)
　　train_list["cnum2"] = train_list["c2"].map(len)
　　train_list["wboth"] = train_list.apply(lambda x: len([t for t in x["w1"] if t in x["w2"]]), axis=1)
　　train_list["cboth"] = train_list.apply(lambda x: len([t for t in x["c1"] if t in x["c2"]]), axis=1)
　　train_list.head()

12、label = train_data["label"].values.copy() 值的复制

13、question_feature = total_question.value_counts().reset_index() 　　

　　value_counts确认数据出现的频率注意：from collections import Counter Counter()函数也能计数，返回值格式可能不一致

　　.reset_index 是改变index，重置index

14、

　　unique_question = total_question.drop_duplicates().reset_index(drop=True)
　　question_dict = pd.Series(unique_question.index,unique_question).to_dict()

　　注释：.drop_duplicates() 去重 .reset_index（）重置index series.to_dict()转换为字典、

15、 from keras.preprocessing.text import Tokenizer

　　char_tokenizer = Tokenizer()
　　char_tokenizer.fit_on_texts(question_data["chars"])
　　char_tokenizer.word_index

使用Tokenizer的方法是，首先用Tokenizer的 fit_on_texts 方法学习出文本的字典，然后word_index 就是对应的单词和数字的映射关系dict，通过

这个dict可以将每个string的每个词转成数字

16、word_count = sorted(list(word_tokenizer.word_counts.items()), key=lambda x: x[1], reverse=True)

　　lambda是一个隐函数，是固定写法，不要写成别的单词；x表示列表中的一个元素，在这里，表示一个元组，x只是临时起的一个名字，你可以使用任意的名字；x[0]表

　　示元组里的第一个元素，当然第二个元素就是x[1]；所以这句命令的意思就是按照列表中第一个元素排序

　 items() 函数以列表返回可遍历的(键, 值) 元组数组

　　reverse=true参数实现倒序排列

17、data["word_same"] = data.apply(lambda x: len(set(x["words1"]).intersection(set(x["words2"]))), axis=1)

　　解释：假设a，b为两个list
　　　　1、交集

　　　　list(set(a).intersection(set(b)))

　　　　2、并集

　　　　list(set(a).union(set(b)))

　　　　3、差集（list a中有，而list b中没有的）

　　　　list(set(a).difference(set(b)))

18、word_embedding_data = pd.read_csv(WORD_EMBED_PATH, delimiter=" ", header=None, index_col=0)

分隔符为空格、列表头为none、列的索引去第0列

数据分析，numpy pandas常用api记录的更多相关文章

数据分析——Numpy/pandas
NumPy NumPy是高性能科学计算和数据分析的基础包.部分功能如下: ndarray, 具有矢量算术运算和复杂广播能力的快速且节省空间的多维数组. 用于对整组数据进行快速运算的标准数学函数(无需编 ...
#2 numpy pandas初步学习记录
对numpy中的array进行了了解,array方法的取值arr_2d[0:2, 0:2] pandas 1,read_CSV方法 2,head方法 3,loc方法,取值前开后开, 4,replace ...
Python数据分析与挖掘所需的Pandas常用知识
Python数据分析与挖掘所需的Pandas常用知识前言Pandas基于两种数据类型:series与dataframe.一个series是一个一维的数据类型,其中每一个元素都有一个标签.series ...
NumPy和Pandas常用库
NumPy和Pandas常用库 1.NumPy NumPy是高性能科学计算和数据分析的基础包.部分功能如下: ndarray, 具有矢量算术运算和复杂广播能力的快速且节省空间的多维数组. 用于对整组数 ...
python 数据分析工具之 numpy pandas matplotlib
作为一个网络技术人员,机器学习是一种很有必要学习的技术,在这个数据爆炸的时代更是如此. python做数据分析,最常用以下几个库 numpy pandas matplotlib 一.Numpy库为了 ...
常用统计分析python包开源学习代码 numpy pandas matplotlib
常用统计分析python包开源学习代码 numpy pandas matplotlib 待办 https://github.com/zmzhouXJTU/Python-Data-Analysis
Python数据分析--Numpy常用函数介绍(2)
摘要:本篇我们将以分析历史股价为例,介绍怎样从文件中载入数据,以及怎样使用NumPy的基本数学和统计分析函数.学习读写文件的方法,并尝试函数式编程和NumPy线性代数运算,来学习NumPy的常用函数. ...
Pandas常用操作方法
Pandas pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的. Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具. pandas提 ...
4、numpy+pandas速查手册
<Python数据分析常用手册>一.NumPy和Pandas篇一.常用链接: 1.Python官网:https://www.python.org/2.各种库的whl离线安装包:http: ...

随机推荐

vue-toy: 200行代码模拟Vue实现
vue-toy 200行左右代码模拟vue实现,视图渲染部分使用React来代替Snabbdom,欢迎Star. 项目地址:https://github.com/bplok20010/vue-toy ...
关于mybatis使用小于号大于号出错的解决方案
原文链接:https://blog.csdn.net/weixin_38061311/article/details/99943807 mybatis 使用的xml的映射文件, 所以里面的标签都是在& ...
Arduino连接LCD1602显示屏
简介 LCD1602是一种工业字符型液晶,能够同时显示16x02即32个字符.LCD1602液晶显示的原理是利用液晶的物理特性,通过电压对其显示区域进行控制,即可以显示出图形.[百度百科] 引脚说明 ...
laravel向视图传递变量
向视图中传递变量我们在开发web应用当中,通常都不是为了写静态页面而生的,我们需要跟数据打交道,那么这个时候,问题就来了,在一个MVC的框架中,怎么将数据传给视图呢?比如我们要在 ArticleCo ...
android屏幕适配的全攻略--支持不同的屏幕尺寸适配平板和手机
一. 核心概念与单位详解 1. 什么是屏幕尺寸.屏幕分辨率.屏幕像素密度? 屏幕分辨率越大,手机越清晰 dpi就是dot per inch dot意思是点,就是每英寸上面的像素点数 android原始 ...
Python3-内置类型-集合类型
Python3中的集合类型主要有两种 set 可变集合可添加和删除元素,它是不可哈希的,因此set对象不能用作字典的键或另一个元素的集合 forzenset 不可变集合正好与set相反,其内容创建 ...
springboot项目打war包发布到外置tomcat
第一步:修改pom.xml 1. <version>0.0.1-SNAPSHOT</version> <packaging>war</packaging> ...
IDEA2019版中文汉化包
废话不多说,上才艺 E G M~~~~~ 2020版的IDEA大佬可以无视........ 1.打开IDEA文件目录 2.打开lib目录--将汉化版复制到该目录下 3.打开IDEA查看效果高铁链 ...
蝙蝠算法（BA）学习笔记
算法原理蝙蝠能够在夜间或十分昏暗的环境中自由飞翔和准确无误地捕捉食物,是因为他们能够从喉头发出地超声脉冲回声来定位.受这一启发,Yang教授在2010年提出了蝙蝠算法(Bat Algorithm,B ...
cf # 420 div.2
说说题吧前两道暴力 a直接枚举每个位置然后枚举所在行和列 b直接枚举所有的x的banana 的数量.计算方式等差数列求和小学生难度.记得long long.int转longlong c记下remove ...

数据分析，numpy pandas常用api记录

数据分析，numpy pandas常用api记录的更多相关文章

随机推荐

热门专题