The function pandas.pivot_table can be used to create spreadsheet-style pivot tables. It takes a number of arguments data: A DataFrame object values: a column or a list of columns to aggregate index: a column, Grouper, array which has the same
pandas.DataFrame.groupby DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs) Group series using mapper (dict or key function, apply given function to group, return result as series) or by
import pandas as pd import numpy as np 分割-apply-聚合 大数据的MapReduce The most general-purpose GroupBy method is apply, which is the subject of the rest of this section. As illustrated in Figure 10-2, apply splits the object being manipulated into pieces,
Groupby Count # Party’s Frequency of donations nyc.groupby(’Party’)[’contb receipt amt’].count() The command returns a series where the index is the name of a Party and the value is the count of that Party. Note that the series is ordered by the name
map只对一个序列而言的. apply只是整个dataframe上任意一列或多列,或者一行或多行, 即可在任意轴操作. 在一列使用apply时,跟map效果一样. 多列时只能用apply. applymap 在整个dataframe的每个元素使用一个函数. Map: It iterates over each element of a series.df[‘column1’].map(lambda x: 10+x), this will add 10 to each element of col
Hive去重统计 先说核心: 都会在map阶段count,但reduce阶段,distinct只有一个, group by 可以有多个进行并行聚合,所以group by会快. 经常在公司还能看到.很多老人用distinct去重,很容易数据量大的时候的数据倾斜.感谢上次冲哥的指正. 相信使用Hive的人平时会经常用到去重统计之类的吧,但是好像平时很少关注这个去重的性能问题,但是当一个表的数据量非常大的时候,会发现一个简单的count(distinct order_no)这种语句跑的特别慢,和直接运