Pandas 之 Series / DataFrame 初识
import numpy as np
import pandas as pd
Pandas will be a major tool of interest throughout(贯穿) much of the rest of the book. It contains data structures and manipulation tools designed to make data cleaning(数据清洗) and analysis fast and easy in Python. pandas is often used in tandem(串联) with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization(可视化) libraries like matplotlib. pandas adopts(采用) sinificant(显著的,大量的) parts of NumPy's idiomatic(惯用的) style of array based computing, especially array-based functions and preference for data processing without for loops.(面向数组编程)
While pandas adopts many coding idioms(惯用的) from NumPy, the biggest difference is that pandas is disgined for working with tabular(表格型) or heterogeneous(多样型) data. NumPy, by contrast(对比), is best suite for working with homogeneous numerical array data. -> pandas 是表格型数据处理的一种最佳方案(作者很能吹的哦)
Since become an open source project in 2010, pandas has matured(成熟的) into a quite large library that is applicable(适用于) in a broad set of real-world use cases. -> 被广泛使用 The developer community has grown to over 800 distinct(活跃的) contributors, who have been helping build the project as they have used
it to solve their day-to-day data problems. -> 解决日常生活中的大量数据处理问题
Throughout the rest of the book, I use the following import convention for pandas:
import pandas as pd
# from pandas import Serieser, DataFrame
Thus, whever you see pd in code, it is refering to pandas. You may also find it easier to import Series and Dataframe into the local namespace since they are frequently used:
"from pandas import Series DataFrame"
To get start with pandas, you will need to comfortable(充分了解) with its two workhorse data structures: Series and DataFrame. While(尽管) they are not a universal solution for every problem, they provide a solid(稳定的), easy-to-use basis for most applications.
Series
A series is a one-dimensional array-like object containing a sequence of values(of similar types to NumPy types) and an associated array of data labels, called it's index. The simplest(简明来说) Series is formed from only an array of data. -> Series像是一个有索引的一维NumPy数组.
obj = pd.Series([4, 7, -5, 3])
obj
0 4
1 7
2 -5
3 3
dtype: int64
The string representation(代表) of a Series displaye interactively(交互地) show the index on the left and the value on the right.(索引显示在左边, 值在右边) Since we did not specify(指定) an index for the data, a default one consisting of the integer 0 throught N-1(where N is the lenght of the data)(索引从0开始的) is created. You can get the array representation and index object of the Series via(通过) its values and index attributes, respectively: -> 通过其values, index属性进行访问和设置.
obj.values
array([ 4, 7, -5, 3], dtype=int64)
obj.index # like range(4)
RangeIndex(start=0, stop=4, step=1)
Often it will be describe to create a Series with an index identifying each data point with a lable:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
"打印索引"
obj2.index
d 4
b 7
a -5
c 3
dtype: int64
'打印索引'
Index(['d', 'b', 'a', 'c'], dtype='object')
Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values.-> 通过index来选取单个或多个元素
"选取单个元素[index]"
obj2['a']
"修改元素-直接赋值-修改是-inplace"
obj2['d'] = 'cj'
"选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的"
obj2[['c', 'a', 'd', 'xx']]
'选取单个元素[index]'
-5
'修改元素-直接赋值-修改是-inplace'
'选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的'
c:\python\python36\lib\site-packages\pandas\core\series.py:851: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self.loc[key]
c 3
a -5
d cj
xx NaN
dtype: object
"对元素赋值修改, 默认是原地修改的"
obj2
'对元素赋值修改, 默认是原地修改的'
d cj
b 7
a -5
c 3
dtype: object
Here ['c', 'a', 'd'] is interpreted(被要求为) as a list of indices, even though it contains strings instead of integers.-> 多个索引的键, 先用一个列表存起来, 再作为一个参数给索引.
Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication(标量乘), or appplying math functions)函数映射, will preserve the index-value link: -> 像操作NumPy数组一样操作, 如bool数组, 标量乘, 数学函数等..
"过滤出Series中大于0的元素及对应索引"
"先还原数据, 字符不能和数字比较哦"
obj2['d'] = 4
obj2[obj2 > 0]
"标量计算"
obj2 * 2
"调用NumPy函数"
"需要用values过滤掉索引, cj 觉得, 不然会报错"
np.exp(obj.values)
'过滤出Series中大于0的元素及对应索引'
'先还原数据, 字符不能和数字比较哦'
d 4
b 7
c 3
dtype: object
'标量计算'
d 8
b 14
a -10
c 6
dtype: object
'调用NumPy函数'
'需要用values过滤掉索引, cj 觉得, 不然会报错'
array([5.45981500e+01, 1.09663316e+03, 6.73794700e-03, 2.00855369e+01])
"cj test"
obj2 > 0
np.exp(obj2)
'cj test'
d True
b True
a False
c True
dtype: bool
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-86002a981278> in <module>
2 obj2 > 0
3
----> 4 np.exp(obj2)
AttributeError: 'int' object has no attribute 'exp'
Another way to think about a Series is as fixed-lenght, ordered dict, as it's a mapping of index values to data values. -> (Series可以看做是一个有序字典映射, key是index, value.) It can be used in many contexts(情景) where you might use a dict:
"跟字典操作一样, 遍历, 选取, 默认都是对key进行操作"
'b' in obj2
'xxx' in obj2
'跟字典操作一样, 遍历, 选取, 默认都是对key进行操作'
True
False
Should you have data contained in a Python dict, you can create a Series from it by pass the dict: -> 可直接将Python字典对象转为Series, index就是key.
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
"直接可将字典转为Series"
obj3 = pd.Series(sdata)
obj3
'直接可将字典转为Series'
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
# cj test
"多层字典嵌套也是可以的, 但只会显示顶层结构"
cj_data = {'Ohio':{'sex':1, 'age':18}, 'Texas':{'cj':123}}
pd.Series(cj_data)
'多层字典嵌套也是可以的, 但只会显示顶层结构'
Ohio {'sex': 1, 'age': 18}
Texas {'cj': 123}
dtype: object
When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in order you want them to appear in the resulting Series: -> 传入字典对象, 默认的index是key, 可以通过重写index来达到任何我们期望的结果:
"重写, 覆盖掉原来的index"
states = ['California', 'Ohio', 'Oregon', 'Texas']
"相同的字段直接 替换, 没有的字段, 则显示为NA"
obj4 = pd.Series(sdata, index=states)
obj4
'重写, 覆盖掉原来的index'
'相同的字段直接 替换, 没有的字段, 则显示为NA'
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
Here, three values found in sdata were palced in the appropriate(适当的) location, (替换, 字段相同), but since no value for 'Carlifornia' was found, it appears as NaN(not a number), which is considered in pandas to mark(标记) missing or NA values. Since 'Utah' was not include in states, it is excluded from the resulting object.
I will use the terms(短语) 'missing' or 'NA' interchangeably(交替地) to refer to(涉及) missing data. The isnull and notnull functions in pandas should be used to detect(检测) missing data:
"pd.isnull(), pd.notnull() 用来检测缺失值情况"
pd.isnull(obj4)
"正向逻辑"
pd.notnull(obj4)
"Series also has these as instance methods:"
obj4.notnull()
'pd.isnull(), pd.notnull() 用来检测缺失值情况'
California True
Ohio False
Oregon False
Texas False
dtype: bool
'正向逻辑'
California False
Ohio True
Oregon True
Texas True
dtype: bool
'Series also has these as instance methods:'
California False
Ohio True
Oregon True
Texas True
dtype: bool
I discuss working with missing data in more detail in Chapter 7.
A usefull Series feature for many applications is that it automatically(自动地) aligns(对齐) index label in arithmetic operations. -> Series 在算数运算中, 会自动地对齐索引,即相同索引, 会被认为一个索引 这点很关键.
obj3
obj4
"obj3 + obj4, index相同, 直接数值相加, 不想同则NaN"
obj3 + obj4
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
'obj3 + obj4, index相同, 直接数值相加, 不想同则NaN'
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
Data alignment features(数据对齐的功能) will be in addressed in more detail later. If you have experience with databases, you can think about this as being simalar to a join operation. ->(数据对齐, 就跟数据的的连接是相似的, 内连接, 左连接, 右连接)
Both the Series object itself and its index hava a name attribute, which integrates(一体化) with other keys areas of pandas functionality: -> (name属性, 是将一些键区域联系在一起的)
"设置键的名字 obj4.name='xxx'"
obj4.name = 'population'
"设置索引的名字 obj4.index.name = 'xxx'"
obj4.index.name = 'state'
obj4
"设置键的名字 obj4.name='xxx'"
"设置索引的名字 obj4.index.name = 'xxx'"
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
A Series's index can be altered(改变) in-place by assignment. -> index 可通过赋值的方式, 原地改变
obj
"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
DataFrame
A DataFrame represents a rectangular table of data(矩形数据表) and contains an ordered collecton of columns, each of which can be different value type(numeric, string, boolean, etc..)-> (每一列可以包含不同的数据类型) The DataFrame has both a row and column index;(包含有行索引index, 和列索引columns)
It can be thought of as a dict fo Series all sharing the same index.(共享相同索引的Series) Under the hood(从底层来看) the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection fo one-dimensional arrays.(数据被存储为多个二维数组块而非list, dict, 或其他一维数组) The exact(详细的) details of DataFrame's internals(底层原理) are outside the scope of this book.
While a DataFrame is physically(原本用来表示) two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical(分层的) indexing, a subject we wil discuss in Chapter8 and an ingredient(成分) in some of the more advanced data-handling features in pandas. -> 分层索引处理多维数据, 和更多处理高维数据的先进功能在pandas中都能学习到.
There are many ways to construct(构造) a DataFrame, though one of the most common is from a dict of equal-length lists of or NumPy array. ->(构造一个DataFrame最常见的方式是传入一个等长字典, or 多维数组)
data = {
'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:
frame
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table.
For large DataFrames, the head method selects only the first five rows: -> df.head() 默认查看前5行
frame.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
If you specify a sequence of columns, The DataFrame's columns will be arranged in that order: -> 指定列的顺序
"按指定列的顺序排列"
pd.DataFrame(data, columns=['year', 'state', 'pop'])
'按指定列的顺序排列'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | |
---|---|---|---|
0 | 2000 | Ohio | 1.5 |
1 | 2001 | Ohio | 1.7 |
2 | 2002 | Ohio | 3.6 |
3 | 2001 | Nevada | 2.4 |
4 | 2002 | Nevada | 2.9 |
5 | 2003 | Nevada | 3.2 |
If you pass a column that isn't contained in the dict, it will appear with missing values the result:
frame2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five', 'six'])
"对于没有的 columns, 则会新建, 值为NaN"
frame2
"index没有, 则会报错哦, frame.columns 可查看列索引"
frame2.columns
'对于没有的 columns, 则会新建, 值为NaN'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | NaN |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
six | 2003 | Nevada | 3.2 | NaN |
'index没有, 则会报错哦, frame.columns 可查看列索引'
Index(['year', 'state', 'pop', 'debt'], dtype='object')
A column in a DataFrame can be retrieve(被检索) as a Series either by dict-like notation or by attribute:
->(列表作为索引, 或者df.列名)
"中括号索引[字段名]"
frame2['state']
"通过属方式 df.字段名"
frame2.state
'中括号索引[字段名]'
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
'通过属方式 df.字段名'
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
Attribute-like access(eg, frame2.year) and tab completion(完成) of column names in Ipython is provided as a convenience. -> 通过属性的方式来选取列名是挺方便的.
Frame2[column] works for any column name, but frame2.column only works when the column name is valid Python variable name.
Note that the returned Series have the same index as the DataFrame,(返回的Series具有相同的索引) and their name attribute has been appropriately(适当地) set.
Rows can also be retrieve by position or name with the special loc attribute(much more than this later) -> loc属性用来选取行...
"选取index为three的行 loc[index]"
frame2.loc['three']
"选取第二行和第三行, frame.loc[1:2]"
frame.loc[1:2]
'选取index为three的行 loc[index]'
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
'选取第二行和第三行, frame.loc[1:2]'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | year | pop | |
---|---|---|---|
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
Columns can be modified by assignment. For example, the enpty 'debt' column could be assigned a scalar value or an array of values: -> 原地修改值
frame2['debet'] = 16.5
"原地修改了整列的值了"
frame2
'原地修改了整列的值了'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debt | debet | |
---|---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN | 16.5 |
two | 2001 | Ohio | 1.7 | NaN | 16.5 |
three | 2002 | Ohio | 3.6 | NaN | 16.5 |
four | 2001 | Nevada | 2.4 | NaN | 16.5 |
five | 2002 | Nevada | 2.9 | NaN | 16.5 |
six | 2003 | Nevada | 3.2 | NaN | 16.5 |
"原地修改, 自动对齐"
frame2['debet'] = np.arange(6)
"删除掉debt列, axis=1, 列, inplace=True原地删除"
frame2.drop(labels='debt', axis=1, inplace=True)
frame2
'原地修改, 自动对齐'
'删除掉debt列, axis=1, 列, inplace=True原地删除'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debet | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | 0 |
two | 2001 | Ohio | 1.7 | 1 |
three | 2002 | Ohio | 3.6 | 2 |
four | 2001 | Nevada | 2.4 | 3 |
five | 2002 | Nevada | 2.9 | 4 |
six | 2003 | Nevada | 3.2 | 5 |
frame2.columns
Index(['year', 'state', 'pop', 'debet'], dtype='object')
frame2.drop()
frame2['debt']
one 0
two 1
three 2
four 3
five 4
six 5
Name: debt, dtype: int32
When you are assigning list or arrays to a column, the value's lenght must match the lenght of the DataFrame.(插入数据的长度必须能对齐, 不然后缺失值了) If you assign a Series, it's labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
"自动对齐, 根据index"
frame2['debet'] = val
frame2
'自动对齐, 根据index'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debet | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | -1.2 |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | -1.5 |
five | 2002 | Nevada | 2.9 | -1.7 |
six | 2003 | Nevada | 3.2 | NaN |
Assigning a column that doesn't exist will create a new colum. The del keyword will delete columns as with a dict. -> del 来删除列
As an example of del, I first add a new column of boolean values where the state columns equals 'Ohio':
frame2['eastern'] = frame2.state == 'Ohio'
"先新增一列 eastern"
frame2
"然后用 del 关键子去删除该列"
del frame2['eastern']
"显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以"
frame2.columns
'先新增一列 eastern'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debet | eastern | |
---|---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN | True |
two | 2001 | Ohio | 1.7 | -1.2 | True |
three | 2002 | Ohio | 3.6 | NaN | True |
four | 2001 | Nevada | 2.4 | -1.5 | False |
five | 2002 | Nevada | 2.9 | -1.7 | False |
six | 2003 | Nevada | 3.2 | NaN | False |
'然后用 del 关键子去删除该列'
'显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以'
Index(['year', 'state', 'pop', 'debet'], dtype='object')
The column returned from indexing a DataFrame is a view on teh underlying data, not a copy.(视图哦, in-place的) Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Serie's copy method. -> 可以显示指定列进行拷贝, 不然操作的是视图.
Another common form of data is a nested dict of dicts:
pop = {
'Nevada': {2001:2.4, 2002:2.9},
'Ohio': {2000:1.5, 2001:1.7, 2002:3.6}
}
If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices: ->(字典一层嵌套, pandas 会将最外层key作为columns, 内层key作为index)
frame3 = pd.DataFrame(pop)
"外层字典的键作为column, 值的键作为index"
frame3
'外层字典的键作为column, 值的键作为index'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Nevada | Ohio | |
---|---|---|
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
You can transpose the DataFrame(swap rows and columns) with similar syntax to a NumPy array:
"转置"
frame3.T
'转置'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
2000 | 2001 | 2002 | |
---|---|---|---|
Nevada | NaN | 2.4 | 2.9 |
Ohio | 1.5 | 1.7 | 3.6 |
The keys in the inner dicts(内部键, index) are combined and sorted to form the index in the result. This isn't true if an explicit index is specified:
# pd.DataFrame(pop, index=('a', 'b','c'))
Dicts of Series are treated in much the same way.
pdata = {
'Ohio': frame3['Ohio'][:-1],
'Nevada': frame3['Nevada'][:2]
}
pd.DataFrame(pdata)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Ohio | Nevada | |
---|---|---|
2000 | 1.5 | NaN |
2001 | 1.7 | 2.4 |
For a complete list of things you can pass the DataFrame constructor(构造), see Table5-1.
If a DataFrame's index and columns have their name attributes, these will also be displayed: -> 设置行列索引的名字属性
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | Nevada | Ohio |
---|---|---|
year | ||
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray: -> values属性返回的是二维的
frame3.values
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
If the DataFrame's columns are different dtypes, the dtype of the values array will be chosen to accommodate(容纳) all of the columns.
"会自动选择dtype去容纳各种类型的数据"
frame2.values
'会自动选择dtype去容纳各种类型的数据'
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, nan],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, nan],
[2002, 'Nevada', 2.9, nan],
[2003, 'Nevada', 3.2, nan]], dtype=object)
Table 5-1 Possible data inputs to DataFrame constructor
- 2D ndarray A matrix of data, passing optional and columns labels
- .......用到再说吧
Index Objects
pandas's Index objects are responsible(保存) for holding the axis labels and other metadata(like the axis name or names). Any array or other sequence of lables you use when constructing(构造) a Series or DataFrame is internally(内部地) converted to an Index(转为索引):
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
index[1:]
obj
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')
a 0
b 1
c 2
dtype: int64
Index objects are immutable(不可变的) and thus can't be modified by the user:
index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-a452e55ce13b> in <module>
----> 1 index[1] = 'd'
c:\python\python36\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
2063
2064 def __setitem__(self, key, value):
-> 2065 raise TypeError("Index does not support mutable operations")
2066
2067 def __getitem__(self, key):
TypeError: Index does not support mutable operations
"index 不可变哦"
index
'index 不可变哦'
Index(['a', 'b', 'c'], dtype='object')
labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
0 1.5
1 -2.5
2 0.0
dtype: float64
obj2.index is labels
True
Unlike Python sets, a pandas Index can con
Selections with dumplicate labels will select all occurrences(发生) of that label.
Each Index has a number of methods and properties for set logic which answer other common questions about the data it contains. Some useful ones are summarized in Table 5-2
- append Concatenate with additional Index objects, producing a new index
- difference Compute set difference as Index
- intersection Compute set intersection
- union Compute set union
- isin -> 是否在里面
- delete Compute new index with element at index i deleted
- drop Compute new index by deleting passed values
- insert Compute new index by inserting element at index i
- is_unique Return True if the index has no duplicate values
- unique Compute the array of unique values in the index.
Pandas 之 Series / DataFrame 初识的更多相关文章
- Pandas之Series+DataFrame
Series是带有标签的一维数组,可以保存任何数据类型(整数,字符串,浮点数,python对象) index查看series索引,values查看series值 series相比于ndarray,是一 ...
- python pandas ---Series,DataFrame 创建方法,操作运算操作(赋值,sort,get,del,pop,insert,+,-,*,/)
pandas 是基于 Numpy 构建的含有更高级数据结构和工具的数据分析包 pandas 也是围绕着 Series 和 DataFrame 两个核心数据结构展开的, 导入如下: from panda ...
- pandas数据结构:Series/DataFrame;python函数:range/arange
1. Series Series 是一个类数组的数据结构,同时带有标签(lable)或者说索引(index). 1.1 下边生成一个最简单的Series对象,因为没有给Series指定索引,所以此时会 ...
- pandas 学习(2): pandas 数据结构之DataFrame
DataFrame 类型类似于数据库表结构的数据结构,其含有行索引和列索引,可以将DataFrame 想成是由相同索引的Series组成的Dict类型.在其底层是通过二维以及一维的数据块实现. 1. ...
- [转]python中pandas库中DataFrame对行和列的操作使用方法
转自:http://blog.csdn.net/u011089523/article/details/60341016 用pandas中的DataFrame时选取行或列: import numpy a ...
- 数据分析入门——pandas之Series
一.介绍 Pandas是一个开源的,BSD许可的库(基于numpy),为Python编程语言提供高性能,易于使用的数据结构和数据分析工具. 官方中文文档:https://www.pypandas.cn ...
- python中pandas库中DataFrame对行和列的操作使用方法
用pandas中的DataFrame时选取行或列: import numpy as np import pandas as pd from pandas import Sereis, DataFram ...
- 读书笔记一、pandas之series
转自 # 直接传入一组数据 from pandas import Series, DataFrame obj = Series([4, 2, 3]) obj 0 4 1 2 2 3 dtype: in ...
- 利用Python进行数据分析(7) pandas基础: Series和DataFrame的简单介绍
一.pandas 是什么 pandas 是基于 NumPy 的一个 Python 数据分析包,主要目的是为了数据分析.它提供了大量高级的数据结构和对数据处理的方法. pandas 有两个主要的数据结构 ...
随机推荐
- LeetCode 896. Monotonic Array
原题链接在这里:https://leetcode.com/problems/monotonic-array/ 题目: An array is monotonic if it is either mon ...
- contest2 CF989 div2 ooox? ooox? oooo?
题意 div2C (o) 在\(小于50*50\)的棋盘上放\(A, B, C, D\)四种花, 并给出每种花的连通块数量\(a, b, c, d(\le 100)\), 输出一种摆法 div2D ( ...
- Python错误“ImportError: No module named MySQLdb”解决方法
这个错误可能是因为没有安装MySQL模块,这种情况下执行如下语句安装: pip install MySQLdb 如果安装时遇到错误“_mysql.c:29:20: 致命错误:Python.h:没有那个 ...
- C语言实现linux之who功能
/* who_test.c */ #include<stdio.h> #include<string.h> #include<getopt.h> #include& ...
- Android Studio 之 Navigation【1.页面之间的切换】
1.创建 2个 Fragment ,下面两个include 不要勾 2.创建好 Fragment 后,打开layout中的 fragment.xml 文件,将里面默认的 textView 控件删除掉 ...
- 作业:SSH
作业:使用SSH通过网络远程控制电脑 在虚拟机中用apt命令安装了ssh,但多次连接都失败了,尝试了很多次.后来发现只要是虚拟机中的系统使用的ip都是一样的从而发现了问题.虚拟机的网络是被更改后的,后 ...
- .net core 运行不需命令行
1.问题情景: 需要保证已安装.net core SDK,并且命令提示符下运行“dotnet --version”,有反应. 如果之前运行良好,现在却不行了,查看安装程序中存在.net core SD ...
- 【2019年05月16日】A股最便宜的股票
查看更多A股最便宜的股票:androidinvest.com/CNValueTop/ 经典价值三因子选股: 市盈率PE.市净率PB 和 股息分红率,按照 1:1:1的权重,选择前10大最便宜的股票. ...
- Prometheus 配置采集目标
Prometheus 配置采集目标 1.根据配置的任务(job)以http/s周期性的收刮(scrape/pull)2.指定目标(target)上的指标(metric).目标(target)3.可以以 ...
- Go语言入门——hello world
Go 语言源代码文件扩展名是.go. 知识点:1. go语言代码的第1行必须声明包2. 入口的go语言代码(包含main函数的代码文件)的包必须是main,否则运行go程序会显示go run: can ...