pandas 之时间序列索引

import numpy as np

import pandas as pd

引入

A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python string or datetime objects:

from datetime import datetime

dates = [

    datetime(2011, 1, 2),

    datetime(2011, 1, 5),

    datetime(2011, 1, 7),

    datetime(2011, 1, 8),

    datetime(2011, 1, 10),

    datetime(2011, 1, 12)

]

ts = pd.Series(np.random.randn(6), index=dates)

ts

2011-01-02    0.825502

2011-01-05    0.453766

2011-01-07    0.077024

2011-01-08   -1.320742

2011-01-10   -1.109912

2011-01-12   -0.469907

dtype: float64

Under the hood, these datetime objects have been put in a DatetimeIndex:

ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',

               '2011-01-10', '2011-01-12'],

              dtype='datetime64[ns]', freq=None)

Like other Series, arithmetic operations between differently indexed time series auto-matically align(自动对齐) on the dates:

ts + ts[::2]

2011-01-02    1.651004

2011-01-05         NaN

2011-01-07    0.154049

2011-01-08         NaN

2011-01-10   -2.219823

2011-01-12         NaN

dtype: float64

Recall that ts[::2] selects every second element in ts:

pandas stores timestamp using NumPy's datetime64 data type the nanosecond resolution:

ts.index.dtype

dtype('<M8[ns]')

Scalar values from a DatetimeIndex are Timestamp object:

stamp = ts.index[0]

stamp

Timestamp('2011-01-02 00:00:00')

A Timestamp can be substituted(被替代) anywhere you would use a datetime object. Additionally, it can store frequency information(if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.

(各种转换操作, 对于时间序列)

索引-切片

Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:

stamp = ts.index[2]

ts[stamp]

0.0770243257021936

As a convenience, you can also pass a string that is interpretable as a date:

ts['1/10/2011']

-1.109911691867437

ts['20110110']

-1.109911691867437

For longer time series, a year or only a year and month can be passed to easly select slices of data:

longer_ts = pd.Series(np.random.randn(1000),

                     index=pd.date_range('1/1/2000', periods=1000))

longer_ts[:5]

2000-01-01    0.401394

2000-01-02    0.720214

2000-01-03    0.488505

2000-01-04    0.446179

2000-01-05   -2.129299

Freq: D, dtype: float64

longer_ts['2001'][:5]

2001-01-01    0.315472

2001-01-02    0.796386

2001-01-03    0.611503

2001-01-04    0.980799

2001-01-05    0.184401

Freq: D, dtype: float64

Here, the string '2001' is interpreted as a year and selects that time period. This also works if you speicify the month:

longer_ts['2001-05'][:5]

2001-05-01    0.439009

2001-05-02   -0.304236

2001-05-03    0.603268

2001-05-04   -0.726460

2001-05-05   -0.521669

Freq: D, dtype: float64

"Slicing with detetime objects works as well"

ts[datetime(2011, 1, 7):]

'Slicing with detetime objects works as well'

2011-01-07    0.077024

2011-01-08   -1.320742

2011-01-10   -1.109912

2011-01-12   -0.469907

dtype: float64

Because most time series data is ordered chrnologically(按年代顺序的), you can slice with time-stamps not contained in a time series to perform a range query:

ts

2011-01-02    0.825502

2011-01-05    0.453766

2011-01-07    0.077024

2011-01-08   -1.320742

2011-01-10   -1.109912

2011-01-12   -0.469907

dtype: float64

ts['1/6/2011': '1/11/2011']

2011-01-07    0.077024

2011-01-08   -1.320742

2011-01-10   -1.109912

dtype: float64

As before, you can pass either a string date, datetime or timestamp. Remember that slicing in this manner produces views on the source time series like slicing NumPy arrays. This means that no data is copied and modifications on the slice will be reflected in the orginal data.

There is an equivalent instance method,truncate that slices a Series between two dates:

ts.truncate(after='1/9/2011')

2011-01-02    0.825502

2011-01-05    0.453766

2011-01-07    0.077024

2011-01-08   -1.320742

dtype: float64

All of this holds true for DataFrame as well, indexing on its rows:

# periods: 多少个, freq: 间隔

dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

long_df = pd.DataFrame(np.random.randn(100, 4),

                      index=dates,

                      columns=['Colorado', 'Texas', 'New York', 'Ohio'])

long_df.loc['5-2001']

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	Colorado	Texas	New York	Ohio
2001-05-02	0.972317	0.407519	0.628906	1.995901
2001-05-09	0.299961	-1.208505	1.019247	2.244728
2001-05-16	0.628163	-0.716498	0.621912	1.257635
2001-05-23	0.508852	0.753517	-0.793127	0.273496
2001-05-30	-1.443141	-0.878143	-0.680227	0.455401

重复索引

ts.is_unique
ts.groupby(level=0)

In some applications, there may be multiple data observations falling on a particular timestamp.Here is an example:

dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000',

                         '1/2/2000', '1/2/2000', '1/3/2000'

                         ])

dup_ts = pd.Series(np.arange(5), index=dates)

dup_ts

2000-01-01    0

2000-01-02    1

2000-01-02    2

2000-01-02    3

2000-01-03    4

dtype: int32

We can tell that the index is not unique by checking its is_unique property:

dup_ts.index.is_unique

False

Indexing into this time series will now either produce scalar values or slice depending on whether a timestamp is duplicated:

dup_ts['1/3/2000']  # not duplicated

dup_ts['1/2/2000']  # duplicated

2000-01-02    1

2000-01-02    2

2000-01-02    3

dtype: int32

Suppose you wanted to aggregate the data having non-unique timestamps. One way to do this is use groupby and pass level=0

grouped = dup_ts.groupby(level=0)  # 没有level 会报错, 默认是None

grouped.mean()

2000-01-01    0

2000-01-02    2

2000-01-03    4

dtype: int32

grouped.count()

2000-01-01    1

2000-01-02    3

2000-01-03    1

dtype: int64