工具-pandas

pandas库提供了高性能、易于使用的数据结构和数据分析工具。其主要数据结构是DataFrame，可以将DataFrame看做内存中的二维表格，如带有列名和行标签的电子表格。许多在Excel中可用的功能都可以通过编程实现，例如创建数据透视表、基于其他列计算新列的值、绘制图形等。还可以按照列的值对行进行分组，或者像SQL中那样连接表格。pandas也擅长处理时间序列。

但是介绍pandas之前，需要有numpy的基础，如果还熟悉numpy，可以查看numpy快速入门教程。

导入pandas

import pandas as pd

Series对象

pandas库包含以下有用的数据结构：

Series对象，是一维数组，类似于电子表格中的列。
DataFrame对象，是二维表格，类似于电子表格（带有行标签和列名）。
Panel对象，可以将其看成是DataFrame对象的字典，较少使用。

创建Series

s = pd.Series([2, -1, 3, 5])
s

输出：

0    2
1   -1
2    3
3    5
dtype: int64

类似于一维数组

Series对象的行为非常类似于一维的ndarray，所以可以将Series对象作为参数传递给numpy的函数, 返回的对象也是Series。

import numpy as np
np.exp(s)

输出：

0      7.389056
1      0.367879
2     20.085537
3    148.413159
dtype: float64

算术运算也是可以的，而且和ndarray一样，也是元素级的。

s + [1000, 2000, 3000, 4000]

输出：

0    1002
1    1999
2    3003
3    4005
dtype: int64

和numpy类似，如果将单个数字加到Series，则数字会被加到Series的所有元素上，称为广播。

s + 1000

输出：

0    1002
1     999
2    1003
3    1005
dtype: int64

对所有的二进制运算和条件运算，也和numpy相似，具有广播机制。

s < 0

输出：

0    False
1     True
2    False
3    False
dtype: bool

索引标签

Series对象的每一个元素都有一个唯一标识符，叫做索引标签。默认情况下，它是Series中元素的排名（从0开始），但是也可以手动设置。

s2 = pd.Series([68, 83, 112, 68], index=['alice', 'bob', 'charles', 'darwin'])
s2

输出：

alice       68
bob         83
charles    112
darwin      68
dtype: int64

可以向使用字典一样使用Series。

s2['bob']

输出：

仍然可以向常规数组那样，使用整数位置来访问元素。

s2[1]

输出：

为了明确索引标签或整数位置访问元素，建议按照索引标签访问时始终使用loc属性，按整数位置访问时始终使用iloc属性。

s2.loc['bob']

输出：

s2.iloc[1]

输出：

对Series进行切片时，也会对索引标签进行切片。

s2.iloc[1:3]

输出：

bob         83
charles    112
dtype: int64

使用默认数字标签时，这可能会导致意外结果，因此要小心。

surprise = pd.Series([1000, 1001, 1002, 1003])
surprise

输出：

0    1000
1    1001
2    1002
3    1003
dtype: int64

surprise_slice = surprise[2:]
surprise_slice

输出：

2    1002
3    1003
dtype: int64

上面切片的结果中，第一个元素的索引标签是2，索引标签为0的元素并不在该切片中。

try:
    surprise_slice[0]
except KeyError as e:
    print('Key error:', e)

输出：

Key error: 0

但是，请记住，可以使用iloc属性按整数位置访问元素，这就是建议使用loc和iloc来访问Series对象元素的另一个原因。

surprise_slice.iloc[0]

输出：

根据字典初始化

可以根据字典创建一个Series，字典的key将会用作索引标签。

weights = {'alice': 68, 'bob': 83, 'colin': 86, 'darwin': 68}
s3 = pd.Series(weights)
s3

输出：

alice     68
bob       83
colin     86
darwin    68
dtype: int64

可以通过显式指定需要的索引，来控制哪些元素包含在Series中以及它们的顺序。

s4 = pd.Series(weights, index=['colin', 'alice'])
s4

输出：

colin    86
alice    68
dtype: int64

自动对齐

当一个操作涉及多个Series对象时，pandas会通过匹配索引标签自动对齐元素。

print(s2.keys())
print(s3.keys())
s2 + s3

输出：

Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')
Index(['alice', 'bob', 'colin', 'darwin'], dtype='object')

alice      136.0
bob        166.0
charles      NaN
colin        NaN
darwin     136.0
dtype: float64

上面的结果包含s2和s3索引标签的并集，这是因为s2中缺少colin，s3中缺少charles，故这些项的值为NaN（没有数字，意味着失踪）。

当处理可能来自具有不同结构和缺失项的数据源的数据时，自动对齐非常方便。但是，如果忘记了设置正确的索引标签，可能会得到令人惊讶的结果。

s5 = pd.Series([1000, 1000, 1000, 1000])
print('s2 = ', s2.values)
print('s5 = ', s5.values)

s2 + s5

输出：

s2 =  [ 68  83 112  68]
s5 =  [1000 1000 1000 1000]

alice     NaN
bob       NaN
charles   NaN
darwin    NaN
0         NaN
1         NaN
2         NaN
3         NaN
dtype: float64

上面的结果中，pandas无法对齐Series，因为它们的索引标签根本不匹配，因此得到的全是NaN的结果。

根据标量初始化

还可以使用一个标量和一个索引标签列表来初始化Series对象，所有的项都会被设置为标量的值。

meaning = pd.Series(42, index=['life', 'universe', 'everything'])
meaning

输出：

life          42
universe      42
everything    42
dtype: int64

Series的名字

一个Series对象也有名字。

s6 = pd.Series([83, 68], index=['bob', 'alice'], name='weights')
s6

输出：

bob      83
alice    68
Name: weights, dtype: int64

绘制Series

pandas让使用matplotlib绘制Series数据变得很容易，只需要导入matplotlib，再调用Series的plot方法。

%matplotlib inline
import matplotlib.pyplot as plt
temperatures = [4.4, 5.1, 6.1, 6.2, 6.1, 6.1, 5.7, 5.2, 4.7, 4.1, 3.9, 3.5]
s7 = pd.Series(temperatures, name='Temperature')
s7.plot()
plt.show()