1. 数据操作(Pandas)

1.1. pandas 对象

引入 pandas 等包,DataFrame、Series 属于常用的,所以直接引入

[1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

DataFrame 对象:Pandas DataFrame 是一个表格型的数据结构,有行索引也有列索引

[2]:
from IPython.display import Image
Image(filename='../../image/DataFrame.png', width=400)
[2]:
../_images/base_01_pandas_5_0.png

Series 对象:类似于一维数组的对象,由一组同样 type 的数组和索引组成

[3]:
s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb'

index 对象:即 Series 和 DataFrame 的索引

[4]:
# 获取索引
df = DataFrame(s1)
idx = s1.index
idx = df.columns # the column index
idx = df.index # the row index
[5]:
# 索引的一些特性
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # multi-level indexe
[6]:
# 索引的一些方法
a = idx.values # get as numpy array
l = idx.tolist() # get as a python list
# idx = idx.astype(dtype) # change data type
# b = idx.equals(other) # check for equality 看看是否是相同的索引
[7]:
# union of two indexes 合并两个索引
# idx = idx.union(other)

idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])
idx1.union(idx2)
[7]:
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
[8]:
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

创建 Series 和 DataFrame

http://pandas.pydata.org/pandas-docs/stable/dsintro.html

1.2. DataFrame 入门

[2]:
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

DataFrame 的一些实用查看方法

[10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A    10 non-null float64
B    10 non-null float64
C    10 non-null float64
D    10 non-null float64
dtypes: float64(4)
memory usage: 392.0 bytes
[11]:
n=4
dfh = df.head(n) # 看前 n 行
[12]:
dft = df.tail(n) # 看后 n 行
[13]:
dfs = df.describe() # 各类统计信息
[14]:
top_left_corner_df = df.iloc[:5, :5]
[15]:
dfT = df.T # transpose rows and cols

DataFrame index 的一些特性

[16]:
l = df.axes # list row and col indexes
l
[16]:
[RangeIndex(start=0, stop=10, step=1),
 Index([u'A', u'B', u'C', u'D'], dtype='object')]
[17]:
(r, c) = df.axes # from above
[18]:
s = df.dtypes # Series column data types
s
[18]:
A    float64
B    float64
C    float64
D    float64
dtype: object
[19]:
b = df.empty # True for empty DataFrame
b
[19]:
False
[20]:
i = df.ndim # number of axes (2)
i
[20]:
2
[21]:
t = df.shape # (row-count, column-count)
t
[21]:
(10, 4)
[22]:
(r, c) = df.shape # from above
(r, c)
[22]:
(10, 4)
[23]:
i = df.size # row-count * column-count
i
[23]:
40
[24]:
a = df.values # get a numpy array for df

实用方法

[27]:
df = DataFrame([1, 23, 3, 5, 2])
[28]:
dfc = df.copy() # copy a DataFrame
dfr = df.rank() # rank each col (default) 把每个值的地位列出了
dfs = df.sort() # sort each col (default)
# dfc = df.astype(dtype) # type conversion
/Users/Scott/Library/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(....) is deprecated, use sort_index(.....)
  app.launch_new_instance()
[29]:
# 下面的两个方法没怎么搞懂
df.iteritems()# (col-index, Series) pairs
df.iterrows() # (row-index, Series) pairs
# example ... iterating over columns
for (name, series) in df.iteritems():
    print('Col name: ' + str(name))
    print('First value: ' +
        str(series.iat[0]) + '\n')
Col name: 0
First value: 1

通用函数

method ##
df = df.abs() absolute values
df = df.add(o) add df, Series or value
s = df.count() non NA/null values
df = df.cummax() (cols default axis)
df = df.cummin() (cols default axis)
df = df.cumsum() (cols default axis)
df = df.cumprod() (cols default axis)
df = df.diff() 1st diff (col def axis)
df = df.div(o) div by df, Series, value
df = df.dot(o) matrix dot product
s = df.max() max of axis (col def)
s = df.mean() mean (col default axis)
s = df.median() median (col default)
s = df.min() min of axis (col def)
df = df.mul(o) mul by df Series val
s = df.sum() sum axis (cols default)

1.3. DataFrame Columns 列处理

column 其实也是一个 Series

[2]:
df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
idx = df.columns # get col index
label = df.columns[0] # 1st col label
lst = df.columns.tolist() # get as a list
[31]:
lst
[31]:
['A', 'B', 'C', 'D']
[32]:
label
[32]:
'A'
[33]:
idx
[33]:
Index([u'A', u'B', u'C', u'D'], dtype='object')

column 改名

[34]:
# df.rename(columns={'old':'new'}, inplace=True)
# df = df.rename(columns={'a':1,'b':'x'})

选择 columns, 也就是提取列

[ ]:
s = df['C'] # select col to Series
df = df[['C']] # select col to df
df = df[['A','B']] # select 2 or more
df = df[['C', 'B', 'A']]# change order 改变排序了
s = df[df.columns[0]] # select by number
f = df[df.columns[[0, 3, 4]] # by number
s = df.pop('C') # get col & drop from df == df['C']

python 特性提取列

[4]:
s = df.A # same as s = df['A'],
# 但不能用 python 特性创建新的 columns
# df['new_col'] = df.a / df.b

添加新的 columns,添加一个 column 是极为方便的,只要能添加一组数据就行

[6]:
df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index
[8]:
df.head(2)
[8]:
A B C D new_col random index_as_col
0 0.458326 -1.402187 0.446208 -0.459079 NaN 0.920599 0
1 0.366833 0.618661 -0.727332 1.152775 NaN 0.503750 1

详情参考 df1[[‘b’,’c’]] = df2[[‘e’,’f’]] df3 = df1.append(other=df2)

判定函数 pd.Series.where

[17]:
# 符合 >0 条件的保持原值,其他 =0
df['A'] = df['A'].where(df['A']>0, other=0)
# df['d']=df['a'].where(df.b!=0,other=df.c)

数据格式 转换一列的格式时非常有用。

s = df[‘col’].astype(str) # Series dtype na = df[‘col’].values # numpy array pl = df[‘col’].tolist() # python list

columns 的一些特性和方法

value = df[‘col’].dtype # type of data value = df[‘col’].size # col dimensions value = df[‘col’].count()# non-NA count value = df[‘col’].sum() value = df[‘col’].prod() value = df[‘col’].min() value = df[‘col’].max() value = df[‘col’].mean() value = df[‘col’].median() value = df[‘col’].cov(df[‘col2’]) s = df[‘col’].describe() s = df[‘col’].value_counts()

找出最小值和最大值的位置

[33]:
df['B'].idxmax()
df['B'].idxmin()
[33]:
7

元素级方法

s = df[‘col’].isnull() s = df[‘col’].notnull() # not isnull() s = df[‘col’].astype(float) s = df[‘col’].round(decimals=0) s = df[‘col’].diff(periods=1) s = df[‘col’].shift(periods=1) s = df[‘col’].to_datetime() s = df[‘col’].fillna(0) # replace NaN w 0 s = df[‘col’].cumsum() s = df[‘col’].cumprod() s = df[‘col’].pct_change(periods=4) s = df[‘col’].rolling_sum(periods=4, window=4)

[42]:
df = df.mul(s, axis=0) # on matched rows,相当于 * other Series 每行都与之相乘
[47]:
df.columns.get_loc('B')
[47]:
0
[ ]:
df = df.iloc[:, 0:2] # exclusive

获取 columns 的具体位置

[153]:
df
[153]:
year state pop debt
one 2000 Ohino 1.5 NaN
two 2001 Ohino 1.7 NaN
three 2002 Ohino 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

下面那个好像没什么软用

[167]:
for i in ['pop', 'state']:
    print df.columns.get_loc(i)
2
1
[152]:
Series(df.columns)
[152]:
0     year
1    state
2      pop
3     debt
dtype: object

1.4. DataFrame rows 行处理

获取索引和标签

[3]:
idx = df.index # get row index
label = df.index[0] # 1st row label
lst = df.index.tolist() # get as a list

改变索引或行名

[ ]:
df.index = idx # new ad hoc index
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new
# note: old index stored as a col in df
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},inplace=True)

添加行

待补。

Drop row 删除行

df = df.drop(‘row_label’) df = df.drop(row1) # multi-row df = df.drop([‘row1’,’row2’]) # multi-row

查找一些行

[23]:
# fake up some data
data = {1:[1,2,3], 2:[4,1,9], 3:[1,8,27]}
df = pd.DataFrame(data)
[4]:
# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] # 这里看不太懂

对行做排序

[ ]:
df_obj.sort(columns = ‘’)#按列名进行排序
df_obj.sort_index(by=[‘’,’’])#多列排序,使用时报该函数已过时,请用sort_values
df_obj.sort_values(by=['',''])同上

索引前奏

[122]:
df0 = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}, index=[3, 2, 1])
df1 = DataFrame([[1, 2, 3,], [3, 4, 5], [6, 7, 8]], index=[3, 2, 1])
[118]:
df0
[118]:
x y
3 1 3
2 2 4
1 3 5
[120]:
df0[1:2]
[120]:
x y
2 2 4
[17]:
df1
[17]:
0 1 2
3 1 2 3
2 3 4 5
1 6 7 8
[18]:
df1[0]
[18]:
3    1
2    3
1    6
Name: 0, dtype: int64
[19]:
df1[0:2]
[19]:
0 1 2
3 1 2 3
2 3 4 5
[21]:
df1.ix[:, 0:2]
[21]:
0 1 2
3 1 2 3
2 3 4 5
1 6 7 8
[124]:
df0[['x','y']]
[124]:
x y
3 1 3
2 2 4
1 3 5

1.5. 索引和切片

整数时一般是不包含的,非整数则会包含尾巴(基于 label)

[41]:
foo = DataFrame([4.5, 7.2, -5.3, 3.6], index=['a', 'b', 'c', 'd'])
bar = DataFrame([4.5, 7.2, -5.3, 3.6], index=range(4))
[100]:
print(foo)
print '------'
print(bar)
     0
a  4.5
b  7.2
c -5.3
d  3.6
------
     0
0  4.5
1  7.2
2 -5.3
3  3.6
[99]:
print foo[:2]
print '------'
print bar[:2]
print '------'
print foo[:'c']
     0
a  4.5
b  7.2
------
     0
0  4.5
1  7.2
------
     0
a  4.5
b  7.2
c -5.3

ix[::, ::] 可以接受两套切片(axis=0)横向,(axis=1)列向

[2]:
data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}

df = DataFrame(data,index=['one','two','three','four','five'],
               columns=['year','state','pop','debt'])
[48]:
df
[48]:
year state pop debt
one 2000 Ohino 1.5 NaN
two 2001 Ohino 1.7 NaN
three 2002 Ohino 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
[81]:
df.ix[:, 'state':'pop']
[81]:
state pop
one Ohino 1.5
two Ohino 1.7
three Ohino 3.6
four Nevada 2.4
five Nevada 2.9
[84]:
df.ix[1] # 切的是行,所以说 ix 默认切的行, 也就是 axis=0
[84]:
year      2001
state    Ohino
pop        1.7
debt       NaN
Name: two, dtype: object

非 ix

ix 可以说是 pandas 的标准切法,而没有 ix 时,情况就略复杂些了,作者说:

  • 索引时,选取的是列
  • 切片时,选取的是行

记住一点,如果你想看单列或少数列的索引,那么直接用 df[‘column’], 其他就

[6]:
print(type(df['year']))
print(type(df[['year']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
[128]:
# df['one'] # 会报错,没办法这样索引,这是行
df[['year', 'state']] # 可运行
[128]:
year state
one 2000 Ohino
two 2001 Ohino
three 2002 Ohino
four 2001 Nevada
five 2002 Nevada
[91]:
df[0:1] # 切第一行,直接 df[0] 是会报错的。而 ix 不会。
[91]:
year state pop debt
one 2000 Ohino 1.5 NaN
[129]:
df['one':'two'] # 所以他也是可以整数切,也能标签切
[129]:
year state pop debt
one 2000 Ohino 1.5 NaN
two 2001 Ohino 1.7 NaN
[89]:
print(df.columns.tolist())
print(df.index.tolist())
['year', 'state', 'pop', 'debt']
['one', 'two', 'three', 'four', 'five']
[102]:
df.loc[:, 'year':'state']
[102]:
year state
one 2000 Ohino
two 2001 Ohino
three 2002 Ohino
four 2001 Nevada
five 2002 Nevada
[104]:
df.iloc[:, 1:2]
[104]:
state
one Ohino
two Ohino
three Ohino
four Nevada
five Nevada
  • .loc[label] 这是严格基于标签的索引
  • .iloc[inte] 这是严格基于整数位置的索引
  • .ix[] 更像是这两种严格方式的智能整合版。
[111]:
# df.loc[1:2] 用 label 的去切整数,自然会出错
# df.iloc['two':'three'] 也会出错
[109]:
df.loc['two':'three']
[109]:
year state pop debt
two 2001 Ohino 1.7 NaN
three 2002 Ohino 3.6 NaN
[105]:
df.iloc[1:2]
[105]:
year state pop debt
two 2001 Ohino 1.7 NaN

小结:

  • 尽量写两套切片,除非是索引单列则用 df[column]
  • 多用 ilocloc, 除非你很清晰的基于标签
[ ]:
# 待补:当标签和整数冲突时
[ ]:
df2.info() # 说明,ix在这种非整数的整数标签上,他的切片跟loc一样,是基于标签的,而另外两个刚好相反。