1. 数据操作（Pandas）¶

1.1. pandas 对象¶

引入 pandas 等包，DataFrame、Series 属于常用的，所以直接引入

[1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

DataFrame 对象：Pandas DataFrame 是一个表格型的数据结构，有行索引也有列索引

[2]:

from IPython.display import Image
Image(filename='../../image/DataFrame.png', width=400)

[2]:

Series 对象：类似于一维数组的对象，由一组同样 type 的数组和索引组成

[3]:

s1 = Series(range(0,4)) # -> 0, 1, 2, 3
s2 = Series(range(1,5)) # -> 1, 2, 3, 4
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb'

index 对象：即 Series 和 DataFrame 的索引

[4]:

# 获取索引
df = DataFrame(s1)
idx = s1.index
idx = df.columns # the column index
idx = df.index # the row index

[5]:

# 索引的一些特性
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # multi-level indexe

[6]:

# 索引的一些方法
a = idx.values # get as numpy array
l = idx.tolist() # get as a python list
# idx = idx.astype(dtype) # change data type
# b = idx.equals(other) # check for equality 看看是否是相同的索引

[7]:

# union of two indexes 合并两个索引
# idx = idx.union(other)

idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])
idx1.union(idx2)

[7]:

Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')

[8]:

i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

创建 Series 和 DataFrame

http://pandas.pydata.org/pandas-docs/stable/dsintro.html

1.2. DataFrame 入门¶

[2]:

df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

DataFrame 的一些实用查看方法

[10]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A    10 non-null float64
B    10 non-null float64
C    10 non-null float64
D    10 non-null float64
dtypes: float64(4)
memory usage: 392.0 bytes

[11]:

n=4
dfh = df.head(n) # 看前 n 行

[12]:

dft = df.tail(n) # 看后 n 行

[13]:

dfs = df.describe() # 各类统计信息

[14]:

top_left_corner_df = df.iloc[:5, :5]

[15]:

dfT = df.T # transpose rows and cols

DataFrame index 的一些特性

[16]:

l = df.axes # list row and col indexes
l

[16]:

[RangeIndex(start=0, stop=10, step=1),
 Index([u'A', u'B', u'C', u'D'], dtype='object')]

[17]:

(r, c) = df.axes # from above

[18]:

s = df.dtypes # Series column data types
s

[18]:

A    float64
B    float64
C    float64
D    float64
dtype: object

[19]:

b = df.empty # True for empty DataFrame
b

[19]:

False

[20]:

i = df.ndim # number of axes (2)
i

[20]:

[21]:

t = df.shape # (row-count, column-count)
t

[21]:

(10, 4)

[22]:

(r, c) = df.shape # from above
(r, c)

[22]:

(10, 4)

[23]:

i = df.size # row-count * column-count
i

[23]:

[24]:

a = df.values # get a numpy array for df

实用方法

[27]:

df = DataFrame([1, 23, 3, 5, 2])

[28]:

dfc = df.copy() # copy a DataFrame
dfr = df.rank() # rank each col (default) 把每个值的地位列出了
dfs = df.sort() # sort each col (default)
# dfc = df.astype(dtype) # type conversion

/Users/Scott/Library/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(....) is deprecated, use sort_index(.....)
  app.launch_new_instance()

[29]:

# 下面的两个方法没怎么搞懂
df.iteritems()# (col-index, Series) pairs
df.iterrows() # (row-index, Series) pairs
# example ... iterating over columns
for (name, series) in df.iteritems():
    print('Col name: ' + str(name))
    print('First value: ' +
        str(series.iat[0]) + '\n')

Col name: 0
First value: 1

通用函数

method	##
df = df.abs()	absolute values
df = df.add(o)	add df, Series or value
s = df.count()	non NA/null values
df = df.cummax()	(cols default axis)
df = df.cummin()	(cols default axis)
df = df.cumsum()	(cols default axis)
df = df.cumprod()	(cols default axis)
df = df.diff()	1st diff (col def axis)
df = df.div(o)	div by df, Series, value
df = df.dot(o)	matrix dot product
s = df.max()	max of axis (col def)
s = df.mean()	mean (col default axis)
s = df.median()	median (col default)
s = df.min()	min of axis (col def)
df = df.mul(o)	mul by df Series val
s = df.sum()	sum axis (cols default)

1.3. DataFrame Columns 列处理¶

column 其实也是一个 Series

[2]:

df = DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
idx = df.columns # get col index
label = df.columns[0] # 1st col label
lst = df.columns.tolist() # get as a list

[31]:

lst

[31]:

['A', 'B', 'C', 'D']

[32]:

label

[32]:

'A'

[33]:

idx

[33]:

Index([u'A', u'B', u'C', u'D'], dtype='object')

column 改名

[34]:

# df.rename(columns={'old':'new'}, inplace=True)
# df = df.rename(columns={'a':1,'b':'x'})

选择 columns, 也就是提取列

[ ]:

s = df['C'] # select col to Series
df = df[['C']] # select col to df
df = df[['A','B']] # select 2 or more
df = df[['C', 'B', 'A']]# change order 改变排序了
s = df[df.columns[0]] # select by number
f = df[df.columns[[0, 3, 4]] # by number
s = df.pop('C') # get col & drop from df == df['C']

用 python 特性提取列

[4]:

s = df.A # same as s = df['A'],
# 但不能用 python 特性创建新的 columns
# df['new_col'] = df.a / df.b

添加新的 columns，添加一个 column 是极为方便的，只要能添加一组数据就行

[6]:

df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index

[8]:

df.head(2)

[8]:

	A	B	C	D	new_col	random	index_as_col
0	0.458326	-1.402187	0.446208	-0.459079	NaN	0.920599	0
1	0.366833	0.618661	-0.727332	1.152775	NaN	0.503750	1

详情参考 df1[[‘b’,’c’]] = df2[[‘e’,’f’]] df3 = df1.append(other=df2)

判定函数 pd.Series.where

[17]:

# 符合 >0 条件的保持原值，其他 =0
df['A'] = df['A'].where(df['A']>0, other=0)
# df['d']=df['a'].where(df.b!=0,other=df.c)

数据格式 转换一列的格式时非常有用。

s = df[‘col’].astype(str) # Series dtype na = df[‘col’].values # numpy array pl = df[‘col’].tolist() # python list

columns 的一些特性和方法

value = df[‘col’].dtype # type of data value = df[‘col’].size # col dimensions value = df[‘col’].count()# non-NA count value = df[‘col’].sum() value = df[‘col’].prod() value = df[‘col’].min() value = df[‘col’].max() value = df[‘col’].mean() value = df[‘col’].median() value = df[‘col’].cov(df[‘col2’]) s = df[‘col’].describe() s = df[‘col’].value_counts()

找出最小值和最大值的位置

[33]:

df['B'].idxmax()
df['B'].idxmin()

[33]:

元素级方法

s = df[‘col’].isnull() s = df[‘col’].notnull() # not isnull() s = df[‘col’].astype(float) s = df[‘col’].round(decimals=0) s = df[‘col’].diff(periods=1) s = df[‘col’].shift(periods=1) s = df[‘col’].to_datetime() s = df[‘col’].fillna(0) # replace NaN w 0 s = df[‘col’].cumsum() s = df[‘col’].cumprod() s = df[‘col’].pct_change(periods=4) s = df[‘col’].rolling_sum(periods=4, window=4)

[42]:

df = df.mul(s, axis=0) # on matched rows,相当于 * other Series 每行都与之相乘

[47]:

df.columns.get_loc('B')

[47]:

[ ]:

df = df.iloc[:, 0:2] # exclusive

获取 columns 的具体位置

[153]:

df

[153]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN
two	2001	Ohino	1.7	NaN
three	2002	Ohino	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN

下面那个好像没什么软用

[167]:

for i in ['pop', 'state']:
    print df.columns.get_loc(i)

2
1

[152]:

Series(df.columns)

[152]:

   year
  state
    pop
   debt
dtype: object

1.4. DataFrame rows 行处理¶

获取索引和标签

[3]:

idx = df.index # get row index
label = df.index[0] # 1st row label
lst = df.index.tolist() # get as a list

改变索引或行名

[ ]:

df.index = idx # new ad hoc index
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new
# note: old index stored as a col in df
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},inplace=True)

添加行

待补。

Drop row 删除行

df = df.drop(‘row_label’) df = df.drop(row1) # multi-row df = df.drop([‘row1’,’row2’]) # multi-row

查找一些行

[23]:

# fake up some data
data = {1:[1,2,3], 2:[4,1,9], 3:[1,8,27]}
df = pd.DataFrame(data)

[4]:

# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] # 这里看不太懂

对行做排序

[ ]:

df_obj.sort(columns = ‘’)#按列名进行排序
df_obj.sort_index(by=[‘’,’’])#多列排序,使用时报该函数已过时,请用sort_values
df_obj.sort_values(by=['',''])同上

索引前奏

[122]:

df0 = DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]}, index=[3, 2, 1])
df1 = DataFrame([[1, 2, 3,], [3, 4, 5], [6, 7, 8]], index=[3, 2, 1])

[118]:

df0

[118]:

	x	y
3	1	3
2	2	4
1	3	5

[120]:

df0[1:2]

[120]:

	x	y
2	2	4

[17]:

df1

[17]:

	0	1	2
3	1	2	3
2	3	4	5
1	6	7	8

[18]:

df1[0]

[18]:

  1
  3
  6
Name: 0, dtype: int64

[19]:

df1[0:2]

[19]:

	0	1	2
3	1	2	3
2	3	4	5

[21]:

df1.ix[:, 0:2]

[21]:

	0	1	2
3	1	2	3
2	3	4	5
1	6	7	8

[124]:

df0[['x','y']]

[124]:

	x	y
3	1	3
2	2	4
1	3	5

1.5. 索引和切片¶

整数时一般是不包含的，非整数则会包含尾巴（基于 label）

[41]:

foo = DataFrame([4.5, 7.2, -5.3, 3.6], index=['a', 'b', 'c', 'd'])
bar = DataFrame([4.5, 7.2, -5.3, 3.6], index=range(4))

[100]:

print(foo)
print '------'
print(bar)

     0
a  4.5
b  7.2
c -5.3
d  3.6
------
     0
0  4.5
1  7.2
2 -5.3
3  3.6

[99]:

print foo[:2]
print '------'
print bar[:2]
print '------'
print foo[:'c']

     0
a  4.5
b  7.2
------
     0
0  4.5
1  7.2
------
     0
a  4.5
b  7.2
c -5.3

ix[::, ::] 可以接受两套切片（axis=0)横向，(axis=1)列向

[2]:

data = {'state':['Ohino','Ohino','Ohino','Nevada','Nevada'],
        'year':[2000,2001,2002,2001,2002],
        'pop':[1.5,1.7,3.6,2.4,2.9]}

df = DataFrame(data,index=['one','two','three','four','five'],
               columns=['year','state','pop','debt'])

[48]:

df

[48]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN
two	2001	Ohino	1.7	NaN
three	2002	Ohino	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN

[81]:

df.ix[:, 'state':'pop']

[81]:

	state	pop
one	Ohino	1.5
two	Ohino	1.7
three	Ohino	3.6
four	Nevada	2.4
five	Nevada	2.9

[84]:

df.ix[1] # 切的是行，所以说 ix 默认切的行， 也就是 axis=0

[84]:

year      2001
state    Ohino
pop        1.7
debt       NaN
Name: two, dtype: object

非 ix

ix 可以说是 pandas 的标准切法，而没有 ix 时，情况就略复杂些了，作者说：

索引时，选取的是列
切片时，选取的是行

记住一点，如果你想看单列或少数列的索引，那么直接用 df[‘column’], 其他就

[6]:

print(type(df['year']))
print(type(df[['year']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

[128]:

# df['one'] # 会报错，没办法这样索引,这是行
df[['year', 'state']] # 可运行

[128]:

	year	state
one	2000	Ohino
two	2001	Ohino
three	2002	Ohino
four	2001	Nevada
five	2002	Nevada

[91]:

df[0:1] # 切第一行，直接 df[0] 是会报错的。而 ix 不会。

[91]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN

[129]:

df['one':'two'] # 所以他也是可以整数切，也能标签切

[129]:

	year	state	pop	debt
one	2000	Ohino	1.5	NaN
two	2001	Ohino	1.7	NaN

[89]:

print(df.columns.tolist())
print(df.index.tolist())

['year', 'state', 'pop', 'debt']
['one', 'two', 'three', 'four', 'five']

[102]:

df.loc[:, 'year':'state']

[102]:

	year	state
one	2000	Ohino
two	2001	Ohino
three	2002	Ohino
four	2001	Nevada
five	2002	Nevada

[104]:

df.iloc[:, 1:2]

[104]:

	state
one	Ohino
two	Ohino
three	Ohino
four	Nevada
five	Nevada

.loc[label] 这是严格基于标签的索引
.iloc[inte] 这是严格基于整数位置的索引
.ix[] 更像是这两种严格方式的智能整合版。

[111]:

# df.loc[1:2] 用 label 的去切整数，自然会出错
# df.iloc['two':'three'] 也会出错

[109]:

df.loc['two':'three']

[109]:

	year	state	pop	debt
two	2001	Ohino	1.7	NaN
three	2002	Ohino	3.6	NaN

[105]:

df.iloc[1:2]

[105]:

	year	state	pop	debt
two	2001	Ohino	1.7	NaN

小结：

尽量写两套切片，除非是索引单列则用 df[column]
多用 iloc 和 loc, 除非你很清晰的基于标签

[ ]:

# 待补：当标签和整数冲突时

[ ]:

df2.info() # 说明，ix在这种非整数的整数标签上，他的切片跟loc一样，是基于标签的，而另外两个刚好相反。