使用Python pandas 中的统计


pandas 基本删选计算

  • 类似切片 df[1: 3]
  • df.loc[]
1
df.loc[df['enumHouseType'] == 7, ["id","enumHouseType"]]
  • 多列数据删选排序
1
df.loc[(df["grade"] == "B") & (df["loan_amnt"]>5000), ["member_id", "term" , "loan_amnt", "grade","sub_grade", "int_rate"]].head()
  • 删选计算
    1
    2
    3
    4
    df.loc[df['enumHouseType'] == 7].get('enumHouseType').max()
    df.loc[df['enumHouseType'] == 7].get('enumHouseType').min()
    df.loc[df['enumHouseType'] == 7].get('enumHouseType').mean()
    df.loc[df['enumHouseType'] == 7].get('enumHouseType').count()
  • 分组计算
    1
    2
    3
    df.groupby('enumHouseType')['id'].count()

    df.groupby('enumDataEntityStatus')['enumHouseType'].agg([len, np.sum, np.mean])
  • 采样
    1
    2
    3
    #手动设置采样权重
    weights = [0, 0, 0, 0, 0.5, 0.5]
    df_inner.sample(n=2, weights=weights)

Read more: http://bluewhale.cc/2017-04-21/use-python-for-data-analysis-like-excel-3.html#ixzz54LDfXg83

pandas 读取文本数据格式化写入Excel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# -*- coding:utf-8 -*- 
import numpy as np
import pandas as pd
import json

from unidecode import unidecode

data = []
f = open("2017-06-21-tripadvisor_restaurant_detail .txt","r")
lines = f.readlines()

for line in lines:
unicode(line,'utf-8')
lineDic = json.loads(line)
lineDic['country'] = lineDic['breadcrumbs'][1]
lineDic['city'] = lineDic['breadcrumbs'][2]
lineDic['breadcrumbs'] = ';'.join(lineDic['breadcrumbs'])
lineDic['languages'] = ';'.join(lineDic['languages'])
lineDic['season_peopele'] = ';'.join(lineDic['season_peopele'])
lineDic['traveller_type'] = ';'.join(lineDic['traveller_type'])
lineDic['restaurant_imgs'] = ';'.join(lineDic['restaurant_imgs'])
lineDic['buiness_hours'] = ';'.join(lineDic['buiness_hours'])
lineDic['rate_percent'] = json.dumps(lineDic['rate_percent'])
data.append(lineDic)

dataFrame = pd.DataFrame(data)
dataFrame.drop(['rate_percent', 'buiness_hours', 'breadcrumbs', 'languages', 'season_peopele', 'season_peopele', 'traveller_type', 'restaurant_imgs'], axis=1, inplace=True)
df = dataFrame[30000:40000]

# df = df.applymap(illegal_char_remover)
# results
filename = 'tripadvisor_data_test2.xlsx'
df.to_excel(filename,"Sheet1",engine="openpyxl", encoding='utf-8')
print "Ok!!! the file in",filename

……等待补充


文章作者: Callable
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Callable !
评论
  目录