Python 数据分析(三)[MAC]

数据:
美国社会保障总署(SSA)提供的一份从1880到2010年的婴儿名字平率数据。
其中,每个文件仅包含当年出现超过5次的名字。
读取:

### read data
## 1880-2010
## read data of each year
import pandas as pd
import numpy as np
columns=['name', 'sex', 'births']
years = range(1880, 2011)
pieces = []
for year in years:
    path = 'pydata-book-master/ch02/names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year
    pieces.append(frame)
## merge
names = pd.concat(pieces, ignore_index=True)

 

 

任务:

为了便于理解,将任务与相关分析过程均放在下一部分。

主要操作:

1.统计每各年度不同性别的出生人数;


### 1. get the sum of 'biths' of different 'sex' in each year

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)

#total_births.tail()

total_births.plot(title='Total births by sex and year',xticks=range(1880, 2020, 10))</pre>
<pre>

结果:


2.获取每个名字的婴儿数相对于总出生数的比例;


### 2. get ratio of each name of different 'sex'['births'/sum('births')] in each year

def add_prop(group):

births = group.births.astype(float)

group['prop'] = births /births.sum()

return group

names = names.groupby(['year', 'sex']).apply(add_prop)

## check if the sum of prop is 1

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)

结果:

True
3.获取每个sex/year组合的前1000个名字;


### 3. get the top1000 names of diferent 'sex' in each year

def get_top1000(group):

return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])

top1000 = grouped.apply(get_top1000)

total_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=sum)

4.获取’John’、’Harry’、’Mary’、’Marilyn’随时间变化的使用数量

 


### 4. get the sum of names['John', 'Harry', 'Mary', 'Marilyn'] of diferent 'sex' in each year

subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]

subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year", xticks=range(1880, 2020, 10))

结果:

5.最流行的1000个名字所占比例,按year和sex维度进行聚合;


### 5. get the sum of ratio of top1000 names of diferent 'sex' in each year

table = top1000.pivot_table('prop', rows='year', cols='sex', aggfunc=sum)

table.plot(title='Sum of table1000.prop by year and sex',

yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

结果:

6.统计每年男女孩占总出生人数前50%的不同名字数量;


### 6. count the num of pop-names of diferent 'sex' in each year where the sum of ratio is 0.5

def get_quantile_count(group, q=0.5):

group = group.sort_index(by='prop', ascending=False)

return group.prop.cumsum().searchsorted(q) + 1

diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)

diversity = diversity.unstack('sex').astype(float)

diversity.plot(title='Number of popular names in top 50%', xticks=range(1880, 2020, 10))

结果:

注意:

原书中代码执行到plot时会报错:

TypeError: Empty ‘DataFrame': no numeric data to plot

为此在 diversity = diversity.unstack(‘sex’) 代码后添加 .astype(float) 即:

diversity = diversity.unstack(‘sex’).astype(float)

7.男孩女孩名字中各个末字母的比例

 


### 7. count the sum of each last letter of names of diferent 'sex' in each year

get_last_letter = lambda x: x[-1]

last_letters = names.name.map(get_last_letter)

last_letters.name = 'last_letter'

table = names.pivot_table('births', rows=last_letters, cols=['sex', 'year'], aggfunc=sum)

subtable = table.reindex(columns=[1910, 1960, 2010], level='year')

letter_prop = subtable / subtable.sum().astype(float)

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))

letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')

letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)

结果:

8.各年出生的男孩中名字以的d/n/y结尾的人数比例


### 8. count the ratio of last letter[d/n/y] of names of diferent 'sex' in each year

letter_prop = table / table.sum().astype(float)

dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T

dny_ts.plot(xticks=range(1880, 2020, 10))

结果:

9.各年度使用包含’lesl’名字的男女比例


### 9. count the ratio of names[contain 'lesl'] of diferent 'sex' in each year

all_names = top1000.name.unique()

mask = np.array(['lesl' in x.lower() for x in all_names])

lesley_like = all_names[mask]

filtered = top1000[top1000.name.isin(lesley_like)]

filtered.groupby('name').births.sum()

table = filtered.pivot_table('births', rows='year', cols='sex', aggfunc='sum')

table = table.div(table.sum(1), axis=0)

table.plot(style={'M':'k-','F': 'k--'}, xticks=range(1880, 2020, 10))

结果:

发表评论

电子邮件地址不会被公开。 必填项已用*标注

您可以使用这些HTML标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>