lesson 2

import numpy as np
import pandas as pd
text = pd.read_csv('train_chinese.csv')
text.head()
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
frame = pd.DataFrame(np.arange(8).reshape((2,4)),
                     index = ['three','one'],
                     columns = ['d','a','b','c'])

排序与排名

  • 默认升序排列,降序排列设置参数为 ascending = False
  • sort_inde 按行索引,参数axis = 1 按列名排序
  • sort_values 按值排序; sort_values(by =[]) 按照一个或者多个列名字进行排序
frame.sort_index()

dabc
one4567
three0123
frame.sort_index(axis = 1,ascending = False) 
dcba
three0321
one4765
frame.sort_values(by = ['b','c'],ascending = False)
dabc
one4567
three0123

Series 排名

  • rank 排名,对数组进行从1到有效数据总数分配名次的操作,相同数据,平均排名
  • 加上参数 rank(method = ‘first’),按照观察顺序排名,相同数据靠前的,排名靠前
  • rank(ascending = False,method = ‘max’)

DataFrame 排名

frame.rank(axis = ‘columns’)

平级关系打破方法

  • average 默认:在每个组中平均排名
  • min 对整个组使用最小排名
  • max 对整个组使用最大排名
  • first 按照值在数据中出现的顺序分配排名
obj = pd.Series([7,-2,7,4,6,8])
obj.rank(ascending = 'False',method = 'max')
0    5.0
1    1.0
2    5.0
3    2.0
4    3.0
5    6.0
dtype: float64
text.sort_values(by = ['年龄'],ascending = False).head(10)
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
11611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ
67267302Mitchell, Mr. Henry Michaelmale70.000C.A. 2458010.5000NaNS
74574601Crosby, Capt. Edward Giffordmale70.011WE/P 573571.0000B22S
333402Wheadon, Mr. Edward Hmale66.000C.A. 2457910.5000NaNS
545501Ostby, Mr. Engelhart Corneliusmale65.00111350961.9792B30C
28028103Duane, Mr. Frankmale65.0003364397.7500NaNQ
text.sort_values(by = ['年龄']).head(10)
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
80380413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC
75575612Hamalainen, Master. Viljomale0.671125064914.5000NaNS
64464513Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC
46947013Baclini, Miss. Helene Barbarafemale0.7521266619.2583NaNC
787912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS
83183212Richards, Master. George Sibleymale0.83112910618.7500NaNS
30530611Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S
82782812Mallet, Master. Andremale1.0002S.C./PARIS 207937.0042NaNC
38138213Nakid, Miss. Maria ("Mary")female1.0002265315.7417NaNC
16416503Panula, Master. Eino Viljamimale1.0041310129539.6875NaNS

从上边可以看出,忽略其他因素,年龄越大,存活率越低!

DataFrame的数据运算与对齐

df1 = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['three','one','a'],
                   columns = ['a','b','c','d']
                  )
df1
abcd
three0123
one4567
a891011
df2 = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['b','one','a'],
                   columns = ['a','e','c','f']
                  )
df2
aecf
b0123
one4567
a891011
df2+df1
abcdef
a16.0NaN20.0NaNNaNNaN
bNaNNaNNaNNaNNaNNaN
one8.0NaN12.0NaNNaNNaN
threeNaNNaNNaNNaNNaNNaN
max(text['兄弟姐妹个数']+text['父母子女个数'])
10
text['票价'].describe()
count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: 票价, dtype: float64
text['父母子女个数'].describe()
count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        6.000000
Name: 父母子女个数, dtype: float64
posted @ 2021-06-17 15:32  visionwpc  阅读(46)  评论(0编辑  收藏  举报