Python for Data Analysis | MovieLens

Background

MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。 

ratings.dat

UserID::MovieID::Rating::Timestamp

users.dat

UserID::Gender::Age::Occupation::Zip-code

movies.dat

MovieID::Title::Genres

 

通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中。

* head=None, case-sensitive.  

In [1]: import pandas as pd

In [2]: unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
In [3]: users = pd.read_table('C:/Users/I******/Desktop/.../movielens/users.dat', sep='::', header=None, names=unames)

In [4]: rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
In [5]: ratings = pd.read_table('C:/Users/I******/Desktop/.../movielens/ratings.dat', sep='::', header=None, names=rnames)

In [6]: mnames = ['movie_id', 'title', 'genres']
In [7]: movies = pd.read_table('C:/Users/I******/Desktop/.../movielens/movies.dat', sep='::', header=None, names=mnames)

利用Python的切片语法,通过查看每个DataFrame的前几行,验证数据加载工作是否顺利。

In [8]: users[:5]
Out[8]:
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

In [9]: ratings[:5]
Out[9]:
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

In [10]: movies[:5]
Out[10]:
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy

In [11]: ratings
Out[11]:
         user_id  movie_id  rating  timestamp
0              1      1193       5  978300760
1              1       661       3  978302109
2              1       914       3  978301968
3              1      3408       4  978300275
4              1      2355       5  978824291
5              1      1197       3  978302268
6              1      1287       5  978302039
7              1      2804       5  978300719
8              1       594       4  978302268
9              1       919       4  978301368
10             1       595       5  978824268
11             1       938       4  978301752
12             1      2398       4  978302281
13             1      2918       4  978302124
14             1      1035       5  978301753
15             1      2791       4  978302188
16             1      2687       3  978824268
17             1      2018       4  978301777
18             1      3105       5  978301713
19             1      2797       4  978302039
20             1      2321       3  978302205
21             1       720       3  978300760
22             1      1270       5  978300055
23             1       527       5  978824195
24             1      2340       3  978300103
25             1        48       5  978824351
26             1      1097       4  978301953
27             1      1721       4  978300055
28             1      1545       4  978824139
29             1       745       3  978824268
...          ...       ...     ...        ...
1000179     6040      2762       4  956704584
1000180     6040      1036       3  956715455
1000181     6040       508       4  956704972
1000182     6040      1041       4  957717678
1000183     6040      3735       4  960971654
1000184     6040      2791       4  956715569
1000185     6040      2794       1  956716438
1000186     6040       527       5  956704219
1000187     6040      2003       1  956716294
1000188     6040       535       4  964828734
1000189     6040      2010       5  957716795
1000190     6040      2011       4  956716113
1000191     6040      3751       4  964828782
1000192     6040      2019       5  956703977
1000193     6040       541       4  956715288
1000194     6040      1077       5  964828799
1000195     6040      1079       2  956715648
1000196     6040       549       4  956704746
1000197     6040      2020       3  956715288
1000198     6040      2021       3  956716374
1000199     6040      2022       5  956716207
1000200     6040      2028       5  956704519
1000201     6040      1080       4  957717322
1000202     6040      1089       4  956704996
1000203     6040      1090       3  956715518
1000204     6040      1091       1  956716541
1000205     6040      1094       5  956704887
1000206     6040       562       5  956704746
1000207     6040      1096       4  956715648
1000208     6040      1097       4  956715569

[1000209 rows x 4 columns]

先用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键。

  1 In [12]: data = pd.merge(pd.merge(ratings, users), movies)
  2 
  3 In [13]: data
  4 Out[13]:
  5          user_id  movie_id  rating   timestamp gender  age  occupation    zip  \
  6 0              1      1193       5   978300760      F    1          10  48067
  7 1              2      1193       5   978298413      M   56          16  70072
  8 2             12      1193       4   978220179      M   25          12  32793
  9 3             15      1193       4   978199279      M   25           7  22903
 10 4             17      1193       5   978158471      M   50           1  95350
 11 5             18      1193       4   978156168      F   18           3  95825
 12 6             19      1193       5   982730936      M    1          10  48073
 13 7             24      1193       5   978136709      F   25           7  10023
 14 8             28      1193       3   978125194      F   25           1  14607
 15 9             33      1193       5   978557765      M   45           3  55421
 16 10            39      1193       5   978043535      M   18           4  61820
 17 11            42      1193       3   978038981      M   25           8  24502
 18 12            44      1193       4   978018995      M   45          17  98052
 19 13            47      1193       4   977978345      M   18           4  94305
 20 14            48      1193       4   977975061      M   25           4  92107
 21 15            49      1193       4   978813972      M   18          12  77084
 22 16            53      1193       5   977946400      M   25           0  96931
 23 17            54      1193       5   977944039      M   50           1  56723
 24 18            58      1193       5   977933866      M   25           2  30303
 25 19            59      1193       4   977934292      F   50           1  55413
 26 20            62      1193       4   977968584      F   35           3  98105
 27 21            80      1193       4   977786172      M   56           1  49327
 28 22            81      1193       5   977785864      F   25           0  60640
 29 23            88      1193       5   977694161      F   45           1  02476
 30 24            89      1193       5   977683596      F   56           9  85749
 31 25            95      1193       5   977626632      M   45           0  98201
 32 26            96      1193       3   977621789      F   25          16  78028
 33 27            99      1193       2   982791053      F    1          10  19390
 34 28           102      1193       5  1040737607      M   35          19  20871
 35 29           104      1193       2   977546620      M   25          12  00926
 36 ...          ...       ...     ...         ...    ...  ...         ...    ...
 37 1000179     4933      3084       3   962757020      M   25          15  94040
 38 1000180     4802      2218       2  1014866656      M   56           1  40601
 39 1000181     4812      2308       2   962932391      M   18          14  25301
 40 1000182     4874       624       4   962781918      F   25           4  70808
 41 1000183     5059      1434       4   962484364      M   45          16  22652
 42 1000184     5947      1434       4   957190428      F   45          16  97215
 43 1000185     5077      1868       3   962417299      M   25           2  20037
 44 1000186     5944      1868       1   957197520      F   18          10  27606
 45 1000187     5105       404       3   962337582      M   50           7  18977
 46 1000188     5185       404       4   963402617      F   35           4  44485
 47 1000189     5532       404       5   959619841      M   25          17  27408
 48 1000190     5543       404       3   960127592      M   25          17  97401
 49 1000191     5220      2543       3   961546137      M   25           7  91436
 50 1000192     5754      2543       4   958272316      F   18           1  60640
 51 1000193     5227       591       3   961475931      M   18          10  64050
 52 1000194     5795       591       1   958145253      M   25           1  92688
 53 1000195     5313      3656       5   960920392      M   56           0  55406
 54 1000196     5328      2438       4   960838075      F   25           4  91740
 55 1000197     5334      3323       3   960796159      F   56          13  46140
 56 1000198     5334       127       1   960795494      F   56          13  46140
 57 1000199     5334      3382       5   960796159      F   56          13  46140
 58 1000200     5420      1843       3   960156505      F    1          19  14850
 59 1000201     5433       286       3   960240881      F   35          17  45014
 60 1000202     5494      3530       4   959816296      F   35          17  94306
 61 1000203     5556      2198       3   959445515      M   45           6  92103
 62 1000204     5949      2198       5   958846401      M   18          17  47901
 63 1000205     5675      2703       3   976029116      M   35          14  30030
 64 1000206     5780      2845       1   958153068      M   18          17  92886
 65 1000207     5851      3607       5   957756608      F   18          20  55410
 66 1000208     5938      2909       4   957273353      M   25           1  35401
 67 
 68                                                      title  \
 69 0                   One Flew Over the Cuckoo's Nest (1975)
 70 1                   One Flew Over the Cuckoo's Nest (1975)
 71 2                   One Flew Over the Cuckoo's Nest (1975)
 72 3                   One Flew Over the Cuckoo's Nest (1975)
 73 4                   One Flew Over the Cuckoo's Nest (1975)
 74 5                   One Flew Over the Cuckoo's Nest (1975)
 75 6                   One Flew Over the Cuckoo's Nest (1975)
 76 7                   One Flew Over the Cuckoo's Nest (1975)
 77 8                   One Flew Over the Cuckoo's Nest (1975)
 78 9                   One Flew Over the Cuckoo's Nest (1975)
 79 10                  One Flew Over the Cuckoo's Nest (1975)
 80 11                  One Flew Over the Cuckoo's Nest (1975)
 81 12                  One Flew Over the Cuckoo's Nest (1975)
 82 13                  One Flew Over the Cuckoo's Nest (1975)
 83 14                  One Flew Over the Cuckoo's Nest (1975)
 84 15                  One Flew Over the Cuckoo's Nest (1975)
 85 16                  One Flew Over the Cuckoo's Nest (1975)
 86 17                  One Flew Over the Cuckoo's Nest (1975)
 87 18                  One Flew Over the Cuckoo's Nest (1975)
 88 19                  One Flew Over the Cuckoo's Nest (1975)
 89 20                  One Flew Over the Cuckoo's Nest (1975)
 90 21                  One Flew Over the Cuckoo's Nest (1975)
 91 22                  One Flew Over the Cuckoo's Nest (1975)
 92 23                  One Flew Over the Cuckoo's Nest (1975)
 93 24                  One Flew Over the Cuckoo's Nest (1975)
 94 25                  One Flew Over the Cuckoo's Nest (1975)
 95 26                  One Flew Over the Cuckoo's Nest (1975)
 96 27                  One Flew Over the Cuckoo's Nest (1975)
 97 28                  One Flew Over the Cuckoo's Nest (1975)
 98 29                  One Flew Over the Cuckoo's Nest (1975)
 99 ...                                                    ...
100 1000179                                   Home Page (1999)
101 1000180                            Juno and Paycock (1930)
102 1000181                                Detroit 9000 (1973)
103 1000182                               Condition Red (1995)
104 1000183                               Stranger, The (1994)
105 1000184                               Stranger, The (1994)
106 1000185                                  Truce, The (1996)
107 1000186                                  Truce, The (1996)
108 1000187  Brother Minister: The Assassination of Malcolm...
109 1000188  Brother Minister: The Assassination of Malcolm...
110 1000189  Brother Minister: The Assassination of Malcolm...
111 1000190  Brother Minister: The Assassination of Malcolm...
112 1000191                          Six Ways to Sunday (1997)
113 1000192                          Six Ways to Sunday (1997)
114 1000193                            Tough and Deadly (1995)
115 1000194                            Tough and Deadly (1995)
116 1000195                                       Lured (1947)
117 1000196                               Outside Ozona (1998)
118 1000197                              Chain of Fools (2000)
119 1000198  Silence of the Palace, The (Saimt el Qusur) (1...
120 1000199                             Song of Freedom (1936)
121 1000200                     Slappy and the Stinkers (1998)
122 1000201                           Nemesis 2: Nebula (1995)
123 1000202                          Smoking/No Smoking (1993)
124 1000203                                 Modulations (1998)
125 1000204                                 Modulations (1998)
126 1000205                              Broken Vessels (1998)
127 1000206                                  White Boys (1999)
128 1000207                           One Little Indian (1973)
129 1000208        Five Wives, Three Secretaries and Me (1998)
130 
131                          genres
132 0                         Drama
133 1                         Drama
134 2                         Drama
135 3                         Drama
136 4                         Drama
137 5                         Drama
138 6                         Drama
139 7                         Drama
140 8                         Drama
141 9                         Drama
142 10                        Drama
143 11                        Drama
144 12                        Drama
145 13                        Drama
146 14                        Drama
147 15                        Drama
148 16                        Drama
149 17                        Drama
150 18                        Drama
151 19                        Drama
152 20                        Drama
153 21                        Drama
154 22                        Drama
155 23                        Drama
156 24                        Drama
157 25                        Drama
158 26                        Drama
159 27                        Drama
160 28                        Drama
161 29                        Drama
162 ...                         ...
163 1000179             Documentary
164 1000180                   Drama
165 1000181            Action|Crime
166 1000182   Action|Drama|Thriller
167 1000183                  Action
168 1000184                  Action
169 1000185               Drama|War
170 1000186               Drama|War
171 1000187             Documentary
172 1000188             Documentary
173 1000189             Documentary
174 1000190             Documentary
175 1000191                  Comedy
176 1000192                  Comedy
177 1000193   Action|Drama|Thriller
178 1000194   Action|Drama|Thriller
179 1000195                   Crime
180 1000196          Drama|Thriller
181 1000197            Comedy|Crime
182 1000198                   Drama
183 1000199                   Drama
184 1000200       Children's|Comedy
185 1000201  Action|Sci-Fi|Thriller
186 1000202                  Comedy
187 1000203             Documentary
188 1000204             Documentary
189 1000205                   Drama
190 1000206                   Drama
191 1000207    Comedy|Drama|Western
192 1000208             Documentary
193 
194 [1000209 rows x 10 columns]

查看指定记录

Error

1 In [14]: data.ix[0]
2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: DeprecationWarning:
3 .ix is deprecated. Please use
4 .loc for label based indexing or
5 .iloc for positional indexing

Solution

 1 In [15]: data.iloc[0]
 2 Out[15]:
 3 user_id                                            1
 4 movie_id                                        1193
 5 rating                                             5
 6 timestamp                                  978300760
 7 gender                                             F
 8 age                                                1
 9 occupation                                        10
10 zip                                            48067
11 title         One Flew Over the Cuckoo's Nest (1975)
12 genres                                         Drama
13 Name: 0, dtype: object

使用pivot_table方法,按性别计算每部电影的平均得分

pandas.pivot_table(datavalues=Noneindex=Nonecolumns=Noneaggfunc='mean'fill_value=Nonemargins=Falsedropna=Truemargins_name='All')01

 1 In [16]: mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')
 2 
 3 In [17]: mean_ratings[:10]
 4 Out[17]:
 5 gender                                    F         M
 6 title
 7 $1,000,000 Duck (1971)             3.375000  2.761905
 8 'Night Mother (1986)               3.388889  3.352941
 9 'Til There Was You (1997)          2.675676  2.733333
10 'burbs, The (1989)                 2.793478  2.962085
11 ...And Justice for All (1979)      3.828571  3.689024
12 1-900 (1994)                       2.000000  3.000000
13 10 Things I Hate About You (1999)  3.646552  3.311966
14 101 Dalmatians (1961)              3.791444  3.500000
15 101 Dalmatians (1996)              3.240000  2.911215
16 12 Angry Men (1957)                4.184397  4.328421

过滤评分数据不足250条的电影。

先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象;

 1 In [18]: ratings_by_title = data.groupby('title').size()
 2 
 3 In [19]: ratings_by_title[:10]
 4 Out[19]:
 5 title
 6 $1,000,000 Duck (1971)                37
 7 'Night Mother (1986)                  70
 8 'Til There Was You (1997)             52
 9 'burbs, The (1989)                   303
10 ...And Justice for All (1979)        199
11 1-900 (1994)                           2
12 10 Things I Hate About You (1999)    700
13 101 Dalmatians (1961)                565
14 101 Dalmatians (1996)                364
15 12 Angry Men (1957)                  616
16 dtype: int64

保留评分数据大于250条的电影名称。

 1 In [20]: active_titles = ratings_by_title.index[ratings_by_title >= 250]
 2 
 3 In [21]: active_titles
 4 Out[21]:
 5 Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
 6        u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
 7        u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
 8        u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
 9        u'2001: A Space Odyssey (1968)', u'2010 (1984)',
10        ...
11        u'X-Men (2000)', u'Year of Living Dangerously (1982)',
12        u'Yellow Submarine (1968)', u'You've Got Mail (1998)',
13        u'Young Frankenstein (1974)', u'Young Guns (1988)',
14        u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
15        u'Zero Effect (1998)', u'eXistenZ (1999)'],
16       dtype='object', name=u'title', length=1216)

据此从mean_ratings中选取所需的行。

Error

1 In [22]: mean_ratings = mean_ratings.ix[active_titles]
2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: DeprecationWarning:
3 .ix is deprecated. Please use
4 .loc for label based indexing or
5 .iloc for positional indexing

Solution

In [23]: mean_ratings = mean_ratings.loc[active_titles]

In [24]: mean_ratings
Out[24]:
gender                                                     F         M
title
'burbs, The (1989)                                  2.793478  2.962085
10 Things I Hate About You (1999)                   3.646552  3.311966
101 Dalmatians (1961)                               3.791444  3.500000
101 Dalmatians (1996)                               3.240000  2.911215
12 Angry Men (1957)                                 4.184397  4.328421
13th Warrior, The (1999)                            3.112000  3.168000
2 Days in the Valley (1996)                         3.488889  3.244813
20,000 Leagues Under the Sea (1954)                 3.670103  3.709205
2001: A Space Odyssey (1968)                        3.825581  4.129738
2010 (1984)                                         3.446809  3.413712
28 Days (2000)                                      3.209424  2.977707
39 Steps, The (1935)                                3.965517  4.107692
54 (1998)                                           2.701754  2.782178
7th Voyage of Sinbad, The (1958)                    3.409091  3.658879
8MM (1999)                                          2.906250  2.850962
About Last Night... (1986)                          3.188679  3.140909
Absent Minded Professor, The (1961)                 3.469388  3.446809
Absolute Power (1997)                               3.469136  3.327759
Abyss, The (1989)                                   3.659236  3.689507
Ace Ventura: Pet Detective (1994)                   3.000000  3.197917
Ace Ventura: When Nature Calls (1995)               2.269663  2.543333
Addams Family Values (1993)                         3.000000  2.878531
Addams Family, The (1991)                           3.186170  3.163498
Adventures in Babysitting (1987)                    3.455782  3.208122
Adventures of Buckaroo Bonzai Across the 8th Di...  3.308511  3.402321
Adventures of Priscilla, Queen of the Desert, T...  3.989071  3.688811
Adventures of Robin Hood, The (1938)                4.166667  3.918367
African Queen, The (1951)                           4.324232  4.223822
Age of Innocence, The (1993)                        3.827068  3.339506
Agnes of God (1985)                                 3.534884  3.244898
...                                                      ...       ...
White Men Can't Jump (1992)                         3.028777  3.231061
Who Framed Roger Rabbit? (1988)                     3.569378  3.713251
Who's Afraid of Virginia Woolf? (1966)              4.029703  4.096939
Whole Nine Yards, The (2000)                        3.296552  3.404814
Wild Bunch, The (1969)                              3.636364  4.128099
Wild Things (1998)                                  3.392000  3.459082
Wild Wild West (1999)                               2.275449  2.131973
William Shakespeare's Romeo and Juliet (1996)       3.532609  3.318644
Willow (1988)                                       3.658683  3.453543
Willy Wonka and the Chocolate Factory (1971)        4.063953  3.789474
Witness (1985)                                      4.115854  3.941504
Wizard of Oz, The (1939)                            4.355030  4.203138
Wolf (1994)                                         3.074074  2.899083
Women on the Verge of a Nervous Breakdown (1988)    3.934307  3.865741
Wonder Boys (2000)                                  4.043796  3.913649
Working Girl (1988)                                 3.606742  3.312500
World Is Not Enough, The (1999)                     3.337500  3.388889
Wrong Trousers, The (1993)                          4.588235  4.478261
Wyatt Earp (1994)                                   3.147059  3.283898
X-Files: Fight the Future, The (1998)               3.489474  3.493797
X-Men (2000)                                        3.682310  3.851702
Year of Living Dangerously (1982)                   3.951220  3.869403
Yellow Submarine (1968)                             3.714286  3.689286
You've Got Mail (1998)                              3.542424  3.275591
Young Frankenstein (1974)                           4.289963  4.239177
Young Guns (1988)                                   3.371795  3.425620
Young Guns II (1990)                                2.934783  2.904025
Young Sherlock Holmes (1985)                        3.514706  3.363344
Zero Effect (1998)                                  3.864407  3.723140
eXistenZ (1999)                                     3.098592  3.289086

[1216 rows x 2 columns]

了解女性观众最喜欢的电影,对F列降序排列。

Error

1 In [25]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)

Solution

 1 In [26]: top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
 2 
 3 In [27]: top_female_ratings[:10]
 4 Out[27]:
 5 gender                                                     F         M
 6 title
 7 Close Shave, A (1995)                               4.644444  4.473795
 8 Wrong Trousers, The (1993)                          4.588235  4.478261
 9 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
10 Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
11 Schindler's List (1993)                             4.562602  4.491415
12 Shawshank Redemption, The (1994)                    4.539075  4.560625
13 Grand Day Out, A (1992)                             4.537879  4.293255
14 To Kill a Mockingbird (1962)                        4.536667  4.372611
15 Creature Comforts (1990)                            4.513889  4.272277
16 Usual Suspects, The (1995)                          4.513317  4.518248

计算不同性别的评分分歧:

给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序 --> 女性观众更喜欢的电影;

 1 In [28]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
 2 
 3 In [29]: sorted_by_diff = mean_ratings.sort_values(by='diff')
 4 
 5 In [30]: sorted_by_diff[:15]
 6 Out[30]:
 7 gender                                        F         M      diff
 8 title
 9 Dirty Dancing (1987)                   3.790378  2.959596 -0.830782
10 Jumpin' Jack Flash (1986)              3.254717  2.578358 -0.676359
11 Grease (1978)                          3.975265  3.367041 -0.608224
12 Little Women (1994)                    3.870588  3.321739 -0.548849
13 Steel Magnolias (1989)                 3.901734  3.365957 -0.535777
14 Anastasia (1997)                       3.800000  3.281609 -0.518391
15 Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
16 Color Purple, The (1985)               4.158192  3.659341 -0.498851
17 Age of Innocence, The (1993)           3.827068  3.339506 -0.487561
18 Free Willy (1993)                      2.921348  2.438776 -0.482573
19 French Kiss (1995)                     3.535714  3.056962 -0.478752
20 Little Shop of Horrors, The (1960)     3.650000  3.179688 -0.470312
21 Guys and Dolls (1955)                  4.051724  3.583333 -0.468391
22 Mary Poppins (1964)                    4.197740  3.730594 -0.467147
23 Patch Adams (1998)                     3.473282  3.008746 -0.464536

对排序结果反序,并取出前15行 --> 男性观众更喜欢的电影;

 1 In [31]: sorted_by_diff[::-1][:15]
 2 Out[31]:
 3 gender                                         F         M      diff
 4 title
 5 Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
 6 Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
 7 Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
 8 Longest Day, The (1962)                 3.411765  4.031447  0.619682
 9 Cable Guy, The (1996)                   2.250000  2.863787  0.613787
10 Evil Dead II (Dead By Dawn) (1987)      3.297297  3.909283  0.611985
11 Hidden, The (1987)                      3.137931  3.745098  0.607167
12 Rocky III (1982)                        2.361702  2.943503  0.581801
13 Caddyshack (1980)                       3.396135  3.969737  0.573602
14 For a Few Dollars More (1965)           3.409091  3.953795  0.544704
15 Porky's (1981)                          2.296875  2.836364  0.539489
16 Animal House (1978)                     3.628906  4.167192  0.538286
17 Exorcist, The (1973)                    3.537634  4.067239  0.529605
18 Fright Night (1985)                     2.973684  3.500000  0.526316
19 Barb Wire (1996)                        1.585366  2.100386  0.515020

不考虑性别因素,计算得分数据的方差或标准差。

 1 In [32]: rating_std_by_title = data.groupby('title')['rating'].std()
 2 
 3 In [33]: rating_std_by_title = rating_std_by_title.loc[active_titles]
 4 
 5 In [34]: rating_std_by_title.sort_values(ascending=False)[:10]
 6 Out[34]:
 7 title
 8 Dumb & Dumber (1994)                     1.321333
 9 Blair Witch Project, The (1999)          1.316368
10 Natural Born Killers (1994)              1.307198
11 Tank Girl (1995)                         1.277695
12 Rocky Horror Picture Show, The (1975)    1.260177
13 Eyes Wide Shut (1999)                    1.259624
14 Evita (1996)                             1.253631
15 Billy Madison (1995)                     1.249970
16 Fear and Loathing in Las Vegas (1998)    1.246408
17 Bicentennial Man (1999)                  1.245533
18 Name: rating, dtype: float64

 

Reference

01 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

posted @ 2017-08-22 11:53  PrinceMay  阅读(1365)  评论(0编辑  收藏  举报