Python for Data Analysis | MovieLens
Background
MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。
ratings.dat
UserID::MovieID::Rating::Timestamp
users.dat
UserID::Gender::Age::Occupation::Zip-code
movies.dat
MovieID::Title::Genres
通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中。
* head=None, case-sensitive.
In [1]: import pandas as pd In [2]: unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] In [3]: users = pd.read_table('C:/Users/I******/Desktop/.../movielens/users.dat', sep='::', header=None, names=unames) In [4]: rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] In [5]: ratings = pd.read_table('C:/Users/I******/Desktop/.../movielens/ratings.dat', sep='::', header=None, names=rnames) In [6]: mnames = ['movie_id', 'title', 'genres'] In [7]: movies = pd.read_table('C:/Users/I******/Desktop/.../movielens/movies.dat', sep='::', header=None, names=mnames)
利用Python的切片语法,通过查看每个DataFrame的前几行,验证数据加载工作是否顺利。
In [8]: users[:5] Out[8]: user_id gender age occupation zip 0 1 F 1 10 48067 1 2 M 56 16 70072 2 3 M 25 15 55117 3 4 M 45 7 02460 4 5 M 25 20 55455 In [9]: ratings[:5] Out[9]: user_id movie_id rating timestamp 0 1 1193 5 978300760 1 1 661 3 978302109 2 1 914 3 978301968 3 1 3408 4 978300275 4 1 2355 5 978824291 In [10]: movies[:5] Out[10]: movie_id title genres 0 1 Toy Story (1995) Animation|Children's|Comedy 1 2 Jumanji (1995) Adventure|Children's|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama 4 5 Father of the Bride Part II (1995) Comedy In [11]: ratings Out[11]: user_id movie_id rating timestamp 0 1 1193 5 978300760 1 1 661 3 978302109 2 1 914 3 978301968 3 1 3408 4 978300275 4 1 2355 5 978824291 5 1 1197 3 978302268 6 1 1287 5 978302039 7 1 2804 5 978300719 8 1 594 4 978302268 9 1 919 4 978301368 10 1 595 5 978824268 11 1 938 4 978301752 12 1 2398 4 978302281 13 1 2918 4 978302124 14 1 1035 5 978301753 15 1 2791 4 978302188 16 1 2687 3 978824268 17 1 2018 4 978301777 18 1 3105 5 978301713 19 1 2797 4 978302039 20 1 2321 3 978302205 21 1 720 3 978300760 22 1 1270 5 978300055 23 1 527 5 978824195 24 1 2340 3 978300103 25 1 48 5 978824351 26 1 1097 4 978301953 27 1 1721 4 978300055 28 1 1545 4 978824139 29 1 745 3 978824268 ... ... ... ... ... 1000179 6040 2762 4 956704584 1000180 6040 1036 3 956715455 1000181 6040 508 4 956704972 1000182 6040 1041 4 957717678 1000183 6040 3735 4 960971654 1000184 6040 2791 4 956715569 1000185 6040 2794 1 956716438 1000186 6040 527 5 956704219 1000187 6040 2003 1 956716294 1000188 6040 535 4 964828734 1000189 6040 2010 5 957716795 1000190 6040 2011 4 956716113 1000191 6040 3751 4 964828782 1000192 6040 2019 5 956703977 1000193 6040 541 4 956715288 1000194 6040 1077 5 964828799 1000195 6040 1079 2 956715648 1000196 6040 549 4 956704746 1000197 6040 2020 3 956715288 1000198 6040 2021 3 956716374 1000199 6040 2022 5 956716207 1000200 6040 2028 5 956704519 1000201 6040 1080 4 957717322 1000202 6040 1089 4 956704996 1000203 6040 1090 3 956715518 1000204 6040 1091 1 956716541 1000205 6040 1094 5 956704887 1000206 6040 562 5 956704746 1000207 6040 1096 4 956715648 1000208 6040 1097 4 956715569 [1000209 rows x 4 columns]
先用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键。
1 In [12]: data = pd.merge(pd.merge(ratings, users), movies) 2 3 In [13]: data 4 Out[13]: 5 user_id movie_id rating timestamp gender age occupation zip \ 6 0 1 1193 5 978300760 F 1 10 48067 7 1 2 1193 5 978298413 M 56 16 70072 8 2 12 1193 4 978220179 M 25 12 32793 9 3 15 1193 4 978199279 M 25 7 22903 10 4 17 1193 5 978158471 M 50 1 95350 11 5 18 1193 4 978156168 F 18 3 95825 12 6 19 1193 5 982730936 M 1 10 48073 13 7 24 1193 5 978136709 F 25 7 10023 14 8 28 1193 3 978125194 F 25 1 14607 15 9 33 1193 5 978557765 M 45 3 55421 16 10 39 1193 5 978043535 M 18 4 61820 17 11 42 1193 3 978038981 M 25 8 24502 18 12 44 1193 4 978018995 M 45 17 98052 19 13 47 1193 4 977978345 M 18 4 94305 20 14 48 1193 4 977975061 M 25 4 92107 21 15 49 1193 4 978813972 M 18 12 77084 22 16 53 1193 5 977946400 M 25 0 96931 23 17 54 1193 5 977944039 M 50 1 56723 24 18 58 1193 5 977933866 M 25 2 30303 25 19 59 1193 4 977934292 F 50 1 55413 26 20 62 1193 4 977968584 F 35 3 98105 27 21 80 1193 4 977786172 M 56 1 49327 28 22 81 1193 5 977785864 F 25 0 60640 29 23 88 1193 5 977694161 F 45 1 02476 30 24 89 1193 5 977683596 F 56 9 85749 31 25 95 1193 5 977626632 M 45 0 98201 32 26 96 1193 3 977621789 F 25 16 78028 33 27 99 1193 2 982791053 F 1 10 19390 34 28 102 1193 5 1040737607 M 35 19 20871 35 29 104 1193 2 977546620 M 25 12 00926 36 ... ... ... ... ... ... ... ... ... 37 1000179 4933 3084 3 962757020 M 25 15 94040 38 1000180 4802 2218 2 1014866656 M 56 1 40601 39 1000181 4812 2308 2 962932391 M 18 14 25301 40 1000182 4874 624 4 962781918 F 25 4 70808 41 1000183 5059 1434 4 962484364 M 45 16 22652 42 1000184 5947 1434 4 957190428 F 45 16 97215 43 1000185 5077 1868 3 962417299 M 25 2 20037 44 1000186 5944 1868 1 957197520 F 18 10 27606 45 1000187 5105 404 3 962337582 M 50 7 18977 46 1000188 5185 404 4 963402617 F 35 4 44485 47 1000189 5532 404 5 959619841 M 25 17 27408 48 1000190 5543 404 3 960127592 M 25 17 97401 49 1000191 5220 2543 3 961546137 M 25 7 91436 50 1000192 5754 2543 4 958272316 F 18 1 60640 51 1000193 5227 591 3 961475931 M 18 10 64050 52 1000194 5795 591 1 958145253 M 25 1 92688 53 1000195 5313 3656 5 960920392 M 56 0 55406 54 1000196 5328 2438 4 960838075 F 25 4 91740 55 1000197 5334 3323 3 960796159 F 56 13 46140 56 1000198 5334 127 1 960795494 F 56 13 46140 57 1000199 5334 3382 5 960796159 F 56 13 46140 58 1000200 5420 1843 3 960156505 F 1 19 14850 59 1000201 5433 286 3 960240881 F 35 17 45014 60 1000202 5494 3530 4 959816296 F 35 17 94306 61 1000203 5556 2198 3 959445515 M 45 6 92103 62 1000204 5949 2198 5 958846401 M 18 17 47901 63 1000205 5675 2703 3 976029116 M 35 14 30030 64 1000206 5780 2845 1 958153068 M 18 17 92886 65 1000207 5851 3607 5 957756608 F 18 20 55410 66 1000208 5938 2909 4 957273353 M 25 1 35401 67 68 title \ 69 0 One Flew Over the Cuckoo's Nest (1975) 70 1 One Flew Over the Cuckoo's Nest (1975) 71 2 One Flew Over the Cuckoo's Nest (1975) 72 3 One Flew Over the Cuckoo's Nest (1975) 73 4 One Flew Over the Cuckoo's Nest (1975) 74 5 One Flew Over the Cuckoo's Nest (1975) 75 6 One Flew Over the Cuckoo's Nest (1975) 76 7 One Flew Over the Cuckoo's Nest (1975) 77 8 One Flew Over the Cuckoo's Nest (1975) 78 9 One Flew Over the Cuckoo's Nest (1975) 79 10 One Flew Over the Cuckoo's Nest (1975) 80 11 One Flew Over the Cuckoo's Nest (1975) 81 12 One Flew Over the Cuckoo's Nest (1975) 82 13 One Flew Over the Cuckoo's Nest (1975) 83 14 One Flew Over the Cuckoo's Nest (1975) 84 15 One Flew Over the Cuckoo's Nest (1975) 85 16 One Flew Over the Cuckoo's Nest (1975) 86 17 One Flew Over the Cuckoo's Nest (1975) 87 18 One Flew Over the Cuckoo's Nest (1975) 88 19 One Flew Over the Cuckoo's Nest (1975) 89 20 One Flew Over the Cuckoo's Nest (1975) 90 21 One Flew Over the Cuckoo's Nest (1975) 91 22 One Flew Over the Cuckoo's Nest (1975) 92 23 One Flew Over the Cuckoo's Nest (1975) 93 24 One Flew Over the Cuckoo's Nest (1975) 94 25 One Flew Over the Cuckoo's Nest (1975) 95 26 One Flew Over the Cuckoo's Nest (1975) 96 27 One Flew Over the Cuckoo's Nest (1975) 97 28 One Flew Over the Cuckoo's Nest (1975) 98 29 One Flew Over the Cuckoo's Nest (1975) 99 ... ... 100 1000179 Home Page (1999) 101 1000180 Juno and Paycock (1930) 102 1000181 Detroit 9000 (1973) 103 1000182 Condition Red (1995) 104 1000183 Stranger, The (1994) 105 1000184 Stranger, The (1994) 106 1000185 Truce, The (1996) 107 1000186 Truce, The (1996) 108 1000187 Brother Minister: The Assassination of Malcolm... 109 1000188 Brother Minister: The Assassination of Malcolm... 110 1000189 Brother Minister: The Assassination of Malcolm... 111 1000190 Brother Minister: The Assassination of Malcolm... 112 1000191 Six Ways to Sunday (1997) 113 1000192 Six Ways to Sunday (1997) 114 1000193 Tough and Deadly (1995) 115 1000194 Tough and Deadly (1995) 116 1000195 Lured (1947) 117 1000196 Outside Ozona (1998) 118 1000197 Chain of Fools (2000) 119 1000198 Silence of the Palace, The (Saimt el Qusur) (1... 120 1000199 Song of Freedom (1936) 121 1000200 Slappy and the Stinkers (1998) 122 1000201 Nemesis 2: Nebula (1995) 123 1000202 Smoking/No Smoking (1993) 124 1000203 Modulations (1998) 125 1000204 Modulations (1998) 126 1000205 Broken Vessels (1998) 127 1000206 White Boys (1999) 128 1000207 One Little Indian (1973) 129 1000208 Five Wives, Three Secretaries and Me (1998) 130 131 genres 132 0 Drama 133 1 Drama 134 2 Drama 135 3 Drama 136 4 Drama 137 5 Drama 138 6 Drama 139 7 Drama 140 8 Drama 141 9 Drama 142 10 Drama 143 11 Drama 144 12 Drama 145 13 Drama 146 14 Drama 147 15 Drama 148 16 Drama 149 17 Drama 150 18 Drama 151 19 Drama 152 20 Drama 153 21 Drama 154 22 Drama 155 23 Drama 156 24 Drama 157 25 Drama 158 26 Drama 159 27 Drama 160 28 Drama 161 29 Drama 162 ... ... 163 1000179 Documentary 164 1000180 Drama 165 1000181 Action|Crime 166 1000182 Action|Drama|Thriller 167 1000183 Action 168 1000184 Action 169 1000185 Drama|War 170 1000186 Drama|War 171 1000187 Documentary 172 1000188 Documentary 173 1000189 Documentary 174 1000190 Documentary 175 1000191 Comedy 176 1000192 Comedy 177 1000193 Action|Drama|Thriller 178 1000194 Action|Drama|Thriller 179 1000195 Crime 180 1000196 Drama|Thriller 181 1000197 Comedy|Crime 182 1000198 Drama 183 1000199 Drama 184 1000200 Children's|Comedy 185 1000201 Action|Sci-Fi|Thriller 186 1000202 Comedy 187 1000203 Documentary 188 1000204 Documentary 189 1000205 Drama 190 1000206 Drama 191 1000207 Comedy|Drama|Western 192 1000208 Documentary 193 194 [1000209 rows x 10 columns]
查看指定记录
Error
1 In [14]: data.ix[0] 2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: DeprecationWarning: 3 .ix is deprecated. Please use 4 .loc for label based indexing or 5 .iloc for positional indexing
Solution
1 In [15]: data.iloc[0] 2 Out[15]: 3 user_id 1 4 movie_id 1193 5 rating 5 6 timestamp 978300760 7 gender F 8 age 1 9 occupation 10 10 zip 48067 11 title One Flew Over the Cuckoo's Nest (1975) 12 genres Drama 13 Name: 0, dtype: object
使用pivot_table方法,按性别计算每部电影的平均得分
pandas.
pivot_table
(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')01
1 In [16]: mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean') 2 3 In [17]: mean_ratings[:10] 4 Out[17]: 5 gender F M 6 title 7 $1,000,000 Duck (1971) 3.375000 2.761905 8 'Night Mother (1986) 3.388889 3.352941 9 'Til There Was You (1997) 2.675676 2.733333 10 'burbs, The (1989) 2.793478 2.962085 11 ...And Justice for All (1979) 3.828571 3.689024 12 1-900 (1994) 2.000000 3.000000 13 10 Things I Hate About You (1999) 3.646552 3.311966 14 101 Dalmatians (1961) 3.791444 3.500000 15 101 Dalmatians (1996) 3.240000 2.911215 16 12 Angry Men (1957) 4.184397 4.328421
过滤评分数据不足250条的电影。
先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象;
1 In [18]: ratings_by_title = data.groupby('title').size() 2 3 In [19]: ratings_by_title[:10] 4 Out[19]: 5 title 6 $1,000,000 Duck (1971) 37 7 'Night Mother (1986) 70 8 'Til There Was You (1997) 52 9 'burbs, The (1989) 303 10 ...And Justice for All (1979) 199 11 1-900 (1994) 2 12 10 Things I Hate About You (1999) 700 13 101 Dalmatians (1961) 565 14 101 Dalmatians (1996) 364 15 12 Angry Men (1957) 616 16 dtype: int64
保留评分数据大于250条的电影名称。
1 In [20]: active_titles = ratings_by_title.index[ratings_by_title >= 250] 2 3 In [21]: active_titles 4 Out[21]: 5 Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)', 6 u'101 Dalmatians (1961)', u'101 Dalmatians (1996)', 7 u'12 Angry Men (1957)', u'13th Warrior, The (1999)', 8 u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)', 9 u'2001: A Space Odyssey (1968)', u'2010 (1984)', 10 ... 11 u'X-Men (2000)', u'Year of Living Dangerously (1982)', 12 u'Yellow Submarine (1968)', u'You've Got Mail (1998)', 13 u'Young Frankenstein (1974)', u'Young Guns (1988)', 14 u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)', 15 u'Zero Effect (1998)', u'eXistenZ (1999)'], 16 dtype='object', name=u'title', length=1216)
据此从mean_ratings中选取所需的行。
Error
1 In [22]: mean_ratings = mean_ratings.ix[active_titles] 2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: DeprecationWarning: 3 .ix is deprecated. Please use 4 .loc for label based indexing or 5 .iloc for positional indexing
Solution
In [23]: mean_ratings = mean_ratings.loc[active_titles] In [24]: mean_ratings Out[24]: gender F M title 'burbs, The (1989) 2.793478 2.962085 10 Things I Hate About You (1999) 3.646552 3.311966 101 Dalmatians (1961) 3.791444 3.500000 101 Dalmatians (1996) 3.240000 2.911215 12 Angry Men (1957) 4.184397 4.328421 13th Warrior, The (1999) 3.112000 3.168000 2 Days in the Valley (1996) 3.488889 3.244813 20,000 Leagues Under the Sea (1954) 3.670103 3.709205 2001: A Space Odyssey (1968) 3.825581 4.129738 2010 (1984) 3.446809 3.413712 28 Days (2000) 3.209424 2.977707 39 Steps, The (1935) 3.965517 4.107692 54 (1998) 2.701754 2.782178 7th Voyage of Sinbad, The (1958) 3.409091 3.658879 8MM (1999) 2.906250 2.850962 About Last Night... (1986) 3.188679 3.140909 Absent Minded Professor, The (1961) 3.469388 3.446809 Absolute Power (1997) 3.469136 3.327759 Abyss, The (1989) 3.659236 3.689507 Ace Ventura: Pet Detective (1994) 3.000000 3.197917 Ace Ventura: When Nature Calls (1995) 2.269663 2.543333 Addams Family Values (1993) 3.000000 2.878531 Addams Family, The (1991) 3.186170 3.163498 Adventures in Babysitting (1987) 3.455782 3.208122 Adventures of Buckaroo Bonzai Across the 8th Di... 3.308511 3.402321 Adventures of Priscilla, Queen of the Desert, T... 3.989071 3.688811 Adventures of Robin Hood, The (1938) 4.166667 3.918367 African Queen, The (1951) 4.324232 4.223822 Age of Innocence, The (1993) 3.827068 3.339506 Agnes of God (1985) 3.534884 3.244898 ... ... ... White Men Can't Jump (1992) 3.028777 3.231061 Who Framed Roger Rabbit? (1988) 3.569378 3.713251 Who's Afraid of Virginia Woolf? (1966) 4.029703 4.096939 Whole Nine Yards, The (2000) 3.296552 3.404814 Wild Bunch, The (1969) 3.636364 4.128099 Wild Things (1998) 3.392000 3.459082 Wild Wild West (1999) 2.275449 2.131973 William Shakespeare's Romeo and Juliet (1996) 3.532609 3.318644 Willow (1988) 3.658683 3.453543 Willy Wonka and the Chocolate Factory (1971) 4.063953 3.789474 Witness (1985) 4.115854 3.941504 Wizard of Oz, The (1939) 4.355030 4.203138 Wolf (1994) 3.074074 2.899083 Women on the Verge of a Nervous Breakdown (1988) 3.934307 3.865741 Wonder Boys (2000) 4.043796 3.913649 Working Girl (1988) 3.606742 3.312500 World Is Not Enough, The (1999) 3.337500 3.388889 Wrong Trousers, The (1993) 4.588235 4.478261 Wyatt Earp (1994) 3.147059 3.283898 X-Files: Fight the Future, The (1998) 3.489474 3.493797 X-Men (2000) 3.682310 3.851702 Year of Living Dangerously (1982) 3.951220 3.869403 Yellow Submarine (1968) 3.714286 3.689286 You've Got Mail (1998) 3.542424 3.275591 Young Frankenstein (1974) 4.289963 4.239177 Young Guns (1988) 3.371795 3.425620 Young Guns II (1990) 2.934783 2.904025 Young Sherlock Holmes (1985) 3.514706 3.363344 Zero Effect (1998) 3.864407 3.723140 eXistenZ (1999) 3.098592 3.289086 [1216 rows x 2 columns]
了解女性观众最喜欢的电影,对F列降序排列。
Error
1 In [25]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False) 2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
Solution
1 In [26]: top_female_ratings = mean_ratings.sort_values(by='F', ascending=False) 2 3 In [27]: top_female_ratings[:10] 4 Out[27]: 5 gender F M 6 title 7 Close Shave, A (1995) 4.644444 4.473795 8 Wrong Trousers, The (1993) 4.588235 4.478261 9 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589 10 Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075 11 Schindler's List (1993) 4.562602 4.491415 12 Shawshank Redemption, The (1994) 4.539075 4.560625 13 Grand Day Out, A (1992) 4.537879 4.293255 14 To Kill a Mockingbird (1962) 4.536667 4.372611 15 Creature Comforts (1990) 4.513889 4.272277 16 Usual Suspects, The (1995) 4.513317 4.518248
计算不同性别的评分分歧:
给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序 --> 女性观众更喜欢的电影;
1 In [28]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F'] 2 3 In [29]: sorted_by_diff = mean_ratings.sort_values(by='diff') 4 5 In [30]: sorted_by_diff[:15] 6 Out[30]: 7 gender F M diff 8 title 9 Dirty Dancing (1987) 3.790378 2.959596 -0.830782 10 Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359 11 Grease (1978) 3.975265 3.367041 -0.608224 12 Little Women (1994) 3.870588 3.321739 -0.548849 13 Steel Magnolias (1989) 3.901734 3.365957 -0.535777 14 Anastasia (1997) 3.800000 3.281609 -0.518391 15 Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885 16 Color Purple, The (1985) 4.158192 3.659341 -0.498851 17 Age of Innocence, The (1993) 3.827068 3.339506 -0.487561 18 Free Willy (1993) 2.921348 2.438776 -0.482573 19 French Kiss (1995) 3.535714 3.056962 -0.478752 20 Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312 21 Guys and Dolls (1955) 4.051724 3.583333 -0.468391 22 Mary Poppins (1964) 4.197740 3.730594 -0.467147 23 Patch Adams (1998) 3.473282 3.008746 -0.464536
对排序结果反序,并取出前15行 --> 男性观众更喜欢的电影;
1 In [31]: sorted_by_diff[::-1][:15] 2 Out[31]: 3 gender F M diff 4 title 5 Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351 6 Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359 7 Dumb & Dumber (1994) 2.697987 3.336595 0.638608 8 Longest Day, The (1962) 3.411765 4.031447 0.619682 9 Cable Guy, The (1996) 2.250000 2.863787 0.613787 10 Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985 11 Hidden, The (1987) 3.137931 3.745098 0.607167 12 Rocky III (1982) 2.361702 2.943503 0.581801 13 Caddyshack (1980) 3.396135 3.969737 0.573602 14 For a Few Dollars More (1965) 3.409091 3.953795 0.544704 15 Porky's (1981) 2.296875 2.836364 0.539489 16 Animal House (1978) 3.628906 4.167192 0.538286 17 Exorcist, The (1973) 3.537634 4.067239 0.529605 18 Fright Night (1985) 2.973684 3.500000 0.526316 19 Barb Wire (1996) 1.585366 2.100386 0.515020
不考虑性别因素,计算得分数据的方差或标准差。
1 In [32]: rating_std_by_title = data.groupby('title')['rating'].std() 2 3 In [33]: rating_std_by_title = rating_std_by_title.loc[active_titles] 4 5 In [34]: rating_std_by_title.sort_values(ascending=False)[:10] 6 Out[34]: 7 title 8 Dumb & Dumber (1994) 1.321333 9 Blair Witch Project, The (1999) 1.316368 10 Natural Born Killers (1994) 1.307198 11 Tank Girl (1995) 1.277695 12 Rocky Horror Picture Show, The (1975) 1.260177 13 Eyes Wide Shut (1999) 1.259624 14 Evita (1996) 1.253631 15 Billy Madison (1995) 1.249970 16 Fear and Loathing in Las Vegas (1998) 1.246408 17 Bicentennial Man (1999) 1.245533 18 Name: rating, dtype: float64
Reference
01 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html