Loading...

Tuesday, September 2, 2014

Python and Pandas: Part 2. Movie Ratings


Hello Readers,

Here is Part 2 of the Pandas and Python series, where we examine movie ratings data from University of Minnesota's Movielens recommendation system. The 1 million rows of data are available here as a 'zip' and 'readme' file. We will use Pandas in Python to read, manipulate, and massage the rating data. The sizable data include over 1 million ratings from 6,000 users, concerning 4,000 movies.

As a reference, I am using Python 2.7.8 via IPython 2.1.0. Start IPython and let's jump in.


Loading the Rating Data

After navigating to the proper directory with the three ending in .dat, start IPython with "matplotlib". Immediately import "pylab" and "pandas". While loading the user, rating and movie data files, we create the variable names for each DataFrame. 

For example, the "users" DataFrame has "user_id", "gender", "age", "occupation", and "zip" code for its variables. Preview the first 5 rows by slicing "[:5]".

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython --matplotlib
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: Qt4Agg

In [1]: import pylab

In [2]: import pandas as pd

In [3]: unames = ['user_id','gender','age','occupation','zip']

In [4]: users = pd.read_table('movielens/users.dat',sep='::',header=None,
   ...: names=unames)

In [5]: users[:5]
Out[5]:
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

In [16]: rnames = ['user_id','movie_id','rating','timestamp']

In [17]: ratings = pd.read_table('movielens/ratings.dat',sep='::',header=None,names=rnames)

In [18]: ratings[:5]
Out[18]:
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

In [19]: mnames = ['movie_id','title','genres']

In [20]: movies = pd.read_table('movielens/movies.dat',sep='::',header=None,names=mnames)

In [21]: movies[:5]
Out[21]:
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy

In [22]: ratings
Out[22]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
user_id      1000209  non-null values
movie_id     1000209  non-null values
rating       1000209  non-null values
timestamp    1000209  non-null values
dtypes: int64(4)

Looking at the "ratings" DataFrame specifically, we find that it has 1,000,209 entries (rows), with 4 columns- "user_id", "movie_id", "rating", and "timestamp". We see that the ratings are integers up to 5. The common variables among the 3 DataFrames are "user_id" and "movie_id".



Merging DataFrames

Working with three data from three separate DataFrames will hamper our efficiency in querying rows we require for analysis. The files were normalized for size efficiency for downloading, and now since we have them on our hard drive, we can go ahead and merge them together (de-normalize). 

Once we merge the DataFrames, there will be redundant elements, as user information will be repeated for various movies they rated, and movie information will be repeated as well since movies have multiple users rating them. That is why the file will be larger in size (think megabytes vs kilobytes). For storage purposes, a normalized format is more optimal, whereas for fast read/write times, de-normalized tables are more efficient.

From pandas, use the ".merge()" method twice to merge all three DataFrames together. The method will merge by similar column names, which is why we named them "user_id", "movie_id". 

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
In [23]: data = pd.merge(pd.merge(ratings,users),movies)

In [24]: data
Out[24]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
user_id       1000209  non-null values
movie_id      1000209  non-null values
rating        1000209  non-null values
timestamp     1000209  non-null values
gender        1000209  non-null values
age           1000209  non-null values
occupation    1000209  non-null values
zip           1000209  non-null values
title         1000209  non-null values
genres        1000209  non-null values
dtypes: int64(6), object(4)

In [25]: data.ix[0]
Out[25]:
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

Now the complete table has 1,000,209 rows with 10 variable columns. Taking the first row index with ".ix[0]", we can examine the merge results within the first row. This female gave the movie "One Flew Over the Cuckoo's Nest" the highest 5 star rating. And looking at her age, 1, we only know that she is less than 18 years of age. Note that the occupation categories are explained in the included README file. (Go ahead, read it).



Pivot Table

Since there are 10 variables, we can massage the data into showing us some patterns between the variables. Not only can we see the average ratings for each movie, we can stratify them by gender. We accomplish this with ".pivot_table()", while specifying the value we want stratified, the variables we want to see in the rows and columns, and how we want to values to be presented (mean).

Taking the first 8 movies, and observe that they are ordered alphabetically by symbol/punctuation, numbers, capitalized letters, and lowercase letters. We can see the average ratings for the movies by gender.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
In [26]: mean_ratings = data.pivot_table('rating',rows='title',cols='gender',aggfunc='mean')

In [27]: mean_ratings[:8]
Out[27]:
gender                                    F         M
title
$1,000,000 Duck (1971)             3.375000  2.761905
'Night Mother (1986)               3.388889  3.352941
'Til There Was You (1997)          2.675676  2.733333
'burbs, The (1989)                 2.793478  2.962085
...And Justice for All (1979)      3.828571  3.689024
1-900 (1994)                       2.000000  3.000000
10 Things I Hate About You (1999)  3.646552  3.311966
101 Dalmatians (1961)              3.791444  3.500000

For the cartoon, 101 Dalmations released in 1961, the average female rating was ~3.79, while the average male rating was slightly lower at 3.50. Sometimes the average rating would be close, and others one gender would rate a movie more favorably than the other gender.



Movie Rating Subsetting and Sorting

Here we aim to work with a robust set of movies with 250 or more ratings by subsetting the data. First we create an index in active_titles of movies grouped by their title using ".groupby('title').size()" or their count. Then using the ratings_by_title, we select only those with 250 or more ratings through the ".index[]" into active_titles.

From the gender stratified mean_ratings created above, we can subset those movies titles with 250 or more ratings, with ".ix[]" for index. Now the DataFrame is reduced to 1,216 movies, from 6,000.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
In [9]: ratings_by_title = data.groupby('title').size()

In [10]: ratings_by_title[:10]
Out[10]:
title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64

In [11]: active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [12]: active_titles
Out[12]: Index(['burbs, The (1989), 10 Things I Hate About You (1999), 101 Dalmatians (1961), ..., Young Sherlock Holmes (1985), Zero Effect (1998), eXistenZ (1999)], dtype=object)

In [13]: mean_ratings = mean_ratings.ix[active_titles]

In [14]: mean_ratings
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)
Data columns (total 2 columns):
F    1216  non-null values
M    1216  non-null values
dtypes: float64(2)

In [15]: top_female_ratings = mean_ratings.sort_index(by='F',ascending=False)

In [16]: top_female_ratings[:10]
Out[16]:
gender                                                         F         M
title
Close Shave, A (1995)                                   4.644444  4.473795
Wrong Trousers, The (1993)                              4.588235  4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)           4.572650  4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)  4.563107  4.385075
Schindler's List (1993)                                 4.562602  4.491415
Shawshank Redemption, The (1994)                        4.539075  4.560625
Grand Day Out, A (1992)                                 4.537879  4.293255
To Kill a Mockingbird (1962)                            4.536667  4.372611
Creature Comforts (1990)                                4.513889  4.272277
Usual Suspects, The (1995)                              4.513317  4.518248

From this refined data, we can reorder and display the top rated titles by females with ".sort_index()". The top female averaged rated title goes to 'A Close Shave', with 'The Wrong Trousers' coming in second. While the average male ratings were not ordered, they are similar in range, due to the quality of the movies. So, users enjoying a movie would be more inclined to rate a movie, thus this subset of movies with 250+ ratings is slightly more biased than the original total data.



Measuring Rating Disagreement

However females and males did not rate movies similarly. We can create a new variable, 'diff', for rating difference of males - females. Then we can sort the mean_ratings with the largest difference first. The biggest disparity in rating where female movie ratings measured higher than males belonged to 'Dirty Dancing', 'Jumpin' Jack Flash', and 'Grease'.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
## measuring rating disagreement

In [17]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

In [19]: sorted_by_diff = mean_ratings.sort_index(by='diff')

In [20]: sorted_by_diff[:10]
Out[20]:
gender                                        F         M      diff
title
Dirty Dancing (1987)                   3.790378  2.959596 -0.830782
Jumpin' Jack Flash (1986)              3.254717  2.578358 -0.676359
Grease (1978)                          3.975265  3.367041 -0.608224
Little Women (1994)                    3.870588  3.321739 -0.548849
Steel Magnolias (1989)                 3.901734  3.365957 -0.535777
Anastasia (1997)                       3.800000  3.281609 -0.518391
Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
Color Purple, The (1985)               4.158192  3.659341 -0.498851
Age of Innocence, The (1993)           3.827068  3.339506 -0.487561
Free Willy (1993)                      2.921348  2.438776 -0.482573

In [21]: sorted_by_diff[::-1][:15]
Out[21]:
gender                                         F         M      diff
title
Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
Longest Day, The (1962)                 3.411765  4.031447  0.619682
Cable Guy, The (1996)                   2.250000  2.863787  0.613787
Evil Dead II (Dead By Dawn) (1987)      3.297297  3.909283  0.611985
Hidden, The (1987)                      3.137931  3.745098  0.607167
Rocky III (1982)                        2.361702  2.943503  0.581801
Caddyshack (1980)                       3.396135  3.969737  0.573602
For a Few Dollars More (1965)           3.409091  3.953795  0.544704
Porky's (1981)                          2.296875  2.836364  0.539489
Animal House (1978)                     3.628906  4.167192  0.538286
Exorcist, The (1973)                    3.537634  4.067239  0.529605
Fright Night (1985)                     2.973684  3.500000  0.526316
Barb Wire (1996)                        1.585366  2.100386  0.515020

In [33]: rating_std_by_title = data.groupby('title')['rating'].std()

In [34]: rating_std_by_title = rating_std_by_title.ix[active_titles]

In [35]: rating_std_by_title.order(ascending=False)[:10]
Out[35]:
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

Conversely, if we flipped the order around, where the male ratings were higher than female ratings, the differential pointed to 'The Good, The Bad and the Ugly', 'The Kentucky Fried Movie', and 'Dumb & Dumber'. Both sets of films, while good, cater stereo-typically to different genders.

Another way to measure the differential is through the spread, otherwise known as the standard deviation. Among all movies and users, 'Dumb & Dumber' took the top prize with the highest standard deviation of 1.321. Therefore, the ratings users assigned to 'Dumb & Dumber' differed more among each other than for other movies, although 'The Blair Witch Project' came in a close second at 1.316



Average User Age by Movie

Grouping by movie, we can find the average age of the user who rates the movies. We also use active_titles to subset those movies with a large number of ratings. The movies with the youngest users rating them are: 'Can't Hardly Wait', 'Friday', and 'Empire Records', all just over 21. Note the year of the movies are more recent in the 1990s or 2000.

On the older end, the highest average age rated movies turned out to be: 'Hud', 'Klute', and 'Around the World in 80 Days'. Again, note the movie dates as they are in the 1960s, 1970s, and 1980s, more resonant with more esteemed users who have seen the movies when they were younger.

Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
In [37]: avg_age = data.groupby('title')['age'].mean()

In [45]: avg_age = avg_age.ix[active_titles]

In [47]: avg_age.order(ascending=True)[:10]
Out[47]:
title
Can't Hardly Wait (1998)                                        21.284314
Friday (1995)                                                   21.768025
Empire Records (1995)                                           21.940959
Billy Madison (1995)                                            22.253521
Mallrats (1995)                                                 22.432373
Road Trip (2000)                                                22.584726
Scream 3 (2000)                                                 22.818024
Teenage Mutant Ninja Turtles II: The Secret of the Ooze (1991)  23.107570
Skulls, The (2000)                                              23.463576
Tommy Boy (1995)                                                23.585366
Name: age, dtype: float64

In [48]: avg_age.order(ascending=False)[:10]
Out[48]:
title
Hud (1963)                                    41.971326
Klute (1971)                                  40.817881
Around the World in 80 Days (1956)            40.732342
Tender Mercies (1983)                         40.196970
Breaker Morant (1980)                         39.655556
Cat Ballou (1965)                             39.575000
Atlantic City (1980)                          39.483553
Mister Roberts (1955)                         39.323040
Taking of Pelham One Two Three, The (1974)    39.217391
In the Heat of the Night (1967)               38.681034

Take heed though, of the age group less than 18. It is numerated with 1. So taking the average age is somewhat misleading, as those less than one skew the mean more than their actual age would. I assume that one year-olds would not be rating movies. We will have to correct this issue later.


Well, congratulations on reaching the end of this picture-less post! Here we used pandas in Python to create and merge DataFrames, and created more specific-use DataFrames through subsetting, and viewing the various orders through sorting. I hope you have an idea of the versatility of Python for data analysis with pandas by reading this series! Stay tuned for more posts!




Thanks for reading,

Wayne
@beyondvalence
LinkedIn


Python and Pandas Series:
1. Python and Pandas: Part 1: bit.ly and Time Zones
2. Python and Pandas: Part 2. Movie Ratings
3. Python and Pandas: Part 3. Baby Names, 1880-2010
4. Python and Pandas: Part 4. More Baby Names
.

3 comments:

  1. Deposit Menggunakan GO Pay di Agen Judi Online terpopuler di Indonesia !

    Minimal Deposit 50ribu saja, Tersedia di situs www.bolavita.fun Anda dapat menikmati berbagai jenis judi online yang tersedia ...

    Info selengkapnya silahkan hubungi :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Keluaran Togel Singapura Hari Ini yang terbaru

    ReplyDelete