Saturday, November 10, 2007

Statistical Analysis of Keywords in Movie Titles

I felt experimental this evening and was wanting to dive into some modules enabling me to graph data in Python. This started out as a "hey, that's cool, graphs...those can be useful". Several hours later, it turned into "wow, there's some potential application here and this is a very cool view over such a large set of data...".

You'll notice in this resulting graph that I have tried to choose "common thematic" words, enabling us to have deep insight into the history of action movie naming conventions. I would like to cross-reference these results with the year the movies were created. It would be even more interesting to overlap those create dates with a historical time line of major events to see how heavily movie naming conventions are influenced by current events.

On a more technical note, I generated this graph using the pychart module, amongst more common ones. Graph and associated code to follow :)

#! /usr/bin/env python
 
#Copyright (C) 2007  Christopher M. Ball (chris.m.ball@gmail.com)
#Originally Posted At: <http://strainthebrain.blogspot.com>
#
#This program is free software: you can redistribute it and/or modify
#it under the terms of the GNU General Public License as published
#by the Free Software Foundation, either version 3 of the License, or
#(at your option) any later version.
#
#This program is distributed in the hope that it will be useful,
#but WITHOUT ANY WARRANTY; without even the implied warranty of
#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
#the GNU General Public License for more details.
#
#You should have received a copy of the GNU General Public License
#along with this program.  If not, see <http://www.gnu.org/licenses/>.

from pychart import *
import commands, pdb, operator

theme.use_color = True
theme.reinitialize()
theme.get_options()

#This function obtains grep counts for each of our keywords against our data file, returning them in coordinate pairs
def keyworddata():
   keywords = ['love','sex','war','money','god','murder','kill','mission','journey','freedom']
   keywords += ['happy','blood','steal','travel','fun','sad','power','peace','drug','friend']
   keywords += ['marriage','country','patriot','tax','prison']
   results = []
   
   #For each keyword, perform a grep count, then form our keyword/value tuple
   for currentWord in keywords:
       ycount = commands.getstatusoutput("grep -i -c " + currentWord + " movielist.txt")
       results.append((currentWord, int(ycount[1])))
   
   return results                        

#Setting up the tick and axis
tick = tick_mark.Circle(size=10, fill_style=fill_style.red)
xaxis = axis.X(label="Movie Title Keyword")
yaxis = axis.Y(label="Quantity of Action Movies (23,530 total)")

#Before assigning my results from the function, I sort the resulting list of tuples by the 2nd item of each pair, descending.
plotpoints = sorted(keyworddata(), key=operator.itemgetter(1), reverse=1)

#Setting up and binding our data with graph settings
ar = area.T(x_coord = category_coord.T(plotpoints, 0), x_axis=xaxis, y_axis=yaxis, x_grid_interval=20, y_grid_interval=25, size=(1000,600), legend=None, y_range = (0, 450))
ar.add_plot(line_plot.T(data = plotpoints, tick_mark=tick))
ar.draw()

1 comments:

brighteyes said...

where's the written analysis? i thought you're going to give more info on action movie-naming conventions??