Intro¶
Today we have Martin Smith show us how he used Pandas and Seaborn to solve our Marvel Comics Challenge. Take it away Martin!
Goals¶
These are the goals for this challenge:
- Get the most popular characters based on the number of appearances they made in comics over the years.
- Get the year with most and least new Marvel characters introduced respectively, return a (max_year, min_year) tuple. Expect min/max to be pretty far apart.
- What percentage of the comics characters is female? Please give us the percentage rounded to 2 digits.
- Good vs bad characters: return a dictionary of bad vs good vs neutral characters per sex. The keys are Bad Characters, Good Characters, Neutral Characters, the values are integer percentages. Who plays the villain more often, a man or a woman?
Setup¶
Import libraries and print out list of files in the work folder.
In [1]:
#import libs
import pandas as pd
import seaborn as sns
#Enable inline plots
%matplotlib inline
#List the files in the work folder
from subprocess import check_output
print(check_output(["ls", "../work"]).decode("utf8"))
Business Finances.ipynb
IBNK1117.CSV
Marvel Comics Data.ipynb
marvel-wikia-data.csv
Test.ipynb
Untitled.ipynb
Get Data¶
Import the data from a csv file into a Pandas dataframe and examine.
In [2]:
df = pd.read_csv('../work/marvel-wikia-data.csv')
df.info()
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
page_id 16376 non-null int64
name 16376 non-null object
urlslug 16376 non-null object
ID 12606 non-null object
ALIGN 13564 non-null object
EYE 6609 non-null object
HAIR 12112 non-null object
SEX 15522 non-null object
GSM 90 non-null object
ALIVE 16373 non-null object
APPEARANCES 15280 non-null float64
FIRST APPEARANCE 15561 non-null object
Year 15561 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB
In [3]:
#Show first few rows of the dataframe
df.head()
Out[3]:
page_id | name | urlslug | ID | ALIGN | EYE | HAIR | SEX | GSM | ALIVE | APPEARANCES | FIRST APPEARANCE | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1678 | Spider-Man (Peter Parker) | \/Spider-Man_(Peter_Parker) | Secret Identity | Good Characters | Hazel Eyes | Brown Hair | Male Characters | NaN | Living Characters | 4043.0 | Aug-62 | 1962.0 |
1 | 7139 | Captain America (Steven Rogers) | \/Captain_America_(Steven_Rogers) | Public Identity | Good Characters | Blue Eyes | White Hair | Male Characters | NaN | Living Characters | 3360.0 | Mar-41 | 1941.0 |
2 | 64786 | Wolverine (James \”Logan\” Howlett) | \/Wolverine_(James_%22Logan%22_Howlett) | Public Identity | Neutral Characters | Blue Eyes | Black Hair | Male Characters | NaN | Living Characters | 3061.0 | Oct-74 | 1974.0 |
3 | 1868 | Iron Man (Anthony \”Tony\” Stark) | \/Iron_Man_(Anthony_%22Tony%22_Stark) | Public Identity | Good Characters | Blue Eyes | Black Hair | Male Characters | NaN | Living Characters | 2961.0 | Mar-63 | 1963.0 |
4 | 2460 | Thor (Thor Odinson) | \/Thor_(Thor_Odinson) | No Dual Identity | Good Characters | Blue Eyes | Blond Hair | Male Characters | NaN | Living Characters | 2258.0 | Nov-50 | 1950.0 |
In [4]:
#Describe the numeric values in the dataframe
df.describe()
Out[4]:
page_id | APPEARANCES | Year | |
---|---|---|---|
count | 16376.000000 | 15280.000000 | 15561.000000 |
mean | 300232.082377 | 17.033377 | 1984.951803 |
std | 253460.403399 | 96.372959 | 19.663571 |
min | 1025.000000 | 1.000000 | 1939.000000 |
25% | 28309.500000 | 1.000000 | 1974.000000 |
50% | 282578.000000 | 3.000000 | 1990.000000 |
75% | 509077.000000 | 8.000000 | 2000.000000 |
max | 755278.000000 | 4043.000000 | 2013.000000 |
Cleanse¶
Fix up the data a bit to make it more manageable. I will perform the following tasks to clean the data:
- Drop unused columns
- Make all the column names consistant
- Fill in the empty values
- Clean the values in align, sex and alive. Remove the word Characters from the values.
In [5]:
#The following columns can be dropped as they don't contribute towards the goals of the challenge
df = df.drop(['GSM','urlslug', 'page_id', 'EYE', 'HAIR', 'ID'], axis=1)
#Make all the columns lower case
df.columns = map(str.lower, df.columns)
df.head()
Out[5]:
name | align | sex | alive | appearances | first appearance | year | |
---|---|---|---|---|---|---|---|
0 | Spider-Man (Peter Parker) | Good Characters | Male Characters | Living Characters | 4043.0 | Aug-62 | 1962.0 |
1 | Captain America (Steven Rogers) | Good Characters | Male Characters | Living Characters | 3360.0 | Mar-41 | 1941.0 |
2 | Wolverine (James \”Logan\” Howlett) | Neutral Characters | Male Characters | Living Characters | 3061.0 | Oct-74 | 1974.0 |
3 | Iron Man (Anthony \”Tony\” Stark) | Good Characters | Male Characters | Living Characters | 2961.0 | Mar-63 | 1963.0 |
4 | Thor (Thor Odinson) | Good Characters | Male Characters | Living Characters | 2258.0 | Nov-50 | 1950.0 |
In [6]:
#Check what the unique values for alive are
df.alive.unique()
Out[6]:
array(['Living Characters', 'Deceased Characters', nan], dtype=object)
In [7]:
#Fill empty values for appearances, align and sex
df.appearances = df.appearances.fillna(1)
df['align'] = df['align'].fillna('Unknown')
df.sex = df.sex.fillna('Unknown')
df.info()
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 7 columns):
name 16376 non-null object
align 16376 non-null object
sex 16376 non-null object
alive 16373 non-null object
appearances 16376 non-null float64
first appearance 15561 non-null object
year 15561 non-null float64
dtypes: float64(2), object(5)
memory usage: 895.6+ KB
In [8]:
#Remove the word characters from alive, sex and align
df['alive'] = df['alive'].str.replace(' Characters', '')
df['sex'] = df['sex'].str.replace(' Characters', '')
df['align'] = df['align'].str.replace(' Characters', '')
df.head()
Out[8]:
name | align | sex | alive | appearances | first appearance | year | |
---|---|---|---|---|---|---|---|
0 | Spider-Man (Peter Parker) | Good | Male | Living | 4043.0 | Aug-62 | 1962.0 |
1 | Captain America (Steven Rogers) | Good | Male | Living | 3360.0 | Mar-41 | 1941.0 |
2 | Wolverine (James \”Logan\” Howlett) | Neutral | Male | Living | 3061.0 | Oct-74 | 1974.0 |
3 | Iron Man (Anthony \”Tony\” Stark) | Good | Male | Living | 2961.0 | Mar-63 | 1963.0 |
4 | Thor (Thor Odinson) | Good | Male | Living | 2258.0 | Nov-50 | 1950.0 |
Solve Challenges¶
1. Characters with most appearances¶
In [9]:
#Sort appearances in descending order
newdf = df.sort_values(by=['appearances'], ascending=False).head(10)
#Plot a graph of top 10 characters with most appearances
newdf.plot(kind='bar', x='name', y='appearances')
Out[9]:
2. Years with most and least new characters¶
In [10]:
#Get the year with most and least new Marvel characters introduced respectively, return a (max_year, min_year) tuple. Expect min/max to be pretty far apart.
new = df.groupby(df['year'])['name'].count().reset_index()
minNewChars = new.sort_values(by=['name']).head(1)['year'].iloc[0]
maxNewChars = new.sort_values(by=['name']).tail(1)['year'].iloc[0]
#Print the results
print('Year with the most new characters', int(maxNewChars), 'and the year with the least new characters', int(minNewChars))
Year with the most new characters 1993 and the year with the least new characters 1958
In [11]:
#Plot bar graph of character introductions per year
sns.barplot(x='year', y='name', data=new)
Out[11]:
3. Percentage of female characters¶
In [12]:
#What percentage of the comics characters is female? Please give us the percentage rounded to 2 digits.
sex = df.groupby(by=df['sex'])['name'].count().reset_index(name='count')
sex['percent'] = sex['count'] / sex['count'].sum() * 100
sex['percent'] = sex['percent'].round(2)
sex = sex.set_index('sex')
percentagefemale = sex.at['Female', 'percent']
print('Percentage of female characters', percentagefemale, '%')
Percentage of female characters 23.43 %
4. Good vs Bad Characters¶
In [13]:
#Let's look at the number of female villians in the data set
df.query('align == "Bad" & sex == "Female"')
Out[13]:
name | align | sex | alive | appearances | first appearance | year | |
---|---|---|---|---|---|---|---|
101 | Raven Darkholme (Earth-616) | Bad | Female | Living | 371.0 | Apr-78 | 1978.0 |
112 | Felicia Hardy (Earth-616) | Bad | Female | Living | 332.0 | Jul-79 | 1979.0 |
164 | Amora (Earth-616) | Bad | Female | Living | 247.0 | Apr-64 | 1964.0 |
231 | Ophelia Sarkissian (Earth-616) | Bad | Female | Living | 175.0 | Feb-69 | 1969.0 |
239 | Hela (Earth-616) | Bad | Female | Deceased | 170.0 | Mar-64 | 1964.0 |
265 | Veranke (Earth-616) | Bad | Female | Deceased | 156.0 | Jan-05 | 2005.0 |
276 | Giuletta Nefaria (Earth-616) | Bad | Female | Living | 149.0 | Jan-68 | 1968.0 |
284 | Madelyne Pryor (Earth-616) | Bad | Female | Living | 146.0 | Apr-83 | 1983.0 |
375 | Mary Walker (Earth-616) | Bad | Female | Living | 111.0 | May-88 | 1988.0 |
384 | Mary MacPherran (Earth-616) | Bad | Female | Living | 108.0 | Jul-84 | 1984.0 |
418 | Yuriko Oyama (Earth-616) | Bad | Female | Living | 99.0 | Aug-83 | 1983.0 |
455 | Selene Gallio (Earth-616) | Bad | Female | Living | 92.0 | Nov-83 | 1983.0 |
461 | Sinthea Shmidt (Earth-616) | Bad | Female | Living | 91.0 | Feb-84 | 1984.0 |
477 | Satana Hellstrom (Earth-616) | Bad | Female | Living | 84.0 | Oct-73 | 1973.0 |
491 | Zelda DuBois (Earth-616) | Bad | Female | Living | 81.0 | Mar-65 | 1965.0 |
505 | Maria Callasantos (Earth-616) | Bad | Female | Deceased | 77.0 | Mar-91 | 1991.0 |
556 | Blanche Sitznski (Earth-616) | Bad | Female | Living | 69.0 | Jun-80 | 1980.0 |
569 | Criti Noll (Clone) (Earth-616) | Bad | Female | Deceased | 68.0 | Sep-06 | 2006.0 |
589 | Morgan le Fay (Earth-616) | Bad | Female | Living | 65.0 | May-55 | 1955.0 |
600 | Lily Hollister (Earth-616) | Bad | Female | Living | 64.0 | Jan-08 | 2008.0 |
611 | Vertigo (Savage Land Mutate) (Earth-616) | Bad | Female | Deceased | 63.0 | Mar-82 | 1982.0 |
664 | Philippa Sontag (Earth-616) | Bad | Female | Living | 57.0 | Oct-86 | 1986.0 |
668 | Hera Argeia (Earth-616) | Bad | Female | Living | 56.0 | Apr-49 | 1949.0 |
674 | Nebula (Earth-616) | Bad | Female | Living | 56.0 | Jul-85 | 1985.0 |
740 | Llyra Morris (Earth-616) | Bad | Female | Living | 50.0 | Dec-70 | 1970.0 |
758 | Deidre Wentworth (Earth-616) | Bad | Female | Living | 49.0 | Jul-91 | 1991.0 |
767 | Umar (Earth-616) | Bad | Female | Living | 48.0 | Nov-66 | 1966.0 |
771 | Nekra Sinclair (Earth-616) | Bad | Female | Deceased | 48.0 | Aug-73 | 1973.0 |
779 | Lorelei (Asgardian) (Earth-616) | Bad | Female | Living | 48.0 | Nov-83 | 1983.0 |
782 | Frances Barrison (Earth-616) | Bad | Female | Living | 48.0 | May-93 | 1993.0 |
… | … | … | … | … | … | … | … |
15741 | Fang (Shi’ar Prisoner) (Earth-616) | Bad | Female | Deceased | 1.0 | Sep-93 | 1993.0 |
15742 | H’rpra (Earth-616) | Bad | Female | Deceased | 1.0 | Feb-93 | 1993.0 |
15753 | T.K.O. (Earth-616) | Bad | Female | Living | 1.0 | Jul-93 | 1993.0 |
15755 | Tartessus (Earth-616) | Bad | Female | Living | 1.0 | Mar-93 | 1993.0 |
15773 | Javelynne (Earth-616) | Bad | Female | Living | 1.0 | Jan-94 | 1994.0 |
15780 | May Reilly (Impostor) (Earth-616) | Bad | Female | Deceased | 1.0 | Aug-94 | 1994.0 |
15782 | Nirvana (Deviant) (Earth-616) | Bad | Female | Living | 1.0 | Jul-94 | 1994.0 |
15826 | Alchemist (Earth-616) | Bad | Female | Living | 1.0 | Sep-96 | 1996.0 |
15832 | Catherine D’Antan (Earth-616) | Bad | Female | Living | 1.0 | May-96 | 1996.0 |
15849 | Kristine Calverly (Earth-616) | Bad | Female | Living | 1.0 | Dec-96 | 1996.0 |
15856 | Pepper (Salt & Pepper) (Earth-616) | Bad | Female | Living | 1.0 | May-96 | 1996.0 |
15873 | Kylie Kopelkin (Earth-616) | Bad | Female | Living | 1.0 | Jan-97 | 1997.0 |
15874 | Megan O’Toole (Earth-616) | Bad | Female | Living | 1.0 | Feb-97 | 1997.0 |
15875 | Molly Peterson (Earth-616) | Bad | Female | Living | 1.0 | Jan-97 | 1997.0 |
15876 | Ra’al (Earth-616) | Bad | Female | Living | 1.0 | Mar-97 | 1997.0 |
15938 | Grace & Mary Mercy (Earth-616) | Bad | Female | Deceased | 1.0 | May-01 | 2001.0 |
16053 | Ch’rith (Earth-616) | Bad | Female | Deceased | 1.0 | Nov-08 | 2008.0 |
16056 | Fry’lu (Earth-616) | Bad | Female | Living | 1.0 | Oct-08 | 2008.0 |
16143 | Zarda Shelton (Zombie Clone) (Earth-616) | Bad | Female | Deceased | 1.0 | May-11 | 2011.0 |
16148 | Da’o Coy Manh (Earth-616) | Bad | Female | Deceased | 1.0 | Jun-12 | 2012.0 |
16159 | Libra (Thanos’ Zodiac) (Earth-616) | Bad | Female | Deceased | 1.0 | May-12 | 2012.0 |
16170 | Virgo (Thanos’ Zodiac) (Earth-616) | Bad | Female | Deceased | 1.0 | May-12 | 2012.0 |
16173 | Katy & Timothy Bashir (Earth-616) | Bad | Female | Deceased | 1.0 | Feb-13 | 2013.0 |
16253 | Illumination (Earth-616) | Bad | Female | Living | 1.0 | NaN | NaN |
16283 | Malefactor (Earth-616) | Bad | Female | Living | 1.0 | NaN | NaN |
16297 | Mrs. Stane (Earth-616) | Bad | Female | Deceased | 1.0 | NaN | NaN |
16331 | Shivoor (Earth-616) | Bad | Female | Living | 1.0 | NaN | NaN |
16336 | Silhouette (Hand) (Earth-616) | Bad | Female | Living | 1.0 | NaN | NaN |
16358 | Zora Loftus (Earth-616) | Bad | Female | Living | 1.0 | NaN | NaN |
16362 | Farbauti (Earth-616) | Bad | Female | Living | 1.0 | NaN | NaN |
976 rows × 7 columns
In [14]:
#Group by alignment and sex and do a count for each group
goodvbad = df.groupby(by=['align', 'sex'])['name'].count().reset_index(name='count')
#Create a percentage column
goodvbad['percent'] = goodvbad['count'] / goodvbad['count'].sum() * 100
goodvbad['percent'] = goodvbad['percent'].round(2)
#Add an index so that I can select rows for specific groupings
goodvbad['index'] = goodvbad['align'].map(str) + goodvbad['sex']
goodvbad = goodvbad.set_index('index')
goodvbad['alignsex'] = goodvbad.index
goodvbad
Out[14]:
align | sex | count | percent | alignsex | |
---|---|---|---|---|---|
index | |||||
BadAgender | Bad | Agender | 20 | 0.12 | BadAgender |
BadFemale | Bad | Female | 976 | 5.96 | BadFemale |
BadMale | Bad | Male | 5338 | 32.60 | BadMale |
BadUnknown | Bad | Unknown | 386 | 2.36 | BadUnknown |
GoodAgender | Good | Agender | 10 | 0.06 | GoodAgender |
GoodFemale | Good | Female | 1537 | 9.39 | GoodFemale |
GoodGenderfluid | Good | Genderfluid | 1 | 0.01 | GoodGenderfluid |
GoodMale | Good | Male | 2966 | 18.11 | GoodMale |
GoodUnknown | Good | Unknown | 122 | 0.74 | GoodUnknown |
NeutralAgender | Neutral | Agender | 13 | 0.08 | NeutralAgender |
NeutralFemale | Neutral | Female | 640 | 3.91 | NeutralFemale |
NeutralGenderfluid | Neutral | Genderfluid | 1 | 0.01 | NeutralGenderfluid |
NeutralMale | Neutral | Male | 1440 | 8.79 | NeutralMale |
NeutralUnknown | Neutral | Unknown | 114 | 0.70 | NeutralUnknown |
UnknownAgender | Unknown | Agender | 2 | 0.01 | UnknownAgender |
UnknownFemale | Unknown | Female | 684 | 4.18 | UnknownFemale |
UnknownMale | Unknown | Male | 1894 | 11.57 | UnknownMale |
UnknownUnknown | Unknown | Unknown | 232 | 1.42 | UnknownUnknown |
In [15]:
#Create a chart of the distribution of alignment and sex
sns.barplot(x='percent', y='alignsex', data=goodvbad)
Out[15]:
In [16]:
#Show the percentage of bad males
males = goodvbad[['sex', 'align', 'count']].copy()
males = males.query('sex == "Male" & align != "Unknown"')
males['percent'] = males['count'] / males['count'].sum() * 100
males['percent'] = males['percent'].round()
males
Out[16]:
sex | align | count | percent | |
---|---|---|---|---|
index | ||||
BadMale | Male | Bad | 5338 | 55.0 |
GoodMale | Male | Good | 2966 | 30.0 |
NeutralMale | Male | Neutral | 1440 | 15.0 |
In [17]:
percentagebadmales = males.at['BadMale', 'percent']
print('Percentage of Bad Males: ', int(percentagebadmales))
Percentage of Bad Males: 55
In [18]:
females = goodvbad[['sex', 'align', 'count']].copy()
females = females.query('sex == "Female" & align != "Unknown"')
females['percent'] = females['count'] / females['count'].sum() * 100
females['percent'] = females['percent'].round()
females
Out[18]:
sex | align | count | percent | |
---|---|---|---|---|
index | ||||
BadFemale | Female | Bad | 976 | 31.0 |
GoodFemale | Female | Good | 1537 | 49.0 |
NeutralFemale | Female | Neutral | 640 | 20.0 |
In [19]: