Using Pandas and Seaborn to solve PyBites Marvel Challenge

By on 7 December 2017

Intro

Today we have Martin Smith show us how he used Pandas and Seaborn to solve our Marvel Comics Challenge. Take it away Martin!

Goals

These are the goals for this challenge:

  1. Get the most popular characters based on the number of appearances they made in comics over the years.
  2. Get the year with most and least new Marvel characters introduced respectively, return a (max_year, min_year) tuple. Expect min/max to be pretty far apart.
  3. What percentage of the comics characters is female? Please give us the percentage rounded to 2 digits.
  4. Good vs bad characters: return a dictionary of bad vs good vs neutral characters per sex. The keys are Bad Characters, Good Characters, Neutral Characters, the values are integer percentages. Who plays the villain more often, a man or a woman?

Setup

Import libraries and print out list of files in the work folder.

In [1]:

#import libs
import pandas as pd
import seaborn as sns

#Enable inline plots
%matplotlib inline

#List the files in the work folder
from subprocess import check_output
print(check_output(["ls", "../work"]).decode("utf8"))
Business Finances.ipynb
IBNK1117.CSV
Marvel Comics Data.ipynb
marvel-wikia-data.csv
Test.ipynb
Untitled.ipynb

Get Data

Import the data from a csv file into a Pandas dataframe and examine.

In [2]:

df = pd.read_csv('../work/marvel-wikia-data.csv')
df.info()

RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
page_id             16376 non-null int64
name                16376 non-null object
urlslug             16376 non-null object
ID                  12606 non-null object
ALIGN               13564 non-null object
EYE                 6609 non-null object
HAIR                12112 non-null object
SEX                 15522 non-null object
GSM                 90 non-null object
ALIVE               16373 non-null object
APPEARANCES         15280 non-null float64
FIRST APPEARANCE    15561 non-null object
Year                15561 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB
In [3]:

#Show first few rows of the dataframe
df.head()
Out[3]:

page_id name urlslug ID ALIGN EYE HAIR SEX GSM ALIVE APPEARANCES FIRST APPEARANCE Year
0 1678 Spider-Man (Peter Parker) \/Spider-Man_(Peter_Parker) Secret Identity Good Characters Hazel Eyes Brown Hair Male Characters NaN Living Characters 4043.0 Aug-62 1962.0
1 7139 Captain America (Steven Rogers) \/Captain_America_(Steven_Rogers) Public Identity Good Characters Blue Eyes White Hair Male Characters NaN Living Characters 3360.0 Mar-41 1941.0
2 64786 Wolverine (James \”Logan\” Howlett) \/Wolverine_(James_%22Logan%22_Howlett) Public Identity Neutral Characters Blue Eyes Black Hair Male Characters NaN Living Characters 3061.0 Oct-74 1974.0
3 1868 Iron Man (Anthony \”Tony\” Stark) \/Iron_Man_(Anthony_%22Tony%22_Stark) Public Identity Good Characters Blue Eyes Black Hair Male Characters NaN Living Characters 2961.0 Mar-63 1963.0
4 2460 Thor (Thor Odinson) \/Thor_(Thor_Odinson) No Dual Identity Good Characters Blue Eyes Blond Hair Male Characters NaN Living Characters 2258.0 Nov-50 1950.0
In [4]:

#Describe the numeric values in the dataframe
df.describe()
Out[4]:

page_id APPEARANCES Year
count 16376.000000 15280.000000 15561.000000
mean 300232.082377 17.033377 1984.951803
std 253460.403399 96.372959 19.663571
min 1025.000000 1.000000 1939.000000
25% 28309.500000 1.000000 1974.000000
50% 282578.000000 3.000000 1990.000000
75% 509077.000000 8.000000 2000.000000
max 755278.000000 4043.000000 2013.000000

Cleanse

Fix up the data a bit to make it more manageable. I will perform the following tasks to clean the data:

  • Drop unused columns
  • Make all the column names consistant
  • Fill in the empty values
  • Clean the values in align, sex and alive. Remove the word Characters from the values.
In [5]:

#The following columns can be dropped as they don't contribute towards the goals of the challenge
df = df.drop(['GSM','urlslug', 'page_id', 'EYE', 'HAIR', 'ID'], axis=1)
#Make all the columns lower case
df.columns = map(str.lower, df.columns)
df.head()
Out[5]:

name align sex alive appearances first appearance year
0 Spider-Man (Peter Parker) Good Characters Male Characters Living Characters 4043.0 Aug-62 1962.0
1 Captain America (Steven Rogers) Good Characters Male Characters Living Characters 3360.0 Mar-41 1941.0
2 Wolverine (James \”Logan\” Howlett) Neutral Characters Male Characters Living Characters 3061.0 Oct-74 1974.0
3 Iron Man (Anthony \”Tony\” Stark) Good Characters Male Characters Living Characters 2961.0 Mar-63 1963.0
4 Thor (Thor Odinson) Good Characters Male Characters Living Characters 2258.0 Nov-50 1950.0
In [6]:

#Check what the unique values for alive are
df.alive.unique()
Out[6]:

array(['Living Characters', 'Deceased Characters', nan], dtype=object)
In [7]:

#Fill empty values for appearances, align and sex
df.appearances = df.appearances.fillna(1)
df['align'] = df['align'].fillna('Unknown')
df.sex = df.sex.fillna('Unknown')
df.info()

RangeIndex: 16376 entries, 0 to 16375
Data columns (total 7 columns):
name                16376 non-null object
align               16376 non-null object
sex                 16376 non-null object
alive               16373 non-null object
appearances         16376 non-null float64
first appearance    15561 non-null object
year                15561 non-null float64
dtypes: float64(2), object(5)
memory usage: 895.6+ KB
In [8]:

#Remove the word characters from alive, sex and align
df['alive'] = df['alive'].str.replace(' Characters', '')
df['sex'] = df['sex'].str.replace(' Characters', '')
df['align'] = df['align'].str.replace(' Characters', '')
df.head()
Out[8]:

name align sex alive appearances first appearance year
0 Spider-Man (Peter Parker) Good Male Living 4043.0 Aug-62 1962.0
1 Captain America (Steven Rogers) Good Male Living 3360.0 Mar-41 1941.0
2 Wolverine (James \”Logan\” Howlett) Neutral Male Living 3061.0 Oct-74 1974.0
3 Iron Man (Anthony \”Tony\” Stark) Good Male Living 2961.0 Mar-63 1963.0
4 Thor (Thor Odinson) Good Male Living 2258.0 Nov-50 1950.0

Solve Challenges

1. Characters with most appearances

In [9]:

#Sort appearances in descending order
newdf = df.sort_values(by=['appearances'], ascending=False).head(10)
#Plot a graph of top 10 characters with most appearances
newdf.plot(kind='bar', x='name', y='appearances')
Out[9]:

weSsFaSjMhwRwAAAABJRU5ErkJggg==

2. Years with most and least new characters

In [10]:

#Get the year with most and least new Marvel characters introduced respectively, return a (max_year, min_year) tuple. Expect min/max to be pretty far apart.
new = df.groupby(df['year'])['name'].count().reset_index()
minNewChars = new.sort_values(by=['name']).head(1)['year'].iloc[0]
maxNewChars = new.sort_values(by=['name']).tail(1)['year'].iloc[0]
#Print the results
print('Year with the most new characters', int(maxNewChars), 'and the year with the least new characters', int(minNewChars))
Year with the most new characters 1993 and the year with the least new characters 1958
In [11]:

#Plot bar graph of character introductions per year
sns.barplot(x='year', y='name', data=new)
Out[11]:

HwCp+rThvwmAAAAA
AElFTkSuQmCC

3. Percentage of female characters

In [12]:

#What percentage of the comics characters is female? Please give us the percentage rounded to 2 digits.
sex = df.groupby(by=df['sex'])['name'].count().reset_index(name='count')
sex['percent'] = sex['count'] / sex['count'].sum() * 100
sex['percent'] = sex['percent'].round(2)
sex = sex.set_index('sex')
percentagefemale = sex.at['Female', 'percent']
print('Percentage of female characters', percentagefemale, '%')
Percentage of female characters 23.43 %

4. Good vs Bad Characters

In [13]:

#Let's look at the number of female villians in the data set
df.query('align == "Bad" & sex == "Female"')
Out[13]:

name align sex alive appearances first appearance year
101 Raven Darkholme (Earth-616) Bad Female Living 371.0 Apr-78 1978.0
112 Felicia Hardy (Earth-616) Bad Female Living 332.0 Jul-79 1979.0
164 Amora (Earth-616) Bad Female Living 247.0 Apr-64 1964.0
231 Ophelia Sarkissian (Earth-616) Bad Female Living 175.0 Feb-69 1969.0
239 Hela (Earth-616) Bad Female Deceased 170.0 Mar-64 1964.0
265 Veranke (Earth-616) Bad Female Deceased 156.0 Jan-05 2005.0
276 Giuletta Nefaria (Earth-616) Bad Female Living 149.0 Jan-68 1968.0
284 Madelyne Pryor (Earth-616) Bad Female Living 146.0 Apr-83 1983.0
375 Mary Walker (Earth-616) Bad Female Living 111.0 May-88 1988.0
384 Mary MacPherran (Earth-616) Bad Female Living 108.0 Jul-84 1984.0
418 Yuriko Oyama (Earth-616) Bad Female Living 99.0 Aug-83 1983.0
455 Selene Gallio (Earth-616) Bad Female Living 92.0 Nov-83 1983.0
461 Sinthea Shmidt (Earth-616) Bad Female Living 91.0 Feb-84 1984.0
477 Satana Hellstrom (Earth-616) Bad Female Living 84.0 Oct-73 1973.0
491 Zelda DuBois (Earth-616) Bad Female Living 81.0 Mar-65 1965.0
505 Maria Callasantos (Earth-616) Bad Female Deceased 77.0 Mar-91 1991.0
556 Blanche Sitznski (Earth-616) Bad Female Living 69.0 Jun-80 1980.0
569 Criti Noll (Clone) (Earth-616) Bad Female Deceased 68.0 Sep-06 2006.0
589 Morgan le Fay (Earth-616) Bad Female Living 65.0 May-55 1955.0
600 Lily Hollister (Earth-616) Bad Female Living 64.0 Jan-08 2008.0
611 Vertigo (Savage Land Mutate) (Earth-616) Bad Female Deceased 63.0 Mar-82 1982.0
664 Philippa Sontag (Earth-616) Bad Female Living 57.0 Oct-86 1986.0
668 Hera Argeia (Earth-616) Bad Female Living 56.0 Apr-49 1949.0
674 Nebula (Earth-616) Bad Female Living 56.0 Jul-85 1985.0
740 Llyra Morris (Earth-616) Bad Female Living 50.0 Dec-70 1970.0
758 Deidre Wentworth (Earth-616) Bad Female Living 49.0 Jul-91 1991.0
767 Umar (Earth-616) Bad Female Living 48.0 Nov-66 1966.0
771 Nekra Sinclair (Earth-616) Bad Female Deceased 48.0 Aug-73 1973.0
779 Lorelei (Asgardian) (Earth-616) Bad Female Living 48.0 Nov-83 1983.0
782 Frances Barrison (Earth-616) Bad Female Living 48.0 May-93 1993.0
15741 Fang (Shi’ar Prisoner) (Earth-616) Bad Female Deceased 1.0 Sep-93 1993.0
15742 H’rpra (Earth-616) Bad Female Deceased 1.0 Feb-93 1993.0
15753 T.K.O. (Earth-616) Bad Female Living 1.0 Jul-93 1993.0
15755 Tartessus (Earth-616) Bad Female Living 1.0 Mar-93 1993.0
15773 Javelynne (Earth-616) Bad Female Living 1.0 Jan-94 1994.0
15780 May Reilly (Impostor) (Earth-616) Bad Female Deceased 1.0 Aug-94 1994.0
15782 Nirvana (Deviant) (Earth-616) Bad Female Living 1.0 Jul-94 1994.0
15826 Alchemist (Earth-616) Bad Female Living 1.0 Sep-96 1996.0
15832 Catherine D’Antan (Earth-616) Bad Female Living 1.0 May-96 1996.0
15849 Kristine Calverly (Earth-616) Bad Female Living 1.0 Dec-96 1996.0
15856 Pepper (Salt & Pepper) (Earth-616) Bad Female Living 1.0 May-96 1996.0
15873 Kylie Kopelkin (Earth-616) Bad Female Living 1.0 Jan-97 1997.0
15874 Megan O’Toole (Earth-616) Bad Female Living 1.0 Feb-97 1997.0
15875 Molly Peterson (Earth-616) Bad Female Living 1.0 Jan-97 1997.0
15876 Ra’al (Earth-616) Bad Female Living 1.0 Mar-97 1997.0
15938 Grace & Mary Mercy (Earth-616) Bad Female Deceased 1.0 May-01 2001.0
16053 Ch’rith (Earth-616) Bad Female Deceased 1.0 Nov-08 2008.0
16056 Fry’lu (Earth-616) Bad Female Living 1.0 Oct-08 2008.0
16143 Zarda Shelton (Zombie Clone) (Earth-616) Bad Female Deceased 1.0 May-11 2011.0
16148 Da’o Coy Manh (Earth-616) Bad Female Deceased 1.0 Jun-12 2012.0
16159 Libra (Thanos’ Zodiac) (Earth-616) Bad Female Deceased 1.0 May-12 2012.0
16170 Virgo (Thanos’ Zodiac) (Earth-616) Bad Female Deceased 1.0 May-12 2012.0
16173 Katy & Timothy Bashir (Earth-616) Bad Female Deceased 1.0 Feb-13 2013.0
16253 Illumination (Earth-616) Bad Female Living 1.0 NaN NaN
16283 Malefactor (Earth-616) Bad Female Living 1.0 NaN NaN
16297 Mrs. Stane (Earth-616) Bad Female Deceased 1.0 NaN NaN
16331 Shivoor (Earth-616) Bad Female Living 1.0 NaN NaN
16336 Silhouette (Hand) (Earth-616) Bad Female Living 1.0 NaN NaN
16358 Zora Loftus (Earth-616) Bad Female Living 1.0 NaN NaN
16362 Farbauti (Earth-616) Bad Female Living 1.0 NaN NaN

976 rows × 7 columns

In [14]:

#Group by alignment and sex and do a count for each group
goodvbad = df.groupby(by=['align', 'sex'])['name'].count().reset_index(name='count')
#Create a percentage column
goodvbad['percent'] = goodvbad['count'] / goodvbad['count'].sum() * 100
goodvbad['percent'] = goodvbad['percent'].round(2)
#Add an index so that I can select rows for specific groupings
goodvbad['index'] = goodvbad['align'].map(str) + goodvbad['sex']
goodvbad = goodvbad.set_index('index')
goodvbad['alignsex'] = goodvbad.index
goodvbad
Out[14]:

align sex count percent alignsex
index
BadAgender Bad Agender 20 0.12 BadAgender
BadFemale Bad Female 976 5.96 BadFemale
BadMale Bad Male 5338 32.60 BadMale
BadUnknown Bad Unknown 386 2.36 BadUnknown
GoodAgender Good Agender 10 0.06 GoodAgender
GoodFemale Good Female 1537 9.39 GoodFemale
GoodGenderfluid Good Genderfluid 1 0.01 GoodGenderfluid
GoodMale Good Male 2966 18.11 GoodMale
GoodUnknown Good Unknown 122 0.74 GoodUnknown
NeutralAgender Neutral Agender 13 0.08 NeutralAgender
NeutralFemale Neutral Female 640 3.91 NeutralFemale
NeutralGenderfluid Neutral Genderfluid 1 0.01 NeutralGenderfluid
NeutralMale Neutral Male 1440 8.79 NeutralMale
NeutralUnknown Neutral Unknown 114 0.70 NeutralUnknown
UnknownAgender Unknown Agender 2 0.01 UnknownAgender
UnknownFemale Unknown Female 684 4.18 UnknownFemale
UnknownMale Unknown Male 1894 11.57 UnknownMale
UnknownUnknown Unknown Unknown 232 1.42 UnknownUnknown
In [15]:

#Create a chart of the distribution of alignment and sex
sns.barplot(x='percent', y='alignsex', data=goodvbad)
Out[15]:

oIPyHm5h8AAAAASUVORK5CYII=

In [16]:

#Show the percentage of bad males
males = goodvbad[['sex', 'align', 'count']].copy()
males = males.query('sex == "Male" & align != "Unknown"')
males['percent'] = males['count'] / males['count'].sum() * 100
males['percent'] = males['percent'].round()
males
Out[16]:

sex align count percent
index
BadMale Male Bad 5338 55.0
GoodMale Male Good 2966 30.0
NeutralMale Male Neutral 1440 15.0
In [17]:

percentagebadmales = males.at['BadMale', 'percent']
print('Percentage of Bad Males: ', int(percentagebadmales))
Percentage of Bad Males:  55
In [18]:

females = goodvbad[['sex', 'align', 'count']].copy()
females = females.query('sex == "Female" & align != "Unknown"')
females['percent'] = females['count'] / females['count'].sum() * 100
females['percent'] = females['percent'].round()
females
Out[18]:

sex align count percent
index
BadFemale Female Bad 976 31.0
GoodFemale Female Good 1537 49.0
NeutralFemale Female Neutral 640 20.0
In [19]:

#Show the percentage of bad females
percentagebadfemales = females.at['BadFemale', 'percent']
print('Percentage of Bad Females: ', int(percentagebadfemales))
Percentage of Bad Females:  31

It would seem that the villain is more often a male than female. From the chart above we can also see that there are more bad male characters than any other kind.

Keep Calm and Code in Python!

Martin



Want a career as a Python Developer but not sure where to start?