Intro¶

Today we have Martin Smith show us how he used Pandas and Seaborn to solve our Marvel Comics Challenge. Take it away Martin!

Goals¶

These are the goals for this challenge:

Get the most popular characters based on the number of appearances they made in comics over the years.
Get the year with most and least new Marvel characters introduced respectively, return a (max_year, min_year) tuple. Expect min/max to be pretty far apart.
What percentage of the comics characters is female? Please give us the percentage rounded to 2 digits.
Good vs bad characters: return a dictionary of bad vs good vs neutral characters per sex. The keys are Bad Characters, Good Characters, Neutral Characters, the values are integer percentages. Who plays the villain more often, a man or a woman?

Setup¶

Import libraries and print out list of files in the work folder.

In [1]:

#import libs
import pandas as pd
import seaborn as sns

#Enable inline plots
%matplotlib inline

#List the files in the work folder
from subprocess import check_output
print(check_output(["ls", "../work"]).decode("utf8"))

Business Finances.ipynb
IBNK1117.CSV
Marvel Comics Data.ipynb
marvel-wikia-data.csv
Test.ipynb
Untitled.ipynb

Get Data¶

Import the data from a csv file into a Pandas dataframe and examine.

In [2]:

df = pd.read_csv('../work/marvel-wikia-data.csv')
df.info()


RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
page_id             16376 non-null int64
name                16376 non-null object
urlslug             16376 non-null object
ID                  12606 non-null object
ALIGN               13564 non-null object
EYE                 6609 non-null object
HAIR                12112 non-null object
SEX                 15522 non-null object
GSM                 90 non-null object
ALIVE               16373 non-null object
APPEARANCES         15280 non-null float64
FIRST APPEARANCE    15561 non-null object
Year                15561 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB

In [3]:

#Show first few rows of the dataframe
df.head()

Out[3]:

	page_id	name	urlslug	ID	ALIGN	EYE	HAIR	SEX	GSM	ALIVE	APPEARANCES	FIRST APPEARANCE	Year
0	1678	Spider-Man (Peter Parker)	\/Spider-Man_(Peter_Parker)	Secret Identity	Good Characters	Hazel Eyes	Brown Hair	Male Characters	NaN	Living Characters	4043.0	Aug-62	1962.0
1	7139	Captain America (Steven Rogers)	\/Captain_America_(Steven_Rogers)	Public Identity	Good Characters	Blue Eyes	White Hair	Male Characters	NaN	Living Characters	3360.0	Mar-41	1941.0
2	64786	Wolverine (James \”Logan\” Howlett)	\/Wolverine_(James_%22Logan%22_Howlett)	Public Identity	Neutral Characters	Blue Eyes	Black Hair	Male Characters	NaN	Living Characters	3061.0	Oct-74	1974.0
3	1868	Iron Man (Anthony \”Tony\” Stark)	\/Iron_Man_(Anthony_%22Tony%22_Stark)	Public Identity	Good Characters	Blue Eyes	Black Hair	Male Characters	NaN	Living Characters	2961.0	Mar-63	1963.0
4	2460	Thor (Thor Odinson)	\/Thor_(Thor_Odinson)	No Dual Identity	Good Characters	Blue Eyes	Blond Hair	Male Characters	NaN	Living Characters	2258.0	Nov-50	1950.0

In [4]:

#Describe the numeric values in the dataframe
df.describe()

Out[4]:

	page_id	APPEARANCES	Year
count	16376.000000	15280.000000	15561.000000
mean	300232.082377	17.033377	1984.951803
std	253460.403399	96.372959	19.663571
min	1025.000000	1.000000	1939.000000
25%	28309.500000	1.000000	1974.000000
50%	282578.000000	3.000000	1990.000000
75%	509077.000000	8.000000	2000.000000
max	755278.000000	4043.000000	2013.000000

Cleanse¶

Fix up the data a bit to make it more manageable. I will perform the following tasks to clean the data:

Drop unused columns
Make all the column names consistant
Fill in the empty values
Clean the values in align, sex and alive. Remove the word Characters from the values.

In [5]:

#The following columns can be dropped as they don't contribute towards the goals of the challenge
df = df.drop(['GSM','urlslug', 'page_id', 'EYE', 'HAIR', 'ID'], axis=1)
#Make all the columns lower case
df.columns = map(str.lower, df.columns)
df.head()

Out[5]:

	name	align	sex	alive	appearances	first appearance	year
0	Spider-Man (Peter Parker)	Good Characters	Male Characters	Living Characters	4043.0	Aug-62	1962.0
1	Captain America (Steven Rogers)	Good Characters	Male Characters	Living Characters	3360.0	Mar-41	1941.0
2	Wolverine (James \”Logan\” Howlett)	Neutral Characters	Male Characters	Living Characters	3061.0	Oct-74	1974.0
3	Iron Man (Anthony \”Tony\” Stark)	Good Characters	Male Characters	Living Characters	2961.0	Mar-63	1963.0
4	Thor (Thor Odinson)	Good Characters	Male Characters	Living Characters	2258.0	Nov-50	1950.0

In [6]:

#Check what the unique values for alive are
df.alive.unique()

Out[6]:

array(['Living Characters', 'Deceased Characters', nan], dtype=object)

In [7]:

#Fill empty values for appearances, align and sex
df.appearances = df.appearances.fillna(1)
df['align'] = df['align'].fillna('Unknown')
df.sex = df.sex.fillna('Unknown')
df.info()


RangeIndex: 16376 entries, 0 to 16375
Data columns (total 7 columns):
name                16376 non-null object
align               16376 non-null object
sex                 16376 non-null object
alive               16373 non-null object
appearances         16376 non-null float64
first appearance    15561 non-null object
year                15561 non-null float64
dtypes: float64(2), object(5)
memory usage: 895.6+ KB

In [8]:

#Remove the word characters from alive, sex and align
df['alive'] = df['alive'].str.replace(' Characters', '')
df['sex'] = df['sex'].str.replace(' Characters', '')
df['align'] = df['align'].str.replace(' Characters', '')
df.head()

Out[8]:

	name	align	sex	alive	appearances	first appearance	year
0	Spider-Man (Peter Parker)	Good	Male	Living	4043.0	Aug-62	1962.0
1	Captain America (Steven Rogers)	Good	Male	Living	3360.0	Mar-41	1941.0
2	Wolverine (James \”Logan\” Howlett)	Neutral	Male	Living	3061.0	Oct-74	1974.0
3	Iron Man (Anthony \”Tony\” Stark)	Good	Male	Living	2961.0	Mar-63	1963.0
4	Thor (Thor Odinson)	Good	Male	Living	2258.0	Nov-50	1950.0

Solve Challenges¶

1. Characters with most appearances¶

In [9]:

#Sort appearances in descending order
newdf = df.sort_values(by=['appearances'], ascending=False).head(10)
#Plot a graph of top 10 characters with most appearances
newdf.plot(kind='bar', x='name', y='appearances')

Out[9]:

2. Years with most and least new characters¶

In [10]:

#Get the year with most and least new Marvel characters introduced respectively, return a (max_year, min_year) tuple. Expect min/max to be pretty far apart.
new = df.groupby(df['year'])['name'].count().reset_index()
minNewChars = new.sort_values(by=['name']).head(1)['year'].iloc[0]
maxNewChars = new.sort_values(by=['name']).tail(1)['year'].iloc[0]
#Print the results
print('Year with the most new characters', int(maxNewChars), 'and the year with the least new characters', int(minNewChars))

Year with the most new characters 1993 and the year with the least new characters 1958

In [11]:

#Plot bar graph of character introductions per year
sns.barplot(x='year', y='name', data=new)

Out[11]:

3. Percentage of female characters¶

In [12]:

#What percentage of the comics characters is female? Please give us the percentage rounded to 2 digits.
sex = df.groupby(by=df['sex'])['name'].count().reset_index(name='count')
sex['percent'] = sex['count'] / sex['count'].sum() * 100
sex['percent'] = sex['percent'].round(2)
sex = sex.set_index('sex')
percentagefemale = sex.at['Female', 'percent']
print('Percentage of female characters', percentagefemale, '%')

Percentage of female characters 23.43 %

4. Good vs Bad Characters¶

In [13]:

#Let's look at the number of female villians in the data set
df.query('align == "Bad" & sex == "Female"')

Out[13]:

	name	align	sex	alive	appearances	first appearance	year
101	Raven Darkholme (Earth-616)	Bad	Female	Living	371.0	Apr-78	1978.0
112	Felicia Hardy (Earth-616)	Bad	Female	Living	332.0	Jul-79	1979.0
164	Amora (Earth-616)	Bad	Female	Living	247.0	Apr-64	1964.0
231	Ophelia Sarkissian (Earth-616)	Bad	Female	Living	175.0	Feb-69	1969.0
239	Hela (Earth-616)	Bad	Female	Deceased	170.0	Mar-64	1964.0
265	Veranke (Earth-616)	Bad	Female	Deceased	156.0	Jan-05	2005.0
276	Giuletta Nefaria (Earth-616)	Bad	Female	Living	149.0	Jan-68	1968.0
284	Madelyne Pryor (Earth-616)	Bad	Female	Living	146.0	Apr-83	1983.0
375	Mary Walker (Earth-616)	Bad	Female	Living	111.0	May-88	1988.0
384	Mary MacPherran (Earth-616)	Bad	Female	Living	108.0	Jul-84	1984.0
418	Yuriko Oyama (Earth-616)	Bad	Female	Living	99.0	Aug-83	1983.0
455	Selene Gallio (Earth-616)	Bad	Female	Living	92.0	Nov-83	1983.0
461	Sinthea Shmidt (Earth-616)	Bad	Female	Living	91.0	Feb-84	1984.0
477	Satana Hellstrom (Earth-616)	Bad	Female	Living	84.0	Oct-73	1973.0
491	Zelda DuBois (Earth-616)	Bad	Female	Living	81.0	Mar-65	1965.0
505	Maria Callasantos (Earth-616)	Bad	Female	Deceased	77.0	Mar-91	1991.0
556	Blanche Sitznski (Earth-616)	Bad	Female	Living	69.0	Jun-80	1980.0
569	Criti Noll (Clone) (Earth-616)	Bad	Female	Deceased	68.0	Sep-06	2006.0
589	Morgan le Fay (Earth-616)	Bad	Female	Living	65.0	May-55	1955.0
600	Lily Hollister (Earth-616)	Bad	Female	Living	64.0	Jan-08	2008.0
611	Vertigo (Savage Land Mutate) (Earth-616)	Bad	Female	Deceased	63.0	Mar-82	1982.0
664	Philippa Sontag (Earth-616)	Bad	Female	Living	57.0	Oct-86	1986.0
668	Hera Argeia (Earth-616)	Bad	Female	Living	56.0	Apr-49	1949.0
674	Nebula (Earth-616)	Bad	Female	Living	56.0	Jul-85	1985.0
740	Llyra Morris (Earth-616)	Bad	Female	Living	50.0	Dec-70	1970.0
758	Deidre Wentworth (Earth-616)	Bad	Female	Living	49.0	Jul-91	1991.0
767	Umar (Earth-616)	Bad	Female	Living	48.0	Nov-66	1966.0
771	Nekra Sinclair (Earth-616)	Bad	Female	Deceased	48.0	Aug-73	1973.0
779	Lorelei (Asgardian) (Earth-616)	Bad	Female	Living	48.0	Nov-83	1983.0
782	Frances Barrison (Earth-616)	Bad	Female	Living	48.0	May-93	1993.0
…	…	…	…	…	…	…	…
15741	Fang (Shi’ar Prisoner) (Earth-616)	Bad	Female	Deceased	1.0	Sep-93	1993.0
15742	H’rpra (Earth-616)	Bad	Female	Deceased	1.0	Feb-93	1993.0
15753	T.K.O. (Earth-616)	Bad	Female	Living	1.0	Jul-93	1993.0
15755	Tartessus (Earth-616)	Bad	Female	Living	1.0	Mar-93	1993.0
15773	Javelynne (Earth-616)	Bad	Female	Living	1.0	Jan-94	1994.0
15780	May Reilly (Impostor) (Earth-616)	Bad	Female	Deceased	1.0	Aug-94	1994.0
15782	Nirvana (Deviant) (Earth-616)	Bad	Female	Living	1.0	Jul-94	1994.0
15826	Alchemist (Earth-616)	Bad	Female	Living	1.0	Sep-96	1996.0
15832	Catherine D’Antan (Earth-616)	Bad	Female	Living	1.0	May-96	1996.0
15849	Kristine Calverly (Earth-616)	Bad	Female	Living	1.0	Dec-96	1996.0
15856	Pepper (Salt & Pepper) (Earth-616)	Bad	Female	Living	1.0	May-96	1996.0
15873	Kylie Kopelkin (Earth-616)	Bad	Female	Living	1.0	Jan-97	1997.0
15874	Megan O’Toole (Earth-616)	Bad	Female	Living	1.0	Feb-97	1997.0
15875	Molly Peterson (Earth-616)	Bad	Female	Living	1.0	Jan-97	1997.0
15876	Ra’al (Earth-616)	Bad	Female	Living	1.0	Mar-97	1997.0
15938	Grace & Mary Mercy (Earth-616)	Bad	Female	Deceased	1.0	May-01	2001.0
16053	Ch’rith (Earth-616)	Bad	Female	Deceased	1.0	Nov-08	2008.0
16056	Fry’lu (Earth-616)	Bad	Female	Living	1.0	Oct-08	2008.0
16143	Zarda Shelton (Zombie Clone) (Earth-616)	Bad	Female	Deceased	1.0	May-11	2011.0
16148	Da’o Coy Manh (Earth-616)	Bad	Female	Deceased	1.0	Jun-12	2012.0
16159	Libra (Thanos’ Zodiac) (Earth-616)	Bad	Female	Deceased	1.0	May-12	2012.0
16170	Virgo (Thanos’ Zodiac) (Earth-616)	Bad	Female	Deceased	1.0	May-12	2012.0
16173	Katy & Timothy Bashir (Earth-616)	Bad	Female	Deceased	1.0	Feb-13	2013.0
16253	Illumination (Earth-616)	Bad	Female	Living	1.0	NaN	NaN
16283	Malefactor (Earth-616)	Bad	Female	Living	1.0	NaN	NaN
16297	Mrs. Stane (Earth-616)	Bad	Female	Deceased	1.0	NaN	NaN
16331	Shivoor (Earth-616)	Bad	Female	Living	1.0	NaN	NaN
16336	Silhouette (Hand) (Earth-616)	Bad	Female	Living	1.0	NaN	NaN
16358	Zora Loftus (Earth-616)	Bad	Female	Living	1.0	NaN	NaN
16362	Farbauti (Earth-616)	Bad	Female	Living	1.0	NaN	NaN

976 rows × 7 columns

In [14]:

#Group by alignment and sex and do a count for each group
goodvbad = df.groupby(by=['align', 'sex'])['name'].count().reset_index(name='count')
#Create a percentage column
goodvbad['percent'] = goodvbad['count'] / goodvbad['count'].sum() * 100
goodvbad['percent'] = goodvbad['percent'].round(2)
#Add an index so that I can select rows for specific groupings
goodvbad['index'] = goodvbad['align'].map(str) + goodvbad['sex']
goodvbad = goodvbad.set_index('index')
goodvbad['alignsex'] = goodvbad.index
goodvbad

Out[14]:

	align	sex	count	percent	alignsex
index
BadAgender	Bad	Agender	20	0.12	BadAgender
BadFemale	Bad	Female	976	5.96	BadFemale
BadMale	Bad	Male	5338	32.60	BadMale
BadUnknown	Bad	Unknown	386	2.36	BadUnknown
GoodAgender	Good	Agender	10	0.06	GoodAgender
GoodFemale	Good	Female	1537	9.39	GoodFemale
GoodGenderfluid	Good	Genderfluid	1	0.01	GoodGenderfluid
GoodMale	Good	Male	2966	18.11	GoodMale
GoodUnknown	Good	Unknown	122	0.74	GoodUnknown
NeutralAgender	Neutral	Agender	13	0.08	NeutralAgender
NeutralFemale	Neutral	Female	640	3.91	NeutralFemale
NeutralGenderfluid	Neutral	Genderfluid	1	0.01	NeutralGenderfluid
NeutralMale	Neutral	Male	1440	8.79	NeutralMale
NeutralUnknown	Neutral	Unknown	114	0.70	NeutralUnknown
UnknownAgender	Unknown	Agender	2	0.01	UnknownAgender
UnknownFemale	Unknown	Female	684	4.18	UnknownFemale
UnknownMale	Unknown	Male	1894	11.57	UnknownMale
UnknownUnknown	Unknown	Unknown	232	1.42	UnknownUnknown

In [15]:

#Create a chart of the distribution of alignment and sex
sns.barplot(x='percent', y='alignsex', data=goodvbad)

Out[15]:

In [16]:

#Show the percentage of bad males
males = goodvbad[['sex', 'align', 'count']].copy()
males = males.query('sex == "Male" & align != "Unknown"')
males['percent'] = males['count'] / males['count'].sum() * 100
males['percent'] = males['percent'].round()
males

Out[16]:

	sex	align	count	percent
index
BadMale	Male	Bad	5338	55.0
GoodMale	Male	Good	2966	30.0
NeutralMale	Male	Neutral	1440	15.0

In [17]:

percentagebadmales = males.at['BadMale', 'percent']
print('Percentage of Bad Males: ', int(percentagebadmales))

Percentage of Bad Males:  55

In [18]:

females = goodvbad[['sex', 'align', 'count']].copy()
females = females.query('sex == "Female" & align != "Unknown"')
females['percent'] = females['count'] / females['count'].sum() * 100
females['percent'] = females['percent'].round()
females

Out[18]:

	sex	align	count	percent
index
BadFemale	Female	Bad	976	31.0
GoodFemale	Female	Good	1537	49.0
NeutralFemale	Female	Neutral	640	20.0

In [19]:

#Show the percentage of bad females
percentagebadfemales = females.at['BadFemale', 'percent']
print('Percentage of Bad Females: ', int(percentagebadfemales))

Percentage of Bad Females:  31

It would seem that the villain is more often a male than female. From the chart above we can also see that there are more bad male characters than any other kind.

—

Keep Calm and Code in Python!

— Martin

Want a career as a Python Developer but not sure where to start?

Using Pandas and Seaborn to solve PyBites Marvel Challenge

By Martin Smith on 7 December 2017

Intro¶

Goals¶

Setup¶

Get Data¶

Cleanse¶

Solve Challenges¶

1. Characters with most appearances¶

2. Years with most and least new characters¶

3. Percentage of female characters¶

4. Good vs Bad Characters¶

Intro¶

Goals¶

Setup¶

Get Data¶

Cleanse¶

Solve Challenges¶

1. Characters with most appearances¶

2. Years with most and least new characters¶

3. Percentage of female characters¶

4. Good vs Bad Characters¶

Related articles