Click here to code!

Data Analysis of Pybites Community Branch Activity

Posted by Martin Uribe on Thu 18 October 2018 in Data • 15 min read

I wanted to play around with a dataset and see what I could find out about it. I decided on analyzing the little bit of data that I could collect from Github without having to use an OAuth key, which limits it to just 300 events.

To Run All of The Cells

You have the option of running each of the cells one at a time or you can just run them all in sequential order. Selecting a cell and either clicking on the Run button on the menu or using the key combination Shift+Enter will run the code in that cell if its code.

To run them all you will have to use the menu: Cell > Run All

In [1]:
import json
from collections import Counter
from pathlib import Path

import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
from dateutil.parser import parse
from matplotlib import rc
from matplotlib.pyplot import figure
In [2]:
data_location = Path.cwd().joinpath("data")

Retrieving and Importing the Data

The following code will load the three event json files in the data directory if the data directory exists. If the direcotry is not found it will be created and the files will be pulled down from Github and then loaded into memory.

In [3]:
def retrieve_data():
    if not data_location.exists():
        data_location.mkdir()
    url = "https://api.github.com/repos/pybites/challenges/events?page={}&per_page=100"
    for page in range(1, 4):
        response = requests.get(url.format(page))
        if response.ok:
            file_name = data_location.joinpath(f"events{page}.json")
            try:
                file_name.write_text(json.dumps(response.json()))
                print(f"  Created: {file_name.name}")
            except Exception as e:
                print(e)
        else:
            print(f"Something went wrong: [response.status_code]: {response.reason}")


def load_data():
    if data_location.exists():
        for page in range(1, 4):
            file_name = data_location.joinpath(f"events{page}.json")
            events.extend(json.loads(file_name.read_text()))
            print(f"  Loaded: {file_name.name}")
    else:
        print("Data directory was not found:")
        retrieve_data()
        load_data()

NOTE: If you want to work with the latest data, just remove the data directory and all its contents to have it pulled down once again.

In [4]:
events = []
load_data()
print(f"Total Events Loaded: {len(events)}")
  Loaded: events1.json
  Loaded: events2.json
  Loaded: events3.json
Total Events Loaded: 300

Parsing the Data

From what I hear, we should just get used to cleaning data up before we can use it and its no exception here. I'm interested in exploring a few key points from the data. Mostly I'm interested in the following:

  • Pull Request Events
  • Data that they were created
  • The username of the developer
  • The amount of time spent on the challenge
  • How difficult they found the challenge to be
In [5]:
# helper function
def parse_data(line):
    if '[' in line:
        data = line.split(': [')[1].replace(']', '').strip()
    else:
        data = line.split(': ')[1].strip()
    return data


# list to store the data
created = []
devs = []
dev_ids = []
diff_levels = []
time_spent = []

for event in events:
    # only insterested in pull request events
    if event['type'] == 'PullRequestEvent':
        # developer username
        dev = event['actor']['login']
        dev_id = event['actor']['id']
        # ignore pybites ;)
        if dev != 'pybites':
            # store developer username and use the id for privacy
            devs.append(dev)
            dev_ids.append(dev_id)
            # store the date
            created.append(event['created_at'].split('T')[0])
            # parse comment from user for data
            comment = event['payload']['pull_request']['body']
            for line in comment.split('\n'):
                # get difficulty level and time spent
                if 'Difficulty level (1-10):' in line:
                    diff = parse_data(line)
                elif 'Estimated time spent (hours):' in line:
                    spent = parse_data(line)
            # pandas DataFrames require that all columns are the same length
            # so if we have a missing value, None is used in its place
            if diff:
                diff_levels.append(int(diff))
            else:
                diff_levels.append(None)
            if spent:
                time_spent.append(int(spent))
            else:
                time_spent.append(None)

Creating The DataFrame

Now that we have the lists with the data that we parsed, a DataFrame can be created with them.

In [6]:
df = pd.DataFrame({
    'Developers': dev_ids, 
    'Difficulty_Levels': diff_levels, 
    'Time_Spent': time_spent,
    'Date': created,
})

Data Exploration

Here, we can start exploring the data. To take a quick peek at how it's looking, there is no better choice then to use head().

In [7]:
df.head()
Out[7]:
Developers Difficulty_Levels Time_Spent Date
0 16877140 5.0 8.0 2018-10-19
1 38558860 3.0 3.0 2018-10-19
2 15842536 2.0 1.0 2018-10-18
3 969189 5.0 4.0 2018-10-17
4 7138575 5.0 4.0 2018-10-17

To get some quick statistacaly metrics on the dataset, describe() can be used.

In [8]:
df.describe()
Out[8]:
Developers Difficulty_Levels Time_Spent
count 4.600000e+01 40.000000 40.000000
mean 1.447647e+07 3.725000 3.500000
std 1.382362e+07 1.601081 3.411895
min 2.531750e+05 1.000000 1.000000
25% 3.473518e+06 2.000000 1.000000
50% 7.138575e+06 4.000000 3.000000
75% 2.021234e+07 4.250000 4.000000
max 4.140364e+07 8.000000 20.000000

As you can see, sometimes the data isn't handled right, like the user id. Based on what I could see above, I Wanted to get a feel for the following portions. I can see the average difficulty level above, next to the 50%, but I also wanted to show you how to pull that out individually.

In [9]:
print(f'Developers: {len(df["Developers"])}')
print(f'Average Difficulty: {df["Difficulty_Levels"].median()}')
print(f'Time Spent: {df["Time_Spent"].sum()}')
Developers: 46
Average Difficulty: 4.0
Time Spent: 140.0

The following Counters are just me exploring the data even further.

In [10]:
developers = Counter(df['Developers']).most_common(6)
developers
Out[10]:
[(7138575, 8),
 (38558860, 7),
 (253175, 5),
 (969189, 3),
 (387927, 3),
 (16894718, 3)]
In [11]:
bite_difficulty = Counter(df['Difficulty_Levels'].dropna()).most_common()
bite_difficulty
Out[11]:
[(4.0, 13), (2.0, 9), (3.0, 6), (6.0, 6), (5.0, 3), (1.0, 2), (8.0, 1)]
In [12]:
bite_duration = Counter(df['Time_Spent'].dropna()).most_common()
bite_duration
Out[12]:
[(1.0, 11),
 (2.0, 8),
 (3.0, 7),
 (4.0, 6),
 (8.0, 4),
 (5.0, 2),
 (20.0, 1),
 (6.0, 1)]
In [13]:
created_at = sorted(Counter(df['Date'].dropna()).most_common())
created_at
Out[13]:
[('2018-10-04', 2),
 ('2018-10-05', 8),
 ('2018-10-07', 7),
 ('2018-10-08', 4),
 ('2018-10-09', 2),
 ('2018-10-10', 1),
 ('2018-10-11', 1),
 ('2018-10-12', 4),
 ('2018-10-13', 3),
 ('2018-10-14', 3),
 ('2018-10-15', 3),
 ('2018-10-16', 2),
 ('2018-10-17', 3),
 ('2018-10-18', 1),
 ('2018-10-19', 2)]

Hmm, how many days are we looking at?

In [14]:
len(created_at)
Out[14]:
15

Who's got the most recent activity?

In [15]:
top_ninja = Counter(devs).most_common(1)[0]
top_ninja
Out[15]:
('clamytoe', 8)

Time To Get Down To Business

Now that we've loaded our data and cleaned it up, lets see what it can tell us.

Number of Pull Request per Day

Pretty amazing that Pybites Blog Challenges had over 300 distinct github interactions in such a short time!

In [16]:
# resize graph
figure(num=None, figsize=(6, 6), dpi=80, facecolor='w', edgecolor='k')

# gather data into a custom DataFrame
dates = [day[0] for day in created_at]
prs = [pr[1] for pr in created_at]
df_prs = pd.DataFrame({'xvalues': dates, 'yvalues': prs})
 
# plot
plt.plot('xvalues', 'yvalues', data=df_prs)

# labels
plt.xticks(rotation='vertical', fontweight='bold')

# title
plt.title('Number of Pull Request per Day')

# show the graphic
plt.show()

Top Blog Challenge Ninjas

Although there are many more contributors, I had to limit the count so that the data would be easier to visualize.

In [17]:
# resize graph
figure(num=None, figsize=(6, 6), dpi=80, facecolor='w', edgecolor='k')

# create labels
labels = [dev[0] for dev in developers]
labels[0] = top_ninja[0]

# get a count of the pull requests
prs = [dev[1] for dev in developers]

# pull out top ninja slice
explode = [0] * len(developers)
explode[0] = 0.1

# create the pie chart
plt.pie(prs, explode=explode, labels=labels, shadow=True, startangle=90)

# add title and center
plt.axis('equal')
plt.title('Top Blog Challenge Ninjas')

# show the graphic
plt.show()

Time Spent vs Difficulty Level per Pull Request

Finally I wanted to explore what the relation between time spent per PR vs how difficult the develop found the challenge to be.

In [18]:
# resize graph
figure(num=None, figsize=(15, 6), dpi=80, facecolor='w', edgecolor='k')

# drop null values
df_clean = df.dropna()

# add legend
diff = mpatches.Patch(color='#557f2d', label='Difficulty Level')
time = mpatches.Patch(color='#2d7f5e', label='Time Spent')
plt.legend(handles=[time, diff])

# y-axis in bold
rc('font', weight='bold')

# values of each group
bars1 = df_clean['Difficulty_Levels']
bars2 = df_clean['Time_Spent']

# heights of bars1 + bars2
bars = df_clean['Difficulty_Levels'] + df_clean['Time_Spent']

# position of the bars on the x-axis
r = range(len(df_clean))

# names of group and bar width
names = df_clean['Developers']
barWidth = 1

# create green bars (bottom)
plt.bar(r, bars1, color='#557f2d', edgecolor='white', width=barWidth)
# create green bars (top), on top of the firs ones
plt.bar(r, bars2, bottom=bars1, color='#2d7f5e', edgecolor='white', width=barWidth)

# custom X axis
plt.xticks(r, names, rotation='vertical', fontweight='bold')
plt.xlabel("Developers", fontweight='bold')

# title
plt.title('Time Spent vs Difficulty Level per Pull Request')

# show graphic
plt.show()

Conclusions

As you can see, the Pybites Ninjas are an active bunch. With such a small limited dataset its plain to see that some good information can be extracted from it. Would be interesting to see which challenges are getting the most action though, but I'll leave that as an exercise for you to explore!

---

Keep Calm and Code in Python!

-- Martin




See an error in this post? Please submit a pull request on Github.