As usual, let's begin importing some useful libraries. Note the presence of several visualization customizations. plt.style.use()
lets us specify a style family among some predefined samples (see other options on the matlplotlib style reference page. Then, the rcParams
dictionary can be used to change the default style settings and behaviors of matplolib. In this case, we are increasing the default figure size to 12x6 inches. Finally, setting the max_rows
attribute for pandas forces the number of rows to be displayed whenever a DataFrame is printed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-notebook')
pd.options.display.max_rows = 10
plt.rcParams["figure.figsize"] = (12, 6)
Let's load the two files using pandas. The first file contains several information related to Points of Interest (POI) in the New York area. The second one contains the IDs of those POIs located in the New York municipality. To retain only the latter points, we can use the DataFrame index functionalities. The operation is equivalent to the set operator of intersection between the two indices.
ny_pois_ids = np.loadtxt("ny_municipality_pois_id.csv")
# This dictionary maps attributes in the table with correct data types
d_types = {'@type':str, '@lat':float, '@lon':float, 'amenity':str, 'name':str,
'shop':str, 'public_transport':str, 'highway':str}
all_pois_df = pd.read_csv("pois_all_info", sep='\t', index_col='@id', dtype=d_types)
# Filter only POIS in NY municipality: intersection on indices
pois_df = all_pois_df.loc[ny_pois_ids]
pois_df
Remember that pandas offers a small set of methods to quickly inspect any pandas DataFrame. Let's apply some.
pois_df.info()
We have 8 columns. While @type , @lat , @lon are present in every record, the other ones contain many missing values.
pois_df.describe()
The describe
method instead provides a few statistical characteristics of each column (e.g. mean, standard deviation, min and max values). As you might have expected, it includes only numerical variables in the report.
Additional note: the attribute @type can be misleading, it does not contain and type-related information. Indeed, it has a single value in the dataset: we can ignore it.
pois_df['@type'].unique()
Another way to count manually the amount of missing values per attribute is by concatenating methods that operate column-wise (the default behavior for many DataFrame operations).
pois_df.isna().sum()
There is something interesting here. It seems that the information is not encoded in a canonical way: the columns "amenity", "shop", "public_transport" and "highway" may correspond to certain categories (let's put "name" aside for now), each containing several possible values. The high number of missing values might suggest us that each record belongs to one category, leaving empty the other fields. Let's quickly check that.
from collections import Counter
def get_categories():
return ['amenity', 'shop', 'public_transport', 'highway']
cats = get_categories()
# Count NaNs per row and inspect their frequencies
check_df = pois_df[cats]
row_nans = check_df.isna().sum(axis=1)
print(Counter(row_nans))
Every row that has a NaN count equal to 3 contains a single non-missing category value. Our assumption was almost correct. Yet, there are rows with two non-empty values. Let's see some of them.
pois_df[row_nans == 2].head()
The couples "cafe" and "books", and "platform" and "bus_stop" appear together. That seems plausible. So, we can assume that each POI can have one or two categories. Note that we also have 10208 records with no category at all.
We can now carry out a deeper analysis on our four categories. To do so, we can plot the distribution of each type within a given category. For quick charts, one can leverage on pandas itself (that, in turn, uses matplotlib internally). Series and DataFrame objects expose handy visualization APIs. However, keep in mind that pandas plotting functionalities are limited: for complex tasks and customizations matplotlib or seaborn are better choices.
For the sake of simplicity, we can narrow down the analysis only to the most frequent types. Obviously, this is not a common approach: pruning the visual exploration can be dangerous, useful information can be lost or missed. Focusing on the most frequent data points is typically limited to simplify visualization. In this exercise, we can retain only the top 80% frequent types of each category. Let's define a function that evaluates the percentage share of each type within a category and filters out the ones below a given threshold.
We will use the method cumsum
to obtain the percentage share of each type within a category. Let's see briefly how it works on a pandas Series and how it can be used to filter with a percentage threshold.
In our introductory example, we can imagine a bag of colored balls. We represent their occurrences in the bag as a Series.
s = pd.Series([10, 12, 20, 2], index=['red', 'green', 'blue', 'yellow'])
s
We can now sort in descending order the occurences and pass to the percentage amount each color accounts for by dividing by the total amount of balls.
so = s.sort_values(ascending=False)
so
sp = so / so.sum()
sp
We now can compute the cumulative sum.
spc = sp.cumsum()
spc
Finally, we can create a boolean mask for elements above a given percentage threshold (e.g. elements that accounts for the top p% occurences). The first item in the mask with a True value will be the first non-frequent item. We can discover its positional index using argmax
.
mask = spc > .8
mask
mask.values.argmax()
Thus, blue and green balls constitue the maximum subset of balls that does not account more than the 80% in terms of ball occurrences.
def get_top_perc(series, perc_value=.8):
perc = series.cumsum() / series.sum()
arg = (perc > perc_value).values.argmax()
return series.iloc[:arg+1]
for col in get_categories():
p = .8
valc = pois_df[col].value_counts()
valf = get_top_perc(valc, p)
fig, ax = plt.subplots()
valf.plot(kind='bar', ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
fig.suptitle(f"Top {p*100:.0f}% points in the category: {col}")
From the charts we know that amenity and shop categories have many types (but with quite different occurrences), while public transport and highway contain only a few.
It is time to draw our POIs on a real map. We can define a Map class for the task.
We will make use of matplolib's colormaps. Let's talk about them briefly. By streamlining, every colormap can be seen as a continuous sequence of colors, interpolating between an initial and a final one. The choice of the two edge colors changes the nature of the scale (e.g. sequential, divergent). You can can get a colormap object with the get_cmap
. Let's try with viridis and BrBG , a Perceptually Uniform Sequential map and a Divergent map respectively.
from matplotlib.cm import get_cmap
x = np.linspace(0, 1, 256)
x = np.vstack((x, x))
fig, axes = plt.subplots(nrows=2, figsize=(12,2))
plt.subplots_adjust(hspace=2) # adjust the vertical space between subplots
for ax, name in zip(axes, ['viridis', 'BrBG']):
ax.imshow(x, aspect='auto', cmap=plt.get_cmap(name))
ax.grid(False)
ax.get_yaxis().set_visible(False)
ax.set_title(name)
As you might have noticed, the values of x
(i.e. 256 floating point numbers linearly distributed between 0 and 1) are used to pick colors from the colormap.
x[0, :10]
Each color is expressed as four floating point numbers (RGBA notation) between 0 and 1. To get these values, use the callable nature of the colormap object itself.
cmap = plt.get_cmap('viridis')
cmap(x[0, :10])
import seaborn as sns
class Map:
def __init__(self, df):
""" Store Dataset with POIs information."""
self.pois_df = df
self.lat_min = df['@lat'].min()
self.lat_max = df['@lat'].max()
self.long_min = df['@lon'].min()
self.long_max = df['@lon'].max()
def plot_map(self):
""" Display an image with NY map and return the Axes object."""
fig, ax = plt.subplots()
nyc_img = plt.imread('./New_York_City_Map.PNG')
ax.imshow(nyc_img, zorder=0, extent=[self.long_min,
self.long_max,
self.lat_min,
self.lat_max])
ax.grid(False)
return ax
def plot_pois(self, ax, category, mask):
"""Plot data on specified Axis."""
df = self.pois_df.loc[mask]
# Version 1: using pandas
types = df[category].unique()
cmap = get_cmap('viridis')
colors = cmap(np.linspace(0, 1, types.size))
for i, t in enumerate(types):
df_t = df.loc[df[category] == t]
c = [colors[i]] * df_t.shape[0]
df_t.plot.scatter(x='@lon', y='@lat', ax=ax, c=c, alpha=.6, label=t)
# Version 2: using seaborn
# sns.scatterplot(df['@lon'], df['@lat'], hue=df[category], ax=ax,
# marker='o', s=3, linewidth=0, palette="viridis", legend='full')
ax.legend() # show the legend, required by Version 1
ax.grid(False)
return ax
The class exposes a method that creates a new figure, draws the New York map onto it, and finally returns the Axes object for future plots. Focus on the extent
parameter: it is used to specify the top, bottom, left and right margins of the image, in terms of coordinates. In this case, it is required to correctly align the background with our latitude and longitude values.
Without prior knowledge on that, we can suppose that the image provided with the dataset is bounded to the area that ranges from minimum and maximum values of both latitude and longitude, considering all POIs. We can check if the latter assumption was correct by simply plotting a category POIs.
We would like you to focus on the plot_pois
method. The actual plotting is presented in two versions: one with pandas functions and one with the seaborn library. The uncommented version exploits pandas. As you can see, the approach requires to:
The commented line (i.e. Version 2) uses the seaborn library to obtain the same result. The hue
parameter is used to infer the labels and the color of each data point, while palette
lets you select the desired colormap. Pretty neat, isn't it?
def show_category_on_map(df, column, perc_value):
"""
Plot the New York map with POIs of a specific category.
Only the top 'perc_value'% frequent types are plotted.
"""
counts = df[column].value_counts()
top_freq = get_top_perc(counts, perc_value)
ny_map = Map(df)
ax = ny_map.plot_map()
mask = df[column].isin(top_freq.index)
ny_map.plot_pois(ax, column, mask)
show_category_on_map(pois_df, 'amenity', .8)
There is an evident concentration of pharmacies in the Manhattan neighborhood. However, we should consider that latest calls may superimpose points on top of the ones drawn by first calls. Let's look at the same chart, lowering the frequency threshold.
show_category_on_map(pois_df, 'amenity', .5)
show_category_on_map(pois_df, 'amenity', .3)
Manhattan seems to be the core of many of the most frequent types of amenities.
We focus now on a grid-based subdivision system. The rationale here follows the rule of data transformation: we want to explore POIs from another perspective, starting from another representation. Lets see how it can be achieved.
We can decide to subdivide the map into non-overlapping, rectangular-shaped cells based on latitude and longitude. Even this is not the only possibility, a grid like that guarantees that each point will fall into a single cell.
Let's write down a class to map POIs' coordinates to the respective cell (or cell_id).
class Cell_converter:
def __init__(self, df, n_splits):
self.lat_min = df['@lat'].min()
self.lat_max = df['@lat'].max()
self.long_min = df['@lon'].min()
self.long_max = df['@lon'].max()
self.n_splits = n_splits
def plot_grid(self, ax):
lat_steps = np.linspace(self.lat_min, self.lat_max, self.n_splits + 1)
long_steps = np.linspace(self.long_min, self.long_max, self.n_splits + 1)
ax.hlines(lat_steps, self.long_min, self.long_max)
ax.vlines(long_steps, self.lat_min, self.lat_max)
def point_to_cell_coord(self, long, lat):
x = int((long - self.long_min)/(self.long_max - self.long_min)*self.n_splits)
y = int((lat - self.lat_min)/(self.lat_max - self.lat_min)*self.n_splits)
return x, y
def point_to_cell_id(self, long, lat):
x, y = self.point_to_cell_coord(long, lat)
return y * n_splits + x
n_splits = 20
cell_conv = Cell_converter(pois_df, n_splits)
pois_df['cell_id'] = pois_df.apply(lambda x: cell_conv.point_to_cell_id(x['@lon'], x['@lat']), axis=1)
pois_df.head()
The point_to_cell_id
method creates the final mapping. As you can see, the cell_id
will be an integer, starting from 0 (the bottom-left cell) and following a row major ordering.
Also, focus on line 27. That is a simply way to assign the result of some operation to a new column, by effectively creating it at the moment.
As you might have understood, we do like maps, scatter plots, and scatter plots on maps. But we do not dislike grids too. That is the reason for the presence of the plot_grid
method. Let's apply it.
yet_another_map = Map(pois_df)
ax = yet_another_map.plot_map()
cell_conv.plot_grid(ax)
Now that we have the cell_id
assigned, we are able to carry out a different analysis.
The cell-wise distribution of POIs re-frames the exploration task, providing a completely new representation of our dataset.
The obtained cells (that now represent a specific sub-area in the NY municipality) become atomic units, summarizing part of our data.
As such, we can decide to characterize each cell by counting the POIs contained, for each POI type. The outcome will be a new DataFrame.
def get_df_count(df, column, perc_value):
counts = df[column].value_counts()
top_freq = get_top_perc(counts, perc_value)
mask = df[column].isin(top_freq.index)
freq_df = df.loc[mask]
# for each cell_id count the number of POIs for each type
count_dframe = []
for cell_id in range(n_splits**2):
count_vals = freq_df.loc[freq_df['cell_id'] == cell_id][column].value_counts()
count_vals.name = cell_id
count_dframe.append(count_vals)
cells_features_df = pd.DataFrame(count_dframe)
cells_features_df = cells_features_df.fillna(0)
return cells_features_df
Let's see a few records of the result for frequent amenities.
amenities_df = get_df_count(pois_df, 'amenity', .6)
amenities_df.head()
Now that we have some new features to characterize each cell, we can inspect whether they are correlated or not. This kind of question becomes interesting especially with data from different categories.
Let's begin computing the cell-wise representation for shops as well.
shops_df = get_df_count(pois_df, 'shop', .6)
shops_df.head()
Now, concatenate it with the DataFrame from amenities. Remember that the horizontal stacking operation (i.e. on columns) is obtained with axis=1
.
final_df = pd.concat([amenities_df, shops_df], axis=1)
Pandas provides an handy method to compute the pairwise correlation of columns, excluding null values. We can adopt the resulting matrix to build an heatmap.
final_corr = final_df.corr()
final_corr.head()
Again, let's develop two versions, to highlight how the seaborn library can significantly simplify your tasks. As you can see, matplotlib does not provide a method out-of-the-shelf. You should rely instead on the image show functionality, adjusting axes and adding a colorbar by yourself. Conversely, the seaborn implementation (the commented one), for the same result, would require a single line of code.
# Version 1
fig, ax = plt.subplots()
im = ax.imshow(final_corr)
ax.set_xticks(np.arange(final_corr.columns.size))
ax.set_yticks(np.arange(final_corr.columns.size))
ax.set_xticklabels(final_corr.columns)
ax.set_yticklabels(final_corr)
plt.setp(ax.get_xticklabels(), rotation=90, ha="right", va="center",
rotation_mode="anchor") # rotate labels on x-axis
cbar = ax.figure.colorbar(im, ax=ax)
_ = cbar.ax.set_ylabel('pearson correlation', rotation=-90, va="bottom")
# Version 2
# sns.heatmap(final_corr)
There are some interesting couples, like restaurant with bicycle parking, fast food, cafe, clothes and bakery. These NY areas are likely related to food activities. Laundry is highly correlated with convenience shops, beauty, hairdresser and dry cleaning. The latter seems to identify cells with commercial activities and shops.
Bear in mind that these are preliminary results. Cells nature may vary, depending on the parameter n_steps
used to build the grid. Also, remeber that we focused on amenities and shops only and, among them, we filtered out the 40% less frequent POI types. Further analyzes are left to you, as side exercise.
The main idea behind correlation inspection is to find cells where some activities or shops appear together. We can take the analysis one step further and search for clusters of cells. Given the new representation at our disposal (i.e. cells characterized by the presence of certain POIs), computing clusters of cells would be equivalent to discovering areas of interest in the city.
We left cluster analysis to you, as side exercise. Among all possible research questions, you could focus on:
We begin by loading the dataset. The argument parse_dates
is used to parse a raw date string to a numpy datetime64 object. Pandas time series / date functionalities are extremely versatile and should be used anytime a column represents a date or timestamp. Working with datetime types lets you, for example, query specific dates (without using string comparison) or date intevals. That becomes extremely useful when the DataFrame index is of type datetime. The actual class is known as DatetimeIndex.
df = pd.read_csv('data/831394006_T_ONTIME.csv', parse_dates=["FL_DATE"]).rename(columns=str.lower)
Let's call now some descriptive methods.
df.head()
df.describe()
df.info()
The dataset is sufficently large (450k entries, 33 columns and 113 MB of memory footprint). We can easily note that the attribute unnamed: 32 is null everywhere. Even if it is not strictly necessary, let's get rid of it.
df = df.drop("unnamed: 32", axis=1)
Let's list all available carries airports and count them.
df.unique_carrier.unique(), df.unique_carrier.unique().size
distinct_airports = pd.concat([df["origin"], df["dest"]])
print(f"First ten airports: {distinct_airports.unique()[:10]}. Size: {distinct_airports.unique().size}")
Note how we have accessed the DataFrame columns. The first one uses an object's attribute: pandas creates these attributes, matching the column names, at instantiation time. In the latter cell we used the square brackets. That syntax is more verbose but lets you specify a list of columns to slice on, returning a DataFrame as result.
df[["origin", "dest"]]
Since the fl_date
is now a datetime64 column, we can ask for something like:
df.fl_date.min(), df.fl_date.max()
We can filter out canceled flights for the following analysis.
print('Shape before:', df.shape)
df = df.loc[df.cancelled == 0]
print('Shape after:', df.shape)
Using the groupby method we can obtain the two information.
df_by_carrier = df.groupby('unique_carrier')
df_count = df_by_carrier['fl_date'].count()
df_count
df_by_carrier[['carrier_delay',
'weather_delay',
'nas_delay',
'security_delay',
'late_aircraft_delay']].mean()
We have already seen how a new column can be generated. Since our fl_date
is in datetime format, obtaining the day of the week is straightforward.
df['weekday'] = df['fl_date'].dt.dayofweek
df['weekday'].head()
Then, a simple element-wise difference between columns is enough for delaydelta
.
df['deltadelay'] = df['arr_delay'] - df['dep_delay']
df['deltadelay'].head()
To accomplish the task we can use a boxplot. The boxplot is a method for graphically depicting numerical sets thorough their quartiles.
df.boxplot(by='weekday', column='arr_delay')
Several outliers squeeze down each plot. Let's exclude them.
df.boxplot(by='weekday', column='arr_delay', showfliers=False)
In both the cases, ther is no clear correlation between the arrival delay and the day of the week.
We can now exploit a bar chart.
we_delay = df.loc[df.weekday > 4].groupby('unique_carrier').arr_delay.mean()
wd_delay = df.loc[df.weekday <= 4].groupby('unique_carrier').arr_delay.mean()
we_delay.name = "weekend delay"
wd_delay.name = "working days delay"
ax = pd.concat([wd_delay, we_delay], axis=1).plot.bar()
ax.grid(True)
The most uneven behavior is obtained by Delta Airlines. Considering the working days, it has a negative mean delay, meaning that, on average, it landed its airplanes earlier than the scheduled time. Conversely, during weekends, it has the highest mean arrival delay.
The creation of the multi-index on an existing DataFrame is achieved by setting multiple columns to the index itself.
multi_df = df.set_index(['unique_carrier', 'origin', 'dest', 'fl_date']).sort_index()
multi_df[multi_df.columns[:4]].head()
Working with multi-level indices is extremely useful sometimes. To access specific rows, you can specify an indexing procedure for each level.
multi_df.loc[(['AA', 'DL'], ['LAX']), ['dep_time', 'dep_delay']]
Note that the first element passed to .loc
is a tuple, specifying a fancy indexing access on the first two levels of the index.
Let's break down the problem. We first detect all the interested records. To do so, we can exploit the datetime type to filter on dates.
# fw_df = multi_df.loc[(:, :, 'LAX', '2017-01-01':'2017-01-08'), :] # not allowed
fw_df = multi_df.loc[(slice(None), slice(None), 'LAX', slice('2017-01-01','2017-01-08')), :]
The first line of the previous cell is commented out, since the character :
cannot be used to access index levels. If you want to use it anyway, pandas provides the IndexSlice object that handles the translation for you. The previous result would be obtained via:
fw_df = multi_df.loc[pd.IndexSlice[:, :, 'LAX', '2017-01-01':'2017-01-08'], :]
We can now complete the task using again the groupby operator.
fw_df.groupby('fl_num')['arr_delay'].mean()
It is time to explore the use of pandas pivot table. We want to count the flights departed each day of the week, for each carrier. Thus, the aggregation function would be count
.
cfd = pd.pivot_table(df, values='fl_num', index='unique_carrier', columns='weekday', aggfunc='count')
cfd
Then, to compute the pairwise correlation between the carriers on different days we can count again on the pandas corr()
method. However, it is applied to columns, hence we first need to transpose the DataFrame. The result is directly passed to seaborn for the heatmap visualization.
plt.figure(figsize=(10,10))
_ = sns.heatmap(cfd.T.corr())
The matrix represents a degree of correlation along the week (the Pearson correlation is used by default): correlated carriers have operated the same number of flights in the same day of the week.
It seems that Hawaiian Airlines (HA) has a different flight schedule compared to most of the other companies. Conversely, Delta Airlines instead shares the plan with many companies.
Just as before, we need a pandas pivot table.
plt.figure(figsize=(10,10))
_ = sns.heatmap(pd.pivot_table(df, values='arr_delay', index='unique_carrier', columns='weekday', aggfunc='mean').T.corr())
Now, an high correlation means that the two carriers have had the same mean arrival delay comparing the same days of the week. AA and DL, and VX and HA have an high correlation, positive and negative respectively. The rest of the heatmap does not bring significant information.
We need again a pivot table, but the input DataFrame needs to be filtered beforehand. To do so, we can make use of a boolean mask.
mask = df.unique_carrier.isin(["HA", "DL", "AA", "AS"])
dcw = pd.pivot_table(df.loc[mask], values='deltadelay', index='unique_carrier', columns='weekday', aggfunc='mean')
dcw
Finally, we can use the pandas plot
method. Calling it on the whole DataFrame, a line for each column is created, with the index values on the x-axis. Thus, we only need to transpose the DataFrame in advance.
dcw.T.plot()