import pandas as pd
housing_data = pd.read_csv('data/rental_contracts_sg.csv', skiprows=2)
housing_data.head()
Cleaning up...¶
When getting data online, we usually have messy or not cleaned up information. Let me show you what I mean.
housing_data.info()
- First, we see that we don't have consistent numbers o f rows for each column.
- The postal district column should actually be an integer, rather than an float64.
- Floor areas are objects because the data in this column is actually a range of the unit area, e.g. 120 to 130 for 1th entry. Let's have a look at the last rows first.
housing_data.tail()
Shocks! I also took the comments at the end of the csv file, we obviously need to remove this. But please take note that information or comments at the end of the file may be valuable in explaining the dataset. But let's just remove it for now.
housing_data = housing_data.iloc[:1523]
housing_data.tail()
Looks good. Now we should fix the postal district data type.
housing_data['Postal District']=housing_data['Postal District'].astype('int')
housing_data.info()
housing_data.describe()
The describe function displays some statistics for all of my dataset, but I want to see per district, so I can compare them. We can use the groupby
function of the pandas
module.
groupedby_district = housing_data.groupby('Postal District')
#use the describe function again
groupedby_district.describe()
Seems like it worked. But still, looking at and interpreting these numbers may make us dizzy, so lets try using plots. We try first to plot a histogram and ECDFs of the grouped dataset.
#do some imports first
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style()
Means and histogram¶
import numpy as np
def ecdf(data):
"""Compute Empirical Cumulative Distribution Function for a one-dimensional array of measurements."""
# Number of data points: n
n=len(data)
# x-data for the ECDF: x
x=np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n
return x, y
fig_hist, ax_hist = plt.subplots()
fig_cdf, ax_cdf = plt.subplots()
for name, group in groupedby_district:
label='District {}'.format(name)
_ = group.hist(column='Monthly Gross Rent($)',ax=ax_hist, bins=30, alpha=0.5, label=label,normed=True)
# compute ECDF for the Unit Price data for each group
x, y = ecdf(group['Monthly Gross Rent($)'])
# generate ECDF plot
_ = ax_cdf.plot(x,y,marker='.',linestyle='none',label=label)
#put legends in the plots
_ = ax_hist.legend()
_ = ax_cdf.legend()
Medians, quartiles and boxplots¶
fig, ax = plt.subplots()
ax = housing_data.boxplot(column='Monthly Gross Rent($)',by='Postal District',ax=ax)
_ = ax.set_title('') #this is just to fix the messy title generated by the .boxplot function
_ = ax.set_ylabel('Unit Price ($psf)')
As seen here, the median (or the middle green line of the boxes) that houses from district 25 tend to be lower than districts 20 and 22. The circles represent data outside of the 'whiskers' of the boxplots, suggesting they are outliers. Note that these outliers might still be important in making our models, so we just keep it for this exercise.
Now to the 'sexy' part of data science, or is it?¶
Stay tuned with my website to see my next article about Predicting House Prices in SG!
Comments
comments powered by Disqus