Calculate And Plot 95% Range Of Data On Scatter Plot In Python
I wish to know, for a given predicted commute journey duration in minutes, the range of actual commute times I might expect. For example, if Google Maps predicts my commute to be
Solution 1:
The relationship between actual duration of a commute and the prediction should be linear, so I can use quantile regression:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# Import data and print the last few rows
commutes = pd.read_csv('https://raw.githubusercontent.com/blokeley/commutes/master/commutes.csv')
# Create the quantile regression model
model = smf.quantreg('duration ~ prediction', commutes)
# Create a list of quantiles to calculate
quantiles = [0.05, 0.25, 0.50, 0.75, 0.95]
# Create a list of fits
fits = [model.fit(q=q) for q in quantiles]
# Create a new figure and axes
figure, axes = plt.subplots()
# Plot the scatter of data points
x = commutes['prediction']
axes.scatter(x, commutes['duration'], alpha=0.4)
# Create an array of predictions from the minimum to maximum to create the regression line
_x = np.linspace(x.min(), x.max())
for index, quantile inenumerate(quantiles):
# Plot the quantile lines
_y = fits[index].params['prediction'] * _x + fits[index].params['Intercept']
axes.plot(_x, _y, label=quantile)
# Plot the line of perfect prediction
axes.plot(_x, _x, 'g--', label='Perfect prediction')
axes.legend()
axes.set_xlabel('Predicted duration (minutes)')
axes.set_ylabel('Actual duration (minutes)');
This gives:
Many thanks to my colleague Philip for the quantile regression tip.
Solution 2:
You should fit your data in a gaussian distribution within 3 sigma std dev wich will represent something around 96% of your results.
Look after Normal Distribution.
Post a Comment for "Calculate And Plot 95% Range Of Data On Scatter Plot In Python"