Best Fit Line On Log Log Scales In Python 2.7
Solution 1:
Data that falls along a straight line on a log-log scale follows a power relationship of the form y = c*x^(m)
. By taking the logarithm of both sides, you get the linear equation that you are fitting:
log(y)= m*log(x)+c
Calling np.polyfit(log(x), log(y), 1)
provides the values of m
and c
. You can then use these values to calculate the fitted values of log_y_fit
as:
log_y_fit = m*log(x) + c
and the fitted values that you want to plot against your original data are:
y_fit =exp(log_y_fit)=exp(m*log(x)+c)
So, the two problems you are having are that:
you are calculating the fitted values using the original x coordinates, not the log(x) coordinates
you are plotting the logarithm of the fitted y values without transforming them back to the original scale
I've addressed both of these in the code below by replacing plt.plot(z, np.poly1d(np.polyfit(logA, logB, 1))(z))
with:
m,c= np.polyfit(logA, logB,1)# fit log(y) = m*log(x) + c
y_fit = np.exp(m*logA +c)# calculate the fitted values of y
plt.plot(z, y_fit,':')
This could be placed on one line as: plt.plot(z, np.exp(np.poly1d(np.polyfit(logA, logB, 1))(logA)))
, but I find that makes it harder to debug.
A few other things that are different in the code below:
You are using a list comprehension when you calculate
logA
fromz
to filter out any values < 1, butz
is a linear range and only the first value is < 1. It seems easier to just createz
starting at 1 and this is how I've coded it.I'm not sure why you have the term
x*log(x)
in your list comprehension forlogA
. This looked like an error to me, so I didn't include it in the answer.
This code should work correctly for you:
fig=plt.figure()
ax = fig.add_subplot(111)
z=np.arange(1, len(x)+1) #startat1, to avoid error fromlog(0)
logA = np.log(z) #no need for list comprehension since all z values>=1
logB = np.log(y)
m, c = np.polyfit(logA, logB, 1) # fit log(y) = m*log(x) + c
y_fit = np.exp(m*logA + c) # calculate the fitted valuesof y
plt.plot(z, y, color ='r')
plt.plot(z, y_fit, ':')
ax.set_yscale('symlog')
ax.set_xscale('symlog')
#slope, intercept = np.polyfit(logA, logB, 1)
plt.xlabel("Pre_referer")
plt.ylabel("Popularity")
ax.set_title('Pre Referral URL Popularity distribution')
plt.show()
When I run it on simulated data, I get the following graph:
Notes:
The 'kinks' on the left and right ends of the line are the result of using "symlog" which linearizes very small values as described in the answers to What is the difference between 'log' and 'symlog'? . If this data was plotted on "log-log" axes, the fitted data would be a straight line.
You might also want to read this answer: https://stackoverflow.com/a/3433503/7517724, which explains how to use weighting to achieve a "better" fit for log-transformed data.
Solution 2:
I figured out another solution for this problem. Sharing this because it might be helpful.
fig=plt.figure()
ax = fig.add_subplot(111)
z=np.arange(len(x)) + 1print z
print y
rank = [np.log10(i) for i in z]
freq = [np.log10(i) for i in y]
m, b, r_value, p_value, std_err = stats.linregress(rank, freq)
print"slope: ", m
print"r-squared: ", r_value**2print"intercept:", b
plt.plot(rank, freq, 'o',color = 'r')
abline_values = [m * i + b for i in rank]
plt.plot(rank, abline_values)
This essentially achieves the objective as well. It uses the stats module.
Post a Comment for "Best Fit Line On Log Log Scales In Python 2.7"