Skip to content Skip to sidebar Skip to footer

Python - How To Normalize Time-series Data

I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to sc

Solution 1:

The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples. And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers. Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):

V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50

Where: X : data point F50 : mean of the latest 200 points F75 : percentile 75 F25 : Percentile 25 N : normal CDF

Solution 2:

Assuming that your timeseries is an array, try something like this:

(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())

This will confine your values between 0 and 1

Solution 3:

Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization: ( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.

This function is slow, so it’s advisable run it just once and store the results.

from scipy.stats import norm
    import pandas as pd

    defget_NormArray(df, n, mode = 'total', linear = False):
        '''
                 It computes the normalized value on the stats of n values ( Modes: total or scale ) 
                 using the formulas from the book "Statistically sound machine learning..."
                 (Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
                 It is modified to fit the data from -1 to 1 instead of -100 to 100
                 df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
                 n define the number of data points to get the mean and the quartiles for the normalization
                 modes: scale: scale, without centering. total:  center and scale.
         '''
        temp =[]

        for i inrange(len(df))[::-1]:

            if i  >= n: # there will be a traveling norm until we reach the initian n values. # those values will be normalized using the last computed values of F50,F75 and F25
                F50 = df[i-n:i].quantile(0.5)
                F75 =  df[i-n:i].quantile(0.75)
                F25 =  df[i-n:i].quantile(0.25)

            if linear == Trueand mode == 'total':
                 v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5elif linear == Trueand mode == 'scale':
                 v =  0.25 * df.iloc[i]/(F75-F25) -0.5elif linear == Falseand mode == 'scale':
                 v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5else: # even if strange values are given, it will perform full normalization with compression as default
                v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5

            temp.append(v[0])
        return  pd.DataFrame(temp[::-1])

Solution 4:

I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.

Solution 5:

from sklearn importpreprocessingnormalized_data= preprocessing.minmax_scale(data)

You can take a look here normalize-standardize-time-series-data-python and sklearn.preprocessing.minmax_scale

Post a Comment for "Python - How To Normalize Time-series Data"