Skip to content Skip to sidebar Skip to footer

How To Fill NaN Values Of A Column Using The Mean Of Surrounding (top And Bottom) Values Of That Column?

I have a df which has some NaN values. For example here is the df: import numpy as np import pandas as pd np.random.seed(100) data = np.random.rand(10,3) data[3,0] = np.NaN data[6

Solution 1:

I think DataFrame.interpolate should help here:

df1 = df.interpolate()
print (df1)
          0         1         2
0  0.543405  0.278369  0.424518
1  0.844776  0.004719  0.280612
2  0.670749  0.825853  0.136707
3  0.428039  0.891322  0.209202
4  0.185328  0.108377  0.219697
5  0.978624  0.191225  0.171941
6  0.959327  0.274074  0.254026
7  0.940030  0.323453  0.336112
8  0.175410  0.372832  0.175683
9  0.252426  0.795663  0.015255

If there are multiple consecutive NaNs interpolate it not replace by mean:

np.random.seed(100)
data = np.random.rand(10,3)
data[3,0] = np.NaN
data[6,0] = np.NaN

data[5,1] = np.NaN
data[7,1] = np.NaN

data[1,2] = np.NaN
data[2,2] = np.NaN
data[8,2] = np.NaN
data[6,2] = np.NaN

df = pd.DataFrame(data)
print (df)
          0         1         2
0  0.543405  0.278369  0.424518
1  0.844776  0.004719       NaN
2  0.670749  0.825853       NaN
3       NaN  0.891322  0.209202
4  0.185328  0.108377  0.219697
5  0.978624       NaN  0.171941
6       NaN  0.274074       NaN
7  0.940030       NaN  0.336112
8  0.175410  0.372832       NaN

df1 = df.interpolate()
print (df1)
          0         1         2
0  0.543405  0.278369  0.424518
1  0.844776  0.004719  0.352746
2  0.670749  0.825853  0.280974
3  0.428039  0.891322  0.209202
4  0.185328  0.108377  0.219697
5  0.978624  0.191225  0.171941
6  0.959327  0.274074  0.254026
7  0.940030  0.323453  0.336112
8  0.175410  0.372832  0.175683
9  0.252426  0.795663  0.015255

Solution for mean:

df2 = df.ffill().add(df.bfill()).div(2)
print (df2)
          0         1         2
0  0.543405  0.278369  0.424518
1  0.844776  0.004719  0.316860
2  0.670749  0.825853  0.316860
3  0.428039  0.891322  0.209202
4  0.185328  0.108377  0.219697
5  0.978624  0.191225  0.171941
6  0.959327  0.274074  0.254026
7  0.940030  0.323453  0.336112
8  0.175410  0.372832  0.175683
9  0.252426  0.795663  0.015255

Solution 2:

Using interpolate per your specifications (only one index row away):

df.interpolate(method='index', limit=1)

Or doing it directly using combine_first:

fills = 0.5 * (df.fillna(method='ffill', limit=1) 
               + df.fillna(method='bfill', limit=1))
df.combine_first(fills)

Solution 3:

More accurately using sklearn

from sklearn.preprocessing import Imputer

mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

mean_imputer = mean_imputer.fit(df)

imputed_df = mean_imputer.transform(df.values)


imputed_df

   [0.54340494, 0.27836939, 0.42451759],
   [0.84477613, 0.00471886, 0.21620453],
   [0.67074908, 0.82585276, 0.13670659],
   [0.5738436 , 0.89132195, 0.20920212],
   [0.18532822, 0.10837689, 0.21969749],
   [0.97862378, 0.44390102, 0.17194101],
   [0.5738436 , 0.27407375, 0.21620453],
   [0.94002982, 0.44390102, 0.33611195],
   [0.17541045, 0.37283205, 0.21620453],
   [0.25242635, 0.79566251, 0.01525497]]

Post a Comment for "How To Fill NaN Values Of A Column Using The Mean Of Surrounding (top And Bottom) Values Of That Column?"