Skip to content Skip to sidebar Skip to footer

Delete Specific Rows From Csv Using Pandas

I have a csv file in the format shown below: I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just

Solution 1:

sample DataFrame built with @andrew_reece's code

In [9]: df
Out[9]:
           center         left         right  steering  throttle  brake
0   center_54.jpg  left_75.jpg  right_39.jpg         1         0      0
1   center_20.jpg  left_81.jpg  right_49.jpg         3         1      1
2   center_34.jpg  left_96.jpg  right_11.jpg         0         4      2
3   center_98.jpg  left_87.jpg  right_34.jpg         0         0      0
4   center_67.jpg  left_12.jpg  right_28.jpg         1         1      0
5   center_11.jpg  left_25.jpg  right_94.jpg         2         1      0
6   center_66.jpg  left_27.jpg  right_52.jpg         1         3      3
7   center_18.jpg  left_50.jpg  right_17.jpg         0         0      4
8   center_60.jpg  left_25.jpg  right_28.jpg         2         4      1
9   center_98.jpg  left_97.jpg  right_55.jpg         3         3      0
..            ...          ...           ...       ...       ...    ...
90  center_31.jpg  left_90.jpg  right_43.jpg         0         1      0
91  center_29.jpg   left_7.jpg  right_30.jpg         3         0      0
92  center_37.jpg  left_10.jpg  right_15.jpg         1         0      0
93  center_18.jpg   left_1.jpg  right_83.jpg         3         1      1
94  center_96.jpg  left_20.jpg  right_56.jpg         3         0      0
95  center_37.jpg  left_40.jpg  right_38.jpg         0         3      1
96  center_73.jpg  left_86.jpg  right_71.jpg         0         1      0
97  center_85.jpg  left_31.jpg   right_0.jpg         3         0      4
98  center_34.jpg  left_52.jpg  right_40.jpg         0         0      2
99  center_91.jpg  left_46.jpg  right_17.jpg         0         0      0

[100 rows x 6 columns]

In [10]: df.steering.value_counts()
Out[10]:
0    43    # NOTE: 43 zeros
1    18
2    15
4    12
3    12
Name: steering, dtype: int64

In [11]: df.shape
Out[11]: (100, 6)

your solution (unchanged):

In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)

In [13]: df.steering.value_counts()
Out[13]:
1    18
2    15
4    12
3    12
0     4        # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64

In [14]: df.shape
Out[14]: (61, 6)

NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:

df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
#  NOTE:                         ^ ^

after that you can save the modified (reduced) DataFrame to CSV:

df.to_csv('/path/to/filename.csv', index=False)

Solution 2:

Here's a one-line approach, using concat() and sample():

import numpy as np
import pandas as pd

# first, some sample data

# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)

# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)

print(df.shape)
(100,3)

The first few rows of sample output:

df.head()
           center         left         right  steering  throttle  brake
0   center_72.jpg  left_26.jpg  right_59.jpg         3         3      0
1   center_75.jpg  left_68.jpg  right_26.jpg         0         0      2
2   center_29.jpg   left_8.jpg  right_88.jpg         0         1      0
3   center_22.jpg  left_26.jpg  right_23.jpg         1         0      0
4   center_88.jpg   left_0.jpg  right_56.jpg         4         1      0
5   center_93.jpg  left_18.jpg  right_15.jpg         0         0      0

Now drop all but 10% of rows with steering==0:

newdf = pd.concat([df.loc[df.steering!=0], 
                   df.loc[df.steering==0].sample(frac=0.1)])

With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.


Solution 3:

Using a mask on steering combined with a random number should work:

df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]

This does generate some extra random values, but it's nice and compact.

Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.


Post a Comment for "Delete Specific Rows From Csv Using Pandas"