Skip to content Skip to sidebar Skip to footer

Creating A Representative Sample From A Large Csv

I have the following dataset: head -2 trip_data_1.csv medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_

Solution 1:

You can use awk like this:

awk 'rand()>0.9' trip_data_1.csv

It just generates a random number between 0 and 1 as it reads each record, and if that random number is > 0.9, it prints the record - thus it should print 10% of your records on average.

If you want the header as well, use:

awk 'FNR==1 || rand()>0.9' trip_data_1.cv

If you want it truly random, rather than predictably random :-)

awk 'BEGIN{srand()} FNR==1 || rand()>0.9' trip_data_1.cv

Solution 2:

Get a random sample:

sort -R filename | head -n $(($(wc -l filename | awk '{print $1}') / 10))
# random sort    | get     10%   ( length divided by 10 )

You have to remove the CSV header first, and then attach it back. Left it as an exercise :)

For efficiency reasons, you might want to implement this with a native app.

Post a Comment for "Creating A Representative Sample From A Large Csv"