Creating A Representative Sample From A Large Csv
I have the following dataset: head -2 trip_data_1.csv medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_
Solution 1:
You can use awk
like this:
awk 'rand()>0.9' trip_data_1.csv
It just generates a random number between 0 and 1 as it reads each record, and if that random number is > 0.9, it prints the record - thus it should print 10% of your records on average.
If you want the header as well, use:
awk 'FNR==1 || rand()>0.9' trip_data_1.cv
If you want it truly random, rather than predictably random :-)
awk 'BEGIN{srand()} FNR==1 || rand()>0.9' trip_data_1.cv
Solution 2:
Get a random sample:
sort -R filename | head -n $(($(wc -l filename | awk '{print $1}') / 10))
# random sort | get 10% ( length divided by 10 )
You have to remove the CSV header first, and then attach it back. Left it as an exercise :)
For efficiency reasons, you might want to implement this with a native app.
Post a Comment for "Creating A Representative Sample From A Large Csv"