Skip to content Skip to sidebar Skip to footer

How To Customize Pytorch Data

I am trying to make a customized Dataloader using pytorch. I've seen some codes like (omitted the class sorry.) def __init__(self, data_root, transform=None, training=True, return_

Solution 1:

First, you want to customize (overload) data.Dataset and not data.DataLoader which is perfectly fine for your use case.

What you can do, instead of loading all data to RAM, is to read and store "meta data" on __init__ and read one relevant csv file whnever you need to __getitem__ a specific entry.
A pseudo-code of your Dataset will look something like:

class ManyCSVsDataset(data.Dataset):
  def __init__(self, ...):
    super(ManyCSVsDataset, self).__init__()
    # store the paths for all csvs and the number of items in each one
    self.metadata = ... 
    self.num_items = total_number_of_items

  def __len__(self):
    return self.num_items

  def __getitem__(self, index):
    # based on the index, use self.metadata to determine what csv file to open
    with open(relevant_csv_file, 'r') as R:
      # read from R the specific line matching item index
    return item

This implementation is not efficient in the sense that it reads the same csv file over and over and does not cache anything. On the other hand, you can take advantage of data.DataLoader's multi processing support to have many parallel sub-processes doing all these file access at the background while you actually use the data for training.


Post a Comment for "How To Customize Pytorch Data"