Skip to content Skip to sidebar Skip to footer

How To Handle The Frequent Changes In Dataset In Azure Machine Learning Studio?

How to handle the frequent changes in the dataset in Azure Machine Learning Studio. My dataset may change over time, I need to add more rows to dataset. How will I refresh the data

Solution 1:

When registering an AzureML Dataset, no data is moved, just some information like where the data is and how it should be loaded are stored. The purpose is to make accessing the data as simple as calling dataset = Dataset.get(name="my dataset")

In the snippet below (full example), if I register the dataset, I could technically overwrite weather/2018/11.csv with a new version after registering, and my Dataset definition would stay the same, but the new data would be available if you use in it training after overwriting.

# create a TabularDataset from 3 paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

However, there are two more recommended approaches (my team does both)

  1. Isolate your data and register a new version of the Dataset, so that you can always roll-back to a previous version of a Dataset version . Dataset Versioning Best Practice
  2. Use a wildcard/glob datapath to refer to a folder that has new data loaded into it on a regular basis. In this way you can have a Dataset that is growing in size over time without having to re-register.

Solution 2:

Does that works for you? https://stackoverflow.com/a/60639631/12925558

You can manipulate the dataset object

Post a Comment for "How To Handle The Frequent Changes In Dataset In Azure Machine Learning Studio?"