Skip to content Skip to sidebar Skip to footer

Coding Variables With Pandas Timeseries

As a follow up to something I was struggling with in a previous question, I've been working for a long time on an analysis of some pretty complicated behavioural data from a mouse-

Solution 1:

I've been asked via email if I ever found a solution to what I wanted to do here, so I'm sharing what I've been doing to date. This might not be the canonical way of using pandas, but it's sufficed for me.

In short, I've split my data into a couple of data frames. The first, data, is as above, but I only use the columns which correspond to single values, like trial, stimuli, resp, and rt.

For my time series data, I use two additional data frames, one for the x-coordinate data, and one for the y. Although there's probably a more elegant way of generating these, my code does the following.

data.iloc[0]

    time_stamp                                     21/11/201313:06
    subject                                                 1276270
    trial                                                         0
    stimuli                                                      14
    resp                                                          2
    rt                                                         1145
    x             [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
    y             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
    t             [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
    Name: 0, dtype: object

data['nx'], data['ny'] = zip(*
     [even_time_steps(x, y, t)
     for x, y, t, in zip(data.x, data.y, data.t)])
     # Using function even_time_steps from package squeak
     # https://github.com/EoinTravers/Squeak 
     # Simpler applications could use
     # data['nx'] = [pd.TimeSeries(x) for y in data['x']]
     # data['ny'] = [pd.TimeSeries(x) for y in data['y']]

# Seperate DataFrames
nx = pd.concat(list(data.nx), axis=1).T
ny = pd.concat(list(data.ny), axis=1).T

# Remove redundant columns
redundant = ['nx', 'ny', 'x', 'y'] # etc...
data = data.drop(redundant, axis=1)

# Important - reindex data
data.index = range(len(data)) # 0, 1, 2, ..., len(data)

Now data contains all my coding information, nx all my x-coordinate information, and ny my y coordinate information.

nx.head()

       0    1    2    3    4    5    6    7    8        9     ...          91
    0    0    0    0    0    0    0    0    0    0  0.00000   ...     0.953960   
    1    0    0    0    0    0    0    0    0    0  0.00099   ...     1.000000   
    2    0    0    0    0    0    0    0    0    0  0.00000   ...     1.010000   
    3    0    0    0    0    0    0    0    0    0  0.00000   ...     0.870396   
    4    0    0    0    0    0    0    0    0    0  0.00000   ...     1.000000   
             92        93        94       95        96   97   98   99   100  
    0  0.993564  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    1  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    2  1.010000  1.008812  1.003960  1.00000  1.000000    1    1    1    1  
    3  0.906238  0.936931  0.973564  0.98604  0.993366    1    1    1    1  
    4  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    [5 rows x 101 columns]

Finally, to select specific subsets of the x and y data, according to coding variables stored in data, I just take the index of the relevant subset of data

subject1_index = data[data.subject==1].index
print subject1_index

    Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
    18, 19, 20,  21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
    36, 37, 38, 39], dtype='int64')

and select a matching subset of nx and ny using the iloc method.

sub1_x = nx.iloc[subject1_index]
sub1_y = ny.iloc[subject1_index]
foriin subject1_index:
    plt.plot(nx.iloc[i], ny.iloc[i], 'r', alpha=.3)
plt.plot(sub1_x.mean(), sub1_y.mean(), 'r', linewidth=2)

enter image description here


EDIT: For completeness, note that a lot of my analysis requires long format data (and is carried out in R). Again, there may be a more elegant way of doing this (so use at your own risk!), but my code goes (note, this is real code, from a different dataset, and I haven't bothered to change the variable names to match the original example):

# Long format data
wide_data = data.copy()
steps = nx.columns
for i in steps:
    wide_data['nx_%i' % i] = nx[i]
    wide_data['ny_%i' % i] = ny[i]

id_vars = ['subject_nr', 'condition', 'count_trial_sequence',
    'trial_id', 'choice', 'accuracy']

# Long data with 'nx' as the variable
long_data = pd.melt(wide_data, id_vars=id_vars, value_vars = ['nx_%i' % i for i in steps])
long_data['step'] = long_data.variable.map(lambda s: int(s[3:]))
long_data['nx'] = long_data.value

# Same with 'ny'
tmp_long = pd.melt(wide_data, id_vars=id_vars, value_vars = ['ny_%i' % i for i in steps])
# Combine in single data frame
long_data['ny'] = tmp_long['value']
del tmp_long

long_data = long_data.drop(['variable', 'value'], axis=1)
long_data.to_csv(os.path.join('data', 'long_data.csv'))

long_data.head()
Out[41]: 
       subject_nr      condition  count_trial_sequence  trial_id choice accuracy  
    0505250022              A                     013   rsp1     True1505250022              A                     116   rsp1     True2505250022              B                     22   rsp2    False3505250022              B                     30   rsp1    False4505250022              C                     433   rsp2    False   

       step  nx  ny  
    00001000200030004000

Post a Comment for "Coding Variables With Pandas Timeseries"