Select Specific Rows And Cells In Text File And Put Into Data Frame: Python Or R
Either python or R is fine to use for this but could someone advise me on how to select the 'Basic stats' rows a text file that looks like the one below. I want this information an
Solution 1:
With R, using:
# read the text file
txt <- readLines('https://dl.dropboxusercontent.com/u/45095175/rois_all.txt')# create an index for the lines that are needed
ti <-rep(which(grepl('ROI:', txt)), each =3)+1:3# create a grouping vector of the same length
grp <-rep(1:33, each =3)# filter the text with the index 'ti' # and split into a list with grouping variable 'grp'
lst <- split(txt[ti], grp)# loop over the list a read the text parts in as dataframes
lst <- lapply(lst,function(x) read.table(text = x, sep ='\t', header =TRUE,
blank.lines.skip =TRUE))# bind the dataframes in the list together in one data.frame
DF <- do.call(rbind, lst)# change the name of the first columnnames(DF)[1]<-'ROI'# get the correct ROI's for the ROI-column
DF$ROI <- sub('.*: (\\w+).*$','\\1', txt[grepl('ROI: ', txt)])
gives:
> DF ROI Min Max Mean Stdev
1 mrc_ranch_house -20.208261 6.025762 -8.866403 5.289712
2 river_1 -20.187374 -6.694543 -12.227586 2.664640
3 river_2 -18.365091 -5.820825 -13.164463 2.851231
4 river_3 -18.291010 -4.583666 -12.092995 3.479293
5 river_4 -17.074295 -4.926921 -9.970926 2.897855
6 river_5 -16.849176 -8.622208 -12.387085 2.168462
7 adjacent_river_2 -18.987597 -7.957749 -13.392523 1.962263
8 adjacent_river_3 -19.426531 -8.640042 -13.467425 1.888105
9 adjacent_river_4 -20.452566 -6.830183 -12.833450 2.124761
10 bcs_1_ -23.612043 -8.221417 -16.032305 2.080695
11 bcs_2_ -24.018219 -10.648975 -16.814048 1.948863
12 bcs_3_ -23.011086 -9.106754 -15.404174 1.867498
13 red_1_ -22.313442 -7.839107 -14.768196 2.134152
14 red_2_ -22.551537 -7.236300 -14.613618 2.204253
15 red_3_ -22.057703 -7.746992 -14.483161 2.123497
16 bcs_4 -22.705107 -8.972753 -15.201623 1.817122
17 bcs_5 -24.109459 -10.113716 -15.776537 1.849163
18 glade_1_ -19.913187 -6.189866 -12.695884 3.303929
19 glade_2_ -19.812855 -4.672865 -11.995191 4.840168
20 glade_3_ -10.078033 -2.828722 -5.877417 1.941401
21 mwea_b -13.979379 -4.977155 -11.392434 2.019037
22 kaga -13.114172 -8.889531 -10.649324 1.290551
23 huku -14.206743 -7.853305 -10.608210 1.441250
24 ruai -18.643108 -12.645180 -14.540123 1.224183
25 tumaini -19.543234 -13.164941 -15.899968 1.812876
26 nkando -19.973492 -7.040238 -11.716987 2.617544
27 jikaze -16.408030 -9.001065 -12.323898 1.942196
28 miarage_b -15.126486 -6.661448 -10.391111 1.764279
29 batian -15.269146 -9.603316 -11.962470 1.168859
30 gitaraga -17.037708 -7.495215 -10.886802 2.561877
31 wiumiririe -9.578024 -6.225223 -7.688715 1.059796
32 chumvi -14.883148 -10.327570 -12.819469 1.231636
33 next_to_airstrip -17.242777 -5.207252 -10.601750 1.987712
The last part (from binding the list together in one dataframe and onwards) can also be done with the rbindlist
-function from the data.table
-package:
# load the 'data.table' package for the 'rbindlist' function
library(data.table)
# bind the dataframes in the list together to a data.table (enhanced version of a data.frame)
DT <- rbindlist(lst)
# change the name of the first column
setnames(DT, 1, 'ROI')
# get the correct ROI's for the ROI-column
DT[, ROI := sub('.*: (\\w+).*$', '\\1', txt[grepl('ROI: ', txt)])]
Solution 2:
Here is another ugly solution. The result is a good old regular data.frame
.
rois_all <- file("https://dl.dropboxusercontent.com/u/45095175/rois_all.txt")
xy <- readLines(rois_all)
# find lines where ROI starts
roin <- grep(pattern= "ROI: ", x = xy)
roi <- xy[roin]
roi <- gsub(".*ROI: (\\w+).*$", "\\1", roi)
# find lines with stats
stats <- grep(pattern= "Basic Stats", x = xy)
# trim whitespace andcollect Col
cn <- trimws(sapply(strsplit(xy[stats][1], "\t"), "[", 2:5, simplify =FALSE)[[1]])
# split the stat line by \t and extract only elements 2to5.mergerow-wise
out<- do.call(rbind, sapply(strsplit(xy[stats +1], "\t"), "[", 2:5, simplify =FALSE))
out<- as.data.frame(apply(out, MARGIN =2, as.numeric))
# add ROI column extracted earlier
out<- cbind(roi, out)
colnames(out) <- c("ROI", cn)
out
ROI Min Max Mean Stdev
1 mrc_ranch_house -20.2082616.025762-8.8664035.2897122 river_1 -20.187374-6.694543-12.2275862.6646403 river_2 -18.365091-5.820825-13.1644632.8512314 river_3 -18.291010-4.583666-12.0929953.4792935 river_4 -17.074295-4.926921-9.9709262.8978556 river_5 -16.849176-8.622208-12.3870852.1684627 adjacent_river_2 -18.987597-7.957749-13.3925231.9622638 adjacent_river_3 -19.426531-8.640042-13.4674251.8881059 adjacent_river_4 -20.452566-6.830183-12.8334502.12476110 bcs_1_ -23.612043-8.221417-16.0323052.08069511 bcs_2_ -24.018219-10.648975-16.8140481.94886312 bcs_3_ -23.011086-9.106754-15.4041741.86749813 red_1_ -22.313442-7.839107-14.7681962.13415214 red_2_ -22.551537-7.236300-14.6136182.20425315 red_3_ -22.057703-7.746992-14.4831612.12349716 bcs_4 -22.705107-8.972753-15.2016231.81712217 bcs_5 -24.109459-10.113716-15.7765371.84916318 glade_1_ -19.913187-6.189866-12.6958843.30392919 glade_2_ -19.812855-4.672865-11.9951914.84016820 glade_3_ -10.078033-2.828722-5.8774171.94140121 mwea_b -13.979379-4.977155-11.3924342.01903722 kaga -13.114172-8.889531-10.6493241.29055123 huku -14.206743-7.853305-10.6082101.44125024 ruai -18.643108-12.645180-14.5401231.22418325 tumaini -19.543234-13.164941-15.8999681.81287626 nkando -19.973492-7.040238-11.7169872.61754427 jikaze -16.408030-9.001065-12.3238981.94219628 miarage_b -15.126486-6.661448-10.3911111.76427929 batian -15.269146-9.603316-11.9624701.16885930 gitaraga -17.037708-7.495215-10.8868022.56187731 wiumiririe -9.578024-6.225223-7.6887151.05979632 chumvi -14.883148-10.327570-12.8194691.23163633 next_to_airstrip -17.242777-5.207252-10.6017501.987712
Solution 3:
I have not found a single import solution as each row in data
is called Band 1
but it is a good start.
import pandas as pd
data = pd.read_csv(r'rois_all.txt', delimiter='\t', error_bad_lines=False, skiprows=[0, 1])
data = data.dropna()
data = data.ix[data.ix[:, 'Basic Stats']!='Basic Stats', :]
data
Example of output
Basic Stats Min Max Mean Stdev
0 Band 1 -20.208261 6.025762 -8.866403 5.289712
3 Band 1 -20.187374 -6.694543 -12.227586 2.664640
6 Band 1 -18.365091 -5.820825 -13.164463 2.851231
I have now extracted all of the Basic Stats names as follows,
names = pd.read_csv(r'rois_all.txt', delimiter='\t', error_bad_lines=False, skiprows=[0, 1])
names = names.ix[names.ix[:, 'Basic Stats'] != ' Band 1']
names = names.ix[names.ix[:, 'Basic Stats'] != 'Basic Stats']
names = names.ix[:, 'Basic Stats'].str.extract('Stats for ROI: (.*) \[.*\] [0-9]*')
names.loc[0] = 'mrc_ranch_house'
names = names.sort_index()
names = names.reset_index(drop=True)
This looks as follows,
0 mrc_ranch_house
1 river_1
2 river_2
Joining data
and names
like so,
data.ix[:, 'Basic Stats'] = names
gives this the results as desired,
Basic Stats Min Max Mean Stdev
0 mrc_ranch_house -20.208261 6.025762 -8.866403 5.289712
1 river_1 -20.187374 -6.694543 -12.227586 2.664640
2 river_2 -18.365091 -5.820825 -13.164463 2.851231
Post a Comment for "Select Specific Rows And Cells In Text File And Put Into Data Frame: Python Or R"