The Best Way To Mark (split?) Dataset In Each String

June 11, 2024 Post a Comment

I have a dataset containing 485k strings (1.1 GB). Each string contains about 700 of chars featuring about 250 variables (1-16 chars per variable), but it doesn't have any splitmar

Solution 1:

Pandas could load this using read_fwf:

In [321]:

t="""0123456789012..."""
pd.read_fwf(io.StringIO(t), widths=[5,3,1,4], header=None)
Out[321]:
      01230123456789012

This will give you a dataframe allowing you to access each individual column for whatever purpose you require

Solution 2:

Try this in R:

x <-"0123456789012"

y <-c(5,3,1,4)

output <- paste(substring(x,c(1,cumsum(y)+1),cumsum(y)),sep=",")
output <- output[-length(output)]

Solution 3:

In R read.fwf would work:

# inputs
x <-c("0123456789012...","1234567890123... ")
widths <-c(5,3,1,4)

read.fwf(textConnection(x), widths, colClasses ="character")

giving:

V1V2V3V41012345678901221234567890123

If numeric rather than character columns were desired then drop the colClasses argument.

Solution 4:

One option in R is

indx1 <- c(1, cumsum(len)[-length(len)]+1)
indx2 <- cumsum(len)
toString(vapply(seq_along(len), function(i)
         substr(str1, indx1[i], indx2[i]), character(1)))
#[1] "01234, 567, 8, 9012"

data

str1 <-'0123456789012'
len <-c(5,3,1,4)

Getting Started with Python