Trimming a string in R to remove everything after a specified character

Feedback


Solution 1:

To extract the desired characters from a string, we can utilize

sub

. This involves selecting one or more characters at the beginning of the string that do not match

:

, then identifying a

:

following a group of characters that are not

:

enclosed in parentheses

[^:]+

. The captured group can be retrieved by referencing it with

1

in the replacement.

sub('^([^:]+:[^:]+).*', '\1', myvec)
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607" 

The aforementioned solution applies to the provided example. To eliminate after the nth delimiter in other scenarios, the following approach can be used.

n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\1', myvec)
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607" 

Checking with a different ‘n’

n <- 3

and repeating the same steps

sub(pat, '\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
#[3] "chr7:55241607:55241607"  

An alternative approach could involve dividing the n components into groups using both

:

and

paste

.

n <- 2
vapply(strsplit(myvec, ':'), function(x)
            paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607" 


Solution 2:


Below are some options to consider where we eliminate everything beyond the kth colon. For the particular scenario mentioned in the question, k equals 2, whereas for the sample cases provided below, we’ve used k equals 3.

Utilize the read.table function to import the data into a data.frame. Select the desired columns and then combine them again by pasting.

k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))

giving:

[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
[3] "chr7:55241607:55241607"  

Use sprintf or sub to create a regular expression that corresponds to the desired pattern, such as

^((.*?:){2}.*?):.*

, when utilizing

sub

.

k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\1", myvec)

giving:

[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
[3] "chr7:55241607:55241607"  

Please note that the equation can be simplified for specific values of k. When k=1, the equation can be further reduced to

sub(":.*", "", myvec)

. Similarly, when k=n-1, it can be simplified to

sub(":[^:]*$", "", myvec)

.

Note 2:
Shown below is a graphic representation of the typical regular expression when

k

is equal to 3:

^((.*?:){2}.*?):.*

Regular expression visualization

Debuggex Demo

To delete the last field, we can use the final regular expression from Note 1 mentioned above repeatedly for

n-k

times.

n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)

To automatically set the value of n, we have the choice of replacing the line where n is hard coded with an alternative option.

n <- count.fields(textConnection(myvec[1]), sep = ":")

Using

gregexpr

, find the positions of all the colons. From there, subtract one from the position of the kth colon to exclude the trailing colon. Then, use

substr

to extract the desired number of characters from the corresponding strings.

k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)

giving:

[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
[3] "chr7:55241607:55241607"  

Note 3 mentions that there are n fields, and the task is to remove everything after the kth delimiter. The solution should be applicable for k values ranging from 1 to n-1. It is not necessary for the solution to work for k = n since there are not n delimiters. However, if we define k as the number of fields to return, then k = n is meaningful, and both (1) and (3) solutions work for this case. On the other hand, (2) and (4) do not work for this extension. We can make them work by using

paste0(myvec, ":")

instead of

myvec

.

Note 4:
We evaluate the efficiency by drawing comparisons.

library(rbenchmark)
benchmark(
 .read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
 .sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\1", myvec),
 .for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
 .gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
  order = "elapsed", replications = 1000)[1:4]

giving:

          test replications elapsed relative
2 .sprintf.sub         1000    0.11    1.000
4    .gregexpr         1000    0.14    1.273
3         .for         1000    0.15    1.364
1  .read.table         1000    2.16   19.636

The sprintf and sub solution is the most efficient, but it entails a complicated regular expression. On the other hand, the other solutions are simpler and do not necessarily require regular expressions, making them more appealing for their simplicity.

Included extra remedies along with supplemental comments.

Frequently Asked Questions