删除给定文本中某个字符之后的字符串

数据挖掘 r 数据清理
2021-09-27 23:48:02

我有一个如下所示的数据集。我想删除字符©之后的所有字符。我怎么能在 R 中做到这一点?

data_clean_phrase <- c("Copyright © The Society of Geomagnetism and Earth", 
"© 2013 Chinese National Committee ")

data_clean_df <- as.data.frame(data_clean_phrase)
2个回答

例如:

 rs<-c("copyright @ The Society of mo","I want you to meet me @ the coffeshop")
 s<-gsub("@.*","",rs)
 s
 [1] "copyright "             "I want you to meet me "

或者,如果您想保留 @ 字符:

 s<-gsub("(@).*","\\1",rs)
 s
 [1] "copyright @"             "I want you to meet me @"

编辑:如果您想要从最后一个 @ 中删除所有内容,您只需按照前面的示例使用适当的正则表达式。例子:

rs<-c("copyright @ The Society of mo located @ my house","I want you to meet me @ the coffeshop")
s<-gsub("(.*)@.*","\\1",rs)
s
[1] "copyright @ The Society of mo located " "I want you to meet me "

鉴于我们正在寻找的匹配, sub 和 gsub 都会给你相同的答案。

为了完整起见:您可以使用 stringr 包来提取您想要的内容。

library(stringr)
data_clean_phrase <- c("Copyright © The Society of Geomagnetism and Earth", 
                       "© 2013 Chinese National Committee ")

str_extract(data_clean_phrase, "^(.*?©)") # including the @
str_extract(data_clean_phrase, "^.*(?=(©))") # excluding the @

注意:我选择了str_extract,你也可以选择str_remove