将数据加载到 R 中的问题

数据挖掘 r 数据集
2022-03-07 19:03:20

我在将数据加载到 R 时遇到问题:

fileUrl <- "http://jadi.net/files/iran_it_status_1394_detail_data_jadi_net.tsv"

download.file(fileUrl , destfile="iran_it_status_1394_detail_data_jadi_net.tsv")

dev <- read.delim("iran_it_status_1394_detail_data_jadi_net.tsv",
              header=TRUE,sep="\t",blank.lines.skip = TRUE,
              na.strings="",fileEncoding="UTF-8",
              stringsAsFactors=FALSE,skipNul = TRUE)

我收到以下错误:

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input
  In addition: Warning message:
  In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection    'iran_it_status_1394_detail_data_jadi_net.tsv'

编辑:数据集有 1217 行和 33 个变量。

names(data) <- c("timestamp","age","sex","birth_province","work_province","experience","education",
            "certificate","learn","project","book","language","wish_language","db","desktop_os",
            "wish_os","mobile","env","theme","src_ctrl","tab_space","drink","items","device","title",
            "org_type","org_emp","income","perk","job_contract","job_type","hour_wage","happy")

对于语言变量,我期望这个输出:

data[1:3,"language"]

C#、Javascript、R、SQL

Java、C#、Javascript、Objective C、Swift、SQL

C#, SQL

也欢迎 Python 解决方案

2个回答

我能够像这样加载数据集:

 dev <- read.table("iran_it_status_1394_detail_data_jadi_net.tsv",
                   header=TRUE, sep="\t", blank.lines.skip = TRUE,
                   na.strings="",
                   stringsAsFactors=FALSE, skipNul = TRUE, fill=T, quote="")

注意编码的删除(以便函数“查找”文件中的行),填充属性(允许带有空单元格的参差不齐的表格)和引号的消除(显然行附近的某处有一条错误引用的行585)。

这会产生一个充满编码字符的表格 - 您需要了解更多关于源数据的信息才能弄清楚如何使用它,但是如果您在原始文本编辑器中打开文件(例如:Sublime text),您可能会得到一些线索:

字符编码问题?

尝试使用trytryCatch

看这里http://stackoverflow.com/questions/13613270/how-to-fix-the-error-in-r-of-no-lines-available-in-input