在 R 包 plyr 中使用 ldply “结果长度不相等”

数据挖掘 r
2022-03-07 07:20:53

我发现了一些类似的问题,但我是 R 新手,无法弄清楚它如何适用于我的具体问题。这是我的代码:

library(rvest)
library(plyr)
library(stringr)

#function passes in letter and extracts bold text from each page
fetch_current_players<-function(letter){
  url<-paste0("http://www.baseball-reference.com/players/", letter, "/")
  urlHTML<-read_html(url)
  playerData<-html_nodes(urlHTML, "b a")
  player<-html_text(playerData)
  player
}

#list of letters to pass into function
atoz<-c("a","b","c","d","e","f","g","h",
        "i","j","k","l","m","n","o","p","q","r",
        "s","t","u","v","w","x","y","z")
player_list<-ldply(atoz, fetch_current_players, .progress="text")

所以这段代码试图做的是使用这个网站的 URL 结构将字母 A 到 Z 的列表传递到我的函数中,以生成粗体名称列表。我认为问题在于它返回的每个玩家列表的长度不同,这会产生错误,因为当我在函数中手动输入每个字母时,该函数似乎可以工作。

任何帮助表示赞赏,谢谢!

1个回答

这是使用一些较新的“tidyverse”包的稍微修改的版本:

library(rvest) 
library(purrr) # flatten/map/safely
library(dplyr) # progress bar

# just in case there isn't a valid page
safe_read <- safely(read_html)

fetch_current_players <- function(letter){

  URL <- sprintf("http://www.baseball-reference.com/players/%s/", letter)
  pg <- safe_read(URL)

  if (is.null(pg$result)) return(NULL)

  player_data <- html_nodes(pg$result, "b a")

  html_text(player_data)

}

pb <- progress_estimated(length(letters))
player_list <- flatten_chr(map(letters, function(x) {
  pb$tick()$print()
  fetch_current_players(x)
}))