我有一个这样的html文件:
<h1>Group 1</h1>
<table>
<tr>
<td>Col1</td>
<td>Col2</td>
<td>Col3</td>
</tr>
<tr>
<td>ValA</td>
<td>ValB</td>
<td>ValC</td>
</tr>
</table>
<h1>Group 2</h1>
<table>
<tr>
<td>Col1</td>
<td>Col2</td>
<td>Col3</td>
</tr>
<tr>
<td>ValP</td>
<td>ValQ</td>
<td>ValR</td>
</tr>
</table>
我想把它读入 Pandas,就好像它有这样的结构:
<table>
<tr>
<td>Caption</td>
<td>Col1</td>
<td>Col2</td>
<td>Col3</td>
</tr>
<tr>
<td>Group 1</td>
<td>ValA</td>
<td>ValB</td>
<td>ValC</td>
</tr>
<tr>
<td>Group 2</td>
<td>ValP</td>
<td>ValQ</td>
<td>ValR</td>
</tr>
</table>
我可以使用PowerQuery的 PowerBI 语言轻松做到这一点:
let
Source = Web.Page(File.Contents("multiple_tables.html")),
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3"}, {"Col1", "Col2", "Col3"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Caption] <> "Document") and ([Col1] <> "Col1"))
in
#"Filtered Rows"
有没有办法使用 Python/Pandas 加上一些 html 解析器开源库在不到 10 行代码中实现这种效果?或者我应该辞职编写较低级别的代码来处理这个问题?