如何读取多个标题下的 html 表并将它们组合在一个 pandas 数据框中?

数据挖掘 Python 熊猫 力比
2021-09-23 14:41:07

我有一个这样的html文件:

<h1>Group 1</h1>
<table>
  <tr>
    <td>Col1</td>
    <td>Col2</td>
    <td>Col3</td>
  </tr>
  <tr>
    <td>ValA</td>
    <td>ValB</td>
    <td>ValC</td>
  </tr>
</table>
<h1>Group 2</h1>
<table>
  <tr>
    <td>Col1</td>
    <td>Col2</td>
    <td>Col3</td>
  </tr>
  <tr>
    <td>ValP</td>
    <td>ValQ</td>
    <td>ValR</td>
  </tr>
</table>

我想把它读入 Pandas,就好像它有这样的结构:

<table>
 <tr>
   <td>Caption</td>
   <td>Col1</td>
   <td>Col2</td>
   <td>Col3</td>
 </tr>
 <tr>
   <td>Group 1</td>
   <td>ValA</td>
   <td>ValB</td>
   <td>ValC</td>
 </tr>
 <tr>
   <td>Group 2</td>
   <td>ValP</td>
   <td>ValQ</td>
   <td>ValR</td>
 </tr>
</table>

我可以使用PowerQuery的 PowerBI 语言轻松做到这一点:

let
    Source = Web.Page(File.Contents("multiple_tables.html")),
    #"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3"}, {"Col1", "Col2", "Col3"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Caption] <> "Document") and ([Col1] <> "Col1"))
in
    #"Filtered Rows"

有没有办法使用 Python/Pandas 加上一些 html 解析器开源库在不到 10 行代码中实现这种效果?或者我应该辞职编写较低级别的代码来处理这个问题?

1个回答

通过BeautifulSoup进行一些解析,我们可以得到如下pandas.Dataframe使用pandas.read_html()

代码:

def get_tables(source):
    elems = iter(BeautifulSoup(source, 'lxml').find_all(['table', 'h1']))
    df = pd.DataFrame(
        pd.read_html(str(next(elems)), header=0)[0].iloc[0].rename(h1.text)
        for h1 in elems)
    df.index.names = ['Caption']
    return df

测试代码:

import pandas as pd
from bs4 import BeautifulSoup

with open('test.html', 'r') as f:
    print(get_tables(f))

结果:

         Col1  Col2  Col3
Caption                  
Group 1  ValA  ValB  ValC
Group 2  ValP  ValQ  ValR