数据挖掘 - 无法在 pig 中解析 XML - 吾爱随笔录

无法在 pig 中解析 XML

数据挖掘数据清理 apache-hadoop 阿帕奇猪

2022-03-10 07:42:31

我有一个具有这种结构的 XML 文件（虽然不完全是一棵树）

<posthistory>
<row Id="1" PostHistoryTypeId="2" PostId="1" 
RevisionGUID="689cb04a-8d2a-4fcb-b125-bce8b7012b88" 
CreationDate="2015-01-27T20:09:32.720" UserId="4" Text="I just got a 
pound of microroasted, local coffee and am curious what the optimal 
way to store it is (what temperature, humidity, etc)" />

我正在使用 apache pig 使用此代码仅提取“文本”部分

grunt> A = load 'hdfs:///parsingdemo/PostHistory.xml' using 
org.apache.pig.piggybank.storage.XMLLoader('posthistory') as(x:chararray);

grunt> result = foreach A generate XPath(x, 'posthistory/Text');

这返回“（）”（空）

在检查 XML 文件后，我了解到我的 XML 文件应该采用以下格式：

<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>

但是我的 XML 数据文件（实际上是 stackoverflow 数据转储）不是这种格式。有没有办法可以强加树结构？我的猪查询有什么问题？

2个回答

此 XPath 将在名为的标记内查找名为的标记：<Text><posthistory>

XPath(x, 'posthistory/Text');

您想在标签中找到标签的Text属性。rowposthistory

类似这样的 XPath 将执行此操作：/posthistory/row/@Text

请参阅此处的示例：http ://www.xpathtester.com/xpath/bac9874ec344f9d8ebcfb250633aaf65并单击“测试”以查看结果集。

了解更多有关 XPath 表示法的信息。

使用正则表达式。以下是通用格式

 foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<child>\\s*<subchild1>(.*)</subchild1>\\s*<subchild2>(.*)</subchild2>\\s*</child>'));

其它你可能感兴趣的问题

上一篇人工神经网络和有效的参数优化下一篇有没有基于NLP+DL的高级搭配工具？