数据挖掘 - Stackoverflow API 结构数据存储 - 吾爱随笔录

我正在使用Stackoverflow API和 Python 下载特定标签的 Stackoverflow 问题和答案。

目标是执行文档聚类以在文档中找到相关术语并找到它们之间的相似性。

例子：

curl --compressed -H "Accept-Encoding: gzip" -X GET 'http://api.stackexchange.com/2.2/questions?page_number=1&pagesize=25&order=desc&sort=activity&tagged=amp-html&site=stackoverflow&run=true'

这是我在 Python 2.7 requests 库中使用的过程：

通过 API 请求问题
使用 BeautifulSoap 后创建 Question 对象并将 API 响应分配给 body 属性。
在问题中查找相关答案。
通过 API 请求答案
创建答案数组
将 Answer 数组分配给父 Question 对象。
对于每个 Answer 对象，在使用 BeautifulSoap 后将 API 响应分配给 body 属性

API问题请求：

curl --compressed -H "Accept-Encoding: gzip" -X GET 'https://api.stackexchange.com/2.1/questions/37745529?site=stackoverflow&filter=withbody'

API 响应示例：

{"items":[{"tags":["html","input","amp-html"],"owner":{"reputation":314,"user_id":5426326,"user_type":"registered","accept_rate":39,"profile_image":"https://www.gravatar.com/avatar/cbff9d2f96733be04cb022f8a724ad49?s=128&d=identicon&r=PG&f=1","display_name":"Pranav Bilurkar","link":"http://stackoverflow.com/users/5426326/pranav-bilurkar"},"is_answered":true,"view_count":47,"accepted_answer_id":37787460,"answer_count":1,"score":1,"last_activity_date":1465850038,"creation_date":1465553340,"last_edit_date":1465797429,"question_id":37745529,"link":"http://stackoverflow.com/questions/37745529/input-tags-elements-in-amp-html","title":"Input Tags/Elements in AMP-html","body":"<p>we are implementing our current <a href=\"http://articles.mercola.com/sites/articles/archive/2099/12/31/alzheimers-disease-early-detection.aspx\" rel=\"nofollow\">site</a> in <code>AMP version</code> and i am new to this. I have below queries regarding <code>AMP</code> -</p>\n\n<ol>\n<li>How to get user <code>input</code> from user in <code>AMP-HTML</code>.</li>\n<li>In the above mentioned site [<code>desktop version</code>] we have <code>comment section</code> at the bottom of the page. We need to implement the same functionality in AMP.</li>\n<li>Are there any <code>websites</code> which are build and developed in <code>AMP</code>? If yes, i need links to check.</li>\n</ol>\n\n<p>Any suggestion and ideas related to above queries would be appreciated.\nThanks.</p>\n"}],"has_more":false,"quota_max":300,"quota_remaining":298}

由于问题和答案在正文属性中包含 HTML、URL、特殊字符、引号、单引号、逗号等。我希望将此文本转换为结构化数据，这些数据可以表示为单个文本块并作为标记进行分析，然后我使用 TF-IDF。

细节：

2 对象：问题和答案问题对象具有作为属性的答案对象数组。每个 Question 和 Answer 对象都包含一个body属性，它是一个包含每个文本的字符串。

存储此信息的推荐方式是什么？