数据科学家可以从文本分析中获得哪些见解?

数据挖掘 Python 文本挖掘 nltk 文本分类
2022-03-06 05:25:53

我有很多文本,我正在尝试分析它们。在对它们进行标记、研究单词频率、发现任何错别字、研究标点符号之后,我一直在研究 POS 标记。由于这是我第一次进行文本挖掘和操作,我想知道我可以从这些信息中获得哪些见解,以及呈现这种分析的最佳方法是什么。

例如:如果我有很多如下文本:

Hemingway =["When spring came, even the false spring, there were no problems except where to be happiest. The only thing that could spoil a day was people and if you could keep from making engagements, each day had no limits. People were always the limiters of happiness except for the very few that were as good as spring itself.","Most people were heartless about turtles because a turtle’s heart will beat for hours after it has been cut up and butchered. But the old man thought, I have such a heart too.","Perhaps as you went along you did learn something. I did not care what it was all about. All I wanted to know was how to live in it. Maybe if you found out how to live in it you learned from that what it was all about.","The people that I liked and had not met went to the big cafes because they were lost in them and no one noticed them and they could be alone in them and be together."]

Shakespeare=["These violent delights have violent ends And in their triump die, like fire and powder Which, as they kiss, consume","Let me not to the marriage of true minds Admit impediments. Love is not love Which alters when it alteration finds, Or bends with the remover to remove. O no, it is an ever-fixed mark That looks on tempests and is never shaken; It is the star to every wand'ring barque, Whose worth's unknown, although his height be taken. Love's not Time's fool, though rosy lips and cheeks Within his bending sickle's compass come; Love alters not with his brief hours and weeks, But bears it out even to the edge of doom. If this be error and upon me proved, I never writ, nor no man ever loved.","O serpent heart hid with a flowering face!Did ever a dragon keep so fair a cave? Beautiful tyrant, feind angelical, dove feather raven, wolvish-ravening lamb! Despised substance of devinest show, just opposite to what thou justly seemest - A dammed saint,honourable villain!","Lord Polonius: What do you read, my lord? Hamlet: Words, words, words. Lord Polonius: What is the matter, my lord? Hamlet: Between who?  Lord Polonius: I mean, the matter that you read, my lord."]

词频分布,POS标签分布有用吗?我从未提出任何结果,因此我想更多地了解数据科学家/分析师如何“讲述”和查看这些数据。

1个回答

好吧,显然用例取决于行业。另外,我假设您正在考虑一些有用的用例。但是让我们想一些例子:

  • 我曾经与一家图书经销商合作,他们用关键字(幻想、恐怖等)标记他们出售的每一本书。如果您有足够大的已标记书籍数据集,则可以自动化标记过程。您可以对您的短语(鼓舞人心的、有趣的等)做同样的事情,但您可能没有标记数据。能够向应用程序询问某种类型的短语会很好:)。

  • 情绪分析更容易:大多数短语是积极的吗?消极的?你不需要标签。

  • 风格转移:也许你有几个短语,比如说,莎士比亚。您可以尝试将他的风格转移到爱因斯坦的短语中。这很难但可行:查看 Generative Adversarial Networks。