如何将短语 ID 映射到斯坦福情绪分析数据集中的句子 ID?

数据挖掘 数据集 情绪分析
2021-10-06 05:19:15

我正在尝试重现在斯坦福情感分析数据集上获得的结果,该数据集包含从烂番茄评论中提取的句子的句法组件的情感注释。

datasetSentences.txt数据在anddatasetSplit.txt文件中按句子划分为训练集、开发集和测试集。这些句子中的组件映射到dictionary.txtsentiment_labels.txt文件中的情感注释。自述文件指出句子和短语 ID 不同,但我找不到它们之间的任何映射,所以我不知道如何将情感注释拆分为我试图重现的实验中使用的分区。

当然,我希望从短语到句子的映射是多对一的(因为同一个短语可能出现在多个句子中),但我仍然希望得到明确的映射,而不必比较子字符串。

有人用过这个数据集吗?有什么我忽略的吗?

1个回答

只有完整的句子用于测试和验证,尽管句子和短语用于训练。有关各种文本分类实验的概述,请参阅Yoon Kim 2014,用于句子分类的卷积神经网络。

下面是从数据集下载中的各种文本文件创建训练、开发和测试 .CSV 文件的代码。

"""
Put all the Stanford Sentiment Treebank phrase data into test, training, and dev CSVs.

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive Deep Models
for Semantic Compositionality Over a Sentiment Treebank. Presented at the Conference on Empirical Methods in Natural
Language Processing EMNLP.

https://nlp.stanford.edu/sentiment/
"""

import os
import sys

import pandas


def get_phrase_sentiments(base_directory):
    def group_labels(label):
        if label in ["very negative", "negative"]:
            return "negative"
        elif label in ["positive", "very positive"]:
            return "positive"
        else:
            return "neutral"

    dictionary = pandas.read_csv(os.path.join(base_directory, "dictionary.txt"), sep="|")
    dictionary.columns = ["phrase", "id"]
    dictionary = dictionary.set_index("id")

    sentiment_labels = pandas.read_csv(os.path.join(base_directory, "sentiment_labels.txt"), sep="|")
    sentiment_labels.columns = ["id", "sentiment"]
    sentiment_labels = sentiment_labels.set_index("id")

    phrase_sentiments = dictionary.join(sentiment_labels)

    phrase_sentiments["fine"] = pandas.cut(phrase_sentiments.sentiment, [0, 0.2, 0.4, 0.6, 0.8, 1.0],
                                           include_lowest=True,
                                           labels=["very negative", "negative", "neutral", "positive", "very positive"])
    phrase_sentiments["coarse"] = phrase_sentiments.fine.apply(group_labels)
    return phrase_sentiments


def get_sentence_partitions(base_directory):
    sentences = pandas.read_csv(os.path.join(base_directory, "datasetSentences.txt"), index_col="sentence_index",
                                sep="\t")
    splits = pandas.read_csv(os.path.join(base_directory, "datasetSplit.txt"), index_col="sentence_index")
    return sentences.join(splits).set_index("sentence")


def partition(base_directory):
    phrase_sentiments = get_phrase_sentiments(base_directory)
    sentence_partitions = get_sentence_partitions(base_directory)
    # noinspection PyUnresolvedReferences
    data = phrase_sentiments.join(sentence_partitions, on="phrase")
    data["splitset_label"] = data["splitset_label"].fillna(1).astype(int)
    data["phrase"] = data["phrase"].str.replace(r"\s('s|'d|'re|'ll|'m|'ve|n't)\b", lambda m: m.group(1))
    return data.groupby("splitset_label")


base_directory, output_directory = sys.argv[1:3]
os.makedirs(output_directory, exist_ok=True)
for splitset, partition in partition(base_directory):
    split_name = {1: "train", 2: "test", 3: "dev"}[splitset]
    filename = os.path.join(output_directory, "stanford-sentiment-treebank.%s.csv" % split_name)
    del partition["splitset_label"]
    partition.to_csv(filename)

这里有代码的要点