数据挖掘 - 是否有现成的分类器来判断网页是否为文章？ - 吾爱随笔录

是否有现成的分类器来判断网页是否为文章？

数据挖掘分类

2022-03-03 13:12:01

我正在尝试解决一个问题，我有一组 URL，我只需要过滤掉那些可以归类为文章的 URL（即通常不包括隐私政策、使用条款、退订订阅源的网页等）。）。

是否有经过训练的分类器可用于过滤这些网页？我发现的每一个网页分类材料都是关于将文章分类的，但这与我无关。我需要退后一步，弄清楚网页是否真的是一篇文章。

如果没有现成的分类器，那么创建这种分类器的最佳方法是什么？

更新：澄清以下一些问题。问题不在于仅通过 url 来识别文章，而是找到一种基于网页内容（html、视觉表示？可能是其他一些元数据）将网页标记为文章/非文章的方法。从我提出这个问题后所做的研究来看，我相信这是一种叫做博客识别的变体. 我知道文章的定义在这里很重要，而且要求非常模糊，但这正是我在寻找一些算法来做到这一点时一直在努力解决的难题。如果我有一个固定的数据集，我应该能够识别常见的模式并自己提出一篇文章的定义。但由于需要过滤的网页集非常动态且事先未知，因此我不想依赖自己的自定义规则来定义文章的外观，而是更愿意使用可能有一些识别的方法研究或至少背后的现场测试验证。在我看来，这是一个必须有人在我之前解决的问题，并且可能已经有基于此的解决方案。

3个回答

我假设您“有一组 URL”，您需要在执行其他操作之前对其进行预过滤。基于这个假设，您可以使用urlparse和一些字符串模式来过滤掉集合中最有可能是非文章的 URL 的 X 百分比。

import re as regex
from urllib.parse import urlparse

urls =['https://candid.org/privacy-policy',
       'https://happify.com/health/privacy-policy',
       'https://www.abstract.com/legal/customer-terms-of-service',
       'https://pesa.org.au/membership/terms-of-service-and-privacy-statement',
       'https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html',
       'https://www.nasa.gov/content/nasa-rss-feeds',
       'https://www.sciencedaily.com/newsfeeds.htm',
       'https://www.ndtv.com/rss']

patterns = ['privacy-policy', 'customer-terms-of-service', 'terms-of-service-and-privacy-statement', 'rss-feeds',
            'newsfeeds', 'rss']
for url in urls:
    split_url = urlparse(url)
    possible_article = [pattern for pattern in patterns if regex.findall(pattern, split_url.path)]
    if not possible_article:
        print(f'Possible article: {url}')
        # output 
        Possible article: https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html

您可以扩展上面的正则表达式来标记符合常见文章 URL 的其他 URL，这些 URL 可以日期字符串或常见关键字，例如新闻或有线故事。

import re as regex
from urllib.parse import urlparse

urls =['https://www.ndtv.com/rss',
       'https://candid.org/privacy-policy',
       'https://www.sciencedaily.com/newsfeeds.htm',
       'https://happify.com/health/privacy-policy',
       'https://www.nasa.gov/content/nasa-rss-feeds',
       'https://www.bbc.com/news/technology-55675826',
       'https://www.abstract.com/legal/customer-terms-of-service',
       'https://pesa.org.au/membership/terms-of-service-and-privacy-statement',
       'https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html',
       'https://www.cnet.com/news/samsungs-galaxy-s21-upgrades-likely-wont-spell-an-end-to-galaxy-fe-or-note-lines-yet',
       'https://abcnews.go.com/Politics/wireStory/trump-leave-washington-morning-bidens-inauguration-75278801?cid=clicksource_4380645_6_heads_hero_live_headlines_hed']

non_article_patterns = ['privacy-policy', 'customer-terms-of-service', 'terms-of-service-and-privacy-statement',
                        'rss-feeds', 'newsfeeds', 'rss']

known_article_patterns = ['\d{4}\/\d{2}\/\d{2}', 'news', 'wireStory']

for url in urls:
    split_url = urlparse(url)
    non_article = [pattern for pattern in non_article_patterns if regex.findall(pattern, split_url.path)]
    if non_article:
        pass
    else:
        possible_article = [pattern for pattern in known_article_patterns if regex.findall(pattern, split_url.path)]
        if possible_article:
            print(f'Possible article: {url}')
            Possible article: https://www.bbc.com/news/technology-55675826
            Possible article: https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html
            Possible article: https://www.cnet.com/news/samsungs-galaxy-s21-upgrades-likely-wont-spell-an-end-to-galaxy-fe-or-note-lines-yet
            Possible article: https://abcnews.go.com/Politics/wireStory/trump-leave-washington-morning-bidens-inauguration-75278801?cid=clicksource_4380645_6_heads_hero_live_headlines_hed

但正如Erwan在他的评论中指出的那样，不清楚“你如何定义文章和非文章之间的区别”，所以我的回答只是你的 URL 分类问题的潜在解决方案的一部分。

很难为任何可以满足未定义用例的“开箱即用”解决方案提出任何可靠的建议。

以下是一些可能对您有所帮助的链接：

我仍然认为您的用例可能需要多管齐下的方法，并且随着更多 URL 被归类为“文章或非文章”，需要对其进行改进。

不会有文章/非文章的预训练分类器，因为不同的人对文章/非文章有不同的定义。

第一步是为您的特定用例定义文章/非文章。这可以通过两种方式完成：

系列规则 - 列出文章/非文章的标准，然后在程序中编码这些规则。例如，一篇文章不包含短语“服务条款”。此策略适用于不会改变的确定性、狭义定义。
训练二元分类器 - 将数千个网页标记为文章/非文章。然后对它们进行预处理，以便它们适合机器学习。然后拟合一个二元分类器，朴素贝叶斯分类器是常见的首选。这种策略适用于可能发生变化的概率性、复杂性定义。

有一些 API 可以做到这一点，例如Diffbot 的 Analyze API。
据我所知，没有现成的开源模型可以做到这一点，所以如果您不打算使用 API，则必须创建自己的模型。

其它你可能感兴趣的问题

上一篇机器学习资源下一篇对训练集进行交叉验证后是否需要测试集？