我假设您“有一组 URL”,您需要在执行其他操作之前对其进行预过滤。基于这个假设,您可以使用urlparse和一些字符串模式来过滤掉集合中最有可能是非文章的 URL 的 X 百分比。
import re as regex
from urllib.parse import urlparse
urls =['https://candid.org/privacy-policy',
'https://happify.com/health/privacy-policy',
'https://www.abstract.com/legal/customer-terms-of-service',
'https://pesa.org.au/membership/terms-of-service-and-privacy-statement',
'https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html',
'https://www.nasa.gov/content/nasa-rss-feeds',
'https://www.sciencedaily.com/newsfeeds.htm',
'https://www.ndtv.com/rss']
patterns = ['privacy-policy', 'customer-terms-of-service', 'terms-of-service-and-privacy-statement', 'rss-feeds',
'newsfeeds', 'rss']
for url in urls:
split_url = urlparse(url)
possible_article = [pattern for pattern in patterns if regex.findall(pattern, split_url.path)]
if not possible_article:
print(f'Possible article: {url}')
# output
Possible article: https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html
您可以扩展上面的正则表达式来标记符合常见文章 URL 的其他 URL,这些 URL 可以日期字符串或常见关键字,例如新闻或有线故事。
import re as regex
from urllib.parse import urlparse
urls =['https://www.ndtv.com/rss',
'https://candid.org/privacy-policy',
'https://www.sciencedaily.com/newsfeeds.htm',
'https://happify.com/health/privacy-policy',
'https://www.nasa.gov/content/nasa-rss-feeds',
'https://www.bbc.com/news/technology-55675826',
'https://www.abstract.com/legal/customer-terms-of-service',
'https://pesa.org.au/membership/terms-of-service-and-privacy-statement',
'https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html',
'https://www.cnet.com/news/samsungs-galaxy-s21-upgrades-likely-wont-spell-an-end-to-galaxy-fe-or-note-lines-yet',
'https://abcnews.go.com/Politics/wireStory/trump-leave-washington-morning-bidens-inauguration-75278801?cid=clicksource_4380645_6_heads_hero_live_headlines_hed']
non_article_patterns = ['privacy-policy', 'customer-terms-of-service', 'terms-of-service-and-privacy-statement',
'rss-feeds', 'newsfeeds', 'rss']
known_article_patterns = ['\d{4}\/\d{2}\/\d{2}', 'news', 'wireStory']
for url in urls:
split_url = urlparse(url)
non_article = [pattern for pattern in non_article_patterns if regex.findall(pattern, split_url.path)]
if non_article:
pass
else:
possible_article = [pattern for pattern in known_article_patterns if regex.findall(pattern, split_url.path)]
if possible_article:
print(f'Possible article: {url}')
Possible article: https://www.bbc.com/news/technology-55675826
Possible article: https://www.cnn.com/2021/01/14/economy/unemployment-benefits-coronavirus/index.html
Possible article: https://www.cnet.com/news/samsungs-galaxy-s21-upgrades-likely-wont-spell-an-end-to-galaxy-fe-or-note-lines-yet
Possible article: https://abcnews.go.com/Politics/wireStory/trump-leave-washington-morning-bidens-inauguration-75278801?cid=clicksource_4380645_6_heads_hero_live_headlines_hed
但正如Erwan在他的评论中指出的那样,不清楚“你如何定义文章和非文章之间的区别”,所以我的回答只是你的 URL 分类问题的潜在解决方案的一部分。
很难为任何可以满足未定义用例的“开箱即用”解决方案提出任何可靠的建议。
以下是一些可能对您有所帮助的链接:
我仍然认为您的用例可能需要多管齐下的方法,并且随着更多 URL 被归类为“文章或非文章”,需要对其进行改进。