我需要更好地理解如何使用基于布尔值开发的自己的模型从头开始创建机器学习算法,例如文本中的单词数、标点符号数、大写字母数等,以确定文本是正式的还是非正式的。例如:我有
Text
there is a new major in this town
WTF?!?
you're a great person. Really glad to have met you
I don't know what to say
BYE BYE BABY
我创建了一些规则来在这个(小)火车数据集上分配标签,但我需要了解如何将这些规则应用于新数据集(测试):
- 如果有一个大写单词,那么 I;
- 如果有一个简短的表达式,比如 don't, 'm ,'s, ... , 那么 I;
- 如果有两个符号(标点符号)彼此靠近,那么 I;
- 如果一个单词在额外单词列表中,那么我;
- 否则 F。
假设我有一个数据框要测试并分配这些标签(I 或 F):
FREEDOM!!! I don't need to go to school anymore
What are u thinking?
Hey men!
I am glad to hear that.
如何将我的模型应用到这个新数据集,添加标签?
Test Output
FREEDOM!!! I don't need to go to school anymore I
What are u thinking? I
Hey men! I
I am glad to hear that. F
在 mnm 的评论后更新:
下一个会被认为是机器学习问题吗?
import pandas as pd
import numpy as np
data = { "ID":[1,2,3,4],
"Text":["FREEDOM!!! I don't need to go to school anymore",
"What are u thinking?",
"Hey men!","
I am glad to hear that."]}
# here there should be the part of modelling
df['upper'] = # if there is an upper case word then "I"
df['short_exp'] = # if there is a short exp then "I"
df['two_cons'] = # if there are two consecutive symbols then "I"
list_extra=['u','hey']
df['extra'] = # if row contains at least one of the word included in list_extra then 'I'
# append cols to original dataframe
df_new = df
df_new['upper'] = df1['upper']
df_new['short_exp'] = df1['short_exp']
# and similar for others
然而,目前尚不清楚最新的部分是否基于条件。如何预测其他文本的新值?