Logistic 回归成本函数:自尝试计算 log(0) 以来给出数学错误

数据挖掘 机器学习 Python 逻辑回归
2022-02-20 12:53:46

我正在学习机器学习,在阅读了有关逻辑回归的材料后,我尝试从头开始在 python 中使用梯度下降实现逻辑回归。

它在某些情况下效果很好,但在某些情况下会导致数学错误,如果我们看到下面的情况,这是可以理解的。

逻辑回归中的成本函数是 -( ylog(predicted) + (1-y)log(1-predicted))

当预测为 1 时会发生什么?代码失败,因为它试图计算未定义的 log(1-1) = log(0)。明确地,我们在 python 中得到了这个错误

ValueError('数学域错误')

请帮助我了解如何防止这种情况。

代码如下:

from numpy.random import RandomState

import pandas as panda
import matplotlib.pyplot as plot 
import random
from math import sqrt, exp, log
remote_location = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

def standard_deviation(values):
    average = sum(values) / len(values)

    variance = sum([(average - i)**2/len(values) for i in values])

    return sqrt(variance)


class LogisticRegression(object):

    def __init__(self, epochs, learning_rate, _x_training_set, _y_training_set, standardize = False, random_state = None):
        self.epochs = epochs
        self.learning_rate = learning_rate
        self.standardize = standardize        
        self._x_training_set = _x_training_set
        self._y_training_set = _y_training_set
        self.number_of_training_set = len(self._y_training_set)
        self.weights = []      
        self.random_state = RandomState(random_state if random_state else 1)

    def standardizeInputData(self):
        """

        Standardizing of feature set means substracting the mean of
        each training sample from the feature value and dividing it by
        the standard deviation

        1. take average of j features from i th training sample . say avg
        2. calculate the variance of each j feature
        3. variance(j) = (avg - x(j))**2/len(features)
        4. standard deviation of x(j) = sq rt(variance(j))

        so standardized(x(j)) = x(j) - avg / standard deviation(x(j))

        """
        temp = []

        for i in range(len(self._x_training_set)):

            mean = sum(self._x_training_set[i])/ len(self._x_training_set[i])
            std_deviation = standard_deviation(self._x_training_set[i])
            temp.append([ (j - mean)/std_deviation for j in self._x_training_set[i]])            

        return temp

    def setup(self):

        if self.standardize:
            self._x_training_set = self.standardizeInputData()

        self.initialize_weights(len(self._x_training_set[0]) + 1)

    def initialize_weights(self, number_of_weights):

        self.weights = list(self.random_state.normal(loc = 0.0, scale = 0.01, size = len(self._x_training_set[0]) + 1))

    def learn(self):

        self.setup() 
        epoch_data = {}
        error = 0

        for epoch in range(self.epochs):

            cost =0 

            for i in range(self.number_of_training_set):
                _x = self._x_training_set[i]
                _desired = self._y_training_set[i]
                _weight = self.weights

                weighted_sum = _weight[0] + sum([_weight[j+1] * _x[j] for j in range(len(_x))])

                guess = 1 / ( 1 + exp(- weighted_sum))

                error = _desired - guess 

                ## i am going to reset all the weights
                if error!= 0 :

                    ## resetting the bias unit
                    self.weights[0] = error * self.learning_rate
                    self.weights[1:] =[self.weights[j+1] + error * self.learning_rate * _x[j] \
                                            for j in range(len(_x))]

                    ## cost entropy loss function
                    cost+= - ( _desired * log(guess) + (1 - _desired) *log(1-guess))

            #saving error at the end of the training set        
            epoch_data[epoch] = cost ##summation of all such y predictions for a training set

        print(epoch_data)

    def predict(self, _x_test_data):
        """

            Given algorithm has been trained using the #learn method
            this method will predict the y values based on the last
            values calculated for weights. This is because
            by the end of the learn method, algorithm has already
            converged as close to 0 error as it can
        """
        prediction = []

        for i in range(len(_x_test_data)):

            weighted_sum = self.weights[0] +  \
                    sum([self.weights[j+1] * _x_test_data[i][j] \
                        for j in range(len(_x_test_data[i]))])

            guess = 1 / ( 1 + exp(- weighted_sum))

            prediction.append( 1 if guess >= 0.5 else 0)

        print(prediction)
        return prediction

客户端代码:

import pandas as panda

from sklearn.model_selection import train_test_split
from predicting_logistic_regression import LogisticRegression
from sklearn.metrics import accuracy_score, mean_absolute_error
from sklearn import datasets

remote_location = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'


# data = panda.read_csv(remote_location)       
# _x_training_set = list(data.iloc[0:, [0,2]].values)
# _y_training_set = [0 if i.lower()!='iris-setosa' else 1 for i in data.iloc[0:, 4].values]

data = datasets.load_iris()
_x_training_set = data.data[:,[2,3]]
_y_training_set = data.target 


_x_train, _x_test, _y_train, _y_test = train_test_split( \
                                        _x_training_set,\
                                        _y_training_set, \
                                        test_size = 0.3, \
                                        random_state = 1, \
                                        stratify = _y_training_set)


random_generator_start = -1
random_generator_end = 1

logistic_regression = LogisticRegression( \
                learning_rate = 0.01, \
                epochs = 40, \
                _x_training_set = _x_train, \
                _y_training_set = _y_train,
                standardize= False
                )

logistic_regression.learn()
_y_predicted = logistic_regression.predict(_x_test)

print(_y_predicted)
print(_y_test)
print(accuracy_score(_y_test, _y_predicted))
print(mean_absolute_error(_y_test, _y_predicted))
3个回答

虽然理论上你永远不会得到完全等于 1 或 0 的值,但在实践中,由于浮点运算(如果你的值变得太接近于 0 或 1),确实会发生这种情况。您可以通过为“猜测”变量设置最小值和最大值来防止它,如果添加正则化,您不太可能陷入这种情况。在“更难”的数据集中发生这种情况的可能性也较小,因为在这些数据集中,您没有获得近 100% 的准确度——我假设您可能在某些玩具数据集(如 iris 数据集)中遇到了这个错误。

感谢所有与逻辑回归问题相关的谷歌搜索和多篇文章,这就是我想出的。

如果您查看代码,则此特定行中存在潜在问题:

weighted_sum = _weight[0] + sum([_weight[j+1] * _x[j] for j in range(len(_x))])
guess = 1 / ( 1 + exp(weighted_sum))

在 weighted_sum 大于 710 的情况下,相应的 exp 函数会给出如此大的值,从而导致溢出错误。同样,对于真正的低数字,它也可能导致下溢问题。

为了解决这个问题,我使用了标准化技术。礼貌 - https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

这是更新的代码:

weighted_sum = [_weight[0]] + [_weight[j+1] * _x[j] for j in range(len(_x))]

normalized_weighted_sum =  (sum(weighted_sum) - min(weighted_sum))/ (max(weighted_sum) - min(weighted_sum))

guess = 1 / ( 1 + exp(normalized_weighted_sum))

这就像一个魅力。

您可以尝试裁剪 0 和 1 的值,并使用与它们非常接近的值来避免 NaN 实例,例如,您可以调整如下代码行:

some variable= np.clip("some fuction",1e-15,1-1e-15)