数据挖掘 - 试图理解逻辑回归的实现 - 吾爱随笔录

试图理解逻辑回归的实现

数据挖掘 Python 逻辑回归梯度下降

2021-10-09 13:47:02

我目前正在使用以下代码作为起点来加深我对正则化逻辑回归的理解。作为第一遍，我只是尝试对 iris 数据集的一部分进行二进制分类。

我认为我遇到的一个问题是负对数损失（计算损失并存储在 loss_vec 中）从一次迭代到下一次迭代没有太大变化。

我面临的另一个挑战是在学习逻辑回归系数后试图弄清楚如何绘制决策边界。使用系数来绘制 0.5 决策边界还很遥远。这让我觉得我在某个地方犯了一个错误

http://fa.bianp.net/blog/2013/numerical-optimizers-for-logistic-regression/

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets


iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target


X = X[:100,:]
Y = Y[:100]

def phi(t):
    # logistic function, returns 1 / (1 + exp(-t))
    idx = t > 0
    out = np.empty(t.size, dtype=np.float)
    out[idx] = 1. / (1 + np.exp(-t[idx]))
    exp_t = np.exp(t[~idx])
    out[~idx] = exp_t / (1. + exp_t)
    return out


def loss(x0, X, y, alpha):
    # logistic loss function, returns Sum{-log(phi(t))}
    #x0 is the weight vector w are the paramaters, c is the bias term
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    yz = y * z
    idx = yz > 0
    out = np.zeros_like(yz)
    out[idx] = np.log(1 + np.exp(-yz[idx]))
    out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx])))
    out = out.sum() / X.shape[0] + .5 * alpha * w.dot(w)
    return out


def gradient(x0, X, y, alpha):
    # gradient of the logistic loss
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    z = phi(y * z)
    z0 = (z - 1) * y
    grad_w = X.T.dot(z0) / X.shape[0] + alpha * w
    grad_c = z0.sum() / X.shape[0]
    return np.concatenate((grad_w, [grad_c]))


def bgd(X, y, alpha, max_iter):
    step_sizes = np.array([100,10,1,.1,.01,.001,.0001,.00001])
    iter_no = 0
    x0 = np.random.random(X.shape[1] + 1) #initialize weight vector

    #placeholder for coefficients to test against the loss function
    next_thetas = np.zeros((step_sizes.shape[0],X.shape[1]+1) )   
    J = loss(x0,X,y,alpha)
    running_loss = []
    while iter_no < max_iter:
        grad = gradient(x0,X,y,alpha)
        next_thetas = -(np.outer(step_sizes,grad)-x0)
        loss_vec = []
        for i in range(np.shape(next_thetas)[0]):
            loss_vec.append(loss(next_thetas[i],X,y,alpha))
        ind = np.argmin(loss_vec)
        x0 = next_thetas[ind]
        if iter_no % 500 == 0:
            running_loss.append(loss(x0,X,y,alpha))
        iter_no += 1  
    return next_thetas

2个回答

我在实施中看到了几个问题。有些只是不必要的复杂方法，但有些是真正的错误。

主要内容

A：尝试从模型背后的数学开始。逻辑回归是一个相对简单的回归。找到你需要的两个方程并坚持下去，一个字母一个字母地复制它们。

B：矢量化。如果您退后一步并考虑最佳矢量化实现，它将为您节省许多不必要的步骤和计算。

C：在代码中多写注释。它会帮助那些试图帮助你的人。它还将帮助您更好地理解每个部分，并可能自己发现错误。

现在让我们一步一步地查看代码。

1.sigmoid函数

有没有理由在中进行如此复杂的实施phi(t)？假设这t是一个向量（一个 numpy 数组），那么你真正需要的是：

def phi(t):
   1. / (1. + np.exp(-t))

Asnp.exp()在数组上按元素操作。理想情况下，我会将它实现为一个也可以返回其导数的函数（这里不是必需的，但如果您尝试使用 sigmoid 激活实现一个基本的神经网络，可能会很方便）：

def phi(t, dt = False):
   if dt:
      phi(t) * (1. - phi(t))
   else:
      1. / (1. + np.exp(-t))

2.成本函数

通常，逻辑成本函数以下列方式定义为对数成本（矢量化）： $\frac{1}{m} (-(y^T \log{(\phi(X\theta))})-(1-y^T)(\log{(1 - \phi(X\theta))}) + \frac{\lambda}{2m} \theta^{1T}\theta$ 在哪里 $\phi(z)$ 是逻辑（sigmoid）函数， $\theta$ 是完整的参数向量（包括偏置权重）， $\theta^1$ 是参数向量 $\theta_1=0$ （按照惯例，偏差不是正则化的）和 $\lambda$ 是正则化参数。

我真正不明白的是你相乘的部分y * z。假设y是你的标签向量 $y$ z，为什么在应用 sigmoid 函数之前将它与你的相乘？为什么需要将成本函数分成 0 和 1 并分别计算任一样本的损失？我认为您代码中的问题确实在于这部分：您一定是错误地乘以 $y$ 和 $X\theta$ 申请前 $\phi(.)$ .

另外，这里有点：X.dot(w) + c。你c的偏差参数也是如此，对吧？你为什么将它添加到每个元素 $X\theta$ ? 它不应该被添加 - 它应该是向量的第一个元素 $X\theta$ . 是的，你没有对其进行正则化，但你需要在损失函数的“预测”部分使用。

在您的代码中，我还认为成本函数过于复杂。这是我会尝试的：

def loss(X,y,w,lam):
   #calculate "prediction"
   Z = np.dot(X,w)
   #calculate cost without regularization
   #shape of X is (m,n), shape of w is (n,1)
   J = (1./len(X)) * (-np.dot(y.T, np.log(phi(Z))) * np.dot((1-y.T),np.log(1 - phi(Z))))
   #add regularization
   #temporary weight vector
   w1 = copy.copy(w) #import copy to create a true copy
   w1[0] = 0
   J += (lam/(2.*len(X))) * np.dot(w1.T, w)
   return J

3.渐变

再次，让我们先回顾一下逻辑损失梯度的公式，再一次，向量化： $\frac{1}{m} ((\phi(X\theta) - y)^TX)^T + \frac{\lambda}{m}\theta^1$ . 这将返回所有参数的导数向量（即梯度），并适当地正则化（没有正则化的偏置项）。

再一次，你乘以 $y$ 太快了：phi(y * z)。事实上，你不应该乘以 $y$ 完全在渐变中。

这就是我要为渐变做的事情：

def gradient(X, y, w, lam):
   #calculate the prediction
   Z = np.dot(X,w)
   #temporary weight vector
   w1 = copy.copy(w) #import copy to create a true copy
   w1[0] = 0
   #calc gradient
   grad = (1./len(X)) * (np.dot((phi(Z) - y).T,X).T) + (lam/len(X)) * w1
   return grad

实际的梯度下降实现对我来说似乎没问题，但由于梯度和成本函数实现存在错误，它无法交付：/

希望这将帮助您走上正轨。

下面是如何在 Python 中实现梯度下降：

def sigmoid(z):
    s= 1/(1 + np.exp(-z))
    return s

def propagate(w, b, X, Y):

    m = X.shape[1]

    A = sigmoid(np.dot(w.T,X)+b)                                     # compute activation

    cost = -1/m * np.sum(Y * np.log(A) + (1-Y) * (np.log(1-A)))

    dz= (1/m)*(A - Y)
    dw = np.dot(X, dz.T)
    db = np.sum(dz)


    cost = np.squeeze(cost)
    grads = {"dw": dw,
             "db": db}

    return grads, cost


def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):

    costs = []

    for i in range(num_iterations):
        m = X.shape[1]
        grads,cost = propagate(w, b, X, Y)
        b = b - learning_rate*grads["db"]
        w = w - learning_rate*grads["dw"]
        if i % 100 == 0:
            costs.append(cost)
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    params = {"w": w,
              "b": b}
    return params, grads, costs



def predict(w, b, X):
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    A = sigmoid(np.dot(w.T,X)+ b)

    for i in range(A.shape[1]):
        x_exp = np.exp(A)
        x_sum = np.sum(x_exp,axis=1,keepdims=True)
        s = np.divide(x_exp,x_sum)

    Y_prediction = 1. * (A > 0.5)

    return Y_prediction



def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):
    w, b = initialize_with_zeros(X_train.shape[0])

    print("learning rate:",learning_rate)

    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = False)

    w = parameters["w"]
    b = parameters["b"]

    Y_prediction_train = predict(w,b,X_train)
    Y_prediction_test = predict(w,b,X_test)
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}

    return d

d = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True)

其它你可能感兴趣的问题

上一篇我们可以将 word2vec 向量与 doc2vec 向量进行比较吗？下一篇在对测试数据进行最终评估后删除特征