数据挖掘 - 如何估计散点图上线的斜率？ - 吾爱随笔录

如何估计散点图上线的斜率？

数据挖掘机器学习张量流回归机器学习模型

2021-10-05 11:01:49

我有一个坐标对列表。对人眼来说，它们形成具有恒定斜率的线：

这就是我如何生成上面的图像：

import numpy as np
np.random.seed(42)

slope = 1.2 # all lines have the same slope

offsets = np.arange(10) # we will have 10 lines, each with different y-intercept
xslist=[]
yslist=[]

for offset in offsets:
    # each line will be described by a variable number of points:
    size = np.random.randint(low=50,high=100)

     # eachline starts from somewhere -5 and -2 and ends between 2 and 5
    xs = np.random.uniform(low=np.random.uniform(-5,-2), high=np.random.uniform(2,5),size=size)

     # add some random offset and some random noise
    ys = slope * xs + offset + np.random.normal(loc=0, scale=0.1, size=1) + np.random.normal(loc=0, scale=0.01, size=size)
    xslist.append(xs)
    yslist.append(ys)

# bring all x and y points together to single arrays
xs = np.concatenate(xslist) # xs: array([-0.37261674,  0.58267626, -3.72592914 ...
ys = np.concatenate(yslist) # ys: array([-0.53638699,  0.61729781, -4.52132114, 

# plot results
import matplotlib.pyplot as plt
plt.scatter(xs,ys)

我可以生成很多xs和ys。在我的现实世界场景中，我不知道哪个点属于哪条线，因此不能简单地将这些点分成不同的组，而只对每个组应用最小二乘拟合。

我如何使用机器学习或其他方式构建一个函数，该函数将xs和ys作为输入，并返回图像上线条的斜率估计值，如上图？

为什么简单的最小二乘拟合似乎不起作用

让我们生成新数据，其中最小二乘拟合的失败更为明显。让我们有一个 2.4 的斜率和 0 到几百之间的 y 截距。

数据生成：

import numpy as np
np.random.seed(42)

slope = 2.4

offsets = np.arange(0,500,100)
xslist=[]
yslist=[]

for offset in offsets:

    size = np.random.randint(low=50,high=100)

    xs = np.random.uniform(low=np.random.uniform(-5,-2), high=np.random.uniform(2,5),size=size)
    ys = slope * xs + offset + np.random.normal(loc=0, scale=0.1, size=1) + np.random.normal(loc=0, scale=0.01, size=size)

    xslist.append(xs)
    yslist.append(ys)

xs = np.concatenate(xslist)
ys = np.concatenate(yslist)

使用的线的最小二乘拟合np.polyfit()：

a, b = np.polyfit(xs, ys, deg=1)

请注意，我不能只适合一条线，因为我不知道哪些点属于一条线。

绘制结果：

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.scatter(xs,ys)

line_x = np.arange(-5,5,0.01)
line_y = a*line_x + b
plt.plot(line_x,line_y,c='r',linewidth=10)

plt.gca().set_aspect(1/8)

IE：

使用最小二乘拟合得到的斜率（即红线的斜率）与黑点形成的线的斜率有很大不同。（请注意，x 和 y 轴上的比例不同。）

打印a（我们的坡度估计）和实际坡度slope：

print(a)
print(slope)

得到：

4.295790412452058
2.4

这个错误对于我的现实世界应用来说太多了。

生成模拟数据的函数

根据评论中的要求，这是一个生成类似于上述示例的数据的函数：

def get_data(number_of_examples):

    np.random.seed(42)

    list_of_xs = []
    list_of_ys = []
    true_slopes = []

    for _ in range(number_of_examples):

        slope = np.random.uniform(low=-10, high=10)

        offsets = np.arange(0,
                            np.random.randint(low=20, high=200),
                            np.random.randint(low=1, high=10))
        xslist=[]
        yslist=[]

        for offset in offsets:

            size = np.random.randint(low=np.random.randint(low=40, high=60),
                                     high=np.random.randint(low=80, high=100))

            xs = np.random.uniform(low=np.random.uniform(-5,-2),
                                   high=np.random.uniform(2,5),size=size)
            ys = slope * xs + offset + \
                np.random.normal(loc=0, scale=0.1, size=1) + \
                np.random.normal(loc=0, scale=0.01, size=size)

            xslist.append(xs)
            yslist.append(ys)

        xs = np.concatenate(xslist)
        ys = np.concatenate(yslist)

        list_of_xs.append(xs)
        list_of_ys.append(ys)
        true_slopes.append(slope)
    
    return list_of_xs, list_of_ys, true_slopes

试试看，得到 10 个例子：

list_of_xs, list_of_ys, true_slopes = data = get_data(10)

绘制结果（红线的斜率是我试图使用蓝点的坐标来预测的）：

for xs, ys, true_slope in zip(list_of_xs, list_of_ys, true_slopes):
    plt.figure()
    plt.scatter(xs, ys)
    plt.plot(xs, xs*true_slope, c='r')

等等。

1个回答

您可以使用的过程如下。首先使用高斯混合模型对您的数据进行聚类。此方法也适用于具有不同斜率的多条线。它应该能够处理交叉点，因为交叉点附近的点可以同时属于两个集群，并且错误的分类不会导致回归结果的巨大差异。

我将发布完整的代码。

# Your code for generating the data
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
slope = 2.4

offsets = np.arange(0,500,100)
xslist=[]
yslist=[]

for offset in offsets:

    size = np.random.randint(low=50,high=100)

    xs = np.random.uniform(low=np.random.uniform(-5,-2), high=np.random.uniform(2,5),size=size)
    ys = slope * xs + offset + np.random.normal(loc=0, scale=0.1, size=1) + np.random.normal(loc=0, scale=0.01, size=size)

    xslist.append(xs)
    yslist.append(ys)

xs = np.concatenate(xslist)
ys = np.concatenate(yslist)

我们将使用您的数据点生成多个高斯混合模型。我们将使用贝叶斯信息准则 (BIC) 最小值的组件数来固定组件数。

# Create multiple Gaussian Mixture models
from sklearn.mixture import GaussianMixture
X = np.vstack((xs, ys)).T
n_components = np.arange(1, 21)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(X) for n in n_components]

# Get optimal number of components by using the index of the components with the minimal value for the Bayesian Information Criterion (BIC)
n_components_optimal = np.argmin(np.array([model.bic(X) for model in models])) + 1

绘制结果并查看具有最佳聚类数的聚类效果如何。

# Code for plotting
gaussian_mixture_model_optimal = GaussianMixture(n_components_optimal, covariance_type='full', random_state=0).fit(X)
labels = gaussian_mixture_model_optimal.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis')

现在，使用聚类数据并从中创建子数据框并拟合您的线性回归。

import pandas as pd 
    
df = pd.DataFrame({
    "x": xs,
    "y": ys,
    "cluster": labels,
})

# 
cluster_number = 1
X_sub = df.query('cluster == @cluster_number').values

其它你可能感兴趣的问题

上一篇哪种交叉验证类型最适合二分类问题下一篇Gensim word2vec和keras Embedding layer的区别