我有一个坐标对列表。对人眼来说,它们形成具有恒定斜率的线:
这就是我如何生成上面的图像:
import numpy as np
np.random.seed(42)
slope = 1.2 # all lines have the same slope
offsets = np.arange(10) # we will have 10 lines, each with different y-intercept
xslist=[]
yslist=[]
for offset in offsets:
# each line will be described by a variable number of points:
size = np.random.randint(low=50,high=100)
# eachline starts from somewhere -5 and -2 and ends between 2 and 5
xs = np.random.uniform(low=np.random.uniform(-5,-2), high=np.random.uniform(2,5),size=size)
# add some random offset and some random noise
ys = slope * xs + offset + np.random.normal(loc=0, scale=0.1, size=1) + np.random.normal(loc=0, scale=0.01, size=size)
xslist.append(xs)
yslist.append(ys)
# bring all x and y points together to single arrays
xs = np.concatenate(xslist) # xs: array([-0.37261674, 0.58267626, -3.72592914 ...
ys = np.concatenate(yslist) # ys: array([-0.53638699, 0.61729781, -4.52132114,
# plot results
import matplotlib.pyplot as plt
plt.scatter(xs,ys)
我可以生成很多xs和ys。在我的现实世界场景中,我不知道哪个点属于哪条线,因此不能简单地将这些点分成不同的组,而只对每个组应用最小二乘拟合。
我如何使用机器学习或其他方式构建一个函数,该函数将xs和ys作为输入,并返回图像上线条的斜率估计值,如上图?
为什么简单的最小二乘拟合似乎不起作用
让我们生成新数据,其中最小二乘拟合的失败更为明显。让我们有一个 2.4 的斜率和 0 到几百之间的 y 截距。
数据生成:
import numpy as np
np.random.seed(42)
slope = 2.4
offsets = np.arange(0,500,100)
xslist=[]
yslist=[]
for offset in offsets:
size = np.random.randint(low=50,high=100)
xs = np.random.uniform(low=np.random.uniform(-5,-2), high=np.random.uniform(2,5),size=size)
ys = slope * xs + offset + np.random.normal(loc=0, scale=0.1, size=1) + np.random.normal(loc=0, scale=0.01, size=size)
xslist.append(xs)
yslist.append(ys)
xs = np.concatenate(xslist)
ys = np.concatenate(yslist)
使用 的线的最小二乘拟合np.polyfit():
a, b = np.polyfit(xs, ys, deg=1)
请注意,我不能只适合一条线,因为我不知道哪些点属于一条线。
绘制结果:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.scatter(xs,ys)
line_x = np.arange(-5,5,0.01)
line_y = a*line_x + b
plt.plot(line_x,line_y,c='r',linewidth=10)
plt.gca().set_aspect(1/8)
IE:
使用最小二乘拟合得到的斜率(即红线的斜率)与黑点形成的线的斜率有很大不同。(请注意,x 和 y 轴上的比例不同。)
打印a(我们的坡度估计)和实际坡度slope:
print(a)
print(slope)
得到:
4.295790412452058
2.4
这个错误对于我的现实世界应用来说太多了。
生成模拟数据的函数
根据评论中的要求,这是一个生成类似于上述示例的数据的函数:
def get_data(number_of_examples):
np.random.seed(42)
list_of_xs = []
list_of_ys = []
true_slopes = []
for _ in range(number_of_examples):
slope = np.random.uniform(low=-10, high=10)
offsets = np.arange(0,
np.random.randint(low=20, high=200),
np.random.randint(low=1, high=10))
xslist=[]
yslist=[]
for offset in offsets:
size = np.random.randint(low=np.random.randint(low=40, high=60),
high=np.random.randint(low=80, high=100))
xs = np.random.uniform(low=np.random.uniform(-5,-2),
high=np.random.uniform(2,5),size=size)
ys = slope * xs + offset + \
np.random.normal(loc=0, scale=0.1, size=1) + \
np.random.normal(loc=0, scale=0.01, size=size)
xslist.append(xs)
yslist.append(ys)
xs = np.concatenate(xslist)
ys = np.concatenate(yslist)
list_of_xs.append(xs)
list_of_ys.append(ys)
true_slopes.append(slope)
return list_of_xs, list_of_ys, true_slopes
试试看,得到 10 个例子:
list_of_xs, list_of_ys, true_slopes = data = get_data(10)
绘制结果(红线的斜率是我试图使用蓝点的坐标来预测的):
for xs, ys, true_slope in zip(list_of_xs, list_of_ys, true_slopes):
plt.figure()
plt.scatter(xs, ys)
plt.plot(xs, xs*true_slope, c='r')
等等。





