数据挖掘 - 在R中按组找到线性方程组最优解的最佳方法 - 吾爱随笔录

在R中按组找到线性方程组最优解的最佳方法

数据挖掘 r 优化

2022-03-14 12:22:48

我目前正在 R 中建模定价和折扣系统。

我的数据框如下所示：

df = structure(
  list(
    Customers = structure(
      c(1L, 1L, 1L, 2L, 2L, 2L),
      .Label = c("A", "B"),
      class = "factor"
    ),
    Products = structure(
      c(1L,
        2L, 3L, 1L, 2L, 3L),
      .Label = c("P1", "P2", "P3"),
      class = "factor"
    ),
    Old_Price = c(5, 2, 10, 7, 4, 8),
    New_Price = c(6, 3, 9,
                  6, 3, 9)
  ),
  class = "data.frame",
  row.names = c(NA,-6L)
)

有几个客户使用“旧价格”和“新价格”购买不同的产品。我现在想为每个客户确定一个折扣参数（从 -1.0 到 1.0 的实数），以最小化旧价格和新价格之间的差异。

因为我对优化等方面了解不多，所以我目前的方法是执行以下操作，这似乎非常低效，并且无论如何可能不会导致最佳解决方案：

df %>%
    mutate(Individual_Discount = (Old_Price-New_Price)/New_Price) %>% # Identify optimal discount individually
    group_by(Customers) %>%
    mutate(Optimal_Discount = mean(Individual_Discount)) # Average individual discount to get approximate discount for customer

解决这种情况的最佳方法是什么？如何在 R 中实现它？

更新：

更清楚地说明问题。有一个如下所示的数据框：

Customers | Product | Old Price | New Price | Delta  | Discount | Discounted New Price
CustA     | ProdA   | 10.00     | 12.00     |  2.00  | -0.167   | 10.00
CustA     | ProdB   | 30.00     | 25.00     | -5.00  |  0.2     | 30.00
CustB     | ProdA   | 15.00     | 12.00     | -3.00  |  0.25    | 15.00
CustB     | ProdB   | 20.00     | 25.00     |  5.00  | -0.2     | 20.00

折扣表示将旧价格和新价格之间的差异减少到零的最佳折扣（因此新价格 2 将计算为新价格 + 新价格 * 折扣）。但是每个客户只能获得一个折扣，那么我应该为每个客户选择哪个折扣以最小化剩余的增量（折扣新价格和旧价格之间的增量）？

更新2：数学关系

Delta = New_Price - Old_Price

折扣 = Delta / -New_Price

Discounted_New_Price = New_Price+New_Price*Discount

更新3：

我已经根据评论拟合了一个线性模型，但是基于分组线性模型的梯度的“线性折扣”产生的结果比我的“平均黑客”更差：

df %>%
  group_by(Customers) %>%
  do({ co <- coef(lm(Old_Price ~ New_Price, .))
       mutate(., linear_discount = co[2])
  }) %>%
  ungroup %>%
  mutate(linear_discount = 1/linear_discount-1) %>%
  mutate(linear_price = New_Price+New_Price*linear_dis

结果是

Customers | Product | Old Price | New Price | Linear Discount  | Linear Price | Discounted New Price
CustA     | Prod1   | 05.00     | 06.00     |  -0.25           | 4.50
CustA     | Prod2   | 02.00     | 03.00     |  -0.25           | 2.25
CustA     | Prod3   | 10.00     | 09.00     |  -0.25           | 6.75
CustB     | Prod1   | 07.00     | 06.00     |   0.50           | 9.00
...

1个回答

您可以使用不带截距的线性回归来完成此操作。

如果我理解正确，您在 Update 3 中的线性回归实现有两个问题：首先，您使用截距拟合线性模型，然后在应用模型时放弃截距。二、下面一行有错别字

linear_discount = 1/linear_discount-1

在下面的代码片段中，我通过两种方法扩展了您的分析：不带截距的线性回归和带截距的线性回归。前一种情况应该完全符合您的要求，即每个客户使用单一的相对折扣率。后一种对应于放宽假设的情况，即一种相对贴现率和一种绝对贴现率（与价格水平无关）。

请注意，对于您提供的说明性数据集，最佳（根据均方误差）贴现率将为零。

df_ext <- df %>%
  # mean hack  
  mutate(Discount_indiv = (Old_Price-New_Price)/New_Price) %>% # Identify optimal discount individually
  group_by(Customers) %>%
  mutate(Discount_mean = mean(Discount_indiv),
         Old_Price_mean = New_Price + New_Price*Discount_mean) %>% # Average individual discount to get approximate discount for customer

  # incorrect linear regression (fitted with intercept, then intercept dropped)  
  do({ co <- coef(lm(Old_Price ~ New_Price, .))
  mutate(., Discount_wrong = co[2])
  }) %>%
  ungroup %>%
  mutate(Discount_wrong = 1 / Discount_wrong-1) %>%
  mutate(Old_Price_wrong = New_Price+New_Price*Discount_wrong) %>%

  # correct linear regression w/o intercept
  group_by(Customers) %>%
  mutate(Discount_regr = coef(lm(Old_Price ~ New_Price -1))[1] - 1,
         Old_Price_regr1 = New_Price+New_Price*Discount_regr) %>% 

  # correct linear regression with intercept (i.e. two discounts per customer)
  mutate(
    Discount_abs = coef(lm(Old_Price ~ New_Price))[1],
    Discount_rel = coef(lm(Old_Price ~ New_Price))[2] - 1,
    Old_Price_regr2 = New_Price + New_Price*Discount_rel + Discount_abs) %>% 

  #calculate residuals
  ungroup() %>%
  mutate(Resid_mean = Old_Price_mean - Old_Price,
         Resid_wrong = Old_Price_wrong - Old_Price,
         Resid_regr1 = Old_Price_regr1 - Old_Price,
         Resid_regr2 = Old_Price_regr2 - Old_Price)


#transform data for visualisation
df_gat <- select(df_ext, matches("Customers|Price")) %>% 
  gather(key="Approach", value="Old_Price", -Customers, -New_Price)

ggplot(df_gat, aes(x=New_Price, y=Old_Price, group=Approach, colour=Approach, shape=Approach)) + 
  geom_line() + geom_point(size=3) + facet_wrap(~Customers, ncol = 1)

select(df_ext, matches("Customers|Disc"))
select(df_ext, matches("Customers|Price"))

#calculate mean square error for all approaches
select(df_ext, matches("Customers|Res")) %>% 
  group_by(Customers) %>%
  summarise_all(~mean(.^2))

所有方法的结果：客户的 MSE 和一些可视化

  Customers Resid_mean Resid_wrong Resid_regr1 Resid_regr2
  <fct>          <dbl>       <dbl>       <dbl>       <dbl>
1 A               1.71        3.62       1.          0.222
2 B               1.71       11.5        1.000       0.222

其它你可能感兴趣的问题

上一篇为什么替代假设的曲线与原假设的曲线相比发生了偏移？下一篇神经网络中的网络大小