数据挖掘 - xgboost 损失函数中的凯利准则 - 吾爱随笔录

xgboost 损失函数中的凯利准则

数据挖掘 Python 优化 xgboost

2022-03-13 06:32:08

我有一个模型可以预测 ATP 网球比赛的结果。预测的质量各不相同，我想开发第二个二元分类模型，根据比赛的一些特征来优化下注（或不下注）的决定。第二个模型的特点是第一个模型的概率，以及每场比赛的存档博彩公司赔率。赌注的大小由凯利标准决定。训练数据已被分类，使得所有可能获胜的赌注 = 1，所有失败的赌注 = 0。我正在使用xgboost。

我正在尝试将凯利标准合并到我的 xgb 损失函数中，但没有成功。

我已经查看了 xgb 演示中的自定义目标示例。据我了解，为了使 xgb最大化对数资金的期望值，我的目标函数需要返回第一个：

\frac{\partial}{\partial x} (p l o g (1 + b x) + (1 - p) l o g (1 - x)) = \frac{- (b + 1) p + b x + 1}{(x - 1) (b x + 1)}

$\frac{\partial}{\partial x}(p\: log(1 + bx)+(1-p)\:log(1-x))=\frac{-(b+1)\:p+b\:x+1}{(x-1)(b\:x+1)}$

和二阶导数：

\frac{\partial^{2}}{\partial x^{2}} (p l o g (1 + b x) + (1 - p) l o g (1 - x)) = - \frac{b^{2} p}{(b x + 1)^{2}} - \frac{1 - p}{(1 - x)^{2}}

$\frac{\partial ^2}{\partial x^2}(p\: log(1 + bx)+(1-p)\:log(1-x))=-\frac{b^{2}p}{(bx+1)^{2}}-\frac{1-p}{(1-x)^{2}}$

在哪里：

b 是投注的净赔率（“b 赔 1”）；也就是说，您可以以 1 美元的赌注赢取1 美元（除了取回您的1美元下注）
p 是获胜的概率；

我还实现了自己的成本函数来计算每次投注的盈亏。

我到目前为止的代码如下。

import pandas as pd
import xgboost as xgb
import numpy as np
import StringIO #  ('import io' in python 3.x)
import requests


url_train = 'https://gist.githubusercontent.com/martinstaniforth/162b9691132f7099b4da08fd14defc39/raw/9372c5cac42b545ecde4200503b97f895e24cbfe/train.csv'
url_test = 'https://gist.githubusercontent.com/martinstaniforth/4445b884abea22d4ae238cda869b5e0e/raw/84d3dead9122a644f8d273e04b898920a8e5a811/test.csv'

train_content = requests.get(url_train).content
test_content = requests.get(url_test).content

train_df = pd.read_csv(StringIO.StringIO(train_content.decode('utf-8')), index_col='match_id')
test_df = pd.read_csv(StringIO.StringIO(test_content.decode('utf-8')), index_col='match_id')

train_target_df = train_df.reset_index(drop=True)[['bet_wins']]
train_df = train_df.reset_index(drop=True).drop(['bet_wins'], axis=1)

test_target_df = test_df.reset_index(drop=True)[['bet_wins']]
test_df = test_df.reset_index(drop=True).drop(['bet_wins'], axis=1)

odds_train = train_df['player_odds'].values - 1
probs_train = train_df['win_prob'].values

odds_test = test_df['player_odds'].values - 1
probs_test = test_df['win_prob'].values

dtrain = xgb.DMatrix(train_df.values, train_target_df.values)
dtest = xgb.DMatrix(test_df.values, test_target_df.values)

param = {
    'max_depth': 3,
    'eta': 0.05,
    'silent': 1,
    'n_estimators': 50,
    'seed': 366}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 200


def kelly_loss(odds_train, probs_train, odds_test, probs_test):
    def logregobj(x, dmatrix):
        bet_outcome = dmatrix.get_label()

        odds = odds_train if len(bet_outcome) == len(odds_train) else odds_test
        probs = probs_train if len(bet_outcome) == len(probs_train) else probs_test

        y = -((odds + 1) * probs + odds * x + 1) / ((x - 1) * (odds * x + 1))
        grad = y - bet_outcome
        hess = -(np.power(odds, 2) * probs) / np.power(odds * x + 1, 2) - \
                (1 - probs) / np.power(1 - x, 2)

        return grad, hess

    return logregobj


def kelly_error(odds_train, probs_train, odds_test, probs_test):
    def evalerror(preds, dmatrix):
        bet_outcome = dmatrix.get_label()

        odds = odds_train if len(bet_outcome) == len(odds_train) else odds_test
        probs = probs_train if len(bet_outcome) == len(probs_train) else probs_test

        kelly_fraction = (probs * (odds + 1) - 1) / odds

        def value_bets(f):  # ignore any bets with a negative kelly fraction
            return 0 if f < 0 else f

        kelly_fraction = np.array([value_bets(x) for x in kelly_fraction])

        profit = preds * kelly_fraction * odds
        loss = (1 - preds) * kelly_fraction

        total_profit = float(sum(profit - loss))

        return 'error', total_profit

    return evalerror

bst = xgb.train(
    param,
    dtrain,
    num_round,
    watchlist,
    kelly_loss(odds_train, probs_train, odds_test, probs_test),
    kelly_error(odds_train, probs_train, odds_test, probs_test))

当我执行代码时，xgb 不会更新预测。我怀疑logregobj函数不正确，因为我还没有完全理解它的用途。有人可以协助在二元分类模型中正确实施凯利准则吗？如代码中所引用的，此gist中提供了训练数据。

提前致谢。

2个回答

我已经在一定程度上解决了这个问题。xgb 能够在不改变目标函数的情况下优化问题。我仍然不明白何时应该将自己的函数作为参数传递，但看起来没有必要。

我在下面修复了我的 kelly_error 函数中的一些错误，包括更改代码以优化 ROI。您还需要告诉 xgb 最大化输出（默认是最小化）

def kelly_error(odds_train, probs_train, odds_test, probs_test):
    def evalerror(preds, dmatrix):
        bet_outcome = dmatrix.get_label()

        odds = odds_train if len(bet_outcome) == len(odds_train) else odds_test
        probs = probs_train if len(bet_outcome) == len(probs_train) else probs_test

        kelly_fraction = (probs * (odds + 1) - 1) / odds

        def value_bets(f):  # ignore any bets with a negative kelly fraction
            return 0 if f < 0 else f

        kelly_fraction = np.array([value_bets(x) for x in kelly_fraction])

        profit = preds * bet_outcome * kelly_fraction * odds
        loss = preds * (1 - bet_outcome) * kelly_fraction

        stake = float(sum(preds * kelly_fraction))
        total_profit = float(sum(profit - loss))

        roi = 100 * total_profit / stake
        return 'roi', roi

    return evalerror

当您只是更改 eval 函数时，XGB 不会针对此函数进行优化。XGB 将使用此函数仅输出模型性能，或者 XGB 将使用此函数进行提前停止等。

如果您真的想针对特定功能进行优化，则需要实现目标功能。

其它你可能感兴趣的问题

上一篇训练非MNIST数据时准确性停滞不前下一篇如果我的数据框中的某些观察值包含 Python 中的目标词，如何向数据框添加一列？