数据挖掘 - 需要做什么才能使 n_jobs 在 sklearn 上正常工作？特别是在 ElasticNetCV 上？ - 吾爱随笔录

需要做什么才能使 n_jobs 在 sklearn 上正常工作？特别是在 ElasticNetCV 上？

数据挖掘机器学习 scikit-学习交叉验证平行线弹性网

2021-09-17 11:57:03

的构造函数sklearn.linear_model.ElasticNetCV作为n_jobs参数。在此处引用文档

n_jobs：int，默认=无

交叉验证期间要使用的 CPU 数量。除非在 joblib.parallel_backend 上下文中，否则 None 表示 1。-1 表示使用所有处理器。有关详细信息，请参阅词汇表。

但是，在我的 4 核机器上运行以下简单程序（下面的规格详细信息）显示性能最佳，随着您一路n_jobs = None增加而逐渐恶化（假设请求所有内核）n_jobsn_jobs = -1

    import numpy as np
    from sklearn.linear_model import ElasticNetCV

    # ------------- Setup  X, y -------------
    IMPORTANT_K = 10
    TOTAL_K = 30
    SAMPLE_N = 1000
    ERROR_VOL = 1

    np.random.seed(0)
    important_X = np.random.rand(SAMPLE_N, IMPORTANT_K)
    other_X = np.random.rand(SAMPLE_N, TOTAL_K - IMPORTANT_K)
    actual_coefs = np.linspace(0.1, 1, IMPORTANT_K)
    noise = np.random.rand(SAMPLE_N)
    y = important_X @ actual_coefs + noise * ERROR_VOL
    total_X = np.concatenate((important_X, other_X), axis=1)

    # ---------------- Setup ElasticNetCV -----------------
    LASSO_RATIOS = np.linspace(0.01, 1, 10)
    CV = 10

    def enet_fit(X, y, n_jobs: int):
        enet_cv = ElasticNetCV(l1_ratio=LASSO_RATIOS, fit_intercept=True,
                               cv=CV, n_jobs=n_jobs)
        enet_cv.fit(X=X, y=y)
        return enet_cv

    # ------------------- n_jobs test --------------
    N_JOBS = [None, 1, 2, 3, 4, -1]
    import time
    for n_jobs in N_JOBS:
        start = time.perf_counter()
        enet_cv = enet_fit(X=total_X, y=y, n_jobs=n_jobs)
        print('n_jobs = {}, perf_counter = {}'.format(n_jobs, time.perf_counter() - start))

需要做什么才能使这项工作按预期进行？

有些人似乎认为这在 Windows 中已损坏，如此处所述。确实，我在 Intel i7-7700 4.2Ghz 机器上运行 Windows 10，但不幸的是，上述链接指向的问题没有任何评论或答案，也不幸的是，如果这适用于 Unix，我也无法访问 Unix 机器来尝试.

这里有些人说从交互式会话中运行是一个问题。我观察到上述程序的执行方式相同，无论是在 Jupyter Lab 上运行还是作为终端脚本运行。

我还要补充一点，我已经在 sklearn 和 joblib 的几个版本组合上运行了它，但无法以这种方式解决问题。

conda env create下面是其中一种组合，仅在指定 python 版本时组合在一起，python==3.6.10从而允许 conda 安装它认为最合适的版本。我还包括已安装的numpy,mkl和scipy依赖项。

intel-openmp              2020.1                      216
joblib                    0.14.1                     py_0
mkl                       2020.1                      216
mkl-service               2.3.0            py36hb782905_0
mkl_fft                   1.0.15           py36h14836fe_0
mkl_random                1.1.0            py36h675688f_0
numpy                     1.18.1           py36h93ca92e_0
numpy-base                1.18.1           py36hc3f5095_1
python                    3.6.10               h9f7ef89_2
scikit-learn              0.22.1           py36h6288b17_0
scipy                     1.4.1            py36h9439919_0

问题

上面的简单程序在运行时是否会通过增加 n_jobs 为您提供更好的性能？

在什么操作系统/设置上？

有什么必须调整才能正常工作吗？

非常感谢所有帮助

3个回答

当我运行你的脚本时，我有同样的印象，那n_jobs就是影响你的表现。但是，您必须考虑并行化交叉验证只有在您有更多数据样本时才会受益。在数据很少的情况下，通信开销确实比任务所涉及的处理成本更昂贵。

我用更多示例尝试了您的脚本SAMPLE_N = 100000并得到了以下结果。设置：macos i5 8gb。

n_jobs = None, perf_counter = 21.605680438
n_jobs = 1, perf_counter = 22.555127251
n_jobs = 2, perf_counter = 15.129894496000006
n_jobs = 3, perf_counter = 11.280528144999998
n_jobs = 4, perf_counter = 13.050180971000003
n_jobs = -1, perf_counter = 20.031103849000004

将尝试根据 DS/ML 模型生产中并行计算的经验和理解来回答：

以高水平回答您的问题：

上面的简单程序在运行时是否会通过增加 n_jobs 为您提供更好的性能？ answer: Yes and can be seen bellow in results.
在什么操作系统/设置上？ answer: OS:ubuntu, 2xCPUsx16Cores+512GB RAM with python=3.7, joblib>=0.14.1 and sklearn >=0.22.1
有什么必须调整才能正常工作吗？yes: change/force parallel_backend to be used other then sequential (requires joblib approach with registered parallel_backend and you can use sklearn.utils.parallel_backend ... I tried sequential from sklearn model you have with n_jobs=-1 into joblib Parallel and got huge scale but need to look more for correctness but did saw huge improvement when scaled to 100mil samples on my machine so worth to test it since were amazed by performance with predefined backend.

我的康达设置：

scikit-image              0.16.2           py37h0573a6f_0  
scikit-learn              0.22.1           py37hd81dba3_0  
ipython                   7.12.0           py37h5ca1d4c_0  
ipython_genutils          0.2.0                    py37_0  
msgpack-python            0.6.1            py37hfd86e86_1  
python                    3.7.6                h357f687_2    conda-forge
python-dateutil           2.8.1                      py_0  
python_abi                3.7                     1_cp37m    conda-forge
joblib                    0.14.1                     py_0

如果您使用个人机器或工作站，请尝试为您的机器保留 1 个核心n_jobs=-2，您可以增加数据，因为这是 joblib 优化的目的（并非所有算法都支持这种方法，但这里超出了范围）并且还更改了后端因为默认情况下不会执行并行任务并且只会使用顺序，可能有更多数据正在执行自动“模式”，但不确定它是否基于，因为我使用 1k、10k 100k、1 mil 和 10 mil 样本进行测试并且没有 loky后端 ElasticNetCV 不会退出顺序后端。

Joblib经过优化，特别是在大数据上快速且健壮，并针对 numpy 数组进行了特定优化。

作为解释，将研究如何计算resources：

对于低于 -1 的 n_jobs，使用 (n_cpus + 1 + n_jobs)。因此，对于 n_jobs = -2，使用除一个以外的所有 CPU。None 是“未设置”的标记，将被解释为 n_jobs=1（顺序执行）

您的代码表现不佳，n_jobs=-1因此请尝试n_jobs=-2以下事实：

确实使用了所有CPU 内核（基于文档），但您可以通过从机器的joblib注册parallel_backend来更改使用线程，这样如果其他进程确实使用 CPU 线程/内核（在您的情况下），这会很慢并且会降低性能这种情况正在发生（您正在运行操作系统和其他需要 CPU 电源才能运行的进程）并且也没有充分利用“线程”，因此根据您的性能问题使用“核心”。

作为一个示例，您将使用when on cluster 模式，因此作为容器的工作人员确实已经分配了核心，并将"n_jobs=-1"利用优化或计算部分。parallel approachdistribute

在这种情况下，您会耗尽 CPU 资源，并且不要忘记并行不是“cheap” 因为为每个“作业”复制相同的数据，因此您将同时获得所有“分配”。
sklearn 并行实现并不完美，因此在您的情况下将尝试使用n_jobs=-2或者如果您想使用joblib那么您可以有更多优化算法的空间。您的 CV 部分是所有性能下降的地方，因为将是parallelized.

将从 joblib 添加以下内容，以更好地了解您的案例和限制+差异如何工作：

backend: str, ParallelBackendBase instance or None, default: ‘loky’

    Specify the parallelization backend implementation. Supported backends are:

        “loky” used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes.
        “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
        “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy).
        finally, you can register backends by calling register_parallel_backend. This will allow you to implement a backend of your liking.

从源代码的实现中，我确实看到 sklearn 确实使用内核或者是首选但不是所有算法线程的默认值：_joblib.py

    import warnings as _warnings

    with _warnings.catch_warnings():
        _warnings.simplefilter("ignore")
        # joblib imports may raise DeprecationWarning on certain Python
        # versions
        import joblib
        from joblib import logger
        from joblib import dump, load
        from joblib import __version__
        from joblib import effective_n_jobs
        from joblib import hash
        from joblib import cpu_count, Parallel, Memory, delayed
        from joblib import parallel_backend, register_parallel_backend


    __all__ = ["parallel_backend", "register_parallel_backend", "cpu_count",
               "Parallel", "Memory", "delayed", "effective_n_jobs", "hash",
               "logger", "dump", "load", "joblib", "__version__"]

但是您列出Elastic Net model的部分算法CV确实使用“线程”作为首选（_joblib_parallel_args（prefer="threads"）），并且似乎是只考虑核心的窗口的错误：

    mse_paths = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
                         **_joblib_parallel_args(prefer="threads"))(jobs)

注意：此答案来自日常工作以利用sparkjoblibandparallel_backend('spark')和的经验parallel_backend('dask')。可以按预期扩展并快速运行，但不要忘记我拥有的每个执行器基本上都具有 4 个内核和 4-32GB 内存，因此在执行时n_jobs=-1确实会在每个执行器内部并行执行部分 joblib 任务，并且不会复制相同的数据注意到因为是分布式的。

是否完美运行 CV 和适合零件，我n_jobs=-1在执行适合或CV零件时使用。

我使用 OP 默认设置的结果：

# 没有跟踪/进度执行是更快的执行，但需要添加进度以清楚起见：

    n_jobs = None, perf_counter = 1.4849148329813033 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 1, perf_counter = 1.4728297910187393 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 2, perf_counter = 1.470994730014354 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 4, perf_counter = 1.490676686167717 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 8, perf_counter = 1.465600558090955 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 12, perf_counter = 1.463360101915896 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 16, perf_counter = 1.4638906640466303 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 20, perf_counter = 1.4602260519750416 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 24, perf_counter = 1.4646347570233047 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 28, perf_counter = 1.4710926250554621 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = -1, perf_counter = 1.468439529882744 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = -2, perf_counter = 1.4649679311551154 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)

# 为了清晰起见，需要添加进度+详细信息的跟踪/进度执行：

0%|          | 0/12 [00:00<?, ?it/s][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
  8%|▊         | 1/12 [00:02<00:31,  2.88s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = None, perf_counter = 2.8790326060261577
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 17%|█▋        | 2/12 [00:05<00:28,  2.87s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 1, perf_counter = 2.83856769092381
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 25%|██▌       | 3/12 [00:08<00:25,  2.85s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 2, perf_counter = 2.8207667160313576
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 33%|███▎      | 4/12 [00:11<00:22,  2.84s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 4, perf_counter = 2.8043343869503587
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.7s finished
 42%|████▏     | 5/12 [00:14<00:19,  2.81s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 8, perf_counter = 2.730375789105892
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
 50%|█████     | 6/12 [00:16<00:16,  2.82s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 12, perf_counter = 2.8604282720480114
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 58%|█████▊    | 7/12 [00:19<00:14,  2.83s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 16, perf_counter = 2.847634136909619
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 67%|██████▋   | 8/12 [00:22<00:11,  2.84s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 20, perf_counter = 2.8461739809717983
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
 75%|███████▌  | 9/12 [00:25<00:08,  2.85s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 24, perf_counter = 2.8684673600364476
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
 83%|████████▎ | 10/12 [00:28<00:05,  2.87s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 28, perf_counter = 2.9122865139506757
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.1s finished
 92%|█████████▏| 11/12 [00:31<00:02,  2.94s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = -1, perf_counter = 3.1204342890996486
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.3s finished
100%|██████████| 12/12 [00:34<00:00,  2.91s/it]
n_jobs = -2, perf_counter = 3.347235122928396

魔法从这里开始：

所以我会说这实际上是错误，即使指定了 n_jobs 这也不会生效，并且仍然作为“无”或“1”运行。时间上的微小差异可能是由于使用 joblib.Memory和Checkpoint缓存结果，但需要更多地查看源代码中的这部分（我敢打赌，否则执行 CV 会很昂贵）。

作为参考：这是通过使用joblib并使用parallel_backend执行并行部分：指定parallel_backend（'loky'）的结果，以便能够指定Parallel在block内部使用的默认后端而不使用'auto'模式：

# 没有跟踪/进度执行是更快的执行，但需要添加进度以清楚起见：

n_jobs = None, perf_counter = 1.7306506633758545, sec
n_jobs = 1, perf_counter = 1.7046034336090088, sec
n_jobs = 2, perf_counter = 2.1097865104675293, sec
n_jobs = 4, perf_counter = 1.4976494312286377, sec
n_jobs = 8, perf_counter = 1.380277156829834, sec
n_jobs = 12, perf_counter = 1.3992164134979248, sec
n_jobs = 16, perf_counter = 0.7542541027069092, sec
n_jobs = 20, perf_counter = 1.9196209907531738, sec
n_jobs = 24, perf_counter = 0.6940577030181885, sec
n_jobs = 28, perf_counter = 0.780998945236206, sec
n_jobs = -1, perf_counter = 0.7055854797363281, sec
n_jobs = -2, perf_counter = 0.4049191474914551, sec
Completed

下面的输出将解释你所拥有的所有限制，“你所拥有的并行预期与并行完成的 insklearn 算法的印象”以及一般正在执行的内容以及如何分配工作人员：

# 为了清晰起见，需要添加进度+详细信息的跟踪/进度执行：

0%|          | 0/12 [00:00<?, ?it/s][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
    [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.4s finished
8%|▊         | 1/12 [00:03<00:37,  3.44s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
        ......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

n_jobs = None, perf_counter = 3.4446191787719727, sec

 [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.5s finished
 17%|█▋        | 2/12 [00:06<00:34,  3.45s/it]

n_jobs = 1, perf_counter = 3.460832357406616, sec

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    2.0s finished
 25%|██▌       | 3/12 [00:09<00:27,  3.09s/it]

n_jobs = 2, perf_counter = 2.2389445304870605, sec

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    1.7s finished
 33%|███▎      | 4/12 [00:10<00:21,  2.71s/it]

n_jobs = 4, perf_counter = 1.8393192291259766, sec

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    1.3s finished
 42%|████▏     | 5/12 [00:12<00:16,  2.36s/it]

n_jobs = 8, perf_counter = 1.517085075378418, sec

[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=12)]: Done  77 out of 100 | elapsed:    1.5s remaining:    0.4s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    1.6s finished
 50%|█████     | 6/12 [00:14<00:13,  2.17s/it]

n_jobs = 12, perf_counter = 1.7410166263580322, sec

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.1s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.7s finished
 58%|█████▊    | 7/12 [00:15<00:09,  1.81s/it]

n_jobs = 16, perf_counter = 0.9577205181121826, sec

[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    1.6s
[Parallel(n_jobs=20)]: Done 100 out of 100 | elapsed:    1.9s finished
 67%|██████▋   | 8/12 [00:17<00:07,  1.88s/it]

n_jobs = 20, perf_counter = 2.0630648136138916, sec

[Parallel(n_jobs=24)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.5s finished
 75%|███████▌  | 9/12 [00:18<00:04,  1.55s/it]

n_jobs = 24, perf_counter = 0.7588121891021729, sec

[Parallel(n_jobs=28)]: Using backend LokyBackend with 28 concurrent workers.
[Parallel(n_jobs=28)]: Done 100 out of 100 | elapsed:    0.6s finished
 83%|████████▎ | 10/12 [00:18<00:02,  1.34s/it]

n_jobs = 28, perf_counter = 0.8542406558990479, sec

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.7s finished
 92%|█████████▏| 11/12 [00:19<00:01,  1.21s/it][Parallel(n_jobs=-2)]: Using backend LokyBackend with 31 concurrent workers.

n_jobs = -1, perf_counter = 0.8903687000274658, sec

[Parallel(n_jobs=-2)]: Done 100 out of 100 | elapsed:    0.5s finished
100%|██████████| 12/12 [00:20<00:00,  1.69s/it]

n_jobs = -2, perf_counter = 0.544947624206543, sec

# # Here I do show what is doing behind and to understand differences in times and wil explain 'None' vs '1' execution time (is all about picklink process and Memory Caching implementation for paralel. 
[Parallel(n_jobs=-2)]: Done  71 out of 100 | elapsed:    0.9s remaining:    0.4s
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
Pickling array (shape=(1000, 30), dtype=float64).
Pickling array (shape=(1000,), dtype=float64).
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
Pickling array (shape=(1000, 30), dtype=float64).
Pickling array (shape=(1000,), dtype=float64).
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
Pickling array (shape=(1000, 30), dtype=float64).
Pickling array (shape=(1000,), dtype=float64).
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
[Parallel(n_jobs=-2)]: Done  73 out of 100 | elapsed:    0.9s remaining:    0.3s
Completed

sklearn 中的 n_jobs sklearn 中的并行性基于线程的并行性与基于进程的并行性

在分析并逐步浏览了 sklearn 的代码后，我得到了一些答案。

摘要：

与建议的相反，sklearn' 的ElasticNetCV()可扩展性差n_jobs 不是由于：

启动线程或进程的开销。
SequentialBackend总是被使用而不管n_jobs. （我无法重现 n1tk 的回答中所述的这个问题，但是我可以确认无论实际的上下文后端如何，SequentialBackend都可以使用n_jobs = 1它。这似乎是合理的行为，而不是错误）

实际问题是sklearn' ElasticNetCVs 默认使用 's threading parallel_backend，而它发送到joblib'sParallel()进行并行化的任务大部分时间都在持有 GIL。

显然，这是适得其反的。

以上对于和的 OP 问题大小是正确n=1000的k=30。一旦n或k变得非常大，那么坐标下降算法（唯一在 GIL 之外运行的代码）开始为线程处理足够的周期，以相对于顺序降低执行时间。

这意味着对于需要大约一秒钟才能解决sklearn的问题，开箱即用的并行性无济于事。如果你有 10,000 个这样的问题要在任何时候解决，你将不得不找到另一个并行解决方案，或者在两次运行之间等待三个小时的最佳时间。

这种情况可以通过不同程度的努力与回报以各种方式得到显着改善。

细节

ElasticNetCV 调用堆栈如下所示：

ElasticNetCV.fit()
    Parallel(...)
        _path_residuals(...)
            enet_path(...)
                cd_fast.enet_coordinate_descent_gram(...)

直到cd_fast.enet_coordinate_descent_gram()GIL 发布。你可以在这里with nogil:看到声明

%prun 表明这个堆栈中 99.3% 的时间都花在了 Parallel(...)

%lprun 表明只有 40% 的时间花在 cd_fast.enet_coordinate_descent_gram(...)

剩下的 60% 的时间花在Parallel(...)GIL 下，因此是不可并行的。

粗略计算表明，40% 的可并行化代码应该允许 2 个线程（超过 2 个将无济于事）完成 100 个任务，这个问题被分解成 60%*99 + 1% = 60.4% 的顺序时间。

那么为什么n_jobs = 2的实际性能比n_jobs = 1的差呢？

我不确定，但在我的 4 核机器（8 个逻辑）上，n_jobs = 2产生 8 个额外线程，而不仅仅是一个额外线程（或两个工人）。由于 60% 的执行不可并行，这可能会导致超额订阅问题。当n_jobs = 2. 这表明超额认购。

我还不知道为什么设置n_jobs = 2会导致 8 个额外的线程，而不仅仅是一个（或两个工作人员）。

最简单的解决方案

joblib通过指示切换到loky多处理后端可以在一定程度上缓解该问题。

这是使用with

with parallel_backend('loky'):
    ElasticNetCV.fit(...)

在我的 4 核（8 个逻辑）机器上，最大的改进loky是使用n_jobs = -1（相当于n_jobs = 8）实现的。这实现了 x2.7 的性能改进。这与我期望在 4 核（8 个逻辑）上完全可并行化的 cython 代码以及我在这台机器上使用此类代码获得的 x4 改进有所不同。

也许这是因为loky并非完全不受超额认购的影响。它产生的进程本身产生线程（原始进程中的 2 个额外线程和我的 4 核机器上每个子进程上的 4 个线程）。这些线程竞争其进程的 GIL。尽管如此，由于有更多 GIL 可供使用（每个衍生进程一个），因此性能比默认threading后端要好。

还可以肯定的是，进程生成和通信开销会导致 x2.7 和 x4 之间的性能差距。

最后，与正常运行的多线程解决方案相比，我不确定该解决方案扩展到 32 个内核的效果如何。对于能够使用简单修复在 32 核机器上运行 OP 脚本的人的评论，我将不胜感激。with parallel_backend('loky'):

对于那些想要更高性能或对发送不可并行任务的并行基础架构感到厌烦的人，请继续阅读。

修复 sklearn

有一个错误sklearn，coordinate_descent.py最终ElasticNetCV.fit(...)调用enet_path()没有设置check_input = False。一旦执行完成，检查是多余的，如果调用者已经确保正在检查enet_path()的内容，文档确实建议设置。check_input = False在这种情况下，似乎没有遵循他们自己的建议，并且在's 的情况下sklearn未能实施。ElasticNetCV

修复此错误可将顺序运行时的不可并行时间从 60% 减少到 30%，总运行时间减少 40%。

然而，threading尽管现在只有 30% 的执行时间是不可并行的，但后端仍然无法提高性能。同样，粗略的计算表明可以在我们的 100 个子任务上使用 4 个内核，并在大约 30% 的顺序执行时间内完成。然而，在实践中，当时产生 8 个新线程，并且每增加一个单位增量就会产生n_jobs = 24 个额外线程。n_jobs最终结果是计算时间变慢n_jobs > 1，并伴随着内核主导 CPU 使用率的症状（在 50% 到 80% 之间）。

这再次表明，由于太多线程竞争同一个 GIL，导致超额订阅。再一次，我还不知道为什么线程后端会产生这个看似过度膨胀的线程数量。有人知道吗？

此修复程序还增强了使用性能loky，随着您的增加收益递减n_jobs。n_jobs = 8在我的 8 核机器上，通过修复设置，我得到了开箱即用的 sklearn 顺序时间的 x3.7 check_input = False。请注意，这只是check_input = False顺序时间的 x2，并且正常工作的多线程解决方案将为我们提供大约 x4。

一个警告，也许是官方版本check_input = False中没有实施的原因。sklearn的输入检查MutiTaskElasticNet仅在内部执行enet_path()，并且需要在其他地方重构它们，以便此修复不会潜在地破坏MultiTask算法的版本。但是，据我所知，单输出版本可以正常工作，而不必担心这一点。

其他事宜：

对于每个fold和（Parallel(...)不必要np.dot(X.T, X)地）在每个l1_ratio. 可以删除不必要的计算，并将必要的计算从Parallel(...). 根据我的分析器，这将进一步将并行任务的 GIL 时间从 30% 减少到 20%。

最终，您不希望受 GIL 约束的代码进入Parallel(...)线程后端。你可以把东西拿出来，或者with nogil:把你放进去的东西用cythonsize。

其它你可能感兴趣的问题

上一篇谁发明了过拟合的概念？下一篇如何将 GridSearchCV 与 Early Stopping 结合起来？