加载数据集的集合 - Python 代码示例

数据挖掘 数据集
2021-10-07 00:31:26

有时您可能想在多个数据集上检查您的想法。有几个地方有数据集集合。

问题:请分享一些 Python 脚本,如何从这些(或其他)数据集集合中下载多个数据集?

理想情况下,一个人应该能够:1)获取数据集列表 2)根据条件选择一些所需的 3)下载选择的那些。但是,如果您有不同的东西,请无论如何分享。

对于“openml”数据库-我有一个脚本-请参阅我自己的答案。但我确实有其他收藏:Kaggle、uci ...


这里有一些数据集集合的例子:

https://www.openml.org/

https://archive.ics.uci.edu/ml/index.php

https://ieee-dataport.org/datasets

Каggle 包含大量数据集,也有特定的集合:图表集合请参见此处的列表https://mathoverflow.net/a/359449/10446 ,许多生物数据在这里:https ://www.ncbi.nlm.nih.gov /gds

3个回答

如何从 python 代码中获取 Kaggle 数据?

  1. 安装 kaggle 包 C:\Users\TalgatHafiz> pip install kaggle

  2. 登录你的 Kaggle 账户 点击右上角的图标 -> 我的账户 向下滚动到 API 部分 点击“创建新的 API 令牌” “kaggle.json”文件被创建并保存在本地

  3. 创建“.kaggle”目录 C:\Users\TalgatHafiz>mkdir .kaggle 并将“kaggle.json”移动到该目录

  4. 通过运行以下命令 C:\Users\TalgatHafiz>kaggle Competitions list 查看所有活跃的比赛

  5. 选择您注册的比赛之一,例如 https://www.kaggle.com/c/contradictory-my-dear-watson/data# 向下滚动。在“Data Explorer”部分之前应该有 API 行:“kaggle Competitions download -c contradictory-my-dear-watson” 复制它

  6. 从笔记本运行这些命令 import kaggle !kaggle Competitions download -c contradictory-my-dear-watson

  7. 压缩的数据文件被下载到您的笔记本所在的同一目录中:C:\Users\TalgatHafiz\conda\contradictory-my-dear-watson.zip 所以现在您可以解压缩并开始使用数据

如果您仍有疑问,请阅读 https://medium.com/@jeff.daniel77/accessing-the-kaggle-com-api-with-jupyter-notebook-on-windows-d6f330bc6953

这是一些用于“openml”数据集集合的脚本。希望可以为其他数据库提供类似的东西。

#see docs: https://docs.openml.org/Python-guide/

!pip install openml
import openml

import numpy as np
import pandas as pd
import time


# Get information on all collection of openml datasets:
datalist = openml.datasets.list_datasets(output_format="dataframe")

# select datasets by some conditions (just pandas) - we will get just 4 such datasets 
datasets_selected = datalist[ (datalist.NumberOfInstances < 2550) & (datalist.NumberOfInstances > 300)& (datalist.NumberOfFeatures > 10000) &  (datalist.NumberOfFeatures < 40000) & \
                     ( datalist.NumberOfFeatures != 10937)    ].sort_values(["NumberOfInstances"], ascending=False)#.head(n=20)
print(datasets_selected.shape)

# load all selected datasets and print short info: 
for i in range(len(datasets_selected)):
  nm = datasets_selected['name'].iloc[i]
  print(nm, i )
  did =  int( datasets_selected['did'].iloc[i] ) # did - dataset_id 
  t0 = time.time()
  data = openml.datasets.get_dataset(did)
  X, y, categorical_indicator, attribute_names = data.get_data(
      dataset_format="array", target=data.default_target_attribute )
  print(X.shape, y.shape, time.time()-t0,'secs passed' )

这是 sklearn 内置数据集的更简单示例:

import numpy as np 
from sklearn import  datasets 
import time
list_id =  ['load_boston', 'load_iris', 'load_diabetes', 'load_digits', 'load_linnerud', 'load_wine' , 'load_breast_cancer'] + \
 ['fetch_california_housing', 'fetch_covtype',  'fetch_lfw_people', 'fetch_20newsgroups_vectorized','fetch_olivetti_faces' ]
# 'fetch_rcv1', - too long 
# 'fetch_lfw_pairs' - TypeError fetch_lfw_pairs() got an unexpected keyword argument 'return_X_y
# 'fetch_kddcup99' - sometimes problem happens
for id in list_id:
  print(id)
  t0 = time.time()
  func_load  = getattr(datasets, id )
  X,y = func_load(return_X_y = True)
  print(id, X.shape, time.time()-t0, 'secs passed')

OpenML 有一个不同用例示例库,包括通过 python 浏览和下载数据集,以及运行基准测试: https ://openml.github.io/openml-python/master/examples/index.html

当您想对新算法进行基准测试时,要点如下:

import openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

suite = openml.study.get_suite('OpenML-CC18') # get benchmark suite
tasks = np.random.choice(suite.tasks, size=10, replace=False) # sample 10 tasks randomly
clf = make_pipeline(SimpleImputer(),RandomForestClassifier()) # simple pipeline
for task_id in tasks:
    task = openml.tasks.get_task(task_id)
    print("Running on task",task.get_dataset().name)
    run = openml.runs.run_model_on_task(clf, task)
    print(run.get_metric_fn(accuracy_score))

输出(这些是 10 倍的 CV 任务):

Running on task credit-approval
[0.928 0.884 0.841 0.768 0.913 0.884 0.884 0.841 0.899 0.884]
Running on task pc1
[0.955 0.919 0.946 0.955 0.937 0.973 0.919 0.928 0.919 0.918]

您还可以选择直接在 OpenML 上与run.publish()

免责声明:我是 OpenML 的核心开发者之一