我正在尝试将 covid-19 大流行的 JSON 格式的所有最新数据集导入 pandas 数据框。我相信使用它应该是可能的,json_normalize
但我不能让它工作。
首先,我初始化并导入 json 数据:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.json'
r = requests.get(url)
json = r.json()
这个 JSON 的类型是dict
. 当我检查 时json
,我可以看到一个复杂的嵌套结构。首先,每个位置都有dict
一个dict
(使用“iso_code”的名称)。这些dicts
有一些整体统计数据作为键/值,并且包含时间序列的“数据”字典list
作为dicts
. 我希望我说对了...
输出是一条长线,我可以从中显示第一个片段:
{"AFG":{"continent":"Asia","location":"Afghanistan","population":39835428.0,"population_density":54.422,"median_age":18.6,"aged_65_older":2.581,"aged_70_older":1.337,"gdp_per_capita":1803.987,"cardiovasc_death_rate":597.029,"diabetes_prevalence":9.59,"handwashing_facilities":37.746,"hospital_beds_per_thousand":0.5,"life_expectancy":64.83,"human_development_index":0.511,"data":[{"date":"2020-02-24","total_cases":5.0,"new_cases":5.0,"total_cases_per_million":0.126,"new_cases_per_million":0.126,"stringency_index":8.33},{"date":"2020-02-25",
为了便于阅读,我已经格式化了一个带有换行符和缩进的片段:
{
"AFG":{
"continent":"Asia",
"location":"Afghanistan",
"population":39835428.0,
"population_density":54.422,
"median_age":18.6,
"aged_65_older":2.581,
"aged_70_older":1.337,
"gdp_per_capita":1803.987,
"cardiovasc_death_rate":597.029,
"diabetes_prevalence":9.59,
"handwashing_facilities":37.746,
"hospital_beds_per_thousand":0.5,
"life_expectancy":64.83,
"human_development_index":0.511,
"data":[
{
"date":"2020-02-24",
"total_cases":5.0,
"new_cases":5.0,
"total_cases_per_million":0.126,
"new_cases_per_million":0.126,
"stringency_index":8.33
},
. .
.
{
"date":"2021-12-27",
"total_cases":157967.0,
"new_cases":16.0,
"new_cases_smoothed":24.286,
"total_deaths":7354.0,
"new_deaths":0.0,
"new_deaths_smoothed":2.714,
"total_cases_per_million":3965.49,
"new_cases_per_million":0.402,
"new_cases_smoothed_per_million":0.61,
"total_deaths_per_million":184.61,
"new_deaths_per_million":0.0,
"new_deaths_smoothed_per_million":0.068
}
]
},
.
.
"ZWE":{
"continent":"Africa",
"location":"Zimbabwe",
"population":14862927.0,
"population_density":42.729,
"median_age":19.6
"aged_65_older":2.822,
"aged_70_older":1.882,
"gdp_per_capita":1899.775,
"extreme_poverty":21.4,
"cardiovasc_death_rate":307.846,
"diabetes_prevalence":1.82,
"female_smokers":1.6,
"male_smokers":30.7,
"handwashing_facilities":36.791,
"hospital_beds_per_thousand":1.7,
"life_expectancy":61.49,
"human_development_index":0.571,
"data":[
{
"date":"2020-03-23",
"total_cases":3.0,
"new_cases":0.0,
"total_deaths":1.0,
"new_deaths":1.0,
"total_cases_per_million":0.202,
"new_cases_per_million":0.0,
"total_deaths_per_million":0.067,
"new_deaths_per_million":0.067,
"stringency_index":45.37
},
.
.
{
"date":"2021-12-28",
"total_cases":207548.0,
"new_cases":2099.0,
"new_cases_smoothed":1397.143,
"total_deaths":4940.0,
"new_deaths":32.0,
"new_deaths_smoothed":17.286,
"total_cases_per_million":13752.031,
"new_cases_per_million":139.079,
"new_cases_smoothed_per_million":92.574,
"total_deaths_per_million":327.322,
"new_deaths_per_million":2.12,
"new_deaths_smoothed_per_million":1.145,
"total_vaccinations":7222341.0,
"people_vaccinated":4107151.0,
"people_fully_vaccinated":3115190.0,
"new_vaccinations":3903.0,
"new_vaccinations_smoothed":10056.0,
"total_vaccinations_per_hundred":47.85,
"people_vaccinated_per_hundred":27.21,
"people_fully_vaccinated_per_hundred":20.64,
"new_vaccinations_smoothed_per_million":666.0,
"new_people_vaccinated_smoothed":5407.0,
"new_people_vaccinated_smoothed_per_hundred":0.036
}
]
}
}
我的目标首先是能够访问每个位置的时间序列,例如“new_cases_per_million”。进一步,我可以想象将“total_cases_per_million”或“new_vaccinations_smoothed_per_million”与“gdp_per_capita”等联系起来。
由于实际的时间序列嵌套在“数据”dict
中,我尝试了这样的事情:
data = pd.json_normalize(json,record_path=['data'],errors='ignore')
这给出了错误:
KeyError Traceback (most recent call last)
<ipython-input-20-602e9c355c71> in <module>
----> 1 data = pd.json_normalize(json,record_path=['data'],errors='ignore')
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
334 records.extend(recs)
335
--> 336 _recursive_extract(data, record_path, {}, level=0)
337
338 result = DataFrame(records)
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
307 else:
308 for obj in data:
--> 309 recs = _pull_records(obj, path[0])
310 recs = [
311 nested_to_record(r, sep=sep, max_level=max_level)
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _pull_records(js, spec)
246 if has non iterable value.
247 """
--> 248 result = _pull_field(js, spec)
249
250 # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
/usr/local/lib/python3.8/site-packages/pandas/io/json/_normalize.py in _pull_field(js, spec)
237 result = result[field]
238 else:
--> 239 result = result[spec]
240 return result
241
KeyError: 'data'
我可以record_path
在不设置这样的情况下运行命令:
data = pd.json_normalize(json)
但这当然不会扩展嵌套数据。
我有另一个熊猫项目,我从 owid 将疫苗接种数据导入为 json(见下文):
json 数据的结构略有不同。
用换行符和缩进格式化,它看起来像:
[
{
"country": "Afghanistan",
"iso_code": "AFG",
"data": [
{
"date": "2021-02-22",
"total_vaccinations": 0,
"people_vaccinated": 0,
"total_vaccinations_per_hundred": 0.0,
"people_vaccinated_per_hundred": 0.0
},
.
.
{
"date": "2021-12-30",
"total_vaccinations": 305216136,
"people_vaccinated": 193716017,
"people_fully_vaccinated": 125428809,
"total_boosters": 860635,
"daily_vaccinations_raw": 1947912,
"daily_vaccinations": 1267521,
"total_vaccinations_per_hundred": 22.22,
"people_vaccinated_per_hundred": 14.1,
"people_fully_vaccinated_per_hundred": 9.13,
"total_boosters_per_hundred": 0.06,
"daily_vaccinations_per_million": 923,
"daily_people_vaccinated": 688914,
"daily_people_vaccinated_per_hundred": 0.05
}
]
},
.
.
{
"country": "Zimbabwe",
"iso_code": "ZWE",
"data": [
{
"date": "2021-02-18",
"total_vaccinations": 39,
"people_vaccinated": 39,
"total_vaccinations_per_hundred": 0.0,
"people_vaccinated_per_hundred": 0.0
},
.
.
{
"date": "2021-12-29",
"total_vaccinations": 7234293,
"people_vaccinated": 4112517,
"people_fully_vaccinated": 3121776,
"daily_vaccinations_raw": 11952,
"daily_vaccinations": 9526,
"total_vaccinations_per_hundred": 47.93,
"people_vaccinated_per_hundred": 27.25,
"people_fully_vaccinated_per_hundred": 20.68,
"daily_vaccinations_per_million": 631,
"daily_people_vaccinated": 4796,
"daily_people_vaccinated_per_hundred": 0.032
}
]
}
]
我可以用
url = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.json"
r = requests.get(url)
df = pd.json_normalize(
j,
record_path = ['data'],
errors = 'ignore',
meta = ['iso_code','country']
)
获得的数据框的工作方式与此处预期的一样。
通过对列名进行一些小的编辑,我可以运行/得到这个:
df[(df['iso_code']=='DNK')][['total_pct','fully_pct','boost_pct']]
total_pct fully_pct boost_pct
date
2020-12-01 0.00 NaN NaN
2020-12-02 NaN NaN NaN
2020-12-03 NaN NaN NaN
2020-12-04 NaN NaN NaN
2020-12-05 NaN NaN NaN
... ... ... ...
2021-12-24 82.29 78.03 42.34
2021-12-25 82.30 78.03 42.65
2021-12-26 82.33 78.06 43.12
2021-12-27 82.44 78.26 44.70
2021-12-28 82.52 78.43 46.15
我无法弄清楚为什么疫苗数据集可以标准化时完整数据集无法标准化。显然,结构有点不同。完整数据集位于顶层,而疫苗接种数据集位于顶层。这会是某种解释,我该如何克服呢?dict
list