机器算法验证 - 如何正确绘制趋势 - 吾爱随笔录

如何正确绘制趋势

机器算法验证数据可视化

2022-01-22 11:12:02

我正在创建一个图表来显示不同国家/地区的死亡率（每 1000 人）的趋势，并且应该来自该图的故事是德国（浅蓝色线）是唯一一个在 1932 年之后趋势增加的国家。这是我的第一次（基本）尝试

在我看来，这张图已经显示了我们想要它表达的东西，但它并不是超级直观。你有什么建议可以更清楚地区分趋势吗？我正在考虑绘制增长率，但我试过了，但并没有那么好。

数据如下

year     de     fr      be       nl     den      ch     aut     cz       pl
1927    10.9    16.5    13      10.2    11.6    12.4    15      16      17.3
1928    11.2    16.4    12.8    9.6     11      12      14.5    15.1    16.4
1929    11.4    17.9    14.4    10.7    11.2    12.5    14.6    15.5    16.7
1930    10.4    15.6    12.8    9.1     10.8    11.6    13.5    14.2    15.6
1931    10.4    16.2    12.7    9.6     11.4    12.1    14      14.4    15.5
1932    10.2    15.8    12.7    9       11      12.2    13.9    14.1    15
1933    10.8    15.8    12.7    8.8     10.6    11.4    13.2    13.7    14.2
1934    10.6    15.1    11.7    8.4     10.4    11.3    12.7    13.2    14.4
1935    11.4    15.7    12.3    8.7     11.1    12.1    13.7    13.5    14
1936    11.7    15.3    12.2    8.7     11      11.4    13.2    13.3    14.2
1937    11.5    15      12.5    8.8     10.8    11.3    13.3    13.3    14

4个回答

有时少即是多。由于有关逐年变化和国家/地区差异的详细信息较少，您可以提供有关趋势的更多信息。由于其他国家/地区大多一起移动，因此您无需单独的颜色即可过关。

在使用平滑器时，您需要读者相信您没有对任何有趣的变化进行平滑处理。

在收到几个代码请求后更新：

我是在JMP的交互式 Graph Builder 中制作的。JMP 脚本是：

Graph Builder(
Size( 528, 456 ), Show Control Panel( 0 ), Show Legend( 0 ),
// variable role assignments:
Variables( X( :year ), Y( :Deaths ), Overlay( :Country ) ),
// spline smoother:
Elements( Smoother( X, Y, Legend( 3 ) ) ),
// customizations:
SendToReport(
    // x scale, leaving room for annotations
    Dispatch( {},"year",ScaleBox,
        {Min( 1926.5 ), Max( 1937.9 ), Inc( 2 ), Minor Ticks( 1 )}
    ),
    // customize colors and DE line width
    Dispatch( {}, "400", ScaleBox, {Legend Model( 3,
        Properties( 0, {Line Color( "gray" )}, Item ID( "aut", 1 ) ),
        Properties( 1, {Line Color( "gray" )}, Item ID( "be", 1 ) ),
        Properties( 2, {Line Color( "gray" )}, Item ID( "ch", 1 ) ),
        Properties( 3, {Line Color( "gray" )}, Item ID( "cz", 1 ) ),
        Properties( 4, {Line Color( "gray" )}, Item ID( "den", 1 ) ),
        Properties( 5, {Line Color( "gray" )}, Item ID( "fr", 1 ) ),
        Properties( 6, {Line Color( "gray" )}, Item ID( "nl", 1 ) ),
        Properties( 7, {Line Color( "gray" )}, Item ID( "pl", 1 ) ),
        Properties( 8, {Line Color("dark red"), Line Width( 3 )}, Item ID( "de", 1 ))
    )}),
    // add line annotations (omitted)

));

这里有很好的答案。让我相信你的话，你想表明德国的趋势与其他国家不同。 水平与变化是经济学中的常见区别。你的数据是层次的，但你的问题被表述为寻求改变。这样做的方法是将参考级别（此处为 1932）设置为。从那里开始，每一年都是前一年的一小部分。（通常采用日志以使更改更加稳定和对称。这确实会在一定程度上改变确切数字的含义，如果您真的希望有人从情节中得到这一点，但通常对于这种事情，人们希望成为能够看到模式。）然后您得到每个系列的运行总和并将其乘以 $1$ $100$ 按照惯例。这就是你的阴谋。您的情况不太常见，因为您的参考点位于系列的中间，所以我从 1932 年开始在两个方向上运行它。下面是一个简单的例子，用 R 编码（会有很多方法来制作代码和情节更好，但这应该直截了当地展示这个想法）。为了在图例中区分它，我把德国的线加粗了，我在处添加了一条参考线。很容易看出德国在其他国家中脱颖而出。您还可以看到，所有其他国家在 1937 年的利率最终都低于 1932 年，而且它们在 1932 年之后的逐年变化的波动比之前几年要小得多。 $100$

d = read.table(text="
year     de     fr      be       nl     den      ch     aut     cz       pl
1927    10.9    16.5    13      10.2    11.6    12.4    15      16      17.3
...
1937    11.5    15      12.5    8.8     10.8    11.3    13.3    13.3    14",
header=T)

d2          = d  # we'll end up needing both
d2[6,2:10]  = 1  # set 1932 as 1
for(j in 2:10){   
  for(i in 7:11){
      # changes moving forward from 1932:
    d2[i,j] = log( d[i,j]/d[i-1,j] )
      # running sum moving forward from 1932:
    d2[i,j] = d2[i,j]+d2[i-1,j]
  }
  for(i in 5:1){
      # changes moving backward from 1932:
    d2[i,j] = log( d[i,j]/d[i+1,j] )
      # running sum moving forward from 1932:
    d2[i,j] = d2[i+1,j]+d2[i,j]
  }
}
d2[,2:10]   = d2[,2:10]*100  # multiply all values by 100

windows()  # plot of changes
  plot(1,1, xlim=c(1927,1937), ylim=c(82,118), xlab="Year", 
       ylab="Change from 1932", main="European death rates")
  abline(h=100, col="lightgray")
  for(j in 2:10){
    lines(1927:1937, d2[,j], col=rainbow(9)[j-1], lwd=ifelse(j==2,2,1))
  }
  legend("bottomleft", legend=colnames(d2)[2:10], lwd=c(2,rep(1,8)), lty=1, 
         col=rainbow(9), ncol=2)

windows()  # plot of levels
  plot(1,1, xlim=c(1927,1937), ylim=c(8,18.4), xlab="Year", 
       ylab="Deaths per thousand", main="European death rates")
  abline(h=d[6,2:10], col="gray90")
  points(rep(1932,9), d[6,2:10], col=rainbow(9), pch=16)
  for(j in 2:10){
    lines(1927:1937, d[,j], col=rainbow(9)[j-1], lwd=ifelse(j==2,2,1))
  }
  legend("topright", legend=colnames(d)[2:10], lwd=c(2,rep(1,8)), lty=1, 
         col=rainbow(9), ncol=2)

相比之下，下面是相应的水平数据图。尽管如此，我还是试图让人们有可能看到仅德国在 1932 年之后以两种方式上升：我在 1932 年的每个系列上都放置了一个突出的点，并在这些水平的背景中画了一条淡淡的灰线。

其他答案中有很多好主意，但它们并没有穷尽可能的好解决方案。该答案中的第一张图认为可以分别讨论和解释不同水平的死亡率。在允许每个系列填补大部分可用空间的同时，它将读者的注意力集中在相对变化的模式上。

按国家/地区的字母顺序通常是一个愚蠢的默认设置，这里不坚持。幸运的是，德国作为 de 位于这个 3 x 3 显示器的中心。一个简单的叙述——看！德国的模式是特殊的，从 1932 年开始出现好转——这使之成为可能和合理的。

幸运的是，有 9 个国家足以证明尝试单独的面板是合理的，但没有太多的国家无法使这种设计不可行（比如说 30 个，当然还有 300 个面板，可能（将）有太多的面板需要扫描，每个都太小而无法审查）。

显然，这里有足够的空间来填写更完整的国名。（在其他一些答案中，图例占据了可用空间的很大一部分，同时仍然有点神秘。在实践中，对此类数据感兴趣的人会发现国家缩写很容易解码，但需要多远的图例通常是图形设计中令人头疼的问题。）

记录的Stata代码：

clear
input int year double(de fr be nl den ch aut cz pl)
1927 10.9 16.5   13 10.2 11.6 12.4   15   16 17.3
1928 11.2 16.4 12.8  9.6   11   12 14.5 15.1 16.4
1929 11.4 17.9 14.4 10.7 11.2 12.5 14.6 15.5 16.7
1930 10.4 15.6 12.8  9.1 10.8 11.6 13.5 14.2 15.6
1931 10.4 16.2 12.7  9.6 11.4 12.1   14 14.4 15.5
1932 10.2 15.8 12.7    9   11 12.2 13.9 14.1   15
1933 10.8 15.8 12.7  8.8 10.6 11.4 13.2 13.7 14.2
1934 10.6 15.1 11.7  8.4 10.4 11.3 12.7 13.2 14.4
1935 11.4 15.7 12.3  8.7 11.1 12.1 13.7 13.5   14
1936 11.7 15.3 12.2  8.7   11 11.4 13.2 13.3 14.2
1937 11.5   15 12.5  8.8 10.8 11.3 13.3 13.3   14
end

rename (de-pl) (death=)
reshape long death, i(year) j(country) string
set scheme s1color 
line death year, by(country, yrescale note("")) xtitle("") xla(1927(5)1937)

编辑：

Tim Morris 建议对该图表进行一个简单的改进，即突出显示最大值出现的年份：

egen max = max(death) , by(country)
replace max = max == death
twoway line death year || scatter death year if max, ms(O)  ///
by(country, yrescale note("") legend(off)) xtitle("") xla(1927(5)1937)

编辑 2（修改为显示更简单的代码）：

或者，下一个设计单独显示每个系列，但每次都以其他系列为背景。总体思路在这个相关线程中讨论。

这里有损失也有收获。虽然每个系列都可以在其他系列的背景下更容易地看到，但重复会丢失空间。

记录的Stata代码：

（代码到input, reshape，rename如本答案中所述）

* type "ssc inst fabplot" to install
fabplot line death year, by(country, compact note("countries highlighted in turn")) ///
ytitle("death rate, yearly deaths per 1000") yla(8(2)18, ang(h)) ///
xla(1927(5)1937, format(%tyY)) xtitle("") front(connected)

fabplot应被理解为rontf或foreground and backdrop 或background plot，而不是 1960 年代俚语“fabulous”的回声。

您的图表是合理的，但需要一些细化，包括标题、轴标签和完整的国家标签。如果您的目标是强调德国是观察期内死亡率上升的唯一国家，那么一个简单的方法是在图中突出显示这条线，或者使用更粗的线，不同的线型或 alpha 透明度。您还可以使用显示死亡率随时间变化的条形图来扩充您的时间序列图，从而将时间序列线的复杂性降低为单一的变化度量。

以下是使用ggplotin生成这些图的方法R：

library(tidyr);
library(dplyr);
library(ggplot2);

#Create data frame in wide format
DATA_WIDE <- data.frame(Year = 1927L:1937L,
                        DE   = c(10.9, 11.2, 11.4, 10.4, 10.4, 10.2, 10.8, 10.6, 11.4, 11.7, 11.5),
                        FR   = c(16.5, 16.4, 17.9, 15.6, 16.2, 15.8, 15.8, 15.1, 15.7, 15.3, 15.0),
                        BE   = c(13.0, 12.8, 14.4, 12.8, 12.7, 12.7, 12.7, 11.7, 12.3, 12.2, 12.5),
                        NL   = c(10.2,  9.6, 10.7,  9.1,  9.6,  9.0,  8.8,  8.4,  8.7,  8.7,  8.8),
                        DEN  = c(11.6, 11.0, 11.2, 10.8, 11.4, 11.0, 10.6, 10.4, 11.1, 11.0, 10.8),
                        CH   = c(12.4, 12.0, 12.5, 11.6, 12.1, 12.2, 11.4, 11.3, 12.1, 11.4, 11.3),
                        AUT  = c(15.0, 14.5, 14.6, 13.5, 14.0, 13.9, 13.2, 12.7, 13.7, 13.2, 13.3),
                        CZ   = c(16.0, 15.1, 15.5, 14.2, 14.4, 14.1, 13.7, 13.3, 13.5, 13.3, 13.3),
                        PL   = c(17.3, 16.4, 16.7, 15.6, 15.5, 15.0, 14.2, 14.4, 14.0, 14.2, 14.0));

#Convert data to long format
DATA_LONG <- DATA_WIDE %>% gather(Country, Measurement, DE:PL);

#Set line-types and sizes for plot
#Germany (DE) is the fifth country in the plot
LINETYPE <- c("dashed", "dashed", "dashed", "dashed", "solid", "dashed", "dashed", "dashed", "dashed");
SIZE     <- c(1, 1, 1, 1, 2, 1, 1, 1, 1);

#Create time-series plot
theme_set(theme_bw());
PLOT1 <- ggplot(DATA_LONG, aes(x = Year, y = Measurement, colour = Country)) + 
         geom_line(aes(size = Country, linetype = Country)) +
         scale_size_manual(values = SIZE) +
         scale_linetype_manual(values = LINETYPE) +
         scale_x_continuous(breaks = 1927:1937) +
         scale_y_continuous(limits = c(0, 20)) +
         labs(title = "Annual Time Series Plot: Death Rates over Time", 
              subtitle = "Only Germany (DE) trends upward from 1927-37") +
         xlab("Year") + ylab("Crude Death Rate\n(per 1,000 population)");


#Create new data frame for differences
DATA_DIFF <- data.frame(Country = c("DE", "FR", "BE", "NL", "DEN", "CH", "AUT", "CZ", "PL"),
                        Change  = as.numeric(DATA_WIDE[11, 2:10] - DATA_WIDE[1, 2:10]));

#Create bar plot
PLOT2 <- ggplot(DATA_DIFF, aes(x = reorder(Country, - Change), y = Change, colour = Country, fill = Country)) + 
         geom_bar(stat = "identity") +
         labs(title = "Bar  Plot: Change in Death Rates from 1927-37", 
              subtitle = "Only Germany (DE) shows an increase in death rate") +
         xlab(NULL) + ylab("Change in crude Death Rate\n(per 1,000 population)");

这导致以下图：

注意：我知道 OP 打算强调自 1932 年以来死亡率的变化，当时德国的趋势开始上升。This seems to me a bit like cherry-picking, and I find it dubious when time intervals are chosen to obtain a particular trend. 出于这个原因，我查看了整个数据范围的间隔，这是与 OP 的不同比较。

其它你可能感兴趣的问题

上一篇概率的倒数代表什么吗？下一篇有没有涉及数学或概率的好电影？