在数据中发现新知识的指南

机器算法验证 数据可视化 探索性数据分析 知识发现
2022-03-31 20:06:50

我策划了一些事情来说明我自己或其他人的观点。通常,一个问题会启动这个过程,而提问的人通常希望得到一个特定的答案。

如何以较少偏见的方式了解有关数据的有趣内容?

现在我大致遵循这种方法:

  1. 汇总统计。
  2. 条形图。
  3. 散点图。
  4. 也许重复一个有趣的数据子集。

但这似乎不够有条理或科学。

是否有可以遵循的指导方针或程序来揭示我不想问的数据?我怎么知道我什么时候做了充分的分析?

3个回答

有一个探索性数据分析 (EDA) 的完整领域,以及一本关于这个主题的优秀书籍,名为Exploratory Data Analysis,作者 John W. Tukey。

我喜欢您使用图表 - 还有许多其他图表可能有用,具体取决于您的数据 - 有多少变量?变量的性质是什么(分类?数字?连续?计数?序数?)

一个通常对具有多个变量的数据有用的图表是散点图矩阵。

您可以查找各种类型的异常值,这些异常值通常是有趣的点。

但我认为这整个过程不能真正有条不紊和科学地进行——探索是在引入有条理和科学的方法之前发生的。在这里,我认为关键方面是可玩性。

If you have chronological data i.e.time series data then there are "knowns" and waiting to be discovered are the "unknowns" . For example if you have a sequence of data points for 10 periods such as 1,9,1,9,1,5,1,9,1,9 then based upon this sample one can reasonably expect 1,9,1,9,... to arise in the future. What data analysis reveals is that there is an "unusual" reading at period 6 even though it is well within +-3 sigma limits suggesting that the DGF did not hold. Unmasking the Inlier/Outlier allows us to reveal things about the data. We also note that the Mean Value is not the Expected Value. This idea easily extends to detecting Mean Shifts and/or Local Time Trends that may have been unknown before the data was analyzed ( Hypothesis Generation ). Now it is quite possible that the next 10 readings are also 1,9,1,9,1,5,1,9,1,9 suggesting that the "5" is not necessarily untoward. If we observe an error process from a suitable model that exhibits provable non-constant variance we might be revealing one of the following states of nature: 1) the parameters might have changed at a particular point in time ; 2. There may be a need for Weighted Analysis (GLS) ; 3. There may be a need to transform the data via a power transform; 4. There may be a need to actually model the variance of the errors. If you have daily data good analysis might reveal that there is a window of response (lead,contemporaneous and lag structure) around each Holiday reflecting consistent/predictable behavior. You might also be able to reveal that certain days of the month have a significant effect or that Fridays before a Monday holiday have exceptional activity.

数据挖掘可以分为两类。如果您有兴趣测量数据集/变量对特定变量的影响,那么这将被视为监督学习。对于没有目标的深度和探索性学习,您正在接受无监督学习。

数据的绘图和统计分析(理解分布和获得直觉)是第一步。