Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 208 additions & 0 deletions lessons/StatisticalConsiderations/L2_Early_Stopping-zh.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 提前结束实验\n",
"\n",
"如果你在数据收集完之前窥视实验结果,并且因为检验展现出统计显著性而提前结束实验,那么 I 型错误率可能会显著上升,即认为实验已产生效果,但实际上没有。在此 notebook 中,你将重复视频中提出的断言:实验基于一个传统统计学检验,在实验运行到一半时窥视一次将使 I 型错误率从 5% 上升到约 8.6%。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import scipy.stats as stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面的模拟函数使用伯努利/二项式成功模型作为结果指标,并以历史基准为衡量依据。也就是说,每个观测值都好比投掷硬币一次,成功概率为“p”。如果基准值“p”的成功次数不太正常,则声明统计显著性结果。我们将实验分成多个模块,每个模块完成后都检查实验状态。目标输出是在任何检验中都具有统计显著性的试验所占的比例,以及在每个模块之后都具有统计显著性的试验所占的比例。\n",
"\n",
"填写 `peeking_sim()` 函数主要包含三个步骤。\n",
"\n",
"1.模拟数据\n",
" - 计算每个模块的试验次数。为简单起见,假设每个模块的试验次数都一样:每个模块最后的试验次数可能比对应的函数参数要大。\n",
" - 使用每个模块观察到的成功次数创建一个数据矩阵:行数等于模拟次数,列数等于模块数量。调用 numpy 的[`random.binomial`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.binomial.html) 函数一次即可完成此操作\n",
" \n",
"2.计算每个峰值处的 z 分数\n",
" - 对于每行,使用 numpy 的 [`cumsum`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cumsum.html) 函数计算每个模块之后的累积成功次数。结果应该是维度与数据相同的矩阵,但是每列都累积地将到此点的行值相加。\n",
" - 在每个模块之后计算成功次数的预期均值和标准差。注意,分布将基于[二项分布](https://en.wikipedia.org/wiki/Binomial_distribution),并且以原始计数居中,而不是成功次数的比例。为了便于计数,有必要根据每个模块之后的试验次数累积之和创建一个向量。\n",
" - 使用累积计数、预期计数和标准差计算每个峰值的z 分数\n",
" \n",
"3.汇总检验结果\n",
" - 使用假设的 I 型错误率计算临界 z 值。根据此临界值标记哪些 z 分数具有统计显著性,哪些没有。\n",
" - 在任何检验中都具有显著性的试验所占的比例将等于至少具有一个标记值的行所占的比例。在每个模块都具有显著性的试验所占的比例将等于每列被标记值的平均数量,它是一个一维数组。将这两个值返回为函数的输出。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def peeking_sim(alpha = .05, p = .5, n_trials = 1000, n_blocks = 2, n_sims = 10000):\n",
" \"\"\"\n",
" This function simulates the rate of Type I errors made if an early\n",
" stopping decision is made based on a significant result when peeking ahead.\n",
" \n",
" Input parameters:\n",
" alpha: Supposed Type I error rate\n",
" p: Probability of individual trial success\n",
" n_trials: Number of trials in a full experiment\n",
" n_blocks: Number of times data is looked at (including end)\n",
" n_sims: Number of simulated experiments run\n",
" \n",
" Return:\n",
" p_sig_any: Proportion of simulations significant at any check point, \n",
" p_sig_each: Proportion of simulations significant at each check point\n",
" \"\"\"\n",
" \n",
" # generate data\n",
" trials_per_block = \n",
" data = \n",
" \n",
" # standardize data\n",
" data_cumsum = \n",
" block_sizes = \n",
" block_means = \n",
" block_sds = \n",
" data_zscores =\n",
" \n",
" # test outcomes\n",
" z_crit = \n",
" sig_flags = \n",
" p_sig_any = \n",
" p_sig_each = \n",
" \n",
" return (p_sig_any, p_sig_each)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"根据提供的默认参数运行函数应该返回一个结果元组,其中在两个模块中任何具有统计显著性的检验结果的概率应该约为 8.6%,在每个模块检查点具有统计显著性的检验结果概率约为 5%。增加试验次数和模拟次数可以获得更准确的估算值。并且增加峰值后,总体 I 型错误率应该会增加。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"peeking_sim()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 提前窥视多次比较法\n",
"\n",
"避免多次检查并作出糟糕的提前结束决策的最保险方法是不要这么做。规划好实验并检查了所有分配流程后,应该让实验一直运行完毕,并在最终结束时评估结果。这并不是说不能提前结束,但是需要额外的规划工作。\n",
"\n",
"解决多次窥视的一种方式是调整单个检验的显著水平,使总体错误率保持理想水平。但是采用之前演示的帮费罗尼或 Šidák 校正法肯定过于保守了,因为我们知道峰值之间的检验结果存在联系。如果我们在中间位置看到某些模拟检验的 z 分数高于阈值,那么与在中间点不具统计显著性的其他模拟检验相比,这些检验在结束时更有可能高于该阈值。获得更好的显著性阈值的一种方式是采用模拟法。在执行了上述第 1 步和第 2 步后,我们希望得出使期望的模拟检验所占比例具有统计显著性的显著水平:\n",
"\n",
"1. 模拟数据(同上)\n",
"2. 在每个峰值计算 z 分数(同上)\n",
"3. 达到所需的单个检验错误率\n",
" - 如果在任何峰值都超过临界值,则一次实验被视为具有统计显著性。从每行获取最大 z 分数,作为零假设被错误拒绝的最坏情形。\n",
" - 算出会拒绝期望的总体 I 型错误率的 z 分数阈值。\n",
" - 将 z 分数转换为对等的单个检验错误率。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def peeking_correction(alpha = .05, p = .5, n_trials = 1000, n_blocks = 2, n_sims = 10000):\n",
" \"\"\"\n",
" This function uses simulations to estimate the individual error rate necessary\n",
" to limit the Type I error rate, if an early stopping decision is made based on\n",
" a significant result when peeking ahead.\n",
" \n",
" Input parameters:\n",
" alpha: Desired overall Type I error rate\n",
" p: Probability of individual trial success\n",
" n_trials: Number of trials in a full experiment\n",
" n_blocks: Number of times data is looked at (including end)\n",
" n_sims: Number of simulated experiments run\n",
" \n",
" Return:\n",
" alpha_ind: Individual error rate required to achieve overall error rate\n",
" \"\"\"\n",
" \n",
" # generate data\n",
" # You can copy the code from the previous function.\n",
" \n",
" \n",
" # standardize data\n",
" # You can copy the code from the previous function.\n",
" \n",
" \n",
" # find necessary individual error rate\n",
" max_zscores = \n",
" z_crit_ind = \n",
" alpha_ind = \n",
" \n",
" return alpha_ind"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用默认参数运行函数应该会达到所需的单个错误率 (约 0.029)。注意,该错误率比帮费罗尼校正和 Šidák 校正生成的错误率 .025 或 .0253 高一些。通过更多的模拟和试验获得更准确的估算值,并尝试不同数量的模块,看看会如何更改所需的单个错误率。结果应该与[这篇文章](https://www.evanmiller.org/how-not-to-run-an-ab-test.html)中间的表格中列出的数字大致相同;注意,窥视 $n$ 次表示将实验拆分为 $n + 1$ 个模块。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"peeking_correction()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading