-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Chapter 8 makes an interesting point about Observer Bias on the Red Line, but it took me a while to understand why the distribution over passengers' observed wait times is greater than the true wait times. After some thought it turns out I was assuming a more complicated model than the text. I don't think either model is unreasonable; my intuition just wasn't on the same page and I didn't find an explicit reason in the text to invalidate my model. The correct model might be obvious to most but perhaps the clarification below will help someone in the future:
The text reads:
The average time between trains, as seen by a ran- dom passenger, is substantially higher than the true average.
Why? Because a passenger is more like (sic) to arrive during a large interval than a small one. Consider a simple example: suppose that the time between trains is either 5 minutes or 10 minutes with equal probability. In that case the average time between trains is 7.5 minutes.
But a passenger is more likely to arrive during a 10 minute gap than a 5 minute gap; in fact, twice as likely. If we surveyed arriving passengers, we would find that 2/3 of them arrived during a 10 minute gap, and only 1/3 during a 5 minute gap. So the average time between trains, as seen by an arriving passenger, is 8.33 minutes.
For this to be true, I believe we have to assume a passenger arriving 0 minutes after the previous train has the same observed waiting time as a passenger arriving any arbitrary n > 0 minutes after the train. In other words, a passenger who just missed the previous train and waited the full gap is treated the same as a passenger who just barely made it the train.
My intuition was as follows: In reality, a passenger can arrive at the 9th minute of a 10 minute gap or the 4th minute of a 5 minute gap. Both passengers wait 1 minute. If you model it this way, the biased distribution actually shifts to the left. Why? Let's say there are two passengers arriving per minute (lam = 2). For a 2 minute gap, you might have the following wait times for 4 passengers: [0, 0, 1, 1]. For a 3 minute gap, you might have the following wait times for 6 passengers: [0, 0, 1, 1, 2, 2]. A passenger who waits 0 has arrived just before the train departs. For an n minute gap, wait time n-1 indicates the passenger arrived within the first minute after the previous train departed. From the 2-minute and 3-minute gaps above, you can deduce that across all trains P(wait n) < P(wait n-1). I.e., there is always be a chance for a passenger to wait 0 minutes. But for an e.g. 5 minute gap, it's impossible to wait 6 minutes.
Here is some code to simulate the process and the resulting histogram.
from math import floor
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
n = 50000 # Number of trains.
l = 2 # Passengers arriving per minute.
T = np.random.normal(10, 2, n) # True time between trains.
W1 = [] # Passengers' observed waiting time (my initial formulation).
W2 = [] # Passengers' observed waiting time (Think Bayes Formulation).
for t in T:
size = int(floor(t * l)) # This many passengers will end up on the next train.
W1 += list(np.random.uniform(0, floor(t), size))
W2 += list(np.ones(size) * t)
bins = int(T.max() - T.min())
plt.hist(T, color='red', bins=bins, alpha=0.3, normed=True, label='True wait $\mu=%.3lf$' % T.mean())
plt.hist(W1, color='blue', bins=bins, alpha=0.3, normed=True, label='Observed wait $\mu=%.3lf$' % np.mean(W1))
plt.hist(W2, color='green', bins=bins, alpha=0.3, normed=True, label='Observed wait simplified $\mu=%.3lf$' % np.mean(W2))
plt.legend(fontsize=8)
plt.show()
