The Gap Statistic - Ethan Young

The Gap Statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be the value that maximizes the gap statistic (that is, that yields the largest gap statistic). **Implementation Steps:** - Compute the log of the sum of the squared distance between data points and their cluster centroids for a range of k values. - Generate uniform random data (reference datasets) and compute their log of the sum of the squared distance for the same k values. - The gap statistic is the difference between the log sum of squared distance for the uniform data and the observed data. The optimal k is the value that maximizes this gap.