Improved Sampling

Use improved sampling techniques to determine appropriate sample sizes for association rule generation with performance guarantees.

Association rules (AR) can use a good sample size with performance guarantee, based on the work of Riondato and Upfal.The AR algorithm computes the sample size by the following inputs:

  • d-index of the dataset

  • Absolute error ε

  • Confidence level γ

d-index is defined as the maximum integer d such that the dataset contains at least d transactions of length d at the minimum. It is the upper bound of Vapnik-Chervonenkis (VC) dimension. The AR algorithm computes d-index of the dataset by scanning the length of all transactions in the dataset.

Users specify absolute error ε and confidence level γ parameters. A large d-index, small AR support, small ε or large γ can cause a large sample size. The sample size theoretically guarantees that the absolute error of both the support and confidence of the approximated AR (from sampling) is less than ε compared to the exact AR with probability (or confidence level) at least γ. In this document this sample size is called AR-specific sample size.