Sunday, October 31, 2010

Statistical Analysis EC2 Spot Pricing Evolution

First of all I'm sorry for the lack of updates the last week (my sickness had something to do with it). But now I'm fully focusing on my thesis again.

The different scenarios of dividing the spot prices I'm having a look at are:
  • Average per day for the last 2 months
  • Average per week
  • Average per day of the week
  • Average per hour of the day
  • Average for the day vs the night
After generating the required graphs (also blox pots cfr what is it) for these scenarios (all output by the way can be downloaded here), I started having a look at how this data (or statistics extracted from this data) could be used to perform intelligent capacity planning.

The statistical values I store for every scenario (and this for each os-region-instance type combination possible) are:
  • Kurtosis: "it is a measure of the 'peakedness' of the probability distribution of a real-valued random variable. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations."
  • Skewness: "it is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive or negative, or even undefined. Qualitatively, a negative skew indicates that the tail on the left side of the probability density function is longer than the right side and the bulk of the values (including the median) lie to the right of the mean. A positive skew indicates that the tail on the right side is longer than the left side and the bulk of the values lie to the left of the mean. A zero value indicates that the values are relatively evenly distributed on both sides of the mean, typically but not necessarily implying a symmetric distribution."
  • Arithmetic mean: "it is often referred to as simply the mean or average when the context is clear, is a method to derive the central tendency of a sample space."
  • Geometric mean: "it indicates the central tendency or typical value of a set of numbers. It is similar to the arithmetic mean, which is what most people think of with the word "average", except that the numbers are multiplied and then the nth root (where n is the count of numbers in the set) of the resulting product is taken."
  • Number of Values: "it is an indication of the number of price changes."
  • Maximum: "it is the greatest value in the set."
  • Minimum: "it is the smallest value in the set."
  • Percentile: "it is the value of a variable below which a certain percent of observations fall. The 25th percentile is also known as the first quartile (Q1); the 50th percentile as the median or second quartile (Q2); the 75th percentile as the third quartile (Q3)."
  • Variance: "it is used as one of several descriptors of a distribution. It describes how far values lie from the mean."
  • Standard deviation: "it is a widely used measurement of variability or diversity used in statistics and probability theory. It shows how much variation or 'dispersion' there is from the 'average' (mean, or expected/budgeted value). A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data is spread out over a large range of values."
I'll explain all my findings in the document I'm working on that summarizes the research I've done and tells how I think to extract information from it that can be helpful to make an intelligent capacity planner. The project that I used to generate the graphs and statistical ouput is called 'SpotPricing' and can be found on my SVN location or here.

No comments:

Post a Comment