Thesis Kurt Vermeersch: December 2010

Wednesday, December 8, 2010

OVMP Analysis

Today I had another look at the "Optimal Virtual Machine Placement across Multiple Cloud Providers" paper and accompanying GAMS software. This approach to getting an optimal solution (lowest cost) for scheduling tasks on a cloud environment has some advantages:

Finding a solution using a solver is rather quick.
This approach gives an optimal solution, while my heuristic gives a suboptimal solution (for workload model 2). But the question is whether we can describe the model as realistic here.
In this paper they manage to determine a solution while there is uncertainty of demand and prices. The workload demand at a certain moment is a probabilistic given, and the on-demand price will change during the simulation with a certain probability.

Because of these advantage it seemed like a good idea to have another look at it, and determine whether there description of a situation is realistic and whether by extending certain variables this model can be made more realistic.

So, I read the paper again and tried to fully understand the GAMS source code. An overview of this analysis follows here. The source code of the file we'll be discussing can be downloaded here. Found a webpage from where this script can be committed as a job to the Neos server here. The first remarkable thing is that only 3 types of tasks/VM classes were used in the simulation with the following resource demand.

To make it more realistic there should be more possible tasks with a probability distribution specifying which tasks should be used more/less often (to get a real life load model).

And 4 instance types were offered (from different cloud providers) with their characteristics specified as in the following table. Since we would use this to schedule tasks and determine an optimal division between reserved and on-demand instances we probably only need one instance type here (the one on which we'll schedule the given tasks).

The resource provisioning happens in three phases:

Reservation of a number of resources.
Determine the actual usage of these resources.
Determine the on-demand resources (which will be all the resources that are still needed after provisioning the reserved resources).

These 3 phases all have an associated cost which will be shown in the next table. Note that generally the reserved and usage cost together will still be smaller than the on-demand cost for the resource.

To make the model more simple, we could eliminate some of these costs (and for example only keep the CPU-time costs, since this one is the one that really matters in my research). This will make finding the solution quicker and possibly allow us to extend this model to a lot of different tasks and a longer period of time. Note that the period in which the VMs are scheduled at the moment is a maximum of 1200 hours which equals 50 days. It would become interesting when this could be extended to a year (which for Amazon equals 8736 hours). You could find it strange that different physical machines with the same characteristics have different prices (for example the reserved price of machine 1 vs machine 2), but this is possible since in this model they represent instances from different cloud providers.

The uncertainty about the price can be found in the parameters that depend on the M set, we notice that with a probability of 70 percents the price stays the same and there is 30 percents chance that the on-demand price increases. It does not increase a little, it is multiplied by 3. The following image illustrates this. This does not seem to be realistic, you can see in my previous blogpost that for EC2 this chance is a lot smaller and the price actually decreases. Here we use it as penalty, to represent the change in price difference between on-demand and reserved resources. We noticed however that they change about the same amount and changes are immediate for both kinds of resources. So bringing this into account is probably not realistic, it would be better to bring a price decrease for both on-demand and reserved instances into account.

In this document the different elements that make up the formulas for the objective (total cost) and the constraints will be explained.

Finally I tried to adapt the GAMS source file in such a way that I would optimize (find the solution with minimal cost) the problem that tasks distributed by the given function were scheduled over a number of instances with the same characteristics. Such that an instance only became reserved when enough of its hours are used during the time period of one year. Prices could change as well in this model (different probability for on-demand as for reserved resources to decrease their price), but only the price of CPU-time would be taken into account. I however have still trouble to get the constraint in there for an instance to only be reserved when it is loaded enough. The current GAMS file can be downloaded here.

A good guide to using GAMS can be found here by the way.

Sunday, December 5, 2010

EC2 On-Demand/Reserved Pricing History

Amazon EC2 instance prices are not fixed, in a way that Amazon has the ability to change these prices at any given moment in time. It should not come as a surprise that prices have decreased over the last couple of years. This is a rather natural phenomenon ... since the instances that existed back then do still have the same specifications, but their hardware cost have decreased dramatically. So, Amazon is able to keep being competitive by lowering its prices. When making a division between the different instance types, this possibility should be taken into account. If one should pay the reserved instance rate for the rest of the period the instance was rented, the price change that is predicted could be taken into account as a discount on the on-demand prices. To be able to make such conclusions one should have a look at the price decreases that happened in the past, so I'll give a little history overview of the EC2 instances and their pricing.

The history of EC2 and its instances is given in a chronological order, when prices did change they will be presented as well:

[24/08/2006] Amazon Elastic Compute Cloud beta is announced. There was only one instance type (1.7GHz Xeon processor/1.74 GB of RAM) available for $0.10 per hour.
[22/10/2007] EC2 in unlimited beta and new instance types are announced.
[29/05/2008] High-CPU instances announced.
[23/10/2008] EC2 exits beta and offers SLA with commitment of 99.95% availability within a region.
[23/10/2008] EC2 instances running Windows Server available.
[10/12/2008] the European Region is added.
[08/01/2009] the AWS management console is launched.
[03/03/2009] Windows instances announced for the EU region.
[12/03/2009] EC2 introduces Reserved instances
[15/04/2009] Reserved instances now available in Europe.
[20/08/2009] prices of Reserved instances are decreased.
[30/09/2009] distinction between instances running Windows and instances running Windows with Authentication Services is removed (the price of the general Windows instances is put in place).
[27/10/2009] new High-Memory instances announced.
[27/10/2009] all other on-demand prices are lowered with up to 15%
[03/12/2009] Northern California region launched.
[14/12/2009] Amazon EC2 Spot instances announced.
[23/02/2010] Extra Large High Memory instances get introduced.
[23/02/2010] EC2 instances with Windows now available.
[29/04/2010] Asia Pacific (Singapore) Region announced
[13/07/2010] Cluster Compute instances announced.
[01/09/2010] Lower prices for High Memory Double and Quadruple XL instances.
[09/09/2010] Micro instances are announced.
[21/10/2010] AWS Free Usage Tier introduced.
[15/11/2010] Cluster GPU instances announced.
[03/12/2010] Free monitoring of CPU, disk and network performance metrics through CloudWatch is introduced.

Your hourly rate for reserved instances is changed as well when prices are changed, which can be seen in the official press release of the price reduction that happened in september.

Effective September 1, 2010, we've reduced the On-Demand and Reserved Instance prices on the m2.2xlarge (High-Memory Double Extra Large) and the m2.4xlarge (High-Memory Quadruple Extra Large) by up to 19% and the Reserved Instances by up to 17%. If you have existing Reserved Instances your hourly usage rate will automatically be lowered to the new usage rate and your estimated bill will reflect these changes later this month.

You however still paid the full one-time fee, so these possible changes could still be taken into account. We notice however that not that many price reductions have happened in the EC2 history. Most of the new prices were made when a new instance type was introduced. The only real general price reduction happened in October 2009, when all on-demand prices were lowered up to 15%. But prices will probably have to change more often in the future since market pressure will rise.

Saturday, December 4, 2010

White Paper: Analytics for Internal Cloud Management

This paper describes using a high-level approach the resource provision model that will be needed in a cloud computing environment. Some snippets from the paper are quoted beneath.

For many organizations, cloud is more of a business change than a technical change, as it alters the way departments interact and the way costs are allocated. Cloud computing enables IT departments to disintermediate themselves from the day-to-day process of providing access to applications, software platforms and IT infrastructure. Instead it allows them to focus on aligning supply and demand, and efficiently provisioning infrastructure in a way that bridges the gap between capex-oriented procurement and opex-oriented consumption. From an IT consumer perspective, it puts control back in their hands by allowing them to respond to their business needs more quickly, while isolating them from the arcane business of buying, installing and managing IT infrastructure.

The addition of specific automation, analytics and cost allocation models is key to making a cloud a cloud, as it enables the self-service model while at the same time assuring infrastructure is managed as efficiently as possible.

Rather than using old-school capacity management approaches that focus on trend and threshold models, what is required is a new focus on workload placement and resource allocations. In this new model, workload placements dictate the optimal use of capacity, much like in the game of Tetris™.

In this cloud computing environment it is important to do strategic optimization, which is a proactive, long-term placement of the resources based on supply and demand. There exist two possible consequences of the shift to this form of resource allocation:

Over-provisioning: which leads to resources with a low level of utilization, so this means we are wasting money.
Under-provisioning: which leads to operational risks, so the capacity cannot meet the demand on some occasions.

This is illustrated by the following graph:

The corresponding white paper can be downloaded here and from the original site.

Scheduling of Tasks

A distinction is made in the EC2 Broker application between two workload models:

In the first workload model a task has a deadline and a certain amound of VM hours needed to execute the task, these hours can be divided over time such that it best fits the scheduler (more freedom to determine the number of VMs used in one hour).

The second model has tasks specified with a deadline and a certain length, but at every hour is also specified how many VMs are needed to execute the task. When a tsk is generated using the task generator of the application, a file specifying the workload of the model is also generated. The ratio of EC2 compute instances needed for a certain instance type is used to determine the number of VMs needed. The workload generator now takes an upper and lower boundary and lets the workload fluctuate between these two according to the EC2 compute instances associated with the corresponding instance type. A snippet from a CSV file containing such a load description is given here.

Workload Model 1 (total VM hours specified)
To determine the optimal division between reserved and on-demand instances it is important to schedule the tasks/loads in such a way that the VMs are in use as much of the time as possible. Scheduling the tasks was rather easy for the second type of workload, so I'll first have a look at this workload model, where a task consist of a certain amount of VM hours that need to be executed before a deadline. When no scheduling is performed this would be an example of a possible result with a corresponding total price of 619.92 US dollars.

When scheduling is performed we obtain a result with a total price of 516 US dollars, here all 4 instances that are used are reserved instances, while when there was no intelligent scheduler used, there were 3 on-demand instances as well. The scheduling algorithm resulted for this example in a decrease of 16.76 percents of the total cost.

To perform the scheduling for this workload model, the tasks are first sorted in an earliest deadline first manner. Then for every task we start filling the VM hours for every instance from the earliest available VM hour till the current tasks deadline hour. If existing instances are filled till your deadline and you still have VM hours to distribute the needed extra instances are added. This actually yields an optimal scheduling solution such that the right amount of reserved instances will be provisioned. How to determine the boundary (from what amount of VM hour utilization it is cheaper to rent a reserved instance) between a reserved and an on-demand instance was explained earlier on this blog.

Workload Model 2 (every hour #VMs specified)
Getting the scheduling in order for the first workload model was rather difficult. I'll first present the results for this model as well and then describe the scheduling algorithm itself. Using the basic scheduler (which means that all tasks simply start at their starting date) the following result is obtained, with a total cost price of 37509.08 US dollars.

When scheduling is performed we obtain a result with a total price of 34531.56 US dollars, also we only need 27 instances instead of the 39 instances that we needed before. And now 13 instances are reserved ones, before only 11 were reserved instances. The scheduling algorithm resulted for this example in a decrease of 7.94 percents of the total cost.

I first developed a way to optimally schedule these tasks as well, but afterwards I came to the conclusion that the time complexity was too high, so a suboptimal algorithm had to be put in place. I'll first describe the optimal solution that I first tried out. It consisted of first calculating all different divisions of the workload hours over the time available for that task, this is equal to a combination of x out of y with y being the total number of time slots available and x the number of time slots used by the task. Because not all these combination can be stored in memory a combination was only generated at the moment it was used. This was made possible by a combination generator class that was initialized with x and y and could be used to iterate over all the possible combinations. By now determining the optimal combination of task schedules for the different tasks, we could determine the optimal solution. Optimal in this interpreted as the solution that minimizes the number of instances needed. This already makes the solution suboptimal, since this requirement will yield an even distribution in most cases, but it is not necessarily the optimal case. I also had a look at optimizing on how many reserved instances would be needed, this would yield the real optimal solution, I'll explain later why this wasn't a solution as well. To make the algorithm feasible again I decided to use the above algorithm in smaller intervals ... which means dividing the time in intervals and dividing the workload evenly between the intervals. The size of such interval is determined by having a look at how many combinations would be possible within the interval and comparing this with a threshold value that has to be set. Also the deadlines of the tasks will be necessary interval borders. In each interval the above described algorithm was used to determine a suboptimal solution. Now we see that it is impossible to determine how many reserved instances would be needed in total, I tried a scheduler which interpolated the number of hours the instance was needed in the interval to the bigger time frame, but this didn't seem to give better results than the scheduler that minimized the number of instances needed in each interval. Next I tried to minimize the standard deviation in number of instances needed each hour in the interval, this seemed to yield a better result for the example.

There are a lot of things that could still be tried to make this scheduler better, maybe some updates will be posted in the coming weeks/months. By the way, a new version of the EC2 Broker can be found on the SVN repository.

Thesis Kurt Vermeersch