Synchronous, Async and Parallel Programming Performance in Windows Azure

This post discusses the performance benefits of effectively using .NET TPL  when doing I/O bound operations.

Intent

When there is a need for non-synchronous programming pattern (asynchronous and/or parallel) in Azure applications, the pattern of choice must be based on the target VM size we have chosen for that app and the type of operation particular part does.

Detail

.NET provides TPL (Task Parallel Library) to write non-synchronous programming much easier way. The asynchronous API enables to perform I/O bound and compute-bound asynchronous operations which lets the main thread to do the remaining operations without waiting for the asynchronous operations to complete. Refer http://snip.udooz.net/Hbmib2 for details. The parallel API enables to effectively utilizes the multicore processors on your machine to perform data intensive or task intensive operations. Refer http://snip.udooz.net/HTLrVv for details.

When writing azure applications, we may need to interact with many external resources like blob, queues, tables, etc. So, it is very obvious to think asynchronous or parallel programming patterns when the amount of I/O operations are higher. In these cases, we should be more cautious on selecting asynchronous and parallel. The extra-small instance provides shared CPU power, the small instance provides single core and medium or above provide multicore. Hence, asynchronous pattern would be the better option for extra-small and small instances. For problem those are highly parallel in nature, then the application should be placed on Medium or above instance with parallel pattern.

To confirm the above statement, I did a small proof of concept which has high I/O operation. The program interacts with Azure blob to get large number of blobs to get data to solve a problem. I’ve taken a small amount of Enron Email dataset from http://www.cs.cmu.edu/~enron/ which contains email messages for various Enron users on their respective Inbox folder as shown in figure 1 and figure 2.

The above figure shows the “inbox” for the user “benson-r”. Every users have approximately more than 200 email messages. A message contains the following content:

Message-ID: <21651803.1075842014433.JavaMail.evans@thyme>
Date: Tue, 5 Feb 2002 11:06:50 -0800 (PST)
From: robert.stalford@enron.com
To: jay.webb@enron.com
Subject: online power option change request
Cc: andy.zipper@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
======= OTHER HEADERS=======
Jay,
It was ..... ====== remaining message body ======

The program going to solve how many times particular user written email to this user. The email messages are resided in a blob container with appropriate blob directory. Hence, the pseudo code is some thing like:

[code lang="csharp"] for every user
get the blob sub-directory for the user from the blob container
create new dictionary // key - sender email ID, value - count

for every blob in the sub-directory
get blob content
parse the “From” value from the message
if the “From” value already exists on dictionary
increment the value by 1
else
add From field value as key and value as 1 into the dictionary
write the result
[/code]

I apply “sync, async and parallel” along with normal Task.StartNew and Task.StartNew + ContinueWith programming patterns on “fetching and parsing email messages” logic (more chatty I/O).

The Code

The normal procedural flow is shown below:

[code lang="csharp"] // rootContainer is CloudBlobDirectory represents "maildir" container
var mailerInbox = rootContainer.GetSubdirectory(mailerFolder + "/inbox");
foreach (var blob in mailerInbox.ListBlobs())
{
//don't see the subfolders if any
if (blob is CloudBlobDirectory) continue;
var email = mailerInbox.GetBlobReference(blob.Uri.ToString()).DownloadText();

//parsing From field
var match = Regex.Match(email, @"From\W*(\w[-.\w]*@[-a-z0-9]+(\.[-a-z0-9]+)*)");
if (match.Groups.Count > 0)
{
var key = match.Groups[1].Value;
//estimate is a Dictionary contains From email id and the count
if (estimate.ContainsKey(key))
estimate[key] = estimate[key]++;
else
estimate.Add(key, 1);
}
}
var sb = new StringBuilder();
foreach (var kv in estimate)
{
sb.AppendFormat("{0}: {1}\n", kv.Key, kv.Value);
}
//writing the result to a blob
var result = mailerInbox.GetBlobReference("result_normal_" + attempt);
result.UploadText(sb.ToString());
[/code]

The parallel version is shown below:

[code lang="csharp"] var mailerInbox = rootContainer.GetSubdirectory(mailerFolder + "/inbox");

Parallel.ForEach(mailerInbox.ListBlobs(), blob =>
{
if (!(blob is CloudBlobDirectory))
{
var email = mailerInbox.GetBlobReference(blob.Uri.ToString()).DownloadText();
var match = Regex.Match(email, @"From\W*(\w[-.\w]*@[-a-z0-9]+(\.[-a-z0-9]+)*)");
if (match.Groups.Count > 0)
{
var key = match.Groups[1].Value;

// used ConcurrentDictionary
cestimate.AddOrUpdate(key, 1, (k,v) => v++);
}
}
});

//the result writing part is here, similar to normal version
[/code]

The asynchronous version is:

[code lang="csharp"] var mailerInbox = rootContainer.GetSubdirectory(mailerFolder + "/inbox");

var tasks = new Queue();

foreach (var blob in mailerInbox.ListBlobs())
{
if (blob is CloudBlobDirectory) continue;

// blobStorage is a wrapper for Azure Blob storage REST API
var webRequest = blobStorage.GetWebRequest(blob.Uri.ToString());

tasks.Enqueue(Task.Factory.FromAsync(webRequest.BeginGetResponse,
webRequest.EndGetResponse, TaskCreationOptions.None)
.ContinueWith(t =>
{
var response = t.Result;
var stream = new StreamReader(response.GetResponseStream());
var emailMsg = stream.ReadToEnd();

stream.Close();
response.Close();

var match = regex.Match(emailMsg);
if (match.Groups.Count > 0)
{
var key = match.Groups[1].Value;
cestimate.AddOrUpdate(key, 1, (k, v) => v++);
}
}));
}

Task.WaitAll(tasks.ToArray());
[/code]

The major difference in the “fetching and parsing” part is, instead of managed API, I have used REST API with a wrapper so that I can access the Blob asynchronously. In addition the above, I have used normal TPL tasks in two different way.  In the first way, I just processed “fetching and parsing” stuff as shown below:

[code lang="csharp"] foreach (var blob in mailerInbox.ListBlobs())
{
if (blob is CloudBlobDirectory) continue;
string blobUri = blob.Uri.ToString();
tasks.Enqueue(Task.Factory.StartNew(() =>
{
var email = mailerInbox.GetBlobReference(blobUri).DownloadText();
var match = Regex.Match(email, @"From\W*(\w[-.\w]*@[-a-z0-9]+(\.[-a-z0-9]+)*)");
if (match.Groups.Count > 0)
{
var key = match.Groups[1].Value;
cestimate.AddOrUpdate(key, 1, (k, v) => v++);
}
}));
}

Task.WaitAll(tasks.ToArray());
[/code]

Another one way, I have used ContinueWith option with the Task as shown below:

[code lang="csharp"] foreach (var blob in mailerInbox.ListBlobs())
{
if (blob is CloudBlobDirectory) continue;
string blobUri = blob.Uri.ToString();
tasks.Enqueue(Task.Factory.StartNew(() =>
{
return mailerInbox.GetBlobReference(blobUri).DownloadText();
}).ContinueWith(t =>
{
var match = regex.Match(t.Result);
if (match.Groups.Count > 0)
{
var key = match.Groups[1].Value;
cestimate.AddOrUpdate(key, 1, (k, v) => v++);
}
}, TaskContinuationOptions.OnlyOnRanToCompletion));
}

Task.WaitAll(tasks.ToArray());
[/code]

Results

I’ve hosted the work role and storage account at “Southeast Asia”.  On every VM size, I’ve made 6 runs and removed the first time result.  I have given 12 concurrent connection in the ServicePointManager for all the testing.  I did not change this value in medium and large instances. All the results are in millisecond.

Extra Small

Normal Parallel Async Task Task & ContinueWith
Run 1 4326 1209 1004 1807 1671
Run 2 4773 1319 972 1399 1887
Run 3 4189 1027 1050 1590 1322
Run 4 4769 1299 964 1778 1728
Run 5 4416 1665 952 1313 1150

Small

Normal Parallel Async Task Task & ContinueWith
Run 1 4044 1319 687 2003 2045
Run 2 4116 1229 972 2070 1854
Run 3 4060 1468 981 1584 1501
Run 4 4375 1316 909 1208 1924
Run 5 4167 931 797 1272 1109

Medium

Normal Parallel Async Task Task & ContinueWith
Run 1 4086 1839 933 1326 1385
Run 2 4245 1204 751 1069 1064
Run 3 4193 1449 753 1176 1291
Run 4 4426 1076 619 1300 1395
Run 5 4145 811 674 888 951

Large

Normal Parallel Async Task Task & ContinueWith
Run 1 4124 1269 697 1159 1091
Run 2 4013 945 892 1028 1299
Run 3 4277 977 657 1228 1148
Run 4 4322 1270 840 820 1072
Run 5 4141 1154 729 1059 1151

Surprisingly, irrespective of the VM size, when an operation is I/O bound, asynchronous pattern outshines all the other approaches followed by Parallel.

Final Words

Hence, the “asynchronous” approach won the I/O bound operation (shown as a diagram also here).

 

Let me come up with one more test which covers on which area Parallel approach will shine.  In addition to these, when you have lesser I/O and want smooth multithreading, Task and Task + ContinueWith may help you.

What do you think? Share your thoughts!

I highly thank  Steve Marx and Nuno  for validating my approach and the results which are actually improved my overall testing strategy.

The source code is available at http://udooz.net/file-drive/doc_download/23-mailanalyzerasyncpoc.html

Mark this!
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Add to favorites
  • RSS
  • Twitter