Apr 6 2012
Synchronous, Async and Parallel Programming Performance in Windows Azure
This post discusses the performance benefits of effectively using .NET TPL when doing I/O bound operations.
Intent
When there is a need for non-synchronous programming pattern (asynchronous and/or parallel) in Azure applications, the pattern of choice must be based on the target VM size we have chosen for that app and the type of operation particular part does.
Detail
.NET provides TPL (Task Parallel Library) to write non-synchronous programming much easier way. The asynchronous API enables to perform I/O bound and compute-bound asynchronous operations which lets the main thread to do the remaining operations without waiting for the asynchronous operations to complete. Refer http://snip.udooz.net/Hbmib2 for details. The parallel API enables to effectively utilizes the multicore processors on your machine to perform data intensive or task intensive operations. Refer http://snip.udooz.net/HTLrVv for details.
When writing azure applications, we may need to interact with many external resources like blob, queues, tables, etc. So, it is very obvious to think asynchronous or parallel programming patterns when the amount of I/O operations are higher. In these cases, we should be more cautious on selecting asynchronous and parallel. The extra-small instance provides shared CPU power, the small instance provides single core and medium or above provide multicore. Hence, asynchronous pattern would be the better option for extra-small and small instances. For problem those are highly parallel in nature, then the application should be placed on Medium or above instance with parallel pattern.
To confirm the above statement, I did a small proof of concept which has high I/O operation. The program interacts with Azure blob to get large number of blobs to get data to solve a problem. I’ve taken a small amount of Enron Email dataset from http://www.cs.cmu.edu/~enron/ which contains email messages for various Enron users on their respective Inbox folder as shown in figure 1 and figure 2.
The above figure shows the “inbox” for the user “benson-r”. Every users have approximately more than 200 email messages. A message contains the following content:
Message-ID: <21651803.433.JavaMail.evans@thyme> Date: Tue, 5 Feb 2002 11:06:50 -0800 (PST) From: To: Subject: online power option change request Cc: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit ======= OTHER HEADERS======= Jay, It was ..... ====== remaining message body ======
The program going to solve how many times particular user written email to this user. The email messages are resided in a blob container with appropriate blob directory. Hence, the pseudo code is some thing like:
for every user get the blob sub-directory for the user from the blob container create new dictionary // key - sender email ID, value - count for every blob in the sub-directory get blob content parse the “From” value from the message if the “From” value already exists on dictionary increment the value by 1 else add From field value as key and value as 1 into the dictionary write the result
I apply “sync, async and parallel” along with normal Task.StartNew and Task.StartNew + ContinueWith programming patterns on “fetching and parsing email messages” logic (more chatty I/O).
The Code
The normal procedural flow is shown below:
// rootContainer is CloudBlobDirectory represents "maildir" container var mailerInbox = rootContainer.GetSubdirectory(mailerFolder + "/inbox"); foreach (var blob in mailerInbox.ListBlobs()) { //don't see the subfolders if any if (blob is CloudBlobDirectory) continue; var email = mailerInbox.GetBlobReference(blob.Uri.ToString()).DownloadText(); //parsing From field var match = Regex.Match(email, @"From\W*(\w[-.\w]*@[-a-z0-9]+(\.[-a-z0-9]+)*)"); if (match.Groups.Count > 0) { var key = match.Groups[1].Value; //estimate is a Dictionary contains From email id and the count if (estimate.ContainsKey(key)) estimate[key] = estimate[key]++; else estimate.Add(key, 1); } } var sb = new StringBuilder(); foreach (var kv in estimate) { sb.AppendFormat("{0}: {1}\n", kv.Key, kv.Value); } //writing the result to a blob var result = mailerInbox.GetBlobReference("result_normal_" + attempt); result.UploadText(sb.ToString());
The parallel version is shown below:
var mailerInbox = rootContainer.GetSubdirectory(mailerFolder + "/inbox"); Parallel.ForEach(mailerInbox.ListBlobs(), blob => { if (!(blob is CloudBlobDirectory)) { var email = mailerInbox.GetBlobReference(blob.Uri.ToString()).DownloadText(); var match = Regex.Match(email, @"From\W*(\w[-.\w]*@[-a-z0-9]+(\.[-a-z0-9]+)*)"); if (match.Groups.Count > 0) { var key = match.Groups[1].Value; // used ConcurrentDictionary cestimate.AddOrUpdate(key, 1, (k,v) => v++); } } }); //the result writing part is here, similar to normal version
The asynchronous version is:
var mailerInbox = rootContainer.GetSubdirectory(mailerFolder + "/inbox"); var tasks = new Queue(); foreach (var blob in mailerInbox.ListBlobs()) { if (blob is CloudBlobDirectory) continue; // blobStorage is a wrapper for Azure Blob storage REST API var webRequest = blobStorage.GetWebRequest(blob.Uri.ToString()); tasks.Enqueue(Task.Factory.FromAsync(webRequest.BeginGetResponse, webRequest.EndGetResponse, TaskCreationOptions.None) .ContinueWith(t => { var response = t.Result; var stream = new StreamReader(response.GetResponseStream()); var emailMsg = stream.ReadToEnd(); stream.Close(); response.Close(); var match = regex.Match(emailMsg); if (match.Groups.Count > 0) { var key = match.Groups[1].Value; cestimate.AddOrUpdate(key, 1, (k, v) => v++); } })); } Task.WaitAll(tasks.ToArray());
The major difference in the “fetching and parsing” part is, instead of managed API, I have used REST API with a wrapper so that I can access the Blob asynchronously. In addition the above, I have used normal TPL tasks in two different way. In the first way, I just processed “fetching and parsing” stuff as shown below:
foreach (var blob in mailerInbox.ListBlobs()) { if (blob is CloudBlobDirectory) continue; string blobUri = blob.Uri.ToString(); tasks.Enqueue(Task.Factory.StartNew(() => { var email = mailerInbox.GetBlobReference(blobUri).DownloadText(); var match = Regex.Match(email, @"From\W*(\w[-.\w]*@[-a-z0-9]+(\.[-a-z0-9]+)*)"); if (match.Groups.Count > 0) { var key = match.Groups[1].Value; cestimate.AddOrUpdate(key, 1, (k, v) => v++); } })); } Task.WaitAll(tasks.ToArray());
Another one way, I have used ContinueWith option with the Task as shown below:
foreach (var blob in mailerInbox.ListBlobs()) { if (blob is CloudBlobDirectory) continue; string blobUri = blob.Uri.ToString(); tasks.Enqueue(Task.Factory.StartNew(() => { return mailerInbox.GetBlobReference(blobUri).DownloadText(); }).ContinueWith(t => { var match = regex.Match(t.Result); if (match.Groups.Count > 0) { var key = match.Groups[1].Value; cestimate.AddOrUpdate(key, 1, (k, v) => v++); } }, TaskContinuationOptions.OnlyOnRanToCompletion)); } Task.WaitAll(tasks.ToArray());
Results
I’ve hosted the work role and storage account at “Southeast Asia”. On every VM size, I’ve made 6 runs and removed the first time result. I have given 12 concurrent connection in the ServicePointManager for all the testing. I did not change this value in medium and large instances. All the results are in millisecond.
Extra Small
Normal | Parallel | Async | Task | Task & ContinueWith | |
Run 1 | 4326 | 1209 | 1004 | 1807 | 1671 |
Run 2 | 4773 | 1319 | 972 | 1399 | 1887 |
Run 3 | 4189 | 1027 | 1050 | 1590 | 1322 |
Run 4 | 4769 | 1299 | 964 | 1778 | 1728 |
Run 5 | 4416 | 1665 | 952 | 1313 | 1150 |
Small
Normal | Parallel | Async | Task | Task & ContinueWith | |
Run 1 | 4044 | 1319 | 687 | 2003 | 2045 |
Run 2 | 4116 | 1229 | 972 | 2070 | 1854 |
Run 3 | 4060 | 1468 | 981 | 1584 | 1501 |
Run 4 | 4375 | 1316 | 909 | 1208 | 1924 |
Run 5 | 4167 | 931 | 797 | 1272 | 1109 |
Medium
Normal | Parallel | Async | Task | Task & ContinueWith | |
Run 1 | 4086 | 1839 | 933 | 1326 | 1385 |
Run 2 | 4245 | 1204 | 751 | 1069 | 1064 |
Run 3 | 4193 | 1449 | 753 | 1176 | 1291 |
Run 4 | 4426 | 1076 | 619 | 1300 | 1395 |
Run 5 | 4145 | 811 | 674 | 888 | 951 |
Large
Normal | Parallel | Async | Task | Task & ContinueWith | |
Run 1 | 4124 | 1269 | 697 | 1159 | 1091 |
Run 2 | 4013 | 945 | 892 | 1028 | 1299 |
Run 3 | 4277 | 977 | 657 | 1228 | 1148 |
Run 4 | 4322 | 1270 | 840 | 820 | 1072 |
Run 5 | 4141 | 1154 | 729 | 1059 | 1151 |
Surprisingly, irrespective of the VM size, when an operation is I/O bound, asynchronous pattern outshines all the other approaches followed by Parallel.
Final Words
Hence, the “asynchronous” approach won the I/O bound operation (shown as a diagram also here).
Let me come up with one more test which covers on which area Parallel approach will shine. In addition to these, when you have lesser I/O and want smooth multithreading, Task and Task + ContinueWith may help you.
What do you think? Share your thoughts!
I highly thank and for validating my approach and the results which are actually improved my overall testing strategy.
The source code is available at https://udooz.net/file-drive/doc_download/23-mailanalyzerasyncpoc.html
Maqsood
Apr 09, 2012 @ 12:49:29
When the collection is huge, Parallel library uses run-time thread pool which will try spawn as many as possible may result in performance issues, Writing code in custom threads (classical threading) with manualresetevents and wait handle can be much better option than relying on parallel frame work. The results can be in <200milliseconds. give a try.
udooz
Apr 10, 2012 @ 11:34:00
Thanks Maqsood. Will work it out as my next post. In that time, I’ll discuss with you. Thanks.