Showing posts with label optimization. Show all posts
Showing posts with label optimization. Show all posts

2009/10/11

URL Canonicalization with 301 redirects for ASP.NET

There are lots of pages talking about the benefits of canonicalization (c14n for short). It is a common agreement that it is just a set of rules in order to have our pages indexed in the most standardized, simplified and optimal way as possible. This would allow us to recollect our PageRank instead of having it spread among all the possible combinations of writing an URL for a particular page. In this post we will cover some canonicalization cases and their implementations for our IIS server running ASP.NET.

These different cases include:

  • Secure versus non secure versions of a page: Are http://www.example.com and https://www.example.com the same?
  • Upper and lowercase characters in the URL: Are ~/Default.aspx, ~/default.aspx and ~/DeFaUlT.aspx the same page?
  • www versus non-www domain: Do http://example.com and http://www.example.com return the same contents?
  • Parameters in the QueryString: Should ~/page.aspx?a=123&b=987 and ~/page.aspx?b=987&a=123 be considered the same? Are we handling bogus parameters? What happens if someone links us with a parameter that is not expected/used such as ~/page.aspx?useless=33 ?
  • Percent encoding: Do ~/page.aspx?p=d%0Fa and ~/page.aspx?p=d%0fa return the same page?

If your answer is yes in all cases, you must keep on reading. If you only answer yes in some cases, this post will be interesting for you anyway; you could skip those points that do not apply in your scenario by just commenting some lines of code, or modify them to match your needs. Sample VS2008 website project with full VB source code is available for downloading.

In our sample code we will be following these assumptions:

  • We prefer non-secure version over secure version, except for some particular (secure) paths: If we receive an https request from a non-authenticated user for a page that should not be served a secure, we will do a 301 redirect to the same requested URL but without the secure ‘s’.
  • We will prefer lowercase for all the the URLs: If we receive a request that contains any uppercase char (parameter names and their values are not considered), we will do a permanent 301 redirect to the lowercase variant for the URL being requested.
  • www vs. non-www should be handled by creating a new website in IIS for the non-www version and placing there a 301 redirect to the www version. This case is not covered by our code in ASP.NET since it only needs some IIS configuration work.
  • The parameters must be alphabetically ordered: If we receive a request for ~/page.aspx?b=987&a=123, we will do permanent redirect to ~/page.aspx?a=126&b=987, since the alphabetic sort a is before b. Regarding lower and uppercase variants either in the name of the parameter or the value itself, we will consider them as being different pages, in other words, no redirecting will be done if the name of a QueryString is found in upper/mixed/lowercase. The same would apply for the value of those parameters: ~/page.aspx?a=3T, ~/page.aspx?A=3T and ~/page.aspx?a=3t will be considered as different pages, no redirection will be done. In pages that accept parameters extra coding must be done to check that no other than the allowed parameters are used.
  • We will prefer percent encoded characters in their uppercase variant, for that reason %2f for instance will be redirected to %2F whenever they appear in the value of any parameter. This way we follow RFC 3986 that states:
    Although host is case-insensitive, producers and normalizers should use lowercase for registered names and hexadecimal addresses for the sake of uniformity, while only using uppercase letters for percent-encodings.

<link rel=”canonical” …>

Last february 2009 Google announced through their Google Webmaster Central Blog a way for you to explicitly declare your preferred canonical version for every page (see Specify your canonical ). By simply adding a <link> tag inside the <head> section of your pages, you can tell spiders the way you prefer them to index your content, the canonical preferred way. This helps to concentrate the GoogleJuice to that particular canonical URL from any other URL version or URL variation pointing to it in this way (the link rel=canonical way). This very same method was later adopted by Ask.com, Microsoft Live Search and Yahoo!, so it can be considered a de facto standard.

We will adopt this relative new feature in our sample code. Most of the times we will be using permanent 301 redirects, but there might be cases where you may not want to do a redirect and simply return the requested page as is (with no redirection) and return the canonical URL as a hint for Search Engines. Whenever we receive a request for a page, including bogus parameters in the query string, we will handle the request as a normal one but we will discard the useless parameters when calculating the link rel=canonical version of the page.

In particular, if you are using Google Adwords, your landing pages will be hit with an additional parameter called gclid that is used for Adwords auto-tagging. We do not want to handle those requests differently, nor treat them as errors in any way. We will only discard the unknown variables when creating the rel=canonical URL for any request.

Related links.

Internet Information Services IIS optimization
Are upper- and lower-case URLs the same page?
Google Indexing my Home Page as https://. 
http:// and https:// - Duplicate Content? 
SEO advice: url canonicalization

Q: What about https and http versions? I have a site is indexed for https, in place of http. I am sure this too is a form of canonical URIs and how do you suggest we go about it?
A: Google can crawl https just fine, but I might lean toward doing a 301 redirect to the http version (assuming that e.g. the browser doesn’t support cookies, which Googlebot doesn’t).

Specify your canonical

Keywords.

canonicalization, seo, optimization, link, rel, canonical, c14n, asp.net, http vs. https, uppercase vs. lowercase

2009/08/30

Automatic generation of META tags for ASP.NET

Some of the well known tags commonly used in SEO are the three following meta tags:  meta title tag, meta keywords tag and meta description tag:

<meta name="title" content="title goes here" /> 
<meta name="keywords" content="keywords, for, the, page, go, here"/> 
<meta name="description" content="Here you will find a textual description of the page" />

A lot has been written about the benefits of using them, and almost the same amount telling that they are not considered anymore by search engines. Anyway, no matter if they are used or not on the calculation of SERP (Search Engine Results Page), nobody discusses the benefits of having them correctly set on all your pages. At least meta description tags are somehow considered by Google, since Google Webmaster Tools warns you about pages with duplicate meta description:

Differentiate the descriptions for different pages. Using identical or similar descriptions on every page of a site isn't very helpful when individual pages appear in the web results. In these cases we're less likely to display the boilerplate text. Wherever possible, create descriptions that accurately describe the specific page. [...]

Download the VB project code

The question is not “should I use meta tags in my pages?”, the real question (and here comes the problem) is “how can I manage to create individual meta descriptions for all my pages?” and “how can I automate the process of creating meta keywords?”. That would be too much work (or too much technical work) for you (or your users, if they create content on their own).

For instance, consider a CMS (Content Management System) in which users are prompted for some fields to create a new entry. In the simplest form, the CMS can ask the user to enter title and content, for the new entry. In advanced-user mode, the CMS could also ask the user to suggest some keywords, but the user will probably enter just two, three or four words (if any). The CMS needs a way to automatically guess and suggest a default set of meta keywords based on the content before definitely saving the new entry. Those could be checked, and eventually completed by the user, and then accepted. Meta title and meta descriptions are much easier, but will be covered also in our code.

In our sample VB project we will not suggest keywords for the user to confirm, we will just calculate them on the fly and we will set them without user intervention. We will use a dummy VirtualPathProvider that will override the GetFile function in order to retrieve the virtualPath file from the real file system, so it is not a real VirtualPathProvider in the whole sense, just a wrapper to take control of the files being served to ASP.NET before they are actually compiled. A VirtualPathProvider is commonly used to seamless integrate path URLs with databases or any other source of data rather than the file system itself. Our custom class inheriting from VirtualPathProvider will be called FileWrapperPathProvider. In our case it will not use the full potential of VirtualPathProviders, since we will only retrieve the data from the file system, do minor changes to the source code on the fly and return them in order to be compiled. This will introduce a bit of overload and some extra CPU cycles before the compilation of the pages, but that will only happen once, until the file needs to be compiled again (because the underlying file has changed, for instance).

Our FileWrapperPathProvider.GetFile function will return a FileWrapperVirtualFile whenever the virtualPath requested falls under the conditions of IsPathVirtual function: the file extension is .aspx or .aspx.vb and the path of the requested URL follows the scheme of ~/xx/, that is to say, under a folder of two characters (for the language, ~/en/, ~/de/, ~/fr/, ~/es/, …). In other case, it will return a VirtualFile handled by the previously registered VirtualPathProvider; ie. none, or the filesystem itself without any change.

We have chosen to use a VirtualPathProvider wrapper around the real file system just to show what kind of things can be done with that class. If your data is on a database instead of static files, you will probably be using your own VirtualPathProvider, and in that case it will work by virtualizing the path being requested and retrieving the file contents from the database instead of the filesystem. Whichever the case, you can adapt it to your scenario in order to make use of the idea that we will illustrate in this post.

The idea is somewhat twisted or cumbersome:

  1. Parse the code behind file for the page being requested (.aspx.vb file) and, using regular expressions (regex), replace the base class so that the page no longer inherits from System.Web.UI.Page and inherits from System_Web_UI_ProxyPage instead(a custom class of our own). This proxy page class declares public MetaTitle, MetaDescription and MetaKeywords properties and link them to the underlying meta title, meta description and meta keywords declared inside the head tag in the masterpage. When a page inherits from our System_Web_UI_ProxyPage, it will expose those 3 properties that can be easily set. See System_Web_UI_ProxyPage.OnLoad in our sample project.
  2. Read and parse the .aspx file linked to the former .aspx.vb file (the same without the .vb) and make a call to JAGBarcelo.MetasGen.GuessMetasFromString method which makes the main job with the file contents. See FileWrapperVirtualFile.Open function in the sample project.
  3. Besides of changing the base class to that of our own, we add some lines to create (or extend) Page_Init method on that .aspx.vb file. In those few lines of code that are added on the fly we set the three properties exposed by System_Web_UI_ProxyPage class and that we have just calculated.
  4. Return the Stream as output of the VirtualFile.Open function with the modified contents so that it can be compiled by ASP.NET engine, based on the underlying real file, using the formerly calculated meta title, meta keywords and meta description. Note that this is done in memory, the actual filesystem is not written at any time. The real files are read (.aspx.vb and .asxp), parsed, and the virtual contents created on the fly and given to ASP.NET. You need too be really careful since you can run into compile-time errors in places that will be hard to understand, since the filesystem version of the files are the base contents, but not the actual contents being compiled.

The way we calculate the metas in JAGBarcelo.MetasGen.GuessMetasFromString is:

  1. Select the proper set of noise-words set depending on the language of the text.
  2. Look for the content inside the whole text. It must be inside ContentPlaceholders (we will suppose you will be using masterpages), and we will look for a particular ContentPlaceHolder that contains the main body/contents of the page. Change the LookForThisContentPlaceHolder const inside MetasGen.vb file in order to customise it for your own masterpage ContentPlaceHolder's names.
  3. Calculate the meta title as the text within the first <h1> tags right after the searched ContentPlaceHolder.
  4. Iterate through the rest of the content, counting word occurrences and two-word phrases occurrences, discarding noise words for the given language.
  5. Calculate the keywords, creating a string that will be filled with the most frequent single-word occurrences (up to 190 characters), and two-word most frequent occurrences (up to 250 characters in total).
  6. Calculate the description, concatenating content previously parsed, to create a string between of 200 and 394 characters. Those two figures are not randomly chosen, Google Webmaster Tools warns you when any of your pages has meta descriptions shorter than 200 or longer than 394 characters (based on my experience).
  7. Return the calculated title, keywords and description in the proper ByRef parameters.

A good thing about this approach, using a VirtualFile is that you can apply it to your already existing website easily. No matter how many pages your site has, hundreds, thousands,... this code adds meta title, meta keywords and meta descriptions to all your pages automatically, transparently, without user intervention, very little modifications (if any) to your already existing pages and it scales well.

Counting word occurrences.

We iterate through the words within the text under consideration (ContentPlaceHolder) and store their occurrences into a HashTable (ht1 for single-words and ht2 for two-words). All words are considered in their lowercase variant. The word must have more than two characters to taken into account and must not start with a number. If it passes the former fast test, it is checked against a noise-word list. If it is not a noise word, it is checked against the existing values in the proper HashTable and included (ht1.Add(word, 1)), or its value incremented (ht1(word) = ht1(word) + 1) if it was already there.

Regarding the noise words, we first considered some word frequency lists available out there, but then we thought about using verb conjugations as well. So we first created MostCommonWordsEN, an array based on simple frequency lists, and then we created also MostCommonVerbsEN based on another frequency list which considered only verbs. At the end we created MostCommonConjugatedVerbsEN, where we stored all the conjugations of the former most common English verbs. When checking a word against one of these word strings we only use MostCommonWordsXX and MostCommonConjugatedVerbsXX (where XX is one of EN, ES, FR, DE, IT). Yes, we did the same for other languages like Spanish, French, German and Italian, whose conjugations are much more complex than -ed, -ing and -s terminations. For automatic generation of all possible conjugations for the given verbs (in their infinitive form) we used http://www.verbix.com/

Calculating meta title.

It will be the text surrounding the first <h1> and </h1> heading tags right after the main ContentPlaceHolder of the page.

Calculating meta description.

Most of the time a description about what a whole text is about (or at least it should) is within the first paragraphs of it. Based on that supposition, we try to parse and concatenate text within paragraphs (<p></p> tags) after the first <h1> tag. Based on our experience, when the meta description tag is longer than 394 characters, Google Webmaster Tools complain about it being too long. Taking that point in mind, we try to concatenate html-cleaned text from the first paragraphs of the text to create the meta description tag, ensuring it is not longer than 394 characters. Once we know the way our meta descriptions are automatically created, all we need to do is create our pages starting with an <h1> header tag followed by one or more paragraphs (<p></p>) that will be the source for creating the meta description for the page. This will be suitable for most scenarios. In other cases, you should modify the way you create your pages or update the code to match your needs.

Calculating meta keywords.

Provided those noise word lists for a given language, calculating the keywords (single word) and key phrases (two words) occurrences within the text was something straightforward. We just iterate through the text, check against noise words, and add a new keyword or increment the frequency if the given keyword is already on the HashTable. At the end of the iteration, we sort the HashTables by descending frequency (using a custom class inheriting from System.Collections.IComparer). The final keywords list is a combination of the most frequent single keywords (ht1) up to 190 characters, and the most common two-word key phrases (ht2), until completing a maximum of 250 characters. All of them will be comma separated values in lowercase.

Summary.

Having meta tags correctly set is a must, however it is difficult to set them manually on every page sometimes, furthermore not forgetting all possible keyword combinations. Too much frequently only a few words are added, and this is when automatic keyword handling can help. If you consider this might be your case, please, download our sample VB project and give it a try (and a few debug traces too). I will be waiting for your comments.

Links.

Internet Information Services IIS optimization

2009/03/29

Conditional GET and ETag implementation for ASP.NET

This post continues the series of Internet Information Services IIS optimization. See the link if you want to follow the whole series.

You can download the VB project code of this article. Another way for optimizing your web site is setting it up for supporting conditional GET, that is, implementing the logic for handling requests whose headers specify If-None-Match (ETag) and/or If-Modified-Since values. This is not something easy, since ASP.NET does not offer support for it directly, nor have primitives/methods/functions for it and, by default, always returns 200 OK, no matter the headers of the request (apart from errors, such as 404, and so).

The idea behind this is quite simple; let’s suppose a dialog between a browser (B) and a web server (WS):

B: Hi, can you give me a copy of ~/document.aspx?
WS: Of course. Here you are: 200Kb of code. Thanks for coming, 200 OK.
B: Hi again, can you give me a copy of ~/another-document.aspx?
WS: Yes, we’re here to serve. Here you are: 160Kb. Thanks for coming, 200 OK.
(Now the user clicks on a link that points to ~/document.aspx or goes back in his browsing history)
B: Sorry for disturbing you again, can I have another copy of ~/document.aspx
WS: No problem at all. Here you are: 200Kb of code (the same as before). Thanks for coming, 200 OK.

Stupid, isn’t it? The way for enhancing the dialogue and avoid unnecessary traffic is having a richer vocabulary (If-None-Match & If-Modified-Since). Here is the same dialogue with these improvements:

B: Hi can you give me a copy of ~/document.aspx?
WS: Of course. Here you are: 200Kb of code. ISBN is 55511122 (ETag) and this is the 2009 edition (Last-Modified). Thanks for coming, 200 OK.
B: Hi again, can you give me a copy of ~/another-document.aspx?
WS: Yes, we are here to serve. Here you are: 160Kb. ISBN is 555111333 (ETag) and it is the 2007 edition (Last-Modified). Thanks for coming, 200 OK.
(Now the time passes and the user goes back to ~/document.aspx, maybe it was in his favorites, or arrived to the same file after browsing for a while)
B: Hi again, I already have a copy of ~/document.aspx, ISBN is 555111222 (If-None-Match), dated 2009 (If-Modified-Since). Is there any update for it?
WS: Let me check… No, you are up to date, 0Kb transferred, 304 Not modified.

It sounds more logical. It takes a little more dialogue (negotiation) previous to the transaction, but if the conditions are met, these extra words saves time and money (bandwidth) on both parties.

Most of the browsers nowadays support such a negotiation, but the web server must do it also in order to get benefits. Unfortunately IIS only supports conditional GET natively for static files. If you want to use it also for dynamic content (ASP.NET files) you need to add support for it programmatically. That’s what we are going to show here.

 

Calculating Last-Modified response header.

To begin with, the server needs to know when a page was last modified. This is very easy for static contents, a simple mapping between the web page being requested and the file in the underlying file system and you are done. The calculation of this date for .ASPX files is a little more complicated. You need to consider all the dependencies for the content being served and calculate the most recent date among them. For instance, let’s suppose the browser requests a page at ~/default.aspx and this file is based on a masterpage called ~/MasterPage.master which has a menu inside it, that grabs its contents from the file ~/web.sitemap. In the simplest scenario (no content being retrieved from a database, no user controls), ~/default.aspx will contain static content within. In this case, the Last-Modified value will be the most recent last modification time of these files:
  • ~/default.aspx
  • ~/default.aspx.vb (Optionally, depending on your pages having code behind which modifies the output or not)
  • ~/MasterPage.master
  • ~/MasterPage.master.vb (Optionally)
  • ~/web.sitemap

The last-mod time is retrieved using System.IO.File.GetLastWriteTime. In case of the content being retrieved from a database, you must have a column for storing last-mod-time (when the content was last written) in order to use this functionality.

 

Calculating ETag response header.

The second key of the dialogue is the ETag value. It is simply a hash function for the final contents being served. If you have any way (with low CPU footprint) for calculating a hash based on certain textual input, it can be used. In our implementation, we used CRC32 but any other will work the same way. We calculate the ETag value making a CRC32 checksum of any dependant content plus the last-mod-dates of these dependencies. I our simplest case, the concatenation of all these strings:
  • ~/default.aspx last write time
  • ~/default.aspx.vb last write time (not likely, but optionally necessary)
  • ~/MasterPage.master last write time
  • ~/MasterPage.master.vb last write time (Optionally)
  • ~/web.sitemap last write time
  • ~/default.aspx contents
  • ~/default.aspx.vb contents (Optionally, but not likely, to speed up calculations)
  • ~/MasterPage.master contents
  • ~/MasterPage.master.vb (Optionally)
  • ~/web.sitemap contents

And then a CRC32 of the whole. If your content is really dynamically generated (from a database, or by code), you will need to use it also, like any other dependency and include it in the former list.
It might seem too much burden, too much CPU usage but, as everything, it really depends on the website:

  High CPU usage Low CPU usage
High volume This scenario might not cope with the extra CPU needed. See Note*. You can safely spend CPU cycles in order to save some bandwidth. Implementing conditional GETs is a must.
Low volume What kind of web server is it? Definitely not a public web server as we know them. Implementing conditional GETs will give your website the impression of being served faster.

Note*: Consider this question: Is your CPU usage so high partly because the same contents are requested once and again by the same users? If you answer is yes (or maybe), an extra CPU usage with the intention of allowing client-side caching and conditional GETs will, globally viewed, lower your overall CPU usage and also the bandwidth being used. Giving a try to this idea and decide for yourself afterwards.

 

Returning them in the response.

Once we have calculated both the Last-Modified & Etag values, we need to return them with the response of the page. This is done using the following lines of code:
Response.Cache.SetLastModified(LastModifiedValue.ToUniversalTime)    
Response.Cache.SetETag(ETagValue)

 

Looking for the values in request’s headers.

Now that our pages’ responses are generated with Last-Modified and ETag headers, we need to check for those values in the requests too. The names of those parameters, when asked via request headers differ from the original names:

Response headers names Request headers names
Last-Modified If-Modified-Since
ETag If-None-Match

The logic for deciding if we should return 200 OK or 304 Not modified is as follows:

  • If both values (If-Modified-Since & If-None-Match) were provided in the request and both match, return 304 and no content (0 bytes)
  • If any of them do NOT match, return 200 and the complete page
  • If only one of them was specified (If-Modified-Since or If-None-Match), it decides.
  • If none were provided, always return 200 and the complete page.
In order to return 304 and no content for the page the code to be used is:
Response.Clear()    
Response.StatusCode = System.Net.HttpStatusCode.NotModified     
Response.SuppressContent = True

 

Test 1: Requests to ~/default.aspx

Having the ideas in place, we have reused the VB project from the previous post ASP.NET menu control optimization, to add it the support for conditional GETs. In the sample VB project for this post there are 2 new files, under App_Code, called CRC32.vb which implements a crc32 checksum algorithm, and another one named HttpSnippets.vb which implements a method called ConditionalGET that does most the jobs explained in this post. We have used Fiddler2 to debug two requests made to ~/default.aspx.

The first one, shown in the left column (red arrow), is made without the browser having any cached information about it. As you can see the browser makes the request without providing any If-Modified-Since or If-None-Match headers. The response given by the server sets the ETag and Last-Modified values for the browser to use in the future in case it supports them.

The second request, shown at the right column (green arrow), is made by the same browser some seconds later. The browser already have information for the page being requested and provides that information along with the request: the If-Modified-Since and If-None-Match headers are provided. The result from the server in this case is different. Instead of returning 200 Ok and the whole page, it returns 304 Not Modified, and the size of the body is 0. You are saving bandwidth at the cost of some CPU cycles and some bytes more in the negotiation (headers).

 

Test 2: Requests to ~/default-optimized.aspx

Following with the ASP.NET menu control optimization project, we added also support for conditional GET to our ~/default-optimized.asp page, which saves the menu in an external client-side cacheable page, in order to reduce (even more) the size of the pages being transferred.

In this case the first column (red arrow) belongs to the request of ~/default-optimized.aspx. As you can see the size of the page being transferred completely is 3785 bytes (in the previous example it was 18358 bytes). This reduction is solely due to the ASP.NET menu control optimization. For more info about this check the previous article. Regarding the conditional GET, the first request does not know anything about the page and no data is provided in the request. The response includes ETag and Last-Modified values.

The second request of interest is at the right column (green arrow) and belongs to the same browser requesting the same file some seconds later. This time, information about the page is provided by the browser with the headers (If-Modified-Since and if-None-Match values). The server then checks them and decides that the content has not changed, returning 304 Not Modified and a body length of 0 bytes.

It seems that ASP.NET Developer Server (“Cassini”), the web server used for debugging with VS2008, does not handle static files very well. As you can see, menu.css and some other static files under ~/resources/ are transferred completely with every request. No ETag nor Last-Modified values are returned for them automatically. This does not happen at all in real production environments with IIS, which handles static files correctly (calculating ETags and Last-Modified values) to avoid transferring static files unnecessarily.

 

Resources and links.

Internet Information Services IIS optimization
For live websites (in the public internet) you can easily test if they support Conditional GETs using HTTP compression and HTTP conditional GET test tool
Another valuable resource is Fiddler2.
The VB website project source sample is available for you to download.

2009/03/28

VIEWSTATE size minimization

This post continues the series of Internet Information Services IIS optimization. See the link if you want to follow the whole series.

According to Microsoft, in NET Framework Class Library documentation for Control.ViewState Property:
A server control's view state is the accumulation of all its property values. In order to preserve these values across HTTP requests, ASP.NET server controls use this property […]
That means that, the bigger the contents of the control, the bigger must be its ViewState property.
What is it used for? When server technologies are used, such as ASP, ASP.NET, PHP, and so on, in the server side a high level and powerful language is used. These languages have advanced server controls (such as grid, treeview, etc), and they can do validations of any kind (on database access, etc). The final end of this high level language is transforming the ‘server page’ in a final page that a browser can understand (HTML+Javascript). If on the one hand you have server controls that are rendered into HTLM when they are output to the browser, what happens when the user does a postback/callback and sends the page back to the server? Here is where the ViewState plays its role, helping to recreate the page objects at the server, in the OOP sense (<asp:TextBox ID=...) based on HTML controls (<input name=”...).

Wouldn’t be easier to forget about all this and handle it in the traditional way? For recreating simple controls as a text box or a select box, it could be feasible to fetch the values right from the HTML, without using the ViewState, but imagine trying to recreate a GridView with only the HTML, having to strip out the formatting. Besides, without the ViewState we could not send to the server certain events such as a change of selection in a DropDownList (the previously selected element is saved in the ViewState).

Ok, we will need the ViewState after all, but, is there any way of minimizing it? Yes. As Microsoft states:
View state is enabled for all server controls by default, but there are circumstances in which you will want to disable it. For more information, see Performance Overview.
If you see Performance Overview, you will be suggested to:
Save server control view state only when it is required. View state enables server controls to repopulate property values on a round trip without requiring you to write code […]
Take that into consideration when writing your master pages, since most of the controls in the master page will be static (or at much, written only once by the server) and probably not needed at all again in case of a postback or callback (unless, for instance a DropDownList for changing the language of the site, being placed in the master page).

When can we disable the view state? Basically, when we use data that will be read only, or that will not be needed again by the server in case of a postback/callback, for instance a grid that do not have associated events for sorting or selection at the server.

There are several ways for controlling the viewstate:
  • In a page by page basis: If you have any particular page in which you know you will not need the viewstate, you can disable it completely at the page declaration:
    <%@ Page Title="Home" Language="C#" MasterPageFile="~/MasterPage.master" AutoEventWireup="true" CodeFile="default.aspx.cs" Inherits="_default" EnableViewState="false" %>
    However, doing so might render your masterpage controls (if any) unusable for that particular page. For instance, if you have DropDownList control in your masterpage for changing/selecting the language of the website and you disable the viewstate for several single files of your site,
  • In the master page declaration: In a similar way as you do for a single page, you can do it also in the masterpage. The result will be that, unless you override this option for single page (explicitly declaring single pages as having it), all pages using a master page declared this way will not have ViewState (and if they do, it will not contain any info about controls from the masterpage):
    <%@ Master Language="C#" AutoEventWireup="true" CodeFile="MasterPage.master.cs" Inherits="MasterPage" EnableViewState="false" %>
  • In a control by control basis: A more flexible (due to its granularity) way for controlling the view state is enabling/disabling it control by control:
    <asp:TextBox ID="TextBox1" runat="server" EnableViewState="false" ></asp:TextBox> 
    This will probably be the easiest method and the one that less interfere with the rest of a website; besides its effects (in the size of the viewstate and functionality) can be easily checked and can be easily reverted back if something does not work.
Most of the controls in a masterpage will fall into the category of light control group (see Viewstate Optimization Strategies in ASP.NET), it means that including or excluding them from the view state makes very little difference (its footprint is very small). Even that being the case, you should make sure you set EnableViewState="false" attribute for them just in case.

One of the asp.net controls that makes the View State grow heavily is the asp.menu control. As I showed in my previous post ASP.NET menu control optimization, moving it out of the masterpage and placing it in another standalone client-side cacheable file can make wonders. However, if you do not implement such a suggestion, you can at least disable the view state for the menu control. The menu control will still be rendered within every page, but the size of the View State will be significantly smaller without further effort. In one of our customers, simply adding EnableViewState="false" to the menu control definition, reduced the size of their homepage (for example) from 150Kb to around 109Kb. Since the menu was in the masterpage, the reduction was similar for all the pages in their site.

Links.

Internet Information Services IIS optimization

2009/03/25

ASP.NET menu control size reduction (a graphical proof)

I have prepared a graphical proof of the type of optimization I suggested a couple of days ago in my post ASP.NET menu control optimization. I just saved into plain txt files the following requests:

  • ~/default.aspx: The original menu sample page without any optimization
  • ~/default-optimized.aspx & ~/resources/menu-js.aspx: The optimized version that split menu related html to an external client-cacheable file (that is requested only once).

Then, I opened those .txt files with MS Word, and reduced the font-size to 6,5 for all of them (to keep the number of pages to a reasonable number), and did some highlighting:

  • Green: The useful real contents of the page.
  • Red: The __VIEWSTATE variable.
  • Blue: Menu related code.
  • Light blue (in menu-js.aspx): Parsed & modified menu related code converted to a javascript string to be written by browser directly.

~/default.aspx

The original menu page output

As you can see in the original non optimal version, the page is mostly filled with content related to the menu and __VIEWSTATE variable. The worst part of this original implementation is that in all the pages, 70%-80% of the contents is the same. The client's browser is downloading mostly the same contents once and again. My idea consists in taking that common factors out of the pages and place them in a single different cacheable page.

~/default-optimized.aspx

In the optimized version, menu related code is reduced drastically (blue). Content with white background is masterpage related code (formatting).

~/resources/menu-js.aspx

Most of the menu-related code is moved to an external file that is requested at the end of the masterpage. That file is client-side cached and thus requested only once per session. The result is that only pages with optimal contents are downloaded afterwards.

If you like this approach and want to see the whole post explaining the idea in detail, with source code project and all the stuff, see my previous post ASP.NET menu control optimization.

Links.

Internet Information Services IIS optimization

2009/03/23

IE8 breaks <asp:Menu> control

There has been a lot of controversy since the public release of IE8 last March, 19th (even before, when in beta) due to it is following the standards and because it does not properly render <asp:Menu> controls under certain conditions (because <asp:Menu> control developers did not follow the standards).

If you see this error feedback in Microsoft Connect regarding ASP.NET menu control not working in IE 8 beta 2, they closed it as ‘By design’ which means that IE8 behaves as it should, that the source of the problem is not IE8, but the ASP.NET engine. As they say, Microsoft will be releasing KB for ASP.NET that issues this problem sooner or later.

In the meantime there are some workarounds, as explained in ASP.NET Menu and IE8 rendering white issue:

  1. Overriding the z-index property.
  2. Using CSS Friendly Control Adapters.
  3. Add the IE7 META tag to the project.

Up to here, nothing new under the sun, I am just introducing you some facts already in the public web… the bad news of this post is that my solution for ASP:NET menu control optimization, posted yesterday, needs an update because it uses <asp:Menu> behind the scenes and shows the same behavior in IE8. The good news is that, since the menu is taken from an external file, only 2 files need to be updated, and not the masterpage, nor all the pages of whole website, etc.

~/resources/menu.css should be updated to include .IE8Fix { z-index: 100; } and ~/resources/menu.aspx should be updated so that DynamicMenuStyle includes an attribute CssClass="IE8Fix"

Those little changes are already implemented in the project available for you to download so you do not need to worry about this problem and concentrate yourself in optimizing.

2009/03/22

ASP.NET menu control optimization

This is the first of my posts regarding Internet Information Services IIS optimization. See the link if you want to follow the whole series.

One of the controls that our website uses the most is the <asp:Menu> control. It is used in the masterpage so that, in the end, it is used at every page of the site, along with breadcrums. I have prepared a sample VS2008 website projects in VB and C# where you can see the facts and follow the steps for yourself. In this sample project, the masterpage sets up several ContentPlaceHolders arranged for a multicolumn webpage. One row at the top contains the logo, breadcrums and menu for the website, a second row with 2 columns contains the left content and main content, and a third row at the end contains the footer with fixed text for all website pages. Of course, if you want to do it right, you should not use <tables> for the layout of the content, you should use <divs> and CSS styles, but that is out of the scope of this post. Here we will only cover and explain a way to optimize your pages that use <asp:Menu> controls.

The layout of the master page is shown using ~/default.aspx in the following image:

Using Fiddler2, the http debugging proxy, we get this file is sized 18214 bytes (17,78 Kb), when browsed through IE7 (see User-Agent string). I strongly recommend Fiddler2 if you want to optimize or debug your web server. It has a lot of useful features, one of the most interesting being the Timeline to see how your server performs in overall (considering all the requests for pages, css files, images, scripts, etc.) graphically, being the time in the X axis. In this case we will just prepare a request using Request Builder and see the results using the Inspector tab:

fiddler request for non-optimal page

Further analysis of the received page, throw these values:

CPH ContentPlaceHolders (2) 1,12 Kb 6,30%
VS __VIEWSTATE 2,81 Kb 15,80%
M Menu contents, scripts & related styles 11,80 Kb 66,37%
T-CPH-VS-M The rest, due to layout (master page) 2,05 Kb 11,53%
T TOTAL 17,78 Kb 100,00%

As you can see, most of the contents of the page is menu-related code. Furthermore, if the menu does not change (very very probable) between subsequent requests of the visitor, we are sending out the same contents again and again, since the menu is in our master page and the same menu related content is rendered for the browser in every page. What a waste of bandwidth (probably money too, if you pay your ISP by traffic) and time for your visitors. Being the bandwidth broader and broader nowadays is no reason for wasting it absurdly.

Besides, if you can read html and see through the generated file, you will see that html code for the menu is in near the top, exactly where we placed the <asp:Menu> control in the masterpage. What would happen if we could delay the load of the menu whist give priority to the real contents of the page? I mean, delay the load of the menu until the contents are shown in the visitor’s browser, and then (afterwards), load the menu. That would increase the responsiveness of the website; the page will not seem stalled while loading a big menu before the actual contents. The users could start reading the contents and in the meantime, even without notice, the menu would appear in its right place.

In subsequent requests, since the menu is already loaded, the visitor would not need to re-download those 11,80Kb (in our case) bytes of menu-related html. In our example, the page of 17,78Kb could be reduced to 1,12 + 2,81 + 2,05 = 5.58 Kb size. The size of the sample page would be 66% smaller, by just stripping out of the page the menu related html and placing it into another page. This can be reduced even more by minimizing the size of the __VIEWSTATE variable, but that will be another post.

The main things to be replaced.

If you read through the html generated code for the menu, you will find several distinguished pieces of code:

  • The <styles> used in the menu, in our example:
    <style type="text/css">
    .ctl00_Menu1_0 { background-color:white;visibility:hidden;display:none;position:absolute;left:0px;top:0px; } 
    .ctl00_Menu1_1 { color:Black;text-decoration:none; } 
    .ctl00_Menu1_2 { color:Black; } 
    .ctl00_Menu1_3 { } 
    .ctl00_Menu1_4 { background-color:Transparent;border-color:Transparent;padding:0px 5px 0px 5px; } 
    .ctl00_Menu1_5 { background-color:White;border-color:Transparent; } 
    .ctl00_Menu1_6 { color:Black; } 
    .ctl00_Menu1_7 { background-color:White;border-color:White;border-width:1px;border-style:solid;padding:0px 5px 0px 5px; } 
    .ctl00_Menu1_8 { background-color:White;border-color:#BBBBBB;border-width:1px;border-style:solid; } 
    .ctl00_Menu1_9 { color:White; } 
    .ctl00_Menu1_10 { color:White;background-color:#BBBBBB;border-color:Transparent; } 
    .ctl00_Menu1_11 { color:White; } 
    .ctl00_Menu1_12 { color:White;background-color:#BBBBBB;border-color:Transparent;border-width:1px;border-style:solid; } 
    </style>
  • Two calls to WebResource.axd for retrieving scripts:
    <script src="/www.mytestsite.com/WebResource.axd?d=Fg4XkH9c9OdEq6bmF8mMjg2&amp;t=633691223257795724" 
      type="text/javascript"></script>
    <script src="/www.mytestsite.com/WebResource.axd?d=-JPtlwQvfdzq429NBDEh_w2&amp;t=633691223257795724"
      type="text/javascript"></script>
  • The actual text for the menu, which is coded using tables (when the browser is IE7) and starts with the string: <a href="#ctl00_Menu1_SkipLink"><img alt...
  • Near the end of the page, there is a script that is also related to the menu, where the object is initialized with the styles and values defined for it. You will find something similar to:
    <script type="text/javascript"> 
    //<![CDATA[ var ctl00_Menu1_Data = new Object(); 
    ctl00_Menu1_Data.disappearAfter = 5000; 
    ctl00_Menu1_Data.horizontalOffset = 0; 
    ctl00_Menu1_Data.verticalOffset = 0; 
    ctl00_Menu1_Data.hoverClass = 'ctl00_Menu1_12'; 
    ctl00_Menu1_Data.hoverHyperLinkClass = 'ctl00_Menu1_11'; 
    ctl00_Menu1_Data.staticHoverClass = 'ctl00_Menu1_10'; 
    ctl00_Menu1_Data.staticHoverHyperLinkClass = 'ctl00_Menu1_9'; 
    //]]> </script>

The problem is that ASP.NET menu control renders differently depending on the User-agent (browser), thus we cannot take this values as fixed constants to create static files with them. However we can still do other thing: Create a simple page with only the menu (between searchable placeholders), self-request this menu-only-file on behalf of the browser making the real request, parse (using regex) and transform the result to create a script file, cache it on the server side too (varying on every user-agent) and return it to the browser (if not a valid cached version already stored).

The steps.

1. Create a standalone menu.aspx file for showing the menu only.

We need to create a ~/resources/ directory under the root of the site (any other name will do the job as long as it is explicitly excluded from being browsed in robots.txt), and as you have imagined, modify your robots.txt and insert:

User-agent: *
Disallow: /resources/ 
Disallow: /WebResource.axd

We will create a simple aspx file (not masterpage based) called ~/resources/menu.aspx and we will insert the <asp:SiteMapDataSource> and <asp:Menu> just as they were in the masterpage (copy & paste) inside the <form> tag. This way we will keep the format and properties of the menu, but get rid of everything else. This page will render just the menu, nothing else. Then surround the start and the end of <asp:Menu> tags with some comments that we will use afterwards when parsing the page to identify exactly where the menu starts and ends (something like <!-- MENU STARTS HERE --> and <!-- MENU ENDS HERE --> will do the job).

menu.aspx

2. Create menu-js.aspx that will be called by the masterpage.

Then we need to create another web form (not masterpage based) that we will call ~/resources/menu-js.aspx. This .aspx file will only have the <% @Page ...> directive, no contents at all at design time. The contents will be generated by the code-behind that will do the parsing of the former menu.aspx page and will be responsible for caching and sending the menu to the client’s browser after having rendered it as a javascript file. The contents of this javascript file that is sent to client’s browser are simply:

var placement = document.getElementById("aspmenu"); 
placement.innerHTML = *** ALL THE MENU CONTENTS ***

This way the menu is rendered after the page has already been loaded and shown in cllient’s browser using javascript, because the call to this menu-js.aspx is near the end of the page. This method works in latest versions of IE, Firefox, Safari, Opera & Chrome, provided that they have javascript enabled. In text only browsers (Lynx and similar) or if javascript is not enabled, this method falls nicely not showing any menu, but keeping the overall appearance provided by the masterpage intact.

3. Create the stylesheet for the menu.

We need to create a ~/resources/menu.css with all the styles that were defined by the original <asp:Menu> control, those named like ctl00_Menu1_xx shown before.

4. Changes in the masterpage.

4.1. Link to the former css file.

You need to include a link to the former css file in the masterpage (see MasterPage-Optimized.master file in the downloadable project, the line is <link href="~/resources/menu.css" rel="stylesheet" type="text/css" />).

4.2. Replace <asp:Menu> by identified <div>.

You must also include an empty <div> tag with id = “aspmenu” in place where the original <asp:Menu> was:

<div id="aspmenu" title="Menu"></div> 

This div tag called aspmenu is the placement where the javascript file will try to insert the real contents of the menu after the page has been loaded. See former point 2, in document.getElementById(“aspmenu”).

4.3. Changes after the <form> tag.

Right after the <form> tag, include a literal control <asp:Literal ID="ltWebResourceMenu" runat="server" EnableViewState="false" />. In the codebehind, this will be set to <script> tags to read the files menu-webresource-axd-a.js & menu-webresource-axd-a.js that we will prepare in next step.

4.4. Changes near the end of the masterpage.

The script that was near the end of a non-optimal page needs to be hard coded now into the master page. Thus, right before the </body> tag, we need to write:

<script type="text/javascript"> 
//<![CDATA[ var Menu1_Data = new Object(); 
Menu1_Data.disappearAfter = 5000; 
Menu1_Data.horizontalOffset = 0; 
Menu1_Data.verticalOffset = 0; 
Menu1_Data.hoverClass = 'Menu1_12'; 
Menu1_Data.hoverHyperLinkClass = 'Menu1_11'; 
Menu1_Data.staticHoverClass = 'Menu1_10'; 
Menu1_Data.staticHoverHyperLinkClass = 'Menu1_9'; 
//]]> 
</script> 
</form> 
<asp:Literal ID="ltMenuScript" runat="server" EnableViewState="false" />

5. Save WebResource.axd used resources as static files under ~/resources/.

In the original non-optimal ~/default.aspx you probably have noticed some lines requesting for a file called WebResource.axd with 2 parameters (d & t). In our case the menu contains some resources that we will grab and save as static files:

Original code Static filename Description
<img src="/www…com/WebResource.axd?d=p51493b-… menu-arrow.gif a right arrow
<script src="/www…com/WebResource.axd?d=Fg4XkH9c9O… menu-webresource-axd-a.js 20,3Kb javascript file
<script src="/www…com/WebResource.axd?d=-JPtlwQvfdz… menu-webresource-axd-b.js 32,4Kb javascript file
<img alt="Skip navigation links" … src="/…/WebResource.axd?d=vlTL… menu-webresource-axd-1x1.gif blank gif

6. Modify menu.aspx to use those static files.

Now that we have saved those resources as static files, we need to modify the menu.aspx we did on step 1, to use these files instead of calls to WebResource.axd. This can be done using the IDE, but the results (in code) should be similar to including these attributes to <asp:Menu> control:

DynamicPopOutImageUrl="~/images/menu-arrow.gif" 
ScrollDownImageUrl="~/images/menu-scroll-down.gif" 
ScrollUpImageUrl="~/images/menu-scroll-up.gif" 
StaticPopOutImageUrl="~/images/menu-arrow.gif"

7. Test the whole thing.

I think I have not left any step behind. Anyway you have the whole projects (in VB and C# , around 28Kb each zip file) to download and see the idea working for yourself. After all, and using Fiddler2, if we request the page ~/default-optimized.aspx, we get the following results:

fiddler request for the optimized-menu page

The file size is 3821 bytes (the original was 18214): that means an improvement of 79% in size reduction!!! Much better than we expected, that is because the size of the __VIEWSTATE has been reduced too (since <asp:Menu> control no longer resides in the page). Of course, the menu-js.aspx still needs to be downloaded, and its size is 9153 bytes (in our example), but using client-side caching, this file only needs to be downloaded once in an hour (Response.Cache.SetExpires(DateTime.Now.AddMinutes(60))).

Another advantage of having the menu rendered in a different file is that the pagerank that any of your pages might have will not dilute its outgoing value among all the rest of the pages due to links in the menu. This way the outgoing links for your pages are much less, only those in the masterpage (that you can easily set to rel=”nofollow”) and those that are real links inside your content. No more outgoing links from any page to any other page because of the menu.

An alternative to my approach for improving performance (and compatibility) of <asp:Menu> control is the use of CSS Friendly Control Adapters. However in that case, the menu is still rendered inside the page (not on a different page request). Their improvement makes a reduction of half the size of the html used to render the menu, by using CSS and <ul> tags instead of <table> tags. Though an improvement (most in compatibility) the improvement we achieve using our approach is much better, since we strip out of any page any menu-related html and place it in another file. By using client-side caching, that file is only requested once per client/connection. Maybe a hybrid solution would be the best: using CSS Friendly Control Adapters and place the html code related to the menu on a different page, but that has not already been done. By now you will need to make your mind up for one or another, you cannot have the best of both in a single solution.

I hope you find this article useful and I am willing to hear your comments about this approach to the problem.

2009-03-23 Update: I have just installed IE8 and checked the well known issue of dropdown menus appearing as blank boxes. Unfortunately, my solution for <asp:Menu> optimization shows the same behavior but has been fixed in both projects ( VB and C#). For more info see my post IE8 breaks asp menu control.

2009/03/21

Internet Information Services IIS optimization

It has been a long time since my last post. For the last 8 months I have been working on web pages, IIS based, ASP.NET 3.5, using master pages, and so on.

When I thought that most of the work was almost done (master page designing, CSS/HTML editing, linking between pages, and so on), I faced the other side of the problem: SEO optimization, page sizes optimization, download times, conditional GETs, metas (title, keywords, description), page compression (gzip, deflate). The biggest part of the iceberg was under the water; I had a lot to learn, and a lot of lines to code.

Now all that things are already in place and running, so I am willing to share all the things I have learnt with the community, in a series of posts that will cover:

  • ASP.NET menu control optimization; to reduce the page size, increase download speed, desirable to have in place before using conditional GETs.
  • __VIEWSTATE size minimization; in our case it simply doubled the size of the page. A proper optimization can make the page half the size (or less).
  • Conditional GET and ETag implementation for ASP.NET; generation of ETag and Last-Modified headers, when and how to return 304 – Not modified with no content (saves bandwidth and increases responsiveness of your site).
  • Solve the CryptographicException: Padding is invalid and cannot be removed when requesting WebResource.axd; this problem is somewhat common but you will fill your EventLog with these errors if you start using conditional GETs.
  • Automatic generation of meta tags: meta title, meta description, meta keywords; this way the editing of pages will be much simpler and faster.
  • URL canonicalization with 301 redirects for ASP.NET; solve problems of http/https, www/non-www, upper/lower case, dupe content indexing among others.
  • Serve different versions of robots.txt: Whether they are requested via http or https you can serve different contents for your robots.txt.
  • Enforce robots.txt directives; to ban those robots on the wild by detecting bad-behaving bots, not following rules at robots.txt; we will ban them for some months and prevent them from spending our valuable bandwidth.
  • Distinguish a crawl from Googlebot and from someone else pretending to be Googlebot (or any other well known bot); in order to ban those pretenders for a while.
  • Set up honey-pots being excluded in robots.txt and ban anyone visiting that forbidden URL; very good against screen-scrapers, offline explorers, and so on.

Since we use Google Webmaster Tools and Google Analytics for all our websites, we had the opportunity to check the consequences of every change. For instance here is the graph that shows the decrease in number of Kb downloaded per day when we enabled http compression and put conditional GETs in place. Note how the number of crawled pages keeps more or less the same during the period, while the Kb downloaded per day slides down passed middle January (peaks match several master page updates).

2008/01/09

Chain e-mail traffic forwarded by your own internal users... is this spam?

This entry is not about anything related to Microsoft nor any of its products. I work for a medium size company as dba, but there are also other things that I have to handle in a daily basis. One of these things is the e-mail server. It does not matter which product/vendor we use, the problem more general than that. We all know what spam is, there are lots of protecting schemes out there: SPF, DKIM, SenderID, and many more, but... what happens when the email that is consuming your bandwidth is generated from the inside, from your own users?

Of course it is a good idea to have and let your employees know that your company has a Employee Internet Usage Guidelines, or Acceptable Usage Policy. That policy document might start with This policy explains what is classified as the acceptable and unacceptable use of the Internet and e-mail systems..., but how do you enforce that? That is the main question: How do you enforce your e-mail policy?

If your company host email servers on its own servers, your bandwidth will be an issue. Download bandwidth is relatively cheap, but not so the upload. If an employee receives a chain email, with a video attachment (those are becoming more and more popular) of 2.5Mb average, that will have some associated costs, but the big cost (from the point of view of IT staff) will raise afterwards, then that employee decides to forward that amusing Videos / Jokes / PowerPoint presentations / Documents / Images to his/her 15 friends and 5 relatives and the dog. When your email server tries to upload the forwarded email to 20 recipients (fortunately his/her dog does not have an email account yet), your SMTP server queue will grow in 50Mb, and your upload bandwidth will collapse for some/several minutes.

If the Internet connection in your company is shared by other users/services, i.e. web servers, dns, VPNs, VoIP, file replication between branch/main offices, etc. all your users/services will suffer some kind of delay during those peak minutes due to a single user violating (at least) the netiquette.

We have been suffering for a while because of this problem and some problematic employees. No matter if you warn them, after some months, they (or others) will collapse again the server up to the point that any other user comes/calls to the IT department and asks if there is any problem with the Internet connection, because it sucks, or that a annoyed customer has called because he has not received the email that someone promised that it was delivered 2 hours ago. Then we open the monitoring graphs, see the upload peak in the email server, go and see the queue and voila!, there are tons of pending legitimate emails waiting to be delivered because, at some point, the outbound queue has 30 email (of non work related) summing 80Mb to be delivered.

There is an article (dated 2003) that states that 35% of corporate e-mail is non-work related. And I wonder... 35% of emails means 35% of the bandwidth used or 35% of the number of emails? If 35% of the number of emails, that would easily make 40-50% of the bandwidth due to the sizes of the attachments. And the figures are growing day by day... as you can see in another article dated March 2007 which states that Half of corporate web traffic not work related. Companies are failing to enforce their Acceptable Usage Policies, basically because (in case of email, at least), there are no embedded tools to help on this subject.

ISV and MTA are fighting against spam thinking that the enemy is out there, that the bad guys are at the other side of our firewall, but that is only partly true. Legit users inside your companies are are the other part of the problem.

I've been thinking a while on this subject and found a common pattern with those chain e-mails with huge attachments:

  1. A file with relatively big size attachment(s) comes into our email server.
  2. It comes from a legit user outside our company and the To: list has several recipients.
  3. After a short amount of time (from minutes to 72 hours), the same attachment(s) are send out of the company
  4. In this case the receipts are different, but there are also more than 3 in most cases.

Having this pattern in mind, we have created a filter that does the following:

For incoming emails

  1. If the message has attachments and the size of the message exceeds certain threshold, an MD5 hash is computed for every attachment.
  2. A daily log file is created to store MD5 hash, file size, the file name of the attachment, To and From fields. This file stores ONLY incoming content.

For outgoing emails

  1. If the message has attachments, the number of recipients is above certain number and the size of the message multiplied by the number of recipients is above a given threshold, MD5 hashes are calculated for every attachment in the email.
  2. Every MD5 hash is checked against the known MD5 hashes in the last n days log files.
  3. If a match is found, the filter notifies the MTA (.exe program exits with ErrorCode 1) and the outgoing email can be quarantined/deleted/moved to a lower priority queue (the action depends of the MTA filter options).

We found that this approach is smarter than just forbidding certain extensions to be delivered through your email server. Many email servers have options to block certain types of attachments, i.e. block .pps files, or .mp3 files, but sooner or later you will face a case in which a legit user needs to send a legit content of these blocked file types. This is a different approach and relies on the idea of forbidding to send out any content that is not original/originated/created at your site (self-made content), fine-tuned with certain parameters such as n retention days, number of recipients, attachment sizes, out queue threshold, etc. This filter is content-agnostic, it does not matter which file name or content type the attachment is, and, as long as it has the same MD5 hash and had been previously received within the defined time-span, the email will be pinpointed.

Keywords:
bandwidth waste, chain emails, SMTP, queue, upload, productivity losses, filter, quarantine, enforce internet usage policy, free, algorithm, procedure, method

2006/08/25

Using TOP 100 PERCENT in a view (SQL 2000 vs 2005 differences)

I'm still in the middle of the process of migrating our SQL servers 2000 to 2005. By now, all I have is a SQL 2005 publisher, but the subscriber is still a 2000 version. I run into problems when I started publishing all my views because of different syntax between 2000 and 2005 versions regarding TOP clause.

In SQL Server 2000 version, the syntax is TOP n [PERCENT] while in SQL Server 2005 the syntax is TOP (expression) [PERCENT]. Please note the parenthesis.

Those parenthesis are the reason for the publications to fail when the initial snapshot is sent to the subscriber (2000 version). I have been trying to find a way for SQL 2005 for not generating the extra parenthesis, but if you use Management Studio and the query designer to create the view, it seems there is no way to get rid of them. I tried also setting the database compatibility mode to 80 (2000) but it makes no difference. It does not matter whether the database mode is 90 (2005) or 80 (2000), the extra parenthesis are still there, making it impossible to publish a view that has an ORDER BY clause (for instance) for SQL Server 2000 subscribers.

Solution: I had to remove all TOP (n) PERCENT clauses and ORDER BY clauses to be able to publish my views correctly and modify my application accordingly. Of course, if you do not mind having to manually handle with T-SQL instead of using the visual query designer, you could still create your views with 2000 syntax (without the parenthesis) which is still valid for 2005, but in my case it did not worth the extra job and decided to keep on using the query designer and remove the TOP clauses.

Just for your information, I found this document SQL 2000 v. 2005 - Using Top 100 Percent in a View that throw me some light about the drawbacks of having TOP clauses in a view:

The obvious Pro is simplicity in access. While adding the ORDER BY to the query against the view really isn't all that difficult, it does make it a bit easier for quick/simple query access. BUT - there's a HUGE con here too. If the view starts getting used for other purposes (like in joins to other tables), then the being ordered before the joins, etc. can cause you an additional step that is NOT necessary. As a result, performance was compromised.

Because of those performance problems, I decided to solve my problem by the tough way: removing them completely. Now I have to use the ORDER BY clauses elsewhere (the client application), but not in the views.