2009/11/21

Segmentation fault on Ubuntu 9.10 Server under Windows 7 x64 Virtual PC

I have been using Ubuntu since version 8.10 Intrepid Ibex and I was anxiously waiting the release of Ubuntu 9.10 Karmik Koala some weeks ago. In previous versions of Ubuntu it was a nightmare to have it running under Microsoft virtual environments (i.e. Virtual Server 2007, Virtual PC 2004, Virtual PC 2007 and so on). Problems with screen resolutions, bouncing mouse cursors, and skewing clocks were common and somewhat hard do solve for a novice Linux user I was by those times.

The fact is that I tried Ubuntu 9.10 Server beta, some weeks before the final version was released, under Virtual PC 2007 on Windows Vista Ultimate x64 on my desktop computer. When the bare bone LAMP server was installed, I logged in and installed gnome-desktop, crossing fingers. I was gladly surprised that everything worked fine right after the reboot: no screen flicking, no bouncing mouse, all Ok. Great. It was still the beta but it was a promising start.

When the final version of Ubuntu Server 9.10 was released, I downloaded the ISO and tried to install it on my laptop, a Windows 7 x64 with Virtual PC, the one shipped with Windows 7, not Virtual PC 2007 that you must use in Vista. All my expectations felt helplessly to the mud.

Everything seemed to be fine when the installer told me to reboot the system for first time:

Installation is complete

I rebooted the virtual machine and … oops… segmentation fault. what? I rebooted once again, and the same error: segmentation fault. Sometimes the virtual machine window simply closed, if not, the console showed me the same error: Segmentation fault and rubbish all along the screen.

Segmentation fault 1

Segmentation fault 2 Segmentation fault 3

There was no way I could run Ubuntu 9.10 Server under Virtual PC from Windows 7 x64. I tried various different install configurations (LAMP, DNS, nothing at all), with different RAM sizes, I even tried to change some settings in the guest BIOS, without any luck. In all cases, when the machine booted, I get the segmentation fault error.

After reading some documentation about the general occurrence of a segfault error, and finding that it happens when the code being execute tries to read/write some memory allocation that it should not, or an invalid memory address.

It sounded me like something dealing with Data Execution Prevention or DEP. You can find those settings in your Windows 7 under System Properties –> Advanced Options –> Performance settings –> Data Execution Prevention.

I tried to disable Data Execution Prevention for %windir%\system32\vpc.exe (the executable file for Virtual PC) but since it was a 64bit system I got an error message: You cannot set DEP attributes on 64 bit executables. No luck this way either.

According to Microsoft about Data Execution Prevention:

32-bit versions of Windows Server 2003 with Service Pack 1 utilize the no-execute page-protection (NX) processor feature as defined by AMD or the Execute Disable bit (XD) feature as defined by Intel. In order to use these processor features, the processor must be running in Physical Address Extension (PAE) mode. The 64-bit versions of Windows use the NX or XD processor feature on 64-bit extensions processors and certain values of the access rights page table entry (PTE) field on IPF processors.

XD processor feature? Umhhh, my BIOS (the laptop, physical one) had such a thing… My laptop is a Dell Vostro 1700 and it has a setting called CPU XD Support. Why don’t we try to disable it? I rebooted to check that setting and I saw that it was Enabled (by default). Just for doing one more test, I disabled it and restarted.

CPU XD (Execute Disable) Support

I then started the Ubuntu 9.10 Server virtual machine and… it worked!!! I was even capable of installing gnome-desktop also and everything worked as it worked with Windows Vista in my desktop computer.

But is it safe to disable such a feature for the whole system? Just to be able to try and play with Ubuntu as a VM sometimes? I suppose not. I then rebooted and reset the value to Enabled (by default).

There must be something wrong with either Ubuntu or Virtual PC. Maybe Ubuntu is trying to execute certain memory address that are code for the guest, but data for the host. I don’t know.

At least, I have found a workaround for the problem. Whenever I want to test something in Ubuntu, it costs me a reboot, a change of settings in CPU XD Support value of the BIOS and a restart… ah.. and another reboot to change it back to the safe value.

If someone else finds a better workaround for this problem, I am willing to hear about it!

2009/10/30

Windows 7: Disable builtin DHCP server for “Internal network” in Virtual PC

I recently installed Windows 7 and I have been waiting for the final release of XP Mode and Virtual PC which occurred last 22nd of October. I previously had (in Windows Vista and using Virtual PC 2007) a virtual domain, composed of virtual machines such as:

  • server2003: a domain controller and DHCP server, with fixed IP address, connected to the “internal network” of Virtual PC.
  • isa2006: with two interfaces (dual homed), one connected to the physical host network adapter (for connecting to the internet), the other one connected to the “internal network”. Both IPs are manually set.
  • sql2008: the database server for the tests with this virtual domain, IP address assigned dinamically through DHCP.
  • vs2008xp: a Windows XP with Visual Studio 2008, belonging to the domain for testing and developing, IP configured through DHCP (that should be handled by server2003).

With such a testing environment, all traffic that should go to/from the internet passes though isa2006. If isa2006 is not running (for instance) the virtual domain is isolated and the virtual machines can only see themselves (members of the domain).

This was the scenario that I had configured in my old Vista using Virtual PC 2007 and wanted to reuse the .vhd files so that I do not need to rebuild the playground from scratch again.

It was quite simple, I just recreated every single virtual machine using the wizard, and when asked for the hard disk, I selected ‘the existing one’ instead creating an empty one. Then, when the machine was first started, I reinstalled the Virtual Machine Additions (now called Integration Components), and after a couple of restarts everthing seemed to be working… but it only seemed.

Then I realized that sql2008 and vs2008xp (both were configured to use dynamic IPs using DHCP) cannot browse the internet, nor ping any other server in the domain. They were using the “Internal network”, but their IP addresses were not assigned by the DHCP running in server2003, since they were not in the expected range/mask.

After Gooling for a while I learned that Virtual PC has its own builtin DHCP server and it seems it is (incorrectly) enabled for the “Internal network”. Fortunately there is a fix for it:

  1. Turn off or hibernate all your running Virtual Machines.
  2. From the Task manager, kill vpc.exe if it does not exit on its own.
  3. Edit "%localappdata%\microsoft\Windows Virtual PC\options.xml"
  4. Search for the “Internal network” section, and then inside the <dhcp> section, disable it: <enabled type="boolean">false</enabled> and save the file. You can keep a backup of the original xml file just in case.
  5. Turn your VMs and verify everything runs as expected.

2009/10/11

URL Canonicalization with 301 redirects for ASP.NET

There are lots of pages talking about the benefits of canonicalization (c14n for short). It is a common agreement that it is just a set of rules in order to have our pages indexed in the most standardized, simplified and optimal way as possible. This would allow us to recollect our PageRank instead of having it spread among all the possible combinations of writing an URL for a particular page. In this post we will cover some canonicalization cases and their implementations for our IIS server running ASP.NET.

These different cases include:

  • Secure versus non secure versions of a page: Are http://www.example.com and https://www.example.com the same?
  • Upper and lowercase characters in the URL: Are ~/Default.aspx, ~/default.aspx and ~/DeFaUlT.aspx the same page?
  • www versus non-www domain: Do http://example.com and http://www.example.com return the same contents?
  • Parameters in the QueryString: Should ~/page.aspx?a=123&b=987 and ~/page.aspx?b=987&a=123 be considered the same? Are we handling bogus parameters? What happens if someone links us with a parameter that is not expected/used such as ~/page.aspx?useless=33 ?
  • Percent encoding: Do ~/page.aspx?p=d%0Fa and ~/page.aspx?p=d%0fa return the same page?

If your answer is yes in all cases, you must keep on reading. If you only answer yes in some cases, this post will be interesting for you anyway; you could skip those points that do not apply in your scenario by just commenting some lines of code, or modify them to match your needs. Sample VS2008 website project with full VB source code is available for downloading.

In our sample code we will be following these assumptions:

  • We prefer non-secure version over secure version, except for some particular (secure) paths: If we receive an https request from a non-authenticated user for a page that should not be served a secure, we will do a 301 redirect to the same requested URL but without the secure ‘s’.
  • We will prefer lowercase for all the the URLs: If we receive a request that contains any uppercase char (parameter names and their values are not considered), we will do a permanent 301 redirect to the lowercase variant for the URL being requested.
  • www vs. non-www should be handled by creating a new website in IIS for the non-www version and placing there a 301 redirect to the www version. This case is not covered by our code in ASP.NET since it only needs some IIS configuration work.
  • The parameters must be alphabetically ordered: If we receive a request for ~/page.aspx?b=987&a=123, we will do permanent redirect to ~/page.aspx?a=126&b=987, since the alphabetic sort a is before b. Regarding lower and uppercase variants either in the name of the parameter or the value itself, we will consider them as being different pages, in other words, no redirecting will be done if the name of a QueryString is found in upper/mixed/lowercase. The same would apply for the value of those parameters: ~/page.aspx?a=3T, ~/page.aspx?A=3T and ~/page.aspx?a=3t will be considered as different pages, no redirection will be done. In pages that accept parameters extra coding must be done to check that no other than the allowed parameters are used.
  • We will prefer percent encoded characters in their uppercase variant, for that reason %2f for instance will be redirected to %2F whenever they appear in the value of any parameter. This way we follow RFC 3986 that states:
    Although host is case-insensitive, producers and normalizers should use lowercase for registered names and hexadecimal addresses for the sake of uniformity, while only using uppercase letters for percent-encodings.

<link rel=”canonical” …>

Last february 2009 Google announced through their Google Webmaster Central Blog a way for you to explicitly declare your preferred canonical version for every page (see Specify your canonical ). By simply adding a <link> tag inside the <head> section of your pages, you can tell spiders the way you prefer them to index your content, the canonical preferred way. This helps to concentrate the GoogleJuice to that particular canonical URL from any other URL version or URL variation pointing to it in this way (the link rel=canonical way). This very same method was later adopted by Ask.com, Microsoft Live Search and Yahoo!, so it can be considered a de facto standard.

We will adopt this relative new feature in our sample code. Most of the times we will be using permanent 301 redirects, but there might be cases where you may not want to do a redirect and simply return the requested page as is (with no redirection) and return the canonical URL as a hint for Search Engines. Whenever we receive a request for a page, including bogus parameters in the query string, we will handle the request as a normal one but we will discard the useless parameters when calculating the link rel=canonical version of the page.

In particular, if you are using Google Adwords, your landing pages will be hit with an additional parameter called gclid that is used for Adwords auto-tagging. We do not want to handle those requests differently, nor treat them as errors in any way. We will only discard the unknown variables when creating the rel=canonical URL for any request.

Related links.

Internet Information Services IIS optimization
Are upper- and lower-case URLs the same page?
Google Indexing my Home Page as https://. 
http:// and https:// - Duplicate Content? 
SEO advice: url canonicalization

Q: What about https and http versions? I have a site is indexed for https, in place of http. I am sure this too is a form of canonical URIs and how do you suggest we go about it?
A: Google can crawl https just fine, but I might lean toward doing a 301 redirect to the http version (assuming that e.g. the browser doesn’t support cookies, which Googlebot doesn’t).

Specify your canonical

Keywords.

canonicalization, seo, optimization, link, rel, canonical, c14n, asp.net, http vs. https, uppercase vs. lowercase

2009/08/30

Automatic generation of META tags for ASP.NET

Some of the well known tags commonly used in SEO are the three following meta tags:  meta title tag, meta keywords tag and meta description tag:

<meta name="title" content="title goes here" /> 
<meta name="keywords" content="keywords, for, the, page, go, here"/> 
<meta name="description" content="Here you will find a textual description of the page" />

A lot has been written about the benefits of using them, and almost the same amount telling that they are not considered anymore by search engines. Anyway, no matter if they are used or not on the calculation of SERP (Search Engine Results Page), nobody discusses the benefits of having them correctly set on all your pages. At least meta description tags are somehow considered by Google, since Google Webmaster Tools warns you about pages with duplicate meta description:

Differentiate the descriptions for different pages. Using identical or similar descriptions on every page of a site isn't very helpful when individual pages appear in the web results. In these cases we're less likely to display the boilerplate text. Wherever possible, create descriptions that accurately describe the specific page. [...]

Download the VB project code

The question is not “should I use meta tags in my pages?”, the real question (and here comes the problem) is “how can I manage to create individual meta descriptions for all my pages?” and “how can I automate the process of creating meta keywords?”. That would be too much work (or too much technical work) for you (or your users, if they create content on their own).

For instance, consider a CMS (Content Management System) in which users are prompted for some fields to create a new entry. In the simplest form, the CMS can ask the user to enter title and content, for the new entry. In advanced-user mode, the CMS could also ask the user to suggest some keywords, but the user will probably enter just two, three or four words (if any). The CMS needs a way to automatically guess and suggest a default set of meta keywords based on the content before definitely saving the new entry. Those could be checked, and eventually completed by the user, and then accepted. Meta title and meta descriptions are much easier, but will be covered also in our code.

In our sample VB project we will not suggest keywords for the user to confirm, we will just calculate them on the fly and we will set them without user intervention. We will use a dummy VirtualPathProvider that will override the GetFile function in order to retrieve the virtualPath file from the real file system, so it is not a real VirtualPathProvider in the whole sense, just a wrapper to take control of the files being served to ASP.NET before they are actually compiled. A VirtualPathProvider is commonly used to seamless integrate path URLs with databases or any other source of data rather than the file system itself. Our custom class inheriting from VirtualPathProvider will be called FileWrapperPathProvider. In our case it will not use the full potential of VirtualPathProviders, since we will only retrieve the data from the file system, do minor changes to the source code on the fly and return them in order to be compiled. This will introduce a bit of overload and some extra CPU cycles before the compilation of the pages, but that will only happen once, until the file needs to be compiled again (because the underlying file has changed, for instance).

Our FileWrapperPathProvider.GetFile function will return a FileWrapperVirtualFile whenever the virtualPath requested falls under the conditions of IsPathVirtual function: the file extension is .aspx or .aspx.vb and the path of the requested URL follows the scheme of ~/xx/, that is to say, under a folder of two characters (for the language, ~/en/, ~/de/, ~/fr/, ~/es/, …). In other case, it will return a VirtualFile handled by the previously registered VirtualPathProvider; ie. none, or the filesystem itself without any change.

We have chosen to use a VirtualPathProvider wrapper around the real file system just to show what kind of things can be done with that class. If your data is on a database instead of static files, you will probably be using your own VirtualPathProvider, and in that case it will work by virtualizing the path being requested and retrieving the file contents from the database instead of the filesystem. Whichever the case, you can adapt it to your scenario in order to make use of the idea that we will illustrate in this post.

The idea is somewhat twisted or cumbersome:

  1. Parse the code behind file for the page being requested (.aspx.vb file) and, using regular expressions (regex), replace the base class so that the page no longer inherits from System.Web.UI.Page and inherits from System_Web_UI_ProxyPage instead(a custom class of our own). This proxy page class declares public MetaTitle, MetaDescription and MetaKeywords properties and link them to the underlying meta title, meta description and meta keywords declared inside the head tag in the masterpage. When a page inherits from our System_Web_UI_ProxyPage, it will expose those 3 properties that can be easily set. See System_Web_UI_ProxyPage.OnLoad in our sample project.
  2. Read and parse the .aspx file linked to the former .aspx.vb file (the same without the .vb) and make a call to JAGBarcelo.MetasGen.GuessMetasFromString method which makes the main job with the file contents. See FileWrapperVirtualFile.Open function in the sample project.
  3. Besides of changing the base class to that of our own, we add some lines to create (or extend) Page_Init method on that .aspx.vb file. In those few lines of code that are added on the fly we set the three properties exposed by System_Web_UI_ProxyPage class and that we have just calculated.
  4. Return the Stream as output of the VirtualFile.Open function with the modified contents so that it can be compiled by ASP.NET engine, based on the underlying real file, using the formerly calculated meta title, meta keywords and meta description. Note that this is done in memory, the actual filesystem is not written at any time. The real files are read (.aspx.vb and .asxp), parsed, and the virtual contents created on the fly and given to ASP.NET. You need too be really careful since you can run into compile-time errors in places that will be hard to understand, since the filesystem version of the files are the base contents, but not the actual contents being compiled.

The way we calculate the metas in JAGBarcelo.MetasGen.GuessMetasFromString is:

  1. Select the proper set of noise-words set depending on the language of the text.
  2. Look for the content inside the whole text. It must be inside ContentPlaceholders (we will suppose you will be using masterpages), and we will look for a particular ContentPlaceHolder that contains the main body/contents of the page. Change the LookForThisContentPlaceHolder const inside MetasGen.vb file in order to customise it for your own masterpage ContentPlaceHolder's names.
  3. Calculate the meta title as the text within the first <h1> tags right after the searched ContentPlaceHolder.
  4. Iterate through the rest of the content, counting word occurrences and two-word phrases occurrences, discarding noise words for the given language.
  5. Calculate the keywords, creating a string that will be filled with the most frequent single-word occurrences (up to 190 characters), and two-word most frequent occurrences (up to 250 characters in total).
  6. Calculate the description, concatenating content previously parsed, to create a string between of 200 and 394 characters. Those two figures are not randomly chosen, Google Webmaster Tools warns you when any of your pages has meta descriptions shorter than 200 or longer than 394 characters (based on my experience).
  7. Return the calculated title, keywords and description in the proper ByRef parameters.

A good thing about this approach, using a VirtualFile is that you can apply it to your already existing website easily. No matter how many pages your site has, hundreds, thousands,... this code adds meta title, meta keywords and meta descriptions to all your pages automatically, transparently, without user intervention, very little modifications (if any) to your already existing pages and it scales well.

Counting word occurrences.

We iterate through the words within the text under consideration (ContentPlaceHolder) and store their occurrences into a HashTable (ht1 for single-words and ht2 for two-words). All words are considered in their lowercase variant. The word must have more than two characters to taken into account and must not start with a number. If it passes the former fast test, it is checked against a noise-word list. If it is not a noise word, it is checked against the existing values in the proper HashTable and included (ht1.Add(word, 1)), or its value incremented (ht1(word) = ht1(word) + 1) if it was already there.

Regarding the noise words, we first considered some word frequency lists available out there, but then we thought about using verb conjugations as well. So we first created MostCommonWordsEN, an array based on simple frequency lists, and then we created also MostCommonVerbsEN based on another frequency list which considered only verbs. At the end we created MostCommonConjugatedVerbsEN, where we stored all the conjugations of the former most common English verbs. When checking a word against one of these word strings we only use MostCommonWordsXX and MostCommonConjugatedVerbsXX (where XX is one of EN, ES, FR, DE, IT). Yes, we did the same for other languages like Spanish, French, German and Italian, whose conjugations are much more complex than -ed, -ing and -s terminations. For automatic generation of all possible conjugations for the given verbs (in their infinitive form) we used http://www.verbix.com/

Calculating meta title.

It will be the text surrounding the first <h1> and </h1> heading tags right after the main ContentPlaceHolder of the page.

Calculating meta description.

Most of the time a description about what a whole text is about (or at least it should) is within the first paragraphs of it. Based on that supposition, we try to parse and concatenate text within paragraphs (<p></p> tags) after the first <h1> tag. Based on our experience, when the meta description tag is longer than 394 characters, Google Webmaster Tools complain about it being too long. Taking that point in mind, we try to concatenate html-cleaned text from the first paragraphs of the text to create the meta description tag, ensuring it is not longer than 394 characters. Once we know the way our meta descriptions are automatically created, all we need to do is create our pages starting with an <h1> header tag followed by one or more paragraphs (<p></p>) that will be the source for creating the meta description for the page. This will be suitable for most scenarios. In other cases, you should modify the way you create your pages or update the code to match your needs.

Calculating meta keywords.

Provided those noise word lists for a given language, calculating the keywords (single word) and key phrases (two words) occurrences within the text was something straightforward. We just iterate through the text, check against noise words, and add a new keyword or increment the frequency if the given keyword is already on the HashTable. At the end of the iteration, we sort the HashTables by descending frequency (using a custom class inheriting from System.Collections.IComparer). The final keywords list is a combination of the most frequent single keywords (ht1) up to 190 characters, and the most common two-word key phrases (ht2), until completing a maximum of 250 characters. All of them will be comma separated values in lowercase.

Summary.

Having meta tags correctly set is a must, however it is difficult to set them manually on every page sometimes, furthermore not forgetting all possible keyword combinations. Too much frequently only a few words are added, and this is when automatic keyword handling can help. If you consider this might be your case, please, download our sample VB project and give it a try (and a few debug traces too). I will be waiting for your comments.

Links.

Internet Information Services IIS optimization