How to detect the user’s preferred language – smarter than Google

Sometimes I find Google annoying. Don’t get me wrong, I am not one of the many Google haters out there. They have done amazing things and especially since they have released WebM as Open Source and royalty free video format, they have earned a bonus in my attitude towards them. And even though Google Chrome is not my personal favorite, I have to admit that Chrome has brought a great momentum into the development of web browsers (and therefor the WWW as a whole) which users of any browser benefit from. Last but not least, I use a lot of Google’s services and they do their job quite well. So all in all, Google’s record from my point of view is pretty good.

But one thing really sucks. I use an English operating system. I use an English Firefox (locale en-US). All the language preferences in the Google profiles of my accounts are set to English. Nevertheless, I often get pages from Google delivered in German by default. How they do that is quite obvious, or lets say, there is only one explanation how they may do this: they identify my home country based on my IP address. There are services that provide the data and they are easy to implement, sometimes even for free, like MaxMind‘s GeoLite Country database. Also I am using these services and I believe there is nothing wrong in doing so, except for a few things – and one of them is to identify the assumed web page visitor’s language preferences based on his or her home country.

Thankfully, I am only a lightweight victim of this practice, because I speak the language which I get served based on their guesses. But how happy would a Japanese tourist on vacation in Austria be to get his pages delivered in German, only because he uses an Austrian IP address? There are many reasons why a person located in one country may not understand the language that is spoken there. I only have to drive about 20 kilometers to get into such a country (the Czech Republic). So to assume that a person’s preferred language is the language spoken by the majority of a country (what about minorities or countries that have more than one language?) is very wrong.

What are better solutions? As I mentioned, I use an English operating system and an English browser. Almost every browser sends a user agent string. My currently most frequently sent user agent string for example is “Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7″. Every web server that receives a request from a user agent (usually a browser) can read this user agent string and process it accordingly.

What can we read in mine, amongst other information? en-US. So the locale tells a web server that it’s a browser (a Firefox 3.6.7) in American English. Now guess what my preferred language will be? Is it likely that I use a browser in a language which I don’t want? Is it likely that a tourist in a different country uses a browser in a language that he or she doesn’t want? It’s certainly possible (like if there is a terminal in the hotel room or an Internet Café), but nowadays most people have their laptops, notebooks, smart phones etc. And their browser language is still their preferred one, regardless where they are currently located. So the browser language serves as a much better criterion to assume a web site visitor’s preferred language.

Alright, no big rocket science so far, and to be honest, there is no big rocket science to come. At one point, when I was annoyed about Google’s failure to identify correctly which language to deliver me, I wondered, how hard would it be to implement the (in my humble opinion) better solution, using the web browser’s locale. The answer is far from being rocket science. The solution can be as easy as this:

First, get yourself a recent copy of the Zend Framework and install it so you make its libraries accessible to your code. You can add an __autoload() function, something like this:

function __autoload($classname) {
    if (substr($classname, 0, 4) == "Zend") {
        $classname = str_replace("_", "/", $classname);
        require_once $classname . '.php';
    }
}

or even easier, use the Zend Framework AutoLoader:

require_once 'Zend/Loader/Autoloader.php';

Zend_Loader_Autoloader::getInstance();

Everything we need to identify the browser language and locale can be found inside Zend_Locale.

The constructor of the Zend_Locale class allows to set the desired language or locale. In addition to that, there are three predefined constants named Zend_Locale::BROWSER, Zend_Locale::ENVIRONMENT and Zend_Locale::FRAMEWORK. According to the Zend_Locale introduction manual page they do the following:

Zend_Locale::FRAMEWORK – “When Zend Framework has a standardized way of specifying component defaults (planned, but not yet available), then using this constant during instantiation will give preference to choosing a locale based on these defaults. If no matching locale can be found, then preference is given to ENVIRONMENT and lastly BROWSER.”

Zend_Locale::ENVIRONMENT – “PHP publishes the host server’s locale via the PHP internal function setlocale(). If no matching locale can be found, then preference is given to FRAMEWORK and lastly BROWSER.”

Zend_Locale::BROWSER – “The user’s Web browser provides information with each request, which is published by PHP in the global variable HTTP_ACCEPT_LANGUAGE. if no matching locale can be found, then preference is given to ENVIRONMENT and lastly FRAMEWORK.”

And this is what we want!

Lets get to the code:

$zend_locale = new Zend_Locale(Zend_Locale::BROWSER);

// returns en for English, de for German etc.
$browser_language = $zend_locale->getLanguage();

// returns en-US for American English, en-GB for British English etc.
$browser_locale = $zend_locale->toString();

Three lines of code to get all the information we need. And how to make a choice which language to serve based on that? Lets assume we have a site in English, French and German. French browsers should get French, German browsers should get German and everybody else should get English:

$site_language = "en";
switch ($browser_language) {
    case "de" :
        $site_language = "de";
        break;
    case "fr" :
        $site_language = "fr";
}

And that’s it. That’s all Google would have to do to make me happy. I assume, their IP address based identification code requires more lines, and is still a poorer solution in my opinion. There are additional benefits to this solution. Want to display date and time according to the locale? Here is what to do:

$zend_date = Zend_Date::now();

$date_time = $zend_date->get(Zend_Date::DATETIME_FULL, $zend_locale);

print $date_time;

My default browser now shows me “Friday, July 23, 2010 2:14:58 AM Europe/Vienna”. If I fire up a German browser, the same page delivers me “Freitag, 23. Juli 2010 02:17:11 Europe/Vienna” and there may be answers like “vendredi 23 juillet 2010 02:17:56 Europe/Vienna” (French), “viernes 23 de julio de 2010 02:18:29 Europe/Vienna” (Spanish), “pátek, 23. července 2010 02:18:53 Europe/Vienna” (Czech) or “2010年7月23日金曜日2時19分17秒 Europe/Vienna” (Japanese) as well. It’s pretty cool how the Zend Framework can make such tasks very simple ;).

Where is a solution like mine being used in the wild? I don’t know where as of the time of this writing, but I know where very soon. The MySQL website will soon launch new localized content, delivered to people who we think really want to see these languages. We try very hard not to annoy people by enforcing a language upon them which they would not like to choose.

And remember: to make good guesses is good. The best solution however is to provide choice! If you make assumptions and even if the assumptions are well-founded, always leave people the choice to select what they want. There may be situations which you are not thinking of and because of some unusual circumstances, somebody may still want to choose differently than you would think.

6 thoughts on “How to detect the user’s preferred language – smarter than Google”

  1. Please don’t forget the standard http content negotiation (the Accept-Language header), this is what the user explicitely chose in the browser preferences. I use an English browser (because translated menus tend to be harder to understand than the original, and the translations are not stable between products or versions), but I prefer German pages, with English as a second choice. Don’t browsers send their own interface language in this header if you didn”t choose one, anyway? This is how this should be handled, intelligent solutions tend to be different everywhere, and it is hard for the users to find out why only one site behaves different from all the others. I would accept this heuristics as a fallback if everything else failed, though. (This _may_ be what you are suggesting, ins this case ignore this comment).

  2. The Zend Framework is smart enough to take this into account.

    If for example you pick a US English Firefox, add another language (lets say Arabic) in the browser’s language preferences (Edit – Preferences – Content tab – Languages) and put it on top, this code now recognizes Arabic as the desired language, rather than US English ;)

  3. The best and the most simple solution I have seen yet on how to detect client browser language.
    Thanks. I will use this solution in the future :)

  4. The comletely ridiculous language selection done by Google also drive me nuts. For some of their services, I have read the whole service description in English from my English browser on an English OS, filled in their forms in English, and suddenly here comes an incomprehensible license I need to accept, in German! Crazy.

    So I’m glad to see I am not alone in complaining. And the funny thing is that I also once wrote a language detection thingy for web sites, which I have used with small variations on several sites. In case it amuses you to see how someone else applied the same basic idea, see http://bahut.alma.ch/2006/03/simple-multi-language-web-site_30.html

    Maybe you

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>