Quantcast
Channel: Planet Plone - Where Developers And Integrators Write
Viewing all articles
Browse latest Browse all 3535

Bo Simonsen: The Scandinavian “curse” – sorting æ, ø, and å

$
0
0

Lately, I have been quite absent from my blog. In fact, just one blog post in February, but wow it has been a busy month. The Cathedral sprint took a week out of the regular schedule, but has really been an experience for me, as mentioned in the earlier blog post. After the sprint we got back to work with refreshed energy to work on our customer’s sites. I have been working on one problem, I found quite interesting so wanted to share it.

In Scandinavia we have some letters, in addition to the regular Latin alphabet which are (notice, the sorting order):

We found out, sorting these letters can actually be a problem. If you rely on standard Python sorting (not general for Python) you get the letters sorted according to their location in the Unicode table (see e.g. here), i.e. each character’s ordinal value. My native language is Danish, and most of our customers are having Danish as primary language. Relying on the order, for sorting, in the Unicode table does not work out; the order there is å, æ, ø and the expected order should be æ, ø, and å; hence we need a better way of sorting the letters.

On a UNIX system, you rely on the locale system to get stuff like sorting order correct. For example, if we create a text file containing, called sorting-test.txt

å
ø
æ

And we execute cat sorting-test.txt |LANG=C sort we get an incorrect result with respect to the right sorting order, since the characters are sorted according to their ordinal values. However, if we execute cat sorting-test.txt |LANG=da_DK.UTF-8 sort the characters are sorted correctly, since the right locale is given, so the system is informed on which sorting to use (collation). By the way, the last example does not seem to work on Mac OS/X (I am using OS/X Mavericks) but on Linux systems it works.

In Plone, this is of course also causing problems. We use a custom adapter to get the sorted_title index correct. This adapter contains the basic functionality from the CatalogTool defined in Products.CMFPlone, but calls the strxfrm function from the locale module simply to get the sorting order correct. The indexing adapter is defined as follows (we place it in indexers.py):

from Products.CMFCore.interfaces import IContentish

from Products.CMFPlone.CatalogTool import num_sort_regex, zero_fill
from Products.CMFPlone.utils import safe_callable
from Products.CMFPlone.utils import safe_unicode

from plone.indexer import indexer

import locale

@indexer(IContentish)
def sortable_title(obj):
    #FIXME: Move it to somewhere it makes sense, or save the old locale.
    #this is not nice, overwriting the locale. Use with caution.
    locale.setlocale(locale.LC_ALL, 'da_DK.UTF-8')

    title = getattr(obj, 'Title', None)
    if title is not None:
        if safe_callable(title):
            title = title()

        if isinstance(title, basestring):
            sortabletitle = safe_unicode(title).lower().strip()
            sortabletitle = num_sort_regex.sub(zero_fill, sortabletitle)
            sortabletitle = sortabletitle[:70].encode('utf-8')
            return locale.strxfrm(sortabletitle)

    return ''

The adapter needs registration, trivially, via configure.zcml:

<adapter factory=".indexers.sortable_title" name="sortable_title"/>

What could be really nice, would be a generic fix in Plone. This solution, as proposed above, may work well for a single-language site. However, for multilingual sites, you may get in trouble. If you have the locale defined to be e.g. Danish, and your site supports other kind of languages, the sorting of strings on the other language relies on the Unicode table, which may not be correct. Let us imagine that the ordering of the Norwegian letters was different, then we would get in trouble for sure. I am not an expert in languages, but I am sure you can find languages sharing characters where the ordering may differ.

So In conclusions, the right solution would involve locale-based sorting (so the language of the object would be taken into account when sorting), which would require extended functionality to the indexer. I am going to dig more into this problem, to figure out a good solution so maybe I will post an update on further findings. In the mean time feel free to comment on your thoughts on the problem.


Viewing all articles
Browse latest Browse all 3535

Trending Articles