Netsight Developers: Optimising Local Role Security Reindexing in Plone

Matt R Drawing Node Graph Ever since the introduction of local roles in Plone way back in the prehistoric ages, there has been a problem to contend with. This is that how do you get the site search in Plone to abide by the local roles that have been set on content? That is, if I give a local role of 'Reader' to the user 'Bob' on a page, then I expect Bob to be able to find that page in the search.

In order to do this, the catalog in Plone keeps an index called allowedRolesAndUsers which keeps a list of all roles, users or groups that have the ability to view that piece of content. So when you do a search, Plone just looks up what roles you have, and what groups you are a member of and passes them as part of the search query to the catalog. Nice and efficient.

Except it isn't. Well, it is at query time. But that nicety comes at the expense of having to compute that index whenever a piece of content's workflow state changes. Or a local roles is added. That doesn't sounds that expensive, does it? The kicker here is that due to Plone having an insanely powerful access control system, local roles are inherited by child items. For example, if I have a site with /departments/hr/policies as a folder path, and I grant 'Bob' the 'Reader' role on the folder hr then by default Bob will get that role on all content within that folder, ie everything in the policies folder. So Plone has to reindex the allowedRolesAndUsers index, not just for the hr folder, but everything under it. ie. all gazillion HR policy documents in that folder.

This operation can take many minutes to complete on a large site... and on a busy site can be catastrophic for performance. The catalog is one of the most contended areas of the database, and so lots is going on there. Operations that spend lots of time manipulating the database can very easily cause other transactions to conflict and need to be retried.

This issue is especially prevalent on intranets, and sites that have some notion of 'membership' to a particular area of the site. Any site with things called 'workgroups' or 'workspaces' are likely to be affected. Whenever you add a new participant to your workgroup you grant them local roles of some kind on that workgroup. That results in every single piece of content in that workgroup being loaded in from the ZODB and reindexed in the allowedRolesAndUsers index.

The common way this issue is worked around is by (ab)using groups in Plone. Instead of adding local roles to a user on workgroup, you have a Plone user group assigned to that workgroup and you add the user to that user group. Due to the underlying way in which Plone works, this is more efficient. But can lead to a proliferation of groups in the site that often have no semantic meaning. e.g 'People with the Contributor role of the folder at /x/y/z'. Or more normally it will be based on a UUID and be something more opaque like 'contributors-1234-0123-4978-2484'.

This is an issue that we wanted to solve as part of the Plone Intranet project. There have been numerous 3rd party products written for Plone to try and provide some notion of 'workgroups'. Most of these also have to deal with solving this performance issue. We wanted to try and see if we could solve the issue at source, so that the actual workgroup implementation could concentrate on the semantic meaning of a workgroup and not have to sully itself with performance workarounds.

We ran two week-long sprints at Netsight to try and solve this problem. And we believe we have succeeded. This is an area of Plone that goes right back to the origins of one of Plone's underpinnings, the CMF. And there are few people who really understand Plone at this depth, and can mentally keep track of all the things going on with the security and permission system. It was very hard work, and we backtracked a lot of times, used up copious quantities of flipchart paper, drew literally dozens of object graphs.

Together with my colleagues, Ben Cole and Matt Russell, we think we have solved it, and now want to get greater feedback and testing from the wider Plone community.

Our basic approach for optimising the security reindex operations is to build a 'shadow tree' which is a lightweight tree of python objects that mirrors the site structure. This tree is annotated with hints that then allow us to group content into sets of content for which the allowedRolesAndUsers index would contain the same result. We can then just calculate the allowedRolesAndUsers for real on just one piece of content per group and submit 'dummy' objects to the catalog to be indexed in lieu of waking up and re-calculating the allowedRolesAndUsers on each and every piece of content.

Our initial results show a pretty amazing performance increase, but the improvements will be very much dependant on the size of your site and the patterns of local roles and workflow states.

Reindex Object Security Results Chart

We tested our optimisations with both legacy ATContentType based content types, and also with the newer Dexterity-based plone.app.contenttypes. You will notice that the Dexterity (DX) content types are over 4 times slower to re-index the security of than the older Archetypes (AT) based types. This surprised us a lot, as we thought Dexterity was going to be much faster than Archetypes. We spent nearly a day digging around investigating this, but that is a separate topic for discussion. In short looking up an attribute, interface or adapter on Dexterity types is very slow due to the __getattr__ code being in python rather than C due to the flexibility of Dexterity types.

However in either case, with our optimisation product installed the time taken to reindex the object security is significantly less, and in many cases near-instantaneous.

The code is on Github at https://github.com/ploneintranet/experimental.securityindexing and on PyPI at https://pypi.python.org/pypi/experimental.securityindexing

We welcome other people to install and test this on development versions of their site and see how it works out. At the moment, we'd advise against installing it on production sites until it has been battle tested a bit more. That said, we have twice as many LoC in our tests as we do in our implementation and nearly 100% test coverage. In the 0.1 release it will not un-install cleanly and will leave an annotation on the site root with the shadow tree in. This can be deleted manually from a debug shell. Once we work out how to get GS to run python methods on un-install we'll get that fixed.

Once you install this, it will lazily build up the shadow tree, so the performance optimisations won't show immediately. But if you add and remove a local role from the root of the site, you will force it to rebuild the shadow tree for the whole site initially.