Quantcast
Channel: Planet Plone - Where Developers And Integrators Write
Viewing all articles
Browse latest Browse all 3535

Izak Burger: Cleaning documents polluted by copy-paste from MSWord

$
0
0

This problem is much less severe now that Plone uses tinyMCE in the newer versions, but we still run into problems with older documents created in Kupu on older versions of Plone.

Case in point, yesterday I dumped the content of such a document to a file and cleaned it up. This resulted in a reduction in file size of more than 90%.

-rw-r--r-- 1 izak izak 3.2M 2011-09-15 15:52 /tmp/before.html
-rw-r--r-- 1 izak izak 205K 2011-09-15 16:09 /tmp/after.html

One thing that TinyMCE definitely doesn't handle as well as Kupu, is 3.2M documents, so we can no longer ignore the MSWord bloat. I wrote the following bit of code to make the cleanup easier. It uses Elementtree.

import sys
from lxml import etree
from lxml.etree import HTMLParser

parser = HTMLParser()
fp = open(sys.argv[1], 'r')
tree = etree.parse(fp, parser)
fp.close()

xslt = etree.XML("""\
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="comment()" />
    <xsl:template match="style" />
    <xsl:template match="link" />
    <xsl:template match="@*|node()">
        <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
    </xsl:template>
</xsl:stylesheet>""")
transform = etree.XSLT(xslt)

newtree = transform(tree)
print str(newtree)

I hope this is useful to someone.


Viewing all articles
Browse latest Browse all 3535

Trending Articles