This problem is much less severe now that Plone uses tinyMCE in the newer versions, but we still run into problems with older documents created in Kupu on older versions of Plone.
Case in point, yesterday I dumped the content of such a document to a file and cleaned it up. This resulted in a reduction in file size of more than 90%.
-rw-r--r-- 1 izak izak 3.2M 2011-09-15 15:52 /tmp/before.html -rw-r--r-- 1 izak izak 205K 2011-09-15 16:09 /tmp/after.html
One thing that TinyMCE definitely doesn't handle as well as Kupu, is 3.2M documents, so we can no longer ignore the MSWord bloat. I wrote the following bit of code to make the cleanup easier. It uses Elementtree.
import sys from lxml import etree from lxml.etree import HTMLParser parser = HTMLParser() fp = open(sys.argv[1], 'r') tree = etree.parse(fp, parser) fp.close() xslt = etree.XML("""\ <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="comment()" /> <xsl:template match="style" /> <xsl:template match="link" /> <xsl:template match="@*|node()"> <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy> </xsl:template> </xsl:stylesheet>""") transform = etree.XSLT(xslt) newtree = transform(tree) print str(newtree)
I hope this is useful to someone.