Izak Burger: Cleaning documents polluted by copy-paste from MSWord

This problem is much less severe now that Plone uses tinyMCE in the newer versions, but we still run into problems with older documents created in Kupu on older versions of Plone.

Case in point, yesterday I dumped the content of such a document to a file and cleaned it up. This resulted in a reduction in file size of more than 90%.

-rw-r--r-- 1 izak izak 3.2M 2011-09-15 15:52 /tmp/before.html
-rw-r--r-- 1 izak izak 205K 2011-09-15 16:09 /tmp/after.html

One thing that TinyMCE definitely doesn't handle as well as Kupu, is 3.2M documents, so we can no longer ignore the MSWord bloat. I wrote the following bit of code to make the cleanup easier. It uses Elementtree.

import sys
from lxml import etree
from lxml.etree import HTMLParser

parser = HTMLParser()
fp = open(sys.argv[1], 'r')
tree = etree.parse(fp, parser)
fp.close()

xslt = etree.XML("""\
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="comment()" />
    <xsl:template match="style" />
    <xsl:template match="link" />
    <xsl:template match="@*|node()">
        <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
    </xsl:template>
</xsl:stylesheet>""")
transform = etree.XSLT(xslt)

newtree = transform(tree)
print str(newtree)

I hope this is useful to someone.

Izak Burger: Cleaning documents polluted by copy-paste from MSWord

Trending Articles

LAG, Lacp configuration on Mellanox switches

Karimnagar District Police Office Mobile Numbers List in Telangana State

Griffith faces three more offences

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Derbyshire jeweller and scrap gold dealer, Jonathan Haag, must pay £57,000...

Black Angus Grilled Artichokes

Moondru Mudichu 09-08-2017 – Polimer tv Serial

Parris out on $9,000 bail

Electronic Bank Statement field Assignment (ZUONR) missing alphabets from...

गर्मी पर स्टेटस – Funny Summer Status in Hindi for Whatsapp

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Ifield Avenue closed following crash in Langley Green

Stories • Goddess Stepmom

Practice Sheet of Right form of verbs for HSC Students

Skint TV teen to be sentenced

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

09g927750** 6 speed transmission TCM VAG original firmware files

TASK ERROR: storage migration failed: block job (mirror) error:...

More things we have to put up with: when NOT to raise hell with Disclosure

Karnataka SSLC 10th Exam Time Table 2016 (www.kseeb.kar.nic.in)