Parsing external emails in Python

Our startup application gets a few thousands emails a day from every possible mailbox including local ones. We parse those raw emails and display them in safe HTML 5 format to our users.

Finding python┬álibrar(ies) that were suitable for this task was quite challenging. We’ve tested at least 10 different solution before arriving to a conclusion that perfect one doesn’t exist.

Having a huge test set of 100 000+ emails from all around the world and most of the main email clients I’ve compiled a list of 30 comprehensive tests. I’ve copy-pasted the interesting parts of the email, written my expected parsed result. In time I’ve added multiple alternative expected results as different libraries were producing things not exactly intuitive, but still valid and displayable.

One test was failed by every single library or combination that we tried.

First, stripping outlook tags, was nicely done by genshi


def test_should_strip_outlook_tags(self):
    msg = u'''
          <p>text</p>
          <!--[if gte mso 9]><o:custom></o:custom><![endif]-->
          <p>after</p>
    expected = u''' 
                 <p>text</p>
                 <p>after</p> '''
               '''

 

Then one of the weird things outlook tends to do is out of the blue equation mark

 

msg = u'''
         <p>text</p>
         <p = class=3DMsoNormal>inner</p>
        '''

expected = u'''
            <p>text</p>
            <p class="3DMsoNormal">inner</p>
           '''

Unfortunately no library could get along with this type of contorted html:


@unittest.skip('This is valid for HTML5, however html5lib processes this against expectations')
def test_should_strip_self_closed_b_without_rendering_it_further(self):
    msg = u'''
         <div>
              <div class="bold" />
              not_bold_content
         </div>
         should_not_be_bold
         '''

    expected = u'''
                  <div>
                      <div class="bold"></div>
                      not_bold_content
                  </div>
                  should_not_be_bold
               '''

What we got instead was self closed tag. This is actually pretty big issue, as self closed tag like this causes chrome to display entire page as bolded.. So we ended up fixing library itself.