Escape title attribute in HTML links with Python

What

<a href="toto.html" title="<strong>I like it!<strong><br />Jerome">
  Toto
</a>

to

<a href="toto.html" title="&lt;strong&gt;I like it!&lt;strong&gt;&lt;br /&gt;Jerome">
  Toto
</a>

Why having HTML in the title attribute?

It's used by Lightbox and similar plugins as caption.

Usage

# one file
python escape.py toto.html

# multiple files
python escape.py *.html

Notes

  • modifies file in place
  • works on incomplete HTML files and PHP files

Script

escape.py

# -*- coding: utf-8 -*-
 
import re
import sys
 
def escape(s):
	reg = r'^\s*<a.* title="(.*)">$'
	p = re.compile(reg)
	m = p.match(s)
 
	if m:
		title = m.group(1)
 
		title_new = title
		title_new = title_new.replace('<p>', '')
		title_new = title_new.replace('</p>', '')
		title_new = title_new.replace('<', '&lt;')
		title_new = title_new.replace('>', '&gt;')
 
		return s.replace(title, title_new)
	else:
		return s
 
def process_file(path):
	# read file
	f = open(path)
	s = ""
	for l in f:
		#l = l.rstrip('\n')
		if l.find('<a ') != -1:
			s+= escape(l)
		else:
			s+= l 
	f.close()
	print s
 
	# write file
	f = open(path, 'w')
	f.write(s)
	f.close()
 
# main
sys.argv.pop(0)
for path in sys.argv:
	print path
	process_file(path)

Feedback