It looks like html-sanitizer does a lot of wibbly-wobbly things to produce the result. It seems possible to implement every html-sanitizer feature using a smart plugin system for a custom SAX parser rules realization.