Bel ons op 070 820 02 31

Edwin Vriethoff

 

8 september 2015

Remove unwanted HTML attributes with PowerShell and regular expressions

​I’m currently working on a migration project to migrate content from an old to a new website. Some of the content of the older site contains HTML markup.
It’s ok for the HTML to be formatted with paragraphs and headers, but in some cases it also contains inline styling and class names which are not wanted at the new website.

The migration scripts are being written in PowerShell, so I also wanted to remove the class and style attributes with PowerShell and regular expressions. It was hard to find a sample on the internet which does only remove the html attributes and not the complete elements.

<p class="mv-Element-P">​Microsoft heeft voor het derde jaar...

Solutions like
$string -replace '<[^>]+>',''
and
$string -replace '<[^>]+((?:style|class)="[^"]*")[^>]*>',''
do not work for me, as the complete elements are being removed.

I finally found a nice post about cleaning Word HTML with Regex in C# which provided me with the correct solution for my case. As my base requirement is to remove any inline styling from the HTML, the extra mentioned attributes in this RegEx are also kept in my PowerShell code.

The working PowerShell command to clean out styling attributes from HTML elements is:
$string -replace "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<`$1`$2>"

Note that the dollar signs in $1$2 are escaped with back ticks to send them to the Regex processor instead as threating them as PowerShell variables.

Originally posted at: http://edwin.vriethoff.net/2015/09/08/remove-unwanted-html-attributes-powershell-regular-expressions/​