PDA

View Full Version : QRegExp Help; remove all html tag



patrik08
27th July 2006, 11:48
I wand to remove all HTML tag to reformat document ...
Tidy can not make the job...

I test QString::remove & QRegExp line 10 and line 11 remove the close tag .. now i wand to remove the open tag i tested line 13 but .. remove all..
How can i make this?...




QString QLess::CleanTag( QString body )
{
qDebug() << "### start clean tag ";
body.replace("<br>","##break##");
body.replace("</br>","##break##");
body.replace("</p>","##break##");
body.replace("</td>","##break##");
body.remove(QRegExp("<head>(.*)</head>"));
body.remove(QRegExp("<form(.*)</form>"));
body.remove(QRegExp("</(div|span|tr|td|br|body|html|tt|a|strong|p)>"));
body.remove(QRegExp("</(DIV|SPAN|TR|TD|BR|BODY|HTML|TT|A|STRONG|P)>"));
/*body.remove(QRegExp("<(div|span|tr|td|br|body|html|tt|a|strong|p)>"));*/
/*body.remove(QRegExp("<(div|span|tr|td|br|body|html|tt|a|strong|p)( )(.*)(!>)>"));*/
qDebug() << "### newbody " << body;
return body;
}

jacek
27th July 2006, 12:10
You need something like:
body.remove( QRegExp( "<(?:div|span|tr|td|br|body|html|tt|a|strong|p)[^>]*>", Qt::CaseInsensitive ) );

patrik08
27th July 2006, 12:22
tanks ... the open tag is going out ... now stay only...



<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!--UdmComment-->

patrik08
27th July 2006, 12:46
Now is run and clean all tag:



QString QLess::CleanTag( QString body )
{
qDebug() << "### start clean tag ";
body.replace("<br>","##break##");
body.replace("</br>","##break##");
body.replace("</p>","##break##");
body.replace("</td>","##break##");
body.remove(QRegExp("<head>(.*)</head>"));
body.remove(QRegExp("<form(.*)</form>"));
body.remove( QRegExp( "<(.)[^>]*>"));
qDebug() << "### newbody " << body;
return body;
}

jacek
27th July 2006, 12:53
body.remove(QRegExp("<form(.*)</form>"));
What if a page contains more than one form?

patrik08
27th July 2006, 13:13
What if a page contains more than one form?

I hope ... body.remove( QRegExp( "<(.)[^>]*>"));
remove 2° inside form tag.... but on my CMS is only News article ... to reformat color and Style... I replace new break-line and go to tidy to controll....

jacek
27th July 2006, 13:20
I hope ... body.remove( QRegExp( "<(.)[^>]*>"));
remove 2° inside form tag....
Then you should better try your code on:

text1
<form>form1</form>
text2
<form>form2</form>
text3
hint (http://doc.trolltech.com/4.1/qregexp.html#setMinimal)

patrik08
27th July 2006, 14:40
Then you should better try your code on:

text1
<form>form1</form>
text2
<form>form2</form>
text3
hint (http://doc.trolltech.com/4.1/qregexp.html#setMinimal)


Now take moore as on form and java scripts or style...

Run so...





QString QLess::CleanTag( QString body )
{
qDebug() << "### start clean tag "; /* &nbsp; */
body.replace("&nbsp;"," ");
body.replace("<br>","##break##");
body.replace("</br>","##break##");
body.replace("</p>","##break##");
body.replace("</td>","##break##");
body.remove(QRegExp("<head>(.*)</head>",Qt::CaseInsensitive));
body.remove(QRegExp("<form(.)[^>]*</form>",Qt::CaseInsensitive));
body.remove(QRegExp("<script(.)[^>]*</script>",Qt::CaseInsensitive));
body.remove(QRegExp("<style(.)[^>]*</style>",Qt::CaseInsensitive));
body.remove(QRegExp("<(.)[^>]*>"));
body.replace("##break##","</br>");
qDebug() << "### newbody " << body;
return body;
}



html result:

text1

text2

text3