Website crawler | find html tags ("a" "h" "meta" ...)

Printable View

10th March 2015, 17:29
terpentin86

1 Attachment(s)

Website crawler | find html tags ("a" "h" "meta" ...)
Hi,

I would like to write a little program which is able to find some informations of a website.
It could find something like metadefinitions, headlines oder links.

So I started to find a-tags and its referencing to internally or external sites.
Used HTML code was saved in a tempfile on my disk before.

At first it works really fine. Later I tried to test it with some other sites ...
sometimes there are results and sometimes nothing.
So I compared the sources line by line and I have found the problem.

If the website needs external sources like javascript or pictures my code doesnt work.
For example:
<link href='http://fonts.googleapis.com/css?family=Questrial' rel='stylesheet' type='text/css'>
<img src="http://www.website.de/images/slider/apparatur.jpg" title="Zahntechnisches Labor" alt="Zahntechnisches Labor">

So I delete all these external ressources from the sourcecode in my tempfile and then it works.

Therefore my question is, how I could surpress such external accesses. For my small console application
graphical representations doesnt matter.

QT 5.2 code:
Code:

void Crawler::crawl_Page() { QWebPage frame; QFile* file = new QFile("D:/temp.html"); if(file->open(QIODevice::ReadOnly | QIODevice::Text)) { qDebug() << "Open tempfile "; QString htmlContent = file->readAll(); qDebug() << "Count html :: " << htmlContent.count(); frame.mainFrame()->setHtml(htmlContent); qDebug() << "Mainframe size :: " << frame.mainFrame()->contentsSize(); QWebElement doc = frame.mainFrame()->documentElement(); QWebElementCollection linkCollection = doc.findAll("a"); qDebug() << "Found " << linkCollection.count() << " links"; foreach (QWebElement link, linkCollection) { qDebug() << "found link " << link.attribute("href"); } } }
Results if it works:
Attachment 11011

There are no errormessage if it doesnt work (sites with external ressources),
only qDebug ... "Found 0 links" ...

Thx
10th March 2015, 20:56
ChrisW67

Re: Website crawler | find html tags ("a" "h" "meta" ...)

The QWebPage cannot know the final result of loading the page HTML until it has loaded all the external scripts/css/images. Since your code never returns to the event loop the QWebPage never gets a chance to commence loading these. I suspect, therefore, that the HTML document remains empty until then.

Do your parsing work in a slot attached to the loadFinished() signal. Use QWebSettings to disable JavaScript or image loading etc. if you need to.

Re: Website crawler | find html tags ("a" "h" "meta" ...)

Hey,

thanks for the reply.

With your hints I have adapted my code to the follow and it works:

Code:

void Crawler::crawl_Page()
{
    frame = new QWebPage(this);
 
 
    QWebSettings::setObjectCacheCapacities(0,0,0);
    frame->settings()->setAttribute(QWebSettings::LocalContentCanAccessFileUrls,false);
    frame->settings()->setAttribute(QWebSettings::LocalContentCanAccessRemoteUrls,false);
 
    QObject::connect(frame->mainFrame(), SIGNAL(loadFinished(bool)),
            this, SLOT(parsingWork()));
 
 
   QFile* file = new QFile("D:/tempfile.txt");
 
   if(file->open(QIODevice::ReadOnly | QIODevice::Text))
       {
       qDebug() << "Open tempfile ";
       QString htmlContent = file->readAll();
 
       qDebug() << "Count Chars :: " << htmlContent.count();
       frame->mainFrame()->setHtml(htmlContent);
 
       doc = frame->mainFrame()->documentElement();
   }
}
 
 
void Crawler::parsingWork()
{
    qDebug() << "Start parsing content .....";
 
    QWebElementCollection linkCollection = doc.findAll("a");
    qDebug() << "Found " << linkCollection.count() << " links";
 
    foreach (QWebElement link, linkCollection)
    {
      qDebug() << "found link " << link.attribute("href");
    }
 
    qDebug() << "stop parsing content .....";
}

16th March 2015, 07:31
anda_skoa

Re: Website crawler | find html tags ("a" "h" "meta" ...)

I would recommend to allocate your QFile onf the stack, no need to allocate on the heap since you only need in the scope of that function.

Has the nice side effect that you don't have to delete it manually (which you currently miss).

Also, if crawl_page is called more than once, you might want to explicitly delete the QWebPage at some point, e.g. using deleteLater() in the slot connected to loadFinished()

Cheers,
_