PDA

View Full Version : Website crawler | find html tags ("a" "h" "meta" ...)



terpentin86
10th March 2015, 17:29
Hi,

I would like to write a little program which is able to find some informations of a website.
It could find something like metadefinitions, headlines oder links.

So I started to find a-tags and its referencing to internally or external sites.
Used HTML code was saved in a tempfile on my disk before.

At first it works really fine. Later I tried to test it with some other sites ...
sometimes there are results and sometimes nothing.
So I compared the sources line by line and I have found the problem.

If the website needs external sources like javascript or pictures my code doesnt work.
For example:
<link href='http://fonts.googleapis.com/css?family=Questrial' rel='stylesheet' type='text/css'>
<img src="http://www.website.de/images/slider/apparatur.jpg" title="Zahntechnisches Labor" alt="Zahntechnisches Labor">

So I delete all these external ressources from the sourcecode in my tempfile and then it works.

Therefore my question is, how I could surpress such external accesses. For my small console application
graphical representations doesnt matter.

QT 5.2 code:



void Crawler::crawl_Page()
{

QWebPage frame;

QFile* file = new QFile("D:/temp.html");

if(file->open(QIODevice::ReadOnly | QIODevice::Text))
{
qDebug() << "Open tempfile ";

QString htmlContent = file->readAll();
qDebug() << "Count html :: " << htmlContent.count();

frame.mainFrame()->setHtml(htmlContent);
qDebug() << "Mainframe size :: " << frame.mainFrame()->contentsSize();

QWebElement doc = frame.mainFrame()->documentElement();

QWebElementCollection linkCollection = doc.findAll("a");
qDebug() << "Found " << linkCollection.count() << " links";

foreach (QWebElement link, linkCollection) {
qDebug() << "found link " << link.attribute("href");
}
}
}


Results if it works:
11011

There are no errormessage if it doesnt work (sites with external ressources),
only qDebug ... "Found 0 links" ...

Thx

ChrisW67
10th March 2015, 20:56
The QWebPage cannot know the final result of loading the page HTML until it has loaded all the external scripts/css/images. Since your code never returns to the event loop the QWebPage never gets a chance to commence loading these. I suspect, therefore, that the HTML document remains empty until then.

Do your parsing work in a slot attached to the loadFinished() signal. Use QWebSettings to disable JavaScript or image loading etc. if you need to.

terpentin86
15th March 2015, 22:42
Hey,

thanks for the reply.

With your hints I have adapted my code to the follow and it works:



void Crawler::crawl_Page()
{
frame = new QWebPage(this);


QWebSettings::setObjectCacheCapacities(0,0,0);
frame->settings()->setAttribute(QWebSettings::LocalContentCanAccessFi leUrls,false);
frame->settings()->setAttribute(QWebSettings::LocalContentCanAccessRe moteUrls,false);

QObject::connect(frame->mainFrame(), SIGNAL(loadFinished(bool)),
this, SLOT(parsingWork()));


QFile* file = new QFile("D:/tempfile.txt");

if(file->open(QIODevice::ReadOnly | QIODevice::Text))
{
qDebug() << "Open tempfile ";
QString htmlContent = file->readAll();

qDebug() << "Count Chars :: " << htmlContent.count();
frame->mainFrame()->setHtml(htmlContent);

doc = frame->mainFrame()->documentElement();
}
}


void Crawler::parsingWork()
{
qDebug() << "Start parsing content .....";

QWebElementCollection linkCollection = doc.findAll("a");
qDebug() << "Found " << linkCollection.count() << " links";

foreach (QWebElement link, linkCollection)
{
qDebug() << "found link " << link.attribute("href");
}

qDebug() << "stop parsing content .....";
}

anda_skoa
16th March 2015, 07:31
I would recommend to allocate your QFile onf the stack, no need to allocate on the heap since you only need in the scope of that function.

Has the nice side effect that you don't have to delete it manually (which you currently miss).

Also, if crawl_page is called more than once, you might want to explicitly delete the QWebPage at some point, e.g. using deleteLater() in the slot connected to loadFinished()

Cheers,
_