Results 1 to 4 of 4

Thread: Website crawler | find html tags ("a" "h" "meta" ...)

  1. #1
    Join Date
    Mar 2015
    Posts
    2
    Thanks
    1
    Qt products
    Qt5
    Platforms
    Windows

    Default Website crawler | find html tags ("a" "h" "meta" ...)

    Hi,

    I would like to write a little program which is able to find some informations of a website.
    It could find something like metadefinitions, headlines oder links.

    So I started to find a-tags and its referencing to internally or external sites.
    Used HTML code was saved in a tempfile on my disk before.

    At first it works really fine. Later I tried to test it with some other sites ...
    sometimes there are results and sometimes nothing.
    So I compared the sources line by line and I have found the problem.

    If the website needs external sources like javascript or pictures my code doesnt work.
    For example:
    <link href='http://fonts.googleapis.com/css?family=Questrial' rel='stylesheet' type='text/css'>
    <img src="http://www.website.de/images/slider/apparatur.jpg" title="Zahntechnisches Labor" alt="Zahntechnisches Labor">

    So I delete all these external ressources from the sourcecode in my tempfile and then it works.

    Therefore my question is, how I could surpress such external accesses. For my small console application
    graphical representations doesnt matter.

    QT 5.2 code:

    Qt Code:
    1. void Crawler::crawl_Page()
    2. {
    3.  
    4. QWebPage frame;
    5.  
    6. QFile* file = new QFile("D:/temp.html");
    7.  
    8. if(file->open(QIODevice::ReadOnly | QIODevice::Text))
    9. {
    10. qDebug() << "Open tempfile ";
    11.  
    12. QString htmlContent = file->readAll();
    13. qDebug() << "Count html :: " << htmlContent.count();
    14.  
    15. frame.mainFrame()->setHtml(htmlContent);
    16. qDebug() << "Mainframe size :: " << frame.mainFrame()->contentsSize();
    17.  
    18. QWebElement doc = frame.mainFrame()->documentElement();
    19.  
    20. QWebElementCollection linkCollection = doc.findAll("a");
    21. qDebug() << "Found " << linkCollection.count() << " links";
    22.  
    23. foreach (QWebElement link, linkCollection) {
    24. qDebug() << "found link " << link.attribute("href");
    25. }
    26. }
    27. }
    To copy to clipboard, switch view to plain text mode 

    Results if it works:
    result.jpg

    There are no errormessage if it doesnt work (sites with external ressources),
    only qDebug ... "Found 0 links" ...

    Thx

  2. #2
    Join Date
    Mar 2009
    Location
    Brisbane, Australia
    Posts
    7,729
    Thanks
    13
    Thanked 1,610 Times in 1,537 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows
    Wiki edits
    17

    Default Re: Website crawler | find html tags ("a" "h" "meta" ...)

    The QWebPage cannot know the final result of loading the page HTML until it has loaded all the external scripts/css/images. Since your code never returns to the event loop the QWebPage never gets a chance to commence loading these. I suspect, therefore, that the HTML document remains empty until then.

    Do your parsing work in a slot attached to the loadFinished() signal. Use QWebSettings to disable JavaScript or image loading etc. if you need to.

  3. The following user says thank you to ChrisW67 for this useful post:

    terpentin86 (15th March 2015)

  4. #3
    Join Date
    Mar 2015
    Posts
    2
    Thanks
    1
    Qt products
    Qt5
    Platforms
    Windows

    Default Re: Website crawler | find html tags ("a" "h" "meta" ...)

    Hey,

    thanks for the reply.

    With your hints I have adapted my code to the follow and it works:

    Qt Code:
    1. void Crawler::crawl_Page()
    2. {
    3. frame = new QWebPage(this);
    4.  
    5.  
    6. QWebSettings::setObjectCacheCapacities(0,0,0);
    7. frame->settings()->setAttribute(QWebSettings::LocalContentCanAccessFileUrls,false);
    8. frame->settings()->setAttribute(QWebSettings::LocalContentCanAccessRemoteUrls,false);
    9.  
    10. QObject::connect(frame->mainFrame(), SIGNAL(loadFinished(bool)),
    11. this, SLOT(parsingWork()));
    12.  
    13.  
    14. QFile* file = new QFile("D:/tempfile.txt");
    15.  
    16. if(file->open(QIODevice::ReadOnly | QIODevice::Text))
    17. {
    18. qDebug() << "Open tempfile ";
    19. QString htmlContent = file->readAll();
    20.  
    21. qDebug() << "Count Chars :: " << htmlContent.count();
    22. frame->mainFrame()->setHtml(htmlContent);
    23.  
    24. doc = frame->mainFrame()->documentElement();
    25. }
    26. }
    27.  
    28.  
    29. void Crawler::parsingWork()
    30. {
    31. qDebug() << "Start parsing content .....";
    32.  
    33. QWebElementCollection linkCollection = doc.findAll("a");
    34. qDebug() << "Found " << linkCollection.count() << " links";
    35.  
    36. foreach (QWebElement link, linkCollection)
    37. {
    38. qDebug() << "found link " << link.attribute("href");
    39. }
    40.  
    41. qDebug() << "stop parsing content .....";
    42. }
    To copy to clipboard, switch view to plain text mode 

  5. #4
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Website crawler | find html tags ("a" "h" "meta" ...)

    I would recommend to allocate your QFile onf the stack, no need to allocate on the heap since you only need in the scope of that function.

    Has the nice side effect that you don't have to delete it manually (which you currently miss).

    Also, if crawl_page is called more than once, you might want to explicitly delete the QWebPage at some point, e.g. using deleteLater() in the slot connected to loadFinished()

    Cheers,
    _

Similar Threads

  1. cmake error with Failed to find "glu32" in ""
    By kennethadammiller in forum Qt Programming
    Replies: 1
    Last Post: 13th September 2013, 00:12
  2. Replies: 0
    Last Post: 6th December 2012, 17:54
  3. Replies: 3
    Last Post: 15th February 2010, 18:27
  4. Replies: 3
    Last Post: 8th July 2008, 20:37
  5. Translation QFileDialog standart buttons ("Open"/"Save"/"Cancel")
    By victor.yacovlev in forum Qt Programming
    Replies: 4
    Last Post: 24th January 2008, 20:05

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.