Results 1 to 4 of 4

Thread: Web Crawler Design suggestions required

  1. #1
    Join Date
    Jul 2012
    Posts
    201
    Thanks
    26
    Thanked 1 Time in 1 Post
    Qt products
    Qt4
    Platforms
    Windows

    Default Web Crawler Design suggestions required

    Hi there guys, for the past three months I have been developing a web crawler and everything seems to be working fine, but I feel my design of the entire application is not efficient at all. The application is in client-server architecture and the crawling functionality is performed by the server. I would like to adopt multi-threading(i.e. using the QThreadPool) in my design to increase efficiency. The application's basic functionality is to crawl 50 websites and retrieve data from the sites. At the moment the application is performing the crawling function serially (i.e. one website at a time) which is very time consuming. My server runs on a 3Mb/s internet connection, I would like the application to crawl 4 sites at a time. I don't need the code or anything like that, all I want is to pick your brain as to how you would approach such a problem. Thanking you in advance.

  2. #2
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Web Crawler Design suggestions required

    I think there are two levels of jobs here and thus two potential parallelisation possibilities.

    1) Crawling sites in parallel
    2) Making multiple requests for one site in parallel

    The latter could be tricky as the site might have a limitation on number of connections from the same client or require some information sharing between connections, e.g. cookies.

    For (1) I would personally go with multiple processes instead of threads but of course it depends a bit on how you store/process the crawling results.

    Cheers,
    _

  3. #3
    Join Date
    Jul 2012
    Posts
    201
    Thanks
    26
    Thanked 1 Time in 1 Post
    Qt products
    Qt4
    Platforms
    Windows

    Default Re: Web Crawler Design suggestions required

    Thank you for your reply anda_skoa. I use QWebEnginePage::runJavaScript() to fetch the data I need and I store it in local storage on the server as .html files. If I may ask, why do you prefer using processes over threads in this scenario.

  4. #4
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Web Crawler Design suggestions required

    Multihreading adds complexity due to the shared address space between processes

    E.g. involved classes must not use static member variables or have them explicitly protected, etc.
    Also problematic are global/singleton objects, in this case global QWebEngineSettings.

    Running multiple processes is also almost "for free", e.g. you have an existing and working program, you just start it multiple times with different input.
    "Almost" because you will likely want a manager process but that will require no or only minor changes to the crawler program and thus not likely negatively impact its working condition.

    Then there is job control: a process can be killed at any time, even if it is stuck, a thread can only be asked to quit. Less a problem with an event driven thread like but still something to consider.

    Stability: if one crawler process encounters data that makes it crash, only that crawler crashes.

    Cheers,
    _

  5. The following user says thank you to anda_skoa for this useful post:

    ayanda83 (15th February 2017)

Similar Threads

  1. Seeking Suggestions for Multi-Threaded Application Design
    By swamyonline in forum Qt Programming
    Replies: 7
    Last Post: 1st May 2014, 16:19
  2. Replies: 2
    Last Post: 6th September 2012, 11:08
  3. Qt4/C++/Java/Crawler Developer
    By renesoft in forum Resumes
    Replies: 1
    Last Post: 21st January 2011, 11:04
  4. Replies: 3
    Last Post: 5th October 2008, 23:41
  5. Design suggestions
    By vermarajeev in forum Qt Programming
    Replies: 1
    Last Post: 15th December 2006, 09:22

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.