Web Crawler Design suggestions required [Archive]

View Full Version : Web Crawler Design suggestions required

ayanda83

13th February 2017, 07:10

Hi there guys, for the past three months I have been developing a web crawler and everything seems to be working fine, but I feel my design of the entire application is not efficient at all. The application is in client-server architecture and the crawling functionality is performed by the server. I would like to adopt multi-threading(i.e. using the QThreadPool) in my design to increase efficiency. The application's basic functionality is to crawl 50 websites and retrieve data from the sites. At the moment the application is performing the crawling function serially (i.e. one website at a time) which is very time consuming. My server runs on a 3Mb/s internet connection, I would like the application to crawl 4 sites at a time. I don't need the code or anything like that, all I want is to pick your brain as to how you would approach such a problem. Thanking you in advance.

anda_skoa

13th February 2017, 08:50

I think there are two levels of jobs here and thus two potential parallelisation possibilities.

1) Crawling sites in parallel
2) Making multiple requests for one site in parallel

The latter could be tricky as the site might have a limitation on number of connections from the same client or require some information sharing between connections, e.g. cookies.

For (1) I would personally go with multiple processes instead of threads but of course it depends a bit on how you store/process the crawling results.

Cheers,
_

ayanda83

13th February 2017, 09:01

Thank you for your reply anda_skoa. I use QWebEnginePage::runJavaScript() to fetch the data I need and I store it in local storage on the server as .html files. If I may ask, why do you prefer using processes over threads in this scenario.

anda_skoa

14th February 2017, 09:25

Multihreading adds complexity due to the shared address space between processes

E.g. involved classes must not use static member variables or have them explicitly protected, etc.
Also problematic are global/singleton objects, in this case global QWebEngineSettings.

Running multiple processes is also almost "for free", e.g. you have an existing and working program, you just start it multiple times with different input.
"Almost" because you will likely want a manager process but that will require no or only minor changes to the crawler program and thus not likely negatively impact its working condition.

Then there is job control: a process can be killed at any time, even if it is stuck, a thread can only be asked to quit. Less a problem with an event driven thread like but still something to consider.

Stability: if one crawler process encounters data that makes it crash, only that crawler crashes.

Cheers,
_