Spider Models


Author: yifei / Created: May 30, 2017, 12:13 p.m. / Modified: May 30, 2017, 3 p.m. / Edit

basically, I think there are two kind of crawling, one is for search engine, there purpose is to archive the web, the other is for harvesting structured data from the web

Here we try to solve the problem: given one page/url and the sample data we want, how to crawl all of them.

our basic assumption is that each page with structured data is presented in some kind of a table. for crawling data from the web, this pattern applies for 90% of the web pages we are interested in:

to sum it up, hub page/list of urls -> data page -> mark cells -> calculate item xpath -> regenerate cell relative xpath -> crawl

one more thing, the whole process could be recursive, and our data page might be a hub page to more detailed data pages

PS: * On one single page, there might be different sections, in each section, there might be different data, e.g. on a blog post page, there are post and a list of comments. * There is also the problem of finding next page if the hub page or the data page is paginated, we have to find the next page by a regular expression * The other problem is that the data might be in the js or ajax call, the data could be metadata or shown only when user acted in a specific way: a. We can use phantomjs or something to convert it back to normal page, might need record user behavior b. We use regular expression to yank out the data


评论区