Advanced Web Crawler

bloodyveins · Dec 1, 2003

What webbots usually do are indexing address with its directories, like: http://www.mysite.com/path/to/file

what i concern is to create a web crawler so that it can read server aguments like http://www.mysite.com?node=1-14

Instead of manipulating the page address (at above case, change address into http://www.mysite.com/1-14/ to enable common webbots crawl the page), could anybody here forward me to a good reference, idea, or consideration for such kind of method?

ashben · Dec 2, 2003

Most search engine's can spider querystring URL's as long as the URL has a back-link. For example, at http://www.google.com/search?q=".asp?name=" you will see that Google has actually indexed pages with querystring parameters in the URL.

bloodyveins · Dec 3, 2003

google did it, but the result is not that good.
when the server arguments get longer, it tends to neglect equal sign (=). this means that not all pages indexed properly.

what i think is that webbot tries every possibility and then indexing the page.
the problem arises when a page (a website) has lots of arguments.
eg: arg 1 => 100 possibilities
2 => 50
3 => 20
4 => 10

by the rule of permutation, there are: 100*50*20*10 = 1000000 (a million) tries to index whole the page.

if there are billions or greater website around the world, what will be faced are:
a. large in size index file
b. low in speed indexing
and it ends in untrusted search.

as a page is updated regularly, the same step has to be done again?? but how long the process will take time??

better idea??

Advanced Web Crawler

bloodyveins

New Member

ashben

Moderator

bloodyveins

New Member