• Howdy! Welcome to our community of more than 130.000 members devoted to web hosting. This is a great place to get special offers from web hosts and post your own requests or ads. To start posting sign up here. Cheers! /Peo, FreeWebSpace.net
managed wordpress hosting

Advanced Web Crawler

bloodyveins

New Member
What webbots usually do are indexing address with its directories, like: http://www.mysite.com/path/to/file

what i concern is to create a web crawler so that it can read server aguments like http://www.mysite.com?node=1-14

Instead of manipulating the page address (at above case, change address into http://www.mysite.com/1-14/ to enable common webbots crawl the page), could anybody here forward me to a good reference, idea, or consideration for such kind of method?
 
google did it, but the result is not that good.
when the server arguments get longer, it tends to neglect equal sign (=). this means that not all pages indexed properly.

what i think is that webbot tries every possibility and then indexing the page.
the problem arises when a page (a website) has lots of arguments.
eg: arg 1 => 100 possibilities
2 => 50
3 => 20
4 => 10

by the rule of permutation, there are: 100*50*20*10 = 1000000 (a million) tries to index whole the page.

if there are billions or greater website around the world, what will be faced are:
a. large in size index file
b. low in speed indexing
and it ends in untrusted search.

as a page is updated regularly, the same step has to be done again?? but how long the process will take time??

better idea??
 
Back
Top