View Full Version : grabbing html from a separate page
keith
August 1st, 2001, 16:04
how would i do this? like i have a page at something.com/page.cgi and i want to include the sourcecode of http://www.geocities.com/keith in the html.
i tried the format:
print "Content-type: text/html\n\n";
include "http://www.geocities.com/keith";
but that didn't work. i can get it in php, but i need it in cgi, and i'm somewhat cgi dumb. any suggestions?
lucifer
August 1st, 2001, 16:21
when you say include the source code is that
a) include the code not processed
b) include the output of the code (run on other sever)
c) include the source into your cgi and run the whole thing
keith
August 1st, 2001, 16:25
i mean just include the textual html on the page being copied from.
so if the entire index page at geocities.com/keith read simply:
<title>whoopity doo</title>...then it would print the text "<title>whoopity doo</title>" in the cgi output wherever the command appeared. basically just needs to print whetever html is on the page being grabbed. i just don't know how to get it to work.
php uses
<? include "http://geocities.com/keith" ?>...and it works fine, but it's gotta be cgi.
lucifer
August 1st, 2001, 16:56
perl
use LWP::Simple;
$url="http://domain.com/mypage.html";
print "Content-type: text/html\n\n";
getprint($url);
should work
keith
August 1st, 2001, 17:48
dude, you rock!
quick question though, i'm using this for a new redirection option at wzr.net, to make your subdomain appear as a real address and not a framed redirect. it acts like ip pointing, only it points at a webpage, not an ip.
so, if you visit http://keith.wzr.net, it goes to http://www.weezer.com with this new redirection format
if you right click the page and hit 'properties', you'll see the page reads as keith.wzr.net. so it appears that keith.wzr.net is a real domain rather than a redirect.
but if you right click on an image and hit 'properties', you'll see it reads 'keith.wzr.net/imagename.jpg'. so not only does it load the html through the subdomain, it looks like the subdomain is actually hosting the images.
so what i'm wondering is, does it really load the images through wzr.net? or directly from the weezer.com server and only appear to be hosted by keith.wzr.net? because if it loads it off weezer.com though the redirect, i'm not going to do it, unfortunately, that would obviously eat up a ton of bandwidth.
if that's the case, i'd add
<base href="$url">in there as well, that'd take care of the problem, but not before i figure out how it loads the images. it's actually kind of cool to see 'keith.wzr.net/image.jpg', but there's always the bandwidth deal...
niv
August 1st, 2001, 17:57
since it's server side, the images are going to load through a script much like the one lucifer posted, gives the information from your server to the client and so it uses up your bandwidth. :(
keith
August 1st, 2001, 18:00
damn... thanks a lot. looks like i'll be adding the <base> tag.
well, at least it'll still load the html through it, making it look like real domain.
lucifer
August 1st, 2001, 18:19
maybe you could do something like
print "Location: $real_url\n\n";
if it's an image etc (or not text)
keith
August 1st, 2001, 18:27
actually that's what the <base href="$url"> tag would do, but it's not 100% failsafe.
i'd need a way of locating code with .jpg, .gif, .zip, etc... extensions and parse in the real url. any ideas, oh cgi guru? :)
hey, any way i could have the cgi locate:
<img src="
in the code and change them all to:
<img src="$url/
? ...that would be pretty sweet, but probably more work than it's worth.
lucifer
August 1st, 2001, 18:43
use LWP::Simple;
$content = get("http://.....");
would let you parse the $content and then print it out so change the links just use reg expressions
also in LWP:Simple
head($url)
Get document headers. Returns the following 5 values if successful: ($content_type, $document_length, $modified_time, $expires, $server)
use that before and decide if to give real or fake info
ie pass through or use print "location: .....\n\n"
do you want all this extra bandwidth?
keith
August 1st, 2001, 18:53
alright, i think you lost me there. what would i have to do to this code to get what you're saying?:
{
use LWP::Simple;
print "Content-type: text/html\n\n";
getprint($url);
}
i'm guessing it shouldn't be too much more bandwidth if i can filter out everything but the html. hmmm, but maybe i'm wrong.
lucifer
August 2nd, 2001, 04:48
yep, not very well written I needed to get to bed.
two things
1) you could check file types using head to see if things are text/html it's proberbly easier to just look for .htm and .html
2) using get you can get the contents of a page and then you could change all the href/src as wanted using regular expressions
so instead of
use LWP::Simple;
print "Content-type: text/html\n\n";
getprint($url);
you have
use LWP::Simple;
$content=get($url);
# transform paths - you need a better reg exp this (just example)
$content =~ s!<img src="!<img src="$url/!ig;
print "Content-type: text/html\n\n";
print $content;
jetalomar
August 2nd, 2001, 06:36
what about <? readfile "http://geocities.com/keith" ?>
?
Magic2K2
August 2nd, 2001, 06:53
Does this work so you can jack code from other sites? (like real-time news sites)
lucifer
August 2nd, 2001, 07:03
Originally posted by Magic2K2
Does this work so you can jack code from other sites? (like real-time news sites) with a little work ;) it's easier in php IMO
Coight
August 2nd, 2001, 07:48
I would also like to know how to do this but via php. I want a script that say you put: www.mydomain.com it will have someting similar to what keith.wzr.net has (full pathforwarding etc, so it looks like a real domain).
I need this via PHP if anyone has any information it would be greatly appreciated.
Magic2K2
August 2nd, 2001, 08:16
Originally posted by lucifer
with a little work ;) it's easier in php IMO
How is it done in PHP?
lucifer
August 2nd, 2001, 08:29
<?
# get html
$stuff=open("http://mydomain.com/page.html");
#rewrite urls etc
$stuff=preg_replace( some patten , some replace , $stuff)
#output
echo $stuff;
?>
Coight
August 2nd, 2001, 08:34
Thanks lucifer didnt work though:
Parse error: parse error in /home/test/public_html/test.php on line 5
lucifer
August 2nd, 2001, 08:39
try changing
'some patten' & 'some replace'
Coight
August 2nd, 2001, 08:45
Lucifer, I am not experienced with php. New language to me :). I have started learning. Could you give me some idea what I should change these two
BTW im a quick learner!
lucifer
August 2nd, 2001, 08:54
Originally posted by Myacen Network
Lucifer, I am not experienced with php. New language to me :). I have started learning. Could you give me some idea what I should change these two
BTW im a quick learner! go to http://php.net download the manual. look in Perl regular expression section.
regular expressions are very cool for finding/replacing stuff in text strings. They are funny little things looking like /href=(["']){0,1}(http://){0,1}(\.?)\1\s/ that should get the url into $3 from href="..." without any http:// bits - I have not checked it's just an overly complicated example :p
learn reg expressions and you'll be able to rip info out of webpages and all sorts :)
Magic2K2
August 2nd, 2001, 09:21
Take out line 5 for now and it should print the page as is, correct? If you are new, first see what you are getting then worry about processing it.
keith
August 2nd, 2001, 15:54
there's a php program that does exactly this at http://www.4cm.com [grabs specific sections of websites]
lucifer
August 3rd, 2001, 05:30
how's the cgi script coming on. I'd like to see it when it's done
Dusty
August 3rd, 2001, 11:46
<?
# get html
$stuff=open("http://mydomain.com/page.html");
#rewrite urls etc
$stuff=preg_replace( some patten , some replace , $stuff)
#output
echo $stuff;
?>That was easier in PHP? I really don't understand it when people say things are easier in PHP than in Perl, in Perl it's done exactly the same way only with slightly different syntax:
use LWP::Simple;
$stuff=get("http://mydomain.com/page.html");
$stuff=~s/some patten/some replace/g;
print "Content-type: text/html\n\n";
print $stuff;
Maybe I'm just very used to seeing it the hard way and don't realize that it's hard...
Also, unless you've got your script setup in some odd way, why would the images load through your server? Webpages aren't sent as one big chunk, the script would fetch the HTML and the HTML makes relative links to the images that should show up broken as the images aren't there. If they're not broken, either they are there (on the same server) or you've setup some sort of script that runs on 404 errors that's redirecting you to the right URL. In either case, to fix it all you've got to do is print out a <BASE HREF="..."> tag before you print out the page to fix the relative URLs, no need for the script to go through an edit all the URLs.
lucifer
August 6th, 2001, 14:52
Originally posted by Dusty
That was easier in PHP? I really don't understand it when people say things are easier in PHP than in Perl, in Perl it's done exactly the same way only with slightly different syntax:
Perl has these extra lines - ok only 2 but that's 40% of the script
use LWP::Simple;
print "Content-type: text/html\n\n";
also perl you have to get your form/cookie variables by code not automatically
no CHMODing to do.
not much in it really ;)
in the long run perl is proberbly a more compact code but there are just these little extra bits to do
Coight
September 27th, 2001, 01:10
Hello,
We did not manage to get this script working (The one that Keith uses). If anyone gets this working you will get 50mb's on our paid server in return for your services.
Robert
Dusty
September 27th, 2001, 12:32
use LWP::Simple;
$page=get("http://domain.com/mypage.html");
print "Content-type: text/html\n\n".$page;...is correct. If it doesn't work for you, perhaps you don't have LWP installed on your server?
By the way, where's Lucifer? He hasn't posted in three and a half weeks. :(
keith
September 27th, 2001, 13:30
Originally posted by Myacen Network
Hello,
We did not manage to get this script working (The one that Keith uses). If anyone gets this working you will get 50mb's on our paid server in return for your services.
Robert i got it working fine...
Powered by vBulletin® Version 4.1.7 Copyright © 2012 vBulletin Solutions, Inc. All rights reserved.