HTTrack Website Copier
Free software offline browser - FORUM
Subject: website recursiveness problems
Author: Haudy Kazemi
Date: 04/10/2002 10:07

I've come across an occaisonal problem on websites 
with pages with improper HTML.  What happens is the 
URL for a link on that site is given with two "/" 
forward slashes in it.
Ex: instead of    <>
the link goes to  <>
(there isn't anything wrong with these pages here...)
on complex sites with many links back and forth pages 
you can get stuff like if httrack runs long enough:
which are saved as c:\web\www.httrack.com_____\*.*

Webbrowsers do the same thing as httrack, but for a 
webbrowser this isn't a problem because it isn't try 
to save everything.  The server software seems 
irrelevent, at least it happens/can happen on sites 
with either Apache or Microsoft IIS.

My current workaround is to use the internal depth 
limit to 4 or 5 levels.

I think a better solution is to check URLs for 
duplicated "/" 's and rewrite them with just one "/".  
I can't think of a legitimate case to actually have 
two "/"'s in any URL, except the initial http://
Nonetheless this would be best implemented as a 
configurable feature just in case it 'breaks' 
compatibility with the way HTTrack gathered sites 

All articles

Subject Author Date
website recursiveness problems

04/10/2002 10:07
Re: website recursiveness problems

04/10/2002 19:04
Re: website recursiveness problems

04/10/2002 20:48
Re: website recursiveness problems

04/10/2002 21:00
Re: website recursiveness problems

04/12/2002 17:00


Created with FORUM 2.0.11