HTTrack Website Copier
Free software offline browser - FORUM
Subject: httrack fills in entry?
Author: FLY
Date: 03/17/2009 01:13
 
I have successfully used httrack to download a few websites, until I tried it
on Continental airline (http://www.continental.com/web/en-US/default.aspx). 
In case you are curious, I am researching the web data for building language
models.

Since I cared only about the texts, I used this rule set "-*.css -*.js
-ad.doubleclick.net/* -mime:application/foobar -*.gif -*.jpg -*.png -*.tif
-*.bmp -*.zip -*.tar -*.tgz -*.gz -*.rar -*.z -*.exe
-*.mov -*.mpg -*.mpeg -*.avi -*.asf -*.mp3 -*.mp2 -*.rm -*.wav -*.vob -*.qt
-*.vid -*.ac3 -*.wma -*.wmv
-*/*signout* -*/*logout*"

I also disabled the "parsing java files".

The problem was that httrack downloaded thousands of similar files, for
exmaple, a lot of default*.html.  I do not think that it was caused by the
time stamp, but I do not know how this happened.  What's more interesting is
that, httrack was able to fill in the "from" entry (the city to depart from)
with a city name (such as Cleveland), and the same to the "to" entry.  How can
httrack be so "smart" to know what to fill in an entry?  What information does
httrack get to decide what to fill in?  I am getting thousands of similar
files, is it because httrack is trying all the possible combinations to fill
in the entries?
Anybody has some hints, thoughts, or even guesses?
Thanks!
 
Reply


All articles

Subject Author Date
httrack fills in entry?

03/17/2009 01:13
Re: httrack fills in entry?

03/17/2009 15:32




f

Created with FORUM 2.0.11