Re: Downloading only certain file types - HTTrack Website Copier Forum

Subject: Re: Downloading only certain file types
Author: mm
Date: 01/09/2020 14:44
> 1) Always post the ACTUAL command line used (or log
> file line two) so we know what the site is, what ALL
> your settings are, etc.
> 2) Always post the URLs you're not getting and from
> what URL it is referenced.
> 3) Always post anything USEFUL from the log file.
> 4) If you want everything use the near flag (get
> non-html files related) not filters.
> 5) I always run with A) No External Pages so I know
> where the mirror ends. With B) browser ID=msie 6
> pulldown as some sites don't like a HTT one. With C)
> Attempt to detect all links (for JS/CSS.) With D)
> Timeout=60, retry=9 to avoid temporary network
> interruptions from deleting files.
> 
> > Hi.  I want to only download certain file types
> > stored within websites such as *.doc, *.pdf,
> *,xls
> > etc.
> > 
> > Is there a way to scan an entire website for
> these
> > files without downloading anything else?> 
> If you have the URLs of those files, no problem.
> Otherwise can't be done. You MUST let it spider the
> site (get the html) to get them.
> You can filter everything else out:
> -* +*.html +*.doc +*.pdf +*.xls
> 
> > Also, if the above is possible can the files be
> sent
> > to a single folder rather than the folder the
> files
> > were in at the website?> 
> change local structure
> <http://www.httrack.com/html/step9_opt5.html>
> html in web other in web/other or xxx in web/xxx

> 1) Always post the ACTUAL command line used (or log
> file line two) so we know what the site is, what ALL
> your settings are, etc.
> 2) Always post the URLs you're not getting and from
> what URL it is referenced.
> 3) Always post anything USEFUL from the log file.
> 4) If you want everything use the near flag (get
> non-html files related) not filters.
> 5) I always run with A) No External Pages so I know
> where the mirror ends. With B) browser ID=msie 6
> pulldown as some sites don't like a HTT one. With C)
> Attempt to detect all links (for JS/CSS.) With D)
> Timeout=60, retry=9 to avoid temporary network
> interruptions from deleting files.
> 
> > Hi.  I want to only download certain file types
> > stored within websites such as *.doc, *.pdf,
> *,xls
> > etc.
> > 
> > Is there a way to scan an entire website for
> these
> > files without downloading anything else?> 
> If you have the URLs of those files, no problem.
> Otherwise can't be done. You MUST let it spider the
> site (get the html) to get them.
> You can filter everything else out:
> -* +*.html +*.doc +*.pdf +*.xls
> 
> > Also, if the above is possible can the files be
> sent
> > to a single folder rather than the folder the
> files
> > were in at the website?> 
> change local structure
> <http://www.httrack.com/html/step9_opt5.html>
> html in web other in web/other or xxx in web/xxx

> 1) Always post the ACTUAL command line used (or log
> file line two) so we know what the site is, what ALL
> your settings are, etc.
> 2) Always post the URLs you're not getting and from
> what URL it is referenced.
> 3) Always post anything USEFUL from the log file.
> 4) If you want everything use the near flag (get
> non-html files related) not filters.
> 5) I always run with A) No External Pages so I know
> where the mirror ends. With B) browser ID=msie 6
> pulldown as some sites don't like a HTT one. With C)
> Attempt to detect all links (for JS/CSS.) With D)
> Timeout=60, retry=9 to avoid temporary network
> interruptions from deleting files.
> 
> > Hi.  I want to only download certain file types
> > stored within websites such as *.doc, *.pdf,
> *,xls
> > etc.
> > 
> > Is there a way to scan an entire website for
> these
> > files without downloading anything else?> 
> If you have the URLs of those files, no problem.
> Otherwise can't be done. You MUST let it spider the
> site (get the html) to get them.
> You can filter everything else out:
> -* +*.html +*.doc +*.pdf +*.xls
> 
> > Also, if the above is possible can the files be
> sent
> > to a single folder rather than the folder the
> files
> > were in at the website?> 
> change local structure
> <http://www.httrack.com/html/step9_opt5.html>
> html in web other in web/other or xxx in web/xxx

> 1) Always post the ACTUAL command line used (or log
> file line two) so we know what the site is, what ALL
> your settings are, etc.
> 2) Always post the URLs you're not getting and from
> what URL it is referenced.
> 3) Always post anything USEFUL from the log file.
> 4) If you want everything use the near flag (get
> non-html files related) not filters.
> 5) I always run with A) No External Pages so I know
> where the mirror ends. With B) browser ID=msie 6
> pulldown as some sites don't like a HTT one. With C)
> Attempt to detect all links (for JS/CSS.) With D)
> Timeout=60, retry=9 to avoid temporary network
> interruptions from deleting files.
> 
> > Hi.  I want to only download certain file types
> > stored within websites such as *.doc, *.pdf,
> *,xls
> > etc.
> > 
> > Is there a way to scan an entire website for
> these
> > files without downloading anything else?> 
> If you have the URLs of those files, no problem.
> Otherwise can't be done. You MUST let it spider the
> site (get the html) to get them.
> You can filter everything else out:
> -* +*.html +*.doc +*.pdf +*.xls
> 
> > Also, if the above is possible can the files be
> sent
> > to a single folder rather than the folder the
> files
> > were in at the website?> 
> change local structure
> <http://www.httrack.com/html/step9_opt5.html>
> html in web other in web/other or xxx in web/xxx

> 1) Always post the ACTUAL command line used (or log
> file line two) so we know what the site is, what ALL
> your settings are, etc.
> 2) Always post the URLs you're not getting and from
> what URL it is referenced.
> 3) Always post anything USEFUL from the log file.
> 4) If you want everything use the near flag (get
> non-html files related) not filters.
> 5) I always run with A) No External Pages so I know
> where the mirror ends. With B) browser ID=msie 6
> pulldown as some sites don't like a HTT one. With C)
> Attempt to detect all links (for JS/CSS.) With D)
> Timeout=60, retry=9 to avoid temporary network
> interruptions from deleting files.
> 
> > Hi.  I want to only download certain file types
> > stored within websites such as *.doc, *.pdf,
> *,xls
> > etc.
> > 
> > Is there a way to scan an entire website for
> these
> > files without downloading anything else?> 
> If you have the URLs of those files, no problem.
> Otherwise can't be done. You MUST let it spider the
> site (get the html) to get them.
> You can filter everything else out:
> -* +*.html +*.doc +*.pdf +*.xls
> 
> > Also, if the above is possible can the files be
> sent
> > to a single folder rather than the folder the
> files
> > were in at the website?> 
> change local structure
> <http://www.httrack.com/html/step9_opt5.html>
> html in web other in web/other or xxx in web/xxx

> 1) Always post the ACTUAL command line used (or log
> file line two) so we know what the site is, what ALL
> your settings are, etc.
> 2) Always post the URLs you're not getting and from
> what URL it is referenced.
> 3) Always post anything USEFUL from the log file.
> 4) If you want everything use the near flag (get
> non-html files related) not filters.
> 5) I always run with A) No External Pages so I know
> where the mirror ends. With B) browser ID=msie 6
> pulldown as some sites don't like a HTT one. With C)
> Attempt to detect all links (for JS/CSS.) With D)
> Timeout=60, retry=9 to avoid temporary network
> interruptions from deleting files.
> 
> > Hi.  I want to only download certain file types
> > stored within websites such as *.doc, *.pdf,
> *,xls
> > etc.
> > 
> > Is there a way to scan an entire website for
> these
> > files without downloading anything else?> 
> If you have the URLs of those files, no problem.
> Otherwise can't be done. You MUST let it spider the
> site (get the html) to get them.
> You can filter everything else out:
> -* +*.html +*.doc +*.pdf +*.xls
> 
> > Also, if the above is possible can the files be
> sent
> > to a single folder rather than the folder the
> files
> > were in at the website?> 
> change local structure
> <http://www.httrack.com/html/step9_opt5.html>
> html in web other in web/other or xxx in web/xxx

HTTrack3.49-2+htsswf+htsjava launched on Thu, 09 Jan 2020 19:41:53 at
<https://www.facebook.com/profile.php?id=100010200485321&lst=100000613543182%3A100010200485321%3A1578177714&sk=friends&source_ref=pb_friends_tl>
+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*
-mime:application/foobar
(winhttrack -qYC2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5
(compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by
HTTrack Website Copier/3.x [XR&CO'2014], %s -->" -%l "en, *" -Y
<https://www.facebook.com/profile.php?id=100010200485321&lst=100000613543182%3A100010200485321%3A1578177714&sk=friends&source_ref=pb_friends_tl>
-O1 "D:\New folder\FACEBOOK" +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js
-ad.doubleclick.net/* -mime:application/foobar )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive
information,
 such as username/password authentication for websites mirrored in this
project
Create subthread
All articles
Subject	Author	Date
Re: Downloading only certain file types		06/21/2013 17:59
Re: Downloading only certain file types		01/09/2020 14:44