| We have a file full of scrape rules that we use on every scrape. Instead of
typing in all of these parameters, we just call a file named
ScanRulesFull.txt. The file had to eventually be moved to a new folder, so
some doit.logs have the old file path in them.
This is an example of an old doit.log:
-O "X:\\egRawScraped\\dublincore.org,X:\\egCache\\dublincore.org" -%S
"X:\\UpdateSW\\HTTrack\\ScanRulesFull.txt" +dublincore.org/* dublincore.org/
-iC2 -O "X:\\egRawScraped\\dublincore.org,X:\\egCache\\dublincore.org" -iC2 -O
"X:\\egRawScraped\\dublincore.org,X:\\egCache\\dublincore.org"
This is one with the new file path:
-O "X:\\egRawScraped\\openlearn.open.ac.uk,X:\\egCache\\openlearn.open.ac.uk"
-%S "X:\\UpdateSW\\HTTrackScanRules\\ScanRulesFull.txt"
+openlearn.open.ac.uk/* openlearn.open.ac.uk/
Were just looking for a way to make the old scrape jobs use the new file
path for our ScanRulesFull.txt file during an update, instead of just copying
the old file path that's in the doit.logs. We (at the WiderNet Project) are
working with 1400 sites so thats a lot of logs to manually change, but we
can do that if needed. Is there any way to bypass whats in the doit.log
during an update?
Thanks! | |