Regading the javascript parser improvements - HTTrack Website Copier Forum

Subject: Regading the javascript parser improvements
Author: Xavier Roche
Date: 10/27/2008 21:56
Hi folks,

I'm just CCing my reply to M. Clifford Missen in the forum, regarding
javascript parsing, as the question pointed out is really interesting. To be
short: each step towards the perfect mirror is harder and harder.

The current httrack javascript parser has currently two logic "levels":
(1) it parses the basic underlying javascript structure (isolating comments,
strings, code body..)
(2) it detects "known" functions and basic expressions within this structure,
such as document.url="some string" or document.write("some html expression
with an url embeddd inside") and scans/replaces them

This is quite basic, but allows to parse a large (something like 9 cases out
of 10) number of crawled data.

Now the big surprise is:

Level (1) was actually the only level that was coded at the begining of
httrack -- the parser did only detect strings, and did "recognize" valid URLs
within strings (ie. strings ending with ".html", for example). It took only a
few hours to code it. But it missed, say, 30% of all links. Which means that
this simple code did successfully handle 70% of all cases. Something so basic
that it probably could be handled by a basic regexp search/replace.

Level (2) was something else to code. It involved more complex analysis, with
basic expression detection, inlined html code parsing, and many ugly things
that required months of coding and debugging/testing. After that, something
like 50% of bogus cases were solved. It means that, solving something like 15%
of all cases did require MUCH more efforts and pain. And when I mean "much
more", I mean it :)

Now, "improving" the js parser to solve, say, 50% of the remaining cases,
would mean a lot of code, and very complicated one, with function/expression
analysis, complex inlining handler (handling outer libraries included with
<script> tags) and things that are normally handled in a browser (ie. using
thousands or ten of thousands of lines of code)

And then, handling 50% of the remaining-remaining unparsed code would require
even more efforts - possibly a lot more than what was done in current
browsers, because you just don't have to execute javascript, but also replace
the inline code so that it will work locally, and scan all possible logic
paths (example: a javascript code that will roll over 3 different images in a
hour) ; possibly using adapted languages such as functional ones.

Solving all cases is in my opinion totally impossible, even with advanced AI.

To summarize the problem, I've been thinking of improving the javascript
parser for some time, but the conclusion was always the same: the efforts
would be huge, for a very limited impact. I know that improvements would be a
great thing to do -- and maybe some easy improvements can be developped anyway
to solve a few remaining case.

But my opinion of the problem is that the efforts to solve each remaining
unhandled % is "violently" exponential and out of sight.

(apologies for the approximate english syntax)
All articles
Subject	Author	Date
Regading the javascript parser improvements		10/27/2008 21:56