-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathindex.html
More file actions
173 lines (147 loc) · 41.3 KB
/
index.html
File metadata and controls
173 lines (147 loc) · 41.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
<!DOCTYPE html> <html> <head> <title>crawler.js</title> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <link rel="stylesheet" media="all" href="docco.css" /> </head> <body> <div id="container"> <div id="background"></div> <table cellpadding="0" cellspacing="0"> <thead> <tr> <th class="docs"> <h1> crawler.js </h1> </th> <th class="code"> </th> </tr> </thead> <tbody> <tr id="section-1"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-1">¶</a> </div> <p><strong>Crawler.js</strong> is a web crawler written in JavaScript/PhantomJS.</p>
<p>Originally developer for <a href="http://pagerjs.com">pager.js</a>.</p>
<p>Source code on <a href="http://github.com/finnsson/crawlerjs/">GitHub</a>. MIT License.</p> </td> <td class="code"> <div class="highlight"><pre><span class="p">(</span><span class="kd">function</span> <span class="p">()</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-2"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-2">¶</a> </div> <p>Get all arguments passed to crawler</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">args</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s2">"system"</span><span class="p">).</span><span class="nx">args</span><span class="p">;</span></pre></div> </td> </tr> <tr id="section-3"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-3">¶</a> </div> <p>Set the start argument index to 1 (since 0 is the name of the file)</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">index</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span></pre></div> </td> </tr> <tr id="section-4"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-4">¶</a> </div> <p>Find out if any URL pattern should be ignored while crawling the site</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">ignorePatterns</span> <span class="o">=</span> <span class="p">[];</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">args</span><span class="p">[</span><span class="nx">index</span><span class="p">]</span> <span class="o">===</span> <span class="s1">'-i'</span> <span class="o">||</span> <span class="nx">args</span><span class="p">[</span><span class="nx">index</span><span class="p">]</span> <span class="o">===</span> <span class="s1">'--ignore'</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">ignorePatterns</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">args</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span>
<span class="nx">index</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span></pre></div> </td> </tr> <tr id="section-5"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-5">¶</a> </div> <p>Fetch the URL from the argument list</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">startUrl</span> <span class="o">=</span> <span class="nx">args</span><span class="p">[</span><span class="nx">index</span><span class="p">];</span></pre></div> </td> </tr> <tr id="section-6"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-6">¶</a> </div> <p>Add the start URL to the list of total URLs
and to the list of not yet visited pages.
<code>visitingPages</code> is an integer over the current number of visiting pages.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">pages</span> <span class="o">=</span> <span class="p">[</span><span class="nx">startUrl</span><span class="p">];</span>
<span class="kd">var</span> <span class="nx">notVisitedPages</span> <span class="o">=</span> <span class="p">[</span><span class="nx">startUrl</span><span class="p">];</span>
<span class="kd">var</span> <span class="nx">visitingPages</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span></pre></div> </td> </tr> <tr id="section-7"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-7">¶</a> </div> <p><code>notVisitingPages</code> holds all URLs that are about to be visited but PhantomJS
hasn't loaded yet (since we don't want to load 100s of pages as the same time).</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">notVisitingPages</span> <span class="o">=</span> <span class="p">[];</span></pre></div> </td> </tr> <tr id="section-8"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-8">¶</a> </div> <p>If no arguments are provided: print the help information:</p>
<pre><code>Usage: phantomjs --load-images=no crawler.js [options] url
Options:
--ignore, -i url pattern to ignore
Example:
phantomjs --load-images=no
crawler.js -i /some/url/\d* http://example.com/
</code></pre> </td> <td class="code"> <div class="highlight"><pre> <span class="k">if</span> <span class="p">(</span><span class="nx">startUrl</span> <span class="o">===</span> <span class="kc">null</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">"Usage: phantomjs --load-images=no crawler.js [options] url"</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">""</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">"Options:"</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">" --ignore, -i url pattern to ignore"</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">""</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">"Example:"</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">" phantomjs --load-images=no crawler.js -i /some/url/\d* http://example.com/ "</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span></pre></div> </td> </tr> <tr id="section-9"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-9">¶</a> </div> <p>Include the file system module. We'll need it when storing files to disc later</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">fs</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s2">"fs"</span><span class="p">);</span></pre></div> </td> </tr> <tr id="section-10"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-10">¶</a> </div> <p>Include the webpage module. This module can create web pages!</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">webpage</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="s1">'webpage'</span><span class="p">);</span></pre></div> </td> </tr> <tr id="section-11"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-11">¶</a> </div> <p>Load the scripts jquery ($), q (Q) and underscore (_) so we can use them</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">phantom</span><span class="p">.</span><span class="nx">injectJs</span><span class="p">(</span><span class="s1">'jquery-1.8.2.min.js'</span><span class="p">);</span>
<span class="nx">phantom</span><span class="p">.</span><span class="nx">injectJs</span><span class="p">(</span><span class="s1">'q.min.js'</span><span class="p">);</span>
<span class="nx">phantom</span><span class="p">.</span><span class="nx">injectJs</span><span class="p">(</span><span class="s1">'underscore-min.js'</span><span class="p">);</span></pre></div> </td> </tr> <tr id="section-12"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-12">¶</a> </div> <p>Helper method for removing an item from an array
Used like</p>
<pre><code>removeA([1,2,4,5], 4) === [1,2,5]
</code></pre> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">function</span> <span class="nx">removeA</span><span class="p">(</span><span class="nx">arr</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">what</span><span class="p">,</span> <span class="nx">a</span> <span class="o">=</span> <span class="nx">arguments</span><span class="p">,</span> <span class="nx">L</span> <span class="o">=</span> <span class="nx">a</span><span class="p">.</span><span class="nx">length</span><span class="p">,</span> <span class="nx">ax</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="nx">L</span> <span class="o">></span> <span class="mi">1</span> <span class="o">&&</span> <span class="nx">arr</span><span class="p">.</span><span class="nx">length</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">what</span> <span class="o">=</span> <span class="nx">a</span><span class="p">[</span><span class="o">--</span><span class="nx">L</span><span class="p">];</span>
<span class="k">while</span> <span class="p">((</span><span class="nx">ax</span> <span class="o">=</span> <span class="nx">arr</span><span class="p">.</span><span class="nx">indexOf</span><span class="p">(</span><span class="nx">what</span><span class="p">))</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">arr</span><span class="p">.</span><span class="nx">splice</span><span class="p">(</span><span class="nx">ax</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nx">arr</span><span class="p">;</span>
<span class="p">}</span></pre></div> </td> </tr> <tr id="section-13"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-13">¶</a> </div> <h2>processPage</h2>
<p>This method is called with a loaded page instance and the URL of
that page instance.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">processPage</span> <span class="o">=</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">page</span><span class="p">,</span> <span class="nx">url</span><span class="p">)</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-14"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-14">¶</a> </div> <h3>saveHref</h3>
<p><code>actions</code> is an object with a key <code>saveHref</code>. This key/method
is called by the <code>window.callPhantom</code>-method inside a <code>page.evaluate</code>-
callback. Using <code>callPhantom</code> it is possible for PhantomJS to pass
serializable data out from a web page into this script.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">actions</span> <span class="o">=</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-15"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-15">¶</a> </div> <p>The <code>saveHref</code> method will analyze the href provided
and decide if the href is a link to a new URL that
should be crawled.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">saveHref</span><span class="o">:</span><span class="kd">function</span> <span class="p">(</span><span class="nx">href</span><span class="p">)</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-16"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-16">¶</a> </div> <p>check if href contains #!/</p> </td> <td class="code"> <div class="highlight"><pre> <span class="k">if</span> <span class="p">(</span><span class="nx">href</span> <span class="o">&&</span> <span class="nx">href</span><span class="p">.</span><span class="nx">indexOf</span><span class="p">(</span><span class="s1">'#!/'</span><span class="p">)</span> <span class="o">!==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-17"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-17">¶</a> </div> <p>check if URI is inside <code>startUrl</code>.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="k">if</span> <span class="p">(</span><span class="nx">href</span><span class="p">.</span><span class="nx">indexOf</span><span class="p">(</span><span class="nx">startUrl</span><span class="p">)</span> <span class="o">!==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-18"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-18">¶</a> </div> <p>make sure URL isn't in pages-array already</p> </td> <td class="code"> <div class="highlight"><pre> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nx">_</span><span class="p">.</span><span class="nx">contains</span><span class="p">(</span><span class="nx">pages</span><span class="p">,</span> <span class="nx">href</span><span class="p">))</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-19"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-19">¶</a> </div> <p>check if the URL should be ignored</p> </td> <td class="code"> <div class="highlight"><pre> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nx">_</span><span class="p">.</span><span class="nx">any</span><span class="p">(</span><span class="nx">ignorePatterns</span><span class="p">,</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">pattern</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="nx">href</span><span class="p">.</span><span class="nx">match</span><span class="p">(</span><span class="nx">pattern</span><span class="p">)</span> <span class="o">!=</span> <span class="kc">null</span><span class="p">;</span>
<span class="p">}))</span> <span class="p">{</span></pre></div> </td> </tr> <tr id="section-20"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-20">¶</a> </div> <p>if not: add to pages-array and notVisitedPages</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">console</span><span class="p">.</span><span class="nx">error</span><span class="p">(</span><span class="s2">"adding URL: "</span> <span class="o">+</span> <span class="nx">href</span><span class="p">);</span>
<span class="nx">pages</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">href</span><span class="p">);</span>
<span class="nx">notVisitedPages</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">href</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">visitingPages</span> <span class="o"><</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">visitingPages</span><span class="o">++</span><span class="p">;</span>
<span class="nx">processUrl</span><span class="p">(</span><span class="nx">href</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nx">notVisitingPages</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">href</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">};</span></pre></div> </td> </tr> <tr id="section-21"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-21">¶</a> </div> <h3>page.onCallback</h3>
<p>React to <code>window.callPhantom</code>. If <code>obj.action === 'saveHref'</code> the method above
will be executed with the data in <code>obj.data</code>.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">page</span><span class="p">.</span><span class="nx">onCallback</span> <span class="o">=</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">obj</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">obj</span><span class="p">.</span><span class="nx">action</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">actions</span><span class="p">[</span><span class="nx">obj</span><span class="p">.</span><span class="nx">action</span><span class="p">](</span><span class="nx">obj</span><span class="p">.</span><span class="nx">data</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span></pre></div> </td> </tr> <tr id="section-22"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-22">¶</a> </div> <h3>page.evaluate</h3>
<p>Evaluate the page and run the callback in the scope of the web page.
Observe that the callback cannot access varialbes in this script that are outside
the callback! Instead the method must use <code>window.callPhantom</code> in order to send data
out of the callback.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">htmlCode</span> <span class="o">=</span> <span class="nx">page</span><span class="p">.</span><span class="nx">evaluate</span><span class="p">(</span><span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">htmlTag</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="s1">'html'</span><span class="p">,</span> <span class="nb">document</span><span class="p">);</span></pre></div> </td> </tr> <tr id="section-23"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-23">¶</a> </div> <ul>
<li>Find links
For each a-tag in the page, extract the href and send it to <code>saveHref</code>.</li>
</ul> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">links</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span> <span class="nx">htmlTag</span><span class="p">);</span>
<span class="nx">$</span><span class="p">.</span><span class="nx">each</span><span class="p">(</span><span class="nx">links</span><span class="p">,</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">linkIndex</span><span class="p">,</span> <span class="nx">link</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">href</span> <span class="o">=</span> <span class="nx">link</span><span class="p">.</span><span class="nx">href</span><span class="p">;</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">callPhantom</span><span class="p">({</span>
<span class="nx">action</span><span class="o">:</span><span class="s1">'saveHref'</span><span class="p">,</span>
<span class="nx">data</span><span class="o">:</span><span class="nx">href</span>
<span class="p">});</span>
<span class="p">});</span></pre></div> </td> </tr> <tr id="section-24"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-24">¶</a> </div> <ul>
<li>Clean up currently visible HTML by removing all elements that are hidden.</li>
</ul> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">hidden</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="s1">'body'</span><span class="p">,</span> <span class="nx">htmlTag</span><span class="p">).</span><span class="nx">find</span><span class="p">(</span><span class="s1">':hidden'</span><span class="p">);</span>
<span class="nx">$</span><span class="p">.</span><span class="nx">each</span><span class="p">(</span><span class="nx">hidden</span><span class="p">,</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">hiddenIndex</span><span class="p">,</span> <span class="nx">hiddenElement</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">$el</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="nx">hiddenElement</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">$el</span><span class="p">.</span><span class="nx">css</span><span class="p">(</span><span class="s1">'visibility'</span><span class="p">)</span> <span class="o">===</span> <span class="s1">'hidden'</span> <span class="o">||</span> <span class="nx">$el</span><span class="p">.</span><span class="nx">css</span><span class="p">(</span><span class="s1">'display'</span><span class="p">)</span> <span class="o">===</span> <span class="s1">'none'</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">$el</span><span class="p">.</span><span class="nx">remove</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">});</span></pre></div> </td> </tr> <tr id="section-25"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-25">¶</a> </div> <ul>
<li>Remove script- and link-tags from the page.</li>
</ul> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">$</span><span class="p">(</span><span class="s1">'script'</span><span class="p">,</span> <span class="nx">htmlTag</span><span class="p">).</span><span class="nx">remove</span><span class="p">();</span>
<span class="nx">$</span><span class="p">(</span><span class="s1">'link'</span><span class="p">,</span> <span class="nx">htmlTag</span><span class="p">).</span><span class="nx">remove</span><span class="p">();</span></pre></div> </td> </tr> <tr id="section-26"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-26">¶</a> </div> <p>Return the HTML content in the cleaned up HTML site as a string</p> </td> <td class="code"> <div class="highlight"><pre> <span class="k">return</span> <span class="nx">htmlTag</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">innerHTML</span>
<span class="p">});</span></pre></div> </td> </tr> <tr id="section-27"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-27">¶</a> </div> <h3>Save content</h3>
<p>Save polished HTML as {absolute-URL}<em>escaped</em>fragment_/hash/bang/value/index.html
E.g.</p>
<p>http://example.com/#!/some/cool/page</p>
<p>is stored as</p>
<p><em>escaped</em>fragment_/some/cool/page/index.html</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">compactHtml</span> <span class="o">=</span> <span class="nx">_</span><span class="p">.</span><span class="nx">compact</span><span class="p">(</span><span class="nx">htmlCode</span><span class="p">.</span><span class="nx">split</span><span class="p">(</span><span class="s1">'\n'</span><span class="p">)).</span><span class="nx">join</span><span class="p">(</span><span class="s1">'\n'</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">totalFileName</span> <span class="o">=</span> <span class="nx">page</span><span class="p">.</span><span class="nx">url</span><span class="p">.</span><span class="nx">substring</span><span class="p">(</span><span class="nx">startUrl</span><span class="p">.</span><span class="nx">length</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">decodedParameter</span> <span class="o">=</span> <span class="nx">totalFileName</span><span class="p">.</span><span class="nx">split</span><span class="p">(</span><span class="s1">'#!/'</span><span class="p">)[</span><span class="mi">1</span><span class="p">];</span>
<span class="kd">var</span> <span class="nx">parameter</span> <span class="o">=</span> <span class="nx">decodedParameter</span> <span class="o">?</span> <span class="nx">decodedParameter</span> <span class="o">:</span> <span class="s1">''</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">folderName</span> <span class="o">=</span> <span class="s1">'_escaped_fragment_'</span> <span class="o">+</span> <span class="nx">totalFileName</span><span class="p">.</span><span class="nx">split</span><span class="p">(</span><span class="s1">'#!/'</span><span class="p">)[</span><span class="mi">0</span><span class="p">];</span>
<span class="kd">var</span> <span class="nx">totalFolderName</span> <span class="o">=</span> <span class="nx">folderName</span> <span class="o">+</span> <span class="p">(</span><span class="nx">parameter</span> <span class="o">?</span> <span class="nx">fs</span><span class="p">.</span><span class="nx">separator</span> <span class="o">+</span> <span class="nx">parameter</span> <span class="o">:</span> <span class="s1">''</span><span class="p">);</span></pre></div> </td> </tr> <tr id="section-28"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-28">¶</a> </div> <p>create folder/tree with name <code>totalFolderName</code></p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">fs</span><span class="p">.</span><span class="nx">makeTree</span><span class="p">(</span><span class="nx">totalFolderName</span><span class="p">);</span></pre></div> </td> </tr> <tr id="section-29"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-29">¶</a> </div> <p>create file index.html in folder with the content <code>compactHtml</code></p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">fs</span><span class="p">.</span><span class="nx">write</span><span class="p">(</span><span class="nx">totalFolderName</span> <span class="o">+</span> <span class="nx">fs</span><span class="p">.</span><span class="nx">separator</span> <span class="o">+</span> <span class="s1">'index.html'</span><span class="p">,</span> <span class="nx">compactHtml</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">);</span>
<span class="nx">removeA</span><span class="p">(</span><span class="nx">notVisitedPages</span><span class="p">,</span> <span class="nx">url</span><span class="p">);</span>
<span class="nx">visitingPages</span><span class="o">--</span><span class="p">;</span></pre></div> </td> </tr> <tr id="section-30"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-30">¶</a> </div> <p>Release the web page</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">page</span><span class="p">.</span><span class="nx">release</span><span class="p">();</span></pre></div> </td> </tr> <tr id="section-31"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-31">¶</a> </div> <p>Are there any pages left to visit and are less that 2 pages visited at the moment?</p> </td> <td class="code"> <div class="highlight"><pre> <span class="k">if</span> <span class="p">(</span><span class="nx">notVisitingPages</span><span class="p">.</span><span class="nx">length</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">visitingPages</span> <span class="o"><</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">lastUrl</span> <span class="o">=</span> <span class="nx">notVisitingPages</span><span class="p">.</span><span class="nx">pop</span><span class="p">();</span>
<span class="nx">visitingPages</span><span class="o">++</span><span class="p">;</span></pre></div> </td> </tr> <tr id="section-32"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-32">¶</a> </div> <p>Then visit a not yet visited page.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">processUrl</span><span class="p">(</span><span class="nx">lastUrl</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">};</span></pre></div> </td> </tr> <tr id="section-33"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-33">¶</a> </div> <h2>processUrl</h2>
<p>Process the URL provided. Start by loggins the URL to console.
Then create a new PhantomJS headless WebKit browser page.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="kd">var</span> <span class="nx">processUrl</span> <span class="o">=</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">url</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">"processing url: "</span> <span class="o">+</span> <span class="nx">url</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">page</span> <span class="o">=</span> <span class="nx">webpage</span><span class="p">.</span><span class="nx">create</span><span class="p">();</span>
<span class="kd">var</span> <span class="nx">pageIsOpened</span> <span class="o">=</span> <span class="nx">$</span><span class="p">.</span><span class="nx">Deferred</span><span class="p">();</span></pre></div> </td> </tr> <tr id="section-34"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-34">¶</a> </div> <p>Load the web page.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">page</span><span class="p">.</span><span class="nx">open</span><span class="p">(</span><span class="nx">url</span><span class="p">,</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">status</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">status</span> <span class="o">!==</span> <span class="s1">'success'</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s1">'Unable to access network'</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nx">pageIsOpened</span><span class="p">.</span><span class="nx">resolve</span><span class="p">(</span><span class="nx">page</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">});</span></pre></div> </td> </tr> <tr id="section-35"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-35">¶</a> </div> <p>Wait for 3 seconds once the page is loaded and then</p>
<ul>
<li>inject jquery into the page</li>
<li>start processing the page</li>
</ul> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">pageIsOpened</span><span class="p">.</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span> <span class="p">(</span><span class="nx">page</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">setTimeout</span><span class="p">(</span><span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
<span class="nx">page</span><span class="p">.</span><span class="nx">includeJs</span><span class="p">(</span><span class="s1">'http://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js'</span><span class="p">,</span> <span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
<span class="nx">processPage</span><span class="p">(</span><span class="nx">page</span><span class="p">,</span> <span class="nx">url</span><span class="p">);</span>
<span class="p">});</span>
<span class="p">},</span> <span class="mi">3000</span><span class="p">);</span>
<span class="p">});</span>
<span class="p">};</span></pre></div> </td> </tr> <tr id="section-36"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-36">¶</a> </div> <h2>Start</h2>
<p>Start processing the start URL.</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">processUrl</span><span class="p">(</span><span class="nx">startUrl</span><span class="p">);</span></pre></div> </td> </tr> <tr id="section-37"> <td class="docs"> <div class="pilwrap"> <a class="pilcrow" href="#section-37">¶</a> </div> <h2>Exit</h2>
<p>Ask every second if all pages are visited. If so: exit phantom</p> </td> <td class="code"> <div class="highlight"><pre> <span class="nx">setInterval</span><span class="p">(</span><span class="kd">function</span> <span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">notVisitedPages</span><span class="p">.</span><span class="nx">length</span> <span class="o">===</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">phantom</span><span class="p">.</span><span class="nx">exit</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">},</span> <span class="mi">1000</span><span class="p">);</span>
<span class="p">}());</span>
</pre></div> </td> </tr> </tbody> </table> </div> </body> </html>