Selective Scrape Pages

We add restrictions to Scrape Pages so that it runs faster and finds more relevant content. We've added a json output for downstream visualization and an example node app that reads it. github

Click scrape.html to run with defaults.

Click scrape.svg to see a download drawn with scrape.js.

pages/selective-scrape-pages

# Parameters

We accept parameters each of which has a default. Click the asset scrape.html to see a scrape run with defaults. Edit the new tab's url to override defaults.

days=10

Limit the graph to include only pages edited in the last 10 days. Default is 30 days. A fork alone doesn't count as an edit. example

site=code.fed.wiki

Start the scrape at the specified site. The default is found.ward.bay.wiki.org. The scrape discovers more when pages fork or otherwise reference new sites. example

# Download

We construct page objects within site objects as we scrape. We have a lot of latitude for what and when we record. A site may be present with no pages selected for inclusion.

{ "found.ward.bay.wiki.org": { "debt-metaphor-explained": { "date": 1571510763204, "title": "Debt Metaphor Explained", "synopsis": "While programming we ...", "links": [ "Quantifying Technical Debt" ], "sites": [ "found.ward.bay.wiki.org", "ward.bay.wiki.org" ] }, "scrape-pages": { "date": 1571252854227, "title": "Scrape Pages", "synopsis": "A good way to understand ...", "links": [ "How Scrape Works", "Search Index Download" ], "sites": [ "found.ward.bay.wiki.org" ] }, ...

Pages are recorded as objects with the fields shown. A slug is a lower-case hyphenated version of the title. Links will be a list of links found on the page. Forks are a list of additional sites where links may resolve in the order they should be checked.

function asSlug (title) { return title.replace(/\s/g, '-') .replace(/[^A-Za-z0-9-]/g, '') .toLowerCase() }

We now properly punctuate and download the scrape results as a json file once the scrape has completed. The file name is generated from scrap parameters.

`scrape-${site}-${days}.json`

# Example

We've created as an example a node application, scrape.js, that can read a download file and render it as svg using graphviz dot notation. We connect sites with two different kinds of lines. enlarge

A solid line means a link followed by a click.

A dashed line means a twin forked from another site.

# Modifications

An advantage of this scrape over other approaches is that the code is run locally and easily modified to collect additional application specific data from the pages it chooses to visit.

function do_page(site, slug, page) { var page_node = json_graph[site][slug] ...

The page object is delivered to do_page for each selected site and slug. Insert into page_node as required.

See Json Schema for page object details.