Sitemap Scrape Improvements

The Ruby Sitemap Scrape provides the first full text search of the visible federation. We've learned a lot by building this ourselves from grep-like utilities. Here we list todos that have surfaced and been completed.


Find all pages that share any items with the current page. Prototype search link now available. github

Scrape item ids and save them in items.txt files. Devise some convenient way to initiate a search from any paragraph. See Link Symmetry

Refactor to have separate what, help and query details pages. Keep them up to date.

match: and or

Add Newly Found Sites to the activity report even if they are not recently active. This would be reporting the consequence of some other activity that linked to the site.

Add a permalink to the search results so that searches can be saved and rerun with a single click. search

Find and remove old rosters after a week or so. Possibly merge them to have into whole days before that.

find activity/*00 -mtime +7 -exec rm {} \;

Momentarily defeat the scrape's incremental mechanism in order to retrieve the new indices, items.txt and plugins.txt from all pages. See Full Scrape

Scrape item types and same them in plugins.txt files.

Add html plugin to since we're now generating lots of html items.

Improve the grep sequence so it doesn't blow up with "too many arguments" from the shell. github

We've somehow lost utf-8 decoding in the scraper. These error messages are new. 140 sites lost from view. This was first successful run from cron. Solution online. post

can't do sitemap for, "\xE2" on US-ASCII grep "can't do sitemap for" logs/Sat-1900 | \ cut -d , -f 2 | sort | uniq -c 11 "\xC2" on US-ASCII 15 "\xC3" on US-ASCII 3 "\xCB" on US-ASCII 3 "\xCE" on US-ASCII 1 "\xCF" on US-ASCII 1 "\xE1" on US-ASCII 106 "\xE2" on US-ASCII

We've figured out how to set CORS headers on the port 3030 sinatra server that delivers the recent-activity.json after giving up on the default 'public' behavior. github

We run scrapes four times a day and find sites with new activity. We'll list active sites here automatically.

Grep the words.txt with a simple web app. site

Recent activity now includes new sites in a more compact format. github

I've added a report listing all sites with ten or more pages. I attempt to group these logically based on their subdomain hierarchy.

Here we report all sites found, organized by domain name, excluding sites with less than ten pages.

I've revised my 2011 cron job that feeds home sensor network data into the federation on a five minute cycle. This polluted the scrape's activity report until I modified the perl script to date pages with the install date of each sensor, not the date of the reading. site

$date = (stat($r))[9]*1000;

I've revised my 2012 cron job that reports farm activity to date the activity in the journal with the date that it happened. A second commit suspends reporting until there is activity after the last report. github github