We now automatically add new sites to the list of sites we watch. This has been developed as a separate pass mostly as a coding convenience. Here we consider the impact this has on reporting and content availability.
This is why content from newly discovered sites won't be available to search until six hours after the site itself is discovered and reported in Recent Activity
# Previously
I have found public sites while scraping private wiki. I'll block the private sites but scrapes of the public sites should remain, so long as there is something there.
diff <(ls sites) sites.txt | \ grep '<' | cut -d ' ' -f 2 | \ while read i; do wc -l sites/$i/words.txt; done
I introduce sites to be scraped by adding them to the sites directory after some inspection of the diff and blocks for sites that shouldn't scrape.
diff <(ls sites) sites.txt | \ grep '>' | cut -d ' ' -f 2 | \ egrep -v 'local|/|^192.168|^127.0' | \ while read i; do mkdir sites/$i; done
# Automation
We've merged the new site discovery into our routine scrape but have preserved the independence provided by a separate pass through the available data. github
We added discovery to our activity reporting but identify newly discovered sites on independent rows because sites present could appear for different reasons.
We could find a newly created site mentioned in an update of a currently scraped site.
We could find an old and inactive site mentioned in an update of a currently scraped site.
# Sequence
The order in which we perform these two passes will influence when and how activity shows up in the Roster reported as Recent Activity.
We're running our scrape four times a day. We've chosen to report newly discovered sites on the same pass they are discovered but in an order that will defer the actual scraping of content from those sites for another six hours.
Will the scraper find this site? I hope so.
# Experience
We've seen bursts of "activity" when our scrape finally discovers a cluster of previously unknown sites. One case was sites made for an upcoming class which were finally added to a Roster on a site we scraped. Another was seventeen empty "submissions" made to an experimental site 22 months ago. One edit by the experimenter cause us to notice these other sites.
It would be possible to construct a server that would appear to be sites but was in reality an algorithm designed to abuse our scrape. By analogy, I once published a Sudoko game where every possible move was given a url and thereby trapped multiple robots into a near endless search of the game space.