We're exploring how a more data centric federated wiki server might work. We'd like this to be driven by technological innovation and improved applications we already have.
The tech revolves around ES6 modules which is pure geek. The application I'm looking into is federation search. See How Scrape Works.
I'm half way through rewriting this in JS async/await style with a queue and a clock that starts another scrape operation every second. This is the geek part. I'm also thinking that this should play with the new federation search in a way that grows gracefully with tens of thousands of sites.
We propose a control panel suitable for both initial and incremental scrape.
But rather than indexing each site as I go I think I should leave that to the full text search recently implemented. Better to use federation indexing to build an over the horizon model of what we now think of as Rosters.
Say you have a Roster of a dozen sites and this isn't working for you with what you are looking for. So you say, federation, what's out beyond that? Maybe we then go to the federation map and compose together another 100 sites. This is still a small portion of the federation but it is 10x as likely to have what you are looking for. Not good enough? Try 100x.
So I am thinking the next generation scrape will just build a who-cites-whom directed graph of the federation as it exists right now. This is actually simpler than what we've done to date and will probably work much better too.
We launch a scrape with a list of root sites as arguments. This builds a flat file index of pages within sites listing sites mentioned on those pages.
deno --allow-net --allow-read --allow-write \ scrape.ts fed.wiki.org
This emits one network request every second until the scrape is complete. File modification dates are adjusted to reflect sitemap dates and are used to detect when changes require a new scrape of a page.
We should wait until the last write completes before exiting but otherwise allow network operations to overlap when they are slow.
We count unique errors reported while scraping.
pbpaste | grep trouble | \ perl -pe 's/site trouble .*? //' | \ perl -pe 's/\(.*?\)//' | \ sort | uniq -c | sort -n >err-tally.txt
We count unique slugs over all sites.
ls data | \ while read i; do ls data/$i | cat; done | \ sort | uniq -c | sort -n
10 chorus-of-voices.json 10 incremental-paragraphs.json 10 indie-web-camp.json 10 json-schema.json 10 recent-submissions.json 14 local-editing.json 15 smallest-federated-wiki.json 17 how-to-wiki.json 25 scratch.json 40 ward-cunningham.json 73 welcome-visitors.json
We count unique sites named by fork or reference.
ls data | \ while read i; do cat data/$i/* | \ jq -r '.'; done | \ sort | uniq -c | sort -n
20 mehaffy.fed.wiki.org 20 sensors.c2.com 20 stefanie.fed.wiki.org 20 wiki.sfw.c2.com 21 nrn.io 23 yala.fed.wiki.org 24 sites.fed.wiki.org 25 design.fed.wiki.org 26 fed.coevolving.com 26 nmsi.fed.wiki.org 28 plugins.fed.wiki.org 30 maha.uk.fedwikihappening.net 33 clive.tries.fed.wiki 35 journal14.hapgood.net:3000 35 video.fed.wiki.org 35 wiki-paul90.rhcloud.com 36 goals.pod.rodwell.me 37 future.fedwiki.org 37 wiki.dbbs.co 39 code.fed.wiki.org 40 sfw.c2.com 47 localhost:3000 49 journal.hapgood.net 71 fed.wiki.org 86 ward.bay.wiki.org 88 found.ward.bay.wiki.org 135 forage.ward.fed.wiki.org 147 ward.asia.wiki.org 284 ward.fed.wiki.org
See Over the Horizon Later for overnight stats.
Handy command for finding scrape data files modified within the last week.
ls -ldtr `find . -mtime -7 -print`