Search over the Horizon

We're exploring how a more data centric federated wiki server might work. We'd like this to be driven by technological innovation and improved applications we already have.

The tech revolves around ES6 modules which is pure geek. The application I'm looking into is federation search. See How Scrape Works.

We've rewritten this in animated javascript before. Let's make this even easier with deno wiki. page

I'm half way through rewriting this in JS async/await style with a queue and a clock that starts another scrape operation every second. This is the geek part. I'm also thinking that this should play with the new federation search in a way that grows gracefully with tens of thousands of sites.

We propose a control panel suitable for both initial and incremental scrape.

But rather than indexing each site as I go I think I should leave that to the full text search recently implemented. Better to use federation indexing to build an over the horizon model of what we now think of as Rosters.

Say you have a Roster of a dozen sites and this isn't working for you with what you are looking for. So you say, federation, what's out beyond that? Maybe we then go to the federation map and compose together another 100 sites. This is still a small portion of the federation but it is 10x as likely to have what you are looking for. Not good enough? Try 100x.

So I am thinking the next generation scrape will just build a who-cites-whom directed graph of the federation as it exists right now. This is actually simpler than what we've done to date and will probably work much better too.

# Scrape

We launch a scrape with a list of root sites as arguments. This builds a flat file index of pages within sites listing sites mentioned on those pages.

deno --allow-net --allow-read --allow-write \ scrape.ts

This emits one network request every second until the scrape is complete. File modification dates are adjusted to reflect sitemap dates and are used to detect when changes require a new scrape of a page.


# Consider

We should wait until the last write completes before exiting but otherwise allow network operations to overlap when they are slow.

We count unique errors reported while scraping.

pbpaste | grep trouble | \ perl -pe 's/site trouble .*? //' | \ perl -pe 's/\(.*?\)//' | \ sort | uniq -c | sort -n >err-tally.txt

We count unique slugs over all sites.

ls data | \ while read i; do ls data/$i | cat; done | \ sort | uniq -c | sort -n

10 chorus-of-voices.json 10 incremental-paragraphs.json 10 indie-web-camp.json 10 json-schema.json 10 recent-submissions.json 14 local-editing.json 15 smallest-federated-wiki.json 17 how-to-wiki.json 25 scratch.json 40 ward-cunningham.json 73 welcome-visitors.json

We count unique sites named by fork or reference.

ls data | \ while read i; do cat data/$i/* | \ jq -r '.[]'; done | \ sort | uniq -c | sort -n

20 20 20 20 21 23 24 25 26 26 28 30 33 35 35 35 36 37 37 39 40 47 localhost:3000 49 71 86 88 135 147 284

See Over the Horizon Later for overnight stats.


Handy command for finding scrape data files modified within the last week.

ls -ldtr `find . -mtime -7 -print`