Our scraping experience suggests we should distribute the function across all servers and move slowly enough that site owners can direct the spider's progress.
See Search Index coding details.
Search operates by accumulating information by scraping and delivering it by query. We will consider both in turn.
A page viewed scrapes its neighbors. We examine the sites mentioned in any page and go fetch their sitemaps. This is the scrape step. It need not be directed by readers.
We suggest that every server should scrape the neighborhood of the pages it serves before they are viewed by readers.
A server could scrape further to the neighborhood of its neighbors or beyond. For a small server with few sites and few pages within those sites a deeper search might make sense, or not.
We've seen the utility of searching scraped sites for many properties from words present to plugins in use. But we find query for items of common origin to be most interesting and also one of the easiest queries to pose.
Show me pages that share this page's history. Show me pages with ids I have here. Show me where this came from and where it is going. Show me more.
A page could be adorned with a 'more' query in the space remaining after we add licence, json and site of residence. A link to 'more' would ask the server hosting the page in question to show us more based on page ids.
Aside: We have a trial implementation of the 'more' query where it is called simply 'search'. github
The 'more' query would bring up a search result looking like a Roster of sites organized by the page names that they share in common and restricted to pages that share at least one story item sharing an id with our source.
As with familiar Rosters, the 'more' search result can be explored by calling into the browser's neighborhood the sitemaps of individual sites or by retrieving whole rows of sitemaps from the » link.
Should the result of a search disappoint, should it not reach far enough into the federation, then the result itself could include another 'more' link that would extend the query to the hosts found in the current result. By the application of 'more', 'more', 'more', the reader can progress breadth first to where they have not read before.
We suggested that the scrape of any given server may not need to go beyond the sites found on its own pages. We suggest that is true because the curious reader can extend this frontier at will by saving even one page from the beyond. The server's next scrape which could be only minutes away will now include pages from the site now brought within the server's natural neighborhood.
When we exploit unbounded search to provide symmetrical links between pages we expose ourselves to "follower spam" as one now finds on facebook and twitter. The unbounded link store provided by unrestricted search exhibits this vulnerability. Our practical desire to limit the depth of automatic search is exactly the protection we need so long as there is a human mediated mechanism for extending these limits.
Should we make farm servers responsible for exploring the server-visible neighborhoods described here then we will find that we have placed additional trust on those writing within the farm. Farms then must protect themselves against the bad actor who would trick the server's search to extend into undesirable neighborhoods.
A site operator will need the ability to expel any site that fails to operate within the best interests of the others and to thereby expunge bad neighbors from its search. Site operators will then become the judges who in their small realm must distinguish the progressive from the subversive, the griefers from the good.
The intellectual health of the federation and the culture it supports then depends on careful admission of new authors into farms hosted and paid for with a purpose. Should these become a network of ingroups then we will face an arms race between progressives and griefers which I expect the progressives will win.
Search Thoughts describes two search cases that intersect in curious ways. Both could be improved by the locality suggested here.
Nedb is a dependency-free in-memory db with indices and an append-only backing store. Seems to be good for 10,000s of documents. github
An Introduction to Information Retrieval. This book presents the scientific underpinnings of this field, at a level accessible to graduate students as well as advanced undergraduates. pdf