Search Index Downloads

We make available the search index as downloadable files in several formats including json objects designed for network graphing.

Our sitemap scraper runs four times a day. Logs from each run can be viewed online. page

We report the page count and domain name of sites found to be online and reporting pages in their sitemaps. page

We distribute the index files individually and in a single 48 megabyte compressed tar file. tgz

tar czf public/sites.tgz sites *.txt

# Index

The index is organized as a collection of text files containing unique words extracted from various fields of each page. These are grouped into directories by site and then page within site.

sites/ words.txt links.txt sites.txt items.txt pages/ how-to-wiki/ words.txt links.txt sites.txt items.txt

We include federation wide rollups of the files which we don't use to search but maintain anyway.

words.txt links.txt sites.txt items.txt

We now include a federation wide rollup containing the slugs of pages found in sites searched. This might be useful for title completion. It is an experiment. txt


# Counts

We accumulate various counts in another text file with one line of json for each scrape. txt


Here we show a sample line after being formatted as indented text. The scan counts are read from logs while the index counts are line counts of the site rollup text files.

{ "date": 1441545903, "scan": { "sites": 676, "pages": 31983 }, "index": { "counts": 3, "items": 258354, "links": 48549, "plugins": 48, "sites": 776, "words": 115927 } }

See Sitemap Scrape Statistics for counts plotted.

# Graphs

We aggregate information from the index into single files representing graphs as node and arcs in two forms.

Nodes are site names and arcs are remote sites. json

"": { "pages": 86, "links": [ "", "", "", "", "" ] }

Nodes are page slugs and arcs are internal links. json

"how-to-wiki": { "forks": 37, "links": [ "add-pages", "add-paragraphs", "copy-pages", "find-sites", "follow-links" ] }

We offer javascript versions of the aggregated graph data files that can be included in a web page with a script tag. site-web.js slug-web.js

# Applications

Title Network Browser allows one to navigate from page title to page title following links going forward and backwards.

Site Network Diagram shows all visible sites connected by arcs where there are neighborhood citations.

Recent Activity Report showing sites found to have new activity in the last week.

Neo4J with batch loading and experimental interactive query plugin.

Item Distribution computed from site and page items.txt.

Possibly intersect with the list of bad words. twitter file