Neo4j is an open-source graph database implemented in Java and accessible from software written in other languages using the Cypher query language through a transactional HTTP endpoint. wikipedia site

I've sought the kind assistance of work colleague, Erika Arnold, author of wikiGraph, a shortest-path visualizing application for Wikipedia based on Neo4J. page

Here I follow her approach.

# Load

Build node and relation csv files from Search Index Downloads with a ruby converter that assigns numeric ids to sites, pages and titles. github

Runtime 3.5 min, 8 mb output. We now repeat this build after every scrap. Look for new data at 1:00 & 7:00, am & pm, pacific time. nodes.csv rels.csv

Build a graph database from csv files with neo4j's import command. docs

neo4j-import \ --into wiki.db \ --nodes nodes.csv \ --relationships rels.csv

IMPORT DONE in 10s 298ms. 92237 nodes 321452 relationships 92237 properties

Find where neo4j resources have been installed.

locate neo4j

Move the constructed db to the server's realm.

cd /var/lib/neo4j/data sudo mv ~/neo-wiki/wiki.db . chmod -R a+w wiki.db

Edit the config and restarted the server.

cd /var/lib/neo4j/conf sudo vi sudo service neo4j-service restart

Open an ssh tunnel to the remote server.

ssh -L 7474:localhost:7474

Then view the graph using the builtin app. localhost

# Query

It's hard to know what to look for until you have a real need and some experience formulating queries. I read docs and tried things. Some impressed me enough to save the svg.

For fast queries find a good place to start and then traverse from there. I picked .org sites and looked for links to titles about Education. svg

match (s:Site)-[HAS]->(p:Page)-[LINK]->(t:Title) where s.title =~ '.*org' and t.title =~ '.*Education.*' return s,p,t limit 100

I try retrieving nodes by the shape of their relations alone. This is slow. I find sites that have/link the same title. svg

match (a)-->(b)-->(c)<--(d)<--(e) return * limit 300

Top page counts for happening sites.

match (s:Site)-->(p) where s.title =~ '.*' with s.title as site, count(p) as pages where pages >= 100 return pages, site order by pages desc

pages,site 3089, 294, 225, 220, 216, 178, 164, 147, 140, 134, 134, 133, 119, 106,

Shortest path between titles with sites that hold the pages along the way. svg

match (here:Title { title:"How Life Works" }), (want:Title { title:"Federated Wiki On Digital Ocean" }), paths = allShortestPaths((here)-[*]-(want)) with nodes(paths) as way match (s:Site)-->(p:Page) where p in way return * limit 40

But wait, this path goes through unrelated 'scratch' pages. It also disregards the relations' direction. We need to constrain the path to sites we know.

# Knows

Revise the batch import to include sites found on each page as KNOWS between a Page and neighborhood Sites. github

Add directional HAS|KNOWS pattern to the shortest path to constrain result to operationally discoverable sites. I add Titles to the path ends with IS relations. This adds a bit of ambiguity as to which page we're starting at. svg

match (here:Title { title:"Hacker Beach" })<-[h:IS]-(start), (end)-[w:IS]->(want:Title { title:"Naval Undersea Museum" }), paths = shortestPath((start)-[:HAS|:KNOWS*]->(end) return here, h, want, w, paths limit 40

I've tested the path by clicking through it. It works.

We can find the sites with the most neighbors by counting distinct KNOWS relations.

match (s:Site)-[h:HAS]->(p:Page)-[k:KNOWS]->(n:Site) with s.title as site, count(DISTINCT n.title) as neighbors return * order by neighbors desc limit 20

The numbers are much higher than we might expect. This is because we conflate forks, references and rosters while we scrape. For the graph database we should do better.

526 407 363 170 160 157 131 128 125 115 110 108 96 94 92 92 90 88 87 86

This agrees with the command line word count of the site wide rollup of site.txt files.

ls | while read i do wc -l $i/sites.txt done | sort -n

We can write a meaningful who-links-here that will resolve to the page at least as a twin. svg

match (s0:Site {title:''})-[:HAS]-> (p0:Page {title: 'federation-search'})-[:IS]-> (t0:Title)<-[a:LINK]-(p1:Page)<-[b:HAS]-(s1:Site) where (p1)-[:KNOWS]->(s0) return a,b