Stepping the Async Scrape

We've designed a scrape that runs with some parallelism captured with liberal use of async and await. Now we are thinking we can single-step this long running computation but have to think through what that even means.

digraph { start sites slugs node [shape=box style=filled fillcolor=lightblue] rankdir=LR dotime dosite doslug node [fillcolor=palegreen] dotime -> running -> reschedue dotime -> ready -> start dosite -> update -> modified -> slugs update -> new -> slugs doslug -> item -> sites doslug -> action -> sites } }

We will separate the queues for sites and slugs to be examined. We will preload sites with a few broadly connected sites. Visiting sites will produce slugs, visiting slugs will produce sites.

Aside: a slug is a page title in lower case with spaces turned to hyphens and other punctuation removed.

We'll single step through dosite pausing after each sitemap fetch reporting availability, activity and errors.

We'll single step through doslug pausing after page fetch reporting new and familiar sites and page format errors.

It makes sense to run either dosite or doslug against their respective queues independently. Both must run to complete a scrape.

A scrape will launch with a few seed sites and complete when both queues are empty and no work is in flight. A cron job can be configured to launch scrapes on a regular schedule.