This work started with a simple scrape of Amazon's Look Inside preview. See Cities Alive Inside
We've encoded most of our logic in a state machine that runs off of <p> tag class names in linear order. github
<script src="... resources/js/index.js"></script>
We switched on the p tag class name adding cases for tags of interest leaving the default case to capture tag and text in a plain paragraph. We used section and chapter headings to begin new pages to be exported on completion. github
The export.json files we write appear as download notices at the bottom of the browser window. These can be dragged as is to another wiki and inspected there. We found it more convenient to immediately import the export file into a new wiki prepared to receive them. We wrote a couple of scripts to do this for us. github
We shortened page titles by recognizing patterns based on quotes and colons. We recorded these in our own table of contents. With this and specific conversions for block quotes, images and references we could get a sense of what browsing as hypertext would be like. github
We separated class names at the first hyphen to find a selector independent of formatting variations specific to the book versions. With this simplification and handlers for the remaining cases we had a rough but complete translation. github
We reached back into previously processed items when a subsequent p tag added an attribution to a block quote or a caption to an image. github
python -m SimpleHTTPServer 9000
The structure of the work fits within the structure of the document which itself has been encoded into a sequence of document specific formatting codes. Having now rendered the body of the document we know the code.
Some codes indicated transitions between sections, chapters, and photo collections.
case 'BOOK_Titles_Section': bold in table of contents case 'BOOK_Titles_Chapter': begin new wiki page, link in table of contents case 'BOOK_Titles_Photo': bold in current wiki page
Some codes indicated the role of text.
case 'BOOK_BODY_Photo': emphasized text case 'BOOK_BODY_': emphasized * * * separator case 'BOOK_BODY_Lists_list': case 'BOOK_BODY_footnote': de-emphasized text case 'BOOK_BODY_body': normal text
One code served as quotation and its attribution.
case 'BOOK_BODY_Quotes_quote': block quote text, merge when ajacent
Two codes identified photographs and their captions.
case 'BOOK_BODY_Images_Figure': image with default caption case 'BOOK_BODY_Images_Caption': revise caption of preceding image
A default case would visibly annotate text with the not-yet-understood code that accompanied it. We would know the structure when there were no remaining unknown codes.
The missing image from section I chapter 1.
Sometimes the codes mislead. For example the photograph introducing section I chapter 1 appears in our mindless rendering cited above but not in our carefully interpreted version. This is because both figure and caption were coded as caption wherein we look only for words, not images.
The work was complete enough to review with the author/publisher to consider how this could become a useful part of this and all future publications.
We will meet to consider our next steps.
1. fine tune heuristics used to adjust text 2. improve wiki’s handling of photo albums 3. improve wiki’s handling of citations and attributions 4. editorial decisions regarding hypertext modularity
There is work in progress elsewhere addressing 2 and 3. Number 4 is a subject that should be considered in the context of other business and social objectives.