Cleaning the Search Index

Search has gotten slow because we have inadvertently indexed base64 text as words filling our words.txt indices with junk. issue plot

# Examples

Type html with anything in tags, especially image data.

{ "id": "79704000", "type": "html", "text": "<img width=\"100%\" alt=\"NREL 2022-03-26T22:44-0700 winds 10-25mph<br>\" src=\"data:image/png;base64,iVBORw0KG...AiRAYII=\"><p>NREL 2022-03-26T22:44-0700 winds 10-25mph<br></p>" },

Type markdown with bare urls. Or any [... word] url.

{ "type": "markdown", "id": "7bbc463a08dc611b", "text": "Just heard his talk at Chicago CTO Summit. https://www.bizjournals.com/baltimore/news/2021/06/25/best-in-tech-awards-2021-jonathan-moore-rowdyorbit.html via matrix " },

# Scripts

We look for the largest site words.txt and then the largest pages words.txt in the largest sites.

ls sites | while read s; do echo `cat sites/$s/words.txt | wc -c` $s; done | sort -n >yy

(cd sites/dreyeck.ch/pages; ls | while read p; do echo `cat $p/words.txt | wc -c` $p; done | sort -n)

By this measure most sites are under 100K, five sites exceed 1M, some substantially more.

2249677 old.viki.wiki 29143944 nixos.ralfbarkow.ch 65888443 http.wiki.ralfbarkow.ch 93449511 wiki.ralfbarkow.ch 94484063 dreyeck.ch

We will try running scrape on one page from one site that has my own work at its source. It has three pages with words.txt over 10K.

19452 boulder-wind-captured 105396 ncar-fire-2022-03-21 692156 tiny-habits

cat sites/wiki.dbbs.co/pages/tiny-habits/words.txt

We've found that ruby gsub fails for a very long match. We don't get an exception but the intended replacement doesn't happen.

find sites -name words.txt -size +1M

# Resolution

We found that our <tag> regex needed to explicitly allow newline within the tag. 692156 ⇒ 556 improvement.

text.gsub! /<(.|\n)*?>/,'' if item['type']=='html' text.gsub! /\[((http|https|ftp):.*?) (.*?)\]/, '\3' text.scan /[A-Za-z]+/ do |word|

We could possibly improve the scan to see base64 as a single word and then reject it as being too long. Or just remove base64 when longer than the longest English word, pneumonoultramicroscopicsilicovolcanoconiosis.

text.gsub! /[A-Za-z0-9+\//]{46,},''

We can force a full site scrape by erasing the date-of-last-scrape flag. We try this for dreyeck.ch.

mark = "sites/#{site}/scraped" if File.exist? mark since = File.mtime(mark).to_i*1000 else since = 0 end

# Scrape

We had an unusual scrape after making the above change and some housecleaning of empty sites. We'll keep some records should we want to analyze later.

pages/cleaning-the-search-index

We tried removing sites entries for sites with no pages. They seem to have come back or appeared in recent changes somehow. Beware using this list as it seems to have one legitimate site in the list.

7800ec3b.ngrok.io 810699.pmint.fed.wiki.org 944b5652.ngrok.io a6ccfbc0.ngrok.io aaron.asia.wiki.org ach.asia.wiki.org allen.fed.wiki allmende.io almereyda.de alyson.sf2.fedwikihappening.net anders.outlandish.academy andysylvester.com annalisamanca.uk2.fedwikihappening.net aribadernatal.com attricia.asia.wiki.org bill.seitz.fed.wiki.org blendedworkshop.fed.wiki boggie.fed.wiki bookmark-outpost-proof.glitch.me c2.com cascade-sys.net caulfield.pw chn.io chohag.asia.wiki.org chris.asia.wiki.org chrisellis.me christiqn.fed.wiki.org cl.blendedworkshop.fed.wiki clay.ny2.fedwikihappening.net cocultures.federated.wik coevolving.com controverse.asia.wiki.org controvese.asia.wiki.org cool-bat-29.deno.dev creativefinance.io dao.lexon.wiki data.fed.wiki.org david.agency.pods.wiki.org david.lexon.wiki david.reimage.fed.wiki.fed.wiki david.reimage.fed.wikiig.fed.wiki dbbs.co dilgreenarchitect.co.uk djon.es dlm.hapgood.net don.noyes.asia.fed.wiki dots.asia.wiki.org drone.dork.wiki.org ebike.fed.wiki ec83aa54.ngrok.io ego.ebike.fed.wiki eterprobertson.tries.fed.wiki etudianutt.asia.wiki.org eugene.agency.pods.wiki.org fandsuamprep.outlandish.academy fautsuamprep.outlandish.academy feast.fm fed.wikabout.fed.wiki federation.asia.wiki.org fedwiki.maxlath.eu fedwikihappening.net foo.dojo.fed.wiki foo.sofi.dojo.fed.wiki foo.tries.fed.wiki forage.ward.fed.wiki frances.uk2.fedwikihappening.net frankmcpherson.net freedombone.net fullmoon.academy futurelaw.org garden.ward.asia.wiki.org glitch.me graziano.ic.wiki.openlearning.cc gyuri.ic.wiki.openlearning.cc hapgood.net hashbase.io haythemben.asia.wiki.org hello.asia.wiki.org hello.ward.asia.wiki.org hi.asia.wiki.org home.asia.wiki.org if.fed.wiki.org innovateoregon.org iscon.fed.wiki.org ittybittyfedwiki.com jack.agency.pods.wiki.org jack.ries.fed.wiki jasongreen.net je.fedwiki.ssatuk.co.uk jeffist.com jhp.mike.asia.wiki.org jillstudent.gpacw.asia.wiki.org journal.asia.wiki.org journal19a:3000 jph.asia.wiki.org jph.mike.asia.wiki.org jupidu.uk2.fedwikihappening.net karen.mitte.tries.fed.wiki kelley.asia.wiki.org kerry.fed.wiki kerry.sofi.tries.fed.wiki knuth.fed.wiki.org lao.lexon.wiki lexon.wiki libre.sh life.asia.wiki.org life.org.asia.wiki.org life.war.asia.wiki.org lifeward.asia.wiki.orglinda.fed.wiki.org livecode.viral.academ lph.asia.wiki.org martinlindner.uk2.fedwikihappening.net matslats.net matts.wiki mcmorgan.org misinfo:3000 mk.asia.wiki.org mks.asia.wiki.org models.asia.wiki monlien.asia.wiki.org ndevil.asia.wiki.org nik.tries.fed.wiki nlafferty.uk2.fedwikihappening.net nrm-io-paul90.hashbase.io nwiki.synesthesia.co.uk olaf.fed.wiki olafbrugman.dojo.fed.wiki ort.fed.wiki.org pamela.podclub.cc patternwork.federated.wikidance.proto.institute paul.agency.pods.wiki.org pax.academy permaculture.daviding.openlearning.cc pete.agency.pods.wiki.org peterdaguru.dojo.fed.wiki philip.cryptoacademy.org plugin.dork.wiki.org plugin.fed.wiki plugins.dork.wiki.org read.fed.wiki recycler rest.livecode.world rest.liveworld.org revgniter.livecode.world search.asia.wiki.org silke.fed.wiki simnet.ward.asia.wiki smallest.fed.wiki stephanie.fed.wiki.org style.fed.wiki.fed.wiki survival.mk.asia.wiki.org swarm.cryptoacademy.org tamara.asia.wiki.org tech-cico.fr test.about.fed.wiki test.paul.asia.wiki.org test2.ward.dojo.fed.wiki thing.richardbatty.fed.wiki thompson.dayton.fed.wiki tips.noyes.asia.wiki.org tol.noyes.asia.wiki.org tomaskafka.uk2.fedwikihappening.net topics.don.noyes.asia.wiki.org trails.asia.wiki.org transformap.co tyler.goaljam.org user1.helmul.tries.fed.wiki utt.sfw.trials.asia.wiki.org uttyassine.asia.wiki.org video.fed.wiki ward.goals.pods.wiki.org wardcunningham.dork.wiki.org wiki.borgesianrhapsody.com wiki.geosemantik.de wiki.org william.ward.asia.wiki.org wsuv.wiki wvengen.fed.wiki.org ww1.newspeak.cc yala.fed.wiki yassachqir.asia.wiki.org yassinachqir.asia.wiki.org yourname.write.asia.wiki.org

# Retire

A better strategy is to look for sites that throw errors when we try to fetch a sitemap. If our logs show the same failure for a week, we'll retire that site which will block it from rediscovering it again.

cat logs/* | grep "sitemap: 767:" | sort | uniq -c | egrep '^ 28' | perl -pe 's/ 28 (.*?),.*/$1/' >xx

Retired already held some pages that were to be retired. I don't know why. We save before mass retiring.

tar czf retired-2023.tgz retired

cat xx | while read s; do mv sites/$s retired/$s; done