We notice that scrapes take longer and longer. A pet theory is that sitemap fetches are timing out. We've changed the curl -m time limit from 10 to 5 seconds. Now we wonder what effect that will have.
We notice a long term increase in the scrape run time from 20 to 80 minutes. Sudden drops are actually the clock wrapping around at 80. plots
We plot a reasonably accurate count of active sites by finding a correctly formatted sitemap. This runs in the mid 800s for each scrape without the volatility we see above.
We see timeouts in the log as json parse failures. Empty results complain that two octets are required. logs
We use this script to count how many sitemap fetch failures we had each run over the last week.
for i in logs/*; do echo $i `grep octet $i | wc -l` done
This count was on Sept 18, 2016 right before we shortened the time out time from 10 to 5 seconds.
logs/Fri-0100 151 logs/Fri-0700 165 logs/Fri-1300 166 logs/Fri-1900 152 logs/Mon-0100 151 logs/Mon-0700 159 logs/Mon-1300 166 logs/Mon-1900 161 logs/Sat-0100 152 logs/Sat-0700 153 logs/Sat-1300 155 logs/Sat-1900 150 logs/Sun-0100 156 logs/Sun-0700 160 logs/Sun-1300 152 logs/Sun-1900 152 logs/Thu-0100 152 logs/Thu-0700 158 logs/Thu-1300 158 logs/Thu-1900 154 logs/Tue-0100 177 logs/Tue-0700 184 logs/Tue-1300 183 logs/Tue-1900 157 logs/Wed-0100 152 logs/Wed-0700 207 logs/Wed-1300 172 logs/Wed-1900 156