We've sought advice from Fred Heberts which we collect here without too much concern for provenance.
It is critical here that you should not necessarily model failure causes, but rather places where failure should be acceptable (it’s still helpful to define what sorts of modes exist to define what should be acceptable) — ie focus on the “can this fail? If so how do we deal with it?” Rather than “how will this fail, and can we anticipate that specific failure mode”
The less your recovery is tied to a specific failure mode and the more it is tied to a failure domain, I believe the more useful the pattern becomes.
If all my parsing errors quickly abort, then all I have to do is find an acceptable way to handle parsing aborting to survive. Then two things happen:
1. push all failure modes towards clearly defined aborts to reinforce that pattern (modes transition towards known domains on the failure side)
2. progressively add ways to handle the specific modes and possibly do better (slowly reducing the potential likelihood of events)
Of course some of these faults might be critical and impossible to handle, and establishing that is also useful.