Monday, June 10, 2013

Seven Deadly Sins of CRCs and Checksums

We're wrapping up the final report for an FAA-sponsored study of CRC and Checksum performance for aviation applications, although the results in general apply to all uses of those error detection codes.

As part of our results we came up with an informal list of "Seven Deadly Sins" (bad ideas):
  1. Picking a CRC based on a popularity contest instead of analysis
    • This includes using “standard” polynomials such as IEEE 802.3
  2. Saying that a good checksum is as good as a bad CRC
    • Many “standard” polynomials have poor HD at long lengths
  3. Evaluating with randomly corrupted data instead of BER fault model
    • Any useful error code looks good on random error patterns vs. BER random bit flips
  4. Blindly using polynomial factorization to choose a CRC
    • It works for long dataword special cases, but not beyond that
    • Divisibility by (x+1) doubles undetected fraction on even # bit errors
  5. Failing to protect message length field
    • Results in pointing to data as FCS, giving HD=1
  6. Failing to pick an accurate fault model and apply it
    • “We added a checksum, so ‘all’ errors will be caught” (untrue!)
    • Assuming a particular standard BER without checking the actual system
  7. Ignoring interaction with bit encoding
    • E.g., bit stuffing compromises HD to give HD=2
    • E.g., 8b10b encoding – seems to be OK, but depends on specific CRC polynomial
(I haven't tried to map it onto the more traditional sin list... if someone comes up with a clever mapping I'll post it!)

Thanks to Kevin Driscoll and Brendan Hall at Honeywell for their work as co-investigators. You can read more about the research on my CRC and Checksum Blog.  That blog has more detailed postings, slide sets, and will have the final research report when it is made publicly available.


11 comments:

  1. Hello,

    Length field is a subtle one: if your protocol has a 'command number' at the start and different commands have different lengths, one's effectively the other. Glad I realized this yesterday instead of next week when it'd have been too late...

    Thanks

    James

    ReplyDelete
  2. James -- this is a great point! Glad you found it sooner rather than later.

    Cheers,
    -- Phil

    ReplyDelete
  3. Actually, I've been pondering this today and framing seems to be real trouble...

    For instance Modbus, a single flipped start bit after a packet would shift to half of CRC + FF, the 'idle 3.5 characters' having received an additional character.

    If instead one used a high bit (address etc.) to indicate the first byte of a packet, and the CRC were directly after, two flips would get you checking the CRC at the wrong location as well.

    Do you have any pointers/references/etc. for how to avoid this problem? (Or have I missed something that makes it less of an issue?)

    Thanks

    James

    ReplyDelete
  4. Right again -- this is a good example of how this topic gets pretty tricky if you want to plug all the gaps. (As I get time I'll post things I've run into... but time always seems to be scarce.)

    Quick version: Yes you have to think of any way in which a message can be misunderstood that can affect the length. That includes length implicit in message headers. One way to solve this is make all messages the same length so the problem can't happen, fragmenting if necessary.

    Another way is to put a CRC on the message header to detect header corruption (FlexRay does this; for a protocol that doesn't do this you could make the first byte of the payload a CRC on the header field).

    A third way is to have headers that have a high hamming distance between each other. One way to do this is, for example, to use the first few bits of the header as the actual header and the rest of the header bits as a CRC protecting those first few bits. The network protocol will just see this as a sparsely populated header space, but you'll know that it is hard for a single bit flip (or more perhaps) to accidentally change the header and therefore change the implicit length.

    The last way I've seen it done is with a high-hamming-distance set of synchronization bit patterns (I think this was Train Control Network), with the sync bit pattern determining which of a few message types are being used and thus implicitly the length.

    ReplyDelete
  5. Just one last question... The 'CRC on the message header' approach, I noticed on USB that the header CRC is HD 3 yet the data is HD 4. I've seen this elsewhere as well. Is there some logic behind why the header can be weaker? This makes the whole packet HD 3 effectively..

    Thanks

    James

    ReplyDelete
  6. It does make the whole packet HD=3, and that is probably why some protocols use the same HD for header protection as packet body protection.

    I don't know why USB did it that way. But the obvious argument would be that the header is a smaller target to hit with random bit errors than the packet body. You might, for example, argue that a particular random independent bit error rate HD=3 on the header is good enough because it is very unlikely to get 3 errors there, while HD=4 is necessary for the packet body because the body is so big that 3 errors will happen often enough there to be a problem. To make this argument for your system you'd need to decide that random independent bit errors is the right fault model for your system, and then do the probability math to decide if number of undetected header errors is low enough to be acceptable to you.


    ReplyDelete
  7. Guess what happens in IEEE 802.15.4 / Zigbee.... last I checked - no CRC or checksum protection of a leading length field. Nobody I mentioned this to seemed to find it a problem. It's a disaster waiting to happen.

    The most robust protocols get around these kind of problems by use of things like illegal sequences (coding or symbol violations) to insert start / end of frame markers, and then use things like bit stuffing to ensure such symbols can never appear inside a valid data frame (eg HDLC).

    The problem is always one of distinguishing control fields in the presence of bit errors. Ignoring the issue does not make it go away. Of course, redundant coding (and thus decent HD) can aid, but its not nice for bandwidth - you don't get something for nothing.

    ReplyDelete
  8. Anecdote on length field importance and rule 6: Recently used CRC16-CCITT for an embedded design. The packet length is known after there is a silence in the line.
    I Assumed CRC would magically detect errors, wrong lengths and the possibility of an error was anything but remote... Turns out line silence isn't always properly detected and therefore 0x00 ends up sometimes in the packet as the last byte (erroneously).
    Now check this out: ALL and EVERY valid packet with CRC16-CCITT with any extra (bogus) 0x00 at the end has also a valid CRC and passes the CRC check!!!!!!!!
    Example:
    D: DATA Byte
    X: CRC Byte 1
    y: CRC Byte 2

    DDDDXY <- XY is the valid CRC for DDDD
    DDDDXY0 <- Y0 is a valid CRC but X (CRC Byte 1) will be taken as data!!
    DDDDXY00 <- 00 is a valid CRC but XY will be taken as data!!
    DDDDXY000<- 00 is a valid CRC but XY0 will be taken as data!!
    and so on...

    At first I thought I had found a 1 in a 65536 test case (at least) and not so bad. Fortunately I wondered how likely was it and created a montecarlo program to find out, ohh was I surprised that every time it was the case!

    I guess I will have to change my algorithm

    Thank you for the information you have provided to the community

    ReplyDelete
  9. Thanks for the story. This sort of issue is why many protocols start with a length field in a known place.

    ReplyDelete
  10. What are the effects of corruptions in the CRC itself?

    I'm assuming that a single bit corruption in the CRC makes all HD promises invalid.

    If that's the case, how does this affect the error detection analysis?

    ReplyDelete
    Replies
    1. Corruption of the CRC field itself is accounted for in the HD. The HD accounts for any combination of codeword bit corruptions, including both the dataword and the Frame Check Sequence (CRC value)

      Delete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...