This appeared in a thread over in the Arduino forums, someone was reporting TX timeout, that was apparently being caused by SPI interaface data corruption. Might be of interest to those building thier own nodes. The post was apparently by someone from Semtech.
The reported issue of an SPI data corruption which leads to a TX TIMEOUT is NOT related to a bug in the Semtech LoRa transceiver. It has been demonstrated by Semtech and independent parties that the corruption only occurs when SPI lines are not properly routed on the PCB and/or when the SPI specifications are not followed as per Semtech datasheet. Examples of the common causes are: • SPI SCK line placed too close to MOSI and MISO lines. • SPI interface lines are too long and too close to other interface signal lines. • No proper ground plane under the SPI lines • SPI clock and timing driven by the Master are not in compliance with Semtech datasheet.
*Any of the practices above could corrupt one or more bits in the SPI data which may lead the Semtech transceiver into an unknown state and consequently resulting in a TX TIMEOUT. *
As an effort to mitigate the effort of a major redesign, Semtech has come up with a workaround which involves a soft reset of the Semtech LoRa transceiver. It’s important to note that the creation of this workaround is NOT to patch a bug on the Semtech LoRa transceiver or reference designs, but rather to offer an alternative simple solution. And for this reason, Semtech does not see the need to document this matter in the errata document. To improve clarity, Semtech will update the comments provided with the code for the workaround.
Now seeing as the SCK, MOSI and MISO lines are right next to each other on the chip and most all modules its not so easy to avoid the tracses being close together.
Interesting to also note that only the later LoRa devices, SX1261,SX1262, SX1268 etc have a TX Timeout capability buit in. On the SX1272,SX1276, SX1278 you need to have support in the library.
Although all these items listed are of course good engineering principles, I can say from my experience with the RFM95 that the SPI port can take a whole lot of abuse! My prototypes are sometimes built under strict time constraints and may bring tears to your eyes, but never have I had problems with corruption on the SPI port. And in the proper PCB design, always include the ground plane, as you will need it anyway, working with RF signals.
There’s a huge difference between lines being adjacent on a package, and lines being adjacent for a long run. Arduino-realm people routinely do unwise things, including using 30 cm jumper wires to hook things up. At slow clocks you can typically get away with a lot since most of the grunge on the data has died out by the appropriate clock edge to sample it, but ringing on the SPI clock line itself can confuse some SPI engines.
If designing a board, leave some SMT footprints at each end of the clock line for various termination strategies. Even if the series elements end up zero ohm jumpers and the shunt elements open, it’s a cheap way to have options. After seeing a 433 MHz OOK project deafened by the harmonics of its own serial debug output, I tend to put ferrite beads on all digital signals in projects of this type anyway.
Interesting to also note that only the later LoRa devices, SX1261,SX1262, SX1268 etc have a TX Timeout capability buit in. On the SX1272,SX1276, SX1278 you need to have support in the library.
These are likely very different causes of “timeout” - in the first case you mention, it is not actually a transmit timeout, it is the chip never being correctly put in transmit mode to begin with.
Hardware can’t detect a timeout of something you haven’t even asked it to do.
Also worth considering: LMiC in particular isn’t really willing to try to “heal” the radio state. If it detects an unexpected state of a radio mode bit that it does look at (vs the many it ignores), it will typically crash with an assert. If you have a watchdog, that should lead to a restart - but often a restart with a complete loss of state (OTAA join, fCnt reset, etc). Since those things are undesirable, real usage so either add logic to heal the radio, or to make it so that network state can survive at least the first watchdog reboot.