tl;dr: two devices where assigned the same dev_addr, device A went offline, device B’s uplinks started being sent to our backend as if they were device A, until device A intiated a rejoin and both devices returned to normal operation.
more context:
I’ve got a fleet of end-devices running a port of the SWL2001 lorawan library on a STM32WL chip. we are using the lorawan 1.0.4 standard.
For simplicity here I’m going to call the affected devices device A and B, but we use the devuid’s that ttn attachs to the data packet to identify which uplinks are associated with which device.
The incident
Prior to the incident both devices were operating and sending data as expected.
Device A was deployed locally, on Dec 13, it was turned off. We know for certain that this device both was not powered, and did not have access to a gateway. Device B was deployed in a remote site (over 400km away from device A). Both devices shared the same dev_addr: 27FDB84F
on Dec 15 Device B “stopped sending data” at the same time Device A “started sending data” again. On Dec 17 Device A was turned back on, initiated a rejoin and both Device A and send data as expected. At the same time Device B started sending data again.
Inspecting the data that “Device A” sent from dec 15-17 it is clear that this was actually Device B. it was transmitted from the same gateway that Device B was connected to (one that Device A was over 400km from). The data was consistent with the data that Device B sent both before and after the incident. I’m reasonably confident that these two devices were mixed up during this period. As a interesting coincidence the uplink where the swap happened both devices had the same frame count.
The questions
Obviously this is an undesirable behvaviour. Currently I don’t have the tools in place to reliable detect this happening, but I have detected at least one other time that this might have occurred (although it is less clear than it was in this case). I’m very confused how this happened, the AppSKey and NwkSkey for these two devices should have been unique so that even if somehow there was a collision such that the both the nwkskeys were valid for the MIC, the resulting decode of the payload should have been nonsense, but the data was decoded sensibly through the entire incident. We had trouble getting hw rand to work on the device and are using srand, which might be seeded with a functionally static value (I haven’t had the opportunity to check that yet), but even in this case my understanding is that the network server generated a key that should be unique regardless.
Do you mean both had dev_addr 27FDB84F assigned at the network server in separate session records for each DevEUI or that both were using this dev_addr value.
Are both devices using the same AppKey value?
I assume so, otherwise the Join Accept from the gateway would be discarded.
Also session keys could not match if different AppKeys are used.
If two devices use the same keys and attempt to join at the same time, both may receive the same JoinAccept and create the same session.
Unique AppKeys for each DevEUI will prevent this from happening.
You didn’t state it explicitly but I am assuming the devices use OTAA.
Sorry to pick nits but this is a recurring misunderstanding. Devices do not connect to gateways. Devices connect to the LNS and a gateway is just a dumb media converter (IP to radio waves).
From the LoRaWAN specification:
JoinNonce is a non-repeating value provided by the Join Server and used by the end-device
to derive the two session keys NwkSKey and AppSKey, which SHALL be calculated as
follows:10
NwkSKey = aes128_encrypt(AppKey, 0x01 | JoinNonce | NetID | DevNonce | pad16)
AppSKey = aes128_encrypt(AppKey, 0x02 | JoinNonce | NetID | DevNonce | pad16)
If your DevNonce is not unique (and given that LoRaWAN 1.0.4 specifies it should be an incrementing counter starting at 0 that might very well be the case) your S-keys will be the sameif TTN uses a unique value for the JoinNonce that is device based (so there might be the same value for different devices. Just unique per device) and your AppKeys are the same.
This feels likely, I’m not sure how TTI generates the JoinNonce but given that our devices share the same AppKey if its not random then I’m going to be seeing collisions all over the place. I guess I’ll figure out how to assign these devices random AppKeys.
Because of the rather literal way my brain reads things, just in case the use of the word incident had any connection with this being an “official” report, for clarity, this is a volunteer run best endeavours forum. Any commercial issues arising should be addressed via TTI support.
I turned the standard random number generator in to a spectacular FootGun a few years back - turn on 25 devices for testing at the same time, watch the collisions as the jitter wasn’t actually random so they all kept retrying in the same sequence.
Turning on the hardware random generator solved the issue. I’ll dig out the code for implementation.
However the 1.0.4 spec requires the next DevNonce to be higher than the last one, so random won’t cut it here - the SWL2001 should be using incrementing DevNonces saving to some non-volatile RAM - but as, perversely it hasn’t got an official port for STM3WL, if when you ported it that was “left for later” &/or you have turned off DevNonce checking on the LNS, then YMMV
I checked, we’ve got this footgun locked and loaded in our codebase as well (initializing srand with HAL_GetTick is not going to generate very many unique seeds). I’ll add it to my list of bugs to resolve. But given that the DevNonce is not random (I verified this on our firmware) I don’t think it resolves my issue, I’ll look into finding a way to generate unique appkeys for each end-node.
I appreciate you taking the time to help me with this!
If someone else stumbles upon this, it I briefly investigated TTN’s behaviour for generating the JoinNonce and it appears they have implemented TR001 Remedy #1 (which may be a requirement for the JS now, I’m no expert)
This means that of all of the variables that go into generating the NwkSKey and AppSKey (AppKey, JoinNonce, NetID, DevNonce) in the 1.0.4 spec only the Appkey is really “random”
So if you are using a single shared AppKey between all of your devices, you will almost certainly run into problems where you are getting identical NwkSKeys/AppSKeys between devices.
Whilst personally I’m pretty relaxed about the levels of encryption for a payload with some blindly obvious data, typically temperature & humidity, and I may use the exact same AppKey for a dozen or so devices when developing, there is no justification for deploying devices with any part of the credentials duplicated.
Which leads me to ask the rather awkward question, where are you getting your EUI’s from? If it’s derived from the hot-mess of the ST silicon’s unique id, be aware that downsizing the id to an EUI has been shown to end up with some unexpected duplicates. And even if it is unique, the right bits needs to be set to make it a locally derived address.
Putting credentials in to a secrets.h file that gets written out by a script from a database of EUI’s and then triggers a compile & flash is well worth the effort.
Generating from HAL_Tick is totally consistent - the ticks start once the MCU is running stable so there is no variations on an individual device or across devices.
Whereas the hardware random number generator works just fine & dandy. I’ll dig out the code in the morning.
The hardware random number generator (which is a very small cup of tea detecting brownian motion inside the MCU):
I configure it on CubeMX - it’s a check box only, no other setup.
And then, depending on the scope of the RNG_HandleTypeDef hrng, you can just ask for a random number with:
uint32_t aRNG = 0;
HAL_StatusTypeDef status = HAL_RNG_GenerateRandomNumber(&hrng, &aRNG);
There are a few utility functions for interrupts and callbacks if that’s your thing. You have to roll your own calculations for making it fit in to a range you need, but it really is that simple.