Confirm Uplinks - Node failed to receive acknowledgement

Up until very recently (3 weeks ago) everything has been going very smoothly with my array of nodes. Then suddenly one of them went offline for around 12 hours. Then last weekend they all went down, I have tracked this down to the fact that they do not receive the acknowledgment from a confirm uplink. After 3 failed confirm uplinks my node will attempt to do a join request, which if it fails it will conform to the back off timer and not resend a message until the next 24hour period. I cant work out why a join request would fail. The node is no more than 10m away from the gateway. I am using the (GitHub - Lora-net/LoRaMac-node: Reference implementation and documentation of a LoRa network node.) SemTech LoRa stack on an STM32WLE5. If anyone has any idea why this has just started happening some insight would be greatly appreciated.
image

Many thanks.

Might be too close! Try moving further away/have absorber inbetween - I assume this is just a test node rather than the full array of field deployed nodes? If so wont explain all the othersā€¦

also

How often were you asking for confirmed uplink? These trigger downlinks which are frowned upon and limited under FUP to max 10/day/node - and should really be thought of as 10 per week or even 10 per month! :wink: as they render the GW deaf to all other users for the durationā€¦

Also Back off your TX rate - again FUP - IIRC the based code has a very naughty TX rate embedded (10s?) - think in terms of minutes & hours between TX not a few secondsā€¦

1 Like

Thank you for your response.

Yes, this was just for a test node, the others are all further than 100m away.

I am sending the max of 10 confirms per day, so I will reduce it to 1 per day.

The TX rate is due to the node not receiving a response so trying again straight away, I shall add a delay to ensure that it complies with the FUP

You probably should not be using confirmed uplink to begin with, as it means every uplink requires a downlink and thatā€™s very expensive for the network since a gateway cannot receive on any of the 8 channels while it is transmitting on just one.

But also your strategy is far, far, too aggressive. Some temporary issue would cause all your nodes to start trying to rejoin. Look at typically ADR backoff timers - they can often end up taking a day or more just to get to the slowest data rate, which itself is not giving up.

What are the circumstances under which you think re-joining would actually be beneficial?

Really the only reason why a re-join would be useful would be if the state of the network servers (or the node) were irrecoverably lost. Thatā€™s a very rare occurrence - even the TTN v2 to v3 cutover doesnā€™t really count.

Essentially, your aggressive failure strategy is likely to cause far more total downtown than waiting several days for the network to start responding again before giving up.

Currently, I am doing a confirm uplink every 2.5 hours. Sometimes for some reason, these fail which makes the node think that it is no longer connected to the network so it tries again, after 3 retries it will then try to rejoin the network. If this rejoin fails it will then try again at the end of the next 24 hour period (Backoff). Once this time has elapsed and the new rejoin is attempted the node joins the network successfully until the next confirmation fails. This could be days later.

Are you suggesting that I should keep doing confirm requests until I get an acknowledgement?

No, never, not on TTN, off TTN you can continue if you really hate your neighbours using the ISM band that much and you have detailed logs to show how you are keeping to the legal duty cycles - testing by hammering the airwaves is not testing, itā€™s just hammering. How can you assess whatā€™s going on with uplinks at such short intervals?

So I think youā€™ll find that most of us here would be suggesting a total re-design of your connection strategy.

The core schemes built in to the specification cover the majority of situations and as far as LoRaWAN use case goes, itā€™s not suitable as a guaranteed delivery scheme. The TTI documented metric is to plan for 10% packet loss, the majority down to airtime issues - like the gateway transmitting when a device uplinks - something you are actually making worse.

You could try running ā€˜normallyā€™ with the ADR backoff scheme and the LinkCheck at around 60 uplinks and see if gives you the data you need. For payload size vs data rate to stay within the FUP, you can check out https://avbentem.github.io/airtime-calculator/ttn/eu868/20

This is a mixup of two almost completely distinct concepts.

One is the ability to move radio packets between the node and the servers by means of a gateway. This is present or not at any given instant, but doesnā€™t really have any persistent ā€œstateā€.

The other is the agreement of state which constitutes a joined session. Having all in-range gateways (or even the server infrastructure itself) go down for a period of hours or even weeks doesnā€™t mean that thereā€™s any disagreement of state.

So thereā€™s no benefit to trying to join again - and a substantial cost, because a join packet carries no useful data, while an uplink packet does, even if the node isnā€™t getting any reassuring replies from the network.

The only time joining again is actually useful to fix a network issue is if the TTN device database has been completely destroyed and everything has to start over from scratch. Thatā€™s not something thatā€™s happened at all recently. Donā€™t be so hasty to give up on a joined session that could start working again at any instant once gateways and/or servers come back up, in order to protect against a sort of failure thatā€™s likely to occur only on an interval of years.

(And if you actually do want to protect against a failure on the TTN side, the thing to do would be to back up the session state from the TTN servers - in theory, you could then at least decode uplinks received through your own gateways on your own, and with some judicious guessing of a reasonable fast forward of the downlink frame count, probably even re-inject the session details into revived or replaced infrastructure.)

Not for the last 5 years at least. I have nodes that have been uplinking for over 5 years without any issues. Even occasional server of gateway issues didnā€™t break things, just lost a few data points.