Issue with LoRaWAN Confirmed Payloads and Multiple Gateways

polysensesolutions · January 3, 2025, 9:13pm

Hello TTN Community,

I’m encountering an issue in a deployment that uses Microchip RN2483 radios and Things Indoor Gateway Pro devices. The radios are set to max TX power and DR3, using OTAA.

Background
Our devices monitor freezers and require confirmed payloads to ensure that the server has received the messages. This is critical because, in the event of a power outage lasting several hours, the devices switch to a special mode to log the average, minimum, and maximum temperatures during the outage.

The Problem
At sites with multiple gateways, devices sometimes communicate with a far-away gateway instead of the nearest one. For example:

Nearest Gateway: ~20 meters away with strong RSSI (-74) and SNR (13.75).
Far Gateway: >100 meters away with weak RSSI (-114) and SNR (-4).
When a far-away gateway processes the uplink, the ACK often fails to reach the device, leading to reliability issues.

Logs
Communication with the Nearest Gateway:

{
  "name": "as.up.data.forward",
  "time": "2025-01-03T20:50:46.332528067Z",
  "identifiers": [
    {
      "device_ids": {
        "device_id": "2433-06-0091",
        "application_ids": {
          "application_id": "multipurpose-monitor"
        },
        "dev_eui": "-------------",
        "join_eui": "-----------------",
        "dev_addr": "2609569A"
      }
    }
  ],
  "data": {
    "@type": "type.googleapis.com/ttn.lorawan.v3.ApplicationUp",
    "end_device_ids": {
      "device_id": "2433-06-0091",
      "application_ids": {
        "application_id": "multipurpose-monitor"
      },
      "dev_eui": "-----------",
      "join_eui": "--------------",
      "dev_addr": "2609569A"
    },
    "correlation_ids": [
      "gs:uplink:01JGPYYM77QHB4Z8D8RJ832NCH"
    ],
    "received_at": "2025-01-03T20:50:46.327268136Z",
    "uplink_message": {
      "session_key_id": "AZJtbi6lePOKMlDWMcVuRA==",
      "f_port": 12,
      "f_cnt": 23742,
      "frm_payload": "MCw1MzIsMzA4OzAyMDAyMTAyMDAyMDAyMDAyMTAyMDAyMDAyMDAyMDswMTkwMTkwMTkwMTkwMTkwMTkwMTkwMTkwMTkwMTk7MDswOzAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwOw==",
      "decoded_payload": {
        "node_payload": {
          "ch1": [
            20,
            21,
            20,
            20,
            20,
            21,
            20,
            20,
            20,
            20
          ],
          "ch2": [
            19,
            19,
            19,
            19,
            19,
            19,
            19,
            19,
            19,
            19
          ],
          "ch3": "00000000000000000000000000000000000000000000000000",
          "details": {
            "battery": 532,
            "duration": 308,
            "is_summation": 0
          },
          "env": {
            "humid": 0,
            "temp": 0
          }
        }
      },
      "rx_metadata": [
        {
          "gateway_ids": {
            "gateway_id": "ttn-managed-002",
            "eui": "------------------"
          },
          "timestamp": 1050163735,
          "rssi": -74,
          "signal_rssi": -74,
          "channel_rssi": -74,
          "snr": 13.75,
          "uplink_token": "---------------------------------",
          "received_at": "2025-01-03T20:50:45.207886390Z"
        }
      ],
      "settings": {
        "data_rate": {
          "lora": {
            "bandwidth": 125000,
            "spreading_factor": 7,
            "coding_rate": "4/5"
          }
        },
        "frequency": "905100000",
        "timestamp": 1050163735
      },
      "received_at": "2025-01-03T20:50:46.120018772Z",
      "confirmed": true,
      "consumed_airtime": "0.230656s",
      "network_ids": {
        "net_id": "000013",
        "ns_id": "EC656E0000000102",
        "tenant_id": "-----------",
        "cluster_id": "nam1",
        "cluster_address": "nam1.cloud.thethings.industries",
        "tenant_address": "----------------"
      }
    }
  },
  "correlation_ids": [
    "gs:uplink:01JGPYYM77QHB4Z8D8RJ832NCH"
  ],
  "origin": "ip-10-22-7-103.us-west-1.compute.internal",
  "context": {
    "tenant-id": "------"
  },
  "visibility": {
    "rights": [
      "RIGHT_APPLICATION_TRAFFIC_READ"
    ]
  },

Communication with the Far-Away Gateway:

{
  "name": "as.up.data.forward",
  "time": "2025-01-03T14:55:40.765581583Z",
  "identifiers": [
    {
      "device_ids": {
        "device_id": "2433-06-0091",
        "application_ids": {
          "application_id": "multipurpose-monitor"
        },
        "dev_eui": "--------------------",
        "join_eui": "0000000000000000",
        "dev_addr": "2609569A"
      }
    }
  ],
  "data": {
    "@type": "type.googleapis.com/ttn.lorawan.v3.ApplicationUp",
    "end_device_ids": {
      "device_id": "2433-06-0091",
      "application_ids": {
        "application_id": "multipurpose-monitor"
      },
      "dev_eui": "-----------------",
      "join_eui": "0000000000000000",
      "dev_addr": "2609569A"
    },
    "correlation_ids": [
      "gs:uplink:01JGPAME0DCX3R2DSA8PTXW85K"
    ],
    "received_at": "2025-01-03T14:55:40.762624502Z",
    "uplink_message": {
      "session_key_id": "AZJtbi6lePOKMlDWMcVuRA==",
      "f_port": 12,
      "f_cnt": 23675,
      "frm_payload": "MCw1MzEsMzA4OzAyMTAyMDAyMTAyMTAyMTAyMTAyMjAyMDAyMjAyMTswMTkwMTkwMTkwMTkwMTkwMTkwMTkwMTkwMTkwMTk7MDswOzAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwOw==",
      "decoded_payload": {
        "node_payload": {
          "ch1": [
            21,
            20,
            21,
            21,
            21,
            21,
            22,
            20,
            22,
            21
          ],
          "ch2": [
            19,
            19,
            19,
            19,
            19,
            19,
            19,
            19,
            19,
            19
          ],
          "ch3": "00000000000000000000000000000000000000000000000000",
          "details": {
            "battery": 531,
            "duration": 308,
            "is_summation": 0
          },
          "env": {
            "humid": 0,
            "temp": 0
          }
        }
      },
      "rx_metadata": [
        {
          "gateway_ids": {
            "gateway_id": "client-02",
            "eui": "-----------"
          },
          "timestamp": 3255658627,
          "rssi": -114,
          "channel_rssi": -114,
          "snr": -4,
          "uplink_token": "-----------------------------------",
          "received_at": "2025-01-03T14:55:40.480805518Z"
        }
      ],
      "settings": {
        "data_rate": {
          "lora": {
            "bandwidth": 125000,
            "spreading_factor": 7,
            "coding_rate": "4/5"
          }
        },
        "frequency": "903900000",
        "timestamp": 3255658627
      },
      "received_at": "2025-01-03T14:55:40.558372056Z",
      "confirmed": true,
      "consumed_airtime": "0.230656s",
      "network_ids": {
        "net_id": "000013",
        "ns_id": "EC656E0000000102",
        "tenant_id": "-----",
        "cluster_id": "nam1",
        "cluster_address": "nam1.cloud.thethings.industries",
        "tenant_address": "--------------------------"
      }
    }
  },
  "correlation_ids": [
    "gs:uplink:01JGPAME0DCX3R2DSA8PTXW85K"
  ],
  "origin": "ip-10-22-15-96.us-west-1.compute.internal",
  "context": {
    "tenant-id": "--------"
  },
  "visibility": {
    "rights": [
      "RIGHT_APPLICATION_TRAFFIC_READ"
    ]
  },
  "unique_id": "01JGPAME6XT2VC4HEP2F06NSJ6"
}

What I’ve Tried/Considered
Ensuring the nearest gateway is optimally positioned.
Reviewing network server configurations to prioritize closer gateways.
Exploring ways to ignore far-away gateways for specific devices (but this doesn’t seem possible).
Questions
Has anyone encountered similar issues with confirmed payloads in multi-gateway setups?
Is there a way to configure gateways or devices to prefer the nearest gateway or to ignore far-away gateways?
Any suggestions for improving ACK reliability in this scenario?
Thank you for your time and insights!

Jeff-UK · January 3, 2025, 10:46pm

Restated as

The Problem

What may well be happening is the nearest is busy responding to a device having requested a confirmed uplink ack! Hence not hearing (a GW Txing is deaf to ANY uplinks from any device). Therefore the further GW will pick up the message and the LNS will choose it to respond to device requiring ACKS. This is an issue inherent in the technology and implementation and is one reason why we discourage use of confirmed messages - especially on the Community network…which I assume you are not using for realworld monitoring, even as you may be using for test purposes. Can you find other methods for deploying the app, can you use a link check (occasionally) - how often are you uplinking? Can you decimate the number of uplinks before you need to do a confirmation(*). P.s. please be aware of TTN FUP/FAP wrt limit on number of downlinks allowed

(*) In the early days the Laird RS186 T&H for cold chain only did confirmed uplinks and a response was needed for correct operation, this wasnt TTN friendly so they later added both a none proprietary payload (Cayenne LPP) and added an occational conf option (1:10 or 1:20 IIRC).

descartes · January 3, 2025, 10:46pm

This doesn’t really explain the design or rationale in enough detail to make sense. It is generally safe to assume that uplinks get through - confirmed uplinks should be used sparingly, whilst TTI suggest working to a potential 10% packet loss, reality is around 2%, so confirmations are a waste.

If the power goes out, why would the devices stop being able to send? If they are powered, they should have a battery backup so they can send a message to say that there is a power outage! At any time the firmware can send min, max, ave etc and, cunningly, can send the delta’s on the last couple of readings prior so if any do happen to be dropped, you’ve got a record. So the ‘special mode’ seems redundant.

Also, there is absolutely nothing the device can do if the message doesn’t get through if a network is down so hammering away with confirmed uplinks is pointless.

But with all the deaf gateways transmitting Ack’s, you probably do see a few lost uplinks.

You should have the server side app monitoring for lost uplinks. The most reasonable strategy for device health may be one confirmed uplink per day or even better, a scheduled downlink to let the device know it’s still talking to the mothership. A downlink can be used to ask for a resend of any missed data.

If they have, they are doing it wrong.

And they directly lead to the issue that you have created for yourself. Gateways have to limit their transmissions to legal duty cycle like all other devices on the shared ISM band. So if the nearer gateway is benched due to some previous uplinks that it’s ack’d, then another gateway will have to take over. Or the nearest gateway is busy ack’ing another device and can’t hear the uplink sent by the second device.

No, there is no shared knowledge of location. The LNS will know by implication of signal strength but you’ve already hobbled it with the task at hand.

Yes, stop using it. And if you are on TTN and sending more than one uplink every 2.5 hours, you are in breach of the Fair Use Policy and should move to a paid for instance, I’m sure @rish1 will be happy to help.

And once you are TTI, you can then put in more gateways.

polysensesolutions · January 3, 2025, 11:37pm

Thanks for the replies. I did not know that the confirmed payloads are not recommended.

We are on a paid enterprise plan, not on a public network.

What is really strange is that we have several sites using more than 25 devices and 1 gateway, and the packet success rate is nearly 100%, including ACKs.

The sites with multiple gateways have considerably fewer devices (5 or fewer), and these are the sites experiencing ACK failures. If the busy gateway theory is valid, I would expect to see considerable losses on the single gateway sites.

I will adjust to remove the ACK and see if that helps.

descartes · January 4, 2025, 10:48am

We’d need lots more detail to debug this - like device, antenna, height above ground, gateway, uplink period and why you aren’t using ADR which will dramatically reduce transmission time - most devices I’d expect to be on DR5 for gateways ~20m away - that’s a quarter of the air time.

Also the power supply - if they all come on at the same time, then they can end up in sync.

There will some wrinkle in the details that upsets the 5 device / more gateways sites - the fact you ended up with more gateways there probably informs part of the reason.

Jeff-UK · January 4, 2025, 12:52pm

Speculating here (as Nick says need a lot more detail to nail exact behaviour, possibly looking at logs and local environmental issues, and that would take beyond just volunteers helping with ideas on a Forum) but with single GW behaviour would be likely it ‘has’ to deal with all and if devices dont see ack they assume not received and hence try again where multiples would catch most messages 1st time around. You then run into issues with collision and overlapping messages (especially with the ack from distant GW) with local reception being impacted by the classic RF ‘near/far’ problems (GIYF).

One of the beauties of LoRa modulation is its resiliance (but note not immunity) where this is a factor, with mileage determined then in part by mix of RF strength, channel used and (usually DR/SF related) collision characteristics, though fact you use fixed DR reduces variation and complexity and helps evalute on one hand it also removes one of the benefits of LoRa with the pseudo-orthogonality of the SF’s helping provide some of the resiliance one would expect to see improve reliability. As Nick says ADR is your friend for a fixed install Where the devices settle wrt individual SF will depend on where mounted - external device with sensor inside on flying lead or whole sensor inside unit, unit contruction and materials, orientation wrt GW location etc.

I have been running a real monitor/eval system here at home office covering 3 fridges and 3 freezers + a number of other Temp sensors for other applications (and other devices(*)); some using the aformentioned Lairds, some modified TTN UNO’s, some TTN Nodes and some others in various mixes (Dragino, Elsys, RAK, etc.) with multiple GW’s in short range (variable numbers depending on what I deploy locally or have in lab for commissioning/test or updates and re-deployments - typically 4-12 in range here or neighbourhood at any time!). The core system has been running for over 6 years scaling up and down as I have added/removed test sesnors for evaluations etc. The basic set up - GW + Fridge & Freezer monitoring (usually plus other sensors) has been replicated on smaller scale in several test sites - Student accomodations, tenented properties & rentals/holiday homes, remote properties and offices etc.

With usually low packet loss (Nick’s 2% probably close to average, though some have long term trends <<1%), most have been single GW reception, but several have had at least one other TTN or peered GW in range either picking up most packets or providing back up where local GW/connectivity goes offline. A classic being a set up in the Thames Valley near me where onsite GW fails occasionally (usually cellular backhaul issue) and after a short period the device manages to connect via a TTN peered (thanks SmartBerks/local authority!) GW approx 3/4 of a km away (just about LOS), and I then see the DR switch up to maintain the connection…I try to fix quickly (days/weeks in this context!!!) to minimise hit to device battery

(*) Interesting though originally set up as a 3-6mo experiment/demo to test for a potential client deployment and system eval I ended up keeping and tracking long term. Over that time have noted long term emerging issues with monitored items - sticking themostats changing cooling/freezing behaviour, long term temp drift as units lost efficiency, even putting vibration sensors on compressors and seeing degrading behaviour and predictive maintenance potential! Also (as small scale model) providing demo/eval for larger scale industrial walk in units to provide alarming for open/badly closed doors etc. (options for door closure sensors or internal light monitor etc.)

descartes · January 4, 2025, 1:06pm

Stealing that idea right now!

In return, monitor the power use - some freezers can keep the temperature in range when a door is compromised - chest ones particularly with those sliding windows that the great unwashed have to leave open in supermarkets - but as the chiller unit has to work much harder, it’s a good way to determine somethings wrong.

Jeff-UK · January 4, 2025, 1:35pm

Your welcome!

Conceived a smart PM implementation in the early days of LoRa (long before LoRaWAN/TTN became a thing!) that is too valuable for general post as used commercially but will share directly via email with usual conf later - busy now with other stuff but if you dont see in next 24hrs poke me to remind

Indeed - monitoring the changing cycles (especially over seasons and changing local environment conditions) can tell a lot. If you recall the charity community/pro bono deployment I did from a couple of years back which included monitoring their freezers over several weeks and months and showed that they were actually running uneconomically - far too cold for their needs - with extra wear on system and excessive electricity consumption. We adjusted operation and saved enough electricity to pay for the monitor over <12 months as well as reduce failure/repair rate extending life and replacement costs (Some of their fridges on the other hand were often fine when recorded manually for records/audit but with frequent use during normal day were often way too warm too long and needed temp tweaking down as 'stats were set too high and compressors too slow/old to catch up and maintain targets) … guess we are getting off topic now!

descartes · January 4, 2025, 1:39pm

Did a stint in the Walls Ice Cream factory in Gloucester where they had airlock doors which can be stupid helpful too - stops the door to the warehouse being propped open so that the space heats ups and removes an insulation layer - insignificant for many situations but when it’s an issue, it’s potential for huge savings.

The main storage warehouses there ARE the freezer which is a whole other ball game.

polysensesolutions · January 4, 2025, 1:57pm

Thanks for the input guys.

I’ll work on refactoring our firmware to mostly remove confirmed payloads and just “check-in” every 10 cycles or so to see if that helps.

descartes · January 4, 2025, 4:56pm

There’s whole chunks of the spec that should be doing this for you already that you might just be able to turn on if you have a full LW stack - look for LinkCheck.

And if the packets are relatively small you can always send the latest reading with the previous reading as a delta - so you get some overlap. Sending 1dp temps in two bytes with one byte diff gives ±12.7 for delta. Depending on payload bytes, you can add one to three bytes for free due to the way the chirps work.

And don’t forget the ADR!