TTIG Reliability Check

vicatcu · August 6, 2019, 2:49pm

My TTIG gateway has been decidedly less effective at forwarding packets to TTN since “the event” in this post (outage from 7/26 - 7/29). I’ve calculated the number of missing reports from one device that reports once every ten minutes and plotted the number of missing reports over time from 7/21 - 8/6, and here’s what it looks like (minutes from start along x-axis):

My endpoint tries to report on a regular 10 minute interval (actually more like 8.2 minutes). The x-axis is minutes from the start of the data. A missing report is implied by the time since last report (i.e. ROUND(deltaT / 8.2) - 1). There’s a pretty obvious increase in the incidence and duration of missing events. Here are histograms of the missing reports before and after the event.

Is it just me or is this compelling evidence representative of a broader experience within (at least) the us-west community?

P. S. This is a continuation of TTIGs Report Not Connected in TTN Console

vicatcu · August 6, 2019, 2:54pm

Here’s what it looks like as a manifestation in my data. You can see the difference in the time-density of data points before and after the 3-day gap.

cslorabox · August 6, 2019, 3:11pm

Comparing experiences with others is of course potentially interesting.

The general problem with your situation is that there are at least three distinct points of possible failure (really, four) and you don’t have visibility into where that occurs.

Your node could be failing to transmit (or could be getting stuck trying to get a join accept) - this is something you could ideally verify with serial debug output or similar from the node firmware
You could be suffering radio interference. Sporadic failures that started at some point could sort of point to that. Testing in another location could help isolate it. Also doing a historgram of packet frequencies could be interesting, I had a situation where one channel showed a far lower success rate for a while. Of course you need to compare the histogram of frequencies actually transmitted, or be confident that your channel selection is truly fair.
The gateway itself could be having issues, there have been some rumors of concern in the past where people ran side by side tests with other gateways, but they may have been clouded by oddities of nodes unreasonably close, and it is hard to evaluate with the lack of firmware sources to add debugging to
There could be server infrastructure issues; my understanding is that traffic from a TTIG in particular has to go through an additional translation stage before hitting the main server infrastructure used by other gateway types, as it uses a scheme that the rest of the system hasn’t yet caught up with.

In a production grade system, one has the ability to evaluate the overall operation from the perspective of each of these stages and components, where some stages are opaque as here the debugging process becomes more challenging.