TTIG Problems, - no location data, wrong date/time, wrong channel and stability issues

Franz_Refle · September 8, 2019, 3:26pm

The only difference in Bootlogs is as follows:

1970-01-01 00:00:05.276 [SYS:DEBU]   Free Heap: 55888 (min=55456) wifi=1 mh=3 cups=0 tc=0
scandone
state: 0 -> 2 (b0)
state: 2 -> 2 (8a0)
state: 2 -> 0 (2)
reconnect

Sorry fo my Poor English, Sorry, for the bad descriptions

09.09.2019:
I’ve tested the TTIG connected to a mobile Hotspot (Smartphone) this morning.
Same problems as above …
After a Power Cycle Connection gets lost after sending the first packet, LED constant green
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it
So I think it isn’t a matter of my WLAN Setup
Same Results if connected to Fritzbox, Cisco AP, Asus AP and mobile Hotspot

Any Suggestions ?
Should I complain on RS-Components a second time ?
Or throw away the dammned thing as suggested in the first posts ?

If I terminate and reenable the mobile Hotspot, things go wrong again …
same as power cycling the TTIG

After loosing connection the TTIG reboots (as seen in the Debug Output) and rescans immediately the WLAN. If the AP is available in this scan, things go wrong as described above.
If I switch on the AP AFTER this first rescan, the TTIG rescans the WLAN after about 30 s a second time (without rebooting), finds the AP and things go right … TTIG functions properly

But there’s no workaround in case of

short power loss
short wifi loss

Why the reboot if WIFI gets lost more then 1 sec ?
Maybe a Bug …
Anybody out there who may verify this reboots ?

@bei: I’m not allowed to post a reply for the next 5 hours …
Here’s my Setup:
@bei: I have a simple TTN Node (Arduino+RFM95) which sends a Temparature Packet every 15 secs. After the described faulty reboot the TTIG forwards exactly 1 Packet … seen in the TTN Console of Gateway and Device. Logging the DEBUG Output of the TTIG doesn’t show any differences between “good” and “bad” state. The TTIG receives all Packets sent by the Node but stops forwarding them after the first sent packet. This is the condition ‘connection gets lost after sending the first packet’. If I activate my DIY 1ch RASPI GW, this GW forwards exactly every 8th Packet (1ch of 8) without any Problems over the same Infrastructure.

@bel: I think I should wait the 4 hours to be permitted for replying …
TTIG Debug Output in failure state doesn’t differ from “good” state:

2019-09-09 08:37:08.883 [S2E:VERB] RX 867.7MHz DR5 SF7/BW125 snr=10.0 rssi=-12 xtime=0x77000005C08C3C - updf mhdr=40 DevAddr=26011CB2 FCtrl=80 FCnt=870 FOpts=[] 01AFB05F mic=2133244814 (16 bytes)
2019-09-09 08:37:09.298 [SYS:DEBU] Free Heap: 18872 (min=17064) wifi=5 mh=7 cups=8 tc=4
and yes, I’m looking at the TTN Console Traffic Page of the GW
Packet Capture maybe next weekend, i have to recover an old WRT54G or is it possible to do a capture on my Android Smartphone ?

bei · September 9, 2019, 7:18am

After a Power Cycle Connection gets lost after sending the first packet, LED constant green
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it

Do you have logs which show that? How do you determine the condition ‘connection gets lost after sending the first packet’?

EDIT 1:
@Franz_Refle Thanks for the clarification. So, the TTIG logs do not indicate connection loss in what you call ‘bad state’, correct? If the TTIG was unable to forward received LoRa packets to the LNS, the internal packet buffer would fill up and you would see log messages indicating that (see basicstation/src/ral_lgw.c at master · lorabasics/basicstation · GitHub). How do you determine that ‘the TTIG receives all Packets sent by the Node but stops forwarding them after the first sent packet’? My guess is that you are looking at the TTN console. The robust way to determine that is to do an IP packet capture on the websocket connection. Do you have the means to do that?

EDIT 2:
I was able to reproduce the issue on my TTIG. Let me try to formalize:

The failure condition

In the failure condition, the TTIG has an active TCP connection to the LNS back-end (solid GREEN LED) but LoRa uplinks do not appear in the TTN console. The DEBUG logs indicate that LoRa messages are received and forwarded to the LNS via the websocket connection. An IP packet capture on the websocket connection supports the observation that TCP packets are sent to the LNS back-end: on the TCP layer, the LNS acks all packet sent by the TTIG. This shows: TCP connection is healthy, forwarded packets are received by the LNS websocket server. However, the LoRa uplink messages do not make it to the LNS’s packet routing logic. In low-activity scenarios, the LNS will reset the websocket connection if no TCP packets are transmitted over the connection in a certain amount of time. After this server-initiated reset and re-connect, everything is back to normal. In high-activity scenarios the websocket connection will not be reset by the LNS because it is seeing TCP packets coming in. In that case, the failure condition will sustain in steady state.

How the failure condition is triggered

The failure condition occurs whenever there is a ‘fast’ TTIG-initiated reconnect.
The TTN LNS regularly performs server-initiated connection resets whenever no activity is detected on the websocket connection (apparently ‘activity’ in this context is measured on the TCP level and not on the LoRa-packet level). These server-initiated re-connects do not trigger the failure condition.
In some scenarios the client (TTIG) may reset and re-establish its connection quickly, for example due to ‘short’ power loss or ‘short’ loss of wifi connectivity. After re-connection, the LNS is receiving and routing the first received LoRa packets (they show up in the TTN console). After a short time (around 10-15s after re-connect) the failure condition kicks in: the packets stop to appear in the console, although the TCP connection is alive and healthy.

A wild guess at the root cause

To me this looks like a race condition in the connection reset logic of the LNS websocket server. Apparently, the LNS websocket server is measuring connection activity on the TCP level and resets the connection after no activity is detected for a particular timeout T. Probably this logic also triggers the destruction of data structures representing the gateway in order to free up resources associated to the just closed connection. In the case of an (unclean) client-side connection reset, the same resource cleanup logic is triggered after timeout T on the dead connection. If the client re-connects before T is expired, there will be two TCP level connections tied to the same internal logical gateway representation. After timeout T hits due to inactivity on the first (dead) connection, the gateway datastructure is destroyed and any packets coming in through the second (healthy) connection will not have the required context to properly route the LoRa packets up the stack.
As said, this is a wild guess, but to me this could explain the behavior we are seeing. Hopefully this helps to locate the true issue in the back-end.

Franz_Refle · September 9, 2019, 12:58pm

@bei:full ack to your description of failure condition and trigger.
I only wonder why the TTIG is rebooting if WLAN outage lasts longer then a second …
My Findings are as follows:

if WLAN Outage < 1 sec then reconnect, – no problems
if WLAN Outage > 1 sec and shorter then about 10 sec then reboot, reconnect, – problems
if WLAN Outage > 10 sec then reboot, 1st reconnect fails, 2nd reconnect success, – no probs

Ist this reboot after 1 sec WLAN Outage normal behaviour of the TTIG or is it caused by a flaw ?
The problem seems to me the 1st reconnect after a reboot, but I’m not the big SW Engineer, knowing the TTIG FW in Detail … only hobbyist. Logs of this behaviour are reproducable.

UdLoRa · September 9, 2019, 2:04pm

This is why I did not answer yet . It needs some time, however I will do it soon.
However, I see there is definitely a batch of TTIGs with problems. Mine was one of the first, although not taken at the conference.

By the way, for me the NOC provides coordinates, metadata in packets do not.

bei · September 9, 2019, 2:09pm

I don’t see any problem on the TTIG side. Dropping the connection uncleanly and re-establishing it shortly thereafter is not an uncommon case in networking which the server should be able to handle gracefully.

Franz_Refle · September 9, 2019, 2:43pm

@bei: full ack, server should be able to handle this …
I’m only in doubt, if there’s a need to reboot the hole gateway 1 sec after WLAN is lost …
is this behavior normal ?

Both events, a short powerl loss and also a short WLAN loss may occur sometimes in real life …
But those two events aren’t handled correct either by server or by gateway … thats my problem

Can’t find a workaround

hilltronic · September 9, 2019, 3:17pm

Hi Guys,

i just installed my brand new TTIG and have problems with letting my Testdevice Join.

If placed in a location that only the TTIG can receive it, it doesnt Join, even if i can see in the console all the Join Requests and also the Accept answers.

Than i placed the device on a different location where it could be heard from other gateways and it could join there.
The only diference i can see that Date on my gateway is 1970… Could this be the reason for not able to Join?

(Before i came to this 1970 problem i created this post (sorry)

Thanks for help
Wolf.

jpmeijers · September 9, 2019, 3:33pm

TTN Mapper has been updated to use the gateway-data API and not the noc API. Things Indoor Gateways should therefore start to appear on TTN Mapper.

Franz_Refle · September 9, 2019, 4:16pm

joining works flawlessly on my TTIG
I’ve joined a PAX-Counter without any problems

Martin_Kautenburger · September 9, 2019, 7:32pm

Yeah, very good quickfix solution

Franz_Refle · September 10, 2019, 7:05am

@bei: if there’s no problem on the TTIG side, will the problem be fixed on server side in the future ?
Else I think of the following workaround: I’ll build a little ATTiny14 or something else into the TTIG, which listens to the Debug output. If it discovers a Reboot of the TTIG it will switch off the WLAN AP for about 10 seconds, so that the first reconnect fails … the second reconnect will then succeed and all is fine. Thats what i’ve tested manually. In german one would say: “Von hinten durch die Brust ins Auge … aber funktioniert”

BoRRoZ · September 10, 2019, 7:12am

seiichiro0185 · September 10, 2019, 7:26am

Thanks for the fix, my TTIG showed up on ttnmapper.org just now (although it is still marked offline despite being online, but alt least it’s visible at all).
Edit; switched to online just now, so ttnmapper seems fine now.

hilltronic · September 10, 2019, 8:58am

My problem with “Joining” solved: i needed to set ATS220=3 on my Adeunis Tester.
(extend RX Windows Timing for Testhouse certification).

Now everything works fine with the TTIG (also can not see regulary reboots etc)

Thanks
Wolf.

Franz_Refle · September 11, 2019, 6:36am

I’d also like to know this.

redwirelessus · September 11, 2019, 5:59pm

This is why we’ve chosen to ‘wait’ and/or use something a bit more ‘robust’ like Laird Sentrius for indoor commercial applications…

There, I couldn’t read

redwirelessus · September 11, 2019, 6:08pm

We asked them (TTI) why Ethernet was NOT an option for this little very nicely priced IG, we haven’t got an answer. Relying solely on WiFi as backhaul especially on the 2.4GHz very crowded RF space and all that WiFi’s unpredictability entails, IOHO, is not very wise, and it starting to show. I think the reasoning here, much like Semtech is doing with their new ‘kit’, is to follow the ‘in a Box’ biz model. We don’t think, again IOHO, that’s a very wise move. Add that to the ‘bugs’ everyone’s referring here, and well you get this long thread. Our Laird Sentrius are definitely more expensive, 2-3 times the price, but they work (ironically) right out of the ‘box’ the 1st time and no problems thereafter…

kersing · September 11, 2019, 6:49pm

Apart from TTIG bashing, what is your point?

The TTIG does not support wired ethernet because it is based on a cheap chipset that does WiFi. The hardware was designed some time ago for a different product and with new software re-released as TTIG at a very competitive price point. Redesign and re-certification (CE/FCC etc) would increase the price and add little (imho).

redwirelessus · September 11, 2019, 7:26pm

I find that insulting - who’s bashing anyone here? It is a valid, professional point that, bashing or not, is still pressing and very much relevant. As such, if it’s already giving so much headache, how can that be considered ‘competitive’? Sounds more like ‘rushed’ or ‘cheap’, which is very un-TTI-esque. We are very appreciative of TTN/TTI, extremely supportive and collaborative, so if you or anyone consider the sound based criticism as ‘bashing’ then maybe that’s something you need to work with - not us.

kersing · September 11, 2019, 8:55pm

Ok, I should not have called it ‘bashing’, my appologies.

I think you could make your point differently. Posting a huge screenshot of a tweet where you could just as easily have posted a link to twitter and provided some context in the message does set the tone. Stating you prefer wired connections, which as we all know TTIG does not support and compare it with a product which is 2-3 times the price (and had issues with the first firmware release as well (frequent packet forwarder failures)) is not particulary helpfull for people with issues with TTIG. The Sentrius gateways have been availble for over two years so if anyone considered it an option they would have bought one and not waited for the hard to obtain TTIG. (And with the newer firmware I would certainly recommend the Sentrius gateways!)

With regards to the stability of TTIG, may-be it does need some time to mature? Brand new packet forwarder, not supported by the back-end (yet) and as a result needing a protocol bridge that does not scale well is not an advertisement for a mature product. Let’s face it, releasing a gateway that is not finished is very TTI-esque. Ask many backers of the original TTN gateway or read the messages on the forum regarding the problems people are (still) having.

TTI does a decent job providing the community back-end, however I’ve talked to several commercial customers that shy away from TTN lately because of the stability issues and outages. A year ago the back-end was a lot more stable. Once people start looking for alternatives the community looses gateways and coverage.