TTIG Problems, - no location data, wrong date/time, wrong channel and stability issues

kersing · September 8, 2019, 1:25pm

Why do you think I referenced a TTN core member in the message?

Franz_Refle · September 8, 2019, 1:28pm

Hi Guys, i’ve managed to activate the USB Debug Port, here some lines of normal Debug Output:

2019-09-08 09:44:05.891 [SYN:INFO] Time sync qualities: min=1049 q90=1070 max=1079 (previous q90=1071)
2019-09-08 09:44:06.684 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:07.688 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:07.997 [SYN:VERB] Time sync rejected: quality=1079 threshold=1070
2019-09-08 09:44:08.692 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:09.696 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:10.700 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:11.705 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:12.708 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:13.712 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:14.716 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:15.723 [S2E:VERB] RX 867.1MHz DR5 SF7/BW125 snr=9.5 rssi=-31 xtime=0xFA00033E3CA40B - updf mhdr=40 DevAddr=26011CB2 FCtrl=80 FCnt=28611 FOpts=[] 0199235D mic=-1920876201 (16 bytes)
2019-09-08 09:44:15.734 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:16.740 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:17.744 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:18.748 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:19.752 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:20.755 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:21.759 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:22.763 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:23.767 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:24.771 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:25.774 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:26.778 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:27.782 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:28.786 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:29.790 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4

…
and then a failure:

2019-09-08 09:44:29.790 [SYS:DEBU]   Free Heap: 18872 (min=17072) wifi=5 mh=7 cups=8 tc=4
9-09-08 09:44:34.809 [SYS:DEBU]   Free Heap: 17272 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:35.813 [SYS:DEBU]   Free Heap: 17272 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:36.817 [SYS:DEBU]   Free Heap: 17272 (min=17072) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:37.821 [SYS:DEBU]   Free Heap: 17272 (min=17072) wifi=5 mh=7 cups=8 tc=4
bcn_timout,ap_probe_send_start
2019-09-08 09:44:38.824 [SYS:DEBU]   Free Heap: 17272 (min=16680) wifi=5 mh=7 cups=8 tc=4
2019-09-08 09:44:39.828 [SYS:DEBU]   Free Heap: 17272 (min=16680) wifi=5 mh=7 cups=8 tc=4
ap_probe_send_over, rest wifi status to disassoc
state: 5 -> 0 (1)
rm 0
pm close 7 0 0/1080320871
2019-09-08 09:44:40.832 [SYS:DEBU]   Free Heap: 18320 (min=16680) wifi=4 mh=7 cups=8 tc=4
reconnect
scandone
no FRITZ!Box 7590 AC found, reconnect after 1s
reconnect
2019-09-08 09:44:41.835 [SYS:DEBU]   Free Heap: 18320 (min=16680) wifi=1 mh=7 cups=8 tc=4
2019-09-08 09:44:42.004 [SYS:INFO] Restarting
scandone
del if0
usl
sul 0 0

… and a reboot

 ets Jan  8 2013,rst cause:1, boot mode:(3,7)

load 0x40100000, len 2408, room 16 
tail 8
chksum 0xe5
load 0x3ffe8000, len 776, room 0 
tail 8
chksum 0x84
load 0x3ffe8310, len 632, room 0 
tail 8
chksum 0xd8
csum 0xd8

2nd boot version : 1.6
  SPI Speed      : 80MHz
  SPI Mode       : QIO
  SPI Flash Size & Map: 32Mbit(1024KB+1024KB)
jump to run user1 @ 1000

~r��b�����b��blb���b|r�l��lb��n��nn����l��lll�����l`�n������bll�r�b����bl�b�lrl����n��r��n|�llll`�mode : sta(2c:f4:32:50:bf:bb)
add if0
1970-01-01 00:00:02.150 [SYS:INFO] FSCK found section marker A0
1970-01-01 00:00:02.162 [SYS:INFO] FSCK section A: 83 records, 12164 bytes used, 1011836 bytes free
1970-01-01 00:00:02.632 [SYS:INFO] FSCK section A followed by erased flash - all clear.
1970-01-01 00:00:02.643 [any:INFO] HW pipe bound with fd=0
1970-01-01 00:00:02.645 [any:INFO] AIO registered for HW pipe &aio=0x3ffeea30


1970-01-01 00:00:00.006 [SYS:DEBU] ======= VER ======
1970-01-01 00:00:00.008 [SYS:DEBU] Station Version   2.0.0(minihub/debug)
1970-01-01 00:00:00.010 [SYS:DEBU] Version Commit    e17c5af
1970-01-01 00:00:00.014 [SYS:DEBU] Station Build     2018-12-06 09:30:37
1970-01-01 00:00:00.020 [SYS:DEBU] Firmware Version  2.0.0
1970-01-01 00:00:00.025 [SYS:DEBU] FW Flavor ID      semtech0
1970-01-01 00:00:00.031 [SYS:DEBU] Model             minihub
1970-01-01 00:00:00.041 [SYS:DEBU] ======= SYS ======
1970-01-01 00:00:00.042 [SYS:DEBU] CPU Freq          80 / 80000000 / 80000000
1970-01-01 00:00:00.048 [SYS:DEBU] Random Number     -1915626743
1970-01-01 00:00:00.053 [SYS:DEBU] Reset cause       4
1970-01-01 00:00:00.058 [SYS:DEBU] Booting USER_BIN  1
1970-01-01 00:00:00.063 [SYS:DEBU] FW start addr     0x00001000
1970-01-01 00:00:00.069 [SYS:DEBU] SDK version       2.0-dev(9ec59b5)
1970-01-01 00:00:00.075 [SYS:DEBU] Free Heap Startup 56160 bytes
1970-01-01 00:00:00.081 [SYS:DEBU] ======= MFG ======
1970-01-01 00:00:00.086 [SYS:DEBU] SN                TBMH100868003850
1970-01-01 00:00:00.092 [SYS:DEBU] MTIME             2019-07-16 06:53:48
1970-01-01 00:00:00.101 [SYS:DEBU] PERSO_HASH        0xfcd3e0c8
1970-01-01 00:00:00.104 [SYS:DEBU] AP_PW             MEw6pHTD

kersing · September 8, 2019, 1:30pm

From LoRaWAN perspective neither the time nor the gps data of the gateway is critical. Your nodes can send and receive data using the gateway without issues while these bugs are present.

The power-up issue can have more impact as this requires manual intervention, however this looks like it could be an issue with TTIG in combination with certain Wi-Fi access points. (I do not observe this issue with Ubiquiti APs)

Not having gateway GPS data does impact TTNmapper, that is inconvenient if your primary use case would be mapping the gateway, however that is of course not a real use case.

Franz_Refle · September 8, 2019, 1:31pm

The Attached Reboot contains Firmware Version and Build Date, maybe it’s useful for someone

Franz_Refle · September 8, 2019, 1:35pm

The issue happens on a FritzBox 7590 and also on a Cisco 25315 …
I’ll use a WRT54G but this will last a while because i have to reactivate this old thing with OpenWRT or ddWRT

kersing · September 8, 2019, 1:35pm

Please take a moment to format your postings!

Franz_Refle · September 8, 2019, 1:45pm

Sorry, I’ll read the formatting guidelines and will use them in the future…
Don’t know if my findings are interesting enough for the community …
maybe it’s only a problem of my WLAN environment and i have to use another GW
But My RasPi 1ch GW and the rest of my WLAN is working without any problems

Franz_Refle · September 8, 2019, 1:53pm

@kersing: As you mentioned time and gps issue aren’t a problem. The only critical thing is the power up issue Imho. I’ll verify that with that old WRT54GL and in the future with a mobile hotspot, i have to buy.

Franz_Refle · September 8, 2019, 2:04pm

After verifying the formatted posts above … there’s a lost line between the first two lines of the second log, here’s the original line:

2019-09-08 09:44:30.300 [S2E:VERB] RX 867.3MHz DR5 SF7/BW125 snr=9.8 rssi=-33 xtime=0xFA00033F1AF0FB - updf mhdr=40 DevAddr=26011CB2 FCtrl=80 FCnt=28612 FOpts=[] 018C458F mic=1566300477 (16 bytes)

Before receiving this packet the free heap reduces from 18872 to 17272.
Something strange happened here

kersing · September 8, 2019, 2:18pm

Could you boot the gateway and leave it in error mode for 30 minutes to see if it recovers by itself?

Franz_Refle · September 8, 2019, 2:25pm

ok, I’ll boot the gateway again and leave it in the failure state fo min 30 min. I’ll report it
I’ve done that some times in the past, but the GW didn’t recover
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it

Franz_Refle · September 8, 2019, 3:21pm

@kersing: TTN Console reports: Last Seen 35 minutes ago for this GW
it didn’t recover …
I’ll compare the bootlogs for a “good” boot with GW alive and a “bad” boot with GW not forwarding any packets …

Franz_Refle · September 8, 2019, 3:26pm

The only difference in Bootlogs is as follows:

1970-01-01 00:00:05.276 [SYS:DEBU]   Free Heap: 55888 (min=55456) wifi=1 mh=3 cups=0 tc=0
scandone
state: 0 -> 2 (b0)
state: 2 -> 2 (8a0)
state: 2 -> 0 (2)
reconnect

Sorry fo my Poor English, Sorry, for the bad descriptions

09.09.2019:
I’ve tested the TTIG connected to a mobile Hotspot (Smartphone) this morning.
Same problems as above …
After a Power Cycle Connection gets lost after sending the first packet, LED constant green
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it
So I think it isn’t a matter of my WLAN Setup
Same Results if connected to Fritzbox, Cisco AP, Asus AP and mobile Hotspot

Any Suggestions ?
Should I complain on RS-Components a second time ?
Or throw away the dammned thing as suggested in the first posts ?

If I terminate and reenable the mobile Hotspot, things go wrong again …
same as power cycling the TTIG

After loosing connection the TTIG reboots (as seen in the Debug Output) and rescans immediately the WLAN. If the AP is available in this scan, things go wrong as described above.
If I switch on the AP AFTER this first rescan, the TTIG rescans the WLAN after about 30 s a second time (without rebooting), finds the AP and things go right … TTIG functions properly

But there’s no workaround in case of

short power loss
short wifi loss

Why the reboot if WIFI gets lost more then 1 sec ?
Maybe a Bug …
Anybody out there who may verify this reboots ?

@bei: I’m not allowed to post a reply for the next 5 hours …
Here’s my Setup:
@bei: I have a simple TTN Node (Arduino+RFM95) which sends a Temparature Packet every 15 secs. After the described faulty reboot the TTIG forwards exactly 1 Packet … seen in the TTN Console of Gateway and Device. Logging the DEBUG Output of the TTIG doesn’t show any differences between “good” and “bad” state. The TTIG receives all Packets sent by the Node but stops forwarding them after the first sent packet. This is the condition ‘connection gets lost after sending the first packet’. If I activate my DIY 1ch RASPI GW, this GW forwards exactly every 8th Packet (1ch of 8) without any Problems over the same Infrastructure.

@bel: I think I should wait the 4 hours to be permitted for replying …
TTIG Debug Output in failure state doesn’t differ from “good” state:

2019-09-09 08:37:08.883 [S2E:VERB] RX 867.7MHz DR5 SF7/BW125 snr=10.0 rssi=-12 xtime=0x77000005C08C3C - updf mhdr=40 DevAddr=26011CB2 FCtrl=80 FCnt=870 FOpts=[] 01AFB05F mic=2133244814 (16 bytes)
2019-09-09 08:37:09.298 [SYS:DEBU] Free Heap: 18872 (min=17064) wifi=5 mh=7 cups=8 tc=4
and yes, I’m looking at the TTN Console Traffic Page of the GW
Packet Capture maybe next weekend, i have to recover an old WRT54G or is it possible to do a capture on my Android Smartphone ?

bei · September 9, 2019, 7:18am

After a Power Cycle Connection gets lost after sending the first packet, LED constant green
only pressing setup 10 sec, waiting some secs until blinking slowly red and then pressing setup for 5 sec did recover it

Do you have logs which show that? How do you determine the condition ‘connection gets lost after sending the first packet’?

EDIT 1:
@Franz_Refle Thanks for the clarification. So, the TTIG logs do not indicate connection loss in what you call ‘bad state’, correct? If the TTIG was unable to forward received LoRa packets to the LNS, the internal packet buffer would fill up and you would see log messages indicating that (see basicstation/src/ral_lgw.c at master · lorabasics/basicstation · GitHub). How do you determine that ‘the TTIG receives all Packets sent by the Node but stops forwarding them after the first sent packet’? My guess is that you are looking at the TTN console. The robust way to determine that is to do an IP packet capture on the websocket connection. Do you have the means to do that?

EDIT 2:
I was able to reproduce the issue on my TTIG. Let me try to formalize:

The failure condition

In the failure condition, the TTIG has an active TCP connection to the LNS back-end (solid GREEN LED) but LoRa uplinks do not appear in the TTN console. The DEBUG logs indicate that LoRa messages are received and forwarded to the LNS via the websocket connection. An IP packet capture on the websocket connection supports the observation that TCP packets are sent to the LNS back-end: on the TCP layer, the LNS acks all packet sent by the TTIG. This shows: TCP connection is healthy, forwarded packets are received by the LNS websocket server. However, the LoRa uplink messages do not make it to the LNS’s packet routing logic. In low-activity scenarios, the LNS will reset the websocket connection if no TCP packets are transmitted over the connection in a certain amount of time. After this server-initiated reset and re-connect, everything is back to normal. In high-activity scenarios the websocket connection will not be reset by the LNS because it is seeing TCP packets coming in. In that case, the failure condition will sustain in steady state.

How the failure condition is triggered

The failure condition occurs whenever there is a ‘fast’ TTIG-initiated reconnect.
The TTN LNS regularly performs server-initiated connection resets whenever no activity is detected on the websocket connection (apparently ‘activity’ in this context is measured on the TCP level and not on the LoRa-packet level). These server-initiated re-connects do not trigger the failure condition.
In some scenarios the client (TTIG) may reset and re-establish its connection quickly, for example due to ‘short’ power loss or ‘short’ loss of wifi connectivity. After re-connection, the LNS is receiving and routing the first received LoRa packets (they show up in the TTN console). After a short time (around 10-15s after re-connect) the failure condition kicks in: the packets stop to appear in the console, although the TCP connection is alive and healthy.

A wild guess at the root cause

To me this looks like a race condition in the connection reset logic of the LNS websocket server. Apparently, the LNS websocket server is measuring connection activity on the TCP level and resets the connection after no activity is detected for a particular timeout T. Probably this logic also triggers the destruction of data structures representing the gateway in order to free up resources associated to the just closed connection. In the case of an (unclean) client-side connection reset, the same resource cleanup logic is triggered after timeout T on the dead connection. If the client re-connects before T is expired, there will be two TCP level connections tied to the same internal logical gateway representation. After timeout T hits due to inactivity on the first (dead) connection, the gateway datastructure is destroyed and any packets coming in through the second (healthy) connection will not have the required context to properly route the LoRa packets up the stack.
As said, this is a wild guess, but to me this could explain the behavior we are seeing. Hopefully this helps to locate the true issue in the back-end.

Franz_Refle · September 9, 2019, 12:58pm

@bei:full ack to your description of failure condition and trigger.
I only wonder why the TTIG is rebooting if WLAN outage lasts longer then a second …
My Findings are as follows:

if WLAN Outage < 1 sec then reconnect, – no problems
if WLAN Outage > 1 sec and shorter then about 10 sec then reboot, reconnect, – problems
if WLAN Outage > 10 sec then reboot, 1st reconnect fails, 2nd reconnect success, – no probs

Ist this reboot after 1 sec WLAN Outage normal behaviour of the TTIG or is it caused by a flaw ?
The problem seems to me the 1st reconnect after a reboot, but I’m not the big SW Engineer, knowing the TTIG FW in Detail … only hobbyist. Logs of this behaviour are reproducable.

UdLoRa · September 9, 2019, 2:04pm

This is why I did not answer yet . It needs some time, however I will do it soon.
However, I see there is definitely a batch of TTIGs with problems. Mine was one of the first, although not taken at the conference.

By the way, for me the NOC provides coordinates, metadata in packets do not.

bei · September 9, 2019, 2:09pm

I don’t see any problem on the TTIG side. Dropping the connection uncleanly and re-establishing it shortly thereafter is not an uncommon case in networking which the server should be able to handle gracefully.

Franz_Refle · September 9, 2019, 2:43pm

@bei: full ack, server should be able to handle this …
I’m only in doubt, if there’s a need to reboot the hole gateway 1 sec after WLAN is lost …
is this behavior normal ?

Both events, a short powerl loss and also a short WLAN loss may occur sometimes in real life …
But those two events aren’t handled correct either by server or by gateway … thats my problem

Can’t find a workaround

hilltronic · September 9, 2019, 3:17pm

Hi Guys,

i just installed my brand new TTIG and have problems with letting my Testdevice Join.

If placed in a location that only the TTIG can receive it, it doesnt Join, even if i can see in the console all the Join Requests and also the Accept answers.

Than i placed the device on a different location where it could be heard from other gateways and it could join there.
The only diference i can see that Date on my gateway is 1970… Could this be the reason for not able to Join?

(Before i came to this 1970 problem i created this post (sorry)

Thanks for help
Wolf.

jpmeijers · September 9, 2019, 3:33pm

TTN Mapper has been updated to use the gateway-data API and not the noc API. Things Indoor Gateways should therefore start to appear on TTN Mapper.