Just like many other users on this forum, I’ve been fighting with TTN gateway reliability. Today I had a bit of a breakthrough:
- My gateway is connected to both Wifi and Ethernet. My ethernet is steady 1ms ping with 0% packet loss and my wifi can be spotty with anywhere between 5ms to 1000ms with 0 to 100% packet loss
- The breakthrough was that I noticed that when I had gateway in a particular spot on my office it would work 90% of the time (all 4 solid blue lights after boot), and in another spot I never got it to work (constant reboots)
- All this while gateway console showed that it got both ethernet and wifi IP successfully, and I was able to ping it at all times (of course with major packet loss in the bad position)
So … based on this I have a theory what’s wrong with TTN Gateway and why it’s so tricky to get to the bottom of the issue:
- If you have no wifi at all, there is no problem, you will only get ethernet IP and off you go
- If you have 100% reliable wifi, there is no problem either
- If you have 50% reliable wifi, there is about 50% chance that your gateway won’t be able to finish the boot up process. It will likely go as far as receiving IP through DHCP but it won’t be able to communicate with TTN after that. It appears that to get IP through DHCP, 50% wifi is good enough because that particular code retries until it succeeds, but there seems to be a weakness in the resiliency of the rest of the code that communicates with TTN. If your wifi goes down, TTN connection crashes in a bad way and doesn’t recover.
My fix is to relocate my TTN gateway to a location with more reliable Wifi (and if that’s still giving me trouble I will turn off wifi all together).
The proper fix of course is for TTN developers to reproduce this issue (by putting themselves into a position with wifi that’s just barely good enough to give DHCP but not good enough to continue with the rest) then hopefully they can reproduce the issue the way I have. After that fix should be easy. Just display a proper error in status instead of infinite boot loop or reboot. Ex: “Error: poor wifi connection, try again with better wifi signal” … or even better just have the thing retry without crashing. Those packets ARE getting through, just very unreliably. So if the code can work around that lack of reliability, things will work.
I hope this helps others
Note: When I’m talking about 50% reliable wifi, of course there is no such thing - just a shorthand for saying, wifi that has high latency and some level of packet loss at times. It’s 50% of what ideal wifi looks like.