OK, whatever caused the outage has been resolved and I have the TTN gateway working again now connected to ethernet. I will report its behavior over the next day or two to see if it has changed.
Are you able and willing to swap power supply and lora board between working and non working gw?
And make some detailed pictures of both circuit boards where chip ids are readable?
Peeking in the firmware code I finally understand the weird values in HTTP: Got 1232 bytes
: the URLs printed in the logs are not the ones actually used, but ?filter=ttn
is appended. And indeed, https://account.thethingsnetwork.org/api/v2/frequency-plans/EU_863_870?filter=ttn is much smaller than https://account.thethingsnetwork.org/api/v2/frequency-plans/EU_863_870
I’m also trying to enable debug logging and add much, much more logging, but then the UART output really starts losing many bytes, making it very hard to decipher my new logging. I tried to double DEBUG_PRINT_BUFFER_SIZE
, but then the compiler fails to allocate space. To be continued…
The good news: I do see responses from the LoRa card, like LORA: recv_rpl: 0x23 0x35 0x1 0x0 0x0 0x0 0x0 0x59 0xd
.
The bad news: the very first step in the configuration already fails:
status *= configureRXChain(0, appGWActivationData.configuration_sx1301.rfchain[0].enable,
appGWActivationData.configuration_sx1301.rfchain[0].freq);
do we already have the ‘programming’ i.e. the command set details for the LoraCard?
UPDATE: one of the good working gateways, now also in an booting loop. Sometimes it receive an sensor packages, but most of the send sensor packages are lost due to rebooting cyclus. For this gateway, i had switch on the automatic update in the console. I changes the power supply, but no change.
At the moment , the gateway without problems, i have switch-off the automatic updates in the console.
So, my situation, only 1 of the three gateways is usable.
Increasing the value of DRV_USART_BAUD_RATE_IDX0
to, say, 921600
helps making the logs a bit more readable, though still not perfect. These problems might also be caused by my UART-to-USB cable. And of course, the same setting must be used when capturing the output, like in LOG="ttn-gateway-
date +’%Y%m%d-%H%M%S’.log"; pio serialports monitor --raw -b 921600 | while read l; do echo "[
date +’%F %T’] $l" | tee -a $LOG; done
, or when using a Raspberry Pi to capture the output.
This allows for changing the log level to SYS_ERROR_DEBUG
in:
…and for enabling some disabled logging in the original code, and adding many more calls to SYS_DEBUG(SYS_ERROR_DEBUG, ...)
.
I added the following at the end of sendCommand
in src/app_lora.c
to know when things go wrong (not sure why the code uses status *= ...
which will still try all next configuration steps if a previous one has already failed):
SYS_DEBUG(SYS_ERROR_WARNING, "LORA: sendCommand %s\r\n", gotresponse ? "OK" : "ERROR");
…yielding:
CNFG: Configuring LoRa module
LORA: Changing state from 2 to 4
LORA: Starting reconfiguration
LORA: send_cmd: 0x23 0x31 0x1 0x0 0x0 0x55 0xd
LORA: recv_rpl: 0x23 0x31 0x1 0x0 0x0 0x55 0xd
LORA: sendCommand OK
LORA: send_cmd: 0x23 0x3a 0x1 0x0 0x0 0x5e 0xd
LORA: recv_rpl: 0x23 0x3a 0x10 0x0 0x1 0x1 0x4c 0x47 0x38 0x35 0x30 0x31 0x36 0x30 0x31 0x37 0x38 0x32 0x4 0x1 0xd
LORA: sendCommand OK
LORA: version: 01
LORA: configureRXChain(0, ...)
RF: 0,1,867500000
LORA: send_cmd: 0x23 0x34 0x6 0x0 0x0 0x1 0xe0 0xff 0xb4 0x33 0x24 0xd
LORA: recv_rpl: 0xd
LORA: sendCommand ERROR
LORA: configureRXChain(1, ...)
RF: 1,1,868500000
LORA: send_cmd: 0x23 0x34 0x6 0x0 0x1 0x1 0x20 0x42 0xc4 0x33 0xb8 0xd
LORA: recv_rpl: 0x23 0x34 0x1 0x0 0x0 0x58 0xd
LORA: sendCommand OK
LORA: configureIFChainX(0, ...)
Above, the call to configureRXChain(1, ...)
actually consistently succeeds. Or maybe the if(appData.rx_uart_buffer[1] == command)
in the call to configureRXChain(1, ...)
is seeing the response of the earlier call to configureRXChain(0, ...)
, as it somehow gets in too late? The if(TIMEOUT(4))
is not being hit. After the above lines the log is a mess again, but I see both OK
s and ERROR
s.
I need to run now; just posting my early findings in case it helps someone right now; will try to trigger the default configuration later, to ensure the config that is fetched from the internet is okay. To be continued…
Would it make sense, after the gateway that stopped working properly has been turned on and powered up for you to press the reset button for five seconds so that it needs to be reactivated?. And then reactivate it selecting to not have automatic updates?
Yes!
Seeing LORA: recv_rpl: 0xd
above, which is a newline, I wondered what would happen if I always first read any pending RX from the LoRa module’s UART before sending new commands.
Guess what: in two occasions there was indeed such excessive newline pending in the receive buffer while the firmware was about to send a new command. Even better: discarding those makes the LoRa configuration complete without any error. Next, the activation simply succeeds and it even downloads new firmware, which obviously no longer includes my fixes, but somehow is not booted (yet) as the SD card with my own firmware is still in the gateway, my own firmware is loaded again after the reboot. So, using my own firmware it’s now activated and happily forwarding packets…! And the debug logs no longer show garbage either.
To be continued, but: in my case the reboot loop during activation is apparently totally fixable by just using new firmware.
And I’m not even alarmed by this weird double occurrence of frame counter 9
, on two frequencies…
(Well, I am…)
@arjanvanb Can you share the image you created, curious to see of this fixes the reboots I have every so many hours.
Here goes, based on today’s develop
branch: https://drive.google.com/open?id=15UMxp7voWhCAHY0_xvZDkWDPHf74pPuX
I will create a PR tomorrow have created a PR, so if you can wait a few days for TTP/TWTG to validate it: just wait
If you don’t want to wait:
- I don’t have the factory firmware for you, so: no way back!
- No need to wipe any existing configuration.
- Unpack the
/update
folder to the root of a FAT32 formatted SD card. - Remove power, insert the SD card, attach power.
- If you’re using a serial cable for logging: set the baudrate of your monitor to 921600.
- Leave the SD card in your gateway to avoid any downloaded firmware from overwriting it.
- After the updated firmware was downloaded, when the SD card is still in place, you’ll see:
…which makes me think that even when removing the SD card after that, it won’t overwrite the firmware until something new is released. But I did not test.FIRM: Starting download FIRM: available bytes: 79 FIRM: (Downloaded FOTA key) 69 AE B7 78 1F 49 4E 7F BC B6 C7 CD 9C 59 4F 5D FA AA 3D 81 D4 9C 56 90 A6 83 81 98 FF 18 88 6A FIRM: (Stored FOTA key) 69 AE B7 78 1F 49 4E 7F BC B6 C7 CD 9C 59 4F 5D FA AA 3D 81 D4 9C 56 90 A6 83 81 98 FF 18 88 6A FIRM: Firmware is already downloaded MAIN: No new firmware available
This basically adds a bit more logging, plus:
// flushUart removes any pending bytes from the receive buffer.
void flushUart(DRV_HANDLE handle)
{
bool flushing = false;
uint8_t buffer[1];
while(DRV_USART_Read(handle, buffer, 1) > 0)
{
if(!flushing)
{
SYS_DEBUG(SYS_ERROR_DEBUG, "LORA: flushing: ");
flushing = true;
}
SYS_DEBUG(SYS_ERROR_DEBUG, "%02x ", buffer[0]);
}
if(flushing)
SYS_DEBUG(SYS_ERROR_DEBUG, "\r\n");
}
(Trying to adhere to the existing code style… Also, it would be nice to use, e.g., DRV_USART_ReceiverBufferIsEmpty
, but the tooling refuses to find that?)
The above is then invoked at the start of:
bool sendCommand(uint8_t command, uint8_t* payload, uint16_t len)
{
flushUart(appData.USARTHandle);
bool gotresponse = false;
...
SYS_DEBUG(SYS_ERROR_DEBUG, "LORA: sendCommand %s\r\n", gotresponse ? "OK" : "ERROR");
return gotresponse;
}
I only get LORA: flushing: 0d
(being an empty line) once now.
However, the mystery continues: I doubt getting two lines as a reply is normal, and I don’t understand how the code could ever print the following!? (See followup post for an explanation.)
RF: 1,1,868500000
LORA: send_cmd: 23 34 06 00 01 01 20 42 c4 33 b8 0d
LORA: recv_rpl: 23 34 07 00 00 01 01 80 e5 f9 ff be 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
Okay, the code in my firmware uses "%02x"
to print the hexadecimal values, which is different from the "%#x"
in the code on GitHub. But how could it ever (consistently!) print two subsequent lines for LORA: recv_rpl
for this very command…!?
My logging
This is complete; see next post for an explanation.
CNFG: Configuring LoRa module
LORA: Changing state from 2 to 4
LORA: Starting reconfiguration
LORA: send_cmd: 23 31 01 00 00 55 0d
LORA: recv_rpl: 23 31 01 00 00 55 0d
LORA: sendCommand OK
LORA: send_cmd: 23 3a 01 00 00 5e 0d
LORA: recv_rpl: 23 3a 10 00 01 01 4c 47 38 35 30 31 36 30 31 37 38 32 04 01 0d
LORA: sendCommand OK
LORA: version: 01
LORA: configureRXChain(0, ...)
RF: 0,1,867500000
LORA: flushing: 0d
LORA: send_cmd: 23 34 06 00 00 01 e0 ff b4 33 24 0d
LORA: recv_rpl: 23 34 01 00 00 58 0d
LORA: sendCommand OK
LORA: configureRXChain(1, ...)
RF: 1,1,868500000
LORA: send_cmd: 23 34 06 00 01 01 20 42 c4 33 b8 0d
LORA: recv_rpl: 23 34 07 00 00 01 01 80 e5 f9 ff be 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChainX(1, ...)
IF: 1,1,1,-200000
LORA: send_cmd: 23 35 07 00 01 01 01 c0 f2 fc ff 0f 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChainX(2, ...)
IF: 2,1,1,0
LORA: send_cmd: 23 35 07 00 02 01 01 00 00 00 00 63 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChainX(3, ...)
IF: 3,1,0,-400000
LORA: send_cmd: 23 35 07 00 03 01 00 80 e5 f9 ff c0 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChainX(4, ...)
IF: 4,1,0,-200000
LORA: send_cmd: 23 35 07 00 04 01 00 c0 f2 fc ff 11 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChainX(5, ...)
IF: 5,1,0,0
LORA: send_cmd: 23 35 07 00 05 01 00 00 00 00 00 65 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChainX(6, ...)
IF: 6,1,0,200000
LORA: send_cmd: 23 35 07 00 06 01 00 40 0d 03 00 b6 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChainX(7, ...)
IF: 7,1,0,400000
LORA: send_cmd: 23 35 07 00 07 01 00 80 1a 06 00 07 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
LORA: configureIFChain8(...)
IF8: 1,1,-200000,250000,7
LORA: send_cmd: 23 36 08 00 01 01 c0 f2 fc ff 02 02 14 0d
LORA: recv_rpl: 23 36 01 00 00 5a 0d
LORA: sendCommand OK
LORA: configureIFChain9(...)
IF9: 1,1,300000,125000,50000
LORA: send_cmd: 23 37 0b 00 01 01 e0 93 04 00 03 50 c3 00 00 f4 0d
LORA: recv_rpl: 23 37 01 00 00 5b 0d
LORA: sendCommand OK
LORA: send_cmd: 23 40 01 00 34 98 0d
LORA: recv_rpl: 23 40 01 00 00 64 0d
LORA: sendCommand OK
LORA: send_cmd: 23 31 01 00 00 55 0d
LORA: recv_rpl: 23 31 01 00 00 55 0d
LORA: sendCommand OK
LORA: send_cmd: 23 30 01 00 00 54 0d
MON: SYS Stack size: 2870
MON: heap usage: 152KB (233KB), free: 187KB
LORA: recv_rpl: 23 30 01 00 00 54 0d
LORA: sendCommand OK
LORA: configLora OK
LORA: Configuration succeeded
LORA: Starting operation
Hi Arjan,
I was in for an experiment, so I downloaded your firmware.
It did not work right a way however. I tried reset, new activation etc.
At a certain point I pushed the LORA card and I felt it went a bit further into the connector.
And now it suddenly works.
Could it be a combination of problems?
It is a pity I did not try to push the card before I changed firmware. And since there is no way back…
But at least it seams to work now and that is very nice.
Thanks for your excellent work to find the cause.
The original code is also full of timed waits/delays with comments that they are there to give time to print something. doesn’t look very stabil.
Do you have a debug cable? Then I wonder if you see the line LORA: flushing
. (Using 921600 baud.)
About the lora board and it’s connector, it looks like they don’t match, as if the board is thinner than the connector is build for. When you remove the board and put it back in, there is no pressure when you press it down, like you have when you insert a flash/sdd card in a slot like that…
I also wonder if the same UART timing issues and excessive newlines could be present in the LoRa card’s output for received LoRaWAN packets, which it might then drop.
But when peeking in TTN Console I think I don’t see my own test packets missing out on frame counters. (Except for the weird double LoRaWAN packet with two different frequencies that I showed earlier.) So maybe my LoRa card is simply a slightly different revision that has different firmware itself, only affecting the configuration commands? At its bottom side, mine is labeled:
LG8501601782
LG-X271
REV: C
If you see packets in ttn console your gw runs longer than mine, mine usually don’t get this far.I have module sn LG8451600649, rest is the same. What I also find weird is that when my gw is powered off for say 24 hrs. and I power it on for first time, it runs the longest. If I do this a few minutes or hours after bootloop it barely has time to connect until I wait a day/night again. It’s not temperature related though.
Yes, it has been running overnight, and a good part of yesterday too. I’ve rebooted manually quite often (it’s a tiring process to copy the firmware to the SD card and so on…) but it has never rebooted unexpectedly.
Any examples? I’ve actually searched for such code but couldn’t find it. I’d assume one can actually peek into the buffer to see if it’s empty, rather than waiting some specific time… But then, the compiler cannot find, e.g., DRV_USART_ReceiverBufferIsEmpty
while it can find DRV_USART_Read
from the very same header file. So I guess somehow the project defines some static linking that I cannot find in the tooling.
For example the vTaskDelay() calls in system_tasks.c
Okay, I figured out the double response in the log, but wow, this is driving me nuts. Bottom line: even at 921600 baud the/my logging is just too often incomplete.
Like this should create 9 entries in the log:
SYS_DEBUG(SYS_ERROR_DEBUG, "\r\nLORA: configureRXChain(1, ...)\r\n");
status *= configureRXChain(1, appGWActivationData.configuration_sx1301.rfchain[1].enable,
appGWActivationData.configuration_sx1301.rfchain[1].freq);
for(i = 0; i <= 7; i++)
{
SYS_DEBUG(SYS_ERROR_DEBUG, "\r\nLORA: configureIFChainX(%d, ...)\r\n", i);
status *= configureIFChainX(i, appGWActivationData.configuration_sx1301.ifchain[i].enable,
appGWActivationData.configuration_sx1301.ifchain[i].radio,
appGWActivationData.configuration_sx1301.ifchain[i].freqOffset);
}
But I get 1 with some weird double response, and 7 for the loop where the first of for(i = 0; i <= 7; i++)
, is not (properly) shown:
LORA: configureRXChain(1, ...)
RF: 1,1,868500000
LORA: send_cmd: 23 34 06 00 01 01 20 42 c4 33 b8 0d
LORA: recv_rpl: 23 34 07 00 00 01 01 80 e5 f9 ff be 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
// The following is missing!
// LORA: configureIFChainX(0, ...)
// IF: 0,1,1,-400000
LORA: configureIFChainX(1, ...)
IF: 1,1,1,-200000
LORA: send_cmd: 23 35 07 00 01 01 01 c0 f2 fc ff 0f 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
// Likewise proper output for i = 2, 3, 4, 5, 6
LORA: configureIFChainX(7, ...)
IF: 7,1,0,400000
LORA: send_cmd: 23 35 07 00 07 01 00 80 1a 06 00 07 0d
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK
So, I am quite sure that the weird second response (the line LORA: recv_rpl: 23 35 01 00 00 59 0d
and the next LORA: sendCommand OK
) is actually the response for i = 0
, and even more: the payload shown for the RX in the line LORA: recv_rpl: 23 34 07 00 00 01 01 80 e5 f9 ff be 0d
is actually the payload for the TX in LORA: send_cmd
for i = 0
!
Compared to when logging happens to be okay, after:
LORA: send_cmd: 23 34 06 00 01 01 20 42 c4 33 b8 0d
…apparently the first and last lines of:
LORA: recv_rpl: 23 34 01 00 00 58 0d
LORA: sendCommand OK
LORA: configureIFChainX(0, ...)
IF: 0,1,1,-400000
LORA: send_cmd: 23 35 07 00 00 01 01 80 e5 f9 ff be 0d
…have somehow been combined into this single bogus line, discarding anything in between:
LORA: recv_rpl: 23 34 07 00 00 01 01 80 e5 f9 ff be 0d
Changing the output format for TX and RX a bit also makes clear it’s indeed just a matter of weird printing. (It’s not like the RX and TX data are mixed up in memory.)
TL;DR: for me the fix to flush the buffer is still working, and I guess I can submit the PR, but one needs to be careful trying to read the log.
The only reference to logging I see is:
But that looks okay, as it’s about to reboot and then logging would be lost without such delay.
Hmmm, not all open source code is quite readable (yet?). It seems the web server resources (HTML, images, JavaScript) have been pushed to Git as some disk image, rather than their source code?
(Of course, one might fetch most, if not all, using a regular browser.)