PowerDown and ABP

verdiagriculture · May 28, 2020, 11:19pm

Hello everyone!
Something is going wrong with my endnode. I can’t say exactly what because the equipment to recreate the bug comes in next week. The endnodes uplink after I reset them, but all of them eventually stops uplinking. I am guessing that it may be something to do with how I stitched together the LowPower library and arduino-lmic.

I am using the usual beginner hardware, Atmega328p with RFM95W. The activation is ABP. My loop contains only

void loop() {
    os_runloop_once();
}

Everything interesting that happens is in the EV_TXCOMPLETE callback:

        case EV_TXCOMPLETE:
            Serial.println(F("EV_TXCOMPLETE (includes waiting for RX windows)"));
            dnlink_flg = 0; //Rx2Timeout, will be changed later if data is received
            ToS = 0;
            ToN = 0;
            ToE = 0;
            ChgDef_flg = 0;
            TimeToStart = invalid_time;
            TimeToEnd   = invalid_time;
            TimeToNextPoll = invalid_time;
            if (LMIC.txrxFlags & TXRX_ACK)
              Serial.println(F("Received ack"));
            if (LMIC.dataLen) {
            //   Serial.println(F("Received "));
            //   Serial.println(LMIC.dataLen);
            //   Serial.println(F(" bytes of payload"));
            //   Serial.println(F("Data: "));
              for (int i = 0; i < LMIC.dataLen; i++) {
                dnlinkcmd[i] = LMIC.frame[LMIC.dataBeg + i];
              }
            //   for (int i = 0; i < dnlinkcmd_size; i++) {
            //       Serial.print(dnlinkcmd[i],HEX);
            //       Serial.print("_");
            //   }
            //   Serial.println();
              dnlink_flg = 1; //DnLinkRecieved
              act_on_cmd();
              clear_dnlinkcmd();
            }
            pick_next_state();
            get_next_sleep();
            // Schedule next transmission
            Serial.print("NextSleepTime:\t");
            Serial.println(NextSleepTime);
            
            //Serial.print("Millis since last sleep:\t");
            //Serial.println(millis()-previousTime);
            Serial.flush();
            for(unsigned long i = 0; i<NextSleepTime; i++){
                LowPower.powerDown(SLEEP_4S,ADC_OFF,BOD_OFF);
            } 
            //previousTime = millis();
            current_state_output();
            clear_mydata();
            fill_mydata();
            //strcpy( (char*) mydata, "Send data" );
            do_send(&sendjob);
            // os_setTimedCallback(&sendjob, os_getTime()+sec2osticks(TX_INTERVAL), do_send);
            break;

Every single function here is made of only if statements and for loops. There is no explicit time manipulation anywhere.
Is there someone with ABP, powerdown experience that can point out the potential failures in this code?

Any help is much appreciated thank you!

descartes · May 29, 2020, 8:34am

This looks suspiciously related to your other thread about duplicated keys …

Which version of LMIC are you running?

Can you run it without your low power additions to see if that is a factor?

kersing · May 29, 2020, 9:19am

I would move the sleep and do_send logic from the event handler to the loop. By calling do_send from the event handler you might well be eating your stack space (recursion). Your code effectively never returns from the event handler before calling a new command to send data. That also means all other (possible) events during that time will never be served.

cslorabox · May 29, 2020, 2:37pm

While it’s traditional to schedule do_send() from the event loop rather than calling it, doing so directly is likely not recursion and not going to take the stack too much deeper, though if it were close to overflow that could be the final straw (edit: and now that attention has been called to the ATmega328, there’s not much stack space to begin with)

do_send() doesn’t actually send anything, what it does is stage the next desired application packet. The actual building of the data frame, and even the decision if that frame should be an application one or something like a (re)join is made asynchronously from this staging in the engine’s update routine.

A major reason for scheduling do_send() rather than calling it from the last stage of a previous attempt (ie, where the receive windows either have or haven’t obtained anything) would be to put some reasonable pacing on the transmission timing by scheduling it at some point in the future.

eg, common usage would be:

        case EV_TXCOMPLETE:
            // generate some sort of debug log output indicating
            //rx windows are done

           // Schedule next transmission
           os_setTimedCallback(&sendjob,
                               os_getTime()+sec2osticks(TX_INTERVAL),
                               do_send);

verdiagriculture · May 29, 2020, 4:47pm

It is very related to the other thread. Although I am pretty certain that duplicate keys are not the cause of my issues. I decided to make a new thread post rather than do a complete 180 on the other post… for forum continuity?

I am using this arduino-lmic library. Running it with and without low power seems to both work. The issue usually occurs only after a few days of running.

verdiagriculture · May 29, 2020, 4:53pm

Hey cslorabox, thanks for the good reply.

Could you explain what you mean by reasonable pacing? The reason why I did not use os_setTimedCallback is because I want the endnode to be in powerdown for as long as possible.

The time between transmissions shouldn’t be a problem since my NextSleepTime provides at least a few minutes before each call to do_send. Is the reasonable pacing some sort wiggle room the os needs to properly schedule a transmission?

verdiagriculture · May 29, 2020, 4:54pm

Didn’t think about that. Big thanks kersing!

I’ll move my code to the loop and only set flags in EV_TXCOMPLETE =)

cslorabox · May 29, 2020, 5:26pm

LMiC is really supposed to operate on scheduling.

The only thing that should be in the Arduino loop() is a call to the pump function of LMiC’s scheduler, eg from your link above:

void loop() {
    os_runloop_once();
}

You don’t need to “set flags” when you use the scheduler, as flagging something that needs to happen in the future is basically what the scheduler does.

If you have a state-preserving low power mode, one way to do it is just re-rig the LMiC scheduler to figure out when the next task is sufficiently far in the future that it makes sense to do a low power sleep rather than a busy wait. In some cases that could be as little as a fraction of a second, but it depends on the overhead of the mode used (the gains in going to low power for less than a second between TX and following RX windows are slight and may not be worth the trouble at least until you have everything else working well)

In a situation where a low power mode loses most state, then you actually need to start up the program from scratch, but trick it by restoring critical LoRaWAN state - both obvious things like frame counts in both directions but also less obvious ones like ADR and/or OTAA time counters. There are people who have figured that out, but it’s not simple.

descartes · May 29, 2020, 5:55pm

I have LMIC v2.2.2 stable for our TinyThing - Arduino Pro Mini with RFM95W but I have to compile using Arduino 1.8.9 or it fills the flash. More up to date versions of LMIC tend to fill flash even with 1.8.9.

Some configurations I’ve tried with different sensors have dropped the reported free memory to the point that it all grinds to a halt and even after putting some feedback in to the LMIC code I still couldn’t precisely say what was causing it all to stop.

All that said, I’ve nodes that have been running for 6+ months on this hardware combo with a low power mode but efforts to use more up to date libraries or IDE haven’t worked YET.

I have to build a test TinyThing in the morning, I’ll put my code on GitHub & link here when I get to that stage.

At some point when I’m in a good mood and have a day to waste, I’ll do some more scientific tests to see if I can bottom the issue, as well as try it on a Mega2560. Mostly I’ve moved to RAK modules (specifically the 4200) for co-development work.

cslorabox · May 29, 2020, 5:59pm

I’d missed that the asker was (like you) using an ATmega328 - that’s really a squeeze and not a very recommended solution when the LoRaWAN stack runs on the application processor. There are much better alternatives today like the various brands of Cortex M0 that don’t cost more and give you better debug with cheap tools, too.

Also worth noting that the LMiC code has a lot of subtle bugs in possible but less usual cases which various branch maintainers have fixed to various degrees - look at the commit history and open issues for MCCI LMiC for example, which has been one of the most proactive.

descartes · May 29, 2020, 6:18pm

I only use that combo now for single temp sensors, throw-away devices and canary applications. It works and it’s less than a tenner (£10 GBP) in parts.

For co-development of a POC I use a RAK4200 module with Arduino so the client can use the resources online to build something without worrying about the MAC.

For an actual device, the RAK4260 is pretty neat - a ATSAMR34J18 - Cortex M0 exactly as you suggested.

verdiagriculture · June 16, 2020, 5:58am

Update

I didn’t want to leave this thread hanging w/o a conclusion.
The issue was the race condition. For some reason, my code/hardware was hitting it more often than Descartes. After I started calling os_runloop_once every 8 seconds, the race condition seems to be hit less often, but I still hit it.

Alas, I had to setup my watchdog so that my endnode would reset itself but that means I had to turn off frame counter check for the time being.

If I switch to a 32bit MCU like the cortex m0, I suppose that there is a different library I can use which doesn’t have this bizarre race condition?

Fyi, thanks for your help descartes, now my nodes are atleast functional =)

descartes · June 16, 2020, 10:06pm

Not something I have a problem with - did you patch your data collection in to my framework or make alterations to yours?

You may end up with a variant of LMIC. Or you may well have something in your code that’s the culprit.

If you want to put your entire code base either on a gist or pastebin or in a message and send me a private message, I’ll have a look see.