First of all, apologies for the late response from my side. I was traveling last week and off yesterday, but I’ve been following internal and public (#ops, twitter, here) messages closely. I appreciate the constructive feedback, nice words and support a lot, thanks @LanMarc77 for raising this, also “special thanks” to the person that spent a few hours on a Twitter post with the TTN logo in flames and a skull suggesting we’re hacked. It shows, positively and negatively, that people care. And that’s a good thing after all.
There’s an unfortunate mix of circumstances that caused the long downtime;
- what: the busiest cluster (EU operated by The Things Network Foundation *)
- why: issues with component interconnections that are notoriously vulnerable and hard to do right due to the design of V2, add to that doubling the network year-on-year growth and shifting focus to V3 since two years now
- when: weekend (and actually enjoying it)
* small note on operations; yes, the TTN core team is on the TTI payroll, but TTN is not a TTI operated service and will not be. We allow for spending time on non-profit Foundation activities, including operations, but it is best effort.
That being said, I see a task for me for bridging the gap between limited operational availability of TTI staff and the seemingly unlimited willingness of the TTN community to maximize service uptime.
I’m replying to a few suggestions above to steer the discussion in that direction. Please feel free to make other suggestions.
This assumption is correct; I would say that “fixing” over 90% of operational issues (in V2) is simply a matter of restarting Docker containers in the right order. Unfortunately, we cannot give direct access to Docker as that implies system root access, access to keys and password which implies user data which results in privacy and GDPR issues, etc.
How about selecting 10-15 active and trusted community members spread across timezones who have access to custom Slackbot command (i.e. /restart ttn-eu
) that triggers a predefined restart sequence? Only if, say, 3 individual members executed that command independently within 15 minutes, it actually happens. And to avoid abuse, at most every hour. I know it’s monkey patching, but it gives control.
Yes, and I’m happy to cover this topic on the other 2019 Conferences where I’ll be (India and Adelaide).
Right. We need to improve this as well. So, as we speak, we’re spending time to improve (real-time) communications on TTN operations, including a new status page. If automated reporting works well, we’ll be using the @ttnstatus twitter account too.
TTN is a special network and is operationally very different from TTI networks; TTN clusters are fully connected and TTN Foundation only operates a part of it (a big part, but not all), there’s heavy load, there’s lots of experimentation and testing on device and gateway level that would flag alerts in an enterprise network (and hence put unnecessary burden on TTI ops), etc. That being said, as we migrate to V3, we will converge the operational setup of TTN Foundation’s clusters and TTI Cloud Hosted. It will not be the same infrastructure, but it will be better manageable.
Yes. This is what Packet Broker is going to provide. In V3, Packet Broker is the backbone of LoRaWAN traffic between public TTN clusters, private TTI clusters, anyone running the open source The Things Stack, and even other network servers (through LoRaWAN stateless passive roaming or implementing the Packet Broker protocol).
As we’ve been promising for a long time, it will be easier for the community to operate TTN clusters as well (cc @kersing). I have several requests from all parts of the world for operating one. There is nothing wrong with overlapping regions. However, we do need Packet Broker for this. Also, we do need to structurally start measuring uptime of all TTN clusters in a way for the community to choose which TTN cluster to pick for connecting a gateway and registering devices. If it turns out that the public TTN Groningen community cluster has better uptime than the TTN Foundation’s EU cluster, so be it. We’ll hand out an award for the best community contributed cluster at the Conference. We’ll figure it out; willingness is not the issue but technology is not there yet.
Now, I hear you thinking; “but who operates Packet Broker and how can we make that redundant?” PB will be TTI operated. But APIs are open and V3 peering allows for redundancy (i.e. multiple independent upstreams). We are open to working with neutral parties (i.e. universities, internet exchanges, domain registrars, etc) to provide redundancy for peering in the TTN community network. That is regardless of community-contributed TTN clusters as described above: both provide decentralization.
I don’t think that throwing money at it will help. Also to whom would TTI provide the SLA? What if we don’t meet the SLA? This, plus the operational differences as outlined above, make me believe that we should keep TTN really a community initiative, where TTI is a contributor on a best effort bases, like, hopefully, many other contributors that will offer a public TTN cluster. Sure, TTN Foundation operated clusters are a showcase of our TTI commercial abilities, but only to a certain extent, like we do and will allow for other community network contributors to optionally (!) upsell services.
I’m not against donations though, in fact I think it’s good to consider it to cover the operational costs of TTN Foundation. Currently, TTI pays the exponentially increasing Azure bills and pays for the time spent on ops, and it would be great if TTN can be financially self-sustaining (through donations) and operationally independent (through decentralization and potentially Slackbots?).