Archway 50/50 #41

zanicar · 2024-09-26T08:31:44Z

zanicar
Sep 26, 2024
Maintainer

Archway 50/50

Upgrading Archway network to CosmosSDK v0.50 and CosmWasm wasmd v0.51

Abstract

This article provides an inside perspective on the release pipeline of Archway upgrades in the light of security advisories, testing issues and continual improvement. It covers our most recent planned upgrade to CosmosSDK v0.50 and CosmWasm v0.51 (at the onset still at v0.50).

Background

The protocol team at Phi Labs, core contributors of the Archway network, is constantly working on innovative and novel features, such as Callbacks and FeeGrants that allows for some very interesting smart contract use cases. However, as with most things in life, we are often faced with mundane or less exciting tasks such as upgrades. Specifically in this case the upgrade to CosmosSDK v0.50 and CosmWasm wasmd v0.51. These ‘boring’ tasks are however required on the path to new exciting features and more often than not result in some unscheduled ‘excitement’ when things don’t go exactly as expected…

Expecting the Unexpected

As experienced engineers we know to expect the unexpected. For this reason we have a number of mechanisms and processes to deal with unforeseen circumstances, such as automated testing, robust release processes, monitoring and control mechanisms, etc. However, there may arise times when these systems and processes may interact with one another to produce unexpected results.

For example, during a recent protocol release cycle we already had a release candidate tagged for release to our public test network. However, on the very day we were scheduled to deploy this version we received a security advisory regarding an important upstream dependency. The required upgrade was consensus breaking and critical to our mainnet. Thus, we had to tag and release a new mainnet version and on this front things went exactly according to our emergency release plan. However, it resulted in a version bump (see our blog post on blockchain versioning) with some unforeseen consequences in our general release pipeline…

Emergent Behaviour

Our emergency release plan allows for coordinated binary swaps when critical security upgrades are in play. Thus, our mainnet went from Archway v7 to Archway v8… but on our testnet our tagged release candidate now had to be bumped to Archway v9. This initially resulted in some ambiguous conversations as we had to keep track of the version formerly known as v8 (to avoid this issue in future we are now using release code names, hence Archway 50/50). However, something went silently wrong with getting this new version deployed to our testnet…

Our robust automation tooling detected an issue with upgrading from v7 to v9, and silently reverted to v7. In our eagerness to get the release back on schedule we also didn’t notice this as the network booted back up and started producing blocks. We then conducted our internal tests and notified our developer community about the upcoming release. At this point our developer community noticed the discrepancy in network versions…

Back to the Unexpected

This time we confirmed the appropriate network version and updated our tooling and automation processes to be very explicit in this regard. At protocol level all tests were passing… but then we got reports that internal application tests were failing! Smart Contract queries returned success conditions but with zero content; Our indexer was not indexing any events post upgrade and likewise was not reporting any issues or errors.

On the one hand we uncovered a bug in the arch3.js library. The version sniffing mechanism that determines which client to use (Tendermint37Client or Comet38CLient) disconnects the inappropriate client. However, the client disconnect cascaded into the HttpBatchClient leading to misbehavior in the appropriate client. Ensuring that the HttpBatchClient is only created after version sniffing is concluded resolves this matter.

On the other hand the issue with the indexer, albeit a distinct issue, turned out to be very closely related. Our engineers highlighted that upstream dependencies should be confirmed to use the appropriate Comet38Client. They then uncovered an upstream dependency PR that addresses build related issues on correct client detection based on the relevant Cosmos SDK version. An updated version of this dependency includes this PR and support for Comet38Client. Upgrading to this version of the dependency resolved the issue with the indexer.

Conclusion and Key Takeaways

The conclusion and key takeaways here are that we expected these issues to report as errors and be caught by our tests. However, seeing as these silently failed they managed to slip through the cracks. These cracks are easily addressed by including some regression and integration tests that simply confirms expected behaviors even when functional tests pass.

The Case for “Live” Testing

“Can automated testing prevent all issues on testnet?”. Good question. In fact, this could easily be classified as a “final boss” question, because strong opposing opinions may arise. But IMHO the answer would be “No”. We cannot possibly test for every case without perfect knowledge, and if we had perfect knowledge we would not need tests in the first place; and without perfect knowledge unforeseen edge cases can always arise…

As the dust of client selection issues settled and the general testnet environment was stable again, a more sinister issue raised its head. We noticed transaction failures across the board for services using our Guzzler Club product. For some inexplicable reason, these transactions were running out of gas (exceeding the limit set by our FeeGrant module). Something in the new consensus engine or the upgraded smart contract engine results in higher gas consumption, but only in certain cases…

Our engineers revisited our contracts and applied further optimizations to reduce gas consumption to previous levels, effectively addressing the issue. However, at this time we have not established the source or cause of this and a number of theories abound. The upgraded consensus engine and smart contract engine may result in different execution pathways from the previous version that adds to gas consumption. Thus any previously optimized contracts may need to be revisited to ensure they are optimized for the latest version.

Sure performance testing and benchmarking may be able to detect this type of issue. The point is that we test for the things we expect might fail, for the things we expect might change, for the ways we expect bad actors might attempt to exploit a system… but unknown and unforeseen issues may still crop up and we best be prepared to deal with them quickly and efficiently. Regardless of what automated testing may entail, at the end of the day we still need to test products, services and applications in an environment that is as close to the live environment as possible… and for this reason we deploy network upgrades to our testnet, for both ourselves and our developer community, before we deploy it to mainnet.

What about an additional testnet?

Currently, we utilize two testnets before any release to mainnet. All protocol changes, updates and upgrades get tested locally before progressing to our internal testnet, Titus. Titus runs on the same infrastructure specifications as our public testnet, Constantine, and our mainnet, Triomphe. However, it is configured with very short governance periods and other relevant parameters conducive of testing and faster iteration. Very importantly it is also an unstable network, meaning state resets should be expected. Consequently, only the most common Smart Contracts and most basic state is typically present on this network, making it unsuitable for general state management, state transition and state continuity testing.

Constantine is our public testnet where developers get to deploy and test their smart contracts. It attempts to be as close as is reasonably possible to the stability and continuity of Triomphe, our mainnet. However, it is the only place where the above mentioned state related test can reliably be conducted, and sometimes that means it will experience issues. It is afterall, still a testnet.

Both internally and from our developer community we have been asked if we can produce an additional testnet to sit in the gap between Titus and Constantine. A testnet that maintains some state, more than Titus but less than Constantine, to allow for efficient state testing. I personally advocated for this idea as from a theoretical perspective it makes sense. But, we have to evaluate the cost-benefit of this endeavor to ensure its practicality…

On closer examination it turns out that even though it is theoretically sound, it is unfortunately impractical. In the first place this network will require state, specifically state from smart contracts which entails that these need to be deployed to the network. But they will also need to operate and be upgraded to remain in alignment with the state encountered on Constantine and Triomphe in order to remain relevant. Thus, the very same developers and internal resources will be burdened to own this task as an additional overhead. In addition, when testing on this network is conducted the same resources will have to take responsibility to identify, report and potentially rectify any issues that may be uncovered. This renders such a network impractical, as its purpose in the first place is to reduce the burden on development resources… but instead it would only shift the burden to this network, but with added overhead for both Phi Labs and our developer community.

Conclusion

At the time of publication we will have dealt with not one, not two, but three distinct security advisories that each individually caused delays, on top of the testing issues we uncovered, to the planned release of Archway 50/50. We have gained valuable experience and adapted our pipelines to incorporate the lessons learned to reduce the burden and limit these types of disruptions in the future. However, we can safely conclude that there will always be events or circumstances that are not directly catered for, and developing the capability, preparedness and processes to efficiently respond to them as a team is necessary. I am personally very grateful to be part of the team at Phi Labs that most certainly has this capability and is continually working to improve it!

adizere · 2024-09-27T15:39:47Z

adizere
Sep 27, 2024

Re: Indexer - this is also related to JS client? If it's related to server-side of the indexer (i.e., CometBFT engine) let us know, glad to follow-up on anything and facilitate any thread left open.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archway Network

Archway 50/50 #41

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Archway Network

Archway 50/50 #41

zanicar Sep 26, 2024 Maintainer

Archway 50/50

Abstract

Background

Expecting the Unexpected

Emergent Behaviour

Back to the Unexpected

Conclusion and Key Takeaways

The Case for “Live” Testing

What about an additional testnet?

Conclusion

Replies: 1 comment

adizere Sep 27, 2024

zanicar
Sep 26, 2024
Maintainer

adizere
Sep 27, 2024