rust-ssg/data/blog/posts/2021-11-11_postmortem-of-a-fun-couple-bugs.rst

39 lines
4.6 KiB
ReStructuredText
Raw Normal View History

2023-10-14 18:03:36 +00:00
Postmortem of a fun couple bugs
###############################
:date: 2021-11-11 14:55
:author: tyrel
:category: Tech
:tags: Go, dbus, bugs
:slug: postmortem-of-a-fun-couple-bugs
:status: published
Story at my previous job:
Tieg: Hey Tyrel, I can't run ``invoke sign 5555``, can you help with this?
This is How my night started last night at 10pm. My coworker Tieg did some work on our `CLI <https://tidelift.com/cli>`_ project and was trying to release the latest version. We use `invoke <https://www.pyinvoke.org/>`_ to run our code signing and deployment scripts, so I thought it was just a quick "oh maybe I screwed up some python!" fix. It wasn't.
I spent from 10:30 until 1:30am this morning going through and looking into why Tieg wasn't able to sign the code. The first thing I did was re-run the build on CircleCI, which had the same error, so hey! at least it was reproducible. The problem was that in our Makefile scripts we run ``tidelift version > tidelift-cli.version`` and then upload that to our deployment directories, but this was failing for some reason. We let clients download this file to see what the latest version is and then our CLI tool has the ability to selfupdate (except on homebrew) to pull this latest version if you're outdated.
Once I knew what was failing, I was able to use CircleCI's ssh commands and log in, and see what happened, but I was getting some other errors. I was seeing some problems with ``dbus-launch`` so I promptly (mistakenly) yelled to the void on twitter about ``dubs-launch``. Well would you know it, I may have mentioned before, but I work with Havoc Pennington.
Havoc Pennington: fortunately I wrote dbus-launch so may be able to tell you something, unfortunately it was like 15 years ago
Pumped about this new revelation, I started looking at our ``keychain`` dependency, because I thought the issue was there as that's the only thing that uses ``dbus`` on Linux. Then we decided (Havoc Pointed it out) that it was a red herring, and maybe the problem was elsewhere. I at least learned a bit about dbus and what it does, but not enough to really talk about it to any detail.
Would you know it, the problem was elsewhere. Tieg was running ``dtruss`` and saw that one time it was checking his ``/etc/hosts`` file when it was failing, and another time it was NOT, which was passing. Then pointed out a 50ms lookup to our ``download.tidelift.com`` host.
Tieg then found `Issue 49517 <https://github.com/golang/go/issues/49517>`_ this issue where someone mentions that Go 1.17.3 was failing them for net/http calls, but not the right way.
It turns out, that it wasn't the keyring stuff, it wasn't the *technically* the version calls that failed. What was happening is every command starts with a check to https://download.tidelift.com/cli/tidelift-cli.version which we then compare to the current running version, if it's different and outdated, we then say "you can run selfupdate!". What fails is that call to download.tidelift.com, because of compiling with go1.17.3 and a ``context canceled`` due to stream cleanup I guess?
Okay so we need to downgrade to Go 1.17.2 to fix this. Last night in my trying, I noticed that our CircleCI config was using ``circle/golang:1.16`` as its docker image, which has been superseded by ``cimg/go:1.16.x`` style of images. But I ran into some problems with that while upgrading to ``cimg/go:1.17.x``. The problem was due to the image having different permissions, so I couldn't write to the same directories that when Mike wrote our ``config.yml`` file, worked properly.
Tieg and I did a paired zoom chat and finished this up by cutting out all the testing/scanning stuff in our config files, and just getting down to the Build and Deploy steps. Found ANOTHER bug that Build seems to run as the ``circleci`` user, but Deploy was running as ``root``. So in the build ``working_directory`` setting, using a ``~/go/tidelift/cli`` path, worked. But when we restored the saved cache to Deploy, it still put it in ``/home/circle/go/tidelift/cli``, but then the ``working_directory`` of ``~/go/tidelift/cli`` was relative to ``/root/``. What a nightmare!
All tildes expanded to ``/home/circleci/go/tidelift/cli`` set, Makefile hacks undone, (removing windows+darwin+arm64 builds from your scripts during testing makes things A LOT faster!) and PR Merged, we were ready to roll.
I merged the PR, we cut a new version of TideliftCLI 1.2.5, updated the changelog and signed sealed delivered a new version which uses Go 1.17.2, writes the proper ``tidelift-cli.version`` file in deployment steps, and we were ready to ROCK!
That was fun day. Now it's time to write some rspec tests.