Skip to main content

IoT: Internet-Optional Things

I both love and hate the idea of "smart" devices in my home. It's tough to balance the convenience of being able to turn lights on and off automatically, and adjust thermostats with my phone, with the risk that all of my devices are doing evil things to my fellow Internet citizens. But, I think I've landed on a compromise that works.

I've had Internet-connected devices for a long time now. I've even built devices that can go online. At some point a year or two ago, I realized that I could do better than what I had. Here's a loose list of requirements I made up for my own "IoT" setup at home:

  • works locally as a primary objective
  • works when my Internet connection is down or slow
  • avoids phoning home to the vendor's (or worse: a third party's) API or service
  • can be fully firewalled off from the actual Internet, ideally through a physical limitation
  • isn't locked up in a proprietary platform that will either become expensive, limited, or will cease to exist when it's no longer profitable

My setup isn't perfect; it doesn't fully meet all of these criteria, but it's close, and it's been working well for me.

At the core of my home IoT network is a device called Hubitat Elevation. It works as a bridge between the actual Internet, and my devices which are actually incapable (for the most part) of connecting to the Internet directly. My devices, which range from thermostats, to lights, to motion sensors, to switchable outlets, and more, use either Zigbee or Z-Wave to communicate with each other (they form repeating mesh networks automatically) and with the hub. Again, they don't have a connection to my WiFi or my LAN, except through the hub, because they're physically incapable of connecting to my local network (they don't have ethernet ports, nor do they have WiFi radios). The hub brokers all of these connections and helps me control and automate these devices.

The hub—the Hubitat Elevation—is fairly inexpensive, and is not fully "open" (as I'd like), but has good integration abilities, is well-maintained, is compatible with many devices (many of them are devices compatible with the more-proprietary but similar SmartThings hub), and has an active community of people answering questions, coming up with new ideas, and maintaining add-ons. These add-ons are written in Groovy, which I hadn't really used in earnest before working with the Hubitat, but you can write and modify them to suit your needs.

The hub itself is mostly controlled through a web UI, which I'll admit is clunky, or through a mobile app. The mobile app adds capabilities like geo-fencing, presence, and notifications. The hub can also be connected to other devices; I have mine connected to my Echo, for example, so I can say "Alexa turn off the kitchen lights."

The devices themselves are either mains-powered (such as my Hue lightbulbs, baseboard thermostats, and switching outlets), or are battery powered (such as motion sensors, door contact switches, and buttons). Many of these devices also passively measure things like local temperature, and relay this data, along with battery health to the hub.

I'm going to get into some examples of how I have things set up, here, though not a full getting-started tutorial, but first I wanted to mention a few things that were not immediately obvious to me, and will get you off on the right foot, if you choose to follow a similar path to mine.

  • Third-party hub apps are a bit weird in their structure (there are usually parent and child apps), and keeping them up to date can be a pain. Luckily, Hubitat Package Manager exists, and many add-ons can be maintained through this useful tool.
  • There's a built-in app called "Maker API" which provides a REST interface to your various devices, which technically goes against one of my loose requirements above, but I have it limited to the LAN, and authenticating, and this feels like a fair trade-off to me for when I want to use this kind of connection.
  • There's an app that will send measured data to InfluxDB, which is a timeseries database that I have running locally on my SAN (as a Docker container on my Synology DSM), and it works well as a data source for Grafana (the graphs in this post come from Grafana).

Programmable Thermostats

My house is heated primarily through a centralized heat pump (which also provides cooling in the summer), but many rooms have their own baseboard heaters + independent thermostats. Before automation, these thermostats were either completely manual, or had a hard-to-manage on-device scheduling function.

I replaced many of these thermostats with connected versions. My main heat pump's thermostat (low voltage) is the Honeywell T6 Pro Z-Wave, and my baseboard heaters (line voltage) are now controlled with Zigbee thermostats from Sinopé.

Managing these through the web app is much better than the very limited UI available on programmable thermostats, directly. The Hubitat has a built-in app called "Thermostat Scheduler." Here's my office, for example (I don't like cold mornings (-: ):

Lighting

An often-touted benefit of IoT is lighting automation, and I have several lights I control with my setup. Much of this is through cooperation with the Hue bridge, which I do still have on my network, but I could remove at some point, since the bulbs speak Zigbee. The connected lights that are not Hue bulbs are mostly controlled by Leviton Decora dimmers, switches, and optional dimmer remotes for 3-way circuits. Most of this is boring/routine stuff such as "turn on the outdoor lighting switch at dusk and off at midnight," configured on the hub with the "Simple Automation Rules" app, but I have a couple more interesting applications.

Countertop

My kitchen counter is long down one side—a "galley" style. There's under-cabinet countertop lighting the whole length of the counter, but it's split into two separate switched/dimmed circuits of LED fixtures—one to the left of the sink and one to the right. I have these set to turn on in the morning and off at night. It's kind of annoying that there are two dimmers that work independently, though, and I find it aesthetically displeasing when half of the kitchen is lit up bright and the other half is dim.

Automation to the rescue, though. I found an app called Switch Bindings that allows me to gang these two dimmers together. Now, when I adjust the one on the left, the dimmer on the right matches the new brightness, and vice versa. A mild convenience, but it sure is nice to be able to effectively rewire these circuits in software.

Cellar

I have an extensive beer cellar that I keep cool and dark most of the time. I found myself sometimes forgetting to turn off the lights next to the bottles, and—as someone who is highly sensitive to mercaptans/thiols (products of lightstuck beers, a "skunky" smell/fault)—I don't want my beer to see any more light than is necessary.

With my setup, I can have the outlet that my shelf lighting is plugged into turn on and off when the door is opened or closed. There's also a useful temperature sensor and moisture sensor on the floor so I can know quickly if the floor drain backs up, or if a bottle somehow breaks/leaks enough for the sensor to notice, via the notification system, and keep track of cellar temperature over time.

these lights turn on and off when the door is opened and closed, respectively

I also receive an alert on my phone when the door is opened/closed, which is increasingly useful as the kids get older.

Foyer

Our house has an addition built onto the front, and there's an entrance room that is kind of separated off from the rest of the living space. The lighting in here has different needs from elsewhere because of this. Wouldn't it be nice if the lights in here could automatically turn on when they need to?

Thanks to Simple Automation Rules (the built-in app), and a combination of the SmartThings motion sensor and the DarkSky Device Driver (which will need to be replaced at some point, but it still works for now), I can have the lights in there—in addition to being manually controllable from the switch panels—turn on when there's motion, but only if it's dark enough outside for this to be needed. The lights will turn themselves off when there's no more motion.

Ice Melting

We have a fence gate that we keep closed most of the time so Stanley can safely hang out in our backyard. We need to use it occasionally, and during the winter this poses a problem because that side of the house has a bit of water runoff that is not normally a big deal, but in the winter, it sometimes gets dammed up by the surrounding snow/ice and freezes, making the gate impossible to open.

In past winters, I've used ice melting chemicals to help free the gate, but it's a pain to keep these on hand, and they corrode the fence posts where the powder coating has chipped off. Plus, it takes time for the melting to work and bags of this stuff are sometimes hard to find (cost aside).

This year, I invested in a snow melting mat. Electricity is relatively cheap here in Quebec, thanks to our extensive Hydro-Electric investment, but it's still wasteful to run this thing when it's not needed (arguably still less wasteful than bag after bag of ice melter). I'm still tweaking the settings on this one, but I have the mat turn on when the temperature drops and off when the ambient temperature is warmer. It's working great so far:

Desk foot-warming mat

My office is in the back corner of our house. The old part. I suspect it's poorly insulated, and the floor gets especially cold. I bought a warming mat on which to rest my feet (similar to this one). It doesn't need to be on all of the time, but I do like to be able call for heat on demand, and have it turn itself off after a few minutes.

I have the mat plugged into a switchable outlet. In the hub, I have rules set up to turn this mat on when I press a button on my desk. The mat turns itself off after 15 minutes, thanks to a second rule in the built-in app "Rule Machine". Warm toes!

When I first set this up, I found myself wondering if the mat was already on. If I pressed the button and didn't hear a click from the outlet's relay, I guessed it was already on. But the hub allows me to get a bit more insight. I didn't want something as distracting (and redundant) as an alert on my phone. I wanted something more of an ambient signifier. I have a Hue bulbed lamp on my desk that I have set up to tint red when the mat is on, and when it turns off, to revert to the current colour and brightness of another similar lamp in my office. Now I have a passive reminder of the mat's state.

Graphs

An additionally interesting aspect of all of this (to me as someone who uses this stuff in my actual work, anyway) is that I can get a visual representation of different sensors in my house, now that we have these non-intrusive devices.

For example, you can see here that I used my office much less over the past two weeks (both in presence and in the amount I used the foot mat), since we took a much-needed break (ignore the CO2 bits for now, that's maybe a separate post):

As I mentioned on Twitter a while back, a graph helped me notice that a heating/cooling vent was unintentionally left open when we switched from cooling to heating:

Or, want to see how well that outdoor mat on/off switching based on temperature is working?

An overview of the various temperatures in my house (and outside; the coldest line) over the past week:

Tools

What's really nice about having all of this stuff set up, aside from the aforementioned relief of it not being able to be compromised directly on the Internet is that I now have tools that I can use within this infrastructure. For example, when we plugged in the Christmas tree lights, this year, I had the outlet's schedule match the living room lighting, so it never gets accidentally left on overnight.

Did it now

I originally wrote this one to publish on Reddit, but also didn't want to lose it.

Many many years ago, I worked at a company in Canada that ran some financial services.

The owner was the kind of guy who drove race cars on weekends, and on weekdays would come into the programmers' room to complain that our fingers weren't typing fast enough.

On a particularly panicky day, one of the web servers in the pool that served our app became unresponsive. We had these servers hosted in a managed rack at a hosting provider offsite. After several hours of trying to bring it back, our hosting partner admitted defeat and declared that they couldn't revive WEB02. It had a hardware failure of some sort. We only had a few servers back then, and they were named according to their roles in our infrastructure: WEB01, WEB02, CRON01, DB03, etc.

Traffic and backlog started piling up with WEB02 out of the cluster, despite our efforts to mitigate the loss (which we considered temporary). Our head of IT was on the phone with our hosting provider trying to come up with a plan to replace the server. This was before "cloud" was a thing and each of our resources was a physically present piece of hardware. The agreed-upon solution was to replace WEB02 with a new box, which they were rushing into place from their reserve of hardware, onsite.

By this point, the race-car-driving, finger-typing-speed-complaining owner of the company was absolutely losing it. It seemed like he was screaming at anyone and everyone who dared make eye contact, even if they had truly nothing to do with the server failure or its replacement.

Our teams worked together to get the new box up and running in record time, and were well into configuring the operating system and necessary software when they realized that no one wanted to go out on a limb and give the new machine a name. President Screamy was very particular about these names for some reason and this had been the target of previous rage fests, so neither the hosting lead nor our internal soldiers wanted to make a decision that they knew could be deemed wrong and end up the target of even more yelling. So, they agreed that the hosting provider would call the CEO and ask him what he'd like to name the box.

But before that call could be made, the CEO called our hosting provider to tear them up. He was assured that the box was almost ready, and that the only remaining thing was whether to name it WEB02 to replace the previous box or to give it a whole new name like WEB06. Rage man did not like this at all, and despite being at the other end of the office floor from his office, we could all hear him lay fully into the otherwise-innocent phone receiver on the other end: "I just need that box up NOW. FIX IT. I don't care WHAT you call it! It just needs to be live! DO IT NOW!"

And that, friends, is how we ended up with a web pool of servers named WEB01, WEB03, WEB04, WEB05, and (the new server) DOITNOW. It also served well as a cautionary tale for new hires who happened to notice.

Cache-Forever Assets

I originally wrote this to help Stoyan out with Web Performance Calendar; republishing here.

A long time ago, we had a client with a performance problem. Their entire web app was slow. The situation with this client's app was a bit tricky; this client was a team within a very large company, and often—in my experience, anyway—large companies mean that there are a lot of different people/teams who exert control over deployed apps and there's a lot of bureaucracy in order to get anything done.

The client's team that had asked us to help with slow page loads only had passive access to logs (they couldn't easily add new logging), and they were mostly powerless to do things like optimize SQL queries, of which there were logs already, and really only controlled the web app itself, which was a very heavy Java/Spring-based app. The team we were working with knew just enough to maintain the user-facing parts of the app.

We, a contracted team brought in to help with guidance (and we did eventually build some interesting technology for this client), had no direct ability to modify the deployed app, nor did we even get access to the server-side source code. But we still wanted to help, and the client wanted us to help, given all of these constraints. So, we did a bit of what-we-can-see analysis, and came up with a number of simple, but unimplemented optimizations. "Low-hanging fruit" if you will.

These optimizations included things like "improve the size of these giant images (and here's how to do it without losing any quality)", "concatenate and minify these CSS and JavaScript assets" (the app was headed by a HTTP 1.x reverse proxy), and "improve user-agent caching". It's the last of these that I'm going to discuss here.

Now, before we get any deeper into this, I want to make it clear that the strategy we implemented (or, more specifically: advised the client to implement) is certainly not ground-breaking—far from it. This client, whether due to geographic location, or perhaps being shielded from outside influence within their large corporate infrastructure, had not implemented even the most basic of browser-facing optimizations, so we had a great opportunity to teach them things we'd been doing for years—maybe even decades—at this point.

We noticed that all requests were slow. Even the smallest requests. Static pages, dynamically-rendered for the logged-in user pages, images, CSS, even redirects were slow. And we knew that we were not in a position to do much about this slowness, other than to identify it and hope the team we were in contact with could request that the controlling team look into the more-general problem. "Put the assets on a CDN and avoid the stack/processing entirely" was something we recommended but it wasn't even something we could realistically expect to be implemented given the circumstances.

"Reduce the number of requests" was already partially covered in the "concatenate and minify" recommendation I mentioned above, but we noticed that because all requests were slow, the built-in strategy of using the stack's HTTP handler to return 304 not modified if a request could be satisfied via Last-Modified or ETag was, itself, sometimes taking several seconds to respond.

A little background: normally (lots of considerations like cache visibility glossed over here), when a user agent makes a request for an asset that it already has in its cache, it tells the server "I have a copy of this asset that was last modified at this specific time" and the server, once it sees that it doesn't have a newer copy, will say "you've already got the latest version, so I'm not going to bother sending it to you" via a 304 Not Modified response. Alternatively, a browser might say "I've got a copy of this asset that you've identified to have unique properties based on this ETag you sent me; here's the ETag back so we can compare notes" and the server will—again, if the asset is already current—send back a 304 response. In both cases, if the server has a newer version of the asset it will (likely) send back a 200 and the browser will use and cache a new version.

It's these 304 responses that were slow on the server side, like all other requests. The browser was still making the request and waiting a (relatively) long time for the confirmation that it already had the right version in its cache, which it usually did.

The strategy we recommended (remember: because we were extremely limited in what we expected to be able to change) was to avoid this Not Modified conversation altogether.

With a little work at "build" time, we were able to give each of these assets, not only a unique ETag (as determined by the HTTP dæmon itself), but a fully unique URL, based on its content. By doing so, and setting appropriate HTTP headers (more on the specifics of this below), we could tell the browser "you never even need to ask the server if this asset is up to date. We could cache "forever" (in practice: a year in most cases, but that was close enough for the performance gain we needed here).

Fast forward to present time. For our own apps, we do use a CDN, but I still like to use this cache-forever strategy. We now often deploy our main app code to AWS Lambda, and find ourselves uploading static assets to S3, to be served via CloudFront (Amazon Web Services' CDN service).

We have code that collects (via either a pre-set lookup, or by filesystem traversal) the assets we want to upload. We do whatever preprocessing we need to do to them, and when it's time to upload to S3, we're careful to set certain HTTP headers that indicate unconditional caching for the browser:

def upload_collected_files(self, force=False):
    for f, dat in self.collected_files.items():

        key_name = os.path.join(
            self.bucket_prefix, self.versioned_hash(dat["hash"]), f
        )

        if not force:
            try:
                s3.Object(self.bucket, key_name).load()
            except botocore.exceptions.ClientError as e:
                if e.response["Error"]["Code"] == "404":
                    # key doesn't exist, so don't interfere
                    pass
                else:
                    # Something else has gone wrong.
                    raise
            else:
                # The object does exist.
                print(
                    f"Not uploading {key_name} because it already exists, and not in FORCE mode"
                )
                continue

        # RFC 2616:
        # "HTTP/1.1 servers SHOULD NOT send Expires dates more than one year in the future"
        headers = {
            "CacheControl": "public,max-age=31536000,immutable",
            "Expires": datetime.today() + timedelta(days=365),
            "ContentType": dat["mime"],
            "ACL": "public-read",
        }

        self.upload_file(
            dat["path"],
            key_name,
            self.bucket,
            headers,
            dry_run=os.environ.get("DRY_RUN") == "1",
        )

The key name (which extends to the URL) is a shortened representation of a file's contents, plus a "we need to bust the cache without changing the contents" version on our app's side, followed by the asset's natural filename, such as (the full URL): https://static.production.site.faculty.net/c7a1f31f4ed828cbc60271aee4e4f301708662e8a131384add7b03e8fd305da82f53401cfd883d8b48032fb78ef71e5f-2020101000/images/topography-overlay.png

This effectively tells S3 to relay Cache-Control and Expires headers to the browser (via CloudFront) to only allow the asset to expire in a year. Because of this, the browser doesn't even make a request for the asset if it's got it cached.

We control cache busting (such as a new version of a CSS, JS, image, etc.) completely via the URL; our app has access (via a lookup dictionary) to the uploaded assets, and can reference the full URL to always be the latest version.

The real beauty of this approach is that the browser can entirely avoid even asking the server if it's got the latest version—it just knows it does—as illustrated here:

Developer tools showing "cached" requests for assets on faculty.com

How I helped fix Canadaʼs COVID Alert app

On July 31st, Canada's COVID Alert app was made available for general use, though it does not have support for actually reporting a diagnosis in most provinces, yet.

In Quebec, we can run the tracing part of the app, and if diagnosis codes become available here, the app can retroactively report contact. It uses the tracing mechanism that Google and Apple created together, and in my opinion—at least for now—Canadians should be running this thing to help us all deal with COVID-19. I won't run it forever, but for now, it seems to me that the benefits outweigh the "government can track me" fear (it's not actually tracking you; it doesn't even know who you are), and it's enabled on my phone.

But, before I decided to take this position and offer up my own movement data, I wanted to be sure the app is doing what it says it's doing—at least to the extent of my abilities to be duly diligent. (Note: it's not purely movement data that's shared—at least without more context—but it's actual physical interactions with other people whose phones are available within the radio range of Bluetooth LE.)

Before installing the app on my real daily-carry phone, I decided to put it on an old phone I still have, and to do some analysis on the most basic level of communication: who is it contacting?

In 2015, I gave a talk at ConFoo entitled "Inspect HTTP(S) with Your Own Man-in-the-Middle Non-Attacks", and this is exactly what I wanted to do here. The tooling has improved in the past 5 years, and firing up mitmproxy, even without ever having used it on this relatively new laptop, was a one-liner, thanks to Nix:

nix-shell -p mitmproxy --run mitmproxy

This gave me a terminal-based UI and proxy server that I pointed my old phone at (via the Wifi Network settings, under HTTP proxy, pointed to my laptop's local IP address). I needed to have mitmproxy create a Certificate Authority that it could use to generate and sign "trusted" certificates, and then have my phone trust that authority, by visiting http://mitm.it/ in mobile Safari, and doing the certificate acceptance dance (this is even more complicated on the latest versions of iOS). Worth noting also, is that certain endpoints such as the Apple App Store appear to use Certificate Pinning, so you'll want to do things like install the COVID Alert app from the App Store before turning on the proxy.

Once I was all set up to intercept my own traffic, I visited some https:// URLs and saw the request flows in mitmproxy.

I fired up the COVID Alert app again, and noticed something strange… something disturbing:

shows that the app is accessing clients.google.com

In addition to the expected traffic to canada.ca (I noticed it's using .alpha.canada.ca, but I suspect that's due to the often-reported unbearably-long bureaucratic hassle in getting a .canada.ca TLS certificate, but that's another story), my phone, when running COVID Alert, was contacting Google.

HEAD https://clients4.google.com/generate_204

A little web searching helped me discover that this is a commonly-used endpoint that helps developers determine if the device is behind a "captive portal" (an interaction that requires log-in or payment, or at least acceptance of terms before granting wider access to the Web). I decided that this was probably unintended by the developers of COVID Alert, but it still bothered me that an app, designed for tracking interactions between people['s devices], that the government wants us to run is telling Google that I'm running it, and disclosing my IP address in doing so:

shows that the User Agent header identifies the app as

(Note that the app clearly identifies itself in the User-Agent header.)

A bit more quick research turned up a statement by Canada's Privacy Commissioner:

An Internet Protocol (IP) address can be considered personal information if it can be associated with an identifiable individual. For example, in one complaint finding, we determined that some of the IP addresses that an internet service provider (ISP) was collecting were personal information because the ISP had the ability to link the IP addresses to its customers through their subscriber IDs.

It's not too difficult to imagine that Google probably has enough data on Canadians for this to be a real problem.

I discovered that this app is maintained by the Canadian Digital Service, and that the source code is on GitHub, but that the code itself didn't directly contain any references to clients3.google.com.

It's a React Native app, and I figured that the call out to Google must be in one of the dependencies, which—considering the norm with JavaScript apps—are pleasantly restrained mostly to React itself. I had no idea which of these libraries was calling out to Google.

Now, I could have run this app on the iOS Simulator (which did I end up doing to test my patches, below), but I thought "let's see what my actual phone is doing." I threw caution to the wind, and I ran checkra1n on my old phone, which gave me ssh access, which in turn allowed me to copy the app's application bundle to my laptop, where I could do a little more analysis (note the app is bundled as CovidShield because it was previously developed by volunteers at Shopify and was then renamed by CDS (or so I gather, anyway)).

~/De/C/iphone/CovidShield.app ▶ grep -r 'clients3.google.com' *
main.jsbundle:__d(function(g,r,i,a,m,e,d){Object.defineProperty(e,"__esModule",{value:!0}),
e.default=void 0;var t={reachabilityUrl:'https://clients3.google.com/generate_204',
reachabilityTest:function(t){return Promise.resolve(204===t.status)},reachabilityShortTimeout:5e3,
reachabilityLongTimeout:6e4,reachabilityRequestTimeout:15e3};e.default=t},708,[]);

(Line breaks added for legibility.) Note reachabilityUrl:'https://clients3.google.com/generate_204. Found it! A bit more searching led me to a package called react-native-netinfo (which was directly in the above-linked package.json), and its default configuration that sets the reachabilityUrl to Google.

Now that I knew where it was happening, I could fix it.

To make this work the same way, we needed a reliable 204 endpoint that the app could hit, and to keep with the expectation that this app should not "leak" data outside of canada.ca, I ended up submitting a patch for the server side code that the app calls. (It turns out that this was not necessary after all, but I'm still glad I added this to my report.)

I also patched, and tested the app code itself via the iOS Simulator.

I then submitted a write-up of what was going wrong and why it's bad, to the main app repository, as cds-snc/covid-alert-app issue 1003, and felt pretty good about my COVID Civic Duty of the day.

The fine folks at the Canadian Digital Service seemed to recognize the problem and agree that it was something that needed to be addressed. A few very professional back-and-forths later (I'll be honest: I barely knew anything about the CDS and I expected some runaround from a government agency like this, and I was pleasantly surprised), we landed on a solution that simply didn't call the reachability URL at all, and they released a version of the app that fixed my issue!

With the new version loaded, I once again checked the traffic and can confirm that the new version of the app does not reach out to anywhere but .canada.ca.

A mitmproxy flow showing traffic to canada.ca and not google.com

New Site (same as old site)

You're looking at the new seancoates.com.

"I'm going to pay attention to my blog" posts on blogs are… passé, but…

I moved this site to a static site generator a few years ago when I had to move some server stuff around, and had let it decay. I spent most of the past week of evenings and weekend updating to what you see now.

It's still built on Nikola, but now the current version.

I completely reworked the HTML and CSS. Turns out—after not touching it in earnest in probably a decade (‼️)—that CSS is a much more pleasant experience these days. Lately, I've been doing almost exclusively back-end and server/operations work, so it was actually a bit refreshing to see how far CSS has come along. Last time I did this kind of thing, nothing seemed to work—or if it did work, it didn't work the same across browsers. This time, I used Nikola's SCSS builder and actually got some things done, including passing the Accessibility tests (for new posts, anyway) in Lighthouse (one of the few reasons I fire up Chrome), and a small amount of Responsive Web Design to make some elements reflow on small screens. When we built the HTML for the previous site, so long ago, small screens were barely a thing, and neither were wide browsers for the most part.

From templates that I built, Nikola generates static HTML, which has a few limitations when it comes to serving requests. The canonical URL for this post is https://seancoates.com/blogs/new-site-same-as-old-site. Note the lack of trailing slash. There are ways to accomplish this directly with where I wanted to store this generated HTML + assets (on S3), but it's always janky. I've been storing static sites on S3 and serving them up through CloudFront for what must be 7+ years, now, and it works great as long as you don't want to do anything "fancy" like redirects. You just have to name your files in a clever way, and be sure to set the metadata's Content-Type correctly. The file you're reading right now comes from a .md file that is compiled into [output]/blogs/new-site-same-as-old-site/index.html. Dealing with the "directory" path, and index.html are a pain, so I knew I wanted to serve it through a very thin HTTP handling app.

At work, we deploy mostly on AWS (API Gateway and Lambda, via some bespoke tooling, a forked and customized runtime from Zappa, and SAM for packaging), but all of that seemed too heavy for what amounts to a static site with a slightly-more-intelligent HTTP handler. Chalice had been on my radar for quite a while now, and this seemed like the perfect opportunity to try it.

It has a few limitations, such as horrific 404s, and I couldn't get binary serving to work (but I don't need it, since I put the very few binary assets on a different CloudFront + S3 distribution), but all of that considered, it's pretty nice.

Here's the entire [current version of] app.py that serves this site:

 import functools

 from chalice import Chalice, Response
 import boto3


 app = Chalice(app_name="seancoates")
 s3 = boto3.client("s3")
 BUCKET = "seancoates-site-content"

 REDIRECT_HOSTS = ["www.seancoates.com"]


 def fetch_from_s3(path):
     k = f"output/{path}"
     obj = s3.get_object(Bucket=BUCKET, Key=k)
     return obj["Body"].read()


 def wrapped_s3(path, content_type="text/html; charset=utf-8"):
     if app.current_request.headers.get("Host") in REDIRECT_HOSTS:
         return redirect("https://seancoates.com/")

     try:
         data = fetch_from_s3(path)
         return Response(
             body=data, headers={"Content-Type": content_type}, status_code=200,
         )
     except s3.exceptions.NoSuchKey:
         return Response(
             body="404 not found.",
             headers={"Content-Type": "text/plain"},
             status_code=404,
         )


 def check_slash(handler):
     @functools.wraps(handler)
     def slash_wrapper(*args, **kwargs):
         path = app.current_request.context["path"]
         if path[-1] == "/":
             return redirect(path[0:-1])
         return handler(*args, **kwargs)

     return slash_wrapper


 def redirect(path, status_code=303):
     return Response(
         body="Redirecting.",
         headers={"Content-Type": "text/plain", "Location": path},
         status_code=status_code,
     )


 @app.route("/")
 def index():
     return wrapped_s3("index.html")


 @app.route("/assets/css/{filename}")
 def assets_css(filename):
     return wrapped_s3(f"assets/css/{filename}", "text/css")


 @app.route("/blogs/{slug}")
 @check_slash
 def blogs_slug(slug):
     return wrapped_s3(f"blogs/{slug}/index.html")


 @app.route("/brews")
 @app.route("/shares")
 @app.route("/is")
 @check_slash
 def pages():
     return wrapped_s3(f"{app.current_request.context['path'].lstrip('/')}/index.html")


 @app.route("/archive")
 @app.route("/blogs")
 def no_page():
     return redirect("/")


 @app.route("/archive/{archive_page}")
 @check_slash
 def archive(archive_page):
     return wrapped_s3(f"archive/{archive_page}/index.html")


 @app.route("/rss.xml")
 def rss():
     return wrapped_s3("rss.xml", "application/xml")


 @app.route("/assets/xml/rss.xsl")
 def rss_xsl():
     return wrapped_s3("assets/xml/rss.xsl", "application/xml")


 @app.route("/feed.atom")
 def atom():
     return wrapped_s3("feed.atom", "application/atom+xml")

Not bad for less than 100 lines (if you don't count the mandated whitespace, at least).

Chalice handles the API Gateway, Custom Domain Name, permissions granting (for S3 access, via IAM policy) and deployments. It's pretty slick. I provided DNS and a certificate ARN from Certificate Manager.

Last thing: I had to trick Nikola into serving "pretty URLs" without a trailing slash. It has two modes, basically: /blogs/post/ or /blogs/post/index.html. I want /blogs/post. Now, avoiding the trailing slash usually invokes a 30x HTTP redirect when the HTTPd that's serving your static files needs to add it so you get the directory (index). But in my case, I was handling HTTP a little more intelligently, so I didn't want it. You can see my app.py above handles the trailing slash redirects in the wrapped_s3 function, but to get Nikola to handle this in the RSS/Atom (I had control in the HTML templates, but not in the feeds), I had to trick it with some ugliness in conf.py:

# hacky hack hack monkeypatch
# strips / from the end of URLs
from nikola import post

post.Post._unpatched_permalink = post.Post.permalink
post.Post.permalink = lambda self, lang=None, absolute=False, extension=".html", query=None: self._unpatched_permalink(
    lang, absolute, extension, query
).rstrip("/")

I feel dirty about that part, but pretty great about the rest.

Faculty

Officially, as of the start of the year, I've joined Faculty.

Faculty is not just new to me, but something altogether new. It's also something that feels older than it is. The familiar, experienced kind of old. The good kind. The kind I like.

It was founded by my good friend and long-term colleague Chris Shiflett, whom I'm very happy to be working with, directly, again.

People often ask us how long we've worked together, and the best answer I can come up with is "around 15 years"—nearly all of the mature part of my career. Since the early 2000s, Chris and I have attended and spoken at conferences together, he wrote a column for PHP Architect under my watch as the Editor-in-Chief, I worked on Chris's team at OmniTI, we ran Web Advent together, and we worked collaboratively at Fictive Kin.

A surprised "fifteen years!" is a common response, understandably—that's an eternity in web time. On a platform that reinvents itself regularly, where we (as a group) often find ourselves jumping from trend to trend, it's a rare privilege to be able to work with friends with such a rich history.

This kind of long-haul thinking also informs the ethos of Faculty, itself. We're particularly interested in mixing experience, proven methodologies and technology, and core values, together with promising newer (but solid) technologies and ideas. We help clients reduce bloat, slowness, and inefficiencies, while building solutions to web and app problems.

We care about doing things right, not just quickly. We want to help clients build projects they (and we) can be proud of. We remember the promise—and the output—of Web 2.0, and feel like we might have strayed a little away from the best Web that we can build. I'm really looking forward to leveraging our team's collective and collected experience to bring excellence, attention to detail, durability, and robustness to help—even if in some small way—influence the next wave of web architecture and development.

If any of that sounds interesting to you, we're actively seeking new clients. Drop us a note. I can't wait.

Shortbread

December 1st. Shortbread season begins today.

(Here's something a little different for long-time blog readers.)

Four years ago, in December, I made dozens and dozens of batches of shortbread, slowly tweaking my recipe, batch after batch. By Christmas, that year, I had what I consider (everyone's tastes are different) the perfect ratio of flour/butter/sugar/salt, and exactly the right texture: dense, crisp, substantial, and short—not fluffy and light like whipped/cornstarch shortbread, and not crumbly like Walkers.

I present for you, friends, that recipe.

shortbread

Ingredients

  • 70g granulated white sugar (don't use the fancy kinds)
  • 130g unsalted butter (slightly softened; it's colder than normal room temperature in my kitchen)
  • 200g all purpose white flour (I use unbleached)
  • 4g kosher salt

Method

In a stand mixer, with the paddle attachment, cream (mix until smooth) the sugar and butter. I like to throw the salt in here, too. You'll need to scrape down the sides with a spatula, as it mixes (turn the mixer off first, of course).

When well-creamed, and with the mixer on low speed, add the flour in small batches; a little at a time. Mix until combined so it looks like cheese curds. It's not a huge deal if you over-mix, but the best cookies come when it's a little crumblier than a cohesive dough, in my experience.

Turn out the near-dough onto a length of waxed paper, and roll into a log that's ~4cm in diameter, pressing the "curds" together, firmly. Wrapped in the wax paper, refrigerate for 30 minutes.

Preheat your oven to 325°F, with a rack in the middle position.

Slice the chilled log (with a sharp, non-searated knife) into ~1.5cm thick rounds, and place onto a baking sheet with a silicone mat or parchment paper. (If you refrigerate longer, you'll want to let it soften a little before slicing.)

Bake until right before the tops/sides brown. In my oven, this takes 22 minutes. Remove from oven and allow to cool on the baking sheet.

Eat and share!

Don't try to add vanilla or top them with anything, unless you like inferior shortbread. (-; Avoid the temptation to eat them right away, because they're 100 times better when they've cooled. Pop a couple in the freezer for 5 mins if you're really impatient (and I am).

Enjoy! Let me know if you make these, and how they turned out.

API Gateway timeout workaround

The last of the four (see previous posts) big API Gateway limitations is the 30 second integration timeout.

This means that API Gateway will give up on trying to serve your request to the client after 30 seconds—even though Lambda has a 300 second limit.

In my opinion, this is a reasonable limit. No one wants to be waiting around for an API for more than 30 seconds. And if something takes longer than 30s to render, it should probably be batched. "Render this for me, and I'll come back for it later. Thanks for the token. I'll use this to retrieve the results."

In an ideal world, all HTTP requests should definitely be served within 30 seconds. But in practice, that's not always possible. Sometimes realtime requests need to go to a slow third party. Sometimes the client isn't advanced enough to use the batch/token method hinted at above.

Indeed, 30s often falls below the client's own timeout. We're seeing a practical limitation where clients can often run for 90-600 seconds before timing out.

Terrible user experience aside, I needed to find a way to serve long-running requests, and I really didn't want to violate our serverless architecure to do so.

But this 30 second timeout in API gateway is a hard limit. It can't be increased via the regular AWS support request method. In fact, AWS says that it can't be increased at all—which might even be true. (-:

As I mentioned in an earlier post, I did a lot of driving this summer. Lots of driving led to lots of thinking, and lots of thinking led to a partial solution to this problem.

What if I could use API Gateway to handle the initial request, but buy an additional 30 seconds, somehow. Or better yet, what if I could buy up to an additional 270 seconds (5 minutes total).

Simply put, an initial request can kick off an asynchronous job, and if it takes a long time, after nearly 30 seconds, we can return an HTTP 303 (See Other) redirect to another endpoint that checks the status of this job. If the result still isn't available after another (nearly) 30s, redirect again. Repeat until the Lambda function call is killed after the hard-limited 300s, but if we don't get to the hard timeout, and we find the job has finished, we can return that result instead of a 303.

But I didn't really have a simple way to kick off an asynchronous job. Well, that's not quite true. I did have a way to do that: Zappa's asynchronous task execution. What I didn't have was a way to get the results from these jobs.

So I wrote one, and Zappa's maintainer, Rich, graciously merged it. And this week, it was released. Here's a post I wrote about it over on the Zappa blog.

The result:

$ time curl -L 'https://REDACTED.apigwateway/dev/payload?delay=40'
{
  "MESSAGE": "It took 40 seconds to generate this."
}

real    0m52.363s
user    0m0.020s
sys     0m0.025s

Here's the code (that uses Flask and Zappa); you'll notice that it also uses a simple backoff algorithm:

@app.route('/payload')
def payload():
    delay = request.args.get('delay', 60)
    x = longrunner(delay)
    if request.args.get('noredirect', False):
        return 'Response will be here in ~{}s: <a href="{}">{}</a>'.format(
            delay, url_for('response', response_id=x.response_id), x.response_id)
    else:
        return redirect(url_for('response', response_id=x.response_id))


@app.route('/async-response/<response_id>')
def response(response_id):
    response = get_async_response(response_id)
    if response is None:
        abort(404)

    backoff = float(request.args.get('backoff', 1))

    if response['status'] == 'complete':
        return jsonify(response['response'])

    sleep(backoff)
    return "Not yet ready. Redirecting.", 303, {
        'Content-Type': 'text/plain; charset=utf-8',
        'Location': url_for(
            'response', response_id=response_id,
            backoff=min(backoff*1.5, 28)),
        'X-redirect-reason': "Not yet ready.",
    }


@task(capture_response=True)
def longrunner(delay):
    sleep(float(delay))
    return {'MESSAGE': "It took {} seconds to generate this.".format(delay)}

That's it. Long-running tasks in API Gateway. Another tool for our serverless arsenal.

API Gateway Limitations

As I've mentioned a couple times in the past, I've been working with Lambda and API Gateway.

We're using it to host/deploy a big app for a big client, as well as some of the ancillary tooling to support the app (such as testing/builds, scheduling, batch jobs, notifications, authentication services, etc.).

For the most part, I love it. It's helped evaporate the most boring—and often most difficult—parts of deploying highly-available apps.

But it's not all sunshine and rainbows. Once the necessary allowances are made for a new architecture (things like: if we have concurrency of 10,000, a runaway process's consequences are amplified, database connection pools are easily exhausted, there's no simple way to use static IP addresses), there are 4 main problems that I've encountered with serving an app on Lambda and API Gateway.

The first two problems are essentially the same. Both headers and query string parameters are clobbered by API Gateway when it creates an API event object.

Consider the following Lambda function (note: this does not use Zappa, but functions provisioned by Zappa have the same limitation):

import json

def respond(err, res=None):
    return {
        'statusCode': '400' if err else '200',
        'body': err.message if err else json.dumps(res),
        'headers': {
            'Content-Type': 'application/json',
        },
    }

def lambda_handler(event, context):
    return respond(None, event.get('queryStringParameters'))

Then if you call your (properly-configured) function via the API Gateway URL such as: https://lambdatest.example.com/test?foo=1&foo=2, you will only get the following queryStringParameters:

{"foo": "2"}

Similarly, a modified function that dumps the event's headers, when called with duplicate headers, such as with:

curl 'https://lambdatest.example.com/test' -H'Foo: 1' -H'Foo: 2'

…will result in the second header overwriting the first:

{
    
    "headers":
        
        "Foo": "2",
        
    
}

The AWS folks have backed themselves into a bit of a corner, here. It's not trivial to change the way these events work, without affecting the thousands (millions?) of existing API Gateway deployments out there.

If they could make a change like this, it might make sense to turn queryStringParameters into an array when there would previously have been a clobbering:

{"foo": ["1", "2"]}

This is a bit more dangerous for headers:

{
    
    "headers":
        
        "Foo": [
            "1",
            "2"
        ],
        
    
}

This is not impossible, but it is a BC-breaking change.

What AWS could do, without breaking BC, is (even optionally, based on the API's Gateway configuration/metadata) supply us with an additional field in the event object: rawQueryString. In our example above, it would be foo=1&foo=2, and it would be up to my app to parse this string into something more useful.

Again, headers are a bit more difficult, but (optionally, as above), one solution might be to supply a rawHeaders field:

{
    
    "rawHeaders": [
        "Foo: 1",
        "Foo: 2",
        
    ],
    
}

We've been lucky so far in that these first two quirks haven't been showstoppers for our apps. I was especially worried about a conflict with access-es, which is effectively a proxy server.

The next two limitations (API Gateway, Lambda) are more difficult, but I've come up with some workarounds:

Lambda payload size workaround

Another of the AWS Lambda + API Gateway limitations is in the size of the response body we can return.

AWS states that the full payload size for API Gateway is 10 MB, and the request body payload size is 6 MB in Lambda.

In practice, I've found this number to be significantly lower than 6 MB, but perhaps I'm just calculating incorrectly.

Using a Flask route like this:

@app.route('/giant')
def giant():
    payload = "x" * int(request.args.get('size', 1024 * 1024 * 6))
    return payload

…and calling it with curl, I get the following cutoff:

$ curl -s 'https://REDACTED/dev/giant?size=4718559' | wc -c
 4718559
$ curl -s 'https://REDACTED/dev/giant?size=4718560'
{"message": "Internal server error"}

Checking the logs (with zappa tail), I see the non-obvious-unless-you've-come-across-this-before error message:

body size is too long

Let's just call this limit "4 MB" to be safe.

So, why does this matter? Well, sometimes—like it or not—APIs need to return more than 4 MB of data. In my opinion, this should usually (but not always) be resolved by requesting smaller results. But sometimes we don't get control over this, or it's just not practical.

Take Kibana, for example. In the past year, we started using Elasticsearch for logging certain types of structured data. We elected to use the AWS Elasticsearch Service to host this. AWS ES has an interesting authentication method: it requires signed requests, based on AWS IAM credentials. This is super useful for our Lambda-based app because we don't have to rely on DB connection pools, firewalls, VPCs, and much of the other pain that comes with using an RDBMS in a highly-distributed system. Our app can use its inherited IAM profile to sign requests to AWS ES quite easily, but we also wanted to give our developers and certain partners access to our structured logs.

At first, we had our developers run a local copy of aws-es-kibana, which is a proxy server that uses the developer's own AWS credentials (we distribute user or role credentials to our devs) to sign requests. Running a local proxy is a bit of a pain, though—especially for 3rd parties.

So, I wrote access-es (which is still in a very early "unstable" state, though we do use it "in production" (but not in user request flows)) to allow our users to access Kibana (and Elasticsearch). access-es runs on lambda and effectively works as a reverse HTTPS proxy that signs requests for session/cookie authenticated users, based on the IAM profile. This was a big win for our no-permanent-servers-managed-by-us architecture.

But the very first day we used access-es to load some large logs in Kibana, it failed on us.

It turns out that if you have large documents in Elasticsearch, Kibana loads very large blobs of JSON in order to render the discover stream (and possibly other streams). Larger than "4 MB", I noticed. Our (non-structured) logs filled with body size is too long messages, and I had to make some adjustments to the page size in the discover feed. This bought us some time, but we ran into the payload size limitation far too often, and at the most inopportune moments, such as when trying to rapidly diagnose a production issue.

The "easy" solution to this problem is to concede that we probably can't use Lambda + API Gateway to serve this kind of app. Maybe we should fire up some EC2 instances, provision them with Salt, manage upgrades, updates, security alerts, autoscalers, load balancers… and all of those things that we know how to do so well, but were really hoping to leave behind with the new "serverless" (no permanent servers managed by us) architecture.

This summer, I did a lot of driving, and during one of the longest of those driving sessions, I came up with an idea about how to handle this problem of using Lambda to serve documents that are larger than the Lambda maximum response size.

"What if," I thought, "we could calculate the response, but never actually serve it with Lambda. That would fix it." Turns out it did. The solution—which will probably seem obvious once I state it—is to use Lambda to calculate the response body, store that response body in a bucket in S3 (where we don't have to manage any servers), use Lambda + API Gateway to redirect the client to the new resource on S3.

Here's how I did it in access-es:

req = method(
    target_url,
    auth=awsauth,
    params=request.query_string,
    data=request.data,
    headers=headers,
    stream=False
)

content = req.content

if overflow_bucket is not None and len(content) > overflow_size:

    # the response would be bigger than overflow_size, so instead of trying to serve it,
    # we'll put the resulting body on S3, and redirect to a (temporary, signed) URL
    # this is especially useful because API Gateway has a body size limitation, and
    # Kibana serves *huge* blobs of JSON

    # UUID filename (same suffix as original request if possible)
    u = urlparse(target_url)
    if '.' in u.path:
        filename = str(uuid4()) + '.' + u.path.split('.')[-1]
    else:
        filename = str(uuid4())

    s3 = boto3.resource('s3')
    s3_client = boto3.client(
        's3', config=Config(signature_version='s3v4'))

    bucket = s3.Bucket(overflow_bucket)

    # actually put it in the bucket. beware that boto is really noisy
    # for this in debug log level
    obj = bucket.put_object(
        Key=filename,
        Body=content,
        ACL='authenticated-read',
        ContentType=req.headers['content-type']
    )

    # URL only works for 60 seconds
    url = s3_client.generate_presigned_url(
        'get_object',
        Params={'Bucket': overflow_bucket, 'Key': filename},
        ExpiresIn=60)

    # "see other"
    return redirect(url, 303)

else:
    # otherwise, just serve it normally
    return Response(content, content_type=req.headers['content-type'])

If the body size is larger than overflow_size, we store the result on S3, and the client receives a 303 see other with an appropriate Location header, completely bypassing the Lambda body size limitation, and saving the day for our "serverless" architecture.

The resulting URL is signed by AWS to make it only valid for 60 seconds, and the resource isn't available without such a signature (unless otherwise authenticated with IAM + appropriate permissions). Additionally, we use S3's lifecycle management to automatically delete old objects.

For clients that are modern browsers, though, you'll need to properly manage the CORS configuration on that S3 bucket.

This approach fixed our Kibana problem, and now sits in our arsenal of tools for when we need to handle large responses in our other serverless Lambda + API Gateway apps.