Scaling the Atmosphere

Jim Calabro · 12 min

All right, I'm going to go real fast because we've got a lot to get through, but I'm Jim. I run the platform team at Blue Sky. Platform team is the team that runs our infrastructure, our data centers, cloud stuff, write a lot of back-end code as well. And yeah, we have a lot to get through, so I'm going to go super fast, but hit me up after if you have any questions or want to talk more. Who am I? I'm Jim. I live in Boston. I run the platform team, as I said. I've been at Blue Sky for about a year, and I'm here today to share some information with you on how we do stuff. Our atmospheric systems, our app view. I want to talk about what's going well, what could be improved, and maybe give some recommendations, or at least food for thought for you as well. All right, let's get into it. Who's this for? So there's a lot of people here who are in the weeds of at-proto. A lot of people have different needs and wants out of at-proto. And so this is really a talk for people who want to achieve high scale, such as running a big, whole-world app view, such as the Blue Sky app view. Another persona might be running a large fleet of PDSs. Euro Sky is coming online. That's awesome. Black Sky. There's so much movement on this, and it's really cool.

So not all projects are in this category. Totally rad. It's super awesome. And yeah, I'll tell you a little bit about what we do. This is just like fact dump, so we're just going to get through it real quick. PDS fleet. We run about 110 rented bare metal cloud hosts in US East and US West. They range from 16 to 64 cores, depending on what year we spun them up. They have 256 gigs of RAM. They have about a gig up, a gig down, and they have various disk configs. This is kind of fun and goofy. I learn about new disk configs every now and then. Most of them are XFS set up in Arrayed-1. One of them I learned was running ZFS. That's a little nifty experiment. It's been live for a long time. That's cool. We also run one against our will on Ceph, so that's kind of cool. We had an emergency situation where we had to lift and shift a PDS, and all we had was a Ceph array in our data centers. That was right after Christmas. They cost about $600 a month each on OVH and i3d, are the two suppliers that we use there. And I'd say one thing on this, if you are looking to set up a big fleet of PDSs, we are way over-provisioned on these things. Here's a BTOP, and that is not showing up at all, but you can see we're using somewhere around 5% CPU, 5% of the RAM, so that's like 12.5 gigs. It's doing relatively little network I.O. as well, even though you do kind of want to have at least a little bit of a beefy setup there. Most of it actually is like disk storage. We're kind of coming up on actually starting to fill up disks, and we're going to have to expand and stuff. You can do a lot with a little with the PDS. I'll also say we have about half a million users per host. I'll say also that was my production PDS, so that's Dapperling. There's a half a million users running on that box. There's no special sauce. We're just running the open source code. There's nothing behind the scenes. There's no rate limits, bypass tokens, or anything. It's just the code. I'll put a note on that. We have our own off-server. But our config looks like this. We run an HAProxy on each one. There's 16 PDS containers. Each user has a SQLite, and we back those SQLites up with RClone once a day, and we do a light stream for live. We have some Redis, some Datadog for monitoring, TailScale for network auth. It's all fully automated, zero-touch provisioning. So you just run one Ansible command, boom, you got a new one. It's really fast to stand up new ones. Second major topic is our POPs. A POP is a point of presence, and it's basically like a co-location center. It's a small data center. We run our own hardware that we own. We bought it. We run the Relay in our data centers. We run the AppView, primarily powered by ScillaDB, in the data centers. We run Discover in the data centers. There's a few other things. Our search cluster's in there. There's two of them, one in California, one in Ashburn, Virginia. And there's about 80 very large servers in each that we own and operate. We have super-duper fast networks, and it's easy to add more. Super-fast disks, and a lot of them. It's really high bandwidth, and we have active-active everything. So there's, like, two ISPs. There's two copies of pretty much everything. And, yeah, we want that high degree of redundancy so you can provide really solid service. Here's some pictures. This is the posting factory. You can see me and Austin. Austin's over there, down on the back. My friend Patrick's actually behind Austin. Sorry, Patrick. But, yeah, this is one of them. This is what a data center looks like. You get a cage. That's, like, your full suite. And in it, you put a bunch of racks. Here's one of the racks. Within the rack, you have a couple of switches. And then you have a bunch of compute servers. And then you have two of those, right? So two of everything. That's kind of what it looks like. It's pretty bog-standard. Next is our AWS account. I'm not going to talk to you too much about AWS. But it's where we run a bunch of, like, singleton stuff. Lots of stuff in there is super-important. Some of it's kind of chill. PLC is, like, obviously really important. We have a ton of Postgres in there. Postgres is really hard to run. It's really annoying.

RDS is great. It's very expensive. What's going well? Pops are goaded. Pops are the GOAT. They're super-duper cheap. And they're extremely high-performance. The PDS fleet is working quite well, actually. It's really easy to add more servers. So as we're growing, we can just chuck new servers up. They're very reasonably-priced. AWS is AWS. It's fine. RDS is so good. It's worth every penny. Renting GPUs is quite convenient. We do run a bunch of GPUs up there for various things. Besides that, it is very, very expensive. Do some very rough napkin math. And I'm really not going to get into this too, too much. But it costs us probably about $800,000 a year to run the Pops. Amortizing for depreciation of the assets over four years. The roughly-equivalent AWS install is literally impossible. Because our total switching capacity in the Pops is just shocking. It's crazy. And you couldn't do our Pops setup in AWS. But if you did, it would probably be about 10x that. So about $8 million a year. And that's with heavily-negotiated long-term reservations. Probably about $14 million if you were doing on-demand. Heavy asterisks. I vibe-coded all that. Yeah, vibe finance. That being said, the Pops are an absolute shitload of work. It requires deep expertise. Austin has gone absolutely crazy on trying to do our re-provisioning of all this stuff and make it sane and make it easy to work with. It has really slow iteration cycles. Once you're getting new hardware, it takes a while to get it online. Unless you have excellent operational practices. Again, kudos, Austin. RAM and storage also is up and to the right, unfortunately. And so we bought a bunch of stuff thanks to Jazz, like, here-ish. Yeah, about here. So Jazz is the goat. So what's next? And I'm running out of time, so I'm going to go fast. Pops. Make them easier to operate. As I said, Austin's been doing yeoman's work on this. Increase our compute density as well. Previously, we basically were assigning one service to a box, and oftentimes, the service would need less than 1% of one of the CPUs. And each one of them has, like, 256. I'm going to say the evil Kubernetes word. So, yeah, trying to improve our density there. We're improving our provisioning, making it faster to get new stuff online.

We migrated our network architecture from a single switch into a spine-and-leaf-close network topology. It's really fun. It's like a network of networks, essentially. This is what everybody does as well. A lot of people do this, at least. Kubernetes for compute density. The net result is higher engineering velocity, robust, high-availability systems, and we can actually reclaim a lot of cloud spend and bring that back to our POPs. We're also going to improve the PDS hosting in some way. We're still talking about this, but when you have 110 servers that you rent, those servers are bare-metal servers. They're ours. They're not virtual. I literally have, like, IPMI login on all those things. They all fail independently, and OVH is not sending their best. And so, as you add more servers, your mean time between failure increases, meaning your on-call burden goes up a lot. You're at the mercy of your hosting providers. When a server goes down, we're waiting for six hours to get notice from OVH. In the meantime, it's like, okay, we can restore to a different server, or we're just going to eat it. That sucks. Shared storage also is a big thing that we're talking about. SQLite on the server is rough. PDS is the best case of this, but I am kind of a SQLite hater, so I'm just going to leave it at that, and we will chat. Boo you! I'm thinking about virtual PDS with shared storage, whatever that looks like. I'm just kind of hand-waving. More interesting PDS implementations are coming online. I'd love to talk about it if you have weird PDS ideas.

Advice or lessons learned. I'm going to start with the do-nots. You must have two of everything. Single points of failures will die, and you will be sad, and your users will be sad. Reputation is hard-earned and quickly lost. Skip the SQLite layer in your app view. Go with my personal favorite, MySQL, or Postgres if you don't like good things. Only move past it when you're sure you're needed. Start simple, basically. Don't accept local maximums. You can do hard stuff. You can do great things. That's a do-not. Now, do's. First is think really hard about your data access patterns. We've really optimized the absolute daylights out of the blue sky data plane. You want to have this notion of mechanical sympathy. Be in tune with your hardware and try and optimize the shit out of it, because you'll pay for it otherwise in dollars, and also your sanity. Bloom filters are your friend. Memcache is your friend. Redis is not your friend. You should have elastic compute and storage, even on-prem. You should use cattle, not pets. That kind of comes back to two of everything, right? You should have very, very thorough observability, be able to answer any question dead about your systems, and do it before you have an outage. So here's some recs on that. And then finally, one more thing of do's real quick. Build a team that's very strong operationally. You can't do it alone. And then come talk to us. Let's organize. Brian posted a while ago about what does Nanog of that proto look like. Let's talk about it. I posted a minute ago L7BGP. Let's go talk to each other and figure out the right way to do this. And there's no silver bullet. It's really hard work. That's it. applause