what is the engineering used to determine a weak startup from a growing company you ask? well....googlers again use random numbers not logic (human interaction avoidance firewall) to determine that and set the floor at $30M "publicly declared investment capital". so what happens when you the gcp architect consultant hired to help this successful startup productionalize their gcp infra but their last round was private? google tells the soon to be $100M success company they are not real yet.....so they go get their virtual cpu,ram,disk from aws who knows how to treat customers right by being able to talk to humans by hiring account managers who pick up the phone and invite you to lunch to talk about your successful startup growing on aws. googlers are the biggest risk factor to the far superior gcp infrastructure for any business, startup or fortune 10.
We add support when we want to do something new, like MediaTailor + SSAI. At that point we're exploring and trying to get our heads around how things work. Once it works there's no real point in support.
That said, you need to ask your account manager about (1) discounts in exchange for spend commitments, and (2) technical assistance. In general we have a talk with our AM when we're doing something new, and they rope in SMEs from the various products for us.
We're not that big, and I haven't worked for large companies, and it's always been a mystery to me why people have problems dealing with AWS. I've always found them to be super responsive and easy to get ahold of. OTOH we actually know what we're doing technically.
Google Cloud, OTOH, is super fucked up. I mean seriously, I doubt anyone there has any idea WTF is happening or how anything works anymore. There's no real cohesion, or at least there wasn't the last time I was abused by GCP.
Depending what precisely you mean by the second one, you may not even need an AM/support for that.
They won't help me use the platform, but they will still address issues with the platform. If you run into bugs, things not behaving how they're documented, or something that simply isn't exposed/available to customers they seem to be pretty good about getting it resolved regardless of your spend or support level.
(On my personal account with minimal spend, no AM, and no support... I've had engineers from the relevant teams email me directly after submitting a ticket for issues.)
So yeah, "if you know what you're doing" you probably don't even need the paid-for support.
Hard disagree. I have to engage with AWS support almost once every 6 months. A lot of them end up being bugs identified in their services. Premium support is extremely valuable when your production services are down and you need to get them back up asap.
Both times they were serious production bugs that took at least a week to resolve, though I only had the lowest tier of support package.
GCP’s architecture seems clearly better to me especially if you are looking to be global.
Every organization I’ve ever witnessed eventually ends up with some kind of struggle with AWS’ insane organizations and accounts nightmare.
GCP’s use of folders makes way more sense.
GCP having global VPCs is also potentially a huge benefit if you want your users to hit servers that are physically close to them. On AWS you have to architect your own solution with global accelerator which becomes even more insane if you need to cross accounts, which you’ll probably have to do eventually because of the aforementioned insanity of AWS account/organization best practices.
Know how you find all the permissions a single user in GCP has? You have to make 9+ API calls, then filter/merge all the results. They finally added a web tool to try and "discover" the permissions for a user... you sit there and watch it spin while it madly calls backend APIs to try to figure it out. Permissions for a single user can be assigned to users, groups, orgs, projects, folders, resources, (and more I forget), and there's inheritance to make it more complex. It can take all day to track down every single place the permissions could be set for a single user in a single hierarchical organization, or where something is blocking some permission. The complexity increases as you have more GCP projects, folders, orgs. But, of course, if you don't do all this, GCP will fight you every step of the way.
Compare that to AWS, where you just click a user, and you see what's assigned to it. They engineered it specifically so it wouldn't be a pain in the ass.
> Every organization I’ve ever witnessed eventually ends up with some kind of struggle with AWS’ insane organizations and accounts nightmare.
This was an issue in the early days, but it's well solved now with newer integrations/services. Follow their Well Architected Framework (https://docs.aws.amazon.com/wellarchitected/latest/framework...), ask customer support for advice, implement it. I'm not exaggerating when I say this is the best description of the best information systems engineering practice in the world, and it's achievable by startups. It just takes a long time to read. If you want to become an excellent systems engineer/engineering manager/CTO/etc, this is your bible. (Note: you have to read the entire thing, especially the appendixes; you can't skim it like StackOverflow)
The problem is that no company I’ve ever worked for implemented the well architected framework with their AWS environment, and not one company will ever invest the time to make their environment match that level of quality.
I think what you describe with the web tool to discover user permissions sounds a lot like the AWS VPC Reachability Analyzer which I had to live in for quite a while because figuring out where my traffic was getting blocked between an endless array of AWS accounts and cross-region transit gateways was such a nightmare that wouldn’t exist with GCP global VPCs and project/folder based permissions.
I don’t like the GCP console, but I also wouldn’t consider a lot of the AWS console to be top tier software. Slow/buggy/inconsistent are words I would use with the AWS console. I can concede that AWS has better documentation, but I don’t think it’s a standout, either.
Architecturally I'd go with GCP in a heartbeat. Bigquery was also one of the biggest wins in my previous role. Completely changed out business for almost everyone, vs Redshift which cost us a lot of money to learn that it sucked.
You could say I'm biased as I work at Google (but not on any of this), but for me it was definitely the other way around, I joined Google in part because of the experience of using GCP and migrating AWS workloads to in.
What are these struggles? The product I work on uses AWS and we have ~5 accounts (I hear they used to be more TBF) but nowadays all the infrastructure is on one of them and the other are for some niche stuff (tech support?). I could see how going overboard with many accounts could be an issue, but I don't really see issues having everything on one account.
The way to automate provisioning of new AWS accounts requires you to engage with Control Tower in some way, like the author did with Account Factory for Terraform.
Just before they announced that I was working on creating org accounts specifically to contain S3 buckets and then permitting the primary app to use those accounts just for their bucket allocation.
AWS themselves recommend an account per developer, IIRC.
It's as you say, some policy or limitation might require lots of accounts and lots of accounts can be pretty challenging to manage.
I have almost 40 AWS accounts on my login portal.
Two accounts per product, one for development environments and one for production environments, every new company acquisition has their own accounts, then we have accounts that solely exist to help traverse accounts or host other ops stuff.
Maybe you don’t see issues with everything in one account but my company would.
I don’t really think they’re following current best practices but that’s a political issue that I have no control over, and I think if you went back enough years you’d find that we followed AWS’ advice at the time.
Undersea cable failures are probably more likely than a google core networking failure.
In AWS a lot of "global" things are actually just hosted in us-east-1.
The routing isn’t centralized, it’s distributed. The VPCs are a logical abstraction, not a centralized dependency.
If you have a region/AZ going down in your global VPC, the other ones are still available.
I think it’s also not that much of an advantage for AWS to be able to say its outages are confined to a region. That doesn’t help you very much if their architecture makes architecting global services more difficult in the first place. You’re just playing region roulette hoping that your region isn’t affected. Outages frequently impact all/multiple AZs.
I, too, prefer McDonald's cheeseburgers to ground glass mixed with rusty nails. It's not so much that I love Terraform (spelled OpenTofu) as that it's far and away the least bad tool I've used in the space.
Terragrunt is the only sane way to deploy terraform/openTofu in a professional environment though.
What you can do if you _really_ like ansible is to use it to generate the terraform files (typically from Jinja2 template). In practice, i think Terragrunt is easier to use if you already have terraform modules. But if i was back at my first "real" job, where we had between 50 and 80 ansible modules (very short ones, it was really good, i've never saw an infrastructure that complex handled that concisely and easily), and if we had to use terraform, i would use ansible to generate terraform files 100%.
However, I often find ansible modules to be confusing to use. Maybe with LLMs it's now easier to draft ansible roles and maintain them, but I always had agro whenever I needed to go to the docs for something I've done many times just because the modules are that much inconsistent.
Ansible modules are trivial to write and more people should. Most are trivial in practice and just consists of a few underlying API calls. A dozen line snippet you fully understand is generally not a maintenance burden. A couple of thousand someone else wrote might be.
Infrastructure needs to be consistent, intuitive and reproducible. Imperative languages are too unconstrained. Particularly, they allow you to write code whose output is unpredictable (for example, it'd be easy to write code that creates a resources based on the current time of day...).
With infrastructure, you want predictability and reproducibility. You want to focus more on writing _what_ your infra should look like, less _how_ to get there.
I have written both TF and then CDKTF extensively (!), and I am absolutely never going back to raw TF. TF vs CDKTF isn't declarative vs imperative, it's "anemic untyped slow feedback mess" vs "strong typesystem, expressive builtins and LSP". You can build things in CDKTF that are humanly intractable in raw TF and it requires far less discipline, not more, to keep it from becoming an unmaintainable mess. Having a typechecker for your providers is a "cannot unsee" experience. As is being able to use for loops and defining functions.
That being said, would I have preferred a CDKTF in Haskell, or a typed Nix dialect? Hell yes. CDKTF was awful, it was just the least bad thing around. Just like TF itself, in a way.
But I have little problems with HCL as a compilation target. Rich ecosystem and the abstractions seem sensible. Maybe that's Stockholm syndrome? Ironically, CDKTF has made me stop hating TF :)
Now that Hashicorp put the kibosh on CDKTF though, the question is: where next...
There are things I think Terraform could do to improve its declarative specs without violating the spirit. Yet, I still prefer it as-is to any imperative alternatives.
Is that an easy mistake to make and a hard one to recover from, in your experience?
The way you have to bend over backwards in Terraform just to instantiate a thing multiple times based on some data really annoys me..
If you're alone in a codebase? Probably not.
In a company with many contributors of varying degrees of competence (from your new grad to your incompetent senior staff), yes.
In large repositories, without extremely diligent reviewers, it's impossible to prevent developers from creating the most convoluted anti-patterny spaghetti code, that will get copy/pasted ad nauseam across your codebase.
Terraform as a tool and HCL as a programming language leave a lot to be desire (in hindsight only, because, let's be honest, it's been a boon for automation), but their constrained nature makes it easier to reign in the zealous junior developer who just discovered OOP and insists on trying it everywhere...
I don't think this is true anymore. Junior devs of today seem to be black pilled on OOP.
I loved reading code
Granted, I'm a programmer, have been for a long time, so using programming tools is a no brainer for me. If someone wants to manage infra but doesn't have programming skills, then learning the Terraform config language is a great idea. Just kidding, it's going to be just as confusing and obnoxious as learning the basic skills you need in python/js to get up and running with Pulumi.
For my current startup I ended up not going a direction where I needed ansible. I've now got everything in helm charts and deployable to K8S clusters, and packaged with Dockerfiles. Not really missing ansible, but not exactly in love with K8S either. It works well enough I guess.
You ended up needing Terraform too for the infrastructure though. At that point why not just use Terraform?
I had originally used Ansible to interact with the cloud provider and do the provisioning too, but someone on the corporate infrastructure team wanted to use terraform for that instead, so they did the migration.
There are all sort of requirements that pops up, specially in times of downtime or testing infra migration in production etc. and it's much easier to manually edit the terraform states.
I'm trying to make the decision for where to go with my home lab, and while Pulumi and Cue look neat, cdk8s seems so predictable & has such clear structure & form to it.
That's said the l1/l2/l3 distinction can be a brute to deal with. There's significant hidden complexity there.
Homelab CDKs: https://github.com/shepherdjerred/monorepo/tree/main/package...
Script I wrote to generate types from Helm charts: https://github.com/shepherdjerred/monorepo/tree/main/package...
That made me laugh. Yes I get that they probably didn't use all of these at the same time.
If the author had a Ko-Fi they would've just earned $50 USD from me.
I've been thinking of making the leap away from JIRA and I concur on RDS, Terraform for IAC, and FaaS whenever possible. Google support is non-existent and I only recommend GC for pure compute. I hear good things about Big Table, but I've never used in in production.
I disagree on Slack usage aside from the postmortem automation. Slack is just gonna' be messy no matter what policies are put in place.
possible but not ideal/inconveniences: - cold starts can hamper latency sensitive apps (language dependant + there are things you can do) - if you have consistent traffic its not very good value for money - if you value local debugging
Other options are email of course, and what, teams for instant messages?
Organized by topics, must be threaded, and default to asynchronous communications. You can still opt in to notifications, and history is well organized and preserved.
It’s funny how we get an instant messaging platform and derive best practices that try to emulate a previous technology.
Btw, email is pretty instant.
See the other point in the article about discouraging one on one private messages and encouraging public discussion. That is the main reason.
* half a day later or days later if you do true async, but that's fine.
But aren’t mailling lists and distribution groups pretty ubiquitous?
I've been working across time zones via IM and email since ... ICQ.
I'm probably biased by that but I consider email the place for questions lists and long statuses with request for comments, and for info that I want retained somewhere. While IM is a transient medium where you throw a quickie question or statement or whine every couple hours - and check what everyone else is whining about.
But clearly, thats cultural.
If you keep your eyes on the linux kernel mailing you’ll see a lot of (on topic) short and informal messages flying in all directions.
If you keep your eyes on the emails from big tech CEOs that sometimes appear in court documents; you’ll see that the way they use email is the same way that I’d use slack or an instant messenger.
Thats likely because its the tool they have available- we have IM tools that connect us to people we need (inside the company)- making email the only place for long form content, which means its only perceived as being for long form content.
But when people have to use something federated more often, it does seem like email is actually used this way.
For us the free version of Slack was insufficient, the commercial one too expensive, and anyway, given that it's a cloud-based system, it's not compliant with our internal rules for confidential information (unless we can get some specific agreement with them). On the side, there is a bit too much analytics/telemetry in the Slack client.
but... you are spending so much on AWS and premium support... surely you can afford that
If you're starting everything from scratch, you might think that going to other providers (like Hetzner) is a good idea, and it may definitely be! But then you need to set up a Site2Site VPN because the second big customer of your B2B SaaS startup uses on-premises infrastructure and AWS has that out of the box, while you need an expert networking guy to do that the right way on Hetzner.
The key insight: for read-heavy workloads on a single machine, SQLite eliminates the network hop entirely. Response times drop to sub-15ms for full-text search queries. The tradeoff is write concurrency, but if your write volume is low (mine is ~20/day), it's a non-issue.
The one thing I'd add to the article: the biggest infrastructure regret I see is premature complexity. Running Postgres + Redis + a message queue when your app gets 100 requests/day is solving problems you don't have while creating problems you do (operational overhead, debugging distributed state, config drift between environments).
If you can manage docker containers in a cloud, you can manage them on your local. Plus you get direct access to your own containers, local filesystems and persistence, locally running processes, quick access for making environmental tweaks or manual changes in tandem with your agents, etc. Not to mention the cost savings.
There's _so_ many providers of 'bare metal' dedicated servers - Hetzner and OVH come up a lot, but _before_ AWS there was ev1servers (anyone remember them?).
Tech is for all intents and purposes a planed economy (we are in the middle of the LLM five year plan comrade).
This is a classic. I'd say that for every company, big or small, ends up taking the #1 spot on technical debt.
[1]: https://martinfowler.com/bliki/IntegrationDatabase.html
This post was a great read.
Tangent to this, I've always found "best practices" to be a bit of a misnomer. In most cases in software and especially devops I have found it means "pay for this product that constrains the way that you do things so you don't shoot yourself in the foot". It's not really a "practice" if you're using a product that gives you one way to do something. That said my company uses a very similar tech stack and I would choose the same one if I was starting a company tomorrow, despite the fact that, as others have mentioned, it's a ton to keep in your head all at once.
The good thing about a lot of devops saas is that you're not paying anyone on staff to understand the problem domain and guide your team. The bad thing is that you're not paying anyone on staff to understand the problem domain and guide your team.
This is an important point.
But I don't like calling this tech debt. The tech debt concept is about taking on debt explicitly, as in choosing the sub-optimal path on purpose to meet a deadline then promising a "payment plan" to remove the debt in the future. Tech debt implies that you've actually done your homework but picked door number 2 instead. A very explicit choice, and one where decision makers must have skin in the game.
A hurried, implicit choice has none of those characteristics - it's ignorance leading (inevitably?) to novel problems. That doesn't fit the debt metaphor at all. We need to distinguish tech debt from plain old sloppy decision making. Maybe management can even start taking responsibility for decisions instead of shrugging and saying "Tech debt, what can you do, amirite?"
> Regret
Thanks for this data point. I am currently trying to make this call, and I was still on the fence. This has tipped me to the separate db side.
Can anyone else share their experience with this decision?
[0] https://cep.dev/posts/every-infrastructure-decision-i-endors...
In my experience, it's easier to take schema out into a new DB in the off-chance it makes sense to do so.
The big place I'd disagree with this is when "your" data is actually customer data, and then you want 1 DB per customer whenever you can and SQLite is your BFF here. You have 1 DB for your stuff(accounting, whatever) and then 1 SQLite file per customer, that holds their data. Your customer wants a copy, you run .backup and send the file, easy peasy. They get pissed, rage quit and demand you delete all their data, easy!
Some big companies have massive monolith code bases. This is not a generalization you could apply universally. There are a lot of other considerations. What kind of features are we talking about, what kind of I/o patterns are planned, what is the scale of data expected, etc.
[1] https://www.amazon.com/Designing-Data-Intensive-Applications... [2] https://www.amazon.com/Monolith-Microservices-Evolutionary-P...
Having multiple teams with one code base that has one database is fine. Every every line of code, table and column needs to be owned by exactly ONE team.
Ownership is the most important part of making an organization effective.
- Crud accumulates in the [infrastructure thingie], and it’s unclear if it can be deleted.
- When there are performance issues, infrastructure (without deep product knowledge) has to debug the [infrastructure thingie] and figure out who to redirect to
- [infrastructure thingie] users can push bad code that does bad things to the [infrastructure thingie]. These bad things may PagerDuty alert the infrastructure team (since they own the [infrastructure thingie]). It feels bad to wake up one team for another team’s issue. With application owned [infrastructure thingies], the application team is the first responder.
The only thing that worried me is that one product might need SOC 2 sooner than another. I thought separate databases would give a smaller compliance surface to worry about. However, this will be my first time going through this process, so I am pretty uninformed here.
Whoa, now there is a truth bomb. I've seen this happen a bunch, but never put it this succinctly before.
Bookmarked for my own infrastructure transformations. Honestly, if Okta could spit out a container or appliance that replaces on-prem ADDCs for LDAP, GPOs, and Kerberos, I’d give them all the money. They’re just so good.
By the same token, it's more efficient to let an LLM operate all these tools (and more) than to force an LLM to keep all of that on its "mind", that is, context.
Just because they can run tools, doesn't mean they run them reliably. Running tools is not a be all and end all of the problem.
Amdahl's law is still in play when it comes to agents orchestrating entire business processes on their own.
I would personally recommend https://www.shortcut.com which is very well designed, and also made some really sensible improvements over the time that we used it.
It's Amazon warehouse worker tracking software for developers and thus we hate it.
Theses days AI in doc, spec and production lifecycle means we need AI first ticket tooling - haven’t used Linear but I suspect that works far better with AI then JIRA
past discussion: https://news.ycombinator.com/item?id=39313623
Almost every infrastructure decision I endorse or regret - https://news.ycombinator.com/item?id=39313623 - Feb 2024 (626 comments)
I've worked with hundreds of customers to integrate IdP's with our application and Google Workspace was by far the worst of the big players (Entra ID, Okta, Ping). Its extremely inflexible for even the most basic SAML configuration. Stay far, far away.
I've been working mostly at startups most of my career (for Sydney Australia values of "start up" which mostly means "small and new or new-ish business using technology", not the Silicon Valley VC money powered moonshot crapshoot meaning). Two of those roles (including the one I'm in now) have been longer that a decade.
And it's pretty much true that almost all infrastructure (and architecture) decisions are things that 4-5 years later become regrets. Some standouts from 30 years:
I didn't choose Macromind/Macromedia Director in '94 but that was someone else's decision I regretted 5 years later.
I shouldn't have chosen to run a web business on ISP web hosting and Perl4 in '95 (yay /cgi-bin).
I shouldn't have chosen globally colocated desktop pc linux machines and MySQL in '98/99 (although I got a lot of work trips and airline miles out of that).
I shouldn't have chosen Python2 in 2007, or even worse Angular2 in 2011.
I _probably_ shouldn't have chosen Arch Linux (and a custom/bastardised Pacman repo) for a hardware startup in 2013.
I didn't choose Groovy on Grails in 2014 but I regretted being recruited into being responsible for it by 2018 or so.
I shouldn't have chosen Java/MySQL in 2019 (or at least I should have kept a much tighter leash on the backend team and their enterprise architecture astronaut).
The other perspective on all those decisions though, each of them allowed a business to do the things they needed to take money off customers (I know I know, that's not the VC startup way...) Although I regretted each of those later, even in retrospect I think I made decent pragmatic choices at the time. And at this stage of my career I've become happy enough knowing that every decision is probably going to have regrets over a 4 or 5 year timeframe, but that most projects never last long enough for you to get there - either the business doesn't pass out and closes the project down, or a major ground up rewrite happens for reasons often unrelated to 5 year old infrastructure or architecture choices.
We didn't have to invent any homegrown orchestration tool. Our infra is hundreds of VMs across 4 regions.
Can you give an example of what you needed to do?
I wish luck to the imo fools chasing the "you may not need it" logic. The vacuum that attitude creates in its wake demands many many many complex & gnarly home-cooked solutions.
Can you? Sure, absolutely! But you are doing that on your own, glueing it all together every step of the way. There's no other glue layer anywhere remotely as integrative, that can universally bind to so much. The value is astronomical, imho.
1: https://kubernetes.io/blog/2026/01/29/ingress-nginx-statemen...
Knative on k8s works well for us, there's some oddities about it but in general does the job
Everything in article is excellent point but other big point is schema changes become extremely difficult because you have unknown applications possibly relying on that schema.
It's also at certain point, the database becomes absolutely massive and you will need teams of DBAs care and feeding it.
Everyone tries to plan for a world where they've become one of the hyperscalers. Better to optimize for the much more likely scenarios.
Database is still 40TB with 3200 stored procedures.
Granted, DB size isn't the best metric to be using here in terms of performance, but it's the one you used.
Con: it’s sadly likely that no one on your staff knows a damn thing about how an RDBMS works, and is seemingly incapable of reading documentation, so you’re gonna run into footguns faster. To be fair, this will also happen with isolated DBs, and will then be much more effort to rein in.
I also reached a lot of similar decisions and challenges, even where we differ (ECS vs EKS) I completely understand your conclusions.
modal.com exists now
FaaS is almost certainly a mistake. I get the appeal from an accountant's perspective, but from a debugging and development perspective it's really fucking awful compared to using a traditional VM. Getting at logs in something like azure functions is a great example of this.
I pushed really hard for FaaS until I had to support it. It's the worst kind of trap. I still get sweaty thinking about some of the issues we had with it.
This is the least of the problems I've experienced with Azure Functions. You'd have to try very hard to NOT end up with useful logs in Application Insights if you use any of the standard Functions project templates. I'm wondering how this went wrong for you?
Though I never really understood the appeal of FaaS over something like Google-Cloud-Run.
it SUCKS. There's no interactive debugging. Deploy for a minute or 5 depending on the changes, then trigger the lambda, wait another 5 minutes for all the logs to show up. Then proceed with printf/stack trace debugging.
For reasons that I forgot, locally running the lambda code on my dev box was not applicable. Locally deploying the cloud environment neither.
I wasn't around for the era but I imagine it's like working on an ancient mainframe with long compile times and a very slow printer.
Surprised to see datadog as a regret - it is expensive but it's been enormously useful for us. Though we don't run kubernetes, so perhaps my baseline of expensive is wrong.
The open source stack has gotten genuinely viable: Prometheus/VictoriaMetrics for metrics, Grafana for viz, and OpenTelemetry as the collection layer means you're not locked into anyone's agent. The gap used to be in correlation - connecting a metric spike to a trace to a log line - but that's narrowed significantly.
The actual hard part of leaving DD isn't technical, it's organizational. DD becomes load-bearing for on-call runbooks, alert routing, and team muscle memory. Migration is less "swap the backend" and more "retrain your incident response."
If you're evaluating: the question I'd ask isn't "which vendor has the best dashboards" but "can I get from alert to root cause in under 5 minutes with this tool?" That's the metric that actually correlates with MTTR, and it's where most monitoring setups (including expensive ones) fail.
Curious to hear more about Renovate vs Dependabot. Is it complicated to debug _why_ it's making a choice to upgrade from A to B? Working on a tool to do app-specific breaking change analysis so winning trust and being transparent about what is happening is top of mind.
When were you using quay.io? In the pre-CoreOS years, CoreOS years (2014-2018), or the Red Hat years?
I love modal. I think they got FaaS for GPU exactly right, both in terms of their SDK and the abstractions/infra they provide.
RDS is a very quick way to expand your bill, followed by EC2, followed by S3. RDS for production is great, but you should avoid the bizarre HN trope of "Postgres for everything" with RDS. It makes your database unnecessarily larger which expands your bill. Use it strategically and your cost will remain low while also being very stable and easy to manage. You may still end up DIYing backups. Aurora Serverless v2 is another useful way to reduce bill. If you want to do custom fancy SQL/host/volume things, RDS Custom may enable it.
I'm starting to think Elasticache is a code smell. I see teams adopt it when they literally don't know why they're using it. Similar to the "Postgres for everything" people, they're often wasteful, causing extra cost and introducing more complexity for no benefit. If you decide to use Elasticache, Valkey Serverless is the cheapest option.
Always use ECR in AWS. Even if you have some enterprise artifact manager with container support... run your prod container pulls with ECR. Do not enable container scanning, it just increases your bill, nobody ever looks at the scan results.
I no longer endorse using GitHub Actions except for non-business-critical stuff. I was bullish early on with their Actions ecosystem, but the whole thing is a mess now, from the UX to the docs to the features and stability. I use it for my OSS projects but that's it. Most managed CI/CD sucks. Use Drone.io for free if you're small, use WoodpeckerCI otherwise.
Buying an IP block is a complicated and fraught thing (it may not seem like it, but eventually it is). Buy reserved IPs from AWS, keep them as long as you want, you never have to deal with strange outages from an RIR not getting the correct contact updated in the correct amount of time or some foolishness.
He mentions K8s, and it really is useful, but as a staging and dev environment. For production you run into the risk of insane complexity exploding, and the constant death march of upgrades and compatibility issues from the 12 month EOL; I would not recommend even managed K8s for prod. But for staging/dev, it's fantastic. Give your devs their own namespace (or virtual cluster, ideally) and they can go hog wild deploying infrastructure and testing apps in a protected private environment. You can spin up and down things much easier than typical AWS infra (no need for terraform, just use Helm) with less risk, and with horizontal autoscaling that means it's easier to save money. Compare to the difficulty of least-privilege in AWS IAM to allow experiments; you're constantly risking blowing up real infra.
Helm is a perfectly acceptable way to quickly install K8s components, big libraries of apps out there on https://artifacthub.io/. A big advantage is its atomic rollouts which makes simple deploy/rollback a breeze. But ExternalSecrets is one of the most over-complicated annoying garbage projects I've ever dealt with. It's useful, but I will fight hard to avoid it in future. There are multiple ways to use it with arcane syntax, yet it actually lacks some useful functionality. I spent way too much time trying to get it to do some basic things, and troubleshooting it is difficult. Beware.
I don't see a lot of architectural advice, which is strange. You should start your startup out using all the AWS well-architected framework that could possibly apply to your current startup. That means things like 1) multiple AWS accounts (the more the better) with a management account & security account, 2) identity center SSO, no IAM users for humans, 3) reserved CIDRs for VPCs, 4) transit gateway between accounts, 5) hard-split between stage & prod, 6) openvpn or wireguard proxy on each VPC to get into private networks, 7) tagging and naming standards and everything you build gets the tags, 8) put in management account policies and cloudtrail to enforce limitations on all the accounts, to do things like add default protections and auditing. If you're thinking "well my startup doesn't need that" - only if your startup dies will you not need it, and it will be an absolute nightmare to do it later (ever changed the wheels on a moving bus before?). And if you plan on working for more than one startup in your life, doing it once early on means it's easier the second time. Finally if you think "well that will take too long!", we have AI now, just ask it to do the thing and it'll do it for you.
God I wish that were true. Unfortunately, ECR scanning is often cheaper and easier to start consuming than buying $giant_enterprise_scanner_du_jour, and plenty of people consider free/OSS scanners insufficient.
Stupid self inflicted problems to be sure, but far from “nobody uses ECR scanning”.
modal.com???
The insidious part with on-call tooling specifically is that switching costs are higher than almost any other category. Your escalation chains, schedules, integrations with monitoring, incident templates, post-mortem workflows - it all becomes organizational muscle memory. Migrating monitoring backends is a weekend project compared to migrating on-call routing.
What I've seen work: teams that treat on-call routing as a thin layer rather than a platform. If your schedules live in something portable (even a YAML file synced to whatever tool) and your alert routing is OpenTelemetry-native, swapping the actual dispatch tool becomes manageable. The teams that get locked in are the ones who build their entire incident process inside PD's UI.
I used to use Replit for educational purposes, to be able to create simple programs in any language and share them with others (teachers, students). That was really useful.
Now Replit is a frontend to some AI chat that is supposed to write software for me.
Is this jumping into AI bandwagon everywhere a new trend? Is this really needed? Is this really profitable?
For the same amount of memory they should cost _nearly_ identical. Run the numbers. They're not significantly different services. Aside from this you do NOT pay for IPv4 when using Lambda, you do on EC2, and so Lambda is almost always less expensive.
On Lambda, load balancing is handled out of the box but you may need to introduce things like connection poolers for the DB you could have gotten away without on EC2
Think it also depends if you're CPU or memory constrained. Lambda seemed more expensive for CPU heavy workloads since you're stuck with certain CPU:mem ratios and there's more flexibility on EC2 instance types
It is true that it can be hard to size workloads into lambdas rather unusual CPU configuration; however, the real beauty of lambda is, you can just fork several parallel copies of your function. We can sometimes fork up to 250 instances just for a single "job."
If you're in the same boat we are where your workloads parallelize easily then Lambda has been incredibly cost effective for this use case.
Hire a DBA ASAP. They need to reign in also the laziness of all other developers when designing and interacting with the DB. The horrors a dev can create in the DB can take years to undo
1. Cost tracking meetings with your finance team are useful, but for AWS and other services that support it I highly recommend setting billing alarms. The sooner you can know about runaway costs, the sooner you can do something about it.
2. Highly recommend PGAnalyze (https://pganalyze.com/) if you're running Postgres in your stack. It's really intuitive, and has proven itself invaluable many times when debugging issues.
3. Having used Notion for like 7 years now, I don't think I love it as much as I used to. I feel like the "complexity" of documents gets inflated by Notion and the number of tools it gives you, and the experience of just writing text in Notion isn't super smooth IMO.
4. +1 to moving off JIRA. We moved to Shortcut years ago, I know Linear is the new hotness now.
5. I would put Datadog as an "endorse". It's certainly expensive but I feel we get loads of value out of it since we leaned so heavily into it as a central platform.