Unfortunately the June pricing change for Copilot forced me personally as well as my entire department at work to switch to Claude Code. With copilot we were hitting a few dollars of extra spend over the included credits in April and May, then in June we started chewing through the monthly budget every 2-3 days.
Just a completely insane price hike from the customer's perspective, I don't know what MS were thinking there.
Even if that is the price they need to be sustainable they should have waited until the competition changed their prices first. I wouldn't be surprised if Copilot lost 50% or more of their customer base last month.
Eventually this could be where all the major players set their prices, so the thought occurs to me that nations should run some form of "public access AI", just like they did for TV. Use the free open models and use tax money to finance a few datacenters. Geo-lock the use and set strict throttles to manage load, but let school children and citizens use that AI freely otherwise.
If Copilot's pricing is the level for all AI in a few years, only the unicorn companies can afford to use them, and everybody else has no chance of competing with a company that can use AI.
They did...
They're literally just passing on the costs https://platform.claude.com/docs/en/about-claude/pricing
Anthropic just provides a subscription - which Enterprise usually doesn't want you to use because everything you're submitting through that will be trained on / becomes part of their model.
So If you use it without explicit permission from your employer you may be committing a contract violation which can have serious consequences - up to jail time - as they can sue you for that.
My Pro account very clearly has a toggle for "Help improve our AI models: Allow the use of your chats and coding sessions to train and improve Anthropic AI models."
> Our use of Materials [...] Even if you opt out, we will use Materials for model training when: (1) you provide Feedback to us regarding any Materials, or (2) your Materials are flagged for safety review to improve our ability to detect harmful content, enforce our policies, or advance our safety research.
The last part is essentially a catch all, which let's them train on everything they want - and they probably are.
But the important bit here isn't actually wherever they're actually training on it - that doesn't matter from the legality aspect of it. You're liable anyway, as all contracts I've ever signed explicitly forbid me from sharing internal data of any kind (including code) with third parties.
You can be prosecuted just from using it - wherever anthropic decides to train it's model on it or not.
If you use Claude via API in your own app, you're paying full price.
If you have an "API Plan" for Claude Code (i.e., free), you're paying full price.
If you have a Pro, Max, Max 5x, or Max 20x, your tokens are subsidized up until a rate limit. Then you pay full price for usage thereafter, until the end of the billing cycle.
The widespread belief in industry right now is that the per-seat pricing (which Copilot bailed from first) is going to go away in the near-term.
At my company we did the comparison and Copilot still wins: for 20$ you get a seat and 20$ of usage, whereas with Claude enterprise you get a seat and then usage is completely added. Moreover usage in Copilot is exactly the price of the providers AND it allows us to use various models from multiple providers.
The case that might be less expensive is if you negociate a volume discount with AWS for Bedrock usage, but that is also possible with GitHub and Microsoft.
Edit: wording on the cost saving effort
I've swapped to the 20x Claude plan for a month or two to knock out two ideas I need to get it MVP - expecting Claude to go token priced soon.
pleasantly surprised, claude's way ahead in tooling but the ability to designate what model your subagents use and having access to all models is a better feature than all of what claude offers combine atm.
The only limit on the amount of ai can consume in a month a work is dollars, so anything that helps with cost is the best model/harness for me.
It also did a better job at smart designating subagents itself where as claude often used higher cost models.
For example: https://github.com/monooso/dotfiles/tree/main/.claude/agents
Honest question, can you ellaborate? If given the option, I use OpenCode but what do you find in Copilot CLI that makes you prefer it to Claude Code?
There is also IMO a distinct difference in "tone" in the dialogue. Claude seems to impersonate a human a bit more than I like.
Claude is of course very good as well and does a few things better than copilot too, but overall I'd prefer to use Copilot.
For my personal work, I still use Claude Code as its cheaper and the limits don't bother me to much, but it feels a bit like being handcuffed to Anthropic vs being at work and freely selecting models.
I also use the Copilot ACP server inside Pycharm and that works decently well too, although it has some annoying bugs, but if you're a Jetbrains user you're used to annoying bugs.
However, last month they introduced a new pricing model ( I know the old pricing was not sustainable), and my USD 10 was exhausted within days. Because of that, I switched to Claude Code and Codex and have never looked back. Yes, tokens on Claude Code and Codex are subsidized heavily, but let's just enjoy when good things last.
I do feel there is a difference between using Claude via Copilot versus using Claude directly in Claude Code. I'm not sure what Microsoft is doing behind the scenes.
Anthropic seems to have a modest lead on their harness and models, so it’s a best-of-both-worlds scenario.
> I'm not sure what Microsoft is doing behind the scenes
It’s probably the exact same model, but the tools and the prompts around it are worse, so you get worse results.
The new pricing model where I got banned from using Opus entirely and half a day of work (with weaker models) consumed the 10$ plan was.
I'm now using a Claude Max subscription and I can get close to the daily limits but I'm fairly happy with the overall plan consumption.
ACP is just a standard that bridges harnesses easily into IDEs, Text Editors, or whatever consumes it (I wrote a TUI that consumes them)
The registry for all the agents (tool harnesses) is here https://github.com/agentclientprotocol/registry if you ever are curious to what Zed or IntelliJ are really hooking into
When using Zed with the CoPilot integration I use Claude Opus and never had this issue.
I paid $6 yesterday for DeepSeek V4 Flash on OpenRouter. That's like $120 dollar for a month, and it's not even a good model.
The performance, if we trust the benchmarks, put it at Sonnet 4.6.
Let’s see if it’s worth it with GitHubs pricing.
I don't trust these benchmarks. I used a number of times Kimi K2.7 and I was disappointed. It would run in circles for things that Claude would do in one shot. However, my usage was via Ollama cloud, and I have no idea if they serve the actual model or a quantized version, and it was the quantization that degraded the performance.
The great news, in my opinion, is the precedent. If Microsoft is now serving Kimi K2.7, then very soon they might start serving GLM 5.2, and that is indeed a very competitive model.
I'm going to be called a chiller again, but at this point I don't care as it is relevant. Synthetic runs their own models for a reasonable price, GLM5.2 & Kimi K2.7-Code included.
Referral link :
ps opencode cli is quite nice too
Cache hit (most important): $0.19
Output: $4.00
This is the same as how much Moonshot charges for it, and it puts it at roughly the price of GPT 5.4 mini, not a bad option.
For some context here is a stupid prompt that wastes tokens: "Play a game of tic tac toe against yourself on a 5x5 board, you need 5 in a row to win."
It costs $0.006 on Kimi K2.7, and you get to see the whole raw reasoning trace.
GPT-5.4 mini costs $0.016 and its summarized.
And in case you are wondering both play incredibly stupidly.
Kimi:
A B C D E
1 . . . . .
2 . . . . .
3 X X X X X
4 . O O O O
5 . . . . .
GPT 5.4 mini: 1: X X X X X
2: O O . . .
3: . . O . .
4: . . . O .
5: . . . . OFable manages to make a reasonable game, at a cost of 40 cents.
X X O O O
O O X X X
X X X O O
X O O X O
X O X X OWe're sticking with Cursor for now, using Kimi as our daily driver (branded as "Composer").
Saw in a discussion on Reddit that the team is evaluating glm5.2 so hopefully more to come!
Has their reputation tanked so much that the alternatives get all the buzz? Or is it that non-enterprise users are priced out by the usage costs, so no free marketing?
This comes up all the time at work because the vendor management people don’t understand the llm ecosystem and think Claude through copilot is the same as Claude through Claude code.
A simple side by side comparison will show dramatic under performance 3 or 4 times out of five when I’m asked to explain the difference.
https://fireworks.ai/blog/kimi-k2p7-code
I don’t know much about them but they did a deal with Microsoft in March:
https://azure.microsoft.com/en-us/blog/introducing-fireworks...
Says that they are run by Moonshot
From your link: https://docs.github.com/en/copilot/reference/ai-models/model....
I tried adding a Foundry LLM as Github Copilot custom model and failed miserably. But with VSCode BYOK (and Github Copilot as the interfact) i did get it working, and i can now use Deepseek V4 Flash with Copilot.
I work at GitHub but even then I often use OpenRouter models in the CLI and Copilot App
https://docs.github.com/en/copilot/reference/ai-models/model...
I find it laughable.
Unless you have a time machine to 2005 (EC2 came out in 2006 that should have been the signal) there is no way to compete now. That train has left the platform.
Second, Nokia and Ericsson dominate mobile infra in the west, but that is good I guess as they are EU? What does USA think about that?
Third, let us say you get rid of MS. Now you have no MS but all network infra for broadband is Cisco, Huawei, Juniper etc. Good luck ripping that out. And for what?
Same with AI. Mistral was amazing at first, Le Chat. Almost as good, generous free limits, good docs. Now? Just plain bad. Deepseek is better (I dislike china so I avoid it). EU should have gone in 500% the moment Mistral showed promise.
But lately we let USA and China take the lead on everything and EU can write a strongly worded letter after about how bad it is.
People will "care" when EU starts making good stuff again.
And lastly lol, people do know everything ends in Taiwan in the end right?
They are run by Moonshot itself, so probably china
But then again they released MAI despite this, so I don't know.
I've set up a small rig, mostly settled on Qwen3.6 and I'm slowly adding features myself. It probably can't compete with Claude. I don't even know, I've stopped checking. It's providing a ton of value to me as is, and it only keeps getting better. All it takes is to realize that it doesn't actually matter if the grass is (maybe even objectively) greener somewhere else. Feels so good to know that it won't change under my feet. I've got this amazing, highly extensible tool, and it's mine.
Just wanted to leave a note for folks who might not have the memory to run a big 32gb model - I just found out there are some pruned models that have really good performance and If I had a smaller machine I might try this pruned unsloth Q4 quant of GLM 4.7 flash that sits at 14gb: https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GG...
I usually use LM Studio for this type of thing but unsloth has their own studio type app that might be even better suited for these quants.
I used GLM 4.7 flash as my main model for months and it was an incredibly tenacious model and very very fast - I think on restricted hardware, this could be a great choice.
It's like we're mostly treading mud at this point. New editions are released, a version number increases, but I have to wonder if all steps are forward or they're more just tuned differently with similar actual perf per dollar as when this year began.
Most in fact seem to be happening to me with small models. Like your Qwen. Or Gemma 4 31B which is kinda magic especially when considering multilingual abilities. So yes, in that sense I can see "development" probably as we refine data sets and training methods but I see it less on the big hulking beasts with daily limits (unless you turn it up to 11 like Fable).
Edit: As I posted this, I saw a "before and after" comparison for Fable and the reintroduced version is seeing a catastrophic drop in BridgeBench performance as they're still mucking with the model. Go figure... https://x.com/Hesamation/status/2072692225100612032
There isn’t any benefit to running a windows machine.
Haven't had much time to test it other than asking a few questions & changing some HTML in cline so it might be thick as a brick for all I know, but still worth trying
I wrote up how I run local LLMs, with numbers and a focus on running Qwen 3.6 and Gemma 4. I prefer Gemma 4 31b, even though the general consensus is that Qwen 3.6 is better for code, and it is better on most coding focused benchmarks...it doesn't seem to be for my use cases, Gemma feels smarter. And, with QAT, you get more smarts in less memory, so it's fast and runs on more hardware.
https://swelljoe.com/post/how-i-run-local-llms/
Currently, the sweet spot for self-hosted models is either Qwen 3.6 or Gemma 4, and those top out at 31B (Gemma) and 35B (for Qwen, but you want the dense Qwen 3.6 27B if you can run it as reasonable speed...the dense models are much smarter), so for now, a system with 64GB or 128GB are going to be running the same models. Going to a bigger model doesn't get you better performance because there aren't any better models that are a little bigger. I wish there was a ~70B or even ~120B MoE in the Qwen 3.6 or Gemma 4 families, as I've got a Strix Halo running a model that leaves a lot of memory on the table (and it's not very fast, to boot...an MoE would be faster, and hopefully smarter if it's a much bigger model, like double or triple sized).
In short, right now, 64GB is all you need for the best models you can self-host on anything short of five-figure machines, but, I wouldn't buy any hardware right now, if you can wait a while. Tokens from DeepSeek are so cheap, you can wait out the memory shortage and get access to models you could never host locally. And, OpenRouter always has free models in preview or just because that you can use lightly, as they're rate-limited (but your self-hosted models are going to be rate-limited, too, because a Mac Mini can't run models very fast). Google AI Studio has the Gemma 4 models for free too, also rate/usage limited.
Hmmm
Then I assumed for cost and battery/heat reasons that a Mini would be better than a laptop.
27B dense can fit on a consumer graphics card. Even without getting into various "intrusive" ways to shrink the size of a model (e.g. REAP), something like a NVFP4 quant of Qwen3.6 27b
https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4
should fit within ~22GB of VRAM. So easily on a 5090. It would also fit on a 3090/4090, but iirc they don't have NVFP4 natively, so you would want a different quant for them.
you can see /r/LocalLLama for some discussions. See this (random) post about Qwen3.6-27B on a 3090 at ~100 tok/s
https://www.reddit.com/r/LocalLLaMA/comments/1ujo46r/qwen_36...
Note that it is possible you could still do this stuff with a mac, as there are ways of hooking up a eGPU to macs and using it for inference. My understanding is they're all fairly hacky though, so it would likely be preferrable to just get a 3090 (or a non-nvidia option, e.g. an AMD r9700 pro has ~32GB of VRAM for much cheaper than a 5090.
https://www.reddit.com/r/LocalLLaMA/comments/1u50hnm/qwen_27...
that seems considerably slower though (~30 tok/s). I don't know if that's an outlier/misconfigured setup or what. In general there will be much better resources for local setups using 3090s, as they're quite popular. Note that 3090s (but not 4090s nor 5090s) have NVLink, so you can network the cards fairly effectively. For this reason 2x 3090 setups are fairly popular as well. I've heard that club 3090 makes that relatively straightforward
https://github.com/noonghunna/club-3090
but don't have experience myself.
You really don’t need them. After a certain point, bigger models give diminishing returns. If you can get 80% of the productivity gain with a free local model, use the local model. It will still be way faster than doing everything by hand, but you also don’t have to pay for tokens to a cloud provider and the tools won’t be ripped away from you on a whim.
This is the new attitude enlightened people should adopt. Reject the arms race.
There are guardrails you can and must add to protect your team if you take the vibe approach: a good type system, a good database with clearly written business model and a good data model to drive your business. Make it loud and clear when something breaks with your tooling.
But... I'd definitely not vibe everything after a certain point. Reading and fixing code is also a lot of fun.
I have to advocate for the vibe-coded mess-colony.
There are applications where it either works or it doesn't, and it's simultaneously obvious whether it does. Think stock price prediction software. I've killed time in the evenings verbally chatting with agents about that specifically, and what emerged worked! It didn't work well, but it clearly outperformed randomness, and I was able to verify that myself easily.
I didn't look at a line of code, but I had an absolute blast.
Except you kinda do. Try getting a job today without mentioning Claude experience. In another year it'll probably be something else. Saying you like to use Copilot today makes one seem elderly.
Not saying you need frontier models on a technical basis, but for career PR you probably do.
I tried out a few models and ended up going with either Qwen3-Coder-Next (no think, just do) and Qwen3.6-35B (thinking, w/llamacpp token budget). Created a customized prompt that works fairly well to around ~60k tokens and then is a toss up on whether it's poisoned itself or I've directly steered it into the wrong. When it's clear that's happened, if it's important to continue, ask it to write a doc then start fresh.
I don't kno whow any one cold have witnessed the last 2 decades of American VC funded tech startups and tell themselves, "you know, this will be a reliable technolgy with no hidden problems".
Even a sober technical evaluation is just two steps:
1. You're proposing to build a app on a non-deterministic model.
2. That model is hosted behind a non-deterministic system (model alignment, model guardrails, system context subterfuge, cost/token pricing)
---
So you want to build your app and you think you're going to kep up with both #1 and #2?
LLMs are, as far as the nastiness of the Real World goes, really fucking benign. Future models outperform past models, both in open weight land and at the big frontier labs. Performance per $ only ever goes up. That's just nice.
Except the Enterprise, and a lot of what people want compute for, is built on deterministic systems or processes. I'm not saying the non-deterministic nature of LLMs isn't useful. However I've worked with a lot of organizations on SOAR projects, for example. When you can weave the deterministic and non-deterministic together you get a relatively efficient system. A workflow that will stay on the rails and will come to a conclusion as expected. And the "as expected" part is critical in these types of systems. The reality of, using SOAR as an example, is also that most enterprise would be much better served by fast SLMs. Parse an email and validate if it's SPAM / Phishing or read a chunk of firewall logs and look for outliers / indications for escalation - those things can get messy in a deterministic system because of potentially unstructured data.
I don't believe it's either / or. And I believe that LLMs just aren't efficient, fast or reliable in the sense that deterministic are. It seems, at least to me, a better together story.
LLMs are what made me start considering this. Imagine a company using an LLM that was fully deterministic. All RNG was either removed or seeded in such a way that the same input (so many the seed counts as part of the input) gave the exact same output. Fully deterministic.
But such an LLM, with a slight drift in input, could still produce very different outputs. This isn't being non-deterministic, but more than the change in outputs does not naturally follow from the input. I'm thinking like how 2 double pendulums can (but not always do) greatly diverge given a very small change in their input.
So in light of that I've begun to call this new property non-chaotic. So Enterprise depends on non-chaotic systems, which are a subset of deterministic systems, and then wrangling the chaotic elements they cannot remove as much as possible.
The follow question I now have is if all LLMs are inherently chaotic, or if it is possible to have a non-chaotic LLM.
#2 is not fine; that non-determinism you do not control, have no insight into, etc.
I'm saying sure, give me #1 if it means I can build a harness around it and smooth over the edges. But I'm not taking #1 and #2. There's zero reasonable way to manae two non-deterministic systems.
So piracy on an by piracy trained ai model..
Alibaba didn't steal Opus weights, they used opus output to train their model.
If this is piracy, then so is reverse engineering efforts powering a bunch of Linux drivers.
Also, yeah, they already stole their copyrighted works, so a thief from a thief is still...theives?
If you have ethical concerns, model distillation feels like an arbitrary line to draw. Why is the first type of piracy ok, the second not? You should restrict yourself to ethical open source models. Which is btw where I genuinely hope the future of local models is going to lie. Open weights is not enough, we need fully open source models to be sustainable. Even for simple things like updating the knowledge cutoff. How we are going to distribute the training effort will be an interesting problem where I don't see an obvious solution yet. Maybe the blockchain/federated learning people can suggest something. Or university consortia, or some public sector solutions. Or something really boring - I for one would absolutely be willing to pay for DRM-free weights of an open source model (even if I could pirate them for free).
Btw, ethically sourced, open source LLMs exist! Check out eg Olmo by Allen AI: https://allenai.org/olmo
Not gonna lie, if you're coming from ChatGPT/Claude Code, you'll mostly be adding back features you've taken for granted, or solving problems you wouldn't have had. But sometimes you do get some extra utility, like uncensored models, which have become my go-to. Not because I'm doing anything saucy, but I hated how I'd become trained to pre-emptivly self-censor my prompts. The guardrails in open weights models are no less strong than in proprietary ones, subjectively even a bit stronger in Qwen. But luckily there's an entire sub-discipline of model ablation. Another advantage would be better control over image generation (although I can't attest to that, yet).