Kimi K2.7 Code is generally available in GitHub Copilot
361 points by unliftedq 15 hours ago | 152 comments

c7b 9 hours ago
Gotta say, I've lost all interest in cloud-based AI products. Too many cool features and workflows that I was once excited about that I can't or don't use anymore for a variety of reasons (price hikes, subjectively nerfed, disappeared altogether, replaced,...) for me to even remember. It's tiring.

I've set up a small rig, mostly settled on Qwen3.6 and I'm slowly adding features myself. It probably can't compete with Claude. I don't even know, I've stopped checking. It's providing a ton of value to me as is, and it only keeps getting better. All it takes is to realize that it doesn't actually matter if the grass is (maybe even objectively) greener somewhere else. Feels so good to know that it won't change under my feet. I've got this amazing, highly extensible tool, and it's mine.

reply
kamranjon 2 hours ago
I'm really happy this is one of the top comments here, I am fully local as well.

Just wanted to leave a note for folks who might not have the memory to run a big 32gb model - I just found out there are some pruned models that have really good performance and If I had a smaller machine I might try this pruned unsloth Q4 quant of GLM 4.7 flash that sits at 14gb: https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GG...

I usually use LM Studio for this type of thing but unsloth has their own studio type app that might be even better suited for these quants.

I used GLM 4.7 flash as my main model for months and it was an incredibly tenacious model and very very fast - I think on restricted hardware, this could be a great choice.

reply
jug 28 minutes ago
I often feel like we're nowadays mostly pushing AI developments in the ways of finetuning differences. Like how new editions of Claude are tuned for agentic coding which might even be detrimental if you're using it for non-agentic coding. Or how Fable 5 in fact do look great but at a huge cost for inference and a high likelihood of post-launch nerfs or limit/price revisions. How Gemini 3.5 has more liberal limits but on the other hand underperforms a bit.

It's like we're mostly treading mud at this point. New editions are released, a version number increases, but I have to wonder if all steps are forward or they're more just tuned differently with similar actual perf per dollar as when this year began.

Most in fact seem to be happening to me with small models. Like your Qwen. Or Gemma 4 31B which is kinda magic especially when considering multilingual abilities. So yes, in that sense I can see "development" probably as we refine data sets and training methods but I see it less on the big hulking beasts with daily limits (unless you turn it up to 11 like Fable).

Edit: As I posted this, I saw a "before and after" comparison for Fable and the reintroduced version is seeing a catastrophic drop in BridgeBench performance as they're still mucking with the model. Go figure... https://x.com/Hesamation/status/2072692225100612032

reply
unleaded 8 hours ago
Qwen3.6-35B-A3B-UD-Q4_K_M runs at about 11 tokens/second on my poor old 1060. Absolutely nuts how far we've come
reply
piyh 5 hours ago
I tried running any model on my 1070 and it instantly crashes my old tower, probably time to get off windows and run linux on it.
reply
SV_BubbleTime 4 hours ago
Understated how much of a boon for Linux that AI development has been.

There isn’t any benefit to running a windows machine.

reply
selectodude 4 hours ago
Au contraire, I run models on WSL and my desktop reliably wakes up from sleep. Best of both worlds.
reply
greenavocado 3 hours ago
Sounds like a hardware issue, though NVIDIA driver issues can't be ruled out, they're much rarer these days
reply
broodbucket 8 hours ago
Mind sharing your llama.cpp settings for that?
reply
unleaded 7 hours ago

  .\llama-server.exe -m ..\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -ngl 999 --n-cpu-moe 41 -c 262144 --port 8081 --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3 --no-mmap --mlock --host 0.0.0.0 -t 8 -tb 8 -np 1
Using this llama.cpp fork https://github.com/TheTom/llama-cpp-turboquant and mostly copying from this video https://www.youtube.com/watch?v=8F_5pdcD3HY

Haven't had much time to test it other than asking a few questions & changing some HTML in cline so it might be thick as a brick for all I know, but still worth trying

reply
unleaded 2 hours ago
I just tested it with some risc-v code and it wrote down a "mov" instruction several times.. yeah something needs tuning maybe
reply
pyreko 30 minutes ago
Same here, been happy throwing Qwen3.6 on my old MBP - no it's not as fast as Claude which I use at work, but it works well enough locally and I don't have to worry about credits or shit like the rug getting pulled under me in terms of capabilities.
reply
JSR_FDED 9 hours ago
This sounds very appealing. What size Mac mini would I need for that?
reply
SwellJoe 9 minutes ago
A 4-bit quantization of either Qwen 3.6 27b or Gemma 4 31b will run on a 32GB Mac with a decent-sized, but not full-sized, context. 64GB gets you the full ~256k context and you don't need to quantize your KV cache (though 8-bit quantization of KV may be worth it for performance). The 4-bit QAT version of Gemma 4 has practically identical performance to the full size version or the 8-bit version in most benchmarks and my tests, so there's no reason to run anything else. The 4-bit Qwen is a little bit lossy, as it hasn't gotten the QAT treatment, but not catastrophically lossy. A 6-bit dynamic quantization would be better for that model, but it's ~25GB on disk, and you'll need more than 32GB to run it with a big context.

I wrote up how I run local LLMs, with numbers and a focus on running Qwen 3.6 and Gemma 4. I prefer Gemma 4 31b, even though the general consensus is that Qwen 3.6 is better for code, and it is better on most coding focused benchmarks...it doesn't seem to be for my use cases, Gemma feels smarter. And, with QAT, you get more smarts in less memory, so it's fast and runs on more hardware.

https://swelljoe.com/post/how-i-run-local-llms/

Currently, the sweet spot for self-hosted models is either Qwen 3.6 or Gemma 4, and those top out at 31B (Gemma) and 35B (for Qwen, but you want the dense Qwen 3.6 27B if you can run it as reasonable speed...the dense models are much smarter), so for now, a system with 64GB or 128GB are going to be running the same models. Going to a bigger model doesn't get you better performance because there aren't any better models that are a little bigger. I wish there was a ~70B or even ~120B MoE in the Qwen 3.6 or Gemma 4 families, as I've got a Strix Halo running a model that leaves a lot of memory on the table (and it's not very fast, to boot...an MoE would be faster, and hopefully smarter if it's a much bigger model, like double or triple sized).

In short, right now, 64GB is all you need for the best models you can self-host on anything short of five-figure machines, but, I wouldn't buy any hardware right now, if you can wait a while. Tokens from DeepSeek are so cheap, you can wait out the memory shortage and get access to models you could never host locally. And, OpenRouter always has free models in preview or just because that you can use lightly, as they're rate-limited (but your self-hosted models are going to be rate-limited, too, because a Mac Mini can't run models very fast). Google AI Studio has the Gemma 4 models for free too, also rate/usage limited.

reply
c7b 5 hours ago
Personally, I would always max out the RAM you can fit into your budget. You might get lower bandwidth (= slower generation) than you do on a Mac if you choose a Strix Halo or DGX Spark, but there are always new tweaks being discovered to speed things up. That being said, with 32GB you should be able to fit an ok quant of 35B-A3B or 27B with some context, with 64GB you should be golden.
reply
sleepybrett 2 hours ago
i have issues on a m5/64g with 35b-a3b (mlx) it eventually hits a memory cap around 52gb... but i'm pretty happy with `Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-mlx-8Bit`
reply
c7b 2 hours ago
I'm sure there will be a fix for it, but it illustrates an important broader point I should probably have made above: if you opt for local AI today, expect to run into some issues. Expect to learn a bit about the tools you're using, the not-so-fun way. I'm not recommending it to non-technical friends (yet).
reply
jadbox 8 hours ago
A PC with an nvidia card with 16gb vram works just fine for Qwen MoE models, and these have worked great as a daily driver for me.
reply
mathgeek 7 hours ago
reply
coredog64 5 hours ago
> That's not hypothetical — it's a real measurement on the base model Mac Mini.

Hmmm

reply
blensor 8 hours ago
I am curious if you implicitly assumed they are Macs or if that's what you are looking for specifically?
reply
JSR_FDED 7 hours ago
I assumed the 27B dense model would be preferable to a MoE model, and that it wouldn’t fit into a consumer graphics card, which leaves the Macs.

Then I assumed for cost and battery/heat reasons that a Mini would be better than a laptop.

reply
mswphd 3 hours ago
dense models are (more) compute heavy, so are generally worse to run on mac. mac tends to be better for (larger) MoE models.

27B dense can fit on a consumer graphics card. Even without getting into various "intrusive" ways to shrink the size of a model (e.g. REAP), something like a NVFP4 quant of Qwen3.6 27b

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4

should fit within ~22GB of VRAM. So easily on a 5090. It would also fit on a 3090/4090, but iirc they don't have NVFP4 natively, so you would want a different quant for them.

you can see /r/LocalLLama for some discussions. See this (random) post about Qwen3.6-27B on a 3090 at ~100 tok/s

https://www.reddit.com/r/LocalLLaMA/comments/1ujo46r/qwen_36...

Note that it is possible you could still do this stuff with a mac, as there are ways of hooking up a eGPU to macs and using it for inference. My understanding is they're all fairly hacky though, so it would likely be preferrable to just get a 3090 (or a non-nvidia option, e.g. an AMD r9700 pro has ~32GB of VRAM for much cheaper than a 5090.

https://www.reddit.com/r/LocalLLaMA/comments/1u50hnm/qwen_27...

that seems considerably slower though (~30 tok/s). I don't know if that's an outlier/misconfigured setup or what. In general there will be much better resources for local setups using 3090s, as they're quite popular. Note that 3090s (but not 4090s nor 5090s) have NVLink, so you can network the cards fairly effectively. For this reason 2x 3090 setups are fairly popular as well. I've heard that club 3090 makes that relatively straightforward

https://github.com/noonghunna/club-3090

but don't have experience myself.

reply
blensor 7 hours ago
The reason why I was curious is that I am running my stuff on a Strix Halo and I get the feeling that this class of devices ( gmktek, minisforum, lenovo, etc. ) seem to becoming a pretty good alternative
reply
c7b 4 hours ago
Unified memory feels like the future of consumer hardware, agreed! Do check out r/StrixHalo
reply
blensor 45 minutes ago
Agreed, it was a bit of a pain to get running on my Ubuntu machine because I had old amdgpu-dkms-firmware packages installed without realizing it. But now that it's running it's amazing how well it works
reply
c7b 19 minutes ago
Sounds like you got it sorted, but more generally this may be interesting: https://github.com/kyuz0/amd-strix-halo-toolboxes
reply
adastra22 4 hours ago
Strix Halo is better performance than a Mac Mini, but not as good as a Mac Studio. But the 128GB unified memory is awesome for larger models.
reply
deadbabe 4 hours ago
People want to make it seem like you need to always use the latest and greatest frontier models to be taken seriously as a developer.

You really don’t need them. After a certain point, bigger models give diminishing returns. If you can get 80% of the productivity gain with a free local model, use the local model. It will still be way faster than doing everything by hand, but you also don’t have to pay for tokens to a cloud provider and the tools won’t be ripped away from you on a whim.

This is the new attitude enlightened people should adopt. Reject the arms race.

reply
organsnyder 4 hours ago
The biggest appeal of the frontier models is for those trying to get autonomous agentic systems running that do real work with minimal human input. I went down a rabbit hole trying that with frontier models, and after a lot of initial promise it ended up actually slowing me down.
reply
pimeys 4 hours ago
We've all been through that no? In the beginning you can do a ton of stuff without reading code. But the LLMs miss all the good abstractions, they just push and push unmaintainable code until at some point you start having more bugs and then you NEED that LLM to fix the codebase you don't understand anymore.

There are guardrails you can and must add to protect your team if you take the vibe approach: a good type system, a good database with clearly written business model and a good data model to drive your business. Make it loud and clear when something breaks with your tooling.

But... I'd definitely not vibe everything after a certain point. Reading and fixing code is also a lot of fun.

reply
apitman 9 minutes ago
They're insanely good for prototypes though. To be able to actually see something working before deciding whether it's worth investing the time to build it for real is invaluable.
reply
dmbrThnU 2 hours ago
Made an account to semi-disagree with you, haha!

I have to advocate for the vibe-coded mess-colony.

There are applications where it either works or it doesn't, and it's simultaneously obvious whether it does. Think stock price prediction software. I've killed time in the evenings verbally chatting with agents about that specifically, and what emerged worked! It didn't work well, but it clearly outperformed randomness, and I was able to verify that myself easily.

I didn't look at a line of code, but I had an absolute blast.

reply
austinthetaco 48 minutes ago
You couldn't have possibly verified that. Stock prediction based on what? What's your sample size over what period of time? Using what indicators? how far is your lookback?
reply
ravenstine 4 hours ago
> People want to make it seem like you need to always use the latest and greatest frontier models to be taken seriously as a developer.

Except you kinda do. Try getting a job today without mentioning Claude experience. In another year it'll probably be something else. Saying you like to use Copilot today makes one seem elderly.

Not saying you need frontier models on a technical basis, but for career PR you probably do.

reply
cyanydeez 8 hours ago
I never got into any of the AI models because it was clear local first was going to be more valueable, if they were to replace coding tasks.

I tried out a few models and ended up going with either Qwen3-Coder-Next (no think, just do) and Qwen3.6-35B (thinking, w/llamacpp token budget). Created a customized prompt that works fairly well to around ~60k tokens and then is a toss up on whether it's poisoned itself or I've directly steered it into the wrong. When it's clear that's happened, if it's important to continue, ask it to write a doc then start fresh.

I don't kno whow any one cold have witnessed the last 2 decades of American VC funded tech startups and tell themselves, "you know, this will be a reliable technolgy with no hidden problems".

Even a sober technical evaluation is just two steps:

1. You're proposing to build a app on a non-deterministic model.

2. That model is hosted behind a non-deterministic system (model alignment, model guardrails, system context subterfuge, cost/token pricing)

---

So you want to build your app and you think you're going to kep up with both #1 and #2?

reply
c7b 5 hours ago
Cool! Anything you want to share? I haven't looked much into my system prompt yet, do you have any tips?
reply
ACCount37 8 hours ago
We live in a non-deterministic world. Anything "deterministic" in it is a castle built on quicksand.

LLMs are, as far as the nastiness of the Real World goes, really fucking benign. Future models outperform past models, both in open weight land and at the big frontier labs. Performance per $ only ever goes up. That's just nice.

reply
windexh8er 6 hours ago
> We live in a non-deterministic world. Anything "deterministic" in it is a castle built on quicksand.

Except the Enterprise, and a lot of what people want compute for, is built on deterministic systems or processes. I'm not saying the non-deterministic nature of LLMs isn't useful. However I've worked with a lot of organizations on SOAR projects, for example. When you can weave the deterministic and non-deterministic together you get a relatively efficient system. A workflow that will stay on the rails and will come to a conclusion as expected. And the "as expected" part is critical in these types of systems. The reality of, using SOAR as an example, is also that most enterprise would be much better served by fast SLMs. Parse an email and validate if it's SPAM / Phishing or read a chunk of firewall logs and look for outliers / indications for escalation - those things can get messy in a deterministic system because of potentially unstructured data.

I don't believe it's either / or. And I believe that LLMs just aren't efficient, fast or reliable in the sense that deterministic are. It seems, at least to me, a better together story.

reply
SkyBelow 5 hours ago
I think it might be built on something more than deterministic systems. Some property that is a subset of deterministic, so all your argument still apply, but merely being deterministic is not good enough.

LLMs are what made me start considering this. Imagine a company using an LLM that was fully deterministic. All RNG was either removed or seeded in such a way that the same input (so many the seed counts as part of the input) gave the exact same output. Fully deterministic.

But such an LLM, with a slight drift in input, could still produce very different outputs. This isn't being non-deterministic, but more than the change in outputs does not naturally follow from the input. I'm thinking like how 2 double pendulums can (but not always do) greatly diverge given a very small change in their input.

So in light of that I've begun to call this new property non-chaotic. So Enterprise depends on non-chaotic systems, which are a subset of deterministic systems, and then wrangling the chaotic elements they cannot remove as much as possible.

The follow question I now have is if all LLMs are inherently chaotic, or if it is possible to have a non-chaotic LLM.

reply
cyanydeez 7 hours ago
YES, but you seem to not understand that having two non-deterministic layers is incompatible. #1 is fine: it has random issue and you build around those random issues; those issues don't change unless you change them.

#2 is not fine; that non-determinism you do not control, have no insight into, etc.

I'm saying sure, give me #1 if it means I can build a harness around it and smooth over the edges. But I'm not taking #1 and #2. There's zero reasonable way to manae two non-deterministic systems.

reply
maykthewessen 8 hours ago
Qwen is the Alibaba distilled Anthropic Claude model

So piracy on an by piracy trained ai model..

reply
cogman10 8 hours ago
Piracy? Lol.

Alibaba didn't steal Opus weights, they used opus output to train their model.

If this is piracy, then so is reverse engineering efforts powering a bunch of Linux drivers.

reply
cyanydeez 6 hours ago
If that's piracy, I'm going to the library and arresting everyone there!

Also, yeah, they already stole their copyrighted works, so a thief from a thief is still...theives?

reply
tommica 5 hours ago
Well, Anthropic got paid for it, unlike the sources that they used...
reply
c7b 4 hours ago
I'm not sure what you're trying to say. Is that a good or a bad thing? Model distillation is presumably part of the reason why Qwen is so good, yes. As a consumer, that's a good thing I would say. It's a natural counterbalance to the monopolistic tendencies of other tech segments.

If you have ethical concerns, model distillation feels like an arbitrary line to draw. Why is the first type of piracy ok, the second not? You should restrict yourself to ethical open source models. Which is btw where I genuinely hope the future of local models is going to lie. Open weights is not enough, we need fully open source models to be sustainable. Even for simple things like updating the knowledge cutoff. How we are going to distribute the training effort will be an interesting problem where I don't see an obvious solution yet. Maybe the blockchain/federated learning people can suggest something. Or university consortia, or some public sector solutions. Or something really boring - I for one would absolutely be willing to pay for DRM-free weights of an open source model (even if I could pirate them for free).

reply
zaphirplane 3 hours ago
Are you saying 2 wrongs make a right
reply
c7b 3 hours ago
I'm saying, either you have a problem with the copyright issues related to AI training or you don't. If you do, neither Qwen nor Claude are acceptable, if not then both are. They have similar moral standing to me.

Btw, ethically sourced, open source LLMs exist! Check out eg Olmo by Allen AI: https://allenai.org/olmo

reply
hathym 9 hours ago
Same here, I’ve removed my credit card from Copilot and won’t be renewing
reply
anon373839 9 hours ago
What features/workflows have you added?
reply
c7b 5 hours ago
Web search, MTP (speeds up generation), uncensored models. Lots more things on my bucket list (eg various things related to image generation).

Not gonna lie, if you're coming from ChatGPT/Claude Code, you'll mostly be adding back features you've taken for granted, or solving problems you wouldn't have had. But sometimes you do get some extra utility, like uncensored models, which have become my go-to. Not because I'm doing anything saucy, but I hated how I'd become trained to pre-emptivly self-censor my prompts. The guardrails in open weights models are no less strong than in proprietary ones, subjectively even a bit stronger in Qwen. But luckily there's an entire sub-discipline of model ablation. Another advantage would be better control over image generation (although I can't attest to that, yet).

reply
Kon5ole 10 hours ago
I am a huge fan of Copilot CLI. It just feels so logical and low-friction to use compared to Claude Code. Having the ability to juggle various models at will is really nice too. ("Plan this using Opus 4.6, let GPT 5.4 verify the plan and give feedback before implementing with Sonnet 4.6").

Unfortunately the June pricing change for Copilot forced me personally as well as my entire department at work to switch to Claude Code. With copilot we were hitting a few dollars of extra spend over the included credits in April and May, then in June we started chewing through the monthly budget every 2-3 days.

Just a completely insane price hike from the customer's perspective, I don't know what MS were thinking there.

Even if that is the price they need to be sustainable they should have waited until the competition changed their prices first. I wouldn't be surprised if Copilot lost 50% or more of their customer base last month.

Eventually this could be where all the major players set their prices, so the thought occurs to me that nations should run some form of "public access AI", just like they did for TV. Use the free open models and use tax money to finance a few datacenters. Geo-lock the use and set strict throttles to manage load, but let school children and citizens use that AI freely otherwise.

If Copilot's pricing is the level for all AI in a few years, only the unicorn companies can afford to use them, and everybody else has no chance of competing with a company that can use AI.

reply
ffsm8 9 hours ago
> they should have waited until the competition changed their prices first.

They did...

They're literally just passing on the costs https://platform.claude.com/docs/en/about-claude/pricing

Anthropic just provides a subscription - which Enterprise usually doesn't want you to use because everything you're submitting through that will be trained on / becomes part of their model.

So If you use it without explicit permission from your employer you may be committing a contract violation which can have serious consequences - up to jail time - as they can sue you for that.

reply
ac29 6 hours ago
> Anthropic just provides a subscription - which Enterprise usually doesn't want you to use because everything you're submitting through that will be trained on / becomes part of their model.

My Pro account very clearly has a toggle for "Help improve our AI models: Allow the use of your chats and coding sessions to train and improve Anthropic AI models."

reply
ffsm8 4 hours ago
Which they may or may not adhere to

> Our use of Materials [...] Even if you opt out, we will use Materials for model training when: (1) you provide Feedback to us regarding any Materials, or (2) your Materials are flagged for safety review to improve our ability to detect harmful content, enforce our policies, or advance our safety research.

The last part is essentially a catch all, which let's them train on everything they want - and they probably are.

But the important bit here isn't actually wherever they're actually training on it - that doesn't matter from the legality aspect of it. You're liable anyway, as all contracts I've ever signed explicitly forbid me from sharing internal data of any kind (including code) with third parties.

You can be prosecuted just from using it - wherever anthropic decides to train it's model on it or not.

reply
phillipcarter 7 hours ago
It's a little more complicated than that, unfortunately.

If you use Claude via API in your own app, you're paying full price.

If you have an "API Plan" for Claude Code (i.e., free), you're paying full price.

If you have a Pro, Max, Max 5x, or Max 20x, your tokens are subsidized up until a rate limit. Then you pay full price for usage thereafter, until the end of the billing cycle.

The widespread belief in industry right now is that the per-seat pricing (which Copilot bailed from first) is going to go away in the near-term.

reply
ffsm8 6 hours ago
It's not more complicated? I referenced the subscription... I just added a small warning about it as some people may or may not be aware about the fact they're opening themselves up to serious consequences if they decide to use it on their employers code without explicit permission... Depending on their employers digression, eg largish entrenched employers which value their IP will be more willing to inflict damages on you. An upstart will likely not care unless the CEO sees an opportunity to profit personally.
reply
nsbk 10 hours ago
The price hike was insane. My $dayjob is moving away from Copilot and into Claude Code subscriptions. In parallel we are testing AWS bedrock and Deepinfra for open weight models in preparation for when CC inevitably stops being such a good deal and aligns with actual token cost. Fun times.
reply
K3UL 3 hours ago
The price hike was insane yes, but because they were eating the price difference. How exaclty does moving to a Claude sub is better, when it's actually more expensive ?

At my company we did the comparison and Copilot still wins: for 20$ you get a seat and 20$ of usage, whereas with Claude enterprise you get a seat and then usage is completely added. Moreover usage in Copilot is exactly the price of the providers AND it allows us to use various models from multiple providers.

The case that might be less expensive is if you negociate a volume discount with AWS for Bedrock usage, but that is also possible with GitHub and Microsoft.

reply
nsbk 8 minutes ago
Last month we consumed all the subscription credits by the 7th day, and had to top the extra credits up every 2-3 days. Last month was definitely not cheaper than a CC subscription. It actually triggered a cost savings effort across the Engineering org (cancelling subscriptions, stopping environments,...) in order to be able to afford AI usage which was not appropriately budgeted for ¯\_(ツ)_/¯

Edit: wording on the cost saving effort

reply
Incipient 10 hours ago
I had to do the same. I expect everything will go token pricing, and at that point a LOT of small/mid businesses will drastically change how they use code.

I've swapped to the 20x Claude plan for a month or two to knock out two ideas I need to get it MVP - expecting Claude to go token priced soon.

reply
ofjcihen 6 hours ago
Hell it’s past small and mid. I do work for a few of the fortune 100s and what I’m hearing is somewhere between “justify all of your usage or don’t use it” to “you now get 500 bucks a month, go over that and you’re getting it revoked”
reply
lanthissa 7 hours ago
i ran out of claude credits for the first time at work in months and had to fallback to copilot.

pleasantly surprised, claude's way ahead in tooling but the ability to designate what model your subagents use and having access to all models is a better feature than all of what claude offers combine atm.

The only limit on the amount of ai can consume in a month a work is dollars, so anything that helps with cost is the best model/harness for me.

It also did a better job at smart designating subagents itself where as claude often used higher cost models.

reply
monooso 3 hours ago
You can tell Claude Code which model to use for subagents.

For example: https://github.com/monooso/dotfiles/tree/main/.claude/agents

reply
pantulis 9 hours ago
> I am a huge fan of Copilot CLI. It just feels so logical and low-friction to use compared to Claude Code.

Honest question, can you ellaborate? If given the option, I use OpenCode but what do you find in Copilot CLI that makes you prefer it to Claude Code?

reply
Kon5ole 7 hours ago
It's a combination of small things really. The mentioned ability to easily call on various models in the same prompt, having agent definitions be able to orchestrate other agents just by mentioning it in the description, doing things like goal/loop automatically.

There is also IMO a distinct difference in "tone" in the dialogue. Claude seems to impersonate a human a bit more than I like.

Claude is of course very good as well and does a few things better than copilot too, but overall I'd prefer to use Copilot.

reply
dluxem 20 minutes ago
Same mindset here. I really like the ability of having OpenAI, Anthropic and other models available.

For my personal work, I still use Claude Code as its cheaper and the limits don't bother me to much, but it feels a bit like being handcuffed to Anthropic vs being at work and freely selecting models.

reply
gwerbin 8 hours ago
Not OP but Copilot CLI is really straightforward, almost minimal in some sense. It's a lot like OpenCode but stripped down.

I also use the Copilot ACP server inside Pycharm and that works decently well too, although it has some annoying bugs, but if you're a Jetbrains user you're used to annoying bugs.

reply
deckar01 7 hours ago
Letting them automatically pick the model is no longer sustainable, but there are some very efficient models that are capable of executing the plan created by a much nicer model. It’s kind of embarrassing to think that Microsoft’s auto model selection was choosing cutting edge reasoning models for tasks like resolving dependency conflicts back when their pricing was at a loss.
reply
nsoonhui 12 hours ago
I used GitHub Copilot for my VS 2026 development and switched between ChatGPT and Claude. That was before I discovered Claude Code and the Codex app. Copilot was OK for my purposes, and the USD 10 per month fee was enough for my usage.

However, last month they introduced a new pricing model ( I know the old pricing was not sustainable), and my USD 10 was exhausted within days. Because of that, I switched to Claude Code and Codex and have never looked back. Yes, tokens on Claude Code and Codex are subsidized heavily, but let's just enjoy when good things last.

I do feel there is a difference between using Claude via Copilot versus using Claude directly in Claude Code. I'm not sure what Microsoft is doing behind the scenes.

reply
taspeotis 11 hours ago
The harness is super important, what tools are available and the system prompts vary from harness to harness.

Anthropic seems to have a modest lead on their harness and models, so it’s a best-of-both-worlds scenario.

> I'm not sure what Microsoft is doing behind the scenes

It’s probably the exact same model, but the tools and the prompts around it are worse, so you get worse results.

reply
irthomasthomas 8 hours ago
Claude in Claude code has been shown to perform persistently worse in evals than claude + a minimal harness.
reply
kilburn 9 hours ago
The harness was absolutely not an issue in my case.

The new pricing model where I got banned from using Opus entirely and half a day of work (with weaker models) consumed the 10$ plan was.

I'm now using a Claude Max subscription and I can get close to the daily limits but I'm fairly happy with the overall plan consumption.

reply
Vinnl 11 hours ago
So if you use Claude via Copilot in Zed... You use Zed's harness, I think? What does Copilot do, at that point?
reply
acpdev 10 hours ago
I believe you are using https://github.com/github/copilot-cli or potentially this https://github.com/github/copilot-language-server-release#ag... via the Agent Client Protocol https://github.com/agentclientprotocol/agent-client-protocol which means you are indeed using Copilot's harness

ACP is just a standard that bridges harnesses easily into IDEs, Text Editors, or whatever consumes it (I wrote a TUI that consumes them)

The registry for all the agents (tool harnesses) is here https://github.com/agentclientprotocol/registry if you ever are curious to what Zed or IntelliJ are really hooking into

reply
Vinnl 7 hours ago
Ah OK, so the ACP connector ensures tool calls work with Zed, and communicates the available tools and their results to the harness, and then the harness mainly provides a system prompt and the API calls?
reply
pantulis 10 hours ago
It’s providing the inference of Anthropic models
reply
arikrahman 11 hours ago
I had a similar experience moving away from Copilot within Zed. Now using the reasonix harness for Deepseek that makes cache hits almost free. And that's with unsubsidized American providers like Digital Ocean or Cloudflare.
reply
toyg 10 hours ago
I tried using Zed but with local models it constantly breaks on tool calls. I wanted to like it but the smell of vibing is just too much.
reply
arikrahman 55 minutes ago
Likewise, and that's with state of the art technology. I wish a true self-contained binary for Reasonix Desktop was released, for now I have to settle for providing a Flake.nix environment. It isn't nearly as fickle as Zed, but I wish they leveraged that power of the Go toolset more.
reply
arcanemachiner 10 hours ago
You using models released this year? I hear this complaint a lot, and it's often due to using an old model which is not as good at tool calling as newer models.
reply
spockz 8 hours ago
What I noticed is that when the conversation starts the agent is pretty able to read from and write to files. As the conversation continues (and maybe sub agents are spawned) it forgets how to do this, complains, tries to resort to running shell or python code, sometimes it works. Sometimes it asks me to execute the code. If I refuse and point out it worked before than sometimes it remembers how to write, but mostly not and I need to start a new session.

When using Zed with the CoPilot integration I use Claude Opus and never had this issue.

reply
toyg 9 hours ago
Qwen 3.6 and 3.5...
reply
sydneypan 7 hours ago
Yep reasonix is an absolute case study of caching. They literally compiled byte level cache in their design and it is insane. i can one shot many workflows, apps in under 0.05 cents.
reply
k__ 11 hours ago
Nice.

I paid $6 yesterday for DeepSeek V4 Flash on OpenRouter. That's like $120 dollar for a month, and it's not even a good model.

reply
bel8 11 hours ago
For DS4 it's much cheaper and reputable to use OpenCode Go $10/mo subscription, or directly with DeepSeek API.
reply
arikrahman 53 minutes ago
Sometimes $10 is more than I'll do with API tokens. I prefer the top up scheme for peace of mind, but the deal does sound generous. The only concern is sustainability, similar to subsidized copilot pricing having to change.
reply
k__ 10 hours ago
Thanks!

I'll try that.

reply
epolanski 11 hours ago
That's quite an achievement, I managed to spend only 2$ on 16 different tasks of v4 pro.
reply
k__ 10 hours ago
Yeah, v4 flash is dirt cheap, but it's running in circles quite often.

Might very well be that a better model is cheaper if it gets things right the first try.

Maybe I should route to a better model when v4flash hasn't solved after a specific number of tokens.

reply
russelg 7 hours ago
I'm having great success with DS4 Pro as my main model, while using DS4 Flash for subagents.
reply
VortexLain 8 hours ago
What is the average monthly token price for daily reasonix use?
reply
arikrahman 52 minutes ago
For me it's about $5 of work, where I've done equivalent work for about $200.
reply
happyweasel 10 hours ago
Same ,I switched to cursor. I told it how to invoke msbuild and it can edit away without needing a native Visual studio plugin.. no problems at all. Target language c++
reply
seanieb 10 hours ago
GitHub Copilot costs have ballooned in recent week, what once took $100 requires $300. I like using Claude with VS Code through Copilot and I feel it’s given me much better code, that I can control the quality. It’s much more transparent than Claude Code. It’s open source but and the IDE interface gives so many more features to have you context and control over whats generated. The increase in cost isn’t purely due to their price increases but also the Opus models agents use more tokens. So I’ve moved to Claude Code and I’m happily still using Opus 4.6. Fable and 4.7 seem to do much larger units of work, go off on tangents and make assumptions that frequently results in slop.
reply
altmanaltman 11 hours ago
My copilot quota finished in maybe 2-3 prompts with claude 4.8 opus. i was expecting it to suck but not this bad. it was good while it lasted though
reply
andhuman 13 hours ago
Finally an alternative to the big dogs that a company can use. People have been asking for a way to run the Chinese models from a trusted provider. Here GitHub delivered!

The performance, if we trust the benchmarks, put it at Sonnet 4.6.

Let’s see if it’s worth it with GitHubs pricing.

reply
MangoCoffee 13 hours ago
Microsoft needs to offer cheaper option since they change to token based billing. GPT-5.4 used to be x1 for yearly subscriber but now it cost 6x. i run out the premium request for just couple prompts. Github copilot for $10 used to be the best value since you get all the US AI labs model for cheap.
reply
sneezychl 12 hours ago
CoPilot was an insanely good value while it lasted. Only moneysoft could subsidize a service that much.
reply
credit_guy 7 hours ago
> The performance, if we trust the benchmarks, put it at Sonnet 4.6.

I don't trust these benchmarks. I used a number of times Kimi K2.7 and I was disappointed. It would run in circles for things that Claude would do in one shot. However, my usage was via Ollama cloud, and I have no idea if they serve the actual model or a quantized version, and it was the quantization that degraded the performance.

The great news, in my opinion, is the precedent. If Microsoft is now serving Kimi K2.7, then very soon they might start serving GLM 5.2, and that is indeed a very competitive model.

reply
rpdillon 6 hours ago
Check your harness. I use Kimi K2.6 for a lot of stuff with OpenCode and omp and it's extremely effective. I'm gonna try 2.7, but it should be capable model based on what I've seen with previous models.
reply
w4yai 10 hours ago
> People have been asking for a way to run the Chinese models from a trusted provider

I'm going to be called a chiller again, but at this point I don't care as it is relevant. Synthetic runs their own models for a reasonable price, GLM5.2 & Kimi K2.7-Code included.

Referral link :

https://synthetic.new/?referral=kwjqga9QYoUgpZV

reply
hgoel 9 hours ago
Being on Copilot means your employer lets you use it at work. It's essentially Copilot's primary value add in the new billing model.
reply
lostmsu 8 hours ago
Cloudflare offers Kimi and GLM
reply
hgoel 2 hours ago
That is probably similar for companies that rely on Cloudflare in as widespread a manner as GitHub can be.
reply
newaccountman2 5 hours ago
OpenCode is an ez way as well
reply
e2e4 24 minutes ago
Much better value by using K2.7 Code with GitHub cli via opencode subscription - at $10/month gives you $60 worth of usage (for now) - you get $5 usage credits per day (with some weekly / monthly limits)

ps opencode cli is quite nice too

reply
e2e4 23 minutes ago
If you want to get an extra $5 off for the first month (I'll get $5 too) https://opencode.ai/go?ref=XDHX30HEFB
reply
kingstnap 11 hours ago
Input: $0.95

Cache hit (most important): $0.19

Output: $4.00

This is the same as how much Moonshot charges for it, and it puts it at roughly the price of GPT 5.4 mini, not a bad option.

For some context here is a stupid prompt that wastes tokens: "Play a game of tic tac toe against yourself on a 5x5 board, you need 5 in a row to win."

It costs $0.006 on Kimi K2.7, and you get to see the whole raw reasoning trace.

GPT-5.4 mini costs $0.016 and its summarized.

And in case you are wondering both play incredibly stupidly.

Kimi:

      A   B   C   D   E
  1   .   .   .   .   .
  2   .   .   .   .   .
  3   X   X   X   X   X
  4   .   O   O   O   O
  5   .   .   .   .   .

GPT 5.4 mini:

  1: X X X X X
  2: O O . . .
  3: . . O . .
  4: . . . O .
  5: . . . . O
reply
kingstnap 11 hours ago
Btw if anyone is wondering, GPT 5.5 does the same garbage as 5.4 mini for 4 times the cost.

Fable manages to make a reasonable game, at a cost of 40 cents.

  X X O O O
  O O X X X
  X X X O O
  X O O X O
  X O X X O
reply
ubanholzer 11 hours ago
Nice idea. I just asked Haiku to do the same in Claude Chat on iOS: it created a interactive react game, implemented the rules and let it play. Clever move for 1$ input and 5$ output, Anthropic!
reply
a_c 9 hours ago
While LLM models are bad at games, they are perfectly capable of writing a RL agent to train on the game itself.
reply
asimovDev 11 hours ago
when i will be extremely bored, I think I will make two models play chess against each other. I bet there's a chess benchmark / llm tournament already somewhere
reply
rusticpenn 11 hours ago
Models are bad at chess. I am using a middleman to help models play chess and experimenting. https://abhay-ai.github.io/R_Daneel_AI/
reply
fuglede_ 7 hours ago
In fact, you don't even need an LLM tournament when you can have tom7's Elo World tournament: https://www.youtube.com/watch?v=DpXy041BIlA
reply
cbg0 8 hours ago
[flagged]
reply
boronine 7 hours ago
For any small team wanting to try Copilot, heed my warning that you will waste hours navigating their billing settings using various out-of-date documentation. Long story short, I finally got an email from them saying that "Copilot Business is available for teams purchasing 10 or more licenses". This is undocumented but other people are reporting the same: https://github.com/orgs/community/discussions/199346

We're sticking with Cursor for now, using Kimi as our daily driver (branded as "Composer").

reply
mmusc 13 hours ago
Yes significantly cheaper to run compared to the other models, tried it for an hour yesterday and the results look promising.

Saw in a discussion on Reddit that the team is evaluating glm5.2 so hopefully more to come!

reply
scriptsmith 13 hours ago
Is GitHub Copilot the best positioned platform for enterprise? They support Claude, GPT, Gemini, and now even open weight models. Larger orgs are paying at API rates anyway so it costs just as much as anywhere else. They have a pretty good agent CLI and SDK, and now a desktop app. They have hosted agents, and you can run their 'Agentic Workflows' in CI.

Has their reputation tanked so much that the alternatives get all the buzz? Or is it that non-enterprise users are priced out by the usage costs, so no free marketing?

reply
gunalx 13 hours ago
The rugpull with the pricing change without further notice was not taken kindly by enterprice.
reply
lbreakjai 10 hours ago
We just cancelled everyone's plans and rolled liteLLM out internally. We kept it for the insanely cheap tokens, but now that they've switched to the new pricing, they're just like openrouter, just with far fewer models.
reply
attentive 12 hours ago
They were, until they decided to commit suicide for the service.
reply
theplumber 9 hours ago
For some reasons compilot seems dummer than vscode Claude or vscode codex. I can’t tell what’s the exact reason but it didn’t feel right
reply
a_c 9 hours ago
Must be the system prompts. Ask copilot to dump its system prompt, and compare the system prompt with claude. It is not accurate but handy. I bet they are quite different
reply
kasey_junk 9 hours ago
Their harness is terrible compared to any of the other cli based harnesses I test against. Like shockingly bad.

This comes up all the time at work because the vendor management people don’t understand the llm ecosystem and think Claude through copilot is the same as Claude through Claude code.

A simple side by side comparison will show dramatic under performance 3 or 4 times out of five when I’m asked to explain the difference.

reply
SeriousM 13 hours ago
Who really cares? The model multipliers and the artificial currency were the final nail in the Github Copilot coffin.
reply
sognetic 12 hours ago
Enterprises still have big contracts with github, those companies are imposing tight spending limits now and if the open weight models enable those limits to last a bit longer that's probably quite popular.
reply
impact_sy 13 hours ago
When will DeepSeek be available?
reply
pkaye 13 hours ago
The V4 models are already in the Azure AI foundry so maybe a good chance of it coming.
reply
skybrian 12 hours ago
Looks like it’s the same price on Fireworks AI?

https://fireworks.ai/blog/kimi-k2p7-code

I don’t know much about them but they did a deal with Microsoft in March:

https://azure.microsoft.com/en-us/blog/introducing-fireworks...

reply
TiredOfLife 10 hours ago
reply
calumcl 9 hours ago
> These models are hosted on US-based Azure AI Foundry infrastructure managed by GitHub and Microsoft. Customer prompts and responses are not sent to the original model developers.

From your link: https://docs.github.com/en/copilot/reference/ai-models/model....

reply
matrik 7 hours ago
And why should one prefer GitHub Copilot over OpenCode? Worse harness, more expensive prices, unreliable product strategy, limited model support, the list goes on.
reply
esafak 4 hours ago
Legacy corps on the Microsoft steamboat.
reply
KolmogorovComp 9 hours ago
What's the credit cost compared to Gemini, Claude and GPT? As others have said, the last month price update killed copilot for good.
reply
johnathan101 11 hours ago
Competition in coding models has gotten intense. A year ago it felt like choosing between two options. Now the bigger question is which model to route each task to.
reply
boundless88 14 hours ago
When will GitHub Copilot support integrating custom models?
reply
mvATM99 12 hours ago
It does, but it's very poorly documented and quite unstable (on purpose i think). What the other commenter said about the VSCode BYOK seems to be the more reliable way.

I tried adding a Foundry LLM as Github Copilot custom model and failed miserably. But with VSCode BYOK (and Github Copilot as the interfact) i did get it working, and i can now use Deepseek V4 Flash with Copilot.

reply
Klaster_1 13 hours ago
AFAIK you can already use custom models in VSCode Copilot, but probably not for cloud workloads yet.
reply
summarity 9 hours ago
It has supported custom, local, any BYOM for quite a while.

I work at GitHub but even then I often use OpenRouter models in the CLI and Copilot App

reply
ignoramous 13 hours ago
Copilot Chat supports BYOK since Oct 2025 for the VSCode plugin: https://code.visualstudio.com/blogs/2026/06/18/byok-vscode
reply
websap 13 hours ago
Where is the inference running?
reply
pkaye 13 hours ago
Azure. It was already available on the Azure AI Foundry before.

https://docs.github.com/en/copilot/reference/ai-models/model...

reply
Tepix 12 hours ago
On servers that are subject to the CLOUD Act. Expect no GDPR compliance.
reply
cassianoleal 10 hours ago
Most European infrastructure runs on the big clouds, who are all subject to the same act. No one cares, unfortunately.
reply
Scroll_Swe 5 hours ago
There is even mainstream press articles about it here in Sweden. "dependance on microsoft ooh so bad" etc.

I find it laughable.

Unless you have a time machine to 2005 (EC2 came out in 2006 that should have been the signal) there is no way to compete now. That train has left the platform.

Second, Nokia and Ericsson dominate mobile infra in the west, but that is good I guess as they are EU? What does USA think about that?

Third, let us say you get rid of MS. Now you have no MS but all network infra for broadband is Cisco, Huawei, Juniper etc. Good luck ripping that out. And for what?

Same with AI. Mistral was amazing at first, Le Chat. Almost as good, generous free limits, good docs. Now? Just plain bad. Deepseek is better (I dislike china so I avoid it). EU should have gone in 500% the moment Mistral showed promise.

But lately we let USA and China take the lead on everything and EU can write a strongly worded letter after about how bad it is.

People will "care" when EU starts making good stuff again.

And lastly lol, people do know everything ends in Taiwan in the end right?

reply
TiredOfLife 10 hours ago
https://docs.github.com/en/copilot/reference/ai-models/model...

They are run by Moonshot itself, so probably china

reply
rombert 8 hours ago
That page states

> These models are hosted on US-based Azure AI Foundry infrastructure managed by GitHub and Microsoft. Customer prompts and responses are not sent to the original model developers.

So not in China.

reply
theanonymousone 9 hours ago
A very sharp slap in the face of those of us who kept our annual plans and didn't ask for s refund: It seems it will not be available to annual subscriptions.
reply
mellosouls 8 hours ago
where does it say that? its not available to me (also annual) at the moment via cloud but it said it is rolling out gradually, so I'm not too concerned. Tho I'm not overly excited either given Copilot pricing now; I reckon this should be at most 1x.
reply
theanonymousone 3 hours ago
They say it here: https://docs.github.com/en/copilot/reference/copilot-billing...

But then again they released MAI despite this, so I don't know.

reply
tapirl 11 hours ago
Unlike Google, the AI wave appears to deliver positive revenue impacts for Microsoft.

The company does need to integrate the new AI-human-machine interface into its application development SDKs.

reply
grumbelbart2 11 hours ago
Is there a zero-retention option?
reply
jingpostmedia 6 hours ago
[dead]
reply
tarun-pmos 8 hours ago
[flagged]
reply
CurbStomper 8 hours ago
[dead]
reply