The Deployment MemoIssue #3Enterprise AI / Infrastructure Open Source Sovereign AI Infrastructure Model Selection Compliance

Cohere Command A+: The Open-Source Sovereign AI Decision — What Every Enterprise AI Team Must Decide This Week

On Sunday, Cohere released Command A+: 218 billion parameters, Apache 2.0 license, runs on two H100s. It consolidates text, reasoning, vision, and translation into a single model available on Hugging Face with no usage restrictions. This is the clearest sovereign AI deployment decision the enterprise market has faced in 2026. This memo dissects the options, the tradeoffs, and the questions your team needs to answer before the window closes.

AI Insight Lab — The Deployment MemoMay 24, 20268 min readDownload 10-slide deck Listen

Key Numbers

218B

parameters — sparse MoE, competitive with frontier closed models

H100 GPUs required to self-host at full precision

Apache 2.0

zero usage restrictions — commercial deployment with no vendor terms

per-token licensing cost when self-hosted on your own infrastructure

Background

On Sunday, May 24, 2026, Cohere released Command A+ — a 218-billion-parameter sparse Mixture-of-Experts language model — under a full Apache 2.0 open-source license. This is the first time Cohere has released model weights with unrestricted commercial use rights. The model is available on Hugging Face in three precision formats: BF16 (full precision), FP8, and W4A4 (4-bit weight, 4-bit activation quantization).

The W4A4 quantization is the number that matters for most enterprise infrastructure teams: Command A+ runs on a single NVIDIA Blackwell B200 GPU, or on two H100s — hardware that is already deployed at thousands of enterprises and cloud environments. In W4A4, Cohere reports benchmark parity with the uncompressed model across its primary task categories.

Command A+ is not a new model family. It consolidates four separate Command A-series models — Command A (text), Command A Reasoning, Command A Vision, and Command A Translate — into a single unified model. The underlying architecture is a decoder-only sparse MoE Transformer: 218 billion total parameters with only 25 billion active during any given generation step, which is what makes two-H100 inference viable rather than theoretical.

The release received essentially no mainstream press coverage on its publication day. Cohere published the weights alongside benchmark comparisons, a vLLM and Transformers implementation guide, and continued availability on its own managed inference platform for enterprises that don’t want to operate their own stack. The simultaneous managed and open-weight release is deliberate: Cohere is competing for both the self-hosted and the API market at the same time.

For most enterprise AI teams, the Apache 2.0 release creates a deployment decision that did not exist 48 hours ago. Paying for managed inference on a model of this capability class — from Cohere, from OpenAI, from Anthropic, or from Google — is now a choice, not a requirement. The question is whether it is the right choice for your organization.

Decision Required

The decision your enterprise AI team must make in response to the Command A+ release: Given that a frontier-class, commercially unrestricted model is now available for self-hosted deployment on infrastructure you likely already own (two H100s), at what point does continuing to pay per-token API fees to a managed provider represent a strategic and financial error — and what would need to be true for self-hosting to be the correct answer for your organization?

This is not primarily a technology decision. The model runs. The implementation guides exist. The question is whether your team has the operational infrastructure, the compliance posture, and the organizational will to run AI inference internally — and whether the economics of your current usage pattern justify the transition cost.

The window for this decision is narrower than it appears. Organizations that evaluate and pilot self-hosted deployment now will build the infrastructure knowledge, the security review process, and the operational runbooks before the capability advantage of open weights narrows. Organizations that wait will face the same transition cost with less competitive advantage on the other side.

Options

Option AContinue managed API — do not engage with Command A+ self-hosting

Maintain your current API relationship with your managed inference provider. Treat Command A+ as a competitive development to monitor but not act on. This is the default posture for organizations where AI inference is not a core competency, where usage volumes do not justify infrastructure investment, or where regulatory constraints require a specific vendor relationship. The risk is not immediate — managed inference remains fully functional — but organizations at scale with high-volume inference workloads are now paying an increasingly visible premium for the operational simplicity that managed infrastructure provides. That premium will become a board-level question once finance teams understand that a two-H100 deployment exists.

Option BFull self-hosted migration — move primary inference to Command A+

Deploy Command A+ on-premises or in a private cloud environment and migrate primary inference workloads away from managed APIs. This is the correct answer for enterprises with three specific profiles: organizations in heavily regulated industries where data residency is not optional (financial services with client data, healthcare with PHI, government with classified or sensitive workflows); organizations at inference volumes where the per-token economics have already made managed API costs a material line item; and organizations that have explicitly decided that AI infrastructure is a core competency they intend to own. The implementation path is real — Cohere has published vLLM guides — but the operational requirement is also real: GPU fleet management, model version control, security patching, and inference latency tuning are all now your problem.

Option CParallel pilot — self-host for one workload category while maintaining API for othersRecommended

Identify one internal AI workload — ideally one that is high-volume, internally scoped (no customer-facing latency requirements), and low-risk to run on experimental infrastructure — and run it against a self-hosted Command A+ deployment while maintaining your managed API relationship for production workloads. This approach lets you build internal operational knowledge, validate the economics, and complete the security review process without betting production stability on the result. The parallel cost during the pilot period is real (you are running both), but it is bounded and time-limited. This is the correct default posture for organizations that are not yet certain which category above they fall into.

Recommendation

Start the pilot. Specifically: identify your highest-volume, internally-scoped inference workload that currently runs against a managed API. If that workload processes more than 50 million tokens per month, the economics of self-hosting are already favorable on two H100s at standard cloud GPU pricing. Commission a security review of the Apache 2.0 terms and your data residency requirements in parallel.

Do not migrate production workloads to self-hosted inference before completing three things: (1) a GPU infrastructure runbook that your on-call team can execute without the person who built it; (2) a model version control policy that specifies when and how you adopt future Cohere updates versus staying on a pinned version; (3) a latency baseline comparison between your self-hosted deployment and your current managed API for the specific workloads you intend to migrate. None of these are blocking for the pilot — they are blocking for production migration.

If your organization is in a regulated industry with mandatory data residency requirements, the evaluation should move faster and the recommendation shifts: Command A+’s Apache 2.0 license combined with its two-H100 inference profile is likely the most viable path to a fully sovereign AI deployment your organization has had access to. Engage your infrastructure and compliance teams this week, not next quarter.

Enjoying this brief? Issue #22 ships Jun 24.

One enterprise AI deployment, dissected weekly. Free during beta · No credit card · Unsubscribe anytime

Risks

Operational complexity underestimation

The most common failure mode for enterprise self-hosted AI deployments is not the initial setup — it is the second month. Deploying Command A+ on two H100s is documented and achievable. Maintaining that deployment across GPU driver updates, model quantization edge cases, inference latency regression as usage patterns change, and security patching of the underlying vLLM stack requires sustained engineering attention that organizations routinely underestimate. If your current AI infrastructure team is already at capacity managing other tooling, self-hosting is likely to produce worse outcomes than managed inference until that capacity constraint is resolved.

Apache 2.0 compliance gaps in regulated environments

Apache 2.0 is a permissive license with essentially no restrictions on commercial use. However, it does not address your regulatory obligations. Deploying Command A+ in a healthcare environment still requires your own HIPAA technical safeguards. Deploying in financial services still requires your existing data governance controls. The license gives you the right to use the model; it does not give you a compliance shortcut. Organizations that treat “open source” as synonymous with “compliance-resolved” will encounter this gap during their next audit.

Capability version lock

When you self-host a specific model version, you own the decision about when to upgrade. Managed inference providers update models continuously; with self-hosted deployment, each capability improvement requires an infrastructure event. Organizations that need to stay current with model capability improvements (competitive advantage use cases, coding assistants, reasoning-intensive workflows) will find that the upgrade cadence of self-hosted deployment adds friction that managed APIs do not. This is not an argument against self-hosting — it is an argument for explicitly deciding which workloads benefit from staying on a stable version versus which ones require continuous model improvement.

Dual-run cost during transition

If you pilot self-hosted Command A+ while maintaining your managed API contracts, you will be paying for both during the evaluation period. For organizations at high inference volume, this dual-run cost can be material. Build a defined pilot timeline (90 days is standard) and a specific go/no-go decision criteria before you start, so the pilot does not become an indefinite parallel spend with no exit condition.

Questions Your Team Should Be Answering

These are the questions that distinguish organizations that get this right from those that do not. If your team cannot answer them, that is your first deliverable.

1.
What is your current monthly token volume across all managed inference providers, and at what volume does self-hosted H100 inference become cost-neutral against your current per-token pricing?
2.
Which of your current AI workloads process internal data only (no customer PII, no third-party data subject to contractual restrictions on external processing) — and which workloads are therefore lowest-risk to migrate to self-hosted inference?
3.
Does your organization have a documented GPU infrastructure runbook, and who owns it? If the answer is "one person," what is the bus-factor risk of a self-hosted AI deployment?
4.
Have you completed a legal review of Apache 2.0 compliance against your existing AI procurement and data governance policies, or is your current managed inference contract the path of least legal resistance?
5.
What is your organization's stance on model version pinning versus continuous capability updates — and does your answer differ by workload type?
6.
If Command A+ in W4A4 quantization matches benchmark performance of the uncompressed model, at what point does your organization re-evaluate its assumption that frontier AI capability requires a managed API relationship?

Forward this to your team.

If this memo belongs in your next executive meeting or board pack, send it along. One click opens a pre-drafted email — edit or send as-is.

Open in email

ShareLinkedIn X Forward

The Personal AI Subscription Problem: What Your Consultants, Lawyers, and Auditors Are Doing With Your Confidential Data

Your external consultants, lawyers, and auditors are using personal ChatGPT Plus, Claude Pro, and Microsoft Copilot subscriptions on your confidential files. Consumer AI subscriptions are not covered by your firm-level data processing agreements. Most NDAs prohibit disclosing confidential information to third parties without consent — and were written before personal AI subscriptions existed at scale.

Read memo →deck

#26Marketing / Advertising AI9 min read

The Ad Machine: What Enterprise Marketing Teams Haven't Governed When AI Is Generating Brand Creative at Scale

Adobe Firefly has generated 9 billion+ images since launch. Meta Advantage+ AI autonomously generates creative for 4M+ advertisers. Google Performance Max gives AI simultaneous control over bidding, audience, and creative. The governance gaps most enterprise CMOs have not closed: AI-generated creative may lack copyright protection, platform agreements may allow vendors to train on your brand creative.

Read memo →deck

#25Accounting / Audit AI9 min read

The Black Box Audit: What Big Four AI Tools Are Doing Inside Your Audit — and What Your Audit Committee Hasn't Asked

KPMG Clara runs analytics across 100% of journal entries. EY Astra drafts audit memo language from flagged conditions. Deloitte Omnia surfaces anomalies before the engagement team reviews them. PwC Halo processes contracts and board minutes with GenAI. All four Big Four firms have announced Microsoft Azure partnerships. Your engagement letter may predate these tools.

Read memo →deck

Browse Issues

←

Issue #2Warehouse AI

The Single Source of Truth Trap

←→

Issue #4OpenAI

The OpenAI Deployment Company

→

Issue #22 ships Jun 24.

One enterprise AI deployment, dissected. Free during beta.

Subscribe Free

Cohere Command A+: The Open-Source Sovereign AI Decision — What Every Enterprise AI Team Must Decide This Week

AI Insight Lab — The Deployment MemoMay 24, 20268 min readDownload 10-slide deck Listen