Vertex AI in Production: What Actually Ships in 90 Days (and What Doesn't)
Most “Vertex AI implementation” content online is a tutorial: set up your project, train an AutoML model, deploy an endpoint. That’s not implementation. That’s a demo. A demo ends when the Colab notebook closes. An implementation is still running on a Tuesday morning when someone on-call gets paged about drift.
According to McKinsey’s 2025 State of AI report, 67% of organizations are still in pilot or experimentation — only 5.5% report meaningful financial returns. Gartner revised their estimate in early 2026: at least 50% of GenAI proofs-of-concept were abandoned before reaching production, and they predict 40% of agentic projects will be canceled by 2027. These aren’t numbers about technology failure. They’re numbers about implementation failure.
This article is about what actually happens in mid-market Vertex AI engagements — the pattern your team will land in, what ships in 90 days, and what doesn’t. There are three patterns. Most teams are in Pattern A. Some reach Pattern B. Almost none reach Pattern C — and that’s fine, because Pattern C isn’t right for most organizations.
One note on naming: Google rebranded Vertex AI to Gemini Enterprise Agent Platform in April 2026. The ML primitives kept their names — Pipelines, Model Registry, Feature Store, Vector Search. The agent surfaces moved: Agent Engine became Agent Runtime, Vertex AI Search became Agent Search. We’ll keep saying “Vertex AI” throughout because that’s what your team still calls it, and because that’s what the search intent still reflects.
The 90-Day Reality in Numbers
McKinsey puts 67% of organizations in pilot or experimentation. Gartner says 50% of GenAI PoCs are abandoned. A late-2025 I&O survey found only 28% of AI use cases fully met their ROI target. These aren’t outliers — they map directly to the three patterns below.
The pattern distribution across mid-market engagements roughly mirrors the industry data. About 67% of teams land in Pattern A: a pilot running in notebooks that never makes it to a production Endpoint. About 28% reach Pattern B: a prod-grade scaffold with Pipelines, Model Registry, and live Endpoints. About 5% reach Pattern C: a full MLOps platform with Feature Store, Vector Search, custom training, and an automated retrain loop.
Each is a real outcome. Pattern A is appropriate for exploration. Pattern B is where most organizations should aim. Pattern C requires organizational alignment most teams don’t have.
The mistake isn’t being in Pattern A. The mistake is calling it production.
Pattern A: Pilot Stalled in Notebooks
Most mid-market Vertex AI pilots land here: a Workbench notebook runs a model, a demo was impressive six months ago, and “we’ll productionize next quarter” has been the answer for four quarters. Nothing has shipped.
The symptoms are recognizable. A Workbench notebook is running — or was until someone’s billing alert fired. No Vertex Pipeline. No Model Registry entry. No Online Prediction Endpoint. The model artifact lives in a Cloud Storage bucket named something like ml-experiments-v3-final-final. The “production plan” involves Sarah’s laptop and a cron job.
Root causes: no MLOps function (someone owns the model, no one owns the pipeline), training-on-laptop habits that moved to the cloud without infrastructure discipline, and IAM/org-policy friction — aiplatform.googleapis.com not enabled in staging, GenAI Terms of Service not accepted at the project level, region-pin mismatches that produce opaque 405 Method Not Allowed errors and take a week to diagnose.
What ships in 90 days at Pattern A: an evaluation harness, possibly a clean BigQuery feature view, a Slack-bot demo. What doesn’t ship: any scaffolding that makes a model a production system.
Pattern A is fine. Exploration has value. The error is the label — calling a notebook a deployment, calling a demo an implementation.
Pattern B: Pilot-to-Prod Scaffolding
This is the target state for most mid-market teams: two to three ML practitioners, a data engineer, and a BigQuery warehouse already in use. The minimum viable Vertex stack gets one or two models into real production, with a CI/CD trigger, baseline drift detection, and rollbacks under ten minutes.
The minimum viable stack for Pattern B has five components. Vertex AI Pipelines (KFP) — one canonical pipeline per use case; the $0.03/run cost is irrelevant, reproducibility and lineage are the value. Model Registry — every artifact tagged, every Endpoint references a registry version, not a filename. Online Prediction Endpoints with autoscaling minimum one replica, maximum sized to 95th-percentile QPS rather than peak. Vertex AI Experiments for run tracking. Cloud Build trigger that fires a training run on a git push — not Sarah’s laptop.
What ships in 90 days: a BigQuery-to-trained-to-registered-to-deployed pipeline, monitoring baseline, rollbacks under ten minutes. That’s a real production system.
Where Pattern B teams stumble is predictable. Autoscaling lag is the most common production incident. Vertex scales Endpoints on resource utilization — CPU thresholds — not QPS. When a burst hits, CPU crosses the scale-up threshold only after latency has already degraded. Nishant Gupta documented this in a 2025 post-mortem: 6–8 second p95 under load, traced to autoscaling that responded too slowly. Fix: set target utilization to 60–70% rather than the default 80%, and size replicas with headroom for business-hours spikes.
Auth split-brain is the second stumble. The SDK can pick up GOOGLE_API_KEY and vertexai=True simultaneously, routing through a hybrid auth path that fails like a quota error. Enforce GOOGLE_GENAI_USE_VERTEXAI=TRUE and use Application Default Credentials throughout — service accounts with 1-hour OAuth tokens. Payoda’s April 2026 post-mortem covers the region-pinning variant: region mismatch produces a Missing key inputs argument error that looks like a schema problem and isn’t.
Drift caught too late is the third. Vertex Model Monitoring surfaces drift. It does not retrain. Automated retraining is on you — the alert triggers a Pipeline run only if you wired it that way.
Pattern B monthly cost: $4,000–$15,000 depending on traffic and replica count. Idle replicas — the minimum-one to prevent cold-start — bill at ~$0.437/hr for an n1-standard-8, roughly $315/month per endpoint before a prediction is served.
Pattern C: Full MLOps Platform
The 5% case. Feature Store, Vector Search, custom training, automated retrain loops, and Agent Runtime for GenAI surfaces. This is a function-level investment — minimum three ML engineers, one DevOps engineer with ML pipeline experience, one data engineer — not a tool decision.
Pattern C is Pattern B plus Feature Store, Vector Search, custom training on TPU or A100 where the workload justifies it, automated retrain triggers wired to monitoring alerts, and Agent Runtime if there’s a defined agent surface (not a research project). Feature Store is worth its $0.30/hr-per-instance cost only when three or more models share feature definitions — before that threshold, the operational overhead of maintaining freshness contracts and serving paths exceeds the value of the abstraction.
The hidden cost of Pattern C isn’t the platform bill — it’s the team. Three ML engineers minimum, one DevOps engineer who has operated ML pipelines (not just Kubernetes), one data engineer. Below that headcount, you’re running Pattern B with extra credentials and a Feature Store you can’t maintain. Team cost typically exceeds platform cost by a factor of three to four.
What doesn’t ship in 90 days: multi-region failover, full feature governance, FinOps cost attribution per model. Those are 6–9-month deliverables. Any vendor who tells you otherwise is describing a roadmap.
The lock-in tax is real. Feature Store lock-in isn’t about data export — you can get your data out. It’s workflow and abstraction coupling: once feature definitions, freshness contracts, and serving paths live in Vertex Feature Store, exit cost is rebuilding those pipelines and renegotiating organizational agreements about feature ownership. That’s a six-figure migration, not a weekend project.
Pattern C monthly cost range: $30,000 to $150,000 for a real production workload with active training, multiple endpoints, Feature Store, and Vector Search. Custom training on an A100 40GB runs about $2.93/hr plus management fees; TPU v5e runs higher. The range is wide because workload shape matters more than any rule of thumb.
Vertex AI vs. SageMaker: Honest MLOps Stack Delta
Both platforms will get models into production. The decision should turn on where your data lives, what your autoscaling requirements look like, and whether Gemini integration or TPU access is a first-class need — not on which platform has more features.
| Layer | Vertex AI | SageMaker | Honest Take |
|---|---|---|---|
| Pipelines | KFP-native, serverless-first, $0.03/run + resources | DAG-based, mature, deeper hooks to Step Functions and MLflow | SageMaker is more mature for complex DAGs. Vertex is faster to a first working pipeline. |
| Feature Store | Unified online+offline, BigQuery zero-copy, $0.08–$0.30/hr/node | More configurable, granular control, more setup overhead | Vertex wins on time-to-value if you’re already on BigQuery. SageMaker wins on governance flexibility. |
| Online Endpoints / Autoscaling | Scales on resource utilization; idle nodes still bill | Scales on QPS option available; multi-model endpoints; more inference modes | This is load-bearing. SageMaker wins for spiky traffic. Vertex wins for simplicity. |
| Batch Prediction | Per-node-hour | Batch Transform, often cheaper for bursty workloads | Roughly equivalent. Workload shape determines winner. |
| Monitoring / Drift | Automated drift, low-ops, no auto-retrain | Model Monitor — very customizable baselines | Tie. Vertex is easier to stand up. SageMaker is more powerful to operate. |
| GenAI / Agent Layer | Agent Builder + ADK + Agent Runtime + Gemini 3 native; Model Garden 200+ models | Bedrock (separate product) for agents and foundation models; SageMaker JumpStart | Vertex wins on Gemini-first depth. SageMaker + Bedrock wins on model optionality. |
| Vector Search / RAG | Vector Search GA, RAG Engine GA, Agent Search out-of-box | OpenSearch + Bedrock Knowledge Bases | Vertex is more integrated. AWS is more à la carte. |
| TPU Access | Yes — v5e, v6e | No (GPU only) | Decisive if the workload is TPU-committed. |
| Pricing at Throughput | Per-request scaling; idle endpoint cost accumulates silently | Always-on instances by default; spot fleets, multi-model endpoints, Inference Recommender reduce idle costs | Below 100 QPS, Vertex is usually cheaper. At sustained high QPS, SageMaker’s optimization levers typically win. |
| Lock-in Shape | Workflow and abstraction coupling; hard to peel apart at Feature Store level | Custom glue-code lock-in; easier to architect out, harder operationally | Different shapes of lock-in. Neither is open. |
The autoscaling row is load-bearing. If your inference traffic follows a business-hours curve — flat overnight, 5x spike from 9 AM to noon — Vertex’s resource-utilization autoscaling lags behind the ramp every morning. SageMaker’s QPS-based option responds earlier. That’s not a configuration problem; it’s a design choice.
Vertex wins on Gemini integration and TPU access. SageMaker wins on autoscaling responsiveness and AWS-ecosystem density. Both lose to a Kubeflow + MLflow stack on raw ops control — but that requires a team that can maintain it, which most mid-market organizations don’t have.
For teams already on Anthos or GKE, the Vertex integration story is smoother than the SageMaker equivalent. The BigQuery vs Snowflake vs Redshift decision is upstream of all of this and will constrain it.
The Pricing Trap Nobody Talks About
Vertex AI pricing has three hidden traps: idle endpoint billing, Standard PayGo usage tiers that throttle throughput before you know they exist, and a Gemini model pricing structure that changed three times between 2025 and 2026.
Idle endpoint billing is the first trap teams hit. Online Prediction Endpoints bill for replicas provisioned, not predictions served. A minimum-one replica on an n1-standard-8 runs ~$0.437/hr — $315/month — whether it serves ten requests or ten million. The default assumption from API-based products is that you pay for what you use. You don’t. You pay for what you provision. The Vertex AI pricing page documents this clearly; teams just don’t read it until the bill arrives.
The Standard PayGo usage tiers are less visible. Vertex ties Gemini API throughput to your 30-day organization spend: Tier 1 ($10–$250) gets 500K tokens per minute for Gemini Pro; Tier 3 (>$2,000) gets 2M TPM. Teams hit these limits before they know the tiers exist, and the error surfaces as a quota exceeded response that looks like a hard cap rather than a spend-gate. Tier structure documented at cloud.google.com/vertex-ai/generative-ai/docs/dsq. At scale, reserved capacity is the only stable economy — and that requires a Google account team conversation, not a console checkbox.
Current Gemini 2.5 Pro rates: $1.25/1M input tokens (≤200K context), $2.50 for longer; $10–$15/1M output. Validate against cloud.google.com/vertex-ai/pricing before locking a budget — this pricing changed three times between 2025 and 2026.
For Google Cloud data analytics consulting workloads using Vertex as an inference layer over BigQuery features, the two billing surfaces are independent. Not modeling that interaction is how teams arrive at a $40,000 month-two bill.
When to Walk Away from Vertex AI
The vendors won’t tell you when their platform isn’t the right call. We will. Five signals that Vertex AI is the wrong choice for this engagement, right now.
Fewer than five ML practitioners. Vertex Pipelines, Model Registry, and Feature Store require an MLOps owner — someone treating pipeline infrastructure as their primary responsibility, not a side project. Below that threshold, there’s no one to own it. The right stack: BigQuery ML for training, Cloud Run for serving, Cloud Scheduler for batch jobs. That combination is maintainable without Vertex’s abstraction overhead.
No data engineering function. Vertex won’t fix bad data infrastructure — it will expose it faster and more expensively. If features are hand-jammed in notebooks per training run, there is no pipeline to productionize. As Nandini Bedola put it in a January 2026 piece: “Vertex AI doesn’t solve AI problems, it exposes them.” Solve the data engineering problem first.
“We just want a chatbot.” Conversational Agents (Dialogflow CX) or Agent Studio low-code is the right tool. An FAQ bot or support triage agent doesn’t require Vertex Pipelines, Feature Store, or Model Registry. Paying for that stack to ship a chatbot is like buying a data center to host a WordPress site.
Heavy AWS lock-in already in place. S3, Redshift, IAM roles, VPCs — running ML on Vertex means cross-cloud data egress costs, parallel identity management, and duplicate operational overhead. Keep ML where the data lives.
Compliance requires workspace isolation Vertex can’t cleanly deliver. For strict BFSI or healthcare workspace isolation — private-link equivalents, documented blast radius per tenant — Azure ML’s workspace isolation model is often the cleaner fit. Vertex has VPC Service Controls and CMEK, but it wasn’t designed for that posture.
Which Pattern Are You Actually In?
The honest version of Vertex AI planning starts with the question most teams skip: given current headcount, data infrastructure, and CI/CD discipline, which pattern can we actually execute?
Pattern A is the correct starting point for most teams today. A well-run Pattern A engagement produces a clean evaluation harness, validated model performance against a real dataset, and a clear answer on whether the use case justifies Pattern B investment. Pattern B is where most mid-market organizations should be aiming after 90 days: one or two models observable, rollbackable, and not dependent on anyone’s local environment. Pattern C is a board-level commitment, not an implementation plan.
If you want help figuring out which pattern fits your team — and which Google Cloud consulting partners have run this engagement before — start with our ranked partner directory. If the data infrastructure question is still open, read BigQuery vs Snowflake vs Redshift first.
Frequently Asked Questions
How long does it take to get a Vertex AI pilot to production?
A minimal prod-grade deployment — one model in Vertex Pipelines, registered in Model Registry, served from an Online Prediction Endpoint with basic drift monitoring — takes 60 to 90 days with a team of 3 or more ML practitioners and a data engineer. Without that team in place, pilots routinely sit in notebooks for quarters without shipping anything.
What's the difference between Vertex AI and Gemini Enterprise Agent Platform?
Google rebranded Vertex AI to Gemini Enterprise Agent Platform in April 2026. The ML primitives kept their names: Pipelines, Model Registry, Feature Store, Vector Search. The agent surfaces moved: Agent Engine became Agent Runtime, Vertex AI Search became Agent Search, Vertex AI Studio became Agent Studio. Most teams still call the platform Vertex AI, and that's fine — the search intent and the underlying tooling haven't changed.
When should I choose Vertex AI over SageMaker?
Choose Vertex AI when your data already lives in BigQuery, when Gemini model integration is a first-class requirement, or when you need TPU access for custom training. Choose SageMaker when you're deep in the AWS ecosystem (S3, Redshift, IAM), when you need QPS-based autoscaling for spiky inference traffic, or when you need multi-model endpoints to reduce idle-node costs at scale.
What does a Vertex AI implementation actually cost?
Costs break down by pattern. A notebook-stage pilot runs under $500 per month. A prod-scaffolded deployment (Pipelines + Model Registry + Online Endpoint + monitoring) runs $4,000 to $15,000 per month depending on traffic and feature size. A full MLOps platform with Feature Store, Vector Search, and custom training runs $30,000 to $150,000 per month for a real workload. Idle Online Prediction Endpoints are a consistent surprise — they bill even when no predictions are served.
What team size do I need to run Vertex AI in production?
Pattern B (prod scaffolding) requires a minimum of 3 ML practitioners plus 1 data engineer. Pattern C (full MLOps platform) requires at least 3 ML engineers, 1 DevOps engineer with ML pipeline experience, and 1 dedicated data engineer. Below those thresholds, the overhead of maintaining Vertex infrastructure exceeds the value it delivers. Teams under 5 ML practitioners are better served by BigQuery ML and Cloud Run until headcount grows.
Peter Korpak
Chief Analyst & Founder
Data-driven market researcher with 15+ years helping software agencies and IT organizations make evidence-based decisions. Former market research analyst at Aviva Investors and Credit Suisse. Analyzed 200+ verified cloud projects (migrations, implementations, optimizations) to build Cloud Intel.
Connect on LinkedInContinue Reading
Google Cloud
Top Google Cloud consulting firms and rankings
BigQuery vs Snowflake vs Redshift: Which Wins for Mid-Market in 2026
BigQuery vs Snowflake vs Redshift, modeled at 10TB/100TB/1PB with 3-year TCO and exit egress. Independent Cloud Intel analysis →
Anthos vs GKE: When Each Wins (and When You're Overpaying)
Anthos was dissolved as a GKE tier in Sept 2025. Here's what you actually pay for in 2026 and the 3 configs where teams overpay. Independent Cloud Intel →
Stay ahead of cloud consulting
Quarterly rankings, pricing benchmarks, and new research — delivered to your inbox.
No spam. Unsubscribe anytime.