Building a centralized AI Model Hosting Platform

Introduction – The AI Sprawl in the Enterprise

Artificial Intelligence has undeniably become ubiquitous. Whether it is SaaS providers enhancing their solutions with AI functionalities, internal IT departments expanding customer-facing software with AI, innovation initiatives developing rapid prototypes, or the deployment of third-party software. They all share a common requirement. They need access to foundation AI models (e.g., OpenAI, Claude, DeepSeek) and specialized AI models (e.g., OpenAI Codex), driven by the strict necessity that data processing, costs, and provisioning must remain entirely under the control of the consuming company.

As a result, a true “Bring Your Own Key” (BYOK) paradigm for AI models has emerged. To prevent an unmanageable sprawl early on while still ensuring easy access for developers, the only viable solution is a centralized AI Model Hosting Service. Given the current state of technology, implementing this can be achieved completely frictionless.

The Solution – A centralized AI Model Hosting Platform with Microsoft AI Foundry

With the recent Ignite conference and the introduction of the new AI Foundry, Microsoft has addressed this exact problem. It is now very easy for a central team which does not necessarily need to be highly specialized in deep tech to set up an AI Model Hosting Service and thereby solve the BYOK dilemma completely frictionless.

The starting point for this should be a new Azure Subscription that seamlessly integrates into the Microsoft Landing Zone concept or the specific AI Landing Zone architecture. The main component within this subscription is the Azure AI Foundry, which in turn contains individually manageable Foundry projects.

These projects each provide an API key and a unique endpoint per model. Role assignments, such as “AI Reader”, can then be applied directly on a project basis. This flexibility allows organizations to map out various constellations to accommodate different use cases and company structures. For example, it is possible to deploy one Foundry per subsidiary or business unit, containing separate projects for SaaS applications, in-house developments, or 3rd-party software.

Hourly token limits and cost showback can then be easily configured or displayed per model. To achieve even greater granularity, the AI Gateway offers pre-configured project limitations and monitoring capabilities based on an Azure API Gateway technology.

Furthermore, the AI Foundry provides straightforward ways to establish company-wide Responsible AI (RAI) policies and guardrails (e.g., content and prompt filtering). Regarding model selection, Microsoft offers a wide range of closed, open-source, and specialized models via the Azure AI Catalog. Additionally, Anthropic models are now also available through the Azure Marketplace. Together, all these components are perfectly suited for secure, enterprise-grade operations.

This image has an empty alt attribute; its file name is bildschirmfoto-2026-02-23-um-22.33.22.png

From general knowledge to specialized knowledge – Grounded AI Agents

For many use cases, especially in the SaaS space, simple API access to foundation models with their general knowledge limited by their training cutoff is perfectly sufficient. However, for specific internal or customer-facing solutions, this approach often reaches its limits.

In these scenarios, model hosting must be expanded to include connections to internal or external data sources. A fundamental example of this is “Web Grounding,” which allows the model to access current web content in real time. In more complex use cases, browser automation tools are even required to dynamically read and process web pages much like a human user would.

For this grounding meaning the anchoring of the model in real, current context the open standard MCP (Model Context Protocol) is increasingly being used. MCP establishes a secure, standardized connection to both internal and external data sources. In this way, the general foundation knowledge of the models is seamlessly combined with specialized, proprietary enterprise knowledge.

The newly introduced Microsoft AI Foundry offers an architecturally seamless solution for this exact requirement within AI Model Hosting. At the level of each Foundry project, “Knowledge” can be directly integrated—for example, via native Microsoft and Azure resources such as Microsoft OneLake, Azure Blob Storage, or Azure AI Search Indexes.

Furthermore, the platform enables the direct integration of “Tools” via MCP through a continuously growing catalog. This includes browser automation with Playwright, API interactions via Postman, or code repository access via GitHub. By combining models, knowledge, and tools, the AI Foundry transforms our standard AI Model hosting into a highly specialized enterprise tool.

This image has an empty alt attribute; its file name is bildschirmfoto-2026-02-26-um-00.00.21.png

Model and Agent Governance – From Evaluation to Production

The raw deployment of a foundation model via an endpoint is nowhere near sufficient for secure enterprise operations. As soon as use cases leave the evaluation phase and transition into production, the absence of deep control mechanisms creates incalculable risks regarding security, compliance, and costs. A centralized AI Model Hosting Service therefore requires mandatory governance structures.

The foundation of model security is formed by the integrated guardrails within the Microsoft AI Foundry. Alongside classic content filtering for inputs and outputs which reliably blocks hate speech, violent content, or, via custom filters, sensitive enterprise data (PII) Prompt Shields play a critical role. They block targeted prompt injection attacks (jailbreaks) in real time. Especially for Grounded Agents, they also prevent “indirect injections” malicious instructions hidden in connected external documents or web content and effectively protect the model from external hijacking.

For traffic management and Day-2 operations, the AI Gateway (based on Azure API Management) is positioned in front of the setup. This gateway expands the model hosting with three indispensable enterprise features:

  • Token-Aware Rate Limiting and Quotas: Instead of merely counting API calls, the gateway analyzes the actual token consumption. Hard limits per Foundry project prevent budget overruns caused by excessive usage and enable exact cost showback.
  • Smart Routing and Fallbacks: The gateway acts as a load balancer. In the event of regional capacity limits or traffic spikes (429 errors), it automatically routes traffic to secondary model deployments, thereby guaranteeing High Availability.
  • Semantic Caching: To drastically reduce latency and costs, the gateway caches responses to specific prompts. For identical or semantically very similar requests, the response is delivered directly from the cache without invoking the expensive backend model again.

The Broader Landscape – AWS Bedrock and Google Vertex AI

Alongside the Microsoft AI Foundry, AWS and Google Cloud offer their own platforms for AI model hosting: AWS Bedrock and Google Vertex AI.

AWS Bedrock is a serverless service for foundation models from Amazon (Titan) and third-party providers (e.g., Anthropic, Meta, Mistral). The service was announced in April 2023 and has been generally available since September 2023.

Google Vertex AI is the machine learning platform of the Google Cloud Platform (GCP) and has been available since May 2021. The features for generative AI and model hosting (Model Garden) were introduced in March 2023.

Feature Comparison (As of 02/2026)

FeatureAzure AI FoundryAWS BedrockGoogle Vertex AI
Simple model hosting with API key for SaaS applicationsYesYesYes
Number of available models (approx.)> 1.600> 100> 150
OpenAI Closed models availableYesNoNo
Multi-model deployment in projectsYes (Foundry Projects)Yes (Bedrock Workspaces / AWS Accounts)Yes (GCP Projects)
Token limitsTPM*, RPM*, Project QuotasTPM*, RPM*, Account QuotasTPM*, RPM*, Region Quotas
Showback costsYes (Azure Cost Management)Yes (AWS Cost Explorer)Yes (Google Cloud Billing)
RBACProject levelModel levelProject level
Web grounding possibleYes (Bing Search API)Yes (Web Crawler)Yes (Google Search)
MCP grounding possibleYesYesYes
Grounding on native servicesYes (OneLake, Blob Storage, AI Search)Yes (Amazon S3, OpenSearch)Yes (Cloud Storage, Vertex AI Search)
Content filteringYes (Azure AI Content Safety)Yes (Guardrails for Amazon Bedrock)Yes (Safety Settings)
Prevention of prompt injection attacksYes (Prompt Shields)Yes (Guardrails for Amazon Bedrock)Yes (Vertex AI Security)
Semantic cachingYes (via Azure API Management)Yes (via additional services e.g., ElastiCache)Yes (Context Caching)
Smart routingYes (via Azure API Management)Yes (Cross-region inference)Yes (Traffic Splitting & Load Balancing)

* TPM (Tokens Per Minute) and RPM (Requests Per Minute) are metrics that define the maximum data processing volume and API call frequency allowed for a model deployment within a 60-second window.

Conclusion

A centralized AI model hosting is essential for an enterprise company in today’s fast moving AI reality and should be rolled out as soon as possible.
This enables fast access to AI models for all use cases mentioned in the introduction. The low-maintenance deployment with the hyperscalers Azure, AWS, and GCP makes it possible to provide this service completely frictionless.

Organizations should choose the corresponding hyperscaler based on their existing cloud footprint, as all key features are covered by all providers. Microsoft, however, offers the highest number of models, and the contract with OpenAI grants their customers exclusive access to the proprietary GPT models. The partnership with Anthropic has also significantly improved Microsofts market position here.

Do not know where to start? – Consult me

I offer consulting services for the planning and technical implementation of centralized AI model hosting architectures. Additionally, I provide support for the organizational changes required to establish an AI Center of Excellence (AICoE) within your enterprise.

Leave a comment

Your email address will not be published. Required fields are marked *