Future of APIs

Azure API Management as a GenAI Gateway: Unlocking the Future of APIs

1.2kviews

Welcome to a deep dive into the innovative world of Azure API Management as a GenAI Gateway. We’ll explore why API professionals should pay attention to Generative AI (GenAI) and large language models (LLMs). This session is packed with insights, practical applications, and new capabilities that can revolutionize how we interact with APIs.

Understanding GenAI and APIs

At its core, when you interact with large language models, you’re essentially working with APIs. You send requests formatted as prompts, and in return, you receive responses as completions. This familiarity is crucial for API professionals, as it means that managing, securing, and governing these AI APIs is just another layer in our existing API management responsibilities.

However, there are unique challenges associated with AI APIs that need addressing. One major concern is token consumption. Every interaction with a large language model consumes tokens, which can lead to unpredictable costs. Understanding token consumption is vital, especially if you’re sharing an OpenAI instance across multiple teams or departments.

Managing Token Consumption

Imagine hosting a hackathon to explore various use cases for large language models within your organization. You wouldn’t want to be faced with a hefty bill for token consumption afterward! To prevent this, it’s essential to have mechanisms in place to monitor and control token usage effectively.

  • Token Consumption Tracking: You can track token usage by different dimensions, such as user ID or subscription ID. This allows for cross-charging based on consumption, making it easier to allocate costs across teams.
  • Rate Limiting: Implementing policies to limit token consumption can help you manage costs. For instance, you can set a cap on tokens per minute or establish long-term quotas (daily, weekly, or monthly).

Reliability and Security Concerns

When moving from proof of concept (POC) to production, reliability becomes crucial. Azure OpenAI offers options for pre-provisioned capacity through TPU (Training Processing Unit) instances, which can be your primary instance. In peak periods, such as during a Black Friday sale, having a failover to a pay-as-you-go model ensures that your applications remain operational.

Security is another important aspect. Sharing a single API key across your organization can create risks. Instead, Azure API Management allows you to share subscription keys, which can be easily managed and rotated. This minimizes the risk associated with key management.

Introducing Journey and Gateway Capabilities

To tackle the specific challenges of token consumption and reliability, Azure API Management has introduced journey and gateway capabilities. Here’s what they include:

  • Azure OpenAI Token Metric Policy: This policy helps you monitor token consumption across different dimensions. By integrating with Application Insights, you can visualize usage by developer, team, or department.
  • Load Balancer and Circuit Breaker: Azure API Management allows you to configure circuit breaker rules for each endpoint. If an endpoint returns a 429 status code (too many requests), API Management can switch to a pay-as-you-go instance to maintain service availability.

Authentication and Authorization

Managing authentication effectively is crucial for security. Azure API Management supports managed identity authentication, eliminating the need for API key management. This allows developers to receive subscription keys that can be easily rotated and revoked.

You can also implement JWT token validation policies to grant access based on specific claims, allowing for granular control over who can access which models and their associated token consumption rules.

Monitoring and Observability

Understanding token usage is not just about tracking; it’s also about limiting it. With the Azure OpenAI token limit policy, you can set both short-term and long-term limits to keep token consumption in check.

In addition, the Azure OpenAI semantic caching policy enables caching for semantically similar prompts. This is beneficial in chat applications where users often ask similar questions. By caching completions based on similarity, you can significantly reduce token consumption.

Expanding Model Support

Azure has broadened its support beyond just OpenAI. The Azure AI model catalog now includes models like Mistral, Cohere, and Lama, all accessible via a unified Azure model inference API. This streamlines the process of working with different models while leveraging the same policies previously discussed.

Demonstration of New Features

In a live demonstration, we explored how to set up an API with Azure API Management. By selecting an OpenAI service instance, we configured policies for token quota, tracking usage, and semantic caching. The setup process is straightforward, allowing for rapid deployment of AI capabilities.

We also tested the system by generating interactions, such as creating jokes or solving mathematical problems using image inputs. The results were impressive, showcasing the capabilities of Azure API Management as a GenAI Gateway.

Scaling for Production

As organizations transition from experimentation to production, scaling becomes a necessity. Deploying multiple OpenAI instances allows for greater token throughput. For instance, if one instance can handle 150,000 tokens per minute, two instances can effectively double that capacity.

Moreover, deploying across multiple regions enhances resiliency. If one region experiences an outage, your application can seamlessly switch to another region, ensuring uninterrupted service for users.

Load Balancing and Circuit Breakers in Action

In our demonstration, we showcased how to configure load balancing between two deployments located in different regions. By setting priorities and circuit breakers, we could manage traffic effectively, ensuring that the primary deployment was utilized as much as possible while having a backup ready if needed.

Monitoring and Quota Enforcement

Monitoring usage through Application Insights allows you to pinpoint which applications consume the most tokens. By implementing quotas, you can manage access based on specific needs, ensuring that no single application monopolizes resources.

Conclusion and Next Steps

As we conclude, it’s clear that Azure API Management as a GenAI Gateway provides a robust framework for managing AI-driven APIs. By leveraging the capabilities discussed, organisations can optimise their API strategies while controlling costs and enhancing security.

For those interested in diving deeper, we encourage you to explore the resources available on the Azure documentation site, participate in the GenAI Hub Accelerator, and check out the GenAI Gateway Accelerator for further learning and implementation strategies.

Documentation: aka.ms/apim/openai-docs
GenAI Gateway Labs: aka.ms/apim/genai/labs
GenAI AI Hub Accelerator: aka.ms/ai-hub-gateway
GenAI Gateway Accelerator: aka.ms/apim-genai-lza

We look forward to seeing how you implement these exciting new capabilities in your API management practices!

APIdays | Events | News | Intelligence

Attend APIdays conferences

The Worlds leading API Conferences:

Singapore, Zurich, Helsinki, Amsterdam, San Francisco, Sydney, Barcelona, London, Paris.

Get the API Landscape

The essential 1,000+ companies

Get the API Landscape
Industry Reports

Download our free reports

The State Of Api Documentation: 2017 Edition
  • State of API Documentation
  • The State of Banking APIs
  • GraphQL: all your queries answered
  • APIE Serverless Architecture