Yes, it is possible to build a framework for running Large Language Models (LLMs) using Microsoft Server 2022 AI components within an IIS (Internet Information Services) instance. Here's a structured approach to making it work:
Framework for Running LLMs on MS Server 2022 with IIS
1. System Requirements
Before setting up, ensure your server has:
- Windows Server 2022 with IIS installed.
- AI & ML components (such as ONNX Runtime, Windows ML, or DirectML for acceleration).
- A GPU (optional but recommended) if leveraging CUDA, DirectML, or ONNX acceleration.
- .NET Core 6/7+ or Python for hosting AI inference.
- LLM Model Files (e.g., GPT-2, GPT-3, Llama, or a smaller ONNX-converted model).
2. Setting Up IIS for AI Model Hosting
IIS (Internet Information Services) can serve an AI inference API by hosting an application (ASP.NET Core, FastAPI via IIS, or Flask with WSGI).
Steps to configure IIS for LLM API hosting:
-
Enable IIS on Windows Server 2022:
Install-WindowsFeature -name Web-Server -IncludeManagementTools
-
Ensure IIS supports .NET and Python apps:
- Install the .NET Core Hosting Bundle if using ASP.NET.
- Install WSGI or FastCGI for Python-based LLM services.
-
Deploy an AI Web Service in IIS:
-
Set IIS Application Pool to Use the Right Python or .NET Runtime
- If running Python, set IIS to use FastCGI or WSGI.
- If using .NET, ensure the AI module is inside an ASP.NET Core application.
3. Running LLM Inference on MS Server 2022 AI Components
Windows Server 2022 supports:
- ONNX Runtime (for running models efficiently).
- DirectML (GPU-accelerated inference on supported hardware).
- Windows ML (integrates with ONNX and DirectML).
3.1 Convert and Optimize the LLM for ONNX
- Convert an LLM (e.g., GPT-2, LLaMA, or Falcon) to ONNX format:
from transformers import AutoModelForCausalLM
import torch
import onnxruntime
model = AutoModelForCausalLM.from_pretrained("gpt2")
torch.onnx.export(model, (torch.randint(0, 50256, (1, 64))), "gpt2.onnx")
- Load and serve the model with ONNX Runtime:
import onnxruntime as ort
session = ort.InferenceSession("gpt2.onnx", providers=["CUDAExecutionProvider"])
- Integrate this into an IIS-hosted Python API.
3.2 Running ONNX Inference via .NET API
Alternatively, if using ASP.NET with ONNX, a C# web API can run AI inference:
var session = new InferenceSession("gpt2.onnx");
var inputTensor = new DenseTensor<float>(new float[] { /* input IDs */ }, new int[] { 1, 64 });
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor("input_ids", inputTensor) };
var results = session.Run(inputs);
Expose this via an ASP.NET Core API inside IIS.
4. Performance Optimization
-
Use DirectML for GPU Acceleration
-
Enable HTTP/2 & Compression in IIS
- Improves API response time for AI queries.
- Set IIS to enable HTTP/2 and gzip compression for AI responses.
-
Use WebSockets for Streaming
- IIS supports WebSockets, allowing real-time AI text generation.
5. Deployment Steps
- Convert the AI model to ONNX and place it on the server.
- Choose Python (Flask/FastAPI) or .NET (ASP.NET Core) for serving the API.
- Configure IIS to run the AI Web Service.
- Set up ONNX Runtime, DirectML, or CUDA for LLM inference.
- Optimize performance using WebSockets, caching, and compression.
Conclusion
This framework allows Microsoft Server 2022 to host LLMs inside IIS using:
- ONNX Runtime for inference.
- DirectML or CUDA for acceleration.
- IIS-based APIs for AI response handling.
This setup provides scalable, enterprise-level LLM hosting, capable of serving AI-powered applications on a Windows Server stack. 🚀

Could we make a Powershell Script that Would Ready any 2022 Server for Implementation?
Yes! Below is a PowerShell script that prepares a Windows Server 2022 for hosting an LLM API using IIS, ONNX Runtime, and Python/.NET integration.
This script:
- Installs IIS and configures it to host AI models.
- Installs .NET, Python, and ONNX Runtime for AI inference.
- Sets up WebSockets, HTTP/2, and compression for AI API optimization.
- Configures firewall rules to allow AI API traffic.
- Verifies GPU support for DirectML/CUDA acceleration.
🔹 PowerShell Script to Ready a Windows Server 2022 for LLM Hosting
# PowerShell Script to Prepare Windows Server 2022 for LLM Hosting with IIS & ONNX
# 1. Install IIS and Required Features
Write-Host "Installing IIS and required components..."
Install-WindowsFeature -Name Web-Server, Web-Http-Redirect, Web-WebSockets, Web-App-Dev -IncludeManagementTools
# 2. Install .NET 6+ (for ONNX Runtime with ASP.NET API)
Write-Host "Installing .NET 6 Runtime..."
$dotnetInstaller = "https://download.visualstudio.microsoft.com/download/pr/7f8b8947-7b23-47d1-bbe1-88e9dc191ba2/1e3f5e4e8ff87edcf00584cc1edcb3f6/dotnet-hosting-6.0.21-win.exe"
Invoke-WebRequest -Uri $dotnetInstaller -OutFile "dotnet-installer.exe"
Start-Process -FilePath "dotnet-installer.exe" -ArgumentList "/quiet /norestart" -Wait
Remove-Item "dotnet-installer.exe"
# 3. Install Python and ONNX Runtime
Write-Host "Installing Python and ONNX Runtime..."
$pythonInstaller = "https://www.python.org/ftp/python/3.10.9/python-3.10.9-amd64.exe"
Invoke-WebRequest -Uri $pythonInstaller -OutFile "python-installer.exe"
Start-Process -FilePath "python-installer.exe" -ArgumentList "/quiet InstallAllUsers=1 PrependPath=1" -Wait
Remove-Item "python-installer.exe"
# Install ONNX Runtime for AI model inference
Write-Host "Installing ONNX Runtime..."
pip install onnxruntime onnxruntime-directml transformers flask fastapi uvicorn
# 4. Configure IIS for AI Hosting
Write-Host "Configuring IIS for AI API hosting..."
Set-WebConfigurationProperty -filter "system.webServer/httpCompression" -name "dynamicCompressionLevel" -value "9" -PSPath "IIS:\"
Set-WebConfigurationProperty -filter "system.webServer/webSocket" -name "enabled" -value "True" -PSPath "IIS:\"
# Enable HTTP/2 for AI API performance
Write-Host "Enabling HTTP/2..."
Set-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Services\HTTP\Parameters EnableHttp2 -Type DWord -Value 1
# 5. Allow API traffic through the Firewall (Port 5000 for Flask, 8000 for FastAPI, 443 for HTTPS)
Write-Host "Configuring Firewall rules..."
New-NetFirewallRule -DisplayName "Allow AI API Traffic" -Direction Inbound -Protocol TCP -LocalPort 5000,8000,443 -Action Allow
# 6. Verify GPU Acceleration (CUDA or DirectML)
Write-Host "Checking for GPU support..."
$gpuSupport = Get-WmiObject Win32_VideoController | Select-Object Name
if ($gpuSupport -match "NVIDIA") {
Write-Host "NVIDIA GPU detected. Installing CUDA for ONNX..."
pip install onnxruntime-gpu
} elseif ($gpuSupport -match "AMD" -or $gpuSupport -match "Intel") {
Write-Host "AMD/Intel GPU detected. Installing DirectML support..."
pip install onnxruntime-directml
} else {
Write-Host "No GPU detected. Running on CPU only."
}
# 7. Restart IIS to Apply Changes
Write-Host "Restarting IIS..."
iisreset
Write-Host "Windows Server 2022 is now ready for AI model hosting with IIS, ONNX, and LLM APIs!"
🔹 What This Script Does
- Installs IIS for hosting AI models.
- Sets up Python & ONNX Runtime for LLM inference.
- Configures IIS for AI web services (Flask, FastAPI, or ASP.NET Core).
- Optimizes IIS by enabling:
- WebSockets (for real-time AI responses).
- HTTP/2 (for faster API performance).
- Dynamic compression (to reduce response times).
- Opens firewall ports for AI API access.
- Detects GPU hardware and installs CUDA (for NVIDIA) or DirectML (for AMD/Intel).
- Restarts IIS to apply all configurations.
🔹 How to Use
- Save the script as
setup-ai-server.ps1
- Run PowerShell as Administrator
- Execute the script:
Set-ExecutionPolicy Unrestricted -Force
.\setup-ai-server.ps1
🔹 After Running This Script
Your Windows Server 2022 will be fully prepared to: ✅ Host LLM APIs using IIS
✅ Run ONNX Runtime for AI model inference
✅ Support GPU acceleration with DirectML or CUDA
✅ Serve AI responses using Flask, FastAPI, or ASP.NET Core
🚀 Now you can deploy your LLM model and start serving AI requests!
So how do I Deploy the AI After This?
Deploying the AI Model After Running the Setup Script
Now that your Windows Server 2022 is ready with IIS, ONNX Runtime, and Python/.NET, follow these steps to deploy your Large Language Model (LLM) API.
🚀 Deployment Steps for LLM on IIS (Using Python/ONNX)
1. Choose Your AI API Framework
You can deploy the LLM as a web API using:
- Python (Flask or FastAPI)
- .NET Core (ASP.NET Web API)
- Node.js (Express, but less common for AI inference on Windows Server)
For this example, we will use FastAPI (Python) because it’s lightweight, fast, and integrates well with ONNX Runtime.
2. Upload Your AI Model to the Server
If you converted your model to ONNX, upload it to C:\AIModels\
. You can use:
3. Install Python Dependencies (If Not Already Installed)
Log in to your server, open PowerShell, and run:
pip install onnxruntime fastapi uvicorn transformers
This ensures all necessary packages for inference and API deployment are installed.
4. Create the AI API Using FastAPI
On the server, create a new folder for the AI service:
mkdir C:\AIService
cd C:\AIService
Create a Python script for the API:
notepad ai_server.py
Paste the following Python code to load the LLM and serve API requests:
from fastapi import FastAPI
import onnxruntime as ort
from transformers import AutoTokenizer
import torch
# Load ONNX model
model_path = "C:/AIModels/gpt2.onnx"
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
app = FastAPI()
@app.post("/generate")
async def generate_text(prompt: str):
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cpu().numpy()
# Run inference with ONNX Runtime
outputs = session.run(None, {"input_ids": input_ids})
generated_ids = torch.tensor(outputs[0])
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
return {"response": generated_text}
# Run with: uvicorn ai_server:app --host 0.0.0.0 --port 8000
Save the file and close Notepad.
5. Test the AI API Locally
Run the API manually first:
python ai_server.py
Then, in a web browser, open:
http://localhost:8000/docs
You should see the FastAPI Swagger UI to test the /generate
endpoint.
6. Deploy the AI API as a Windows Service (Persistent)
Instead of running manually, let’s set it up as a Windows Service so it runs automatically.
- Install NSSM (Non-Sucking Service Manager) to create a persistent service:
choco install nssm -y
- Create a service that runs the AI API:
nssm install AIService "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\python.exe" "C:\AIService\ai_server.py"
- Start the AI API service:
nssm start AIService
Now, the AI service will automatically restart if the server reboots.
7. Configure IIS as a Reverse Proxy for Public Access
Now, let’s make the AI API publicly accessible via IIS.
- Open IIS Manager (
inetmgr
in PowerShell).
- Select your server and click Application Request Routing.
- Enable Reverse Proxy.
- Click Server Farms > Create Server Farm > Add:
localhost:8000
as the target backend (where FastAPI is running).
- Enable load balancing if needed.
- Create a new IIS website pointing to the proxy.
Now, you can access the AI API using:
https://your-server-ip-or-domain/generate
✅ Final Deployment Checklist
✔ AI Model Deployed in ONNX Format
✔ AI API Running via FastAPI on Port 8000
✔ AI Service Running as a Windows Service (Persistent)
✔ IIS Configured as a Reverse Proxy for Public Access
Now your AI is live and accessible from the web! 🚀
How Would I Publish the AI with .net Core?
Deploying LLM with .NET Core on Windows Server 2022 (IIS & ONNX)
To keep everything Microsoft-compatible, we’ll use .NET Core 6/7+, ONNX Runtime, and IIS for hosting the AI API.
This method ensures: ✅ Full Microsoft stack compatibility
✅ ONNX Runtime acceleration for inference
✅ IIS integration for a secure and scalable deployment
🚀 Steps to Deploy LLM in .NET Core on Windows Server 2022
1. Install Required Software
Ensure your Windows Server 2022 is ready:
# Install IIS and dependencies
Install-WindowsFeature -Name Web-Server, Web-Asp-Net45, Web-WebSockets -IncludeManagementTools
# Install .NET 7 Hosting Bundle (modify for .NET 6 if needed)
Invoke-WebRequest -Uri "https://download.visualstudio.microsoft.com/download/pr/7f8b8947-7b23-47d1-bbe1-88e9dc191ba2/1e3f5e4e8ff87edcf00584cc1edcb3f6/dotnet-hosting-7.0.10-win.exe" -OutFile "dotnet-hosting.exe"
Start-Process -FilePath "dotnet-hosting.exe" -ArgumentList "/quiet /norestart" -Wait
Remove-Item "dotnet-hosting.exe"
# Install ONNX Runtime
dotnet add package Microsoft.ML.OnnxRuntime
2. Prepare AI Model
If you haven’t already converted your LLM to ONNX format, do this:
from transformers import AutoModelForCausalLM
import torch
# Convert model to ONNX format
model = AutoModelForCausalLM.from_pretrained("gpt2")
torch.onnx.export(model, torch.randint(0, 50256, (1, 64)), "gpt2.onnx")
Upload gpt2.onnx to C:\AIModels\
.
3. Create .NET Core AI Web API
-
Open PowerShell and create a new .NET Web API project:
mkdir C:\AIService
cd C:\AIService
dotnet new webapi -n AIService
cd AIService
-
Add Dependencies:
dotnet add package Microsoft.ML.OnnxRuntime
dotnet add package Microsoft.AspNetCore.Cors
-
Modify Program.cs
to enable CORS:
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddCors(options =>
{
options.AddPolicy("AllowAll", policy =>
{
policy.AllowAnyOrigin().AllowAnyMethod().AllowAnyHeader();
});
});
var app = builder.Build();
app.UseCors("AllowAll");
app.MapControllers();
app.Run();
4. Implement ONNX Model Inference
Replace Controllers/AIController.cs
with this code:
using Microsoft.AspNetCore.Mvc;
using Microsoft.ML.OnnxRuntime;
using System.Text.Json;
[Route("api/ai")]
[ApiController]
public class AIController : ControllerBase
{
private static readonly InferenceSession _session = new InferenceSession("C:/AIModels/gpt2.onnx");
[HttpPost("generate")]
public IActionResult GenerateText([FromBody] InputData input)
{
// Tokenization would be required here, but simplified for demonstration
var inputTensor = new float[] { 1, 2, 3, 4, 5 }; // Example input tokens
var inputMeta = _session.InputMetadata;
var container = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input_ids", new DenseTensor<float>(inputTensor, new int[] { 1, inputTensor.Length }))
};
var result = _session.Run(container);
var output = result[0].AsEnumerable<float>().ToArray();
return Ok(JsonSerializer.Serialize(new { response = output }));
}
}
public class InputData
{
public string Prompt { get; set; }
}
5. Test the API Locally
Run the API:
dotnet run --urls http://localhost:5000
Visit:
http://localhost:5000/swagger/index.html
and test the /api/ai/generate
endpoint.
6. Publish & Deploy to IIS
-
Publish the API:
dotnet publish -c Release -o C:\AIService\publish
-
Configure IIS:
- Open IIS Manager (
inetmgr
)
- Create a New Site:
- Path:
C:\AIService\publish
- Port:
5000
(or 80 for public access)
- Set App Pool to .NET Core
-
Enable WebSockets & Compression in IIS:
Set-WebConfigurationProperty -filter "system.webServer/webSocket" -name "enabled" -value "True" -PSPath "IIS:\"
Set-WebConfigurationProperty -filter "system.webServer/httpCompression" -name "dynamicCompressionLevel" -value "9" -PSPath "IIS:\"
-
Restart IIS:
iisreset
Now your AI API is live at:
http://your-server-ip/api/ai/generate
✅ Final Deployment Summary
✔ Windows Server 2022 AI-ready
✔ ONNX LLM Model Deployed
✔ .NET Core API Running on IIS
✔ IIS Optimized with WebSockets & Compression
✔ Public AI API Ready to Serve LLM Responses! 🚀
Now you have a Microsoft-native LLM API running on IIS with .NET Core! 🎉