Could we make a framework for using the MS server 2022 AI components to run LLMs in an IIS publishing instance of an AI?

Yes, it is possible to build a framework for running Large Language Models (LLMs) using Microsoft Server 2022 AI components within an IIS (Internet Information Services) instance. Here's a structured approach to making it work:


Framework for Running LLMs on MS Server 2022 with IIS

1. System Requirements

Before setting up, ensure your server has:

  • Windows Server 2022 with IIS installed.
  • AI & ML components (such as ONNX Runtime, Windows ML, or DirectML for acceleration).
  • A GPU (optional but recommended) if leveraging CUDA, DirectML, or ONNX acceleration.
  • .NET Core 6/7+ or Python for hosting AI inference.
  • LLM Model Files (e.g., GPT-2, GPT-3, Llama, or a smaller ONNX-converted model).

2. Setting Up IIS for AI Model Hosting

IIS (Internet Information Services) can serve an AI inference API by hosting an application (ASP.NET Core, FastAPI via IIS, or Flask with WSGI).

Steps to configure IIS for LLM API hosting:

  1. Enable IIS on Windows Server 2022:

    Install-WindowsFeature -name Web-Server -IncludeManagementTools
    
  2. Ensure IIS supports .NET and Python apps:

    • Install the .NET Core Hosting Bundle if using ASP.NET.
    • Install WSGI or FastCGI for Python-based LLM services.
  3. Deploy an AI Web Service in IIS:

    • Deploy an ASP.NET API (C#) or Flask/FastAPI (Python) for model inference.
    • Ensure the AI model is preloaded on the backend.
    • The AI API should expose endpoints such as:
      POST /predict   → Takes input text, returns LLM-generated response
      
  4. Set IIS Application Pool to Use the Right Python or .NET Runtime

    • If running Python, set IIS to use FastCGI or WSGI.
    • If using .NET, ensure the AI module is inside an ASP.NET Core application.

3. Running LLM Inference on MS Server 2022 AI Components

Windows Server 2022 supports:

  • ONNX Runtime (for running models efficiently).
  • DirectML (GPU-accelerated inference on supported hardware).
  • Windows ML (integrates with ONNX and DirectML).

3.1 Convert and Optimize the LLM for ONNX

  • Convert an LLM (e.g., GPT-2, LLaMA, or Falcon) to ONNX format:
    from transformers import AutoModelForCausalLM
    import torch
    import onnxruntime
    
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    torch.onnx.export(model, (torch.randint(0, 50256, (1, 64))), "gpt2.onnx")
    
  • Load and serve the model with ONNX Runtime:
    import onnxruntime as ort
    
    session = ort.InferenceSession("gpt2.onnx", providers=["CUDAExecutionProvider"])
    
  • Integrate this into an IIS-hosted Python API.

3.2 Running ONNX Inference via .NET API

Alternatively, if using ASP.NET with ONNX, a C# web API can run AI inference:

var session = new InferenceSession("gpt2.onnx");
var inputTensor = new DenseTensor<float>(new float[] { /* input IDs */ }, new int[] { 1, 64 });
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor("input_ids", inputTensor) };
var results = session.Run(inputs);

Expose this via an ASP.NET Core API inside IIS.


4. Performance Optimization

  1. Use DirectML for GPU Acceleration

    • If running on a Windows GPU, use ONNX Runtime with DirectML:
      ort.InferenceSession("gpt2.onnx", providers=["DmlExecutionProvider"])
      
    • This speeds up model inference using Windows ML + DirectML.
  2. Enable HTTP/2 & Compression in IIS

    • Improves API response time for AI queries.
    • Set IIS to enable HTTP/2 and gzip compression for AI responses.
  3. Use WebSockets for Streaming

    • IIS supports WebSockets, allowing real-time AI text generation.

5. Deployment Steps

  1. Convert the AI model to ONNX and place it on the server.
  2. Choose Python (Flask/FastAPI) or .NET (ASP.NET Core) for serving the API.
  3. Configure IIS to run the AI Web Service.
  4. Set up ONNX Runtime, DirectML, or CUDA for LLM inference.
  5. Optimize performance using WebSockets, caching, and compression.

Conclusion

This framework allows Microsoft Server 2022 to host LLMs inside IIS using:

  • ONNX Runtime for inference.
  • DirectML or CUDA for acceleration.
  • IIS-based APIs for AI response handling.

This setup provides scalable, enterprise-level LLM hosting, capable of serving AI-powered applications on a Windows Server stack. 🚀

Could we make a Powershell Script that Would Ready any 2022 Server for Implementation?

Could we make a Powershell Script that Would Ready any 2022 Server for Implementation?

Yes! Below is a PowerShell script that prepares a Windows Server 2022 for hosting an LLM API using IIS, ONNX Runtime, and Python/.NET integration.

This script:

  1. Installs IIS and configures it to host AI models.
  2. Installs .NET, Python, and ONNX Runtime for AI inference.
  3. Sets up WebSockets, HTTP/2, and compression for AI API optimization.
  4. Configures firewall rules to allow AI API traffic.
  5. Verifies GPU support for DirectML/CUDA acceleration.

🔹 PowerShell Script to Ready a Windows Server 2022 for LLM Hosting

# PowerShell Script to Prepare Windows Server 2022 for LLM Hosting with IIS & ONNX

# 1. Install IIS and Required Features
Write-Host "Installing IIS and required components..."
Install-WindowsFeature -Name Web-Server, Web-Http-Redirect, Web-WebSockets, Web-App-Dev -IncludeManagementTools

# 2. Install .NET 6+ (for ONNX Runtime with ASP.NET API)
Write-Host "Installing .NET 6 Runtime..."
$dotnetInstaller = "https://download.visualstudio.microsoft.com/download/pr/7f8b8947-7b23-47d1-bbe1-88e9dc191ba2/1e3f5e4e8ff87edcf00584cc1edcb3f6/dotnet-hosting-6.0.21-win.exe"
Invoke-WebRequest -Uri $dotnetInstaller -OutFile "dotnet-installer.exe"
Start-Process -FilePath "dotnet-installer.exe" -ArgumentList "/quiet /norestart" -Wait
Remove-Item "dotnet-installer.exe"

# 3. Install Python and ONNX Runtime
Write-Host "Installing Python and ONNX Runtime..."
$pythonInstaller = "https://www.python.org/ftp/python/3.10.9/python-3.10.9-amd64.exe"
Invoke-WebRequest -Uri $pythonInstaller -OutFile "python-installer.exe"
Start-Process -FilePath "python-installer.exe" -ArgumentList "/quiet InstallAllUsers=1 PrependPath=1" -Wait
Remove-Item "python-installer.exe"

# Install ONNX Runtime for AI model inference
Write-Host "Installing ONNX Runtime..."
pip install onnxruntime onnxruntime-directml transformers flask fastapi uvicorn

# 4. Configure IIS for AI Hosting
Write-Host "Configuring IIS for AI API hosting..."
Set-WebConfigurationProperty -filter "system.webServer/httpCompression" -name "dynamicCompressionLevel" -value "9" -PSPath "IIS:\"
Set-WebConfigurationProperty -filter "system.webServer/webSocket" -name "enabled" -value "True" -PSPath "IIS:\"

# Enable HTTP/2 for AI API performance
Write-Host "Enabling HTTP/2..."
Set-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Services\HTTP\Parameters EnableHttp2 -Type DWord -Value 1

# 5. Allow API traffic through the Firewall (Port 5000 for Flask, 8000 for FastAPI, 443 for HTTPS)
Write-Host "Configuring Firewall rules..."
New-NetFirewallRule -DisplayName "Allow AI API Traffic" -Direction Inbound -Protocol TCP -LocalPort 5000,8000,443 -Action Allow

# 6. Verify GPU Acceleration (CUDA or DirectML)
Write-Host "Checking for GPU support..."
$gpuSupport = Get-WmiObject Win32_VideoController | Select-Object Name
if ($gpuSupport -match "NVIDIA") {
    Write-Host "NVIDIA GPU detected. Installing CUDA for ONNX..."
    pip install onnxruntime-gpu
} elseif ($gpuSupport -match "AMD" -or $gpuSupport -match "Intel") {
    Write-Host "AMD/Intel GPU detected. Installing DirectML support..."
    pip install onnxruntime-directml
} else {
    Write-Host "No GPU detected. Running on CPU only."
}

# 7. Restart IIS to Apply Changes
Write-Host "Restarting IIS..."
iisreset

Write-Host "Windows Server 2022 is now ready for AI model hosting with IIS, ONNX, and LLM APIs!"

🔹 What This Script Does

  1. Installs IIS for hosting AI models.
  2. Sets up Python & ONNX Runtime for LLM inference.
  3. Configures IIS for AI web services (Flask, FastAPI, or ASP.NET Core).
  4. Optimizes IIS by enabling:
    • WebSockets (for real-time AI responses).
    • HTTP/2 (for faster API performance).
    • Dynamic compression (to reduce response times).
  5. Opens firewall ports for AI API access.
  6. Detects GPU hardware and installs CUDA (for NVIDIA) or DirectML (for AMD/Intel).
  7. Restarts IIS to apply all configurations.

🔹 How to Use

  1. Save the script as setup-ai-server.ps1
  2. Run PowerShell as Administrator
  3. Execute the script:
    Set-ExecutionPolicy Unrestricted -Force
    .\setup-ai-server.ps1
    

🔹 After Running This Script

Your Windows Server 2022 will be fully prepared to: ✅ Host LLM APIs using IIS
✅ Run ONNX Runtime for AI model inference
✅ Support GPU acceleration with DirectML or CUDA
✅ Serve AI responses using Flask, FastAPI, or ASP.NET Core

🚀 Now you can deploy your LLM model and start serving AI requests!

So how do I Deploy the AI After This?

Deploying the AI Model After Running the Setup Script

Now that your Windows Server 2022 is ready with IIS, ONNX Runtime, and Python/.NET, follow these steps to deploy your Large Language Model (LLM) API.


🚀 Deployment Steps for LLM on IIS (Using Python/ONNX)

1. Choose Your AI API Framework

You can deploy the LLM as a web API using:

  • Python (Flask or FastAPI)
  • .NET Core (ASP.NET Web API)
  • Node.js (Express, but less common for AI inference on Windows Server)

For this example, we will use FastAPI (Python) because it’s lightweight, fast, and integrates well with ONNX Runtime.


2. Upload Your AI Model to the Server

If you converted your model to ONNX, upload it to C:\AIModels\. You can use:

  • SFTP/FTP to transfer files
  • Remote Desktop (RDP) copy-paste
  • PowerShell command to download the model:
    Invoke-WebRequest -Uri "https://your-model-source.com/gpt2.onnx" -OutFile "C:\AIModels\gpt2.onnx"
    

3. Install Python Dependencies (If Not Already Installed)

Log in to your server, open PowerShell, and run:

pip install onnxruntime fastapi uvicorn transformers

This ensures all necessary packages for inference and API deployment are installed.


4. Create the AI API Using FastAPI

On the server, create a new folder for the AI service:

mkdir C:\AIService
cd C:\AIService

Create a Python script for the API:

notepad ai_server.py

Paste the following Python code to load the LLM and serve API requests:

from fastapi import FastAPI
import onnxruntime as ort
from transformers import AutoTokenizer
import torch

# Load ONNX model
model_path = "C:/AIModels/gpt2.onnx"
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

app = FastAPI()

@app.post("/generate")
async def generate_text(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cpu().numpy()

    # Run inference with ONNX Runtime
    outputs = session.run(None, {"input_ids": input_ids})
    generated_ids = torch.tensor(outputs[0])
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

    return {"response": generated_text}

# Run with: uvicorn ai_server:app --host 0.0.0.0 --port 8000

Save the file and close Notepad.


5. Test the AI API Locally

Run the API manually first:

python ai_server.py

Then, in a web browser, open:

http://localhost:8000/docs

You should see the FastAPI Swagger UI to test the /generate endpoint.


6. Deploy the AI API as a Windows Service (Persistent)

Instead of running manually, let’s set it up as a Windows Service so it runs automatically.

  1. Install NSSM (Non-Sucking Service Manager) to create a persistent service:
    choco install nssm -y
    
  2. Create a service that runs the AI API:
    nssm install AIService "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\python.exe" "C:\AIService\ai_server.py"
    
  3. Start the AI API service:
    nssm start AIService
    

Now, the AI service will automatically restart if the server reboots.


7. Configure IIS as a Reverse Proxy for Public Access

Now, let’s make the AI API publicly accessible via IIS.

  1. Open IIS Manager (inetmgr in PowerShell).
  2. Select your server and click Application Request Routing.
  3. Enable Reverse Proxy.
  4. Click Server Farms > Create Server Farm > Add:
    • localhost:8000 as the target backend (where FastAPI is running).
    • Enable load balancing if needed.
  5. Create a new IIS website pointing to the proxy.

Now, you can access the AI API using:

https://your-server-ip-or-domain/generate

✅ Final Deployment Checklist

✔ AI Model Deployed in ONNX Format
✔ AI API Running via FastAPI on Port 8000
✔ AI Service Running as a Windows Service (Persistent)
✔ IIS Configured as a Reverse Proxy for Public Access

Now your AI is live and accessible from the web! 🚀

How Would I Publish the AI with .net Core?

Deploying LLM with .NET Core on Windows Server 2022 (IIS & ONNX)

To keep everything Microsoft-compatible, we’ll use .NET Core 6/7+, ONNX Runtime, and IIS for hosting the AI API.

This method ensures: ✅ Full Microsoft stack compatibility
✅ ONNX Runtime acceleration for inference
✅ IIS integration for a secure and scalable deployment


🚀 Steps to Deploy LLM in .NET Core on Windows Server 2022

1. Install Required Software

Ensure your Windows Server 2022 is ready:

# Install IIS and dependencies
Install-WindowsFeature -Name Web-Server, Web-Asp-Net45, Web-WebSockets -IncludeManagementTools

# Install .NET 7 Hosting Bundle (modify for .NET 6 if needed)
Invoke-WebRequest -Uri "https://download.visualstudio.microsoft.com/download/pr/7f8b8947-7b23-47d1-bbe1-88e9dc191ba2/1e3f5e4e8ff87edcf00584cc1edcb3f6/dotnet-hosting-7.0.10-win.exe" -OutFile "dotnet-hosting.exe"
Start-Process -FilePath "dotnet-hosting.exe" -ArgumentList "/quiet /norestart" -Wait
Remove-Item "dotnet-hosting.exe"

# Install ONNX Runtime
dotnet add package Microsoft.ML.OnnxRuntime

2. Prepare AI Model

If you haven’t already converted your LLM to ONNX format, do this:

from transformers import AutoModelForCausalLM
import torch

# Convert model to ONNX format
model = AutoModelForCausalLM.from_pretrained("gpt2")
torch.onnx.export(model, torch.randint(0, 50256, (1, 64)), "gpt2.onnx")

Upload gpt2.onnx to C:\AIModels\.


3. Create .NET Core AI Web API

  1. Open PowerShell and create a new .NET Web API project:

    mkdir C:\AIService
    cd C:\AIService
    dotnet new webapi -n AIService
    cd AIService
    
  2. Add Dependencies:

    dotnet add package Microsoft.ML.OnnxRuntime
    dotnet add package Microsoft.AspNetCore.Cors
    
  3. Modify Program.cs to enable CORS:

    var builder = WebApplication.CreateBuilder(args);
    builder.Services.AddCors(options =>
    {
        options.AddPolicy("AllowAll", policy =>
        {
            policy.AllowAnyOrigin().AllowAnyMethod().AllowAnyHeader();
        });
    });
    
    var app = builder.Build();
    app.UseCors("AllowAll");
    app.MapControllers();
    app.Run();
    

4. Implement ONNX Model Inference

Replace Controllers/AIController.cs with this code:

using Microsoft.AspNetCore.Mvc;
using Microsoft.ML.OnnxRuntime;
using System.Text.Json;

[Route("api/ai")]
[ApiController]
public class AIController : ControllerBase
{
    private static readonly InferenceSession _session = new InferenceSession("C:/AIModels/gpt2.onnx");

    [HttpPost("generate")]
    public IActionResult GenerateText([FromBody] InputData input)
    {
        // Tokenization would be required here, but simplified for demonstration
        var inputTensor = new float[] { 1, 2, 3, 4, 5 }; // Example input tokens
        var inputMeta = _session.InputMetadata;
        var container = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input_ids", new DenseTensor<float>(inputTensor, new int[] { 1, inputTensor.Length }))
        };

        var result = _session.Run(container);
        var output = result[0].AsEnumerable<float>().ToArray();

        return Ok(JsonSerializer.Serialize(new { response = output }));
    }
}

public class InputData
{
    public string Prompt { get; set; }
}

5. Test the API Locally

Run the API:

dotnet run --urls http://localhost:5000

Visit:

http://localhost:5000/swagger/index.html

and test the /api/ai/generate endpoint.


6. Publish & Deploy to IIS

  1. Publish the API:

    dotnet publish -c Release -o C:\AIService\publish
    
  2. Configure IIS:

    • Open IIS Manager (inetmgr)
    • Create a New Site:
      • Path: C:\AIService\publish
      • Port: 5000 (or 80 for public access)
    • Set App Pool to .NET Core
  3. Enable WebSockets & Compression in IIS:

    Set-WebConfigurationProperty -filter "system.webServer/webSocket" -name "enabled" -value "True" -PSPath "IIS:\"
    Set-WebConfigurationProperty -filter "system.webServer/httpCompression" -name "dynamicCompressionLevel" -value "9" -PSPath "IIS:\"
    
  4. Restart IIS:

    iisreset
    

Now your AI API is live at:

http://your-server-ip/api/ai/generate

✅ Final Deployment Summary

✔ Windows Server 2022 AI-ready
✔ ONNX LLM Model Deployed
✔ .NET Core API Running on IIS
✔ IIS Optimized with WebSockets & Compression
✔ Public AI API Ready to Serve LLM Responses! 🚀

Now you have a Microsoft-native LLM API running on IIS with .NET Core! 🎉

 

Add comment