🛡️ Thông tin dự án / Project Info

CUDA11.8 Python3.9 / 3.10 BackendCU111 / CU112 GPUCMP 40HX 8GB ModelWan 2.2 14B

This project is a personal initiative. I am open to further discussions and collaborations. Please contact me directly for technical details and the latest updates.

💡 Ý tưởng & Kiến trúc / Concept & Architecture: Trần Hiếu Nghĩa (Timmyo/V)
🤖 Thực thi / Implementation: Assisted by AI Systems (Gemini, Claude, ChatGPT, Grok)
🌟 Cảm hứng / Inspiration: ggml, llama.cpp-Python, sd.cpp-Python, ComfyUI-GGUF, Diffusers

📖 Giới thiệu (Vietnamese)

AIO-GGUF Backend V5 là một giải pháp toàn diện được thiết kế để chạy các mô hình AI (như Wan 2.2 14B, Stable Diffusion XL) trên phần cứng có tài nguyên hạn chế.

Dự án đã thực hiện thành công trên phần cứng GPU CMP 40HX 8GB (Mod PCIe 1.1 x16), sử dụng môi trường CUDA 11.8 với backend từ Driver 460.89 là CU112.

📖 Overview (English)

AIO-GGUF Backend V5 is a comprehensive solution designed to run massive AI models (such as Wan 2.2 14B and Stable Diffusion XL) on resource-constrained hardware.

The project has been successfully deployed on GPU 40HX 8GB (Modded PCIe 1.1 x16) hardware, utilizing a CUDA 11.8 environment with backends powered by Driver 460.89 (CU112).

🚀 Hành trình Tối ưu hóa: Từ Crash/1000s+ xuống 153s/it

Dự án này là minh chứng cho việc bứt phá giới hạn vật lý của phần cứng cũ. Với một trong những mô hình AI mới nhất - Wan2.2 14B, băng thông chật hẹp của PCIe 1.1 x16 ban đầu đã gây ra tình trạng thắt cổ chai trầm trọng (1000s/it) và crash hệ thống. Các kỹ thuật lõi đã được áp dụng:

Vượt rào Băng thông với Pinned Memory (DMA): Viết lại luồng quản lý bộ nhớ ở cấp độ thấp, sử dụng bộ đệm cứng pin_memory=True. Kỹ thuật này ép hệ điều hành cấp phát bộ nhớ vật lý độc quyền, cho phép GPU rút dữ liệu trực tiếp (DMA) từ RAM hệ thống cực nhanh mà không qua xử lý trung gian của CPU.
Tối ưu Lõi Attention cho Kiến trúc Turing: Tái cấu trúc hàm Scaled Dot-Product Attention (SDPA). Ép xử lý cục bộ, ép Tensor Cores phải hoạt động hết công suất.
Kiểm soát luồng dữ liệu & Batch Size tối đa: Loại bỏ các phép tính dư thừa bằng cách can thiệp sâu vào thông số CFG và KSampler.

🚀 Optimization Journey: From Crash/1000s+ down to 153s/it

Bypassing Bandwidth Walls with Pinned Memory (DMA): Rewrote low-level memory management using a pin_memory=True buffer. Forces the OS to allocate page-locked physical memory, enabling Direct Memory Access without CPU overhead.
Core Attention Tuning for Turing Architecture: Reconstructed the Scaled Dot-Product Attention (SDPA) mechanism. Bypassed native instruction set limitations of CMP 40HX GPU by forcing local processing, pushing Tensor Cores to maximum efficiency.
Data Flow & Batch Size Control: Eliminated redundant computations by deep-diving into CFG and KSampler parameters, keeping Batch Size minimal to halve the massive data load.

🔥 Kết quả / The Result:

Từ việc không thể chạy nổi, hệ thống đã render thành công Video bằng mô hình Wan 2.2 14B (81 frames, 480x480, 32 FPS) với tốc độ 153s/it.

From total system crashes → successfully generated Video using Wan 2.2 14B (81 frames, 480x480, 32 FPS) at a stable 153s/it.
A massive breakthrough on legacy PCIe 1.1 x16 — and optimization continues!

🧠 AIO-GGUF Backend (Alpha V0.0.1)

I'm building a high-performance GGUF-to-Diffusers backend that leverages (llama.cpp/sd.cpp).dll to load and run quantized models directly through Diffusers. It's designed to be much lighter and faster than the native Safetensors pipeline, specifically addressing the current optimization gaps in Diffusers' GGUF implementation.

# Frameworks supported:
ggml · llama.cpp-Python · sd.cpp-Python · ComfyUI-GGUF · Diffusers

# Hardware tested:
GPU CMP 40HX 8GB  |  PCIe 1.1 x16 (modded)  |  CUDA 11.8  |  Driver 460.89

# Models:
Wan 2.2 14B  |  Stable Diffusion XL  |  Quantized GGUF variants

⚠️ Alpha stage — APIs and interfaces may change. This is a research/optimization project targeting legacy CUDA hardware.