- Python 99.9%
- Dockerfile 0.1%
| .woodpecker | ||
| docs | ||
| media_api | ||
| tests | ||
| .gitignore | ||
| bao.yml | ||
| CHANGELOG.md | ||
| CLAUDE.md | ||
| Containerfile | ||
| pytest.ini | ||
| README.md | ||
| requirements-dev.txt | ||
| requirements.txt | ||
| sonar-project.properties | ||
media-api
Unified local AI media service — image generation, vision analysis, TTS, STT, OCR (image + PDF), segmentation, background removal, upscale, compositing.
Runs as a container on the api qube at :8096 (relocated from the ollama qube during the GPU/API split — GPU upstreams are reached over the WireGuard mesh with embedded basic-auth credentials). Accessed via MCP server (mcp-media) on the claude/flex qubes.
👉 docs/use-cases.md — practical recipes with example outputs for every endpoint, plus an honest list of where the local stack falls short of hosted alternatives.
CI + release. Not deployed on the k8s cluster — runs locally on the qube. The CI pipeline (
.woodpecker/ci.yml+release.yml) still runs on the Woodpecker instance fromk8s-build-env; build artifacts land in the Forgejo OCI registry atgit.loop-coop.net/loco/workloads-media-api.
Endpoints
| Method | Path | Backend |
|---|---|---|
| POST | /v1/images/generate |
ComfyUI (SD3.5 / Juggernaut XL) |
| POST | /v1/images/generations |
OpenAI-compat shim → /v1/images/generate |
| POST | /v1/images/analyze |
Ollama vision (llama3.2-vision / qwen2.5vl / moondream) |
| POST | /v1/images/compose | PIL (in-process) |
| POST | /v1/images/ocr | Tesseract + vision fallback |
| POST | /v1/images/anonymize | OpenCV Haar cascade + pixelation |
| POST | /v1/audio/speech | Speaches Kokoro/piper TTS |
| POST | /v1/audio/transcribe | Speaches Whisper STT |
| POST | /v1/translate | LibreTranslate (Argos / CTranslate2) |
| POST | /v1/translate/detect | LibreTranslate language detection |
| GET | /v1/translate/languages | LibreTranslate loaded language pairs |
| GET | /v1/models | Ollama + image models (Ollama-shape and OpenAI-shape) |
| GET | /v1/jobs | Job queue snapshot |
| GET | /v1/jobs/{id} | Job status + result |
| DELETE | /v1/jobs/{id} | Cancel pending job |
| GET | /health | Service probe |
Add ?async=1 to any POST for async job submission; poll GET /v1/jobs/{id} for result.
Image generation
Default model: SD3.5 medium (sd3.5_medium.safetensors) — better scene composition and prompt-following, ~18s at 1024px.
Override with checkpoint param:
"checkpoint": "juggernaut"— Juggernaut XL v9, ~15s, better for photorealistic portraits/faces"checkpoint": "sd3.5"— explicit SD3.5 (same as default)
hires_fix: true — upscales output to 2048px via 4x-UltraSharp + img2img refinement pass (~70s total). Use for final/hero images.
OpenAI-compatible endpoint
POST /v1/images/generations accepts the OpenAI Images API shape:
{"model": "sd3.5", "prompt": "a cat", "n": 1, "size": "1024x1024", "response_format": "b64_json"}
…and returns {"created": <ts>, "data": [{"b64_json": "..."}]}. Used by clients that speak the OpenAI API verbatim (e.g. Nextcloud's integration_openai). Accepted model values: sd3.5, juggernaut, sdxl, or omit to use the auto-detected default. Unsupported sizes are rounded to the nearest SDXL/SD3.5 native size.
Default params
| Param | SD3.5 default | Juggernaut recommended |
|---|---|---|
| steps | 20 | 25–35 |
| cfg | 4.5 | 6.5–7.0 |
| sampler | euler | dpmpp_2m |
| scheduler | sgm_uniform | karras |
Modes
| Mode | Required params |
|---|---|
| txt2img | prompt |
| img2img | prompt + image (base64) + strength (0.3–0.9) |
| inpaint | prompt + image + mask (white=replace) |
| ControlNet | controlnet.image + controlnet.model + controlnet.strength |
| IP-Adapter | ipadapter.image + ipadapter.model + ipadapter.strength |
Required ComfyUI models
| File | Location | Purpose |
|---|---|---|
sd3.5_medium.safetensors |
models/checkpoints/ |
SD3.5 base |
clip_l.safetensors |
models/clip/ |
SD3.5 CLIP-L |
clip_g.safetensors |
models/clip/ |
SD3.5 CLIP-G |
juggernaut_xl_v9.safetensors |
models/checkpoints/ |
SDXL fallback |
4x-UltraSharp.pth |
models/upscale_models/ |
hires_fix upscaler |
Vision (/v1/images/analyze, OCR fallback)
Picked by free GPU VRAM (_pick_vision_model):
| Free VRAM | Model | Notes |
|---|---|---|
| ≥ 8.5 GB | llama3.2-vision:11b |
richest reasoning; ~7.9 GB resident |
| ≥ 5.5 GB | qwen2.5vl:7b |
best OCR + layout among small VLMs; ~5 GB |
| else | moondream |
tight-VRAM fallback; ~1.7 GB |
Override via the model field in the request body. Pull the models on the
ollama qube:
ollama pull qwen2.5vl:7b
ollama pull llama3.2-vision:11b
ollama pull moondream
TTS
Speaches Kokoro (default voice af_bella) or piper shorthand (en_GB-alba-medium).
{ "input": "Hello world", "voice": "af_bella" }
STT
Speaches Whisper (Systran/faster-whisper-small). Accepts OGG, WAV, MP3, M4A.
{ "audio": "<base64>", "language": "en" }
Health check
curl http://localhost:8096/health
{
"version": "v0.11.0",
"status": "ok",
"services": { "ollama": true, "comfyui": true, "speaches": true, "tts": true },
"image_model": "sd3.5_medium.safetensors",
"vision_model": "qwen2.5vl:7b",
"vram_free_mb": 9299,
"queue_depth": 0
}
MCP server
The MCP server (/usr/bin/mcp-media) wraps all endpoints as Claude tools. Registered in ~/.mcp.json as the media server on claude/flex qubes.
Tools: generate_image, stylize_image, analyze_image, compose_images, synthesize_speech, transcribe_audio, ocr, anonymize_image, list_jobs, get_job, cancel_job.
stylize_image is the high-level entry point for visual style transfer — it wraps generate_image with IP-Adapter (reference-based) or img2img + depth ControlNet (prompt-based, structure-preserving).
