No description

Python 99.9%
Dockerfile 0.1%

Find a file

woodpecker-bump 114bac2fd4 chore(release): v5.9.2 [skip ci]		2026-05-18 13:30:20 +00:00
.woodpecker	chore(ci): pin shared-ci-shared:v1, swap to shared-python-base:v1	2026-05-13 12:03:44 +02:00
docs	docs: media-api operations + translation pipeline notes	2026-05-09 18:24:02 +02:00
media_api	fix(documents): full HTML entity decode in mode=pages	2026-05-14 21:31:11 +02:00
tests	fix(documents): full HTML entity decode in mode=pages	2026-05-14 21:31:11 +02:00
.gitignore	chore(sonar): wire SonarCloud project config	2026-05-06 00:24:03 +02:00
bao.yml	chore(ci): consolidate to single pipeline + unified bao-checks	2026-05-01 17:31:43 +02:00
CHANGELOG.md	chore(release): v5.9.2 [skip ci]	2026-05-18 13:30:20 +00:00
CLAUDE.md	feat(images): add stylize_image MCP tool	2026-05-10 15:28:35 +02:00
Containerfile	refactor!: split media_api.py into a media_api/ package + add tests	2026-05-05 17:33:09 +02:00
pytest.ini	test: add live integration suite against deployed media-api	2026-05-05 18:18:46 +02:00
README.md	feat(images): add stylize_image MCP tool	2026-05-10 15:28:35 +02:00
requirements-dev.txt	test(conformance): wire schemathesis OpenAPI fuzz against FastAPI	2026-05-05 23:40:21 +02:00
requirements.txt	feat(documents): extract_tables endpoint (pdfplumber + vision fallback)	2026-05-14 08:03:59 +02:00
sonar-project.properties	chore(sonar): wire SonarCloud project config	2026-05-06 00:24:03 +02:00

README.md

media-api

Unified local AI media service — image generation, vision analysis, TTS, STT, OCR (image + PDF), segmentation, background removal, upscale, compositing.

Runs as a container on the api qube at :8096 (relocated from the ollama qube during the GPU/API split — GPU upstreams are reached over the WireGuard mesh with embedded basic-auth credentials). Accessed via MCP server (mcp-media) on the claude/flex qubes.

👉 docs/use-cases.md — practical recipes with example outputs for every endpoint, plus an honest list of where the local stack falls short of hosted alternatives.

CI + release. Not deployed on the k8s cluster — runs locally on the qube. The CI pipeline (.woodpecker/ci.yml + release.yml) still runs on the Woodpecker instance from k8s-build-env; build artifacts land in the Forgejo OCI registry at git.loop-coop.net/loco/workloads-media-api.

Endpoints

Method	Path	Backend
POST	`/v1/images/generate`	ComfyUI (SD3.5 / Juggernaut XL)
POST	`/v1/images/generations`	OpenAI-compat shim → `/v1/images/generate`
POST	`/v1/images/analyze`	Ollama vision (llama3.2-vision / qwen2.5vl / moondream)

| POST | /v1/images/compose | PIL (in-process) | | POST | /v1/images/ocr | Tesseract + vision fallback | | POST | /v1/images/anonymize | OpenCV Haar cascade + pixelation | | POST | /v1/audio/speech | Speaches Kokoro/piper TTS | | POST | /v1/audio/transcribe | Speaches Whisper STT | | POST | /v1/translate | LibreTranslate (Argos / CTranslate2) | | POST | /v1/translate/detect | LibreTranslate language detection | | GET | /v1/translate/languages | LibreTranslate loaded language pairs | | GET | /v1/models | Ollama + image models (Ollama-shape and OpenAI-shape) | | GET | /v1/jobs | Job queue snapshot | | GET | /v1/jobs/{id} | Job status + result | | DELETE | /v1/jobs/{id} | Cancel pending job | | GET | /health | Service probe |

Add ?async=1 to any POST for async job submission; poll GET /v1/jobs/{id} for result.

Image generation

Default model: SD3.5 medium (sd3.5_medium.safetensors) — better scene composition and prompt-following, ~18s at 1024px.

Override with checkpoint param:

"checkpoint": "juggernaut" — Juggernaut XL v9, ~15s, better for photorealistic portraits/faces
"checkpoint": "sd3.5" — explicit SD3.5 (same as default)

hires_fix: true — upscales output to 2048px via 4x-UltraSharp + img2img refinement pass (~70s total). Use for final/hero images.

OpenAI-compatible endpoint

POST /v1/images/generations accepts the OpenAI Images API shape:

{"model": "sd3.5", "prompt": "a cat", "n": 1, "size": "1024x1024", "response_format": "b64_json"}

…and returns {"created": <ts>, "data": [{"b64_json": "..."}]}. Used by clients that speak the OpenAI API verbatim (e.g. Nextcloud's integration_openai). Accepted model values: sd3.5, juggernaut, sdxl, or omit to use the auto-detected default. Unsupported sizes are rounded to the nearest SDXL/SD3.5 native size.

Default params

Param	SD3.5 default	Juggernaut recommended
steps	20	25–35
cfg	4.5	6.5–7.0
sampler	euler	dpmpp_2m
scheduler	sgm_uniform	karras

Modes

Mode	Required params
txt2img	`prompt`
img2img	`prompt` + `image` (base64) + `strength` (0.3–0.9)
inpaint	`prompt` + `image` + `mask` (white=replace)
ControlNet	`controlnet.image` + `controlnet.model` + `controlnet.strength`
IP-Adapter	`ipadapter.image` + `ipadapter.model` + `ipadapter.strength`

Required ComfyUI models

File	Location	Purpose
`sd3.5_medium.safetensors`	`models/checkpoints/`	SD3.5 base
`clip_l.safetensors`	`models/clip/`	SD3.5 CLIP-L
`clip_g.safetensors`	`models/clip/`	SD3.5 CLIP-G
`juggernaut_xl_v9.safetensors`	`models/checkpoints/`	SDXL fallback
`4x-UltraSharp.pth`	`models/upscale_models/`	hires_fix upscaler

Vision (`/v1/images/analyze`, OCR fallback)

Picked by free GPU VRAM (_pick_vision_model):

Free VRAM	Model	Notes
≥ 8.5 GB	`llama3.2-vision:11b`	richest reasoning; ~7.9 GB resident
≥ 5.5 GB	`qwen2.5vl:7b`	best OCR + layout among small VLMs; ~5 GB
else	`moondream`	tight-VRAM fallback; ~1.7 GB

Override via the model field in the request body. Pull the models on the ollama qube:

ollama pull qwen2.5vl:7b
ollama pull llama3.2-vision:11b
ollama pull moondream

TTS

Speaches Kokoro (default voice af_bella) or piper shorthand (en_GB-alba-medium).

{ "input": "Hello world", "voice": "af_bella" }

STT

Speaches Whisper (Systran/faster-whisper-small). Accepts OGG, WAV, MP3, M4A.

{ "audio": "<base64>", "language": "en" }

Health check

curl http://localhost:8096/health

{
  "version": "v0.11.0",
  "status": "ok",
  "services": { "ollama": true, "comfyui": true, "speaches": true, "tts": true },
  "image_model": "sd3.5_medium.safetensors",
  "vision_model": "qwen2.5vl:7b",
  "vram_free_mb": 9299,
  "queue_depth": 0
}

MCP server

The MCP server (/usr/bin/mcp-media) wraps all endpoints as Claude tools. Registered in ~/.mcp.json as the media server on claude/flex qubes.

Tools: generate_image, stylize_image, analyze_image, compose_images, synthesize_speech, transcribe_audio, ocr, anonymize_image, list_jobs, get_job, cancel_job.

stylize_image is the high-level entry point for visual style transfer — it wraps generate_image with IP-Adapter (reference-based) or img2img + depth ControlNet (prompt-based, structure-preserving).

README.md Unescape Escape