No description
  • Python 99.9%
  • Dockerfile 0.1%
Find a file
2026-05-18 13:30:20 +00:00
.woodpecker chore(ci): pin shared-ci-shared:v1, swap to shared-python-base:v1 2026-05-13 12:03:44 +02:00
docs docs: media-api operations + translation pipeline notes 2026-05-09 18:24:02 +02:00
media_api fix(documents): full HTML entity decode in mode=pages 2026-05-14 21:31:11 +02:00
tests fix(documents): full HTML entity decode in mode=pages 2026-05-14 21:31:11 +02:00
.gitignore chore(sonar): wire SonarCloud project config 2026-05-06 00:24:03 +02:00
bao.yml chore(ci): consolidate to single pipeline + unified bao-checks 2026-05-01 17:31:43 +02:00
CHANGELOG.md chore(release): v5.9.2 [skip ci] 2026-05-18 13:30:20 +00:00
CLAUDE.md feat(images): add stylize_image MCP tool 2026-05-10 15:28:35 +02:00
Containerfile refactor!: split media_api.py into a media_api/ package + add tests 2026-05-05 17:33:09 +02:00
pytest.ini test: add live integration suite against deployed media-api 2026-05-05 18:18:46 +02:00
README.md feat(images): add stylize_image MCP tool 2026-05-10 15:28:35 +02:00
requirements-dev.txt test(conformance): wire schemathesis OpenAPI fuzz against FastAPI 2026-05-05 23:40:21 +02:00
requirements.txt feat(documents): extract_tables endpoint (pdfplumber + vision fallback) 2026-05-14 08:03:59 +02:00
sonar-project.properties chore(sonar): wire SonarCloud project config 2026-05-06 00:24:03 +02:00

media-api

Unified local AI media service — image generation, vision analysis, TTS, STT, OCR (image + PDF), segmentation, background removal, upscale, compositing.

Runs as a container on the api qube at :8096 (relocated from the ollama qube during the GPU/API split — GPU upstreams are reached over the WireGuard mesh with embedded basic-auth credentials). Accessed via MCP server (mcp-media) on the claude/flex qubes.

👉 docs/use-cases.md — practical recipes with example outputs for every endpoint, plus an honest list of where the local stack falls short of hosted alternatives.

hero — three generative outputs + cat cutout

CI + release. Not deployed on the k8s cluster — runs locally on the qube. The CI pipeline (.woodpecker/ci.yml + release.yml) still runs on the Woodpecker instance from k8s-build-env; build artifacts land in the Forgejo OCI registry at git.loop-coop.net/loco/workloads-media-api.

Endpoints

Method Path Backend
POST /v1/images/generate ComfyUI (SD3.5 / Juggernaut XL)
POST /v1/images/generations OpenAI-compat shim → /v1/images/generate
POST /v1/images/analyze Ollama vision (llama3.2-vision / qwen2.5vl / moondream)

| POST | /v1/images/compose | PIL (in-process) | | POST | /v1/images/ocr | Tesseract + vision fallback | | POST | /v1/images/anonymize | OpenCV Haar cascade + pixelation | | POST | /v1/audio/speech | Speaches Kokoro/piper TTS | | POST | /v1/audio/transcribe | Speaches Whisper STT | | POST | /v1/translate | LibreTranslate (Argos / CTranslate2) | | POST | /v1/translate/detect | LibreTranslate language detection | | GET | /v1/translate/languages | LibreTranslate loaded language pairs | | GET | /v1/models | Ollama + image models (Ollama-shape and OpenAI-shape) | | GET | /v1/jobs | Job queue snapshot | | GET | /v1/jobs/{id} | Job status + result | | DELETE | /v1/jobs/{id} | Cancel pending job | | GET | /health | Service probe |

Add ?async=1 to any POST for async job submission; poll GET /v1/jobs/{id} for result.

Image generation

Default model: SD3.5 medium (sd3.5_medium.safetensors) — better scene composition and prompt-following, ~18s at 1024px.

Override with checkpoint param:

  • "checkpoint": "juggernaut" — Juggernaut XL v9, ~15s, better for photorealistic portraits/faces
  • "checkpoint": "sd3.5" — explicit SD3.5 (same as default)

hires_fix: true — upscales output to 2048px via 4x-UltraSharp + img2img refinement pass (~70s total). Use for final/hero images.

OpenAI-compatible endpoint

POST /v1/images/generations accepts the OpenAI Images API shape:

{"model": "sd3.5", "prompt": "a cat", "n": 1, "size": "1024x1024", "response_format": "b64_json"}

…and returns {"created": <ts>, "data": [{"b64_json": "..."}]}. Used by clients that speak the OpenAI API verbatim (e.g. Nextcloud's integration_openai). Accepted model values: sd3.5, juggernaut, sdxl, or omit to use the auto-detected default. Unsupported sizes are rounded to the nearest SDXL/SD3.5 native size.

Default params

Param SD3.5 default Juggernaut recommended
steps 20 2535
cfg 4.5 6.57.0
sampler euler dpmpp_2m
scheduler sgm_uniform karras

Modes

Mode Required params
txt2img prompt
img2img prompt + image (base64) + strength (0.30.9)
inpaint prompt + image + mask (white=replace)
ControlNet controlnet.image + controlnet.model + controlnet.strength
IP-Adapter ipadapter.image + ipadapter.model + ipadapter.strength

Required ComfyUI models

File Location Purpose
sd3.5_medium.safetensors models/checkpoints/ SD3.5 base
clip_l.safetensors models/clip/ SD3.5 CLIP-L
clip_g.safetensors models/clip/ SD3.5 CLIP-G
juggernaut_xl_v9.safetensors models/checkpoints/ SDXL fallback
4x-UltraSharp.pth models/upscale_models/ hires_fix upscaler

Vision (/v1/images/analyze, OCR fallback)

Picked by free GPU VRAM (_pick_vision_model):

Free VRAM Model Notes
≥ 8.5 GB llama3.2-vision:11b richest reasoning; ~7.9 GB resident
≥ 5.5 GB qwen2.5vl:7b best OCR + layout among small VLMs; ~5 GB
else moondream tight-VRAM fallback; ~1.7 GB

Override via the model field in the request body. Pull the models on the ollama qube:

ollama pull qwen2.5vl:7b
ollama pull llama3.2-vision:11b
ollama pull moondream

TTS

Speaches Kokoro (default voice af_bella) or piper shorthand (en_GB-alba-medium).

{ "input": "Hello world", "voice": "af_bella" }

STT

Speaches Whisper (Systran/faster-whisper-small). Accepts OGG, WAV, MP3, M4A.

{ "audio": "<base64>", "language": "en" }

Health check

curl http://localhost:8096/health
{
  "version": "v0.11.0",
  "status": "ok",
  "services": { "ollama": true, "comfyui": true, "speaches": true, "tts": true },
  "image_model": "sd3.5_medium.safetensors",
  "vision_model": "qwen2.5vl:7b",
  "vram_free_mb": 9299,
  "queue_depth": 0
}

MCP server

The MCP server (/usr/bin/mcp-media) wraps all endpoints as Claude tools. Registered in ~/.mcp.json as the media server on claude/flex qubes.

Tools: generate_image, stylize_image, analyze_image, compose_images, synthesize_speech, transcribe_audio, ocr, anonymize_image, list_jobs, get_job, cancel_job.

stylize_image is the high-level entry point for visual style transfer — it wraps generate_image with IP-Adapter (reference-based) or img2img + depth ControlNet (prompt-based, structure-preserving).