OpenAI SDK · Redis · inference backend

openai-rq

Run the OpenAI SDK from behind a locked-down network — by tunnelling inference over Redis.

The problem

Your client sits behind a heavily restricted network. The only reachable outbound endpoint — from both sides — is Redis.

client │ 🔒 │ Redis │ 🔒 │ inference backend

No direct HTTP from the client to the inference box is possible.

So Redis is the only rendezvous

Redis isn't just a queue here — it's the single meeting point between the external client and the in-cloud inference box.

Which means Redis has to carry the request itself, not just a job id.

Key insight

The backend is already an OpenAI HTTP server

So the OpenAI SDK on the client already builds a valid HTTP request. We intercept at the transport layer, ship that raw request over Redis, and the worker simply replays it against the local backend.

  • Generic relay — chat, embeddings, any current or future endpoint, zero changes.
  • Errors propagate for free — the worker returns the real status + body.
  • Streaming is just an SSE HTTP response — same mechanism.

Works with anything that speaks the OpenAI API — vLLM, SGLang, TGI, llama.cpp, Ollama.

Architecture

OpenAIRQ ⇄ Redis Streams ⇄ openai-rq worker → HTTP → backend :8000

Both client and worker connect only to Redis. The worker is outbound-only — nothing connects into the cloud box.

Client = a drop-in openai.OpenAI

client.py
# swap the class, point at Redis — the rest is identical
from openai_rq import OpenAIRQ

client = OpenAIRQ(redis_url="redis://localhost:6379/0")

resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Hello!"}],
)
# streaming + AsyncOpenAIRQ behave like the real SDK
A free bonus

extra_body / extra_headers just work

Because we intercept at the transport, the SDK bakes these into the request and the worker replays them verbatim — no code needed.

provider-specific fields
client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[...],
    extra_body={"guided_json": schema},   # passes through
)

Worker — run it next to the backend

shell
openai-rq worker \
    --redis-url redis://localhost:6379/0 \
    --openai-base-url http://localhost:8000/v1 \
    --concurrency 16

Run as many workers as you like against the same Redis — jobs are load-balanced across them via a Redis consumer group.

Deep dive

What actually crosses Redis

The transport freezes each request into one self-describing JSON job and XADDs it to a stream — no live socket, just a record the worker can replay anywhere.

openai-rq:requests · one job
{
  "id":       "9f2c…",       # uuid; names the result key
  "method":   "POST",
  "path":     "/v1/chat/completions",
  "headers":  { … },         # hop-by-hop headers stripped
  "body_b64": "eyJtb2RlbCI6…",   # the SDK's exact body
  "stream":   true
}
A gotcha worth knowing

Is this request a stream?

You'd reach for Accept: text/event-stream — but the OpenAI SDK sends Accept: application/json even when stream=True. The only signal is "stream": true inside the body.

transport.py · _is_stream()
def _is_stream(request):
    accept = request.headers.get("accept", "")
    if accept.startswith("text/event-stream"):
        return True                  # explicit SSE
    # the OpenAI SDK signals streaming only in the body
    ctype = request.headers.get("content-type", "")
    if "application/json" in ctype:
        return bool(json.loads(request.content).get("stream"))
    return False

Header-only detection routed every SDK stream down the blocking path. Reading the body fixes it — proven against the real SDK. (v0.1.2)

Streaming, reassembled over a Redis Stream

The worker writes the upstream SSE response into a per-request Redis Stream as typed entries; the client replays them as an ordinary httpx streaming body.

head data data … done / error
  • head carries status + headers — the client builds the response from it.
  • data is raw SSE bytes, coalesced in a ~50ms window and never split mid-event.
  • done / error is a terminal sentinel; bytes feed straight into the SDK's own parser.

Reliability & lifecycle

  • Consumer group, not pub/subXREADGROUP load-balances jobs across N workers; persistent and replayable, not fire-and-forget.
  • At-least-onceXACK on completion; XAUTOCLAIM reclaims jobs orphaned by a crashed worker (idle > 60s).
  • Dead-letter — past --max-retries a job is parked, and the waiting client is unblocked with a terminal 502 rather than hanging.
  • Bounded by design — the request stream is maxlen-capped and every result/stream key carries a TTL. The client's read-timeout maps to the Redis block time.

Backend auth, done right

The backend credential is owned by the worker.

Set via --openai-api-key (→ Authorization: Bearer) or --openai-header for custom auth. It's injected at relay time and never transits Redis or reaches the client.

Recap

Identical client code, any endpoint, zero SDK changes

  • Drop-in OpenAIRQ / AsyncOpenAIRQ.
  • Generic relay — every OpenAI endpoint, now and future.
  • Streams-backed reliability, bounded memory, worker-side auth.

github.com/allen2c/openai-rq