openai-rq
Run the OpenAI SDK from behind a locked-down network — by tunnelling inference over Redis.
The problem
Your client sits behind a heavily restricted network. The only reachable outbound endpoint — from both sides — is Redis.
No direct HTTP from the client to the inference box is possible.
So Redis is the only rendezvous
Redis isn't just a queue here — it's the single meeting point between the external client and the in-cloud inference box.
Which means Redis has to carry the request itself, not just a job id.
The backend is already an OpenAI HTTP server
So the OpenAI SDK on the client already builds a valid HTTP request. We intercept at the transport layer, ship that raw request over Redis, and the worker simply replays it against the local backend.
- Generic relay — chat, embeddings, any current or future endpoint, zero changes.
- Errors propagate for free — the worker returns the real status + body.
- Streaming is just an SSE HTTP response — same mechanism.
Works with anything that speaks the OpenAI API — vLLM, SGLang, TGI, llama.cpp, Ollama.
Architecture
Both client and worker connect only to Redis. The worker is outbound-only — nothing connects into the cloud box.
Client = a drop-in openai.OpenAI
# swap the class, point at Redis — the rest is identical from openai_rq import OpenAIRQ client = OpenAIRQ(redis_url="redis://localhost:6379/0") resp = client.chat.completions.create( model="openai/gpt-oss-120b", messages=[{"role": "user", "content": "Hello!"}], ) # streaming + AsyncOpenAIRQ behave like the real SDK
extra_body / extra_headers just work
Because we intercept at the transport, the SDK bakes these into the request and the worker replays them verbatim — no code needed.
client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[...],
extra_body={"guided_json": schema}, # passes through
)
Worker — run it next to the backend
openai-rq worker \
--redis-url redis://localhost:6379/0 \
--openai-base-url http://localhost:8000/v1 \
--concurrency 16
Run as many workers as you like against the same Redis — jobs are load-balanced across them via a Redis consumer group.
What actually crosses Redis
The transport freezes each request into one self-describing JSON job
and XADDs it to a stream — no live socket, just a record the worker can
replay anywhere.
{
"id": "9f2c…", # uuid; names the result key
"method": "POST",
"path": "/v1/chat/completions",
"headers": { … }, # hop-by-hop headers stripped
"body_b64": "eyJtb2RlbCI6…", # the SDK's exact body
"stream": true
}
Is this request a stream?
You'd reach for Accept: text/event-stream — but the OpenAI SDK sends
Accept: application/json even when stream=True. The
only signal is "stream": true inside the body.
def _is_stream(request): accept = request.headers.get("accept", "") if accept.startswith("text/event-stream"): return True # explicit SSE # the OpenAI SDK signals streaming only in the body ctype = request.headers.get("content-type", "") if "application/json" in ctype: return bool(json.loads(request.content).get("stream")) return False
Header-only detection routed every SDK stream down the blocking path. Reading the body fixes it — proven against the real SDK. (v0.1.2)
Streaming, reassembled over a Redis Stream
The worker writes the upstream SSE response into a per-request Redis Stream as typed entries; the client replays them as an ordinary httpx streaming body.
- head carries status + headers — the client builds the response from it.
- data is raw SSE bytes, coalesced in a ~50ms window and never split mid-event.
- done / error is a terminal sentinel; bytes feed straight into the SDK's own parser.
Reliability & lifecycle
- Consumer group, not pub/sub —
XREADGROUPload-balances jobs across N workers; persistent and replayable, not fire-and-forget. - At-least-once —
XACKon completion;XAUTOCLAIMreclaims jobs orphaned by a crashed worker (idle > 60s). - Dead-letter — past
--max-retriesa job is parked, and the waiting client is unblocked with a terminal502rather than hanging. - Bounded by design — the request stream is
maxlen-capped and every result/stream key carries a TTL. The client's read-timeout maps to the Redis block time.
Backend auth, done right
The backend credential is owned by the worker.
Set via --openai-api-key (→ Authorization: Bearer) or
--openai-header for custom auth. It's injected at relay time and
never transits Redis or reaches the client.
Identical client code, any endpoint, zero SDK changes
- Drop-in
OpenAIRQ/AsyncOpenAIRQ. - Generic relay — every OpenAI endpoint, now and future.
- Streams-backed reliability, bounded memory, worker-side auth.
github.com/allen2c/openai-rq