
ROXBOX
ROXBOX is a premium, high-performance Progressive Web App (PWA) designed for private music management and playback. Unlike standard streaming services, ROXBOX is built on a "Light Proxy" architecture that bridges cloud availability with local extraction power. It allows users to manage a massive S3-backed library while leveraging home hardware for heavy lifting like audio extraction and AI indexing.
The system is designed for audiophiles and fitness enthusiasts alike, featuring built-in workout timers (Norwegian 4x4) and a glassmorphic aesthetic that feels premium on any device.
System Architecture
The ROXBOX ecosystem is built on a sophisticated multi-tier architecture that prioritizes security and performance. By leveraging a Cloud-to-Home Bridge via Tailscale, the system maintains the global availability of a serverless application while utilizing the heavy-lifting capabilities of local hardware for extraction and AI processing.


NextJS Application
The frontend is built with Next.js 15 and React 19, utilizing the App Router for a bifurcated architecture. The "Cloud Bridge" consists of stateless API routes that handle authentication, library manifest aggregation, and request proxying.
By using a PWA approach, ROXBOX offers an "installable" experience on iOS and Android without the overhead of native app stores. High-performance styling is achieved through Vanilla CSS and Glassmorphism, ensuring smooth 60fps animations even on older mobile hardware. Sentry telemetry is integrated across the entire stack for real-time observability and distributed tracing.


AWS S3 Cloud Storage
ROXBOX leverages AWS S3 as its primary storage engine, treating the cloud as a highly available "Source of Truth." A central manifest.json file tracks the entire library metadata, allowing for near-instant library browsing regardless of size.
Audio files are streamed directly from S3, while localforage (IndexedDB) provides a persistent client-side cache for offline playback. This "Cloud-First, Offline-Capable" strategy ensures your music is available whether you're on a gigabit connection or in a dead zone.
AI-Driven Similar Search
To solve the problem of music discovery in private collections, ROXBOX implements an AI-powered Acoustic Fingerprinting system. When a new song is added, an AWS Lambda (containerized with FFmpeg and Python) extracts its acoustic features and stores them as high-dimensional vectors in Pinecone.
The Next.js frontend can then perform real-time vector similarity searches, allowing users to find "songs that sound like this" with sub-second latency. This turns a static collection into a dynamic, searchable ecosystem.


YouTube Audio Extraction
The heavy lifting of audio extraction is delegated to a FastAPI service running on local hardware (The Home Engine). When a user searches YouTube in the PWA, the request is proxied through an AWS Lambda bridge into a Tailscale Mesh tunnel.
The Home Engine uses yt-dlp for high-quality extraction and ffmpeg for MP3 conversion. Once processed, files are automatically uploaded to S3, triggering the cloud indexer. This "Light Proxy" setup provides the security and power of a home server with the global accessibility of a cloud app.


Fitness: Norwegian 4x4 Integration
Beyond simple playback, ROXBOX is designed as a performance training tool. It features a native implementation of the Norwegian 4x4 High-Intensity Interval Training (HIIT) protocol.
The app automatically manages your workout timers—Warm-Up, 4-minute Burn intervals, and 3-minute Active Recovery sets—and synchronizes them with specific playlists in your library. This eliminates the need for external timer apps and ensures your highest-energy tracks hit exactly when your heart rate needs to peak.




Podcasts & Long-Form Audio
ROXBOX extends its management capabilities to podcasts and long-form audio. By utilizing the same S3-backed architecture, users can curate their own podcast feeds and audiobooks. The app maintains playback position and provides specialized controls for navigating longer tracks, ensuring a seamless transition between high-energy music and informative spoken word content.