RL Engineer · LLM Fine-tuning · ML Systems
I fine-tune large language models, design reinforcement learning agents, and build end-to-end ML pipelines — from research on H100s to deployment on consumer hardware. Open to RL + LLM engineering roles.
What I work with
QLoRA, LoRA, HuggingFace PEFT/TRL, GGUF export, llama.cpp, custom eval harnesses. Ran 27B params on RTX 3060.
PPO from scratch, GAE, Transformer-XL memory agents, ViT-based observation encoders, multi-GPU training.
PyTorch, ViT-B/16, DINOv2, CNNs, Transformer architectures, H100 training, cross-GPU model compression.
XGBoost, time-series CV, feature engineering, causal indicators, VWAP/TWAP, walk-forward validation.
FastAPI inference servers, Streamlit UIs, Docker, CUDA optimization, deterministic reproducibility across GPUs.
End-to-end pipelines: dataset curation → training → evaluation → export → deployment. All on consumer hardware.
What I've built
Every project is end-to-end — from dataset curation and training to evaluation and deployment. All production runs on consumer hardware (RTX 3060 12 GB).
A full end-to-end LLM fine-tuning pipeline: 33K samples curated from three sources → QLoRA training on H100 (~3h 48m) → GGUF export → FastAPI OpenAI-compatible server → Streamlit UI → custom eval harness. Runs a 27B model locally on a single RTX 3060 12 GB.
The HumanEval gain of +15pp over the Gemma 3 27B-IT base is real. MBPP (71%) is the honest generalization number — confirms no catastrophic forgetting. Supports Python, JavaScript, Java, C++, C, and SQL.
A production-grade recurrent RL agent solving LunarLander-v3 using Transformer-XL memory inside a custom PPO loop. Converges ~3× faster than vanilla PPO (~35k vs ~100k steps), with a distinct 3-phase learning curve driven by the memory context window filling. Architecture is modular — swap Identity encoder for DinoV2/ViT to handle pixel envs.
Trains a Vision Transformer (ViT-B/16) + Transformer-XL + PPO agent on raw pixel frames using an H100, then compresses and deploys to a consumer RTX 3060 — retaining 96% of peak performance across a ~17× compute reduction. A direct study in hardware-aware ML optimization: train big, deploy lean.
End-to-end quant pipeline: fetches BTCUSDT 15m OHLCV from Binance (2020–2025), engineers 9 causal features (VWAP, TWAP, RSI, EMA, rolling proximity), then trains dual XGBoost regressors to forecast % deviation from the next N-bar High and Low. Strict TimeSeriesSplit prevents any look-ahead bias.
Final year project: spearheaded CV and flight-control architecture for an autonomous drone with real-time person detection via motion capture. Bridged deep learning vision models with onboard electronics, and engineered swarm search algorithms that cut target acquisition time by 40% across 100+ live test flights. Co-authored and published as a peer-reviewed paper.
ML pipeline predicting air quality indices and CO₂ emission levels from environmental factors — temperature, humidity, wind speed, industrial emissions, and traffic data. Designed as a decision-support tool for policymakers and environmental agencies.
Who I am
I'm Kaustubh Kubitkar — An Electronics Graduate and an AI/ ML engineer focused on reinforcement learning, large language model systems and quant. My work spans the full stack: from writing custom PPO training loops and fine-tuning 27B parameter models, to building the inference servers and UIs that expose them.
I care deeply about running things on real hardware under real constraints. Every project here runs on a consumer RTX 3060 — because a model that only works on H100 infrastructure has limited practical value.
I'm open to RL + LLM engineering roles + Quant roles — India onsite or worldwide remote. Reach me at kaustubhkubitkar@gmail.com.