RL Engineer · LLM Fine-tuning · ML Systems

Building intelligent
systems from scratch.

I fine-tune large language models, design reinforcement learning agents, and build end-to-end ML pipelines — from research on H100s to deployment on consumer hardware. Open to RL + LLM engineering roles.

PyTorch QLoRA / Fine-tuning PPO / RL Transformers FastAPI GGUF / llama.cpp India onsite · Worldwide remote
View My Work GitHub LinkedIn
scroll to explore

Skills & Expertise

🤖

LLM Fine-tuning

QLoRA, LoRA, HuggingFace PEFT/TRL, GGUF export, llama.cpp, custom eval harnesses. Ran 27B params on RTX 3060.

🧠

Reinforcement Learning

PPO from scratch, GAE, Transformer-XL memory agents, ViT-based observation encoders, multi-GPU training.

👁

Deep Learning

PyTorch, ViT-B/16, DINOv2, CNNs, Transformer architectures, H100 training, cross-GPU model compression.

📊

Quantitative ML

XGBoost, time-series CV, feature engineering, causal indicators, VWAP/TWAP, walk-forward validation.

🚀

ML Engineering

FastAPI inference servers, Streamlit UIs, Docker, CUDA optimization, deterministic reproducibility across GPUs.

⚗️

Research to Production

End-to-end pipelines: dataset curation → training → evaluation → export → deployment. All on consumer hardware.

Projects

Every project is end-to-end — from dataset curation and training to evaluation and deployment. All production runs on consumer hardware (RTX 3060 12 GB).

Forge

Local Coding Assistant · Gemma 3 27B QLoRA

98.78%
HumanEval pass@1
27B
Params on RTX 3060
71%
MBPP pass@1
Q4_K_M
GGUF · ~16 GB
QLoRAGemma 3GGUF FastAPIStreamlitllama.cpp HumanEval

A full end-to-end LLM fine-tuning pipeline: 33K samples curated from three sources → QLoRA training on H100 (~3h 48m) → GGUF export → FastAPI OpenAI-compatible server → Streamlit UI → custom eval harness. Runs a 27B model locally on a single RTX 3060 12 GB.

The HumanEval gain of +15pp over the Gemma 3 27B-IT base is real. MBPP (71%) is the honest generalization number — confirms no catastrophic forgetting. Supports Python, JavaScript, Java, C++, C, and SQL.

PyTorchPPO Transformer-XLGymnasium

LunarLander-v3 Agent

+280 reward · 35k steps · 3× sample efficiency

A production-grade recurrent RL agent solving LunarLander-v3 using Transformer-XL memory inside a custom PPO loop. Converges ~3× faster than vanilla PPO (~35k vs ~100k steps), with a distinct 3-phase learning curve driven by the memory context window filling. Architecture is modular — swap Identity encoder for DinoV2/ViT to handle pixel envs.

PyTorchViT-B/16 PPOModel Compression

CartPole ViT+PPO

480+/500 score · 96% perf retained · ~17× compute reduction

Trains a Vision Transformer (ViT-B/16) + Transformer-XL + PPO agent on raw pixel frames using an H100, then compresses and deploys to a consumer RTX 3060 — retaining 96% of peak performance across a ~17× compute reduction. A direct study in hardware-aware ML optimization: train big, deploy lean.

XGBoostPandas TimeSeriesSplitBinance API

Crypto Predictive Pipeline

RMSE 3.01 · 300-iter RandomizedSearchCV · 5-fold walk-forward CV

End-to-end quant pipeline: fetches BTCUSDT 15m OHLCV from Binance (2020–2025), engineers 9 causal features (VWAP, TWAP, RSI, EMA, rolling proximity), then trains dual XGBoost regressors to forecast % deviation from the next N-bar High and Low. Strict TimeSeriesSplit prevents any look-ahead bias.

Computer VisionDeep Learning Autonomous SystemsSwarm Nav

Autonomous Drone — Motion Capture & Person Detection

95% detection accuracy · 40% faster acquisition · 100+ live flights

Final year project: spearheaded CV and flight-control architecture for an autonomous drone with real-time person detection via motion capture. Bridged deep learning vision models with onboard electronics, and engineered swarm search algorithms that cut target acquisition time by 40% across 100+ live test flights. Co-authored and published as a peer-reviewed paper.

PythonScikit-learn RegressionEDA

Air Quality & CO₂ Prediction

ML pipeline predicting air quality indices and CO₂ emission levels from environmental factors — temperature, humidity, wind speed, industrial emissions, and traffic data. Designed as a decision-support tool for policymakers and environmental agencies.

About Me

I'm Kaustubh Kubitkar — An Electronics Graduate and an AI/ ML engineer focused on reinforcement learning, large language model systems and quant. My work spans the full stack: from writing custom PPO training loops and fine-tuning 27B parameter models, to building the inference servers and UIs that expose them.

I care deeply about running things on real hardware under real constraints. Every project here runs on a consumer RTX 3060 — because a model that only works on H100 infrastructure has limited practical value.

I'm open to RL + LLM engineering roles + Quant roles — India onsite or worldwide remote. Reach me at kaustubhkubitkar@gmail.com.

27B
Largest model fine-tuned
98.8%
HumanEval pass@1 (Forge)
17×
Compute reduction (CartPole)
Peer-reviewed paper published

Let's work together.

Open to RL + LLM engineering roles + Quant roles — India onsite · worldwide remote.

Available for opportunities