tabula rasa RL in OSRS

RL, the great equalizer.

RL in a nutshell.

once you really get into this stuff, you'll learn that the policies exploit everything. scripted policies, ruthlessly. they find all the sim bugs. they're good at hiding their knowledge of them though, and it won't be obvious to you as the dev unless you're really rigorous about your visuals and metrics. and even if you are, you'll still get stumped by the stuff the agents get up to.

a long time ago i was surprised by a policy that had learned to trap a certain scripted opponent into a loop where it constantly tried to chase our guy with intentions to finish it off with a special attack. our agent knew that from experience and everytime it was about to lose a fight against this particular opponent, which was a single percentage of the total opponent pool by the way, replicated this flaw in its AI and pushed it into that eternal loop state, running away and never losing, electing to get the fights timed out instead. it doesn't sound particularly interesting the way i'm explaining it here, but i was baffled by how many little intricate things it needed to just know and be able to exploit perfectly in order to land in this state, time and time again.

play the loop.

the same visual runner i use for catching sim bugs, cut down into a browser lab. pick a fight, click into the canvas, and poke at the thing the policies learned against.

wasm loads after click · keyboard focus lives inside the canvas

this one is best on a keyboard and a real screen. the article stays fast here, the simulator waits for desktop.

the agent is always smarter than the developer.

sounds silly but everything will be used against you. it will learn the wrong hit delay. it will kite a scripted opponent into a bug you did not know existed. it will punish the parts of the code you thought seemed close enough.

RL is a ruthless adversarial test harness.

build the loop so every exploit teaches you something: a missing rule, a bad observation, a dishonest reward, a slow kernel, a fake boss mechanic, maybe a reference you copied without understanding.

the first lessons.

the early target was Old School RuneScape PvP. i've been writing scripts and plugins since my teens. this one was gonna be just another script, but joining my "day job" as a machine learning engineer and passion projects.

turns out with the right harness and a ton of learnings you can train these agents super quick, and super good. i had my fair share of going back and forth about different sets of obs and actions, sometimes modeling the real game closer, sometimes using a ton of scaffolding. failing over and over of course like you tend to do. when i finally got good at it, the results were a bit anticlimactic even.

of course not because the problem was hard or impossible, because it was too easy. the agents learn to mop the floor with the hardest scripted policies i could, as a domain expert with years and years of high end PvP experience in this game, come up with. and that happens within minutes of training here locally on my macbook. not on 10 GPU nodes, here, on my macbook. the hardest scripted policy is a perfect robot that makes 0 human like mistakes AND reads the agent's intentions INTO THE FUTURE 50% of the time. not enough, gets absolutely slaughtered.

need to generalize.

PvP was cool but what am i gonna do now? where do i find actual human opponents to train against? not feasible, i don't have the data and i'm not about to go collect it. besides, i've spent hundreds of hours at this point getting good at RL and C. i no longer care about OSRS the game per se, playing it, or creating plugins for it. RL is now my single biggest passion by far.

Zulrah was the first expansion into generalizability. it's an old boss, but it has rotations, phases, venom clouds, snakelings, safespots, prayers, gear choices, and a sparse (ish, i didn't know what truly sparse means at this point in learning; probably i still don't, tbh) finish condition. at first the sparse reward made it more annoying than it should have been. in PvP it's enough to terminal reward for win and penalty for loss. with some minor behavior tuning rewards or penalties, but ofc you gotta be super careful about those or the policy immediately makes a fool of you.

the final lesson from Zulrah was almost funny: once the environment told the truth, agents cleared it in tens of seconds of local training, even in low-budget gear. this boss became a regression test really quick. the env dev took me like a hundred hours since references are weak and it's a shitty boss with weird arbitrary mechanics, but i was delighted to see that once we get to serious training, i can make it work super quick. 100 hours of env dev, and like 3 hours of RL dev to solve it to redundancy.

real problem.

inferno is not hard because it has one clever mechanic. it is hard because the whole episode is long, brittle, and full of transitions. the agent has to survive waves, preserve pillars, flick prayers, handle blobs, solve Jads, move with the shield, tag healers, choose between sets and Zuk, and do all of that after enough time has passed for tiny sim mistakes to compound. even advanced players don't say that it's necessarily mechanically hard, it's just long and really demanding in terms of knowledge and muscle memory and errors cumulate really quickly, you can't afford to mess up over what can be hours of intense focus.

well, turns out you no longer need hours to solve it once your policy is great. you do it faster than like, 98th percentile humans. you could say that's to be expected, it's a machine playing after all. only thing is that it doesn't play like real players do, like not even remotely. it figures out its own way, and that's the beauty. it solves the env in minutes of training on a 4090, from nothing. it hasn't played the game before, it doesn't understand anything about anything, it just does it. it's fascinating.

this is the first OSRS environment in the project that still feels alive as a research problem. current agents now learn to clear it from start to finish well over 70% of the time in internal runs, with very light scaffolding. often using no supplies, ignoring familiar safety habits (pillars, if you know what i'm talking about), and finding lines through the fight that look wrong until you see them happen in playback. that is the point. the policy is not trying to role-play a human Inferno run. if you want it to play like you, just go supervised or heuristic. it's not gonna get better than you though. or find bugs for you.

the failures were not random, they cluster around transitions: post-Jad into Zuk, healer arrival, post-healer recovery, and low-HP endgame choices. that is where sparse terminal reward gives weak information and shaped reward can teach the wrong habit.

so the logs grew phase labels, low-watermark counters, target diagnostics, healer metrics, forecast observations, terminal penalty modes, curriculum modes, and sweep splits. i've had 200 useful metrics on wandb simultaneously.

it's interesting how precise you need to be with rewards here. reward tagging and killing the healers? it will kill them, and then suicide immediately after since it thinks that's the fight. there's more rewards waiting once you actually finish the fight, but nah, why bother; we got a lot of rewards here, might as well suicide now. but you don't see its internal decision making, at least neatly. on your screen it just dies. you have to deduce these things yourself.

okay these episodes are LONG, what do we do?

late-game Inferno is expensive because most rollouts die before the mistake you care about. Go-Explore style archives attack that directly. store the state, the progress cell, store the action chunk, store recurrent hidden state when needed. restore from the part of the fight where the policy is thin, but this isn't trivial either.

this forced a more serious state model. the environment cannot be a bag of pointers and side effects if you want to copy it, rank it, replay it, and restore it under training pressure. state setting made the env arch stricter as well. not only does RL force you to design the env well, it forces you, indirectly, to program well.

throughput is everything.

the PvP lesson only landed because training could run locally. way back, it couldn't. the sim was Java and super slow. RL algos were terrible. then i found SOTA for this kind of work in PufferLib, but ofc it's CUDA first. on mac, you need to run pytorch. it's shit! sorry pytorch guys, i know you're doing good work somewhere, just not here. cue hundreds of hours of Metal kernels, and we've turned a macbook into a useful RL workbench for the smaller models. fast iteration means scripted opponents could be iterated on and then broken in minutes, sim fixes could be tested without booking a GPU, and bad ideas could die super quick. iteration speed is everything, but only if you can do it without sacrificing evaluator complexity, or lack thereof!

on the metal side. reusable command buffers, argument tables, GPU timing, tensor ops, sampling kernels, MinGRU scan parity, optimizer buffers, allocator bugs, and sync fixes. kernel work is boring and stressful at times as you break things and recover, but if you're not extracting everything you can out of the hardware you probably don't understand the problem in the first place.

on the cuda side. Aurora was implemented as an optimizer, and go-explore style work continues upstream. Aurora did not magically beat Muon in the first passes like Muon did for Adam, but it's a genuinely clever algo.

sim bug after sim bug.

that sounds too strong until you watch enough policies. when something looked dumb, the first useful question isn't "why is PPO bad?" or often even "what did i mess up in the kernels now?" it's the simulator, the observation, the action mask, the reward, or the reset distribution, everytime.

environment development is rigorous software engineering because the policies punish every sloppy edge. the only reasonable and sane loop is to implement, train, watch, test, and fail. often in ways that you find incomprehensible at first. it's not magic though and you'll simply be impressed when you find the latest hack.

reward soup. this is the old thinking. let it go. you won't be able to shape your policy to victory.
script comfort. scripted opponents feel hard while encoding the exact blind spots the policy will farm. the policy isn't you, it doesn't have your weaknesses.
state magic. state restore and curriculum is powerful, but it does not forgive a weak state representation nor fix all your problems, there's a ton of tuning involved.
visual lies. you can get fooled by what's in front of your own eyes as well. maybe you wired a wrong graphic or action. no free lunches with debugging, unfortunately.
benchmark romance. when optimizing throughput, be rigorous. discard statistically insignificant optimizations, or they'll come to bite you later. code is a huge liability.
tinkering. what about this little trick? this mix of things? remember that by tinkering you're introducing complexity and lowering throughput. avoid avoid avoid. no matter what clever tricks you come up with, RL is smarter.

the whole loop.

i can now look back at a tabula-rasa RL problem from the individual step to the optimizer kernel. every policy isn't solved forever, and in this field problems are same but different and will remain until the end of time. i can build the world, expose the right state, make it fast, train it, inspect the failures, patch the simulator, and run the proper sweeps to solve it.

the agent starts blank and it can't even see anything except 1s and 0s. i don't, and i can.

the next env should be easier to write.

the whole point of doing things this way is that the next encounter should start higher up the stack for the developers who come after. a developer should write the mechanics, not rebuild the trainer. the renderer should catch state bugs, the sweeps should isolate their knobs, the state buffer should restore without ceremony. RL forces real software engineering out of you like nothing else. quit the python and notebook bullshit, it feels good but it won't fulfill you.