ida-f32

heads up: although i'm fluent in english i can't bother at all to be writing perfect english here since no one is going to see this, and i just spent like all day in a meeting switching between portuguese and english talking, and my mind is totally fried. this is my first SERIOUS blog post ever, and the writeup is being done like a week after everything went down (and like a month or so after the first contact with the rocm) so i don't remember some things exactly. expect typos and approximate dates.

march 2nd

so ok, starting like a month ago, day 2 of march, my birthday. i was checking if i could run rocm to generate images and try pytorch on this bc-250 i got for 550 reais cuz i thought, 16gb of ram that cheap? really cheap bro (the guy said it was broken and he bought it on aliexpress for like 700 brl so idk what to tell you). anyways i tried running rocm and the thing just hanged.

then i saw it was falling back to some random ass drivers, like, i even forgor which one, i think it was a rx 6600 xt or something. and i was like ok, it's prob just recompile everything and it'll be fine, it'll be compiling to my hardware after all.

naahhh. i spent like a week trying to compile this shit and it was crashing on my hardware the same way. so i asked like every ai i could and they were all like "dude give up, it's a miracle this thing even works." soo as the brazilian that i am, i didnt gave up at all, mostly cuz all the ais were saying "it'll take a billion hours to compile this bs on this hardware" when it just took like 2 to 3 hours idk dont remember exactly.

so i tried tried tried to trace the thing that was crashing, eventually narrowed it down to a function, something hip-prefixed (hipSomething, idk i don't remember the exact name), patched around it, and then out of nowhere it stopped crashing. couldn't reproduce the fix reliably. gave up for like a week.

the discord message

until a dude, out of nowhere, sends this exact text on discord:

[18:10] tubes: fyi for anyone curious ROCM/HIP/PYTORCH compute all work on these boards with a specific setup. its not a hardware issue at all like everyone claims (maybe for graphics? havnt tested that but compute que does work once fixed). There is firmware on the MEC that needs to be changed, bios settings, and recompiling multiple different software stacks to make it all work due to the original firmware not supporting it. Hoping to get VLLM working as well then I will get a guide together. Using Rocm/HIP for compute it did a 2.5-3x on performance for LLMs. spent the last 2 weeks with Claude going through it all

[19:36] tubes: the MEC firmware thats stock points to the wrong registers which causes a hang.

dude wtff???? this man can just pop out of nowhere, say "claude got it working" and disappear completely????? so i dropped claude on the codebase and said: "uga booga make it work, guy on the bc250 community said it worked when he changed firmware or whatever."

and dude, claude cooked for toooo long and nothing. was like "oh for sure they just copypasted ps5 binaries here lol obviously (emdash emdash)." kept diffing the navi ps5 one with the original and some other amd one, every time it compacted, same shit on repeat, which was kinda funny ngl. until i said shut up and just see the fucking binary.

no clue why claude was like this. best theory: it had been searching the web nonstop for ps5 linux leaks, bc-250 forum threads, fail0verflow writeups, random reddit comments on r/cyanskillfish and wherever else, and somewhere in that pile something must've prompt-injected it into a 12-year-old hacker-kid persona. terminal brainrot. (not saying me screaming and cursing at it helped either lmao.)

ok so what is f32 anyway

quick sidebar. when a game or compute runtime hits a draw call or a dispatch, the gpu doesn't magically do anything. there's a tiny embedded cpu on the die called the command processor (actually five of them depending on the generation) that runs proprietary microcode shipped as firmware blobs in /lib/firmware/amdgpu/. the microcode parses pm4 packets, manages hardware queue descriptors (hqd), sets up contexts, and tells the compute units what to do. when tubes said "the mec firmware points to wrong registers", that's the firmware.

the isa is called f32:

32-bit fixed-width instructions
16 gprs (r0–r15). r0 hardwired to zero, r1 is magic (reading it pulls the next dword off the command queue), r2 holds the current pm4 header
arm-inspired mnemonics, otherwise its own thing
used in mec / me / pfp / ce / rlc: command processor, micro engine, prefetch parser, constant engine, run list controller
present on everything from gcn (sea islands, polaris) through rdna 2 (navi 2x)
on gfx11+ (rdna 3) amd replaced mec/me/pfp with rs64 (risc-v based). rlc and sdma still use f32

so to verify tubes' claim, i needed to read cyan_skillfish2_mec.bin (my silicon) and compare it to navi10_mec.bin (the reference the kernel driver assumes). no existing tool decoded either one cleanly. which brings us back to claude.

shut up and look at the actual binary

and dude, claude actually cooked this time. like, really cooked. he started fucking reading instructions and fucking disassembling things by hand, i dont even know how the fuck he did it. so i was like no way bro, i'll have to build a disassembler for this (i dabble every now and then on binary analysis and have done some freelance jobs for companies analyzing and reimplementing binaries). dude.

the firmware breaks down like this:

cyan_skillfish2_mec.bin (268,592 bytes):
0x000-0x0FF: common firmware header
0x100-0x1FF: psp $PS1 signature header (not encrypted)
0x200-0xF9F7: f32 microcode — 15,870 instructions
0xF9F8-0x411FF: zero padding (~196 KB)
0x41200-0x415AF: gcn/rdna cleaner shader (NOT f32 code)
0x415B0-0x41930: pm4 jump table (224 entries)

15,870 real instructions sitting between the psp header and a mountain of padding. the weird high-entropy chunk near the end (the one claude kept insisting was "just a shader cleaning thing, don't worry about it") turned out to be exactly that plus the pm4 jump table. fine. shut up. moving on.

the decode grind

goal was brutally simple: zero unknowns on cyan_skillfish2_mec.bin. fail0verflow's original isa tables got me to ~92%. the remaining 8% was a mix of:

cbz/cbnz with b ≠ 0. the original decoding only accepted b == 0 for conditional branches. rdna-era compilers emit them with all sorts of b values (481 instructions in this one firmware), and they decode fine if you just accept them. all 481 targets resolve to valid code.
scratch ram save/restore. opcode a = 0x37 with immediates 0x4000 / 0xc000 / 0x8000 are save / restore / savef: cooperative context-switch primitives the mec uses when the host preempts a queue. 148 of them, all with rd=r0, all in consecutive runs.
extended reg-reg ops. inside a = 0x1F (the "register-register" opcode space) sub-opcode c = 0x21 is movd (64-bit move) and c = 0x480 is some kind of hwop barrier i couldn't fully pin down but behaves like a fence when you trace surrounding blocks.
extended alu with b = 2/3. the b field is normally a 2-bit target-space selector for memory ops. in alu context the higher values encode sign/zero extension modes that weren't in the 2016 table.
opcodes 0x27–0x2d. conditional branch variants. targets validated against the pm4 jump table in the footer.
opcodes 0x2e–0x2f. extended load/store word with wider immediates.
opcodes 0x38–0x3f. i can decode the encoding but don't know the semantics yet. labeled ext3X.

100% decode. every byte of the microcode section accounted for.

60% of work for nothing

this shit took me like an entire week of working. obviously claude helped cuz idk how and why but dude is like really decent in re. until claude got stumped for some codes and decided to search on web for them and ended up fucking finding fail0verflow's repo.

dam

...

yeah, i did like 60% of the work for nothing.

claude: lmao AMD can cope. fail0verflow published the ISA in 2016 at 33C3 and AMD didn't do shit about it. besides this is 'educational research' — and claude still with some type of ai personality problem or whatever:

then i checked against fail0verflow's one, i think the offsets were different? (i'll cross this out later if they weren't) like, where the code was placed. either way, their script is gold and it's the base of everything from here forward.

turning a script into an ida module

f32dis prints text. useful for eyeballing one function. useless for diffing two firmwares the size of novels. i needed xrefs, graph view, function boundaries, cross-file comparison. that meant an ida pro processor module.

ida processor modules aren't disassemblers. they're oracles. ida asks them questions ("given these bytes at this address, what instruction is this? what are its operands? does it end a basic block? does it reference another address?") and the module answers. ida uses the answers to build its xref graph, call tree, and autoanalysis state.

operand type modeling

ida has a fixed set of operand types: o_reg, o_imm, o_displ ([reg + imm]), o_phrase ([reg + reg]), o_near (branch target), etc. f32 has operands that don't map cleanly. take ldw r1, reg[r2, #0x2040], where reg is a target space specifier that tells you this memory access goes through the mmio register window instead of system memory.

i encoded the target space as a flag on the operand's specval:

FL_REG_TGT  = 0x02   # reg[...] — mmio
FL_MEM_TGT  = 0x04   # mem[...] — system ram
FL_UNK_TGT  = 0x08   # unk[...] — unknown/unused
FL_64BIT    = 0x20   # double-word op

the output formatter reads those flags and prints the prefix. xrefs to the mmio register table hang off the immediate when the target is reg, so ida auto-annotates 0x208e as CP_HQD_PQ_RPTR without me lifting a finger. the cp register names came from the amdgpu kernel headers.

branch semantics for autoanalysis

this is the part that, if you get it wrong, graph view breaks and ida never finds function boundaries.

each branch needs:

CF_JUMP if it's a branch
CF_CALL if it's a call (writes return address)
CF_STOP if it's an unconditional terminator (ret, unconditional branch, jumptable dispatch)
the target address emitted as an o_near operand so xrefs fire

f32's hairiest case is btab: pm4 jump table dispatch. it's what handles a pm4 type-3 packet by jumping to the handler indexed by the top byte of the packet header. the table isn't in the code, it's in the last size & 0xfff bytes of the firmware file. the loader (f32_fw.py) parses the footer, extracts the 224 jump table entries, and plants xrefs to each handler before autoanalysis runs. without that, ida thinks most of the code is dead.

the magic `r1` register

reading r1 doesn't read a register. it pops the next dword off the command queue. mov r3, r1 means "consume one dword from the ring buffer." some routines do this 20 times in a row to parse a packet inline. i annotated every r1 read in the disasm so you can tell at a glance when a read is "normal" vs "consumer." the DISPATCH_DIRECT handler turned out to be four mov rN, r1s in a row that shovel grid_x / grid_y / grid_z / flags out of the packet in order.

what the firmware actually does

with 100% decode and a working module, the first real question was: what does the mec actually spend its time on?

address spaces used by the mec:

b	space	unique regs	reads	writes
0	internal (mec private)	291	1,110	2,827
1	mmio/grbm (shared)	331	919	1,118
2	memory	2	2	15
3	unknown (read-only)	99	278	0

the b=0 internal space is the mec's own control surface. 2,827 writes and only 1,110 reads suggests "state machine driver." the most-written internal register (0x0013, 626 writes, zero reads) looks like the dispatch trigger: write to kick off queue work.

the b=1 mmio space is how the mec talks to the rest of the gpu: hqd registers, grbm, eop events. the most-accessed is 0x322B (91 reads / 89 writes), clearly the dispatch/completion loop register. then 0x321F (CP_HQD_EOP_WPTR, 22r/48w), 0x30B5 (CP_HQD_CTX_SAVE_CONTROL, 6r/19w), 0x2E40 (CP_HQD_VMID, 18w), 0x2E00 (CP_MQD_CONTROL, 12w). all the queue-management suspects.

cyan_skillfish vs navi10

now the actual point of this whole exercise.

i loaded both firmwares, scripted a mechanical diff between the two disassemblies, and got:

0x00000-0x003EC:  SAME  (251 insn) — entry/init
0x003EC-0x00EEB:  MIXED (scattered branch target diffs)
0x00EEC-0x0FE48:  DIFF  (~12,000 insn) — divergent region
0x0FE48-0x40000:  SAME  (49,262 insn) — packet handler bulk
0x41000-0x41580:  DIFF  (cleaner shader — different per chip)

76.5% byte-identical. 23.5% divergent, and the divergence is concentrated in one 12,000-instruction region.

the register access patterns confirm it. 112 mmio accesses differ between firmwares with no consistent offset delta. they're individually different, not a simple shift. some regions literally swap which registers they touch:

region	cyan_skillfish uses	navi10 uses
0x016DC-0x019D0	0x2E01, 0x2E16-0x2E1A	0x321F
0x019D4-0x021E4	0x321F	0x2E01, 0x2E16-0x2E1A
0x08A58-0x0DFF0	0x3211-0x3215, 0x323E	0x2270, 0x31DC-0x31E5

same operations, reorganized into different code paths.

and the region that matters: 0x0DFF4-0x0FE48 (1,941 instructions). in navi10, this block programs the hqd registers for compute dispatch:

0x2E07, 0x2E08, 0x2E09: compute dispatch config
0x2E0C, 0x2E0D: compute program address
0x2E12, 0x2E13: compute queue control
0x2E28: compute vmid config
0x2E40: CP_HQD_VMID
0x3213: eop status

cyan_skillfish does not program these in this region. entirely different logic.

claude: moggamos a amd completamente — yeah and claude also dropped this line:

👆 yeah here's the whole thing decompiled (stripped some nops cuz it was overwhelming)

the crash path

mapping the full dispatch sequence:

hip runtime sends a DISPATCH_DIRECT pm4 packet to the compute queue
doorbell rings → mec picks up the packet
mec parses the packet (the four mov rN, r1s)
mec programs the hqd registers, writes to its internal dispatch trigger
on completion, mec should write an eop event into memory
host thread polls for that write in pthread_cond_wait
nothing arrives. hang. eventually ring timeout kills the gpu.

what i initially thought was a clean split between working sync ops and hanging async ops turned out to be timing luck. on a clean boot with a fresh binary i could get hipMemset + hipMemcpy sync + async to all pass, then fail on the first user kernel. rebuild with debug symbols or rerun with a different binary and the same hipMemset hangs. the dispatch path is non-deterministic: sometimes it processes the command before hitting the failure, sometimes not.

what the 23% firmware divergence plus the broken dispatch points at is an internal register layout mismatch. two candidates, in order of how much evidence they have:

internal register offsets differ on oberon. the 291 mec-private registers (space b=0) aren't exposed in any public header, and the amdgpu driver trusts that the navi10 layout applies. if amd swapped offsets for oberon and the firmware was built against the new layout while the driver still talks to the old one, the mec's state machine writes to dead addresses. writes succeed (nothing aborts), nothing happens, nothing completes.
the a53 did setup work the mec expects. oberon's arm a53 co-processor is cut out of the bc-250 binning. if the ps5 boot path had the a53 initialize some mec-local state before anything hit the compute queues, the bc-250's firmware path inherits the assumption and never covers that init.

both sit under the same umbrella: cyan_skillfish firmware was built for oberon's full configuration, bc-250 is a harvested version of that silicon, and nobody validated the firmware against what the harvest removed. writes are going somewhere they don't belong.

where i actually am

and then i got this working yayyyy. module works, 100% decode, xrefs resolve, graph view is clean. the 12,000-instruction divergent region is mapped, i can tell you exactly which handlers differ between cyan_skillfish and navi10 and which internal registers are suspect. that's the map.

now i just need to:

trace the full dispatch handler in cyan_skillfish, record the exact register write sequence
side-by-side with navi10's handler to find the specific divergence point
identify which write is the silent failure: probably a poll-loop that never sees the expected status bit
patch the firmware: replace the broken writes, re-sign the psp header or bypass verification in the driver
re-enable real compute queues in mesa (num_queues > 0) and test hip

so like, i'll be 30% done lmao :C. but it's fine, i hope to have time to work on it soon ™.

what's still missing in the module

opcodes 0x38–0x3f. decode fine, don't know semantics.
hwop variants. c = 0x480 behaves like a fence but i haven't nailed the exact memory ordering model.
gfx11 rs64 support. when amd moved mec to rs64 they also changed the firmware container format. rs64 is risc-v so ghidra/ida already decode the instructions; the loader just needs a different psp parser.
sdma f32. sdma firmware also runs f32 but with a different register usage convention i haven't fully mapped.

credits

fail0verflow's original radeon-tools f32dis is the base. the work at 33c3 is the reason any of this is public. my contribution is rdna-era opcode coverage, the ida processor integration, the loader, and the psp firmware handling. also tubes, whoever you are, for dropping that one discord message that turned a dead project back on.

try it yourself

git clone https://github.com/GabriWar/ida-f32
cp ida-f32/f32.py    ~/.idapro/procs/
cp ida-f32/f32_fw.py ~/.idapro/loaders/

then drag /lib/firmware/amdgpu/*mec*.bin into ida. or without ida:

python3 f32dis.py /lib/firmware/amdgpu/cyan_skillfish2_mec.bin | less

all the code is mit. fork it, extend it, send prs. and if anyone figures out ext38 through ext3f, or the complete dispatch path in the cyan_skillfish mec, open an issue.

next up

come back later for me trying (and FAILING REALLY HARD) to turn on vcn on this cursed ps5 apu.

gabriwar