ScreamingPigeon

I like computers...


Developing an Out-Of-Order RISC-V CPU: part one

This post is about 5 months late. I created the markdown file on 12/23/2025 but never got around to writing anything to it. Now, I have some time on my hands (and I feel bad about not writing as often as I had planned to) - so I figured I will give this a shot again.

Last fall, I took ECE 411 the infamous computer-architecture class at my university. This was a class I was really looking forward to. After dealing with some department advising beauracracy (and almost losing my seat in the course) I did end up taking the class - and I quite enjoyed it. The final project of the class involves designing a Out-Of-Order RISC-V CPU. This blog post is going to hopefully go over

EDIT: I am an hour into writing this post. It is clear to me that is going to be hard to fit a semester’s worth of content into a single blog post. Therefore I have decided to split this into 2 posts. This one will focus on theory of operation and the next one will cover my personal experience developing the project, and some exciting news for the future. Hopefully it doesn’t take me another 6 months to write it XD

What is Out-Of-Order?

If you’re not already familiar - I would recommend you look at

In most computers - memory is very limited. CPU registers are implemented using flip-flops which are power hungry, and occupy considerable area. As such you end up consuming a lot of power/area per bit. BUT, the CPU gets instant access to the register. Therefore, the number of CPU registers is usually constrained to a number that optimizes Power/Area/Efficiency for a given use-case. For example, the RV32-I spec of RISC-V comes equipped with 32 32-bit CPU registers.

In order for you to run any non-trivial program - you need fast access to memory. Caches are usually implemented in SRAM, which give you better power/area but are much slower. If the data you want is not present in the cache, then you go to DRAM which is cheaper than SRAM - but slower and physically far away from your CPU –> which means even slower access times.

And if the data you want has not been paged to DRAM, you go hunting for it in your HDD/SSD which is an incredibly slow process for a Processor that runs at MHz/GHz frequencies.

Getting back to why Out of Order Matters

Say for example you are attempting to perform the following operation on


 1. c = b + M[d]

 2. a = c + b

 3. d = d + b

This is a straightforward addition operation. Let’s assume c, b, d, and a are registers.

M[d] means we are attempting to “dereference a pointer” , i.e. access the data stored at the memory address whose value is stored in register d

For the sake of our example, let’s say the data at address d is not present in the cache

A traditional in-order pipelined/multicycle CPU would have to stall all the pursuant instruction until this memory read is resolved, and only then will execute instructions 2 and 3.

This is a huge waste of time. The fetch, decode and execute stages/hardware of the CPU just sits there - unutilized while we wait for the memory system to respond. OOO tries to fix this problem by exploiting Instruction Level Parallelism

Now let’s take a look at instructions 2 and 3. Evidently, instruction 2 is dependent on the outcome of instruction 1. What about instruction 3? It has no dependency on Instructions 1 or 2.

An OOO CPU is capable of executing instructions Out-Of-Order and this example is a great scenario where an OOO processor would utilize it’s “free” hardware resources to execute the pursuant “independent/ready” instructions.

How does it work - Tomasulo

Out of order processors are not a new idea. IBM had already come up with an OOO hardware algorithm back in 1967. I highly recommend watching this video for a more visual explanation. It is what I used to get a sense of what is going on when I took the class, and is probably far better than my attempt here.

Tomasulo Processor Layout

The basic principle of this algorithm, called Tomasulo’s Algorithm - is that instructions are issued in-order, executed out-of-order, and committed/written-back in-order.

In a Tomasulo processor, there is one common frontend (fetch, decode, issue) - multiple “functionl units” (which are ALUs and Memory Units for load/stores). Once executed - the functional units send the result to the commit buffer / reorder buffer - that commits the results of the instruction in-order.

Each kind of functional unit has a queue known as reservation stations.

The reservation stations control when an instruction can execute

Each functional unit has a tag . The reservation stations, and the FUs are connected through a Common Data Bus. Each “broadcast” of a completion from the FU is accompanied by a tag

Instuctions are fetched and decoded in order, and then placed in the issue queue.

Each register/operand in an operation is assigned also assigned a tag - via a technique called “Register Renaming”. In Tomasulo, each operand is assigned a “tag”. Unused registers have a special tag. Registers that are already in use - are assigned a placeholder tag - usuallly pointing to the tag of the functional unit where we expect the result to originate from.

Let’s go back to my initial example

 1. c = b + M[d]

 2. a = c + b

 3. d = d + b

In RISC-V memory and arithmetic operations cannot be muddled. So let’s change the example a little

 0. e = M[d]

 1. c = b + e

 2. a = c + b

 3. d = d + b

Execution Flow with Tomasulo’s Algorithm:

Now obviously I have glossed over some of the details here. Reorder Buffer and commits form different stages of the backend of the CPU.

There is an increased pipeline depth in out of order processors due to more need of processing instruction dispatching

But we were able to demonstrate how instructions were able to execute out of order.

The main reasons for stalls in such a processor come from

Making these structures larger allows us to exploit bigger ILP windows - but also increases the combinational logic needed to make decisions on readiness - limiting clock speed, increasing area and power consumption.

The Alternative: ERR

There is another alternative to Tomasulo’s algorithm: Explicit Register Renaming.

I found these slides that have some nice visuals on ERR. The key difference between ERR and Tomasulo is that ERR de-links scheduling and renaming. Instead of “tagging” registers to functional units - we have an architectural register file and a physical register file. The physical registers are far more than the architectural registers. For example my project CPU had 64 physical CPUs - and only 32 architectural registers. here is a Register Alias Table (the RAT) that keeps track of which physical registers map to which architectrual registers. Kind of like a hash table. There is also a retirement RAT which keeps track of mappings after instruction have completely finished execution and have been committed. This helps us deal with branches (which I will talk about more in my follow-up post).

funny picture of the RAT

Each Functional unit has it’s own issue queue. which holds instructions that have been dispatched. The issue queue contains some issue logic - which is a muxtree that “dispatches” instructions with ready operands to the functional unit.

ERR can be explained with the following examples:

i0. x1 = x2 + x3
i1. x4 = x1 + x5
i2. x1 = x5 + x2
i3. x5 = x1 + x2

Let’s assume our CPU spec presents only 5 architectural registers (x1 - x5). In an ERR processor, we have physical registers - usually more than the number of architectural registers. In our case, let’s say we have 10 physical registers (p1-p10).

The instructions abover present 3 types of “dependencies”:

To make the hazards posed by these data dependencies clear - let’s assume that the ALU takes 3 clock cycles to complete a 32 bit addition. We have 2 ALU Functional Units in this ERR CPU. Let’s assume for the sake of simplicity, that the issue queues have a depth of “2” (these are usually parameterized)

Execution Flow with ERR

And now we have demonstrated how ERR allows for out-of-order execution, solving the problem of WAR and WAW dependencies. ERR is usually more power hungry and larger than Tomasulo style processors. The ILP window can be made larger by increasing the size of the ROB, Issue Queues, and increasing the number of FUs/CDBs

Discussion

And now I have tried to explain the basic premise of OOO cpus. Nothing beats the IPC of pipelined CPUs if memory was ideal and didn’t take ages to respond and was cheap for power/area. But OOO CPUs do a good job with what we have.

It is worth noting that branches get much more complicated to deal with in a OOO CPU. In the case of a branch resolving, and the prediction being incorrect - when said branch instruction is committed from the ROB - and the misprediction is discovered - the entire CPU is flushed. Because of the number of queues and stages in an OOO processor - this is an expensive affair - significantly degrading its performance.

Conclusion

This blog post took me a lot longer to write than I was expecting. It’s my first attempt at trying to explain a fairly technical topic over text only. I have glossed over a lot of implementation details (ROB, CDB, RAT) and only tried to convey the intuiton for how OOO processors beat certain limitations - so reading a textbook/lecture slides on this is highly recommended. It should help cement the foundation of understanding, and go over implementation details I would also recommend taking a look at

Part 2 of this will cover some of my experience designing a baseline CPU, and writing benchmarks to see what needed extra bells and whistles to run better.

If you have an ideas for improving the clarity of this writeup- do let me know :D