anna-docs

When Microcontrollers Struggle with Math: Building Herald

anna — Sun, 29 Mar 2026 05:30:00 +0530

Why I Built Herald

Microcontrollers are great, until you ask them to do math.

Try throwing trigonometry or signal processing at a small MCU and things slow down very quickly. This is especially true for cheaper or simpler chips that either do not have a floating point unit or rely on relatively expensive hardware support. Even when FPUs are available, they are not always the best choice for power or area constrained systems.

I wanted something simpler. Maybe a small hardware block that could handle these operations efficiently without dragging in all the overhead of floating point. And that idea eventually became Herald, a fixed-point DSP coprocessor designed for Tiny Tapeout.

The design itself was written in Bluespec SystemVerilog for rapid prototyping, and then compiled down to Verilog to fit into the Tiny Tapeout flow.

At first, this was just an experiment to see if something like this could even fit within Tiny Tapeout's constraints. It turned out to be a lot more involved than expected.

What Herald does

At a high level, Herald is a small hardware accelerator for math operations that tend to be slow on Microcontrollers.

It focuses on two main categories:

Trigonometric and vector operations
Multiply-accumulate (MAC) operations

That already covers a lot of real-world use cases like filtering, sensor fusion, and basic control systems. Instead of doing all this in software, the idea is simple. You send inputs to Herald, let it compute in hardware and the result is read back once its ready.

Internally, this is handled by two main blocks:

A CORDIC engine for trigonometry and vector math
A MAC unit for fast arithmetic accumulation

A small control interface ties everything together and keeps the interaction simple.

Architecture Overview

At a high level, Herald can be split into two parts: a control path and a data path.

The control path is driven by a small FSM. It handles commands, collects input data, and decides when to start computation and when results are ready.

The datapath is where the actual work happens. Inputs are routed to the appropriate compute block, and the result is selected and sent back through the interface.

Keeping these two blocks seperate made the design easier to reason about, especially during debugging!

High-level architecture of Herald showing the control FSM, compute engines, and result path.

How Herald really works

For the low level nerds, this is where things get a bit more interesting.

So Herald operates through a simple command driven flow controlled by a FSM. Each operation follows a fixed sequence:

A command is written to select the operation
Input data is sent byte by byte
The selected compute block is triggered
The result becomes available once computation completes!

Since each operand is 24 bits wide, data is transferred over multiple cycles using an 8-bit interface. The FSM takes care of assembling these bytes and ensuring everything is aligned before execution starts.

From the outside, this just looks like a small protocol. Internally though, it keeps the design predictable and, more importantly, saved me from pulling my hair out by avoiding a bunch of timing headaches.

FSM controlling command intake, operand loading, execution and result output.

The Interesting Bits

A lot of the design choices in Herald come down to one idea: keep things simple, but still useful.

Fixed-point arithmetic

Instead of using floating point, Herald uses a fixed-point format (Q12.12). In simple terms, everything is stored as an integer with an implicit scaling factor.

So instead of working with something like 3.5, you scale it and store it as an integer. All computations then stay in integer form. This avoids the need for FPU entirely, which saves both area and complexity, while still giving enough precision for most embedded use cases.

CORDIC: doing trig without "real" math

CORDIC can be visualized as rotating a vector toward a target angle. In hardware, this happens iteratively in small steps.

The CORDIC engine is probably the most interesting part of the design.

Instead of directly computing things like sin or cos, it works by rotating a vector in small steps until it reaches the desire angle. Each step only uses shifts and additions, which makes it very hardware friendly.

After enough iterations, you end up with a very close approximation of the result without ever using multipliers or complex functions.

MAC: simple, but useful

The MAC Unit is much simpler, but it ends up being just as useful.

It maintains an internal accumulator and supports operations like multiply, accumulate and subtract. This makes it ideal for things like dot product or filtering, where values are combined repeatedly.

It is also much faster than the CORDIC path, completing in a single cycle in this design, which makes it a nice complement to the more iterative CORDIC engine.

From Code to Silicon

The Big Leagues

Full Tiny Tapeout IHP26A shuttle layout showing multiple user designs sharing the same space!

One of the most satisfying parts of this project was seeing the design turn into actual layout.

At the shuttle level, Herald is just one small block among many designs sharing the same space. It is easy to forget how small your design really is until you see it in that context.

Zooming into Herald

Herald's layout showing standard cells and routing after synthesis and place-and-route.

Grabbing a magnifying glass and looking into the shuttle GDS, we can have a closer look at Herald! This is where things get more interesting.

What started as HDL modules and state machines is now a dense grid of standard cells and routing. Every operation (from simple addition to a full CORDIC iteration) is physically represented here in the most fundamental electronic device: the transistor.

What Went Wrong (And What I Learned)

As is always the case, no hardware project is complete without a few things going wrong, and Herald definitely had its share.

The "it works...wait, no it doesn't" bug

At one point, the CORDIC engine was producing completely incorrect results. Everything looked fine on paper, but the outputs were clearly off.

After digging into it, the issue turned out to be a subtle timing problem. The design was effectively reading results before the computation had actually progressed, so it ended up returning values from an earlier iteration.

The fix itself was small, just an extra control signal to make sure the computation had actually started, but figuring it out took a lot longer than expected.

When data is not what you think it is

Another issue introduced itself while testing the wrapper interface. Values coming out the design were correct, but somehow ended up in the wrong places.

This turned out to be due to how tuples are packed in Bluespec SystemVerilog. The ordering was not what I initially assumed, which meant values were being interpreted incorrectly when read out.

Looks right...until you check it

One interesting moment was verifying the CORDIC gain factor.

Instead of just trusting the theoretical value, I compared expected results with actual hardware outputs. This helped confirm that the implementation was behaving as intended, even with approximation errors from finite iterations.

It is very easy to assume things are correct when they "look right", but actually checking the numbers made a big difference!

Wrapping Up

What started as a small experiment turned into a full journey through designing, debugging and finally seeing the project make its way to actual silicon!

Working within the constraints of Tiny Tapeout forced a lot of decisions, from using fixed-point arithmetic to keeping the interface simple. In the end, those constraints shaped the design just as much as the original idea.

More than anything, this project is a good proof which shows that the real challenges in hardware are not always the big ideas, but even the small details. Timing and integration tend to matter far more than expected.

I am truly grateful to the Tiny Tapeout community for their quick help whenever I needed it. I am very much excited to see how Herald behaves in real silicon soon :)

You Don't Need Expensive Tools to Play with Silicon

anna — Sat, 07 Feb 2026 05:30:00 +0530

So you're sold on RISC-V. You've seen the vision. Now comes the question every beginner asks: "Okay, but how do I actually start?"

In the previous post, we explored why RISC-V is the "Linux of hardware" and how it's making processor design accessible to everyone. Now let's talk about the actual tools that let you build, simulate, and hack on real processors.

The good news? You don't need expensive licenses, proprietary toolchains, or a university lab. The hardware hacking stack is wide open and it's free.

Breaking the Myth: Hardware != Expensive

When most people think "chip design," they imagine locked-down corporate tools with five-figure licenses: Cadence, Synopsys, Mentor Graphics. And yeah, industry still uses those. But here's what's changed in the last decade:

The FOSS hardware ecosystem has grown up.

Just like Linux gave us gcc, vim, and git, the hardware world now has Yosys, Verilator, and OpenROAD. Students, hobbyists, and even startups are designing real chips with these tools: chips that actually get manufactured and work.

The barrier to entry isn't money anymore. It's just knowing where to look.

The Open Hardware Toolchain: Your New Best Friends

Let's walk through the essential tools that take you from "I have an idea" to "I have a working processor." Think of this as your hardware hacker starter pack.

1. Simulation: Verilator & Icarus Verilog

Simulation is basically running your hardware design in software to see how it behaves, think of it like a test drive before you commit to building the real thing. Writing hardware code (Verilog/VHDL) without simulation is like coding in C without ever compiling. You need to see if your logic actually works: does your processor fetch the right instruction? Does your ALU compute correctly?

Verilator is the Usain Bolt of the bunch. It converts your Verilog code into C++ and compiles it into a native executable, which makes it insanely fast and perfect for simulating large designs like full processors. It's cycle-accurate, meaning it simulates your design clock cycle by clock cycle, just like the real hardware would run. The trade-off? It's a bit more complex to set up, but once you get it running, you'll appreciate the speed :)

Icarus Verilog is the friendlier option for beginners. It's an event-driven simulator that's simpler to use and doesn't require you to write testbenches in C++. Just write your Verilog, compile it with the command iverilog, and run it with vvp. Pair it with GTKWave to visualize signals as waveforms, and you've got a complete simulation workflow. It's not as fast as Verilator for huge designs, but for learning and debugging, it's perfect.

Try simulating a simple counter or a full adder. Watch the waveforms change. It's oddly satisfying to see logic come alive on your screen.

A GTKWave waveform illustrating the timing of control, datapath, and ALU signals in a CPU core.

2. Synthesis: Yosys

This is where your abstract "hardware description" becomes real logic that could be etched onto silicon. Yosys converts your high-level Verilog code into a gate-level netlist: actual logic gates like AND, OR, and flip-flops.

Think of Yosys as the gcc of hardware. It compiles your design into something a chip fab (or FPGA) can understand. Yosys is scriptable, modular, and works with a ton of FPGA and ASIC flows. It's used in actual chip tapeouts through projects like Google's open PDKs (a PDK, or Process Design Kit, is basically the instruction manual for a specific chip fabrication process: it tells your tools how that particular factory makes transistors).

The synthesis flow: your high-level hardware description gets compiled down to actual logic gates that can be implemented in silicon.

So, feed Yosys a simple Verilog module, run synthesis, and look at the output netlist. Watching your code transform into gates is pretty cool.

3. Place & Route: OpenROAD

Synthesis gives you logic. Place & route gives you geometry: where each gate sits, how wires connect, timing constraints, power routing. This is the final step before manufacturing.

OpenROAD is an open-source ASIC flow that takes your gate-level netlist and turns it into an actual physical chip layout. It handles everything from floorplanning (deciding where blocks go) to detailed routing (connecting all the wires) to timing analysis (making sure your chip actually works at the speed you want).

What makes OpenROAD special is that it's not just a toy project, it's being used in real chip tapeouts. Companies and researchers are using it with Google's SkyWater 130nm PDK and platforms like ChipFlow to manufacture actual silicon. The same flow that powers multi-million dollar chip designs is now available to anyone with a laptop.

The workflow is surprisingly straightforward: feed it your synthesized netlist, tell it about your target technology (via a PDK), set your constraints, and let it do its thing. It'll optimize placement, route all the connections, check timing, and spit out a GDSII file, the final layout that gets sent to the fab.

So, you're literally designing the physical layout of a chip. This used to be reserved for grad students and industry engineers and now it's something you can mess around with on a Tuesday night.

My chip layout post place & route. Those colorful rectangles and wires? That's real geometry heading to the fab.

4. RISC-V Cores: The Playground

Now that you have the tools, what do you build? Here's where pre-existing RISC-V cores come in, think of them as reference designs you can learn from, modify, extend and even contribute to.

PicoRV32 is arguably the most famous RISC-V core, tiny and simple around 1000 lines of Verilog. Perfect for beginners because you can actually read and understand the whole thing. Simulate it, run a "Hello World" program, trace how instructions execute.

BOOM (Berkeley Out-of-Order Machine) is the other end of the table, a high-performance, Out-of-Order RISC-V core. When you're ready to level up, BOOM teaches you modern processor techniques. Study the pipeline, see how dynamic scheduling works.

RISCape (shameless plug) is my 5-stage pipelined RISC-V core. Built by a student, for students. If I can do it, you absolutely can too. Fork it, break it, fix it, extend it. That's how you learn.

We can see the three levels of RISC-V cores: PicoRV32 for learning the basics, RISCape for understanding pipelines, BOOM for exploring modern processor techniques.

5. Tiny Tapeout: Your Shot at Real Silicon

Remember all those tools we just talked about? Tiny Tapeout is where you actually use them to manufacture real chips.

Tiny Tapeout is a platform that makes chip fabrication accessible. Instead of paying hundreds of thousands of dollars for a solo manufacturing run, multiple designs get combined into one "shuttle" and share the cost. For around $100-150, you can get your design fabricated in actual silicon using processes like IHP SG13G2 or SkyWater 130nm. No company backing, no research lab required, just you, your design, and a submission deadline.

I first heard about Tiny Tapeout while working on a System-on-Chip (SoC) project with some friends. The idea that students could tape out real chips felt almost too good to be true. My first real dive was through a crowd-sourced competition where participants contributed peripherals for a RISC-V SoC. That got me hooked, read about my submission here!

After that, I decided to go all in and do a dedicated tapeout of my own: Herald, a co-processor that handles CORDIC (coordinate rotation) and MAC (multiply-accumulate) operations. The workflow? Design in Verilog, simulate it, synthesize with Yosys, place & route with OpenROAD, verify timing, submit your GDSII file, and wait for the shuttle to tape out.

Sounds smooth, right? It wasn't. Errors after errors timing violations, things that worked in simulation but failed in hardware checks. But here's the thing: the Tiny Tapeout community is incredibly active. Ask a question on Discord (even if you think it's silly), and someone will help. The barriers aren't technical anymore, they're just learning curves, and the community helps you climb them.

Real Tiny Tapeout shuttle layout (Shuttle - IHP26A) showing multiple designs(including mine) combined into one manufacturing run. Each colorful section is someone's chip: students, hobbyists, researchers all sharing the same silicon :)

What makes Tiny Tapeout special isn't just that it's affordable. It's that you're using the exact same tools and flows that industry uses. You're not playing with toy simulators or fake environments. You're designing real chips, following real constraints, and getting real manufactured silicon at the end. That's the ultimate validation.

If you've been following along with the tools in this post—Verilator, Yosys, OpenROAD then Tiny Tapeout is where you put them all together. It's the final level of the training arc: from simulating logic gates to holding a physical chip with your design etched into it.

Closing: The Only Question Left Is "What Will You Build?"

A decade ago, you needed a million-dollar lab to design a processor. Today, you need a laptop and curiosity.

The tools are here. The cores are open. The community is waiting.

You've seen the philosophy here. You've seen the tools in this post. In the next post, we'll go deep into pipelines, out-of-order execution, hazards, caches, and the real magic happening inside modern processors. We'll talk about how I approached building RISCape, the challenges I faced, and what I learned about computer architecture along the way.

But for now? Start small. Run the tools. Let things break. The rest follows.

Why GPU Threads Aren’t What They Look Like: SIMT + Divergence Explained

anna — Tue, 02 Dec 2025 05:30:00 +0530

Tiny Tone - My First Accepted Tiny-Tapeout Design

anna — Sat, 25 Oct 2025 05:30:00 +0530

Introduction

In this blog, I share a small but meaningful design I submitted to the TinyQV tapeout program, what it is, why it matters to me, and what I learned along the way. Before we understand what my design is all about, let’s be clear with some common terminologies I use in this blog.

What is TinyQV?

Tiny-QV is a collaborative competition under the Tiny Tapeout program. The goal is to build a small but complete RISC-V SoC, where the CPU and peripherals are designed to be simple and within the strict constraints of Tiny Tapeout.

At its heart is TinyQV, a lightweight RISC-V CPU that implements the RV32EC instruction set along with the Zcb and Zicond extensions.

The really fun part: the peripherals that complete the SoC are contributed by the open source community. Each submission is a chance to have your logic integrated into silicon and that’s where my project, Tiny-Tone, fits in.

What is Tapeout?

In chip design, tapeout is the moment when a design is functionally verified and sent off to be manufactured on real silicon. Back in the day, this really did involve shipping magnetic tapes to the fabrication plant and while the process is digital now, the name stuck around.

Tapeout is a milestone because it marks the point where your idea leaves the world of simulation and code, and takes its first steps toward becoming a physical chip. For many hardware designers, it’s the most exciting part of the journey.

Introducing Tiny-Tone

My contribution to Tiny QV is called Tiny-Tone: a PWM-based audio tone generator written in Bluespec SystemVerilog. The goal was simple create a small peripheral that could generate audible tones using just digital logic.

This design felt perfect for tapeout: it’s compact, demonstrable, and educational. When the chips come back, I’ll be able to hook up a speaker and hear it play tones( if I’m able to afford it :D ).

Understanding PWM (Pulse Width Modulation)

Imagine quickly flipping a light switch on and off. If it’s on half the time and off half the time, the average brightness looks like 50%. If it’s on 90% of the time, it looks almost fully bright. That’s the basic idea behind PWM: controlling the duty cycle, the fraction of time the signal is “on” vs. “off.”

For audio, we can do something similar. Instead of controlling brightness, the rapid on/off pattern controls how a speaker cone moves. If we switch fast enough, the speaker naturally smooths out the square wave, and what you hear is a steady tone.

In Tiny-Tone, I used PWM to generate these patterns at different frequencies. Changing the frequency changes the pitch of the tone, and adjusting the duty cycle changes its character. With just digital logic, you can make a speaker “sing.”

With Tiny-Tone, I can control the frequency of the PWM signal, which changes the pitch of the sound, and adjust the duty cycle, which changes how the sound feels when it comes out of the speaker.

PWM isn’t just for audio related operations, it’s everywhere. For example, the brightness of an LED is often controlled with PWM, and the speed of motors in drones or fans is set the same way.

My Implementation

I started by writing the RTL in Bluespec SystemVerilog, which is great for building modular hardware. Its rule-based approach and strong type system made it easy to organize the design cleanly.

The Tiny-Tone peripheral is essentially a small, configurable PWM generator that can be controlled from software. The design has two main parts:

Tone Generator – This module creates the PWM signal. You can enable or disable it and set the frequency at runtime. The output is a single digital line that toggles according to the programmed tone.
Peripheral Wrapper – This exposes the generator to the rest of the SoC using a simple memory-mapped interface. Software can turn the tone on/off, change the frequency, and read back status, all through a few registers.

I also wrote a Bluespec testbench to make sure everything worked as expected. It exercised the registers, ran the PWM signal, and verified that enabling/disabling the tone behaved correctly.

Finally, the design was compiled to Verilog for Tiny Tapeout, wrapped in a top-level harness, and paired with a Python cocotb testbench for automated simulation. This made integration with the SoC straightforward and allowed for continuous verification.

Challenges I Faced and How I Fixed Them

Every hardware design brings its own set of challenges, not just in writing RTL, but also in meeting real-world constraints like area limits, timing, and integration requirements. In my case, working with Bluespec SystemVerilog added an extra twist: I had to convert my design to Verilog before it could be accepted into Tiny Tapeout.

Integration with Tiny Tapeout

Adapting the Bluespec-generated Verilog to fit the Tiny-QV SoC interface wasn’t just a drop-in task. I needed to carefully map ports and build a wrapper so that my peripheral played nicely with the rest of the system.
Debugging the Peripheral

One tricky bug showed up during cocotb testing. When I wrote a word (e.g., 0x82345678) to register 0 and tried reading back the byte at that address, I got 0 instead of the expected 0x78.

After digging into the generated Verilog, I realized the problem: my Bluespec peripheral only supported byte writes (data_write_n == 2'b00), but the testbench was doing a word write (data_write_n == 2'b10).

The Fix

The solution was simple once I understood the mismatch: I updated the test to use a byte write (write_byte_reg) instead of a word write. Now the writes lined up with what the peripheral supported, and the test passed cleanly.

This might sound like a small detail, but debugging it taught me a lot about how interfaces and toolchains can disagree and how important it is to understand what your generated RTL actually supports.

What I Learned

This project taught me a few valuable lessons:

Designing for tapeout constraints is very different from simulating freely. Every gate, every bit of logic matters.
Writing in a high-level HDL (like Bluespec) is powerful, but translating to Verilog forces you to think about the details.
The excitement of knowing your design will physically exist on silicon is unmatched :P

Tiny-Tone is my first accepted tapeout design, and it won’t be my last. The journey of taking this design from idea to tapeout has been just as exciting as waiting for the day it comes alive on silicon. This journey has been equal parts challenging and rewarding and it marks the start of many more hardware adventures.

RISC-V: The Linux of Hardware

anna — Sun, 07 Sep 2025 05:30:00 +0530

The Hidden Half of Computing

Most of us discover FOSS through software: cool Linux distros, editors, frameworks and many other tools that make our life easier. But beneath all that is a world not many of us think about: the microchips that run our code. All this while, this world was locked away behind billion-dollar corporations and closed Instruction Set Architectures.

Say hello to RISC-V (Reduced Instruction Set Computer) — an open source Instruction Set Architecture(ISA) that’s doing to hardware what Linux did to operating systems. This blog is a quick introduction to what RISC-V is, why it matters, and how it fits perfectly into the FOSS spirit.

So What’s an ISA anyway?

We all speak different languages with our own unique grammar. Similarly, a processor has its own “grammar” called an Instruction Set Architecture (ISA) — the rules that define how software talks to hardware.

When you think “processor,” names like Intel’s x86 or AMD’s Ryzen probably come to mind. Maybe you’ve even heard of ARM chips in mobile phones and Raspberry Pi. But here’s the catch: all of these are proprietary. That means if you want to design a chip with them, you’ll need to pay hefty license fees, sign NDAs, and basically fight uphill just to get started.

Back in 2010, researchers at UC Berkeley decided to change that. Their project: RISC-V — an open source, free and modular ISA. This means anyone can use it, extend it and guess what even add their own instructions! The entire script was flipped, making hardware innovation as open as software innovation.

Why should students care about RISC-V?

1. FOSS energy → The same spirit as Linux

The openness of RISC-V has created ecosystems much like Linux did — thriving communities, student-led projects, multiple open-source processor cores, Discord servers buzzing with discussions, and more. It’s not just an ISA; it’s a culture you can join and shape.

2. A real learning playground → No glass walls

Hardware design often feels like wizardry happening in corporate or research labs. But RISC-V opens the doors for everyone. Curious about how a processor works internally? You can:

Browse and study real RISC-V cores on GitHub: from tiny educational cores to ones capable of running Linux.
Simulate a RISC-V CPU using open-source tools like QEMU.

Peek inside a RISC-V CPU: one instruction at a time, courtesy of Ripes. You’re not stuck just reading theory; you can play with the real thing.

3. Structured guidance → Mentorship and learning programs

RISC-V isn’t just open; it’s supportive. The RISC-V Foundation hosts mentorship programs, student contests, workshops, and provides online resources. Beginners can pair up with experienced developers, contribute to real cores, and learn in a guided environment which makes the learning curve far less intimidating.

4. Industry shift → Where the world is heading

This isn’t just about tinkering. RISC-V is becoming a global standard. Tech giants like Intel, NVIDIA, Google, and Qualcomm are adopting it. That means demand for engineers who understand RISC-V is going to rise. As students, learning it now is like learning Linux in the early 2000s: the skills are future-proof.

Why RISC-V Resonates with me?

My first introduction to RISC-V was at Tilde, HSP’s flagship summer mentorship program where we were tasked with building a RISC processor in Verilog. Although I wasn’t able to continue with the project, I was given access to vast resources that opened the door to the world of RISC. I started out with simple Linux Foundation courses and quickly found myself immersed in the subject, having already delved deep into Digital VLSI.

Soon, I was diving into books and research papers to understand the architecture and design principles of RISC-V. At the same time, I was building RISCape, a 5-stage pipelined RISC-V processor. Reading the theory while implementing it side by side made the concepts click in a way no textbook alone could achieve.

This hands-on approach showed me the real power of open architectures: students can explore, experiment, and actually make things work, all without asking for permission. If it got me hooked, it can probably do the same for you: grab a core, start experimenting, and see where it takes you.

Closing Thoughts

RISC-V is more than just a processor architecture: it’s an open, collaborative ecosystem that mirrors the FOSS philosophy we see in software. From student-friendly cores to real-world applications, it’s shaping the way hardware can be explored, understood, and innovated upon.

In the next post, we’ll dive into the tools and resources that make experimenting with hardware easier. From simulators to development frameworks and see how you can start building your own projects from scratch.

Aetheron: Bringing My Own SoC to Life

anna — Fri, 18 Jul 2025 05:30:00 +0530

Introduction

Aetheron is a small but complete RISC-V System-on-Chip, built entirely in Bluespec SystemVerilog and run in simulation. It features a pipelined CPU, memory-mapped peripherals, my own minimal TileLink interconnect, and can boot minimal bare metal C programs — all stitched together from scratch.

But it’s more than just a collection of modules.

I’ve always loved the idea of designing systems — not just writing code or building hardware in isolation, but understanding how they blend together. Fortunately, I had the privilege to attend a three-day workshop hosted by my seniors, Devesh Bhaskaran and Abhiram Gopal Dasika where I was introduced to the world of Automotive SoC Design. I decided I wanted to build something that felt real: where the CPU actually boots a program, where peripherals respond to memory-mapped I/O, and where every instruction running in software maps to a transaction on a bus.

"Aetheron" was born out of that drive.

It started with just a CPU core, then quickly spiraled into a full-on SoC: with ROM, RAM, UART, and a TileLink fabric to tie everything together. I didn’t want to reuse cores or plug in pre-built peripherals. This had to be something I understood fully — every wire, every rule, every hex file in the ROM image.

Curious to know what exactly is a SoC? Click here to read my blog about Introduction to SoCs.

Setting the Vision

What I wanted to build

From day one, I wasn’t interested in just blinking LEDs or simulating a few instructions. My goal was clear: build a bootable SoC that could run real software — not a demo, not hardcoded testbenches, but actual compiled C programs.

I wanted:

A pipelined RISC-V CPU (RV32I) I could understand and modify.
ROM and RAM memory regions with clear separation.
A TileLink interconnect, for clean and modular communication.
Memory-mapped peripherals like UART, GPIO, and Timer.
A boot process where the CPU fetched code from ROM, copied payloads into RAM, and executed them like a real embedded system.

In short: something you could imagine dropping into a real chip, even if it never left simulation.

Early decisions I made

One of the first choices was the instruction set. I went with RISC-V because it's clean, open, and designed to be simple — perfect for a project like this. It made writing my CPU pipeline much more enjoyable and gave me a standard to work against.

For the HDL, I chose Bluespec SystemVerilog (BSV). I already had experience with Verilog, but Bluespec’s rule-based semantics and powerful types made building a modular system much easier. Once you wrap your head around guarded atomic actions, you realize how expressive it is — especially for designing protocols like TileLink.

I also went with TileLink as the on-chip interconnect protocol instead of something like AMBA. TileLink’s open, light weight and fully-specified nature made it easier to reason about formally. Its decoupled, pipelined a/d channel design fit naturally into the rule-based model of BSV, and it scaled well as I added more peripherals.

Finally, I decided early on that simulation-only was totally fine. I wasn’t aiming to synthesize this on an FPGA. I just wanted correctness, modularity, and the ability to run and debug C code end to end. That freedom let me move faster and focus on the design, not timing closure or board quirks.

The First Steps

Designing the building blocks

I began with the bare minimum: a CPU, ROM, RAM, and UART — all stitched together through a custom TileLink interconnect. At this point, the SoC didn’t boot any real programs. It wasn’t even executing C yet. But my focus was on getting the pipeline running in order to see actual instructions being fetched, decoded, and executed.

The CPU design started as a stub — a placeholder module that just returned hardcoded values. It helped me bring up the rest of the system without worrying about instruction execution yet. Once the rest of the SoC took shape, I replaced the stub with a proper 5-stage pipelined core called Specula. At the time, Specula was a relatively simple in-order RISC-V core. It would eventually evolve into something far more ambitious: a true Out-of-Order processor. But for now, it was my way of getting real programs flowing through an instruction pipeline.

The ROM was modeled as a read-only memory block, used to hold the bootloader and payload at simulation time. The RAM was a simple byte-addressable memory model, accessible via TileLink. And the UART, while basic, was my only way to get output from inside the SoC — my single window into what the CPU was doing.

The full ROM → RAM boot process in Aetheron

Testing with assembly

At this stage, I was manually writing tiny RISC-V assembly test programs, compiling them into .text and .data sections, converting them into hex files, and loading them into ROM. No C yet — just handcrafted instructions to test branching, memory stores, and UART writes.

This was also when I first got UART output working in my assembly code.

And just like that the CPU could talk :D

Roadblocks & Debugging Era

As with all systems design, just when things seemed stable — everything broke.

Up to this point, I had a functioning ROM, RAM, UART, and a pipelined CPU. My bootloader, written in RISC-V assembly, was executing correctly, and I could print characters over UART using simple sw instructions. Things felt solid. It was time to try running a C program.

That’s when it all began to crumble.

1. The Case of the Silent UART

I compiled a tiny C program, linked it with my custom script, and embedded the binary into ROM. But when I ran the simulation… nothing. No UART output. Just silence.

I suspected a mapping issue. Was my C code writing to the wrong MMIO address? But UART_ADDRESS = 0x40001000 matched in both hardware and software. Everything checked out — yet nothing was printing.

I turned to trace logs. I dumped every memory access, register write, UART state change. That’s when I noticed: the CPU wasn't even reaching printf(). It was looping somewhere much earlier.

2. Memory Mapping & Peripheral Routing

Maybe the hardware was at fault? I added debug prints in the TileLink interconnect and peripherals. I confirmed: writes to 0x40001000 were routed correctly to the UART. Address decoding wasn’t the issue.

3. Toolchain & Linker Script Confusion

The Makefile, linker scripts, and objcopy pipeline were fragile. I reviewed every line: memory regions, section assignments, symbol names. Using readelf, objdump, and objcopy, I inspected each ELF and binary output. Slowly, I rebuilt a correct linker script that mapped ROM and RAM properly.

But still — the program didn’t run.

4. The Payload That Never Was

Finally, I suspected a deeper issue — not the code itself, but how it was getting into ROM.

See, my bootloader was supposed to copy a .payload section from ROM to RAM, then jump to it. But what if that section wasn’t even being included in the final ROM image?

I checked the ELF file using riscv64-elf-objdump -h and saw something weird: the .payload section was listed, but with VMA = 0x0 and zero file offset. It looked like the section existed… but wasn’t actually backed by any data in the binary.

The culprit was objcopy. My Makefile used:

riscv64-elf-objcopy -O binary -j .payload ...

But that doesn’t extract the section properly if the linker never placed it right in the first place.

At one point, my linker script had something like:

.payload : { *(.text .text.* .rodata .data .bss) }

But that only works if the incoming .o file explicitly labels those sections as part of .payload. In my case, it didn’t. The payload data existed under normal sections (.text, .rodata, etc.), but nothing instructed the linker to collect them under .payload.

I rewrote the linker script to explicitly place the payload contents, added the .payload symbol in the source to force section labeling, and rechecked everything with objdump. This time the section showed up with the right address and size. The resulting .bin was properly populated.

I regenerated rom.hex, ran the simulation… and there it was. My rom.hex finally contained a working payload, and Aetheron executed its first C program.

Key Learnings

Looking back, building Aetheron taught me far more than just how to wire up peripherals or write linker scripts.

Hardware/software co-design is real. Designing the SoC and writing code for it in tandem helped me see how tightly coupled these layers are. Every software bug was a chance to inspect my hardware — and vice versa.
Debugging is a full-stack art. From the ROM image to C code to UART signals, every layer matters. Sometimes the problem isn’t the code — it’s how your tools handle it.
Don’t underestimate simulation. I didn't need an FPGA or fancy board. Just a solid simulation setup, trace dumps, and a good mental model were enough to build and boot real programs.

Most importantly, I learned to enjoy the grind — to sit with a problem, question every assumption, and slowly trace my way to a solution.

Interested to read about my SoC Learnings? Click here :D, I try to keep it up-to-date.

What’s Next

Aetheron is far from finished. There’s still a lot more I want to explore and build on top of it. One of the first goals is adding interrupt support, so peripherals like the Timer can trigger software events.

And then there's a fun suggestion from one of my seniors Joyen Benitto where he mentioned I should try building a custom accelerator that sits on an AHB TileLink slave port. Maybe a memory-mapped matrix multiplier: write your operands to an address, and read back the result. That idea stuck with me. Now that the TileLink fabric and basic SoC infrastructure are in place, Aetheron feels like the perfect playground to try something like that.

Final Thoughts

Aetheron was more than just a successful project as it reflects how I think, the systems I dream of, and the way I approach problems. It’s a culmination of goals, ideas, mistakes, and long nights chasing one more bug.

I hope it also serves as a spark for others to build their own systems, question abstractions, and enjoy the messy, brilliant process of bringing hardware to life!

What is a System-on-Chip, Really?

anna — Fri, 18 Jul 2025 05:30:00 +0530

Introduction

You’ve probably heard the term “System-on-Chip” before. Maybe it popped out from our daily usage of Smartphones, Automotive Vehicles or even a Raspberry Pi board! It sounds important (it is important), maybe even futuristic. But let’s rewind for a moment and ask the real question:

What is a System-on-Chip, really?

I’ve wanted to understand this for a long time. But textbook definitions never satisfied me—I didn’t just want to use SoCs, I wanted to understand what made them tick.

What is a SoC?

At its core, a System-on-Chip (SoC) is exactly what the name suggest: a complete computing system squeezed into a single chip of silicon. Unlike a general-purpose standalone processor that only handles computation and needs support from external memory or I/O, an SoC pulls everything together:

a CPU (or more than one)
memory (ROM, RAM, Flash memory)
peripherals (UART, GPIO, SPI, I2C, Timers… it just goes on)
communication buses (some well known ones like AXI from the AMBA family)
interrupt controllers, clock sources, debug logic—you name it.

It’s like shrinking an entire motherboard into a single chip.
Power-efficient. Compact. Purpose-built. Almost like making a sandwich :P

So is it the same as a CPU or an MCU?

This question comes up a lot, and it’s worth clarifying.

A CPU is just the brain—the execution core. It can’t do anything meaningful on its own without external memory and I/O.
An MCU (Microcontroller Unit) is a type of SoC, but usually simpler and tailored for low-cost embedded tasks (like your microwave or TV remote).
An SoC can scale far beyond that as it powers smartphones, SSD controllers, automotive systems, smartwatches… anywhere that needs tight integration and efficiency.

So no, they’re not the same—but they are related.

SoCs in the Real World

We use SoCs every day, often without realizing it.

The Snapdragon chip in your Android phone? SoC.
The Apple M1/M2/M3 in your MacBook? Massive SoC with powerful CPU/GPU/NPU/DRAM controllers.
The ESP32 you play with in hobby electronics? Tiny Wi-Fi capable SoC.
The Tesla FSD chip? A custom automotive-grade SoC.

Even microcontrollers like the STM32 or RP2040 are SoCs — just on the smaller, embedded end of the spectrum.

That’s the beauty: SoCs scale. From your watch to your laptop to your car’s engine control unit — it’s all the same fundamental idea, adapted to the problem at hand.

Why SoCs Fascinate Me

What drew me in wasn’t just the formal textbook definition of a System-on-Chip. It was the magic of integration.

To me, SoC design is the ultimate optimization problem. What used to be a sprawling mess of chips on a PCB is now a carefully crafted rectangle of silicon. There’s something deeply elegant about that. It's not just engineering—it’s compression, clarity, and control.

But more than that, every SoC tells a story — a story about tradeoffs, constraints, and creativity. When you study one closely, you’re looking into the minds of the engineers who made it—what they prioritized, what they left out, and why.

Somewhere along the way, that curiosity turned into obsession. And that’s how Aetheron was born.

What’s Next?

This blog was a gentle introduction to SoCs — their beauty, their complexity, their presence in our everyday lives.

But this is just the start.

If this sparked even a bit of curiosity, I think you’ll enjoy Aetheron, my homegrown RISC-V SoC, built from scratch. It's simulated, not silicon — but it boots real C programs, features custom TileLink peripherals, and captures everything I’ve come to love about SoC design.

See you there :)

→ [Read Aetheron: Bringing My Own SoC to Life →]

Blogs

anna — Thu, 01 Jan 1970 05:30:00 +0530

Listing of blog posts.

Guestbook

anna — Thu, 01 Jan 1970 05:30:00 +0530

Pranav

anna — Thu, 01 Jan 1970 05:30:00 +0530

Hi, I'm Pranav 👋

I’m an undergraduate electronics student drawn to digital hardware, processor design, and embedded systems.

I build open-source RISC-V cores, explore verification methodologies, microarchitecture, and performance-driven design.

Most of my work lives on GitHub, and this site is where I write about what I learn. Feel free to sign my guestbook!

Reach me at pranav.m1205@gmail.com

Herald DSP

anna — Thu, 01 Jan 1970 05:30:00 +0530

A tiny DSP designed for constrained microcontrollers.

Gallery

anna — Thu, 01 Jan 1970 05:30:00 +0530

anna-docs

When Microcontrollers Struggle with Math: Building Herald

Why I Built Herald

What Herald does

Architecture Overview

How Herald really works

The Interesting Bits

Fixed-point arithmetic

CORDIC: doing trig without "real" math

MAC: simple, but useful

From Code to Silicon

The Big Leagues

Zooming into Herald

What Went Wrong (And What I Learned)

The "it works...wait, no it doesn't" bug

When data is not what you think it is

Looks right...until you check it

Wrapping Up

You Don't Need Expensive Tools to Play with Silicon

So you're sold on RISC-V. You've seen the vision. Now comes the question every beginner asks: "Okay, but how do I actually start?"

Breaking the Myth: Hardware != Expensive

The Open Hardware Toolchain: Your New Best Friends

1. Simulation: Verilator & Icarus Verilog

2. Synthesis: Yosys

3. Place & Route: OpenROAD

4. RISC-V Cores: The Playground

5. Tiny Tapeout: Your Shot at Real Silicon

Closing: The Only Question Left Is "What Will You Build?"

Why GPU Threads Aren’t What They Look Like: SIMT + Divergence Explained

Tiny Tone - My First Accepted Tiny-Tapeout Design

Introduction

What is TinyQV?

What is Tapeout?

Introducing Tiny-Tone

Understanding PWM (Pulse Width Modulation)

My Implementation

Challenges I Faced and How I Fixed Them

Integration with Tiny Tapeout

Debugging the Peripheral

The Fix

What I Learned

RISC-V: The Linux of Hardware

The Hidden Half of Computing

So What’s an ISA anyway?

Why should students care about RISC-V?

1. FOSS energy → The same spirit as Linux

2. A real learning playground → No glass walls

3. Structured guidance → Mentorship and learning programs

4. Industry shift → Where the world is heading

Why RISC-V Resonates with me?

Closing Thoughts

Aetheron: Bringing My Own SoC to Life

Introduction

Setting the Vision

What I wanted to build

Early decisions I made

The First Steps

Designing the building blocks

Testing with assembly

Roadblocks & Debugging Era

1. The Case of the Silent UART

2. Memory Mapping & Peripheral Routing

3. Toolchain & Linker Script Confusion

4. The Payload That Never Was

Key Learnings

What’s Next

Final Thoughts

What is a System-on-Chip, Really?

Introduction

What is a System-on-Chip, really?

What is a SoC?

So is it the same as a CPU or an MCU?

SoCs in the Real World

Why SoCs Fascinate Me

What’s Next?

Blogs

Guestbook

Pranav

Hi, I'm Pranav 👋

Herald DSP