Example: FFT SoC

First off, a big thanks to Discord user Dutra_CGI(blog) for helping with all the practical FFT implementation advice and bug fixes to get things going.

PXL_20241029_234817379 MP

This page describes creating a RISC-V based System on Chip (SoC) that is capable of computing an implementation of the Fast Fourier Transform (FFT).

The target platform for this design is a Digilent Arty FPGA dev board:

The input signal is audio from an I2S PMOD.
The output is a VGA PMOD based live display of the signal's frequency spectrum and time-domain samples.
Compute of the N=1024 point FFT is done primarily by:
- a RISC-V CPU
- Custom hardware accelerators

Being done in PipelineC it's possible to automatically pipeline both the math of some RISC-V instructions as well as the arbitrary pure functions that describe the custom compute pipelines attached to the CPU. This enabled easily exploring a range of compute pipelines, with a range of operating frequencies, on a range of target FPGA devices.

overview diagram

If you have more questions and comments after reading what's here, feel free to stop by the PipelineC Discord to chat :).

CPU Summary

cpu part of overview diagram

The RISC-V CPU used in this design is based off of previous work shown here. In this case the CPU is not pipelined, is a finite state machine with one instruction in flight at a time. The CPU top level .c file for this FSM based design is fsm_risc-v.c. The main loop compiled and running on the CPU is found in i2s_and_fb_test.c.

The base not-auto-pipelined version of the CPU has an FMAX ~5-10MHz due to the long division|remainder combinatorial math. This baseline low clock rate combinatorial only (no auto-pipelining) --comb output of the tool at 6.25Mhz is compared to a 62.5MHz 10x clock version with the long execute combinatorial paths auto-pipelined to meet timing.

As seen in the diagram above, the CPU has the following states defined in code:

FETCH_START
FETCH_END / DECODE / REG_READ_START
REG_READ_END / EXECUTE_START
EXECUTE_END
MEMORY_START
MEMORY_END
NEXT_PC / WRITE_BACK

rv32im Multi Cycle Design: ~3-N clocks per instruction
All instructions make use of the FETCH, DECODE, EXECUTE, WRITE_BACK states.
Can pass through multiple FSM states in a single cycle.
- Red lines in diagram represent single cycle register delays.
- For example: The path from register file output, to and from memory mapped registers, and finally through the next PC/write back logic can occur in a single cycle.
Long math operations, ex. division, take several cycles depending on CPU clock rate requiring extra auto-pipelining.
Data Memory RAM block ram access takes an extra clock cycle.
Data Memory RAM AXI access takes variable amounts of time.

The memory map layout is implemented by connecting data memory reads and writes to peripherals as seen here. PipelineC provides hooks for working with busses like AXI via the Shared Resource Bus library.

Shared Resource Bus Summary

axi like shared resource bus

The arbitration of managers (M) / 'hosts' making requests to and receiving responses from subordinates (S) / 'devices' is common in SoC design. PipelineC's Shared Resource Bus provides the handshaking based infrastructure for connecting multiple hosts and devices.

In this design I2S, VGA, and the CPU act as hosts making requests to read and write the shared off chip DDR memory. For example, the CPU connects to the AXI bus for the Xilinx DDR controller axi_xil_mem here in the memory map.

Input Samples

Building on previous work with an I2S PMOD this design adapts the handshake of I2S samples to(transmit/output audio) and from(receive/input audio) the PMOD into reads(of TX data) and writes(of RX data) over the AXI shared resource bus leading to off chip memory.

I2S DDR loopback diagram

The data path of audio samples into DDR memory and back out again was proved out in i2s_axi_loopback.c as documented on Discord.

In the full design as audio samples pass through the DDR-loopback the CPU is able to get a copy of the received samples.

The lowest level of code wraps the memory mapped registers where the samples descriptors are read from hardware FIFOs.

void i2s_rx_out_desc_READ(axi_descriptor_t* desc_out){
  // Wait for valid data to show up
  while(!mm_handshake_valid->i2s_rx_out_desc){}
  // Copy the data to output
  *desc_out = mm_handshake_data->i2s_rx_out_desc;
  // Signal done with data
  mm_handshake_valid->i2s_rx_out_desc = 0;
}

From there another wrapper layer does some pointer math to output an array of samples as they reside in DDR memory. Reads and writes of samples at those addresses is mapped to transactions over the shared resource AXI bus.

void i2s_read(i2s_sample_in_mem_t** samples_ptr_out, int* n_samples_out){
  // Read description of samples in memory
  axi_descriptor_t samples_desc;
  i2s_rx_out_desc_READ(&samples_desc);
  // gets pointer to samples in AXI DDR memory
  i2s_sample_in_mem_t* samples = (i2s_sample_in_mem_t*)(samples_desc.addr + MMIO_AXI0_ADDR);
  // and number of samples (in u32 word count)
  int n_samples = (samples_desc.num_words*sizeof(uint32_t))/sizeof(i2s_sample_in_mem_t);
  // return outputs
  *samples_ptr_out = samples;
  *n_samples_out = n_samples;
}

Output Display

Building on previous work with a VGA PMOD that is connected to AXI DDR memory via a shared resource bus, this design also has a small state machine doing AXI reads to queue up frame buffer pixels from DDR memory. These pixels are streamed out as VGA timing chases the beam.

The CPU interface for the frame buffer simply reads and writes an array of pixels that have been memory mapped to the AXI shared resource bus.

static volatile pixel_t* FB0 = (pixel_t*)FB0_ADDR;
pixel_t frame_buf_read(uint16_t x, uint16_t y)
{
  return FB0[pos_to_pixel_index(x,y)];
}
void frame_buf_write(uint16_t x, uint16_t y, pixel_t pixel)
{
  FB0[pos_to_pixel_index(x,y)] = pixel;
}

Using these functions to read and write individual pixels the full final output image is drawn with three helper functions:

void drawRect(int start_x, int start_y, int end_x, int end_y, uint8_t color);
void draw_spectrum(int width, int height, fft_data_t* pwr_bins);
void draw_waveform(int width, int height, int line_width, fft_in_t* input_samples);

Drawing a rectangle drawRect iterates over an x,y range to fill in the required pixels with a color (grayscale for now). Frame buffer reads and writes are AXI bus transactions to uncached off chip DDR memory. This is substantially slower than on chip BRAM or register access. For that reason it was important that draw_spectrum and draw_waveform do as few frame buffer operations as possible.

draw_waveform relatively simply scales the audio samples time domain signal into points(rectangles of some line width) in the upper half of the frame (time in x pixels, amplitude into y pixels). draw_spectrum does something similar to draw the spectrum as series of narrow rectangles. The power in each bin is used to calculate the y-pixels height of each bar.

In both cases, for the line position in time domain waveform, and for the bar height of the spectrum bin, the helper functions maintain a static local variable of the last drawn height/position. As such, when new time domain samples and frequency spectrum values need to be drawn to the screen only the changed new pixels are written.

FFT Algorithm

cooley tukey wiki algorithm pic

The input to this algorithm is a power of two (N=1024) sized array of input samples (complex values, in the case of audio, imaginary part=0).

void compute_fft_cc(fft_in_t* input, fft_out_t* output, uint32_t N);
void compute_fake_power(fft_out_t* output, fft_data_t* output_pwr, int N);

The design shown in compute_fft_cc is based off of the Cooley–Tukey FFT algorithm. This nested loop iterative version (unrolled recursion of N=2 point FFTs) uses two buffers: one for inputs, and one for outputs. The first step in this in-place version copies inputs into the output buffer in bit reversed order. After that O(log2(N)) levels of butterflies convolve the input signal with a complex exponential and scaling twiddle factor. This is done by computing O(Nlog2(N)) complex multiplies, additions, and subtractions to update each output value. Each iteration also includes another complex multiplication to update the complex exponential omega value for the next sample.

// t = twiddle factor
t = omega * output[k + j + m_1_2];
u = output[k + j];
// calculating y[k + j]
output[k + j] = u + t;
// calculating y[k+j+n/2]
output[k + j + m_1_2] = u - t;
// update omega
omega = omega * omega_m;

The output of this algorithm is an equal sized output array of FFT frequency bin values (complex) covering a frequency range based on the sample rate fs: [0,fs/2] and [-fs/2,0] (typically shifted in post processing so range is simply [-fs/2,fs/2]). From there, a power-like(missing squaring and square root) computation of output_pwr[i] = abs(re) + abs(im); is used to combine the real and imaginary parts of positive frequency bins into the "faked power spectrum" histogram the user sees on screen.

Testing

Using a command like:

gcc fft_demo.c -lm -o fft_demo && ./fft_demo | python3 fft_demo.py

it is possible to run a demo of the C code algorithm and plot the resulting output.

fft_demo.c is configured with some parameters:

#define NFFT (1<<10)
// Set sample rate for working out units
#define SAMPLE_RATE_INT_HZ 44100 // Fs in integer Hz
#define TONE_RATE_INT_HZ 12345 // integer Hz

and then demos fft.h functionality by:

Generating a test tone input signal
Dumping input signal values to console stdout
Using compute_fft_cc function to compute FFT
Using compute_fake_power to compute the power histogram bins
Dumping output FFT values and power bins to console
Uses drawRect and draw_spectrum modified to dump pixel writes to console.

The following fft_demo.py Python script parses all of the stdout console contents to produce plots like those shown below:

example plots from test script

The figures step through the computation from: input time domain signal, a known correct FFT computed by np.fft, the FFT values as computed by the C code, the power bins as computed by the C code, and finally the VGA pixels written by the C code.

CPU Software

The main loop compiled and running on the CPU is found in i2s_and_fb_test.c. It does the following repeatedly, while updating status LEDs for each step:

Get a pointer to a batch of N=1024 I2S samples in DDR memory using i2s_read.
Copy the I2S samples from the DDR part of data memory into a BRAM mapped portion while doing any fixed/floating point conversions needed.
Compute the FFT using compute_fft_cc.
Compute the approximated power histogram using compute_fake_power.
Update the screen with draw_spectrum and draw_waveform.

By closely examining the status LEDs it is clear that the compute_fft_cc FFT computation is the bottleneck longest part of each iteration.

Floating Point

The fft.h header can be configured in its original #define FFT_TYPE_IS_FLOAT mode. The CPU does not implement RISC-V floating point extensions, so those operations are completed using multiple rv32im instructions. Complex exponential sine and cosine values are computed using standard math.h floating point functions.

Results with and without auto-pipelined execute stages for higher FMAX (~8x improvement for a 10x faster clock):

62.5 MHz: ~1 second per compute_fft_cc
- RV32 M extension division|remainder math has 8 pipeline stages
- RV32I execute math has 2 pipeline stages
6.25 MHz: ~8 seconds per compute_fft_cc

Fixed Point

fft.h is primarily configured to use the faster #define FFT_TYPE_IS_FIXED mode. In this mode the rv32im CPU does not use floating point operations that take multiple instructions to complete. Complex exponential sine and cosine values are computed using a CORDIC cordic.h software and hardware compatible C implementation. The fixed point types used are a mix of Q0.15 int16s and Q15.16 and Q1.30 int32's.

Results with and without auto-pipelined execute stages for higher FMAX:

Approximately ~30x faster than the original soft floating point implemented version above.
62.5 MHz: ~30 compute_fft_cc's per second
- RV32 M extension division|remainder math has 8 pipeline stages
- RV32I execute math has 2 pipeline stages
6.25 MHz: ~4 compute_fft_cc's per second

Omega Complex Exponential Lookup Table

The FFT code as originally written requires values from a complex exponential (sine,cosine) function to be computed and updated for each sample as the algorithm iterates. These values are reused in each execution of the FFT function. Instead of re-computing these 'omega' values on the fly, they can be computed once and stored for later in a lookup table.

fft.h can be configured with #define FFT_USE_OMEGA_LUT enabling code like so:

volatile fft_in_t OMEGA_LUT[NFFT]={0};
void init_omega_lookup(){
    int N = NFFT;
    int index = 0;
    for (uint32_t s = 1; s < (int)ceil(log2(N) + 1.0); s++){
        int32_t m = 1 << s;
        int32_t m_1_2 = m >> 1;
        fft_in_t omega_m = exp_complex(-1 / m);
        fft_in_t omega = {1 , 0}; 
        for (uint32_t j = 0; j < m_1_2; j++){
            OMEGA_LUT[index] = omega;
            index++;
            omega = omega * omega_m;
        }
    }  
}
fft_in_t omega_lookup(int s, int j){
    int index = (1<<(s-1)) - 1 + j;
    return OMEGA_LUT[index];
}

init_omega_lookup is run once at the start of execution and uses the same loop structure from one iteration of the FFT algorithm, but stripped down to only compute the omega values needed for the lookup table. omega_lookup can quickly index into the OMEGA_LUT eliminating the original algorithm's complex multiplication repeated in the inner loop.

Custom Hardware Accelerators

Beginning in a software-only configuration, fixed point types, with omega lookup table, the current performance as measured by hardware counters is as such:

1207052 CPU clock cycles per compute_fft_cc
at 62.5MHz thats 1207052 * 16ns
- ~19.3 milliseconds per FFT
- ~51 FFT's per sec

To accelerate this computation in hardware we look to the inner loop 2 point FFT 'butterfly' section of

// t = twiddle factor
t = omega * output[k + j + m_1_2];
u = output[k + j];
// calculating y[k + j]
output[k + j] = u + t;
// calculating y[k+j+n/2]
output[k + j + m_1_2] = u - t;

The above can be accelerated by creating a hardware pipeline for the omega * output complex multiply, the u + t complex addition, and finally the u - t complex subtraction. A benefit of the iterative Cooley–Tukey FFT algorithm is that it makes it easy to unroll loops for running on parallel hardware. Inside of the k,j loops, each of the iterations is mostly independent of each other. When the omega complex exponential scaling values are computed in advance then each iteration becomes truly independent and multiple of them can be computed simultaneously.

cooley tukey butterfly

Inside fft.h you will find the inner loop body made into fft_2pt functions that look like so:

typedef struct fft_2pt_out_t
{
  fft_out_t t;
  fft_out_t u;
}fft_2pt_out_t;
fft_2pt_out_t fft_2pt(
  fft_out_t t,
  fft_out_t u,
  fft_in_t omega
){
  // t = twiddle factor
  t = omega * t;
  fft_2pt_out_t o;
  // calculating y[k + j]
  o.u = u + t;
  // calculating y[k+j+n/2]
  o.t = u - t;
  return o;
}

This simple pure stateless combinatorial logic function will be the basis for creating hardware accelerated blocks.

By running the PipelineC tool and reading output timing reports it can be seen that a combinatorial logic version of the complex multiply, add, and subtract inside of the fft_2pt function can meet timing as-is in a single CPU cycle at 62.5Mhz (no auto-pipelining is needed).

Data via CPU Memory Mapped Registers

The most simple possible 'accelerator' is some kind of combinatorial logic attached the to CPU. In this case it is done by memory mapping registers to the CPU, and between those registers placing the 2-point FFT function.

reg comb logic reg

With that hardware in place, the CPU simply writes to the input register and reads the output register:

static inline fft_2pt_out_t fft_2pt_hardware(fft_2pt_in_t i){
  // Write input registers contents
  mm_ctrl_regs->fft_2pt_in = i;
  //(takes just 1 CPU clock cycle, output ready immediately)
  // Return output register contents
  return mm_status_regs->fft_2pt_out;
}

The algorithm's modified inner loop body using this hardware looks like so:

fft_in_t omega = omega_lookup(s, j);
fft_2pt_in_t fft_in;
fft_in.t = output[t_index];
fft_in.u = output[u_index];
fft_in.omega = omega;
// Invoke hardware
fft_2pt_out_t fft_out = fft_2pt_hardware(fft_in);
output[t_index] = fft_out.t;
output[u_index] = fft_out.u;

736944 CPU clock cycles per compute_fft_cc
at 62.5MHz thats 736944 * 16ns
- ~11.7 milliseconds per FFT
- ~84 FFT's per sec

This seems to show that, outside of the complex multiply, add, and subtract math that has been accelerated, the loads and stores of samples and omega values remain as significant portion of the FFT iteration time. For that reason it makes sense to consider moving the omega software lookup table into hardware.

Omega Lookup in Hardware

By writing a PipelineC function like below, a combinatorial ROM of omega values can be incorporated into design:

fft_in_t omega_lut_rom(uint16_t i){
    fft_in_t the_rom[NFFT] = {
    {.real=32767, .imag=0},
    ...
    {.real=0, .imag=0}
    };
    return the_rom_RAM_SP_RF_0(i, NULL_WR_DATA, 0);
}

The existing fft_2pt function can be modified to include the omega lookup in addition to the 2 point FFT math:

typedef struct fft_2pt_w_omega_lut_in_t
{
  fft_out_t t;
  fft_out_t u;
  uint32_t s;
  uint32_t j;
}fft_2pt_w_omega_lut_in_t;
fft_2pt_out_t fft_2pt_w_omega_lut(
  fft_2pt_w_omega_lut_in_t i
){
    // Compute omega index
    uint32_t index = omega_s_j_to_index(i.s, i.j);

    // Omega lookup
    fft_in_t omega = omega_lut_rom(index);

    // 2 point fft
    fft_2pt_in_t fft_2pt_in;
    fft_2pt_in.t = i.t;
    fft_2pt_in.u = i.u;
    fft_2pt_in.omega = omega;
    fft_2pt_out_t fft_2pt_out = fft_2pt_comb_logic(fft_2pt_in);
    return fft_2pt_out;
}

Running the PipelineC tool it was determined that, with the new lookup table added, the fft_2pt_w_omega_lut function as unpipelined combinatorial logic has an FMAX of about ~40MHz. For that reason the lookup table was switched from being purely combinatorial, to one that is registered and infers FPGA block RAMs/ROMs return the_rom_RAM_SP_RF_1(i, NULL_WR_DATA, 0);. These block rams take an extra cycle to receive an output from the ROM. However, no software C code needs to change, it's still fine to write the input registers and immediately read the output register. Since each CPU instruction takes at least 3 cycles, we know these MMIO registers can't be loaded/stored to faster than that anyway, and thus have a few cycles to spare.

680486 CPU clock cycles per compute_fft_cc
at 62.5 MHz thats 680486 * 16ns
- ~10.8 milliseconds per FFT
- ~91 FFT's per sec

Data directly to FFT compute unit

Using the 2-point butterfly unit (with omega lookup table) combinatorial logic from above, the PipelineC tool can create pipelines to meet arbitrary FMAX goals. This section describes building an FFT accelerator around this automatically pipelined butterfly function.

The two inputs into the butterfly unit come from two different locations in memory. Additionally, the outputs from the butterfly unit go back to those two separate memory locations. This implies a memory with two read ports and two write ports simultaneously. Because of this, designs like Xilinx'x Radix-2 Burst I/O FFT IP will show two RAMs with muxing/switching for moving data in and out of a 2-point unit.

xilinx fft ip butterfly and ram

In this design it was chosen to have a single RAM with just one pair of read and write ports. When operated at a 2x clock rate the RAM can expose a two read port and two write port interface in the 1x clock domain. As such the butterfly compute pipeline will be able to operate at full duty cycle taking new inputs and having outputs every cycle.

block diagram of 2x and 1x domains with ram fsm and pipeline

The execution of the FFT algorithm is orchestrated by a finite state machine (FSM) that connects the streams of data moving between the storage RAM and compute pipeline. The fft_2pt_fsm control FSM is defined in fft_2pt_2x_clk.c and repeatedly goes through the following states:

Accelerator States:

Loading inputs:
- Samples directly from the I2S PMOD are pulled from an input FIFO
- FSM uses this stream of samples pulled from FIFO to make write requests to RAM
Computing butterfly iterations (3 components):
1. FSM requests read data from RAM by using s,k,j triple nested loop iterators from the FFT algorithm to supply a stream of addresses into the RAM.
  - Software equivalent of starting the process of indexing into the array for output[t_index] and output[u_index] to get inputs for the 2-point math
2. FSM directs read response data from RAM into the 2-point butterfly pipeline
  - Software equivalent of finishing the array lookups and using those values as inputs to start a 2-point butterfly computation
3. FSM directs pipeline outputs into the RAM via write requests
  - Software equivalent of finishing the 2-point butterfly computation and starting a write back for output[t_index] and output[u_index]
- As highlighted by the red arrow above: The above three components, 1) making read requests by providing addresses to RAM, 2) moving read responses of RAM data into pipeline, and 3) finally writing pipeline outputs back into RAM are all occurring simultaneously forming one long dataflow around the design with many butterfly iterations being computed at once.
Unloading outputs:
- FSM requests reads of FFT output values from RAM
- Stream of read response FFT output data from RAM is connected to FIFO for CPU to consume

The FSM waits until it receives all NFFT=1024 samples before beginning butterfly iterations, and similarly waits until all output values have been sent to the CPU before restarting. This design does not accept a constant stream of samples (i.e. during butterfly iterations, no input samples are accepted). Bursts of input samples are accepted during the input state and bursts of output data occur during the output state. In this case the audio sample rate is low enough that this not-full-rate-continuous function is acceptable.

Input Stream

i2s_axi_loopback.c has an I2S RX 'MAC' block that already outputs a stream of samples. I2S_RX_STREAM_MONITOR_PORT was defined to expose this stream as an external global variable visible by the CPU/FFT blocks as needed.

#ifdef I2S_RX_STREAM_MONITOR_PORT
stream(i2s_samples_t) i2s_rx_samples_monitor_stream;
#endif
...
// External stream port for samples
#ifdef I2S_RX_STREAM_MONITOR_PORT
i2s_rx_samples_monitor_stream.data = mac_rx_samples.data;
i2s_rx_samples_monitor_stream.valid = mac_rx_samples.valid & mac_rx_samples_ready;
#endif

FIFOs

This design uses six as small as possible(depth=16 here) FIFOs declared as such:

// Declare the fifos linking across clock domains
GLOBAL_STREAM_FIFO(i2s_samples_t, samples_fifo, 16)
GLOBAL_STREAM_FIFO(fft_ram_2x_read_req_t, rd_req_fifo, 16)
GLOBAL_STREAM_FIFO(fft_ram_2x_read_resp_t, rd_resp_fifo, 16)
GLOBAL_STREAM_FIFO(fft_ram_2x_write_req_t, wr_req_fifo, 16)
GLOBAL_STREAM_FIFO(fft_out_t, output_fifo, 16)

For example: the little bit of glue linking the I2S PMOD monitor port into the FFT block input FIFO looks like so:

#pragma MAIN i2s_to_fft_connect
void i2s_to_fft_connect(){
  samples_fifo_in = i2s_rx_samples_monitor_stream;
}

Storage RAM

fft_2pt_2x_clk.c is where the RAM for the FFT is defined and used.

First the RAM module is declared using a helper macro DECL_STREAM_RAM_DP_W_R_1 from ram.h:

// Type of RAM for storing FFT output
// Dual ports: write only + read only, 1 clock latency (block)RAM
DECL_STREAM_RAM_DP_W_R_1(
  fft_out_t, fft_ram, NFFT, "(others => (others => (others => '0')))"
)

And then the RAM module is instantiated with another helper macro RAM_DP_W_R_1_STREAM.

MAIN_MHZ(fft_ram_main, FFT_CLK_2X_MHZ)
void fft_ram_main(){
  // Declare instance of fft_ram called 'ram'
  // declares local variables 'ram_...' w/ valid+ready stream interface
  RAM_DP_W_R_1_STREAM(fft_out_t, fft_ram, ram)
  // Declares wires like...
  ram_wr_data_in
  ram_wr_addr_in
  ram_wr_in_valid
  ram_wr_in_ready
  ...
  // FSM logic here...
}

Those valid-ready stream wires for connecting to the RAM are used with another small FSM inside fft_ram_main. This FSM manages the 2:1 un/packing of requests and responses into/out of the RAMs<->FIFOs.

Butterfly Pipeline

Using the fft_2pt_w_omega_lut combinatorial logic developed above, along with helper macro GLOBAL_VALID_READY_PIPELINE_INST from global_func_inst.h, a globally visible autopipelined instance of that function called fft_2pt_pipeline is declared that includes a streaming valid-ready interface:

// Global instance of butterfly pipeline with valid ready handshake
GLOBAL_VALID_READY_PIPELINE_INST(fft_2pt_pipeline, fft_2pt_out_t, fft_2pt_w_omega_lut, fft_2pt_w_omega_lut_in_t, 16)
// Declare wires like
fft_2pt_pipeline_in.data
fft_2pt_pipeline_in.valid
fft_2pt_pipeline_in_ready
...

Because of the one cycle latency block ROM that is used for the omega_lut_rom lookup table. We need to inform the PipelineC tool about this hand pipelined delay we have included with #pragma FUNC_LATENCY omega_lut_rom 1.

Output Stream

The control FSM eventually begins unloading FFT output results by writing into the declared output_fifo. The read-side of this FIFO is connected to the CPU memory map as such in fsm_risc-v.c:

// Connect the outputs from FFT results FIFO into memory map
output_fifo_out_ready = ~handshake_valid_reg_value.fft_out;
if(output_fifo_out_ready & output_fifo_out.valid){
  handshake_data.fft_out = output_fifo_out.data;
  handshake_valid.fft_out = 1;
}

Memory mapped IO valid-ready handshaking registers are used to allow the CPU to read data out of the FIFO like so:

void fft_out_READ(fft_out_t* out){
  // Wait for valid data to show up
  while(!mm_handshake_valid->fft_out){}
  // Copy the data to output
  *out = mm_handshake_data->fft_out;
  // Signal done with data
  mm_handshake_valid->fft_out = 0;
}

Simulation

Inside of fft_2pt_2x_clk_tb.c a small testbench instantiates just the FFT hardware from fft_2pt_2x_clk.c to demonstrate the flow of data in simulation.

// For test bench to not need multiple clocks, 
// keep simple and define both clock the same
#define FFT_CLK_2X_MHZ 100
#define FFT_CLK_MHZ 100
#include "fft_2pt_2x_clk.c"

#pragma MAIN tb_i2s_input
void tb_i2s_input(){
  // Drive counters as i2s samples w fixed point adjust
  static uint32_t counter;
  samples_fifo_in.data.l_data.qmn = counter << (24-16);
  samples_fifo_in.data.r_data.qmn = counter << (24-16);
  samples_fifo_in.valid = 1;
  if(samples_fifo_in_ready){
    printf("Testbench: Inputing sample %d\n", counter);
    counter += 1;
  }
}

#pragma MAIN tb_cpu_output
void tb_cpu_output(){
  // Drain CPU output fifo
  output_fifo_out_ready = 1;
  if(output_fifo_out.valid){
    printf(
      "Testbench: Output value real %d imag %d\n", 
      output_fifo_out.data.real, 
      output_fifo_out.data.imag
    );
  }
}

Various printfs have been placed throughout the control FSMs. When a simulation is run using GHDL and cocotb like so: ./src/pipelinec examples/risc-v/fft_2pt_2x_clk_tb.c --comb --sim --cocotb --ghdl --run 15000 you can see outputs like:

...
Clock:  12386
RAM: Enqueue 2x read resp
t data = -486,
u data = -486
RAM: write request t-side?=0 enabled? 1 (req addr = 503, data = -517)
RAM: rd req t-side?=1 addr 1023
FSM: Enqueued pipeline output 2x write request
t_index = 1017 t data -516,
u_index = 505 u data -520
RAM: read response t-side?=0 data = -486
RAM: Dequeue 2x write req
t enabled 1, t addr = 1015, t data = -465,
u enabled 1, u addr = 503, u data = -517
FSM: Dequeued 2x read response as butterfly pipeline input:
s = 10, k = 0, j = 509,
t data = -513, u data = -513
...

Results

In this #define FFT_USE_FULL_HARDWARE mode of operation, the compute_fft_cc function simply reads NFFT=1024 elements out of the FFT results FIFO repeatedly:

void compute_fft_cc(fft_in_t* input, fft_out_t* output){
    // FFT done in hw, just copy results to output
    for (uint32_t i = 0; i < NFFT; i++)
    {
        fft_out_READ(&(output[i]));
    }
}

Because of that, it no longer makes sense to use the time for one compute_fft_cc as the measure of FFT speed. Measuring the speed of unloading FFT outputs via fft_out_READ is at best approximating a measure of how fast samples can be loaded into the hardware (1 FFT output for each sample input) ~= 44.1K samples per second.

Instead a counter was added into the FFT's 1x clock domain to measure just the computation part of the FFT algorithm executing butterfly iterations (not loading or unloading samples/outputs):

FFT Compute Clock = 50MHz, RAM Clock = 100MHz
- fft_2pt_w_omega_lut 1 cycle pipeline (2 stages)
  - Min 1 cycle from BROM omega LUT
- 5380 50MHz clock cycles for butterfly iterations of FFT
- 5380 * 20ns
  - ~107 us per FFT
  - ~9.2K FFT's per sec (ideal not counting samples onload/offload)
  - NFFT=1024 means capable of ~9.2 MSPS ideal sustained input rate
FFT Compute Clock = 100MHz, RAM Clock = 200MHz
- fft_2pt_w_omega_lut 3 cycle pipeline (4 stages)
  - 1 cycle for omega_s_j_to_index math to get ROM index from s,j iterators
  - 1 cycle from BROM omega LUT
  - 1 cycle for 2pt butterfly (cycle is used for multiply part)
- 5400 100MHz clock cycles for butterfly iterations of FFT
- 5400 * 10ns
  - 54 us per FFT
  - ~18.5K FFT's per sec (ideal not counting samples onload/offload)
  - NFFT=1024 means capable of ~18.9 MSPS ideal sustained input rate

Quite a way aways from ~1/8th of an FFT per second using the 6.25MHz RV32IM CPU soft floating point version where this started.

Future Work

Continuous streaming

As mentioned above, having the FFT fully in hardware can support several mega samples per second of throughput. In order to actually accommodate continuously receiving a higher sample rate, the design can be modified to size the input and output FIFOs to NFFT=1024: doing so would allow for butterfly iterations to be happening while next+previous inputs and outputs are streaming in/out of the design.

If still needing to support higher sample rates, multiple copies of the FFT hardware can be instantiated and work distributed across the instances.

Data indirectly to FFT unit via Direct Memory Access?

Often a custom hardware accelerator is being attached to an existing SoC design, with an existing memory bus architecture. Early iterations of this design stored audio samples in off chip DDR memory, where the CPU then manually copied samples into data memory block RAM before computing.

Instead of making the CPU manage data movement between places like block RAM to-from DDR, a direct memory access (DMA) accelerator for data movement can be incorporated into the design. For example, as opposed to directly connecting samples into the FFT block, the CPU could setup the DMA unit to manage a bulk transfer of samples from DDR into the FFT accelerator block. The CPU can be off doing something else during the FFT computation, and then end by setting up another DMA transfer for FFT results back into DDR memory.

PipelineC Derived FSMs?

PipelineC has allowed writing the control FSMs in this design like you would in traditional hardware description languages. However, PipelineC also offers a way of deriving finite machines from C code that looks much more like traditional software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly