Understand the details about ara #292

AD738560581 · 2024-03-14T11:16:02Z

Hello @mp-17
I have been following and learning ara with rvv-1.0 spec since August last year. Even though I have been studying for six months and have made many modifications to the ARA issues, I am still very confused about many of the design details of the ARA. For example, MFPU and ALU's state machine processing of red* instruction，like vredsum. Also, the state machine jump of slide instructions that are not powers of 2 has also troubled me for a long time.
Therefore, I would like to inquire if you have any information about the details of ARA design that you can provide me with for learning, such as pictures, documents, and PowerPoint presentations, etc.
Finally, I wish the ARA team and ARA project continued success and success

mp-17 · 2024-03-14T11:59:30Z

Hey @AD738560581,

We don't have documentation for the FSMs, but I can give you some information that can help you understand the hardware description and the waves.

For the reductions:

Reductions happen in a 3-stage process: intra-lane, inter-lane, and SIMD phases.
During the intra-lane phase, the operand requesters of every lane fetch all the input data from the VRF, and these data are all accumulated into one accumulator in the processing unit.
During the inter-lane phase, the #L accumulated results are then reduced in ~log2(L) steps. Half of the lanes with valid accumulated results move their accumulator to the next lane with a valid result that is not being moved. This move happens through the slide unit.
In the end, lane 0 will hold a final 64-bit packet. This packet can contain 1, 2, 4, or 8 data depending on the element width. So, if the element width is less than 64-bit, the partial accumulators in the packet are further reduced to just one datum that is written back into the VRF.

For the slides by non-power-of-2:

The slides are broken down into power of two slides (for example, a slide by 39 is broken down into a slide by 32, a slide by 4, a slide by 2, and a slide by 1).
Everything happens in the slide unit. There is not an injection of micro-operations from the dispatcher. This ensures that the whole operation is atomic and there are no spurious writes to the VRF.
Every word from the lanes is slid multiple times to reach the correct non-power-of-two alignment before the write-back. This also ensures having no spurious writes.

Let me know if this helps,
Matteo

AD738560581 · 2024-03-14T12:19:39Z

Your reply has been of great help to me, and I will try to gain a deeper understanding of ara. Thakns!@mp-17

mp-17 closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understand the details about ara #292

Understand the details about ara #292

AD738560581 commented Mar 14, 2024

mp-17 commented Mar 14, 2024

AD738560581 commented Mar 14, 2024

Understand the details about ara #292

Understand the details about ara #292

Comments

AD738560581 commented Mar 14, 2024

mp-17 commented Mar 14, 2024

AD738560581 commented Mar 14, 2024