Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand the details about ara #292

Closed
AD738560581 opened this issue Mar 14, 2024 · 2 comments
Closed

Understand the details about ara #292

AD738560581 opened this issue Mar 14, 2024 · 2 comments

Comments

@AD738560581
Copy link

Hello @mp-17
I have been following and learning ara with rvv-1.0 spec since August last year. Even though I have been studying for six months and have made many modifications to the ARA issues, I am still very confused about many of the design details of the ARA. For example, MFPU and ALU's state machine processing of red* instruction,like vredsum. Also, the state machine jump of slide instructions that are not powers of 2 has also troubled me for a long time.
Therefore, I would like to inquire if you have any information about the details of ARA design that you can provide me with for learning, such as pictures, documents, and PowerPoint presentations, etc.
Finally, I wish the ARA team and ARA project continued success and success

@mp-17
Copy link
Collaborator

mp-17 commented Mar 14, 2024

Hey @AD738560581,

We don't have documentation for the FSMs, but I can give you some information that can help you understand the hardware description and the waves.

For the reductions:

  1. Reductions happen in a 3-stage process: intra-lane, inter-lane, and SIMD phases.
  2. During the intra-lane phase, the operand requesters of every lane fetch all the input data from the VRF, and these data are all accumulated into one accumulator in the processing unit.
  3. During the inter-lane phase, the #L accumulated results are then reduced in ~log2(L) steps. Half of the lanes with valid accumulated results move their accumulator to the next lane with a valid result that is not being moved. This move happens through the slide unit.
  4. In the end, lane 0 will hold a final 64-bit packet. This packet can contain 1, 2, 4, or 8 data depending on the element width. So, if the element width is less than 64-bit, the partial accumulators in the packet are further reduced to just one datum that is written back into the VRF.

For the slides by non-power-of-2:

  1. The slides are broken down into power of two slides (for example, a slide by 39 is broken down into a slide by 32, a slide by 4, a slide by 2, and a slide by 1).
  2. Everything happens in the slide unit. There is not an injection of micro-operations from the dispatcher. This ensures that the whole operation is atomic and there are no spurious writes to the VRF.
  3. Every word from the lanes is slid multiple times to reach the correct non-power-of-two alignment before the write-back. This also ensures having no spurious writes.

Let me know if this helps,
Matteo

@AD738560581
Copy link
Author

Your reply has been of great help to me, and I will try to gain a deeper understanding of ara. Thakns!@mp-17

@mp-17 mp-17 closed this as completed Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants