|
| 1 | +\documentclass[11pt]{article} |
| 2 | +\usepackage[letterpaper, margin=1in]{geometry} |
| 3 | +\usepackage[utf8]{inputenc} |
| 4 | +\usepackage[T1]{fontenc} |
| 5 | +\usepackage{listings} |
| 6 | +\usepackage{xcolor} |
| 7 | +\usepackage{inconsolata} |
| 8 | +\usepackage{graphicx} |
| 9 | +\usepackage{enumitem} |
| 10 | +\usepackage{amsfonts} |
| 11 | +\usepackage{amsmath} |
| 12 | +\usepackage{indentfirst} |
| 13 | +\usepackage[backend=biber, citestyle=ieee]{biblatex} |
| 14 | + |
| 15 | +\addbibresource{references.bib} |
| 16 | + |
| 17 | +\lstset{ |
| 18 | + language=C, |
| 19 | + belowcaptionskip=1\baselineskip, |
| 20 | + breaklines=true, |
| 21 | + frame=L, |
| 22 | + xleftmargin=\parindent, |
| 23 | + showstringspaces=false, |
| 24 | + basicstyle=\ttfamily, |
| 25 | + keywordstyle=\bfseries\color{green!40!black}, |
| 26 | + commentstyle=\itshape\color{purple!40!black}, |
| 27 | + identifierstyle=\color{blue}, |
| 28 | + stringstyle=\color{orange}, |
| 29 | +} |
| 30 | + |
| 31 | +\title{\vspace{-2cm} {Automating FPGA Offload with Symbolic Execution}} |
| 32 | +\date{\vspace{-2cm}} |
| 33 | + |
| 34 | +\begin{document} |
| 35 | + |
| 36 | +\maketitle |
| 37 | + |
| 38 | +\section{Background} |
| 39 | + |
| 40 | +Field-programmable gate arrays (FPGAs) are an essential component of the latest datacenters and compute clusters. |
| 41 | +They are especially valuable as compute accelerators since they can often outperform CPUs at specific tasks \cite{ma2017optimizing}. |
| 42 | +However, their design also presents a substantial challenge to the users of these clusters and datacenters. |
| 43 | +FPGAs require the user to write program descriptions in a hardware description language (HDL) such as Verilog, which has a significant learning curve compared to high-level languages such as C. |
| 44 | + |
| 45 | +While solutions such as high-level synthesis (HLS) can transpile a high-level language such as C to HDL, integrating the FGPA into a compute workflow still requires significant effort. |
| 46 | +Since not all logic is suitable to be run on an FPGA, some code segments still have to be run on the host. |
| 47 | +Thus, communication libraries between the CPU and FPGA need to be written carefully to ensure maximum performance and often require domain-specific knowledge on interconnects such as PCIe \cite{sommer2017openmp}. |
| 48 | +Validation of FPGA code is also extremely time-consuming due to the inability of CPUs to effectively emulate FPGAs \cite{simpson2010fpga}. |
| 49 | +Thus, it may often be impractical to spend time integrating FPGAs into a workflow, despite their substantial performance benefits. |
| 50 | +A potential solution may be to find a way automatically select which components of a program to offload to an FPGA. |
| 51 | + |
| 52 | +\section {Proposal} |
| 53 | + |
| 54 | +This project will address the difficulty in creating performant and reliable FPGA offloading solutions by creating a compiler that will automatically convert code segments into HDL. |
| 55 | +This automation can work since the semantics of a program can reflect its suitability for offloading. |
| 56 | + |
| 57 | +The most primary requirement for logic that is suitable for offload to an FPGA is that the code must be able to be pipelined \cite{simpson2010fpga}. |
| 58 | +Namely, the code must have minimal overlapping operations, and side effects cannot modify the execution of the function. |
| 59 | +Both of these requirements can be tested for through binary analysis. |
| 60 | + |
| 61 | +Symbolic execution is especially suitable for checking code segments for offload suitability. |
| 62 | +A symbolic execution engine such as \texttt{angr} can generate an abstract syntax tree (AST) from a given binary \cite{wang2017angr}. |
| 63 | +The AST represents the set of operations that will define the value of a variable at some time during the program’s execution. |
| 64 | +Heuristics to determine if the function is pipelineable can be created using this data. |
| 65 | + |
| 66 | +The data dependency graph (DDG) will provide an important metric for determining offload suitability. |
| 67 | +Each node in this graph is a variable in the function, and each directed edge represents a variable that influences the final state of another variable. |
| 68 | +If cycles exist in such a graph, it may indicate that this function might not be efficiently pipelined. |
| 69 | +The control flow graph (CFG) of the function can also be used to identify loop conditions that may not be efficiently pipelined. |
| 70 | + |
| 71 | +Other suitability analyses will need to be made as well to ensure that the FPGA can implement the required design. |
| 72 | +For example, the total size of all variables cannot exceed the total number of registers available on the FPGA, and allocated memory cannot exceed the total memory on the FPGA. |
| 73 | +Other considerations may include identifying code regions that may work well on specialized components within the FPGA aceclerators. |
| 74 | +For example, many FPGA vendors offer intellecutal property cores (IPs) that offer better performance than custom implementations of the same function. |
| 75 | +The compiler may need to take these IPs into account by generating profiles for each FPGA. |
| 76 | + |
| 77 | +This project will create a compiler that can compile C source code into both a CPU portion and an FPGA portion. |
| 78 | +The compiler will compile the input C program into a binary and then analyze each function of the binary as described above. |
| 79 | +This allows the program to make use of compiler optimizations such as code inlining and loop unrolling detect code regions that may not otherwise be offloadable. |
| 80 | + |
| 81 | +The compiler will then automatically determine which sections of the program can be offloaded to FPGAs through the heuristics above. |
| 82 | +It will extract the largest selection of code regions that was found to be offloadable and decompile that binary to HLS C. |
| 83 | +This preserves the semantics of the compiled program and allows conversion to the subset of C used by the HLS suite while keeping compiler optimizations. |
| 84 | +It will also add functions for communication between the host and the FPGA, and finally flash the compiled HLS project to the FPGA. |
| 85 | + |
| 86 | +This project will also include a library for both the FPGA and the CPU which will enable communication between them. |
| 87 | +This can be done by building a kernel driver and a Verilog \texttt{module} that will interface between the offloaded code segments and the vendor's PCIe IP. |
| 88 | + |
| 89 | +\section{Evaluation} |
| 90 | + |
| 91 | +This project can be evaluated on any computer with an FPGA card supporting PCIe. |
| 92 | +Real-world programs making heavy use of parallelization will be evaluated. |
| 93 | +This will include applications such as the NAMD molecular dynamics library \cite{phillips2005scalable} or the Stockfish chess engine. |
| 94 | +These programs are highly parallelized and will test the pipeline heuristic of the prototype. |
| 95 | +Furthermore, some of these projects also support common APIs such as OpenMP and OpenMPI, which this project should also support. |
| 96 | + |
| 97 | +Benchmarks such as those in NAMD and Stockfish also have custom metrics that can also be compared between the two versions. |
| 98 | +These results could take into account side effects that may not be apparent when looking solely at latency and throughput. |
| 99 | + |
| 100 | +Performance comparisons can then be made between the host-only version and the offloaded version. |
| 101 | +Significant metrics to test include the latency and throughput of the application when offloaded, as well as total runtime of each version. |
| 102 | +The performance of the communication library will play a substantial role in the test results as well. |
| 103 | + |
| 104 | +\section{Related Work} |
| 105 | + |
| 106 | +Sommer et al. presented an addition for the OpenMP API that runs user-specified blocks on an FPGA using Vivado HLS \cite{sommer2017openmp}. While this presents an option to write FPGA offloads without writing any communication libraries, the project cannot automatically select code locations to offload. Instead, they are specified using a \texttt{\#pragma} directive, like a standard OpenMP call. However, the project also implemented a communication library between the host and the FPGA using PCIe. |
| 107 | + |
| 108 | +Yamato presented a project that allowed for automatic offloading of specific function blocks \cite{yamato2021automatic}. |
| 109 | +This project relies on static code analysis on the source code and therefore cannot account for compiler optimizations that may unroll loops and present new opportunities for offloading. |
| 110 | +Use of symbolic execution on a compiled binary will take advantage of compiler optimizations such as loop unrolling to identify offloadable blocks that may not be identifiable in the source. |
| 111 | +Furthermore, Yamato's project also requires the source code of the original program and thus cannot support offloading from a binary. |
| 112 | + |
| 113 | +\section{Relevance to the Department of Defense} |
| 114 | + |
| 115 | +This project will enable software developers to deploy their applications to FPGA accelerators with minimal effort. |
| 116 | +If successful, the compiler developed by this project may significantly increase the performance of many common computing operations. |
| 117 | + |
| 118 | +This is especially relevant for ARL's research focus on supercomputing technologies, under section KCI-CS-1 of BAA W911NF-17-S-0003. |
| 119 | +In particular, this work will allow users to leverage large-scale on-chip parallelism without the specialist knowledge in RTL programming usually required. |
| 120 | + |
| 121 | +This project may also be applicable to ARL's dynamic binary translation focus, by allowing developers to target FPGAs with pre-existing binaries. |
| 122 | +While the proposed project will work with source code, it can also be used to translate binaries as it operates at an intermediate representation level. |
| 123 | + |
| 124 | +\section{Societal Impacts} |
| 125 | + |
| 126 | +Highly parallel processing is an essential part of many scientific discoveries that may prove to be societally significant. |
| 127 | +For example, the SUMMIT supercomputer was used to model the COVID-19 spike protein and discovered a set of molecules that could disrupt the virus’ infectivity \cite{smith2020repurposing}. |
| 128 | +Similar techniques are used to model the Earth's climate and present possible futures \cite{mizielinski2014high}. |
| 129 | +However, these calculations require considerable amounts of compute time, and the clusters often draw a large amount of power while doing so. |
| 130 | + |
| 131 | +FPGAs also often offer better power efficiency and can potentially speed up computations by orders of magnitude \cite{asano2009performance}. |
| 132 | +Accelerating similar scientific projects through FPGA offloading could benefit other societally significant work by decreasing computation time and power usage. |
| 133 | +Having their results sooner also allows scientists to spend less time waiting for their experiments to finish or allow previously infeasible calculations to be undertaken. |
| 134 | + |
| 135 | +\section{Relevant Qualifications} |
| 136 | + |
| 137 | +A significant portion of my undergraduate experience focused on building high-performance distributed systems for various networked use cases. |
| 138 | +My research on defenses against denial of service attacks familiarized me with the Linux kernel. |
| 139 | +Similarly, my project at a NNSA laboratory on building network testbed technology provided me with a solid background in building high performance software systems. |
| 140 | +Furthermore, I am familiar with software development through my participation in the Mars 2020 project that will be essential for building a prototype with real-world applicability. |
| 141 | + |
| 142 | +Furthermore, I also have extensive experience working with FPGAs in a high performance setting. |
| 143 | +My work with FPGA SmartNICs this past summer provided me with the experience needed to build designs for such accelerators. |
| 144 | +I am also currently working on a related project that is developing a high performance Verilog-to-C transpiler for networking tasks. |
| 145 | +These experiences has provided me with additional insights into designing for FPGAs with both HDL and Verilog. |
| 146 | + |
| 147 | +I also have significant experience working with the \texttt{angr} binary analysis toolkit, including on a DARPA-funded project to extract mathematical expressions from cyber-physical binaries. |
| 148 | +This work familiarized me with symbolic execution and static analysis methods, and will be invaluable for the success of this project. |
| 149 | + |
| 150 | +\newpage |
| 151 | + |
| 152 | +\printbibliography |
| 153 | + |
| 154 | +\end{document} |
0 commit comments