N-Body Simulation on an Altera DE2-115 FPGA

Project Proposal

We propose to build a n-body simulator using systolic arrays entirely on an FPGA. Our engine will use a connected line of processing elements to feed streaming bodies across the entire pipeline, calculating inter-body forces along the way. All data will be stored in the FPGA's BRAM and sent back to a PC via UART to verify correctness. For comparison, we will also be implementing an optimized n-body simulator on a GPU with CUDA. The goal is to achieve comparable or even greater performance with the FPGA by leveraging pipelining, DSP blocks, and memory optimizations.

Proposal PDF: View Project Proposal (PDF)

Milestone Report PDF: View Milestone Report (PDF)

Background

In n-body simulation, a body's velocity and position are updated every time step to simulate how large bodies (planets, stars) interact in our universe. However, these velocity and position updates are based on the forces exerted by all other bodies in the system. Sequentially calculating all the forces on a single body requires iterating over all the other N - 1 bodies. While this is an extremely parallel algorithm, the amount of resources required to execute a fully correct, parallel approach over millions of bodies is impractical.

The Challenge

As stated previously, the n-body simulation problem is inherently computationally intensive because its naive formulation scales quadratically with the number of bodies. Mapping such an algorithm onto an FPGA presents several key challenges: computational complexity, precision trade-offs, limited memory size, and verification.

Platform Choice & Resources

Platform Choice

We have chosen to program our FPGA with SystemVerilog and use C++ with CUDA for the CPU + GPU implementation. This is because we are most familiar with these languages and they are standard for their respective applications.

Resources

Our development and testing will primarily occur on the ECE lab machines in HH1305, which are equipped with VCS (Verilog Compiler Simulator) and Intel Quartus (HDL Synthesizer). These tools will allow us to simulate our designs for testing and then synthesize our design onto the FPGA. If time permits for further optimizations, we will also use the Synopsys Design Compiler for critical path, power, and area analysis. As for the FPGA, we will borrow the Altera DE2-115 Cyclone IV from a previous professor. Development and testing for the CPU + GPU implementation will primarily occur on our personal machines, which are equipped with NVIDIA GPUs (an RTX 3060 and an RTX 3070). We will also leverage the GHC lab machines with RTX 2080 GPUs for further performance evaluation.

Schedule

Week 1 (April 7 - April 13):
In the first week, we will create a minimal, “naive” N-body simulator using SystemVerilog on the FPGA. This initial version will compute all pairwise forces by iterating over each body in turn, storing positions and velocities in the BRAM for quick access.

Week 1.5 (April 16 - 19):

Systolic Array Architecture Planning (mwguo + egomez): Finalize the block diagram for the systolic array. Determine how many processing elements (PEs) we can reasonably instantiate within the FPGA’s LUT and DSP block budget. Map out the data flow: how bodies are streamed into the array, how partial sums of force are passed along, and how the pipeline handles any arithmetic latency (especially square root).
Position/Velocity/Acceleration Calculator Updates (mwguo): Extend the naive calculator modules into a form compatible with systolic processing. Insert additional pipeline stages into the floating-point operations (add, mul, div) to allow new operations to issue every clock and mask latencies.
Central FSM for Systolic Array (egomez): Write a top-level finite state machine (FSM) that manages data loading, coordinate broadcasting, and result accumulation from all PEs. Verify that each PE receives the correct body data and that the aggregated forces can update body positions.
Integration & Debugging (mwguo + egomez): Combine the updated calculators with the FSM into a fully functional systolic array design. Use a small test set of bodies in simulation to ensure correctness before synthesis.

Week 2 (April 20 - 23):

Arithmetic Optimization (mwguo): Refine the floating-point pipelines (especially the square root unit) to reduce cycle counts. Experiment with fewer Newton-Raphson iterations for sqrt, or consider new initial guesses to minimize error without large performance hits.
Systolic Array Tuning (egomez): Profile how many bodies per pass we can handle versus resource utilization. Adjust the number of PEs (K) to strike a balance between concurrency and timing closure at 50 MHz. Possibly pipeline the array further if we can’t meet timing at a higher concurrency.
Integration Tests & Partial Benchmarks (mwguo + egomez): Run partial benchmarks on the FPGA using small to moderate N values. Compare against the naive design’s performance. Verify that resource usage remains within our device limits.

Week 2.5 (April 24 - 28):

Advanced Optimizations (mwguo + egomez): Explore deeper pipelining strategies, data compression, or approximate arithmetic (like reduced mantissa) if needed. Investigate whether unrolling outer loops or reorganizing memory layouts can reduce stall cycles.
Data Collection & Graph Generation (mwguo, egomez): Gather detailed throughput and cycle counts for various N. Compare the improved systolic array design against the CPU baseline and naive FPGA design. Plot performance vs. resource utilization, exploring multiple PE configurations.
Final Report Preparation (mwguo + egomez): Summarize the pipeline design trade-offs, accuracy vs. performance findings, and resource usage. Compile results into clear charts showing speedup (or slowdown) relative to the CPU, power/watt analysis, etc.

Project Repository

Check out the full project on GitHub: https://github.com/mwguo15/n-body-on-fpga