Difference between revisions of "Discussion of: Moving Compute towards Data in Heterogeneous multi-FPGA Clusters using Partial Reconfiguration and I/O Virtualisation"
Jump to navigation
Jump to search
(Created page with "'''David:''' * Started to get lost around page 3 * PR = Partial Reconfiguration = uploading new algorithms to FPGA without re-flashing entire board(?) * I/O virtualization * D...") |
|||
(2 intermediate revisions by 2 users not shown) | |||
Line 8: | Line 8: | ||
** Motivates distributing data as evenly as possible over storage. | ** Motivates distributing data as evenly as possible over storage. | ||
** Energy used to extract data from disk? SSD not a problem. Could be increased energy though if you need to spin up two HDDs as opposed to one. | ** Energy used to extract data from disk? SSD not a problem. Could be increased energy though if you need to spin up two HDDs as opposed to one. | ||
+ | |||
+ | '''Diana:''' | ||
+ | * [https://www.nytimes.com/2021/04/13/technology/racist-computer-engineering-terms-ietf.html NY Article on Master/Slave terminology] | ||
+ | |||
+ | '''Michael:''' | ||
+ | * The Ideal Environment | ||
+ | ** Single Major Function Kernel Build | ||
+ | ** Plug and Play Deployment | ||
+ | ** Multiple Deployments on Single FPGA | ||
+ | ** Multiple Deployments on Heterogeneous FPGAs | ||
+ | ** Transparent Performance Scaling | ||
+ | * Partial Configuration (PR) | ||
+ | ** Hot Insertion of FPGA Region | ||
+ | ** Currently done via Processor Config Access Port (PCAP) | ||
+ | ** Design Modularization | ||
+ | * SOTA Issues | ||
+ | ** Each Deployment Requires Separate Kernel | ||
+ | ** Tools Do Not Abstract FPGA I/O Heterogeneity | ||
+ | ** All Kernels Require Distinct I/O Configuration | ||
+ | ** Substantial O/H to swap kernels | ||
+ | ** Remote Swaps Have Higher Latency | ||
+ | * Solution | ||
+ | ** I/O Virtualiztion | ||
+ | ** High Speed ICAP Dynamic Remote Config Service | ||
+ | ** High Level Synthesis | ||
+ | ** UNIMEM: effects PGAS | ||
+ | ** UNILOGIC: Virtual FPGA (VF) PR | ||
+ | *** Remote PR | ||
+ | *** VF task location | ||
+ | *** Visible Memory Mapped Accelerators (Kernels) | ||
+ | *** Transparent Accelerator Access to RDMA | ||
+ | * FPGA Implementation Stack | ||
+ | ** Decoupled Accelerator Builds | ||
+ | ** H/W Abstractions | ||
+ | ** Accelerator I/F Libs: | ||
+ | *** Standardized Register Map | ||
+ | *** Generic Drivers: Streaming I/F, Master/Slave I/F | ||
+ | *** Runtime/Execution API: Hi Level S/W Integration | ||
+ | *** gRPC: async cluster job launch | ||
+ | * Performance/Payoff | ||
+ | ** ICAP vs. PCAP: Table I | ||
+ | ** Inter PR comm latency (I/O Virt) 2-3 ns (Fig 3) | ||
+ | ** Build Flow: 55/336 mins (1/6) | ||
+ | ** Execution: Fig 4 | ||
+ | ** Energy: Fig 4 |
Latest revision as of 17:32, 30 September 2021
David:
- Started to get lost around page 3
- PR = Partial Reconfiguration = uploading new algorithms to FPGA without re-flashing entire board(?)
- I/O virtualization
- Design requires compute resources to be distributed throughout storage resources. (cost/benefit?)
- "... offers users with an illusion of a single and large FPGA, in which they can develop, deploy, and execute applications at large-scale with ease to achieve energy-efficient HPC"
- Does the benefit only come if the data you need to process happens to be spread out over many nodes?
- Motivates distributing data as evenly as possible over storage.
- Energy used to extract data from disk? SSD not a problem. Could be increased energy though if you need to spin up two HDDs as opposed to one.
Diana:
Michael:
- The Ideal Environment
- Single Major Function Kernel Build
- Plug and Play Deployment
- Multiple Deployments on Single FPGA
- Multiple Deployments on Heterogeneous FPGAs
- Transparent Performance Scaling
- Partial Configuration (PR)
- Hot Insertion of FPGA Region
- Currently done via Processor Config Access Port (PCAP)
- Design Modularization
- SOTA Issues
- Each Deployment Requires Separate Kernel
- Tools Do Not Abstract FPGA I/O Heterogeneity
- All Kernels Require Distinct I/O Configuration
- Substantial O/H to swap kernels
- Remote Swaps Have Higher Latency
- Solution
- I/O Virtualiztion
- High Speed ICAP Dynamic Remote Config Service
- High Level Synthesis
- UNIMEM: effects PGAS
- UNILOGIC: Virtual FPGA (VF) PR
- Remote PR
- VF task location
- Visible Memory Mapped Accelerators (Kernels)
- Transparent Accelerator Access to RDMA
- FPGA Implementation Stack
- Decoupled Accelerator Builds
- H/W Abstractions
- Accelerator I/F Libs:
- Standardized Register Map
- Generic Drivers: Streaming I/F, Master/Slave I/F
- Runtime/Execution API: Hi Level S/W Integration
- gRPC: async cluster job launch
- Performance/Payoff
- ICAP vs. PCAP: Table I
- Inter PR comm latency (I/O Virt) 2-3 ns (Fig 3)
- Build Flow: 55/336 mins (1/6)
- Execution: Fig 4
- Energy: Fig 4