Difference between revisions of "AI Surrogate Models LDRD"

From epsciwiki
Jump to navigation Jump to search
 
(22 intermediate revisions by 3 users not shown)
Line 4: Line 4:
 
=== Project Description ===
 
=== Project Description ===
  
We propose to develop tools to automatically generate machine learning surrogate models from existing software so that they may utilize modern heterogeneous hardware. A hypothetical future high performance data facility would make extensive use of heterogeneous hardware such as GPUs and FPGAs, but many legacy codes will need to be heavily modified in order to take advantage of this hardware. Surrogate models replace a piece of code which is expensive to run with an approximate model; when the underlying model is a neural net, it runs efficiently on heterogeneous hardware. Thus, they are a promising technique for offloading computation to such hardware while minimizing the necessary changes to the original code. The tools developed during this project would make it substantially simpler to implement a surrogate model, enabling legacy code to access heterogeneous hardware, saving users' time and effort, and eliminating redundant work. Lowering these barriers should enable faster development iterations and make it easier to bring machine learning research code into production. This project includes a proof of principle of a new kind of code analysis tool which could be useful for an even broader variety of problems in high performance computing. Several of the milestones open up opportunities for future research into neural differential equations and automatic identification of functions to surrogate.
+
PHASM ("Parallel Hardware viA Surrogate Models") is a software toolkit, currently under development, for creating AI-based surrogate models of scientific code. AI-based surrogate models are widely used for creating fast and inverse simulations. The project anticipates an additional, future use case: adapting legacy code to modern hardware. Data centers are investing in heterogeneous hardware such as GPUs and FPGAs; meanwhile, many important codebases are unable to take advantage of this hardware's superior parallelism without undergoing a costly rewrite. An alternative is to train a neural net surrogate model to mimic the computationally intensive functions in the code, and deploy the surrogate on the exotic hardware instead. PHASM addresses three specific challenges: (1) systematically discovering which functions can be effectively replaced with a surrogate, (2) automatically identifying, for a given function, the true space of inputs and outputs including those not apparent from the type signature, and (3) integrating a machine learning model into a legacy codebase cleanly and with a high level of abstraction. In the first year of development, a proof of concept has been developed for each challenge. A surrogate API makes it easy to bring PyTorch models into the C++ ecosystem and uses profunctor optics to establish a two-way data binding between C++ datatypes and tensors. A model variable discovery tool performs a dynamic binary analysis using Intel PIN in order to identify a target function's model variable space, including types, shapes, and ranges, and generate the optics code necessary to bind the model to the function. Future work may include exploring the limits of surrogate models for functions of increasing size and complexity, and adaptively generating synthetic training data based on uncertainty estimates.
  
 
=== General Resources ===
 
=== General Resources ===
  
 
* Source code lives here: https://github.com/nathanwbrei/phasm
 
* Source code lives here: https://github.com/nathanwbrei/phasm
 +
** [[HOWTO build and run PHASM on Geant4 examples]]
 +
** [https://docs.google.com/spreadsheets/d/19iVKLKfVFlASZSgHDrYQx6XqakzqsAp0i52GIF5nEWs/edit#gid=0 PHASM test result tracking]
 
* [https://wiki.jlab.org/epsciwiki/index.php/SRGS_2022 SRGS 2022]
 
* [https://wiki.jlab.org/epsciwiki/index.php/SRGS_2022 SRGS 2022]
 
* [https://trello.com/b/iHKfTSKB/project-planning Project planning (Trello page)]
 
* [https://trello.com/b/iHKfTSKB/project-planning Project planning (Trello page)]
Line 18: Line 20:
  
 
Passcode: cWidgE
 
Passcode: cWidgE
 +
 +
* [[Minutes from the 25 August 2022 meeting]]
 +
 +
* [[Minutes from the 1 Dec 2022 meeting]]
  
 
== Documents ==
 
== Documents ==
Line 37: Line 43:
 
| Nathan Brei, David Lawrence
 
| Nathan Brei, David Lawrence
 
| [https://wiki.jlab.org/epsciwiki/images/e/ee/Phasm_Intro_Slides.pdf PDF]
 
| [https://wiki.jlab.org/epsciwiki/images/e/ee/Phasm_Intro_Slides.pdf PDF]
 +
|-
 +
| 2022-08-24
 +
| W&M collab
 +
| Nathan Brei
 +
| [https://wiki.jlab.org/epsciwiki/images/c/cb/Phasm_FY22.pdf PDF]
 +
|-
 +
| 2022-10-27
 +
| ACAT 2022
 +
| Nathan Brei
 +
| [https://indico.cern.ch/event/1106990/contributions/4991290/attachments/2535525/4363650/ACAT_PHASM_talk.pdf PDF]
 +
|-
 +
| 2023-05-01
 +
| CHEP 2023
 +
| Xinxin Mei
 +
| [https://github.com/cissieAB/pytorch-paradnn/blob/master/docs/CHEP_GPU4ML4NP_05022023.pdf poster]
 
|}
 
|}
  
Line 46: Line 67:
 
! Title
 
! Title
 
|-
 
|-
| ccc
+
| March 2024
| aaa
+
| [https://www.osti.gov/ US DOE OSTI] Technical Report
| bbb
+
| [https://doi.org/10.2172/2331226 PHASM: A Toolkit for Creating AI Surrogate Models within Legacy Codebases]
 +
|-
 +
| March 2023
 +
| ACAT 2022 proceedings (preprint)
 +
| [[Media:ACAT PHASM paper.pdf|PHASM: A toolkit for creating AI surrogate models within legacy codebases]]
 +
|-
 
|}
 
|}
  
 +
=== Graphics ===
  
=== Notes ===
+
==== PHASM flowchart ====
* [https://jeffersonlab-my.sharepoint.com/:w:/r/personal/xmei_jlab_org/Documents/Roofline_model/Advisor%20and%20Likwid%20report.docx?d=wde1d0406d7644135ac5cb3c942751296&csf=1&web=1&e=m0EqSh The NVIDIA GPU profiling tool for the GPU Roofline analysis], Xinxin Mei
 
* [https://jeffersonlab-my.sharepoint.com/:w:/r/personal/xmei_jlab_org/Documents/Roofline_model/Advisor%20and%20Likwid%20report.docx?d=wde1d0406d7644135ac5cb3c942751296&csf=1&web=1&e=Uuh3UJ Utilizing Advisor and Likwid for CPU Roofline analysis], Xinxin Mei
 
  
== Actionables ==
+
[[File:Phasm_flowchart.png]]
  
* Surrogate toolkit: Implement array inputs and test on a 2D PDE such as diffusion. '''Nathan'''
+
Rectangles denote programs or scripts, parallelograms denote data, diamonds denote decisions, and arrows denote data flow.  
 
 
* Vacuum tool: Automatically identify primitive inputs and outputs from the function signature. '''Nathan'''
 
 
 
* Tool for profiling the memory bounds on a GPU. '''David + Nathan'''
 
 
 
* Start thinking about a neural net model for a 2D PDE. '''Kishan + Nathan'''
 
 
 
* Improve the existing GlueX tracking models in preparation for Q3 and Q4 milestones '''Kishan'''
 
  
 +
=== Notes ===
 +
* [https://jeffersonlab-my.sharepoint.com/:w:/g/personal/xmei_jlab_org/EbW3wNREEVxFm12IvsRs0AEB2BFHyuUrvU5T7OZJtJlqaQ?e=jHHfT9 The NVIDIA GPU profiling tool for the GPU Roofline analysis], Xinxin Mei
 +
* [https://jeffersonlab-my.sharepoint.com/:w:/r/personal/xmei_jlab_org/Documents/Roofline_model/Advisor%20and%20Likwid%20report.docx?d=wde1d0406d7644135ac5cb3c942751296&csf=1&web=1&e=Uuh3UJ Utilizing Advisor and Likwid for CPU Roofline analysis], Xinxin Mei
 +
* [https://github.com/cissieAB/pinn-heat-equation/blob/main/docs/prof_res.md The Roofline analysis of a libtorch C++ GPU PINN-heat equation implementation], Xinxin Mei
  
 
== Useful Links ==
 
== Useful Links ==

Latest revision as of 15:37, 16 September 2024

Running Legacy Code on Heterogeneous Hardware via Surrogate Models

Project Description

PHASM ("Parallel Hardware viA Surrogate Models") is a software toolkit, currently under development, for creating AI-based surrogate models of scientific code. AI-based surrogate models are widely used for creating fast and inverse simulations. The project anticipates an additional, future use case: adapting legacy code to modern hardware. Data centers are investing in heterogeneous hardware such as GPUs and FPGAs; meanwhile, many important codebases are unable to take advantage of this hardware's superior parallelism without undergoing a costly rewrite. An alternative is to train a neural net surrogate model to mimic the computationally intensive functions in the code, and deploy the surrogate on the exotic hardware instead. PHASM addresses three specific challenges: (1) systematically discovering which functions can be effectively replaced with a surrogate, (2) automatically identifying, for a given function, the true space of inputs and outputs including those not apparent from the type signature, and (3) integrating a machine learning model into a legacy codebase cleanly and with a high level of abstraction. In the first year of development, a proof of concept has been developed for each challenge. A surrogate API makes it easy to bring PyTorch models into the C++ ecosystem and uses profunctor optics to establish a two-way data binding between C++ datatypes and tensors. A model variable discovery tool performs a dynamic binary analysis using Intel PIN in order to identify a target function's model variable space, including types, shapes, and ranges, and generate the optics code necessary to bind the model to the function. Future work may include exploring the limits of surrogate models for functions of increasing size and complexity, and adaptively generating synthetic training data based on uncertainty estimates.

General Resources

Meetings

Thursday 2-3pm, bi-weekly. Teams.

Meeting ID: 281 724 015 543

Passcode: cWidgE

Documents

Proposals

Presentations

Date Event Presenter Slides
2022-06-27 SRGS 2022 Nathan Brei, David Lawrence PDF
2022-08-24 W&M collab Nathan Brei PDF
2022-10-27 ACAT 2022 Nathan Brei PDF
2023-05-01 CHEP 2023 Xinxin Mei poster

Publications

Date Journal Title
March 2024 US DOE OSTI Technical Report PHASM: A Toolkit for Creating AI Surrogate Models within Legacy Codebases
March 2023 ACAT 2022 proceedings (preprint) PHASM: A toolkit for creating AI surrogate models within legacy codebases

Graphics

PHASM flowchart

Phasm flowchart.png

Rectangles denote programs or scripts, parallelograms denote data, diamonds denote decisions, and arrows denote data flow.

Notes

Useful Links