Minutes from the 1 Dec 2022 meeting

From epsciwiki
Jump to navigation Jump to search

Adminstrative

  • Created a new Trello board for PHASM FY23: https://trello.com/b/lLlXkTuU/phasm-fy23
  • Using better discipline regarding PRs and issues
  • Brad Sawatzky urgently needs a heterogeneous hardware expert to do a write up on the Future Trends workshop (who didn't necessarily attend said workshop), so I volunteered Cissy
  • Ongoing discussion with Makoto about getting started with Geant/PHASM integration
  • Never made a decision about Colin and Will

Recent accomplishments

  • Nathan
    • PR #10: Plugin loader work
      • Brought JPluginLoader into PHASM codebase
      • Created a phasm::Plugin abstraction
      • Integrated the plugin loader with SurrogateBuilder
      • phasm::SurrogateBuilder delegates model creation to Plugin
  • Cissy
    • PR #7: Add a working TorchScript example
    • PR #3: PINN PDE example executes on both CPU and farm A100/T4/TitanRTX GPU

Ongoing

  • Nathan
    • Issue #9: Finish up plugin milestone by . Hope to finish milestone this week
    • Issue #8: Need to deeply revamp the PHASM Docker containers
      • Add ALL of the dependencies right away on a lower layer: PIN, llvm, ROSE, Geant4, CUDA, halld-recon
      • Maintain a CUDA and a non-CUDA version of the top layer
  • Cissie

Project milestones

FY23 Q1

  • Rearchitect the surrogate API, moving the PyTorch dependency onto a dynamically loaded backend plugin
    • Well underway. PHASM plugin interface created and tested. Working on having the plugin handle model creation.
    • Remaining technical difficulty: Different plugins will want different formats for their tensors. Do we make tensor virtual, or store tensors as arrays of primitives and then have the model convert them?
    • Making tensor be virtual is weird because the underlying tensor type needs to match the underlying model type.
    • Having two levels of conversion is really bad for performance
    • Bonus question: the tensor needs to support offloading. Can this stay on one side of the abstraction, or not?
    • One important question is whether we can compile plugins using a different C++ language standard (same compiler though) and not get horrifying runtime breakages. Compiler docs are very coy on this topic. The good news is that there is practically no shared code on either side of the plugin interface, so the real constraint is the standard library.
  • Attach PHASM to Geant4 and collect training data; formulate a strategy for model design
    • Need to communicate more with Makoto to find the exact point in the code (this is more challenging than we would have hoped)
    • Plan B: Makoto pointed us to an inefficient, proof-of-concept Geant4 CUDA kernel last year. It is probably that one.
    • ACAT plenary talk included a lot of good advice on generative models for calo sims.
    • Need to close the loop with Malachi and find a DS person to help work on this.
    • Need to compile the Geant4 source and try attaching PHASM. (Note: having the plugin architecture milestone )
  • Generate on-CPU and off-CPU flame graphs; filter the flame graph data using Amdahl's and Gustavsson's laws
  • Since the PHASM PINN example looks like it has a bunch of weird problems, and I have no idea about the performance implications of the the tracking example, we may want to jump straight to the Geant4 example.
  • We should be able to re-create Makoto's analysis using our own profiler data. This should also give us an early bound on what sort of performance improvement we can expect from this work.
  • Get a simple static analysis code compiling and running on both ROSE and LLVM
    • The main thing we need right away are better Docker containers so that we can get all these dependencies sorted out.
    • Figuring out how to use ROSE and llvm is tricky but fun. Good thing for the holiday 'downtime'.

Revisiting FY22 Q4

  • Integrate charged particle tracking model with surrogate library.
    • Good news: Tracking model problem was ported to PyTorch. Don't think it ever ran inside PHASM. Also, Cissie apparently finally got TorchScript integration working, so it looks like the model can run inside PHASM. We should immediately try to run this using canned data (don't use real data yet, see bullet 3)
    • Bad news: Never figured out how to compile PHASM+halld_recon+CUDA together. Due to the Lovecraftian horror of compilation constraints that the GPU hardware and CUDA version put on the compiler and language versions, not entirely sure this is possible. Better approach is to migrate to a plugin architecture. This is one of the FY23Q1 milestones anyhow, so as we complete the FY23Q1 milestone, we should loop back and demonstrate using it with the GlueX codebase.
    • Bad news: I looked in to the question of how to convert JFactories into tensors in a sane way, but ran into the problem of handling arrays of unknown length. Better to hack this problem for now, then circle back and make sure that our profunctor solution is a good match for the problem in practice. Thus we also need to circle back to figure this thing out.

Small tasks (~1PR in size)

  • Tensors are made abstract. Plugins provide a tensor factory that provides a plugin-specific tensor implementation (see: Bridge pattern). SurrogateBuilder uses the plugin's tensor factory instead of the abstract tensor's base class constructors (Nathan)
  • #8: Rejigger docker container in order to fix dependency hell (Nathan)
  • Tensor abstraction supports GPU offloading (Cissie+Nathan)
  • Rerun PHASM inside halld_recon once pluginization is complete
  • Revisit approach for handling variable-length tensors in the context of halld_recon
  • Docker container with up-to-date CUDA support (Cissie)
  • Compile Geant4 with PHASM integration, put in Docker container immediately (depends on pluginized backend milestone) (Nathan)
  • Run profiler on Geant4 and generate flame graphs. Estimate expected performance as a function of model/batch size. (Cissy)
  • Work through llvm static analysis tutorial (requires a reasonable Docker) (Nathan)

Next steps

David

Cissy

  • Add support for GPU offloading to phasm::tensor abstraction
  • Create Docker/Singularity container with up-to-date CUDA integration

Nathan

  • Tensors are made abstract. Plugins provide a tensor factory that provides a plugin-specific tensor implementation (see: Bridge pattern). SurrogateBuilder uses the plugin's tensor factory instead of the abstract tensor's base class constructors
  • #8: Rejigger docker container in order to fix dependency hell