The National Science Foundation (NSF) has made significant investments in major multi-user research facilities, which are the foundation for a robust data-intensive science program. Extracting scientific results from these facilities involves the comparison of “real” data collected from the experiments with “synthetic” data produced from computer simulations. There is wide growing interest in using new machine learning (ML) and artificial intelligence (AI) techniques to improve the analysis of data from these facilities and improve the efficiency of the simulations. The SCAILFIN project will use recently developed algorithms and computing technologies to bring cutting-edge data analysis techniques to such facilities, starting with the data from the international Large Hadron Collider. One result of these advancements will be that research groups at smaller academic institutions will more easily be able to access to the necessary computing resources which are often only available at larger institutions. Removing access barriers to such resources democratizes them, which is key to developing a diverse workforce. This effort will also contribute to workforce development through alignment of high-energy physics data analysis tools with industry computing standards and by training students in high-value data science skills.
The main goal of the SCAILFIN project is to deploy artificial intelligence and likelihood-free inference (LFI) techniques and software using scalable cyberinfrastructure (CI) that is developed to be integrated into existing CI elements, such as the REANA system. The analysis of LHC data is the project’s primary science driver, yet the technology is sufficiently generic to be widely applicable. The LHC experiments generate tens of petabytes of data annually and processing, analyzing, and sharing the data with thousands of physicists around the world is an enormous challenge. To translate the observed data into insights about fundamental physics, the important quantum mechanical processes and response of the detector to them need to be simulated to a high-level of detail and accuracy. Investments in scalable CI that empower scientists to employ ML approaches to overcome the challenges inherent in data-intensive science such as simulation-informed inference will increase the discovery reach of these experiments. The development of the proposed scalable CI components will catalyze convergent research because 1) the abstract LFI problem formulation has already demonstrated itself to be the “lingua franca” for a diverse range of scientific problems; 2) the current tools for many tasks are limited by lack of scalability for data-intensive problems with computationally-intensive simulators; 3) the tools the project is developing are designed to be scalable and immediately deployable on a diverse set of computing resources due to the design; and 4) the integration of additional commonly-used workflow languages to drive the optimization of ML components and to orchestrate large-scale workflows will lower the barrier-to-entry for researchers from other domains.
This project is supported by the NSF Office of Advanced Cyberinfrastructure in the Directorate for Computer and Information Science and Engineering.