Ensemble Runs on Supercomputers: Towards Energy Efficiency / Simulation d'ensemble sur supercalculateur: vers plus d'efficacité énergétique

Inria
November 30, 2022
Contact:N/A
Offerd Salary:Negotiation
Location:N/A
Working address:N/A
Contract Type:Other
Working Time:Negotigation
Working type:N/A
Job Ref.:N/A

2022-05412 - Ensemble Runs on Supercomputers: Towards Energy Efficiency / Simulation d'ensemble sur supercalculateur: vers plus d'efficacité énergétique

Contract type : Internship agreement

Level of qualifications required : Graduate degree or equivalent

Fonction : Internship Research

About the research centre or Inria department

The Inria Grenoble - Rhône-Alpes research center groups together almost 600 people in 22 research teams and 7 research support departments.

Staff is present on three campuses in Grenoble, in close collaboration with other research and higher education institutions (Université Grenoble Alpes, CNRS, CEA, INRAE, …), but also with key economic players in the area.

Inria Grenoble - Rhône-Alpes is active in the fields of high-performance computing, verification and embedded systems, modeling of the environment at multiple levels, and data science and artificial intelligence. The center is a top-level scientific institute with an extensive network of international collaborations in Europe and the rest of the world.

Main activities Ensemble Runs on Supercomputers: Towards Energy Efficiency
  • Level: Master Level Research Internship (M2) or equivalent (stage fin étude ingénieur)
  • Where: UGA campus, Grenoble
  • When: Fexible, womewhere 2022-2023, 4 months minimum
  • Financial support: about 500 euros/month (gratifications de stage)
  • Team: Datamove
  • Advisers: Bruno Raffin, (Bruno.Raffin@inria.fr) and Olivier Richard (Olivier.Richard@inria.fr)
  • Context

    Numerical simulations are today commonly used for modeling complex phenomena or systems in different fields such as physics, chemistry, biology or industrial engineering. Some of these numerical simulations require supercomputers to run high-resolution models. In general, a numerical simulation needs a set of input parameters in order to produce the simulation outputs. The input parameters and the often complex internal model produce outputs that can be very large.

    Very large scale supercomputers have the capacity to support the execution of many instances of these numerical simulations, usually called an ensemble run or parameter sweep. Having a large sample of executions is used for many purposes, including sensibility analysis, deep reinforcement learning, deep surrogate training, data assimilation, Simulation-based Inference.

    Our team developed an original solution for running very large ensemble runs on supercomputers and processing the data on-the-fly. The framework, called Melissa, is open source and has been used for sensibility analysis, data assimilation and training of deep neural networks for physics (See refs….). So far the largest Melissa ensemble runs handled 80 000 simulations, processed on-line 278 TB of data, using up to 27 000 compute cores.

    The objective of this internship is to study and develop strategies to better control the execution of the ensemble with the goal of reducing execution time and power consumption. Melissa relies on an orchestrator, called the launcher, to control the execution of the different simulation instances. Melissa orchestrator has several degrees of freedom to tune each simulation (how many CPUs and/or GPUs to use). The orchestrator can also kill a running simulation and restart it with a different configuration if needed, or can increase or decrease the number of simulations running concurrently. The target execution platform is a supercomputer that, like the cloud, is shared between different applications. So the execution environment is uncertain as the availability of resources (CPUs, GPUs, file system and network load) can change over time. The batch scheduler is a service on supercomputers in charge of deciding when and where (which nodes) each application is expected to run. Melissa already interact with the batch scheduler for requesting simulation execution, but so far with simple strategies. We would like to extend Melissa to monitor the supercomputer environment and adjust the ensemble execution trying to optimize its objectives (duration, energy).

    Work

    To be able to control the execution environment, to reproduce experiments and have the possibility to run:

  • A training period to master the different concepts and tools:
  • It includes Melissa and the Grid'5000 cluster for executions.
  • Nixos-compose is a tool to deploy software and services to computing platform. It will be used to deploy a self-contained "supercomputer" with Melissa and the OAR batch scheduler and controllable external load. We already have a setup ready. You will need to learn to control it.
  • Batsim is a simulator of batch scheduler. Deploying a whole melissa cluster with nixos-compose gives good confidence about the feasibility of the scenarios. However, it requires a lot of resources and time to gather results. In respect of that Batsim will be used to simulate the strategies and find best scenarios. You will learn how Batsim works, and run simple simulations to understand its functioning.
  • Make a simple model that capture the essential characteristics of the environment (Melissa app and supercomputer), start to elaborate some strategies and evaluate their potential benefits with Batsim and/or nixos-compose.
  • Start experimenting a promising strategy, analyse results, revisit and improve strategy and repeat.
  • If required we will be able to have complementary tools to extend the experimentation context (runs on larger production supercomputers, or simulation runs in the BatSim environment).
  • The work will be pursued in tight collaboration with the DataMove members, including Engineers, PhD students and Postdocs working on these topics. We are the developers of Melissa, the OAR batch Scheduler, NixCompose deployment recipes (see publications in the reference section), which should give us significant freedom to try different strategies.

    This work is part of the REGALE European project. We will for instance use the EAR tool developed by our partners at the Barcelona Supercomputing Center for monitoring simulation energy usage. The candidate will have the opportunity to interact with the other teams from REGALE.

    Our team regularly receives grants for funding PhDs and engineer positions. This internship is a good way to integrate Datamove, and if the fit is good, stay for a PhD or engineering position.

    What you will learn during this internship
  • Scientific writing and team work.
  • Get expertise with parallel machines.
  • Participate to advance open source code development
  • Learn how to conduct experiments, analyse the result, be critical about the findings.
  • Be creative to find the right solution when facing a problem.
  • Location

    The internship will take place at the DataMove team located in the IMAG building on the campus of Saint Martin d'Heres (Univ. Grenoble Alpes) near Grenoble. The DataMove team is a friendly and stimulating environment gathering Professors, Researchers, PhD and Master students all leading research on High Performance Computing.

    The city of Grenoble is a student friendly city surrounded by the alps mountains, offering a high quality of life and where you can experience all kinds of mountain related outdoors activities.

    References
  • The Challenges of In Situ Analysis for Multiple Simulations. https: // hal.inria.fr/hal-02968789
  • Melissa:
  • https: // gitlab.inria.fr/melissa
  • An elastic framework for ensemble-based large-scale data assimilation. https: // hal.inria.fr/hal-03017033v3
  • Melissa: Large Scale In Transit Sensitivity Analysis Avoiding Intermediate Files. https: // hal.inria.fr/hal-01607479
  • NixCompose
  • Painless Transposition of Reproducible Distributed Environments with NixOS Compose. https: // www. archives-ouvertes.fr/hal-03723771/
  • https: // github.com/oar-team/nur-kapack
  • OAR: https: // oar.imag.fr/
  • BatSim: https: // gitlab.inria.fr/batsim/batsim
  • EAR: https: // gitlab.bsc.es/earteam/ear/-/tree/master
  • Benefits package
  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leaves
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Remuneration

    Grant of 3.90 € per hour

    General Information
  • Theme/Domain : Distributed and High Performance Computing Scientific computing (BAP E)

  • Town/city : Montbonnot

  • Inria Center : CRI Grenoble - Rhône-Alpes
  • Starting date : 2023-02-01
  • Duration of contract : 6 months
  • Deadline to apply : 2022-11-30
  • Contacts
  • Inria Team : DATAMOVE
  • Recruiter : Raffin Bruno / bruno.raffin@inria.fr
  • About Inria

    Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.

    Instruction to apply

    Defence Security : This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.

    Recruitment Policy : As part of its diversity policy, all Inria positions are accessible to people with disabilities.

    Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.

    From this employer

    Recent blogs

    Recent news