Post-Doctoral Research Visit F/M Coordinating Scheduling and Control for Resource Harvesting in High-performance Computing

Inria

France

October 20, 2022

Description

2022-05377 - Post-Doctoral Research Visit F/M Coordinating Scheduling and Control for Resource Harvesting in High-performance Computing

Contract type : Fixed-term contract

Level of qualifications required : PhD or equivalent

Fonction : Post-Doctoral Research Visit

About the research centre or Inria department

The Inria Grenoble - Rhône-Alpes research center groups together almost 600 people in 22 research teams and 7 research support departments.

Staff is present on three campuses in Grenoble, in close collaboration with other research and higher education institutions (Université Grenoble Alpes, CNRS, CEA, INRAE, …), but also with key economic players in the area.

Inria Grenoble - Rhône-Alpes is active in the fields of high-performance computing, verification and embedded systems, modeling of the environment at multiple levels, and data science and artificial intelligence. The center is a top-level scientific institute with an extensive network of international collaborations in Europe and the rest of the world.

Context

This work is in cooperation between the teams Datamove and Ctrl-A (Raphaël Bleuse & Éric Rutten)

Large-scale computing infrastructures are processing vaster amount of data or solving problems requiring vaster amount of computing power. The behavior of such infrastructures has become more variable and difficult to model and predict, especially with respect to power consumption and application performance. As a result, the management (i.e., configuration) of the infrastructures needs to adapt dynamically to variations, and to be automated. This automatic management can be done in a feedback loop, by periodically monitoring the state of the system, and updating the configuration to activate relevant mechanisms. Such feedback loops are the object of autonomic computing 2, and one approach to the design of autonomic managers involves the use of Control Theory, which is very widespread in various fields of engineering, but still recent in computer science 3 .

In this project we consider more particularly the context of the use case of CiGri 1, a lightweight, scalable and fault-tolerant grid system that exploits the unused resources of a set of computing clusters. Harvesting idle resources is done by submitting jobs from Bag-of-Tasks applications as best- effort, with the lowest priority to the clusters in order to maximize their use.

Such best-effort jobs can be killed when premium computations need resources. However, the allocation of resources to best-effort jobs introduces disturbances for the premium users of the cluster: increased startup delays due to the harvesting of resources, longer scheduling decisions delays, degraded performances of the distributed file system, ... Therefore, the submission must be regulated in a feedback loop with a Control Theory approach 4 in order to minimize these perturbations.

Until now, we have been considering the OAR RJMS (resources and jobs management system) of the clusters as a black box.

Assignment

The challenge of this projet is to deepen the interaction between the RJMS and the controller, to go beyond implicit interferences, by coordinating them explicitely : this involves opening the RJMS as a grey box, and enriching the controller with the new available information. The integration and coordination of multiple management mechanisms is an important topic in Software Engineering for Autonomic Computing. It is important to keep the separation of concerns between different management problems and skills, while establishing the possibility to cooperate in order to reach improved performance. In our context, it opens the way to a study of relationships between Scheduling and Control for HPC systems.

The work proposed in this post-doctoral position is to study, within the framework of CiGri 1, how to integrate information from the RJMS (e.g., scheduler) into the feedback loop. The RJMS is a key software component in charge of allocating resources to users' jobs. It involves the scheduling component OAR, detaining information about premium users tasks. This information can be useful for the controller to act more predictively according to planned variations of resources to be harvested.

Main activities

More precisely, a first approach is to consider the jobs allocation plan computed by OAR (usually depicted as a Gantt diagram), where information about tasks and their resource usage in time is available. The coordination will be supported by an interface to be developed, in order to communicate to the controller the predicted variations in load (upwards: premium jobs taking additional resources, as well as downwards: jobs releasing resources). The controller will be designed in order to use this predictive information, to regulate the submission of best-effort jobs so as to e.g., minimize useless computation induced by the submission of best-effort jobs which would be prematurely killed by an upwards variation.

This simple reactive approach could be enriched to leverage the knowledge of past jobs. For example, one could weight the controller's reaction based on some life expectancy metric. This opens up the door to stochastic approaches.

The work is performed in a multidisciplinary cooperation with researchers from both control engineering and computer science fields. On the Control Theory aspects we are cooperating wih Gipsa-lab in Grenoble and the Spirals team at Cristal/Inria in Lille. The objective of this post- doctoral position is to contribute more on the side of HPC, interfacing with OAR, and experimental validation. The latter will follow a reproductible approach, building upon tools developed in the Datamove team.

Steps will feature :

- study of existing techniques related to autonomic computing; - appropriation of the CiGri and OAR environments, and the recent developments ; - interaction with colleagues designing the control feedback loop exploiting information from the cluster RJMS; !"integration of this controller with the RJMS : interfacing components, extracting useful information ; !"experimental evaluation (performance analysis) of the proposed strategy.

References

1 Bruno Bzeznik and Ghislain Charrier, CiGri. lic: GPL-3.0-or-later. url: https:// cigri.imag.fr/, vcs: https: // github.com/oar-team/cigri. 2 Jeffrey O. Kephart and David M. Chess. The Vision of Autonomic Computing. In: IEEE Computer 36.1 (Jan. 2003), pp. 41-50. doi: 10.1109/MC.2003.1160055. 3 Eric Rutten, Nicolas Marchand, Daniel Simon. Feedback Control as MAPE-K loop in Autonomic Computing. Software Engineering for Self-Adaptive Systems III. Assurances., 9640, Springer, pp.349-373, 2018, Lecture Notes in Computer Science, ⟨10.1007/978-3-319-74183-312⟩. ⟨hal-01285014⟩ 4 Quentin Guilloteau, Olivier Richard, Bogdan Robu, Eric Rutten. Controlling the Injection of Best-Effort Tasks to Harvest Idle Computing Grid Resources. ICSTCC 2021 - 25th International Conference on System Theory, Control and Computing, Oct 2021, Iași, Romania. pp.1-6, ⟨10.1109/ICSTCC52150.2021.9607292⟩. ⟨hal-03363709⟩

Skills

The candidate is expected to hold a PhD (or be finishing its preparation) in Computer Science, in the domain of High-Performance Computing, with an interest for experimental validation, and for multidisciplinary work.

Benefits package
  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage
  • Remuneration

    Gross salary : 2 746 euros per month before income taxes.

    General Information
  • Theme/Domain : Distributed and High Performance Computing Software Experimental platforms (BAP E)

  • Town/city : Grenoble

  • Inria Center : CRI Grenoble - Rhône-Alpes
  • Starting date : 2022-11-01
  • Duration of contract : 12 months
  • Deadline to apply : 2022-10-20
  • Contacts
  • Inria Team : DATAMOVE
  • Recruiter : Richard Olivier / olivier.richard@inria.fr
  • The keys to success

    The candidate is expected to hold a PhD (or be finishing its preparation) in Computer Science, in the domain of High-Performance Computing, with an interest for experimental validation, and for multidisciplinary work.

    About Inria

    Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.

    Instruction to apply

    Defence Security : This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.

    Recruitment Policy : As part of its diversity policy, all Inria positions are accessible to people with disabilities.

    Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.

    Similar Jobs

    Add to favorites Read more...

    Scheduling And Qos Management On Fog Computing Architecture

    Research Field: Computer science Mathematics › Algorithms Location: France › TOULOUSE LAAS CNRS - SARA Team (Services et Architectures pour Réseaux Avancés) these new architectures, involving a large number of devices and heterogeneous The set...

    Inria

    France 1 week ago

    Add to favorites Read more...

    Post-Doctoral Research Visit F/M Impact of Information Structures on Service Pricing

    2022-05367 - Post-Doctoral Research Visit F/M Impact of Information About the research centre or Inria department The Inria University of Lille centre, created in 2008, employs 360 people including 305 scientists in 15 research teams....

    Add to favorites Read more...

    F/M Post-Doc On Full System Simulation Of Hybrid Wired/Wireless Noc-Based Many-Cores

    Research Field: Computer science Engineering Mathematics Location: France › LORIENT - Full system simulation of hybrid wired/wireless NoC-based many-cores - architecture requires a deluge of data movement and coherence traffic, this project, we intend to...

    University of Otago

    New Zealand Sep 12, 2022

    Add to favorites Read more...

    Engineer Research and Teaching Solutions

    Engineer Research and Teaching Solutions -2201695 INFORMATION TECHNOLOGY SERVICES RESEARCH AND TEACHING IT SUPPORT Our Unit provides research and teaching specific IT support to enable researchers and teaching staff to continue to deliver high quality...