Reproducible Research II: Practices and tools for managing compu…

Reproducible Research II: Practices and tools for managing computations and data

Ref. 41023

In this MOOC, we will show you how to improve your practices and your ability to manage and process larger amounts of data, complex computations, while controlling your software environment.

Duration: 4 months
Effort: 35 hours
Pace: ~8h45/month
Languages: English

What you will learn

At the end of this course, you will be able to:

Manage research data:
- understand the challenges posed by large volumes of data
- archive code and data on well-known archives such as Software Heritage and Zenodo
- integrate data into versioning (Git Annex)
- use structured binary data formats (FITS, HDF5)

Use tools and techniques for controlling the software environment:
- understand how software packages are built and managed
- deploy software environments as containers (ex: Docker)
- manage software environments using a functional package manager (ex: Guix)
- work in controlled software environments on a daily basis

Automate long or complex computations using workflows:
- understand the challenges of scaling up: long calculations, distributed calculations
- choose a workflow tool adapted to your needs
- automate a data analysis using make and snakemake
- control the software environments of a workflow

Description

Following the success of the MOOC "Reproducible research: methodological principles for transparent science", the authors continue on the same theme, dealing more specifically with the issues of massive data and the complex calculations associated with them. These two MOOCs complement each other and offer a coherent training program on the subject.

In this second MOOC, we will show you how to improve your practices for managing large data and complex computations in controlled software environments:

you will learn how to use formats like JSON, FITS, and HDF5, platforms like Zenodo and Software Heritage, tools like git-annex, docker, singularity, guix, make, and snakemake;
we will show you how to integrate them in a real-life use case: a sunspot detection study. You will see for yourself that our methods and tools allow you to work in a reliable and reproducible way.

The strength of this new MOOC lies in a general and systematic presentation of the major concepts and of how they translate into practical solutions through numerous hands-on sessions with state-of-the-art open-source tools.

Format

This MOOC consists of three independent modules that combine video lectures, quizzes, pratical sessions, textual course supports, and many exercises for getting hands-on experience with the tools and methods that are presented.

Most of the exercises can be carried out in a JupyterLab environment made available to each MOOC learner. Some exercises require a Linux computer and the possibility to install system software on it.

Prerequisites

This course is for everyone who relies on a computer to perform data analysis. You should have some experience with running commands in a terminal, and have a basic knowledge of git (at the level of the first MOOC) and scientific Python.

Assessment and certification

An Open Badge for successful completion of the course will be issued on request to learners who obtain an overall score of 50% correct answers to all the quizzes and learning activities. Assessment is based on quizzes and practical exercises.

Course plan

Module Introduction

Introduction with expert interviews
Getting started with JupyterLab and sunspot time series

Module Managing data

1.1 Archiving
1.2 File formats
1.3 Project Organization
1.4 Git Annex

Module Managing software

2.1 On the Importance of Software Environment
2.2 Package Management Principles
2.3 Isolation and Containers
2.4 Using Containers
2.5 Building and Sharing Containers
2.6 Functional Package Managers (Guix, Docker, Singularity,...)

Module Managing computations

3.1 Why do we need workflows?
3.2 From notebooks to shell scripts
3.3 Workflows with `make`
3.4 Workflows with `snakemake`
3.5 Workflows and environments

Course team

Arnaud Legrand

Categories

Arnaud Legrand is a CNRS researcher at the Laboratoire d’Informatique in Grenoble. His research interest is the evaluation of the performance of big computing infrastructures. Both for performing experiments and for analyzing the outcomes, it is essential to capture the process rigorously.

Christophe Pouzat

Categories

Christophe Pouzat is a CNRS researcher at IRMA (Institute for Advanced Mathematical Research, University of Strasbourg). He is actually a neurophysiologist, working on the analysis of experimental data. Reproducible research enables him to communicate explicitly with experimentalists, avoiding many mistakes.

Konrad Hinsen

Categories

Konrad Hinsen is a CNRS researcher at the Centre de Biophysique Moléculaire in Orléans and at the Synchrotron SOLEIL in Saint Aubin. He explores the structure and dynamics of proteins by computational methods, which he tries to make reproducible.

Matthieu Simonin

Categories

Matthieu Simonin is a research engineer at the Inria Centre at Rennes University. He works closely with teams studying distributed systems and provides support for experimental campaigns that combine hardware and software constraints with data manipulation. Matthieu has recently joined labos1point5, a group of members from the French academic world, helping to develop tools for quantifying the carbon footprint of research activities, the calculations for which must, of course, be reproducible!

Ludovic Courtès

Categories

Ludovic Courtès is a research software engineer at Inria in Bordeaux, France. He contributes to Guix, a free software tool to deploy software environments in a reproducible fashion, with an eye towards making it a tool of choice for reproducible research.

Kim tâm Huynh

Categories

Kim tâm Huynh is actually a research engineer at Inria Paris. She supports research teams on methodologies and tools for software developments.

License

License for the course content

Attribution-NonCommercial-ShareAlike

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

License for the content created by course participants

Attribution-NonCommercial-ShareAlike

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Reproducible Research II: Practices and tools for managing computations and data

What you will learn

Description

Format

Prerequisites

Assessment and certification

Course plan

Course team

Arnaud Legrand

Christophe Pouzat

Konrad Hinsen

Matthieu Simonin

Ludovic Courtès

Kim tâm Huynh

Organizations

Inria

With the support of Fonds national de la science ouverte

License

License for the course content

Attribution-NonCommercial-ShareAlike

License for the content created by course participants

Attribution-NonCommercial-ShareAlike