1038 lines
34 KiB
HTML
1038 lines
34 KiB
HTML
<!doctype html>
|
||
<html lang="en">
|
||
<head>
|
||
<title>Computational reproducibility in practice</title>
|
||
<meta name="description" content="">
|
||
<meta name="author" content="Michael Hanke">
|
||
|
||
<meta charset="utf-8">
|
||
<meta name="apple-mobile-web-app-capable" content="yes" />
|
||
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
|
||
<link rel="stylesheet" href="common/css/main.css" id="theme">
|
||
<link rel="stylesheet" href="index.css">
|
||
<script src="common/js/printpdf.js"></script>
|
||
</head>
|
||
<body>
|
||
|
||
<div class="reveal">
|
||
<div class="slides">
|
||
<section>
|
||
<h1>Computational reproducibility<br><small>How can it be achieved in practice?</small></h1>
|
||
<p>Michael Hanke</p>
|
||
<p>
|
||
<small>Institute of Neuroscience and Medicine, Brain & Behavior (INM-7),
|
||
Research Center Jülich</small><br>
|
||
<small>Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf</small></br>
|
||
<p><img style="height:50px;margin-bottom:-12px;margin-right:10px" data-src="common/img/mastodon.svg" />@mih@mas.to
|
||
<a href="http://psychoinformatics.de">http://psychoinformatics.de</a></p>
|
||
<p style="margin-top:50px"><img style="height:100px;margin-right:100px" data-src="common/img/fzj_logo.svg" />
|
||
<img style="height:100px" data-src="common/img/hhu_logo.svg" /></p>
|
||
</section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Get the talk slides
|
||
|
||
<!-- .element: height="400" -->
|
||
|
||
https://files.inm7.de/mih/talks/computational-reproduccibility-in-practice
|
||
|
||
<small>(sources at https://bits.ngln.eu/mih-talks/computational-reproducibility-in-practice)</small>
|
||
</script></section>
|
||
|
||
|
||
<!--
|
||
I usually work with
|
||
- Windows
|
||
- Mac OS
|
||
- Linux
|
||
- Other
|
||
Have you ever tried to reproduce somebody else's (digital) work?
|
||
- Yeah, no biggie!
|
||
- Yes, with substantial effort
|
||
- Yes, but I could not
|
||
- No
|
||
-->
|
||
<section data-markdown><script type="text/template">
|
||
## Getting to know each other
|
||
<iframe src="https://directpoll.com/r?XDbzPBdEt8j1rJ90qLGZ5bsjLVm4d7atmTGVeCOCkAhEyIx",
|
||
style="border: 0" width="900" height="800"></iframe>
|
||
</script></section>
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Reproducibility: Why should *I* care?
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Your schools, your pension, your comfort!
|
||
|
||
> Reinhart and Rogoff’s work showed average real economic growth slows (a 0.1% decline) when a country’s debt rises to more than 90% of gross domestic product (GDP) – and this 90% figure was employed repeatedly in political arguments over high-profile austerity measures.
|
||
<!-- .element: style="text-align:left;margin:0px;width:100%" -->
|
||
|
||
> The most serious was that, in their Excel spreadsheet, Reinhart and Rogoff had not selected the entire row when averaging growth figures: they omitted data from Australia, Austria, Belgium, Canada and **Denmark**.
|
||
<!-- .element: style="text-align:left;margin:0px;width:100%" -->
|
||
|
||
> When that error was corrected, the **“0.1% decline” data became a 2.2% average increase** in economic growth.
|
||
<!-- .element: style="text-align:left;margin:0px;width:100%" -->
|
||
|
||
<note>https://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646</note>
|
||
</script></section>
|
||
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## The good, the bad, and the ugly
|
||
|
||
> Traditionally, researchers have been taught to record every detail of their work, including experimental design, procedures, equipment, raw results, data processing, statistical methods and other tools used to analyse the results.
|
||
<!-- .element: style="text-align:left;margin:0px;width:100%" -->
|
||
|
||
> In contrast, relatively few researchers who employ computing in modern science [...] typically take such care in their work.
|
||
In most cases, there is no record of workflow, hardware and software configuration, and often even the source code is no longer available (or has been revised numerous times since the study was conducted).
|
||
<!-- .element: style="text-align:left;margin:0px;width:100%"" -->
|
||
|
||
> We think this is a seriously lax environment in which **deliberate fraud** and **genuine error** can proliferate.
|
||
<!-- .element: style="text-align:left;margin:0px;width:100%"" -->
|
||
|
||
<note>https://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646</note>
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
If everything is important... **track everything!**
|
||
|
||
We can only track things that are written... **write down/out everything!**
|
||
</script></section>
|
||
<section data-markdown><script type="text/template">
|
||
## Objectives for today
|
||
|
||
- Reproduce a real paper, from empirical data to manuscript PDF
|
||
|
||
- Understand why and how that is doable
|
||
|
||
<!-- .element: width="80%" style="margin-bottom:-20px;margin-top:-20px" -->
|
||
|
||
<note>Dar, A. H., Wagner, A. S. & Hanke, M. (2020). REMoDNaV: Robust Eye-Movement Classification for Dynamic Stimulation. Behavior Research Methods, 53, 399–414. https://doi.org/10.3758/s13428-020-01428-x</note>
|
||
</script></section>
|
||
</section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Concept of this session
|
||
|
||
- we can **only scratch the surface**
|
||
|
||
- **focus on doing**, not on theory and background
|
||
|
||
- **written materials are available**, presented slides shall only provide pointers and keywords
|
||
|
||
- **code-along**: everyone invited to run everything on their machines too
|
||
|
||
- hands-on parts are (mostly) self-contained, if you get stuck in one, you can still try the next
|
||
- open the slides in a browser and **copy&paste**, rather than type
|
||
- if you need help: **ask sooner than later**, the answers are often insightful to everyone!
|
||
|
||
</script></section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
# Switch to hacker mode!
|
||
<!-- .element: width="900" style="margin-bottom:-20px;margin-top:-20px" -->
|
||
|
||
In 3-2-1
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
### Terminal survival guide<!-- .element: style="margin-top:-50px" -->
|
||
|
||
|
||
<table width="100%" style="line-height:1.0;padding:0px">
|
||
<tr><th>Everywhere</th><th>But on Windows <code>cmd.exe</code></th></tr>
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">Which directory am I in?</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>pwd</code></pre></td><td style="padding:0px"><pre><code>cd</code></pre></td></tr>
|
||
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">Show (sub)directory tree structure</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>tree</code></pre></td><td style="padding:0px"><pre><code>tree</code></pre></td></tr>
|
||
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">Change into a directory; parent directory</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>cd PATH; cd ..</code></pre></td><td style="padding:0px"><pre><code>cd PATH; cd ..</code></pre></td></tr>
|
||
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">Go back into the HOME directory</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>cd</code></pre></td><td style="padding:0px"><pre><code>cd %userprofile%</code></pre></td></tr>
|
||
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">List the content of a directory</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>ls [PATH]</code></pre></td><td style="padding:0px"><pre><code>dir [PATH]</code></pre></td></tr>
|
||
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">Show the content of a file</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>cat PATH</code></pre></td><td style="padding:0px"><pre><code>type PATH</code></pre></td></tr>
|
||
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">Print something to the terminal</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>echo "some text"</code></pre></td><td style="padding:0px"><pre><code>echo some text</code></pre></td></tr>
|
||
|
||
<tr><td colspan=2 style="text-align:left;font-size:70%">Write something to a file</td></tr>
|
||
<tr><td style="padding:0px"><pre><code>echo "some text" > PATH</code></pre></td><td style="padding:0px"><pre><code>echo some text > PATH</code></pre></td></tr>
|
||
</table>
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## 1st time setup
|
||
|
||
- Open your terminal and run (please customize as needed!)
|
||
|
||
- Verify installation, Should show version for `cmd:git` and `cmd:annex`
|
||
|
||
```
|
||
datalad wtf -S dependencies
|
||
```
|
||
|
||
- Open your terminal and run (please customize!):
|
||
|
||
```
|
||
git config --global user.name "FirstName LastName"
|
||
git config --global user.email "email@example.com"
|
||
```
|
||
|
||
- Upgrade to latest DataLad versions
|
||
|
||
```
|
||
python -m pip install -U datalad datalad-container datalad-next
|
||
```
|
||
|
||
- Bring the future!
|
||
|
||
```
|
||
git config --global --add datalad.extensions.load next
|
||
```
|
||
|
||
|
||
**Most of the hands-on parts are self-contained, and can be done individually, if this setup was successful!**
|
||
|
||
</script></section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## What is DataLad?
|
||
<!-- .element: height="600" -->
|
||
http://datalad.org<!-- .element: style="margin-left:800px" -->
|
||
|
||
|
||
<aside class="notes">
|
||
But let's not talk about it, and only talk about feature and example implementations in DataLad
|
||
</aside>
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Exhaustive tracking of research components
|
||
<!-- .element: width="100%" -->
|
||
Well-structured datasets (using community standards), and portable computational environments — and their evolution — are the precondition for reproducibility
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# turn any directory into a dataset
|
||
# with version control
|
||
|
||
% datalad create <directory>
|
||
</pre></code>
|
||
</td><td style="padding:0px">
|
||
<code><pre>
|
||
# save a new state of a dataset with
|
||
# file content of any size
|
||
|
||
% datalad save
|
||
</pre></code>
|
||
</td></tr></table>
|
||
Note:
|
||
- link to prev. statements on description standards
|
||
- your community could be really small (your lab), when data are precious resources
|
||
will be spent to understand it, but information must be capture to make this possible
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Capture computational provenance
|
||
<!-- .element: width="100%" -->
|
||
Which data were needed at which version, as input into which code, running with what parameterization in which
|
||
computional environment, to generate an outcome?
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# execute any command and capture its output
|
||
# while recording all input versions too
|
||
|
||
% datalad run --input ... --output ... <command>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
|
||
Note:
|
||
The missing link: even when everything is shared, we still don't know how to start.
|
||
README is minimum, but executable prov-records are much better.
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Exhaustive capture enables portability
|
||
<!-- .element: width="100%" -->
|
||
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# transfer data and metadata to other sites and services
|
||
# with fine-grained access control for dataset components
|
||
|
||
% datalad push --to <site-or-service>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
|
||
Note:
|
||
Does it fly? Can you give it to someone? Or can you take it with you to your new lab?
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Reproducibility strengthens trust
|
||
<!-- .element: width="100%" -->
|
||
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated "upgrades" of outputs
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# obtain dataset (initially only identity,
|
||
# availability, and provenance metadata)
|
||
|
||
% datalad clone <url>
|
||
</pre></code>
|
||
</td><td style="padding:0px">
|
||
<code><pre>
|
||
# immediately actionable provenance records
|
||
# full abstraction of input data retrieval
|
||
|
||
% datalad rerun <commit|tag|range>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
Note:
|
||
Goal is automated reproducibility, enables assessment of robustness and benchmarking algorithmic developments
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Ultimate goal: (re)usability
|
||
<!-- .element: width="100%" -->
|
||
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re)used as modular components in larger contexts — propagating their traits
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# declare a dependency on another dataset and
|
||
# reuse it at particular state in a new context
|
||
|
||
% datalad clone -d <superdataset> <url> <path-in-dataset>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
|
||
Note:
|
||
With these in place, reusability is a small(er) step
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Talk is cheap, show me the code: Git vs. DataLad
|
||
|
||
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/Yrg6DgOcbPE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||
|
||
https://www.youtube.com/watch?v=Yrg6DgOcbPE
|
||
|
||
<aside class="notes">
|
||
- show git limits: commit a change in a 3rd-level submodule
|
||
- show annex limits: get file in a subdataset
|
||
- reveal: datalad makes repo-boundaries vanish -- show save -r
|
||
</aside>
|
||
</script></section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Hands-on: DataLad RDM essentials
|
||
|
||
(this section is self-contained, only 1st-time setup needed)
|
||
|
||
**Use a new terminal, or go back to your HOME directory!**
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Dataset creation
|
||
|
||
- Create a DataLad dataset
|
||
|
||
```
|
||
datalad create my1st
|
||
cd my1st
|
||
```
|
||
|
||
- Run the command
|
||
```
|
||
gitk
|
||
```
|
||
and inspect the dataset history. Can you find
|
||
- the dataset identifier?
|
||
- the version label?
|
||
- the dataset creator?
|
||
- the dataset creation date?
|
||
|
||
- Run the command
|
||
```
|
||
gitk --all
|
||
```
|
||
What is the difference from before?
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Content tracking and version control
|
||
|
||
- Check the status of the dataset
|
||
```sh
|
||
datalad status
|
||
```
|
||
-> nothing to save, working tree clean
|
||
|
||
- Create a file inside the dataset (or copy one in with a file manager)
|
||
```sh
|
||
echo "exquisit" > myfile.txt
|
||
```
|
||
|
||
- Inspect the dataset status again
|
||
```sh
|
||
datalad status
|
||
```
|
||
|
||
- Save the dataset modification
|
||
```sh
|
||
datalad save -m "First time I ever datalad-saved"
|
||
```
|
||
|
||
- Run `gitk` again. Can you find
|
||
- the file identifier?
|
||
- the dataset modification date?
|
||
</script></section>
|
||
</section>
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Hands-on: Use and reuse data module
|
||
|
||
(this section is self-contained, only 1st-time setup needed)
|
||
|
||
**Use a new terminal, or go back to your HOME directory!**
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Create a "YODA dataset"<!-- .element: style="margin-top:-50px" -->
|
||
|
||
- Create a DataLad dataset, applying a specific post-creation routine
|
||
|
||
```sh
|
||
datalad create -c yoda iyoda
|
||
cd iyoda
|
||
```
|
||
|
||
- Run the command
|
||
```sh
|
||
gitk
|
||
```
|
||
- what did that "YODA" setup actually do?
|
||
- how do we know that data module should go into `inputs/`?
|
||
|
||
- Run the command
|
||
```sh
|
||
datalad clone -d . https://github.com/datalad-handbook/iris_data inputs
|
||
```
|
||
- check the
|
||
```sh
|
||
datalad status
|
||
```
|
||
- what was actually saved in the `iyoda` dataset?
|
||
```sh
|
||
datalad subdatasets
|
||
```
|
||
- inspect the history of the "iris" dataset with `gitk`. When was it made, what does it contain? Hint: enter the dataset first
|
||
</script></section>
|
||
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Imagine data science!<!-- .element: style="margin-top:-50px" -->
|
||
|
||
- Create a script `code/extract.py` inside `iyoda` with content
|
||
```
|
||
from os.path import join as opj
|
||
import csv
|
||
with open(opj('inputs', 'iris.csv')) as csvfile:
|
||
reader = csv.DictReader(csvfile)
|
||
for row in reader:
|
||
if row['variety'] != 'Setosa':
|
||
continue
|
||
print(row['petal.length'])
|
||
```
|
||
- careful: with Python consistent indentation with tabs OR spaces is necessary!
|
||
- It will print all rows matching the Setosa variety
|
||
- Test the script
|
||
```
|
||
unix: python code/extract.py win: python code\extract.py
|
||
```
|
||
- stop here and try figure out why it does not print any lines?<br>
|
||
(hint: the script above is fine as-is)
|
||
- try again after running
|
||
```
|
||
datalad get inputs
|
||
```
|
||
- Save the script after confirming that it works
|
||
```
|
||
datalad save -m "Data extraction script" code
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Capture outputs
|
||
|
||
- Run the script and write its output into a file for further processing
|
||
```sh
|
||
unix: python code/extract.py > outputs.dat
|
||
win: python code\extract.py > outputs.dat
|
||
```
|
||
- Check the status and save the dataset modification
|
||
```sh
|
||
datalad status
|
||
datalad save -m "Create the desired setosa variety petal length data file"
|
||
```
|
||
- Use `gitk` to inspect the change record.
|
||
- What information is captured?
|
||
- Imagine yourself in a year: What information would you be missing?
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Capture provenance!
|
||
|
||
- Run the script again, but through DataLad, and declare inputs and outputs
|
||
- Unix:
|
||
```sh
|
||
datalad run -i inputs/iris.csv -o plength.txt "python code/extract.py > {outputs}"
|
||
```
|
||
<!-- .element: style="margin-left:0px;padding:0px;width:100%" -->
|
||
- Windows:
|
||
```cmd
|
||
datalad run -i inputs\iris.csv -o plength.txt "python code\extract.py > {outputs}"
|
||
```
|
||
<!-- .element: style="margin-left:0px;padding:0px;width:100%" -->
|
||
|
||
- Use `gitk` to inspect the change record. What is different now?
|
||
|
||
- Note that `datalad-run` also has the `-m` option to set a message
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Actionable provenance to the rescue!
|
||
|
||
- Force DataLad to loose the processed data
|
||
```sh
|
||
datalad drop --reckless availability plength.txt
|
||
```
|
||
|
||
- Verify that the file content is no longer around
|
||
|
||
- Re-execute the provenance record to recreate the file content
|
||
```sh
|
||
datalad rerun HEAD
|
||
```
|
||
|
||
- Check the file content! Check the dataset status
|
||
```sh
|
||
datalad status
|
||
```
|
||
What changed?
|
||
</script></section>
|
||
</section>
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Hands-on: Let's (start to) reproduce the paper!
|
||
|
||
(this section is self-contained, only 1st-time setup needed)
|
||
|
||
**Use a new terminal, and keep it open. This will run for while on its own!**
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## All there is to reproduce<!-- .element: style="margin-top:-50px" -->
|
||
|
||
<small>*There is one component still missing to explain. But it is under the hood, and we will get to that in a sec.*</small>
|
||
<!-- .element: style="margin-top:-20px;margin-bottom:-20px" -->
|
||
|
||
- Clone the dataset with the paper
|
||
```sh
|
||
datalad clone https://github.com/psychoinformatics-de/paper-remodnav.git
|
||
```
|
||
<!-- .element: style="width:100%" -->
|
||
|
||
- Use a file manager to take a look at the content
|
||
- `main.tex`: the manuscript in LaTeX format
|
||
- `img`: all figures in SVG format this directory
|
||
- `results_def.tex`: all statistics reported in the paper
|
||
|
||
- Use DataLad to get an idea about the dataset setup
|
||
```sh
|
||
cd paper-remodnav
|
||
datalad status --annex availability
|
||
datalad subdatasets
|
||
```
|
||
|
||
- Cheat: save time, supply pre-downloaded 1.7G TAR files (optional)
|
||
```
|
||
git annex reinject --known --backend MD5E [PATH-TO]360338....tar
|
||
git annex reinject --known --backend MD5E [PATH-TO]705094....tar
|
||
```
|
||
- Reproduce all figures and statistics reported in the paper
|
||
```sh
|
||
# first downloads and then processes 1GB of eye-tracking data
|
||
datalad rerun results-containerized
|
||
```
|
||
</script></section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Back to the main track
|
||
|
||
- we created a (toy) data processing pipeline with Python
|
||
- we can run and rerun it
|
||
- we get the same results each time
|
||
|
||
Is that reproducibility?<!-- .element: class="fragment" data-fragment-index="1" -->
|
||
|
||
**Yes!**<!-- .element: class="fragment" data-fragment-index="2" -->
|
||
|
||
At least right here...<!-- .element: class="fragment" data-fragment-index="3" -->
|
||
|
||
At least right now...<!-- .element: class="fragment" data-fragment-index="4" -->
|
||
</script></section>
|
||
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Main challenges of (long-term) computational reproducibility
|
||
|
||
- not all software/algorithms produce the same results on all computers
|
||
(even with identical input data)
|
||
|
||
- most software requires many other software components to work,
|
||
it can be very diffcult/impossible to recreate the exact same
|
||
installer later/elsewhere
|
||
|
||
- hardware and software go out of fashion, expertise and availability go down
|
||
over time
|
||
</script></section>
|
||
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Containers! They help... a bit... for now
|
||
|
||
- a **container** is a virtual computer that runs on real machine
|
||
|
||
- a container embeds an entire software environment, down to the
|
||
operating system, with all software dependencies
|
||
|
||
- a container is stored in an **image** file (when it is not running),
|
||
like the hard drive of a real computer, but in a file
|
||
|
||
- a container is built from a **recipe**, written and executable instructions
|
||
to create a container image from scratch
|
||
|
||
- many container solutions exist (Singularity, Docker, podman, etc)
|
||
|
||
- a container will run whereever the respective container solutions run — an additional layer of insulation/interoperability buffer
|
||
</script></section>
|
||
|
||
<!--
|
||
Have you work with containers before?
|
||
- Sure, docker!
|
||
- I used singularity (maybe on HPC)
|
||
- Shouldn't we all be using podman instead?
|
||
- Containers?
|
||
-->
|
||
<section data-markdown><script type="text/template">
|
||
## Containers?!
|
||
<iframe src="https://directpoll.com/r?XDbzPBdEt8j1rJ90qLGZ5bsjLVm4d7atmTGVeCOCkAhEyIx",
|
||
style="border: 0" width="900" height="800"></iframe>
|
||
</script></section>
|
||
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Hands-on: container as an app
|
||
|
||
(this section is self-contained)
|
||
|
||
**Use a new terminal, or go back to your HOME directory!**
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Moo in a box
|
||
|
||
- Often a container is used to *package* an application
|
||
|
||
- Use Docker to obtain the `rancher/cowsay` app from the Docker Hub registry
|
||
```sh
|
||
docker pull rancher/cowsay
|
||
```
|
||
|
||
- Most of the time, a container can be executed with some input, and delivers
|
||
some result. Try it!
|
||
```sh
|
||
docker run rancher/cowsay Moo!
|
||
```
|
||
|
||
- Docker is a service that can manage and run many containers. Check out
|
||
these essential commands:
|
||
```
|
||
# what containers are running
|
||
docker ps
|
||
# what container images are available locally
|
||
docker images
|
||
# remove container images (when space runs out)
|
||
docker rmi ...
|
||
```
|
||
</script></section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Hands-on: Container-based Python installation
|
||
|
||
(this section is NOT self-contained, but based on the `iyoda` session)
|
||
|
||
**Use a new terminal, and/or `cd` into the `iyoda` dataset!**
|
||
</script></section>
|
||
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## All things end
|
||
|
||
- the Python script in `code/extract.py` was written for an old version of Python,
|
||
version 2, which ran out of support in 2020.
|
||
|
||
- the code still runs with newer Python versions, but there is no guarantee that this will stay this way
|
||
|
||
- one can still install Python 2, but again no guarantee that this will remain possible
|
||
|
||
- not really a problem with a simple 8-line script, but imagine a PY2-based
|
||
analysis toolkit comprising a few 10k lines of code
|
||
|
||
- a Python 2 in a container can postpone death. Let's try that
|
||
</script></section>
|
||
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Python 2 in a box<!-- .element: style="margin-top:-50px" -->
|
||
|
||
- pull the container with Python 2 from Docker Hub
|
||
```sh
|
||
docker pull python:2-slim
|
||
```
|
||
- run it, and check that it is indeed a version 2.x
|
||
```sh
|
||
docker run python:2-slim python --version
|
||
```
|
||
- to be able to use it for data processing, we must give the container access
|
||
to the data location on the host machine
|
||
```sh
|
||
unix: docker run -v $(pwd):/tmp -w /tmp python:2-slim python code/extract.py
|
||
win: docker run -v .:/tmp -w /tmp python:2-slim python code/extract.py
|
||
```
|
||
(slightly odd formatting of the windows command to better show how they are
|
||
almost identical)
|
||
- `-v ...` exposes the current directory as `/tmp` *inside* the container
|
||
- `-w ...` let's the given command run in `/tmp` inside the container AKA the
|
||
current working directory outside the container
|
||
- we can now use the containerized Python just like the system Python for processing
|
||
data in this dataset
|
||
</script></section>
|
||
</section>
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Excursion: Container customization
|
||
|
||
(this section is self-contained)
|
||
|
||
**Use a new terminal, or go back to your HOME directory!**
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## A numerical Python 2<!-- .element: style="margin-top:-50px" -->
|
||
|
||
- containers are customized via *recipes* (called Dockerfiles in Docker)
|
||
|
||
- create a `Dockerfile` file in a `mypy2` directory with this content
|
||
```Dockerfile
|
||
# base container
|
||
FROM python:2-slim
|
||
# code to run to customize it
|
||
RUN python -m pip install numpy
|
||
# command to run when the container is executed
|
||
CMD python
|
||
```
|
||
it extends the provided Python 2 container with the `numpy` package
|
||
- Build the container (run in the parent directory of `mypy2`)
|
||
```sh
|
||
docker build -t mypy2:latest mypy2
|
||
```
|
||
- Try running it
|
||
```
|
||
docker run -it mypy2:latest
|
||
```
|
||
- type
|
||
```
|
||
import numpy
|
||
```
|
||
and hit return, it should not error
|
||
- type ``exit()`` and hit return to stop the container
|
||
</script></section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Hands-on: provenance capture with containers
|
||
|
||
(this section is NOT self-contained, but based on the `iyoda` session)
|
||
|
||
**Use a new terminal, and/or `cd` into the `iyoda` dataset!**
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Register and run containers in DataLad datasets<!-- .element: style="margin-top:-50px;margin-bottom:-30px" -->
|
||
|
||
- the `datalad-container` extension package provides this feature
|
||
- register a container from Docker Hub in the `iyoda` dataset
|
||
```sh
|
||
# 'container-python' is the name of the container in the dataset
|
||
datalad containers-add -u dhub://python:2-slim container-python
|
||
```
|
||
<!-- .element: style="width:100%" -->
|
||
- inspect the containers that the dataset knows
|
||
```sh
|
||
datalad containers-list
|
||
```
|
||
<!-- .element: style="width:100%" -->
|
||
- run our extraction code via DataLad AND the container
|
||
|
||
<small>Unix:</small>
|
||
```sh
|
||
datalad containers-run -n container-python -i inputs/iris.csv -o plength.txt "python code/extract.py > plength.txt"
|
||
```
|
||
<!-- .element: style="margin-top:-30px;margin-left:0px;white-space:pre-wrap;width:106%;" -->
|
||
|
||
<small>Windows:</small>
|
||
```sh
|
||
datalad containers-run -n container-python -i inputs\iris.csv -o plength.txt "python code/extract.py > plength.txt"
|
||
```
|
||
<!-- .element: style="margin-top:-30px;margin-left:0px;white-space:pre-wrap;width:106%;" -->
|
||
- inspect the provenance record with `gitk`
|
||
- find the container configuration defined in `.datalad/config`
|
||
- try a rerun
|
||
```sh
|
||
# may need 'master' instead of 'main'
|
||
datalad rerun main
|
||
```
|
||
<!-- .element: style="width:100%" -->
|
||
</script></section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Building blocks of computational reproducibility
|
||
|
||
- **Track any digital information** with DataLad datasets
|
||
- text files
|
||
- binary data
|
||
- code
|
||
- computational environments
|
||
- Nest DataLad datasets to **form and combine modular, reusable units** of information
|
||
- Capture computational provenance to not only track what was produced, but to also
|
||
**record how it was done**
|
||
- Use **containers** as portal encapsulations of arbitrarily complex computing
|
||
environments.
|
||
</script></section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Hands-on: Reproducing a paper (part 2)
|
||
|
||
(this section is NOT self-contained)
|
||
|
||
**Go back to the terminal that executed `datalad-rerun` on the paper!**
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Inspect the reproduced paper
|
||
- Verify that everything was reproduced
|
||
```sh
|
||
datalad status
|
||
gitk
|
||
|
||
- Use `gitk` to find the three key pieces that enabled reproduction
|
||
- the `results-containerized` tag, and the associated messages (notice the cryptographic signature)
|
||
- the datalad-containers setup for `docker-make` (`367bbeea`)
|
||
- the provenance record for building the container image itself (`3dd49e36`), and its recipe
|
||
|
||
- Build the manuscript PDF
|
||
```sh
|
||
datalad containers-run -n docker-make main.pdf
|
||
```
|
||
</script></section>
|
||
</section>
|
||
|
||
<!--
|
||
How was that?
|
||
- Like a breeze, could go on forever!
|
||
- Pheew, but it reproduced!
|
||
- It did not work for me...
|
||
- What a mess?!
|
||
-->
|
||
<section data-markdown><script type="text/template">
|
||
## How was that?
|
||
<iframe src="https://directpoll.com/r?XDbzPBdEt8j1rJ90qLGZ5bsjLVm4d7atmTGVeCOCkAhEyIx",
|
||
style="border: 0" width="900" height="800"></iframe>
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Critical look at the container-approach<!-- .element: style="margin-top:-50px" -->
|
||
|
||
- Ultimately suffers from the same problems as plain software use
|
||
|
||
- if the container solution no longer installs/runs reproducing fails
|
||
|
||
- not all container solutions are universally applicable/available
|
||
|
||
- It adds complexity to an already complex computational environment
|
||
|
||
- But
|
||
- it can help keep things running today that would otherwise already fail today
|
||
|
||
- it improves the portability of computational environments, also ongoing
|
||
collaborations
|
||
|
||
- receipes make containerized environment much more reproducible than real machines,
|
||
this also helps for updating environment in ongoing projects (e.g., to fix
|
||
software defects)
|
||
|
||
- combined with version-control, containerization can be introduced
|
||
in a later stage without loosing confidence
|
||
</script></section>
|
||
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## Extensive documentation and training materials
|
||
<!-- .element: width="700" style="margin-top:-20px;margin-bottom:-10px" -->
|
||
|
||
https://handbook.datalad.org (or [ISBN 979-8857037973](https://www.bookfinder.com/isbn/979-8857037973))
|
||
|
||
Note:
|
||
RDM Education is key. Handbook helps people be more productive, yielding more FAIR resources as an outcome, but not as the main goal.
|
||
</script></section>
|
||
|
||
|
||
<section>
|
||
<h2>DataLad contact and more information</h2>
|
||
<table>
|
||
<tr><td>Website + Demos</td>
|
||
<td><a href="http://datalad.org">http://datalad.org</a></td>
|
||
</tr><tr><td>Documentation</td>
|
||
<td><a href="http://handbook.datalad.org">http://handbook.datalad.org</a></td>
|
||
</tr><tr><td>Talks and tutorials</td>
|
||
<td><a href="https://youtube.com/datalad">https://youtube.com/datalad</a></td>
|
||
</tr><tr><td>Development</td>
|
||
<td><a href="http://github.com/datalad">http://github.com/datalad</a></td>
|
||
</tr><tr><td>Support</td>
|
||
<td><a href="https://matrix.to/#/#datalad:matrix.org">https://matrix.to/#/#datalad:matrix.org</a></td>
|
||
</tr><tr><td>Open data</td>
|
||
<td><a href="http://datasets.datalad.org">http://datasets.datalad.org</a></td>
|
||
</tr>
|
||
</tr><tr><td>Mastodon</td>
|
||
<td>@datalad@fosstodon.org</td>
|
||
</tr><tr><td>Twitter</td>
|
||
<td>@datalad</td>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<section data-markdown><script type="text/template">
|
||
## Additional topics
|
||
</script></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## What goes into (different) data modules?
|
||
|
||
- Target audience is different
|
||
- public vs. private
|
||
- domain specific vs. domain general
|
||
|
||
- Pace of evolution is different
|
||
- "factual" raw data vs. choices of (pre-)processing
|
||
- completed acquisition vs. ongoing study
|
||
|
||
- Size impacts I/O and logistics
|
||
- Git can struggle with 1M+ files
|
||
- filesystems (licensing) can struggle with large numbers of inodes
|
||
|
||
- Legal/Access constraints
|
||
- personal vs. anonymized data
|
||
|
||
<aside class="notes">
|
||
Note to self
|
||
</aside>
|
||
</script></section>
|
||
</section>
|
||
|
||
</div> <!-- /.slides -->
|
||
</div> <!-- /.reveal -->
|
||
|
||
<script src="common/reveal.js/js/reveal.js"></script>
|
||
|
||
<script>
|
||
// Full list of configuration options available at:
|
||
// https://github.com/hakimel/reveal.js#configuration
|
||
Reveal.initialize({
|
||
// The "normal" size of the presentation, aspect ratio will be preserved
|
||
// when the presentation is scaled to fit different resolutions. Can be
|
||
// specified using percentage units.
|
||
width: 1280,
|
||
height: 960,
|
||
|
||
// Factor of the display size that should remain empty around the content
|
||
margin: 0.15,
|
||
|
||
// Bounds for smallest/largest possible scale to apply to content
|
||
minScale: 0.2,
|
||
maxScale: 1.0,
|
||
|
||
controls: true,
|
||
progress: true,
|
||
history: true,
|
||
center: true,
|
||
|
||
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
||
|
||
// Optional reveal.js plugins
|
||
dependencies: [
|
||
{ src: 'common/reveal.js/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
|
||
{ src: 'common/reveal.js/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
||
{ src: 'common/reveal.js/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
||
{ src: 'common/reveal.js/plugin/zoom-js/zoom.js', async: true },
|
||
{ src: 'common/reveal.js/plugin/notes/notes.js', async: true }
|
||
]
|
||
});
|
||
</script>
|
||
</body>
|
||
</html>
|