datalad-intro2025/index.html

<!doctype html>
<html lang="en">
<head>
  <title>Distributed data logistics with DataLad</title>
  <meta name="description" content="Talk at the FZJ IT-Forum">
  <meta name="author" content="Michael Hanke">

  <meta charset="utf-8">
  <meta name="apple-mobile-web-app-capable" content="yes" />
  <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
  <link rel="stylesheet" href="common/css/main.css" id="theme">
  <script src="common/js/printpdf.js"></script>
</head>
<body>

<div class="reveal">
<div class="slides">
<section>
  <h1>DataLad<br><small>Distributed data logistics</small></h1>
  <p>Michael Hanke</p>
  <p>
      <small>Institute of Neuroscience and Medicine, Brain &amp; Behavior (INM-7),
      Research Center Jülich</small><br>
  <small>Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf</small></br>
  <p><img style="height:50px;margin-bottom:-12px;margin-right:10px" data-src="common/img/mastodon.svg" />@mih@mas.to &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  <a href="http://psychoinformatics.de">http://psychoinformatics.de</a></p>
  <p style="margin-top:50px"><img style="height:100px;margin-right:100px" data-src="common/img/fzj_logo.svg" />
  <img style="height:100px" data-src="common/img/hhu_logo.svg" /></p>
  <a href="https://creativecommons.org/licenses/by/4.0">
  <img data-src="img/cc-by.svg" />
  </a>
</section>

<section data-markdown><script type="text/template">
![](img/datalad_logo_wide.svg)<!-- .element: height="500" -->

- Free and open-source software (MIT)
- Continuously developed since 12 years, as an international collaboration
- Numerous topical (third-party) extension packages

https://helmholtz.software/software/datalad

<aside class="notes">
But let's not talk about it, and only talk about feature and example implementations in DataLad
</aside>
</script>
</section>

<section>
<section data-markdown><script type="text/template">
## What DataLad can help with?
</script></section>

<section data-markdown><script type="text/template">
## Access an ecosystem of cyberinfrastructure
![](img/ecosystem.webp)

Vast majority is covered. Easy to add additional support with independent efforts.
</script></section>

<section data-markdown><script type="text/template">
## Remote-Process "cannot-move" Data
![](img/remoteanalysis.webp)

Enables utilization of data resources that cannot be handed out for legal, technical or other reasons.
</script></section>

<section data-markdown><script type="text/template">
## Reproducible HPC workflows
![](img/hpcworkflows.webp)

Enhances trust in computational outcomes through automatically verified reproducibility, even for users that have no access to the original compute resources.

<note>Wagner, Waite, Wierzba, Hoffstaedter, Waite, Poldrack, Eickhoff, Hanke (2022). FAIRly big: A framework for computationally reproducible processing of large-scale data. Scientific Data, 9, 80.</note>
</script></section>

<section data-markdown><script type="text/template">
## Reproducible publications

<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/nhLqmF58SLQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

- Oldest example: Peer-reviewed paper published in Behavior Research Methods in 2020<br>[[DOI 10.3758/s13428-020-01428-x](https://doi.org/10.3758/s13428-020-01428-x)]<!-- .element: style="font-size:70%" -->

- See http://handbook.datalad.org/r.html?reproducible-paper and https://youtube.com/datalad

<!-- .element: style="font-size:70%" -->
Note:
- VERY useful prior publication
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Automated data catalogs
![SFB1451 catalog](img/SFB1451_catalog_screenshot.png)<!-- .element: style="width:49%" -->
![NN catalog](img/naturalistic_imaging_catalog.webp)<!-- .element: style="width:49%" -->

Improves (global) findability, populated from existing metadata
<note>Example: https://data.sfb1451.de</note>
</script></section>
</section>

<section>
<section data-markdown><script type="text/template">
## How does this work?
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive tracking of research components
![](img/vamp_0_start.png)<!-- .element: width="100%" -->
Well-structured datasets (using community standards), and portable computational environments &mdash; and their evolution &mdash; are the precondition for reproducibility

<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# turn any directory into a dataset
# with version control

% datalad create &lt;directory&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# save a new state of a dataset with
# file content of any size

% datalad save
</pre></code>
</td></tr></table>
Note:
- link to prev. statements on description standards
- your community could be really small (your lab), when data are precious resources
will be spent to understand it, but information must be capture to make this possible
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Capture computational provenance
![](img/vamp_1_provcapture.png)<!-- .element: width="100%" -->
Which data was needed at which version, as input into which code, running with what parameterization in which
computional environment, to generate an outcome?

<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# execute any command and capture its output
# while recording all input versions too

% datalad run --input ... --output ... &lt;command&gt;
</pre></code>
</td></tr></table>

Note:
The missing link: even when everything is shared, we still don't know how to start.
README is minimum, but executable prov-records are much better.
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive capture enables portability
![](img/vamp_2_pushtocloud.png)<!-- .element: width="100%" -->
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.

<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# transfer data and metadata to other sites and services
# with fine-grained access control for dataset components

% datalad push --to &lt;site-or-service&gt;
</pre></code>
</td></tr></table>

Note:
Does it fly? Can you give it to someone? Or can you take it with you to your new lab?
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Reproducibility strengthens trust
![](img/vamp_3_reproduce.png)<!-- .element: width="100%" -->
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated "upgrades" of outputs

<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# obtain dataset (initially only identity,
# availability, and provenance metadata)

% datalad clone &lt;url&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# immediately actionable provenance records
# full abstraction of input data retrieval

% datalad rerun &lt;commit|tag|range&gt;
</pre></code>
</td></tr></table>
Note:
Goal is automated reproducibility, enables assessment of robustness and benchmarking algorithmic developments
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Ultimate goal: (re-)usability
![](img/vamp_4_reuse.png)<!-- .element: width="100%" -->
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re-)used as modular components in larger contexts &mdash; propagating their traits

<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# declare a dependency on another dataset and
# re-use it a particular state in a new context

% datalad clone -d &lt;superdataset&gt; &lt;url&gt; &lt;path-in-dataset&gt;
</pre></code>
</td></tr></table>

Note:
With these in place, re-usability is a small(er) step
</script></section>

<section data-markdown><script type="text/template">
## DataLad: Manage (co-)evolution of digital objects
![](img/yoda_decentralized_publishing.png)<!-- .element: width="900" style="margin-bottom:-70px;margin-top:-20px" -->

Consume, create, curate, analyze, publish, and query data with full provenance capture and "universal" metadata support.
<p style="font-size:70%;margin-top:-20px">
DataLad is free and open source (MIT-licensed). http://datalad.org
</p>

<note>
Halchenko, Meyer, Poldrack, ... & Hanke, M. (2021).
DataLad: distributed system for joint management of code, data, and their relationship.
Journal of Open Source Software, 6(63), 3262.
</note>
Note:
- following illustrations contain concrete implementation with datalad
- Software developed to address the needs of long-term maintenance and collab on the stufyforrest dataset
</script></section>

<section data-markdown><script type="text/template">
## Talk is cheap, show me the code: Git vs. DataLad

<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/Yrg6DgOcbPE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

https://www.youtube.com/watch?v=Yrg6DgOcbPE

<aside class="notes">
- show git limits: commit a change in a 3rd-level submodule
- show annex limits: get file in a subdataset
- reveal: datalad makes repo-boundaries vanish -- show save -r
</aside>
</script></section>
</section>


<section data-markdown><script type="text/template">
## Extensive documentation and training materials
![](img/datalad_anintroduction_cover.jpg)<!-- .element: width="700" style="margin-top:-20px;margin-bottom:-10px" -->

https://handbook.datalad.org (or ISBN 979-8857037973)

- **educational materials** on technologies &mdash; **targeting researchers**, not developers (executable paper, student surpervisor workflow,
  ...)
- handbook on concepts, workflows, and use cases
- **weekly public (virtual) office hour**

Note:
RDM Education is key. Handbook helps people be more productive, yielding more FAIR resources as an outcome, but not as the main goal.
</script></section>

<section>
<section data-markdown data-transition="none"><script type="text/template">
## Machine-driven metadata reporting

![Screenshots](img/machine_driven_metadata.svg)<!-- .element: style="height:650px;margin-bottom:-30px" -->

Formal "open-world" model, query and validated submission<br>
RDF-compatible *and* simultaneously scripting-ready<br>
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Manual annotation and description

<video data-autoplay width="1280" height="720" controls loop>
  <source src="vid/annotate_demo.webm" type="video/webm">
</video>

Preview a live editor: https://annotate.trr379.de/s/demo
</script></section>

<section data-markdown><script type="text/template">
## Full-stack RDM solution
![](img/forgejo.webp)

See https://atris.fz-juelich.de for a FZJ Forgejo-Aneksajo deployment
</script></section>
</section>

<section data-markdown><script type="text/template">
![](img/distribits2025-teaser.webp)

https://distribits.live
</script></section>

<section>
  <h2>DataLad contact and more information</h2>
  <table>
  <tr><td>Website + Demos</td>
  <td><a href="http://datalad.org">http://datalad.org</a></td>
  </tr><tr><td>Documentation</td>
  <td><a href="http://handbook.datalad.org">http://handbook.datalad.org</a></td>
  </tr><tr><td>Talks and tutorials</td>
  <td><a href="https://youtube.com/datalad">https://youtube.com/datalad</a></td>
  </tr><tr><td>Development</td>
  <td><a href="http://github.com/datalad">http://github.com/datalad</a></td>
  </tr><tr><td>Support</td>
  <td><a href="https://matrix.to/#/#datalad:matrix.org">https://matrix.to/#/#datalad:matrix.org</a></td>
  </tr><tr><td>Open data</td>
  <td><a href="http://datasets.datalad.org">http://datasets.datalad.org</a></td>
  </tr>
  </tr><tr><td>Mastodon</td>
  <td>@datalad@fosstodon.org</td>
  </tr>
  </table>
</section>
</div> <!-- /.slides -->
</div> <!-- /.reveal -->

<script src="common/reveal.js/js/reveal.js"></script>

<script>
  // Full list of configuration options available at:
  // https://github.com/hakimel/reveal.js#configuration
  Reveal.initialize({
    // The "normal" size of the presentation, aspect ratio will be preserved
    // when the presentation is scaled to fit different resolutions. Can be
    // specified using percentage units.
    width: 1280,
    height: 960,

    // Factor of the display size that should remain empty around the content
    margin: 0.1,

    // Bounds for smallest/largest possible scale to apply to content
    minScale: 0.2,
    maxScale: 1.0,

    controls: true,
    progress: true,
    history: true,
    center: true,

    transition: 'slide', // none/fade/slide/convex/concave/zoom

    // Optional reveal.js plugins
    dependencies: [
      { src: 'common/reveal.js/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
      { src: 'common/reveal.js/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
      { src: 'common/reveal.js/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
      { src: 'common/reveal.js/plugin/zoom-js/zoom.js', async: true },
      { src: 'common/reveal.js/plugin/notes/notes.js', async: true }
    ]
  });
</script>
</body>
</html>