datalad-intro2025/index.html

363 lines
14 KiB
HTML

<!doctype html>
<html lang="en">
<head>
<title>Distributed data logistics with DataLad</title>
<meta name="description" content="Talk at the FZJ IT-Forum">
<meta name="author" content="Michael Hanke">
<meta charset="utf-8">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="common/css/main.css" id="theme">
<script src="common/js/printpdf.js"></script>
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<h1>DataLad<br><small>Distributed data logistics</small></h1>
<p>Michael Hanke</p>
<p>
<small>Institute of Neuroscience and Medicine, Brain &amp; Behavior (INM-7),
Research Center Jülich</small><br>
<small>Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf</small></br>
<p><img style="height:50px;margin-bottom:-12px;margin-right:10px" data-src="common/img/mastodon.svg" />@mih@mas.to &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<a href="http://psychoinformatics.de">http://psychoinformatics.de</a></p>
<p style="margin-top:50px"><img style="height:100px;margin-right:100px" data-src="common/img/fzj_logo.svg" />
<img style="height:100px" data-src="common/img/hhu_logo.svg" /></p>
<a href="https://creativecommons.org/licenses/by/4.0">
<img data-src="img/cc-by.svg" />
</a>
</section>
<section data-markdown><script type="text/template">
![](img/datalad_logo_wide.svg)<!-- .element: height="500" -->
- Free and open-source software (MIT)
- Continuously developed since 12 years, as an international collaboration
- Numerous topical (third-party) extension packages
https://helmholtz.software/software/datalad
<aside class="notes">
But let's not talk about it, and only talk about feature and example implementations in DataLad
</aside>
</script>
</section>
<section>
<section data-markdown><script type="text/template">
## What DataLad can help with?
</script></section>
<section data-markdown><script type="text/template">
## Access an ecosystem of cyberinfrastructure
![](img/ecosystem.webp)
Vast majority is covered. Easy to add additional support with independent efforts.
</script></section>
<section data-markdown><script type="text/template">
## Remote-Process "cannot-move" Data
![](img/remoteanalysis.webp)
Enables utilization of data resources that cannot be handed out for legal, technical or other reasons.
</script></section>
<section data-markdown><script type="text/template">
## Reproducible HPC workflows
![](img/hpcworkflows.webp)
Enhances trust in computational outcomes through automatically verified reproducibility, even for users that have no access to the original compute resources.
<note>Wagner, Waite, Wierzba, Hoffstaedter, Waite, Poldrack, Eickhoff, Hanke (2022). FAIRly big: A framework for computationally reproducible processing of large-scale data. Scientific Data, 9, 80.</note>
</script></section>
<section data-markdown><script type="text/template">
## Reproducible publications
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/nhLqmF58SLQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
- Oldest example: Peer-reviewed paper published in Behavior Research Methods in 2020<br>[[DOI 10.3758/s13428-020-01428-x](https://doi.org/10.3758/s13428-020-01428-x)]<!-- .element: style="font-size:70%" -->
- See http://handbook.datalad.org/r.html?reproducible-paper and https://youtube.com/datalad
<!-- .element: style="font-size:70%" -->
Note:
- VERY useful prior publication
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Automated data catalogs
![SFB1451 catalog](img/SFB1451_catalog_screenshot.png)<!-- .element: style="width:49%" -->
![NN catalog](img/naturalistic_imaging_catalog.webp)<!-- .element: style="width:49%" -->
Improves (global) findability, populated from existing metadata
<note>Example: https://data.sfb1451.de</note>
</script></section>
</section>
<section>
<section data-markdown><script type="text/template">
## How does this work?
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive tracking of research components
![](img/vamp_0_start.png)<!-- .element: width="100%" -->
Well-structured datasets (using community standards), and portable computational environments &mdash; and their evolution &mdash; are the precondition for reproducibility
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# turn any directory into a dataset
# with version control
% datalad create &lt;directory&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# save a new state of a dataset with
# file content of any size
% datalad save
</pre></code>
</td></tr></table>
Note:
- link to prev. statements on description standards
- your community could be really small (your lab), when data are precious resources
will be spent to understand it, but information must be capture to make this possible
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Capture computational provenance
![](img/vamp_1_provcapture.png)<!-- .element: width="100%" -->
Which data was needed at which version, as input into which code, running with what parameterization in which
computional environment, to generate an outcome?
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# execute any command and capture its output
# while recording all input versions too
% datalad run --input ... --output ... &lt;command&gt;
</pre></code>
</td></tr></table>
Note:
The missing link: even when everything is shared, we still don't know how to start.
README is minimum, but executable prov-records are much better.
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive capture enables portability
![](img/vamp_2_pushtocloud.png)<!-- .element: width="100%" -->
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# transfer data and metadata to other sites and services
# with fine-grained access control for dataset components
% datalad push --to &lt;site-or-service&gt;
</pre></code>
</td></tr></table>
Note:
Does it fly? Can you give it to someone? Or can you take it with you to your new lab?
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Reproducibility strengthens trust
![](img/vamp_3_reproduce.png)<!-- .element: width="100%" -->
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated "upgrades" of outputs
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# obtain dataset (initially only identity,
# availability, and provenance metadata)
% datalad clone &lt;url&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# immediately actionable provenance records
# full abstraction of input data retrieval
% datalad rerun &lt;commit|tag|range&gt;
</pre></code>
</td></tr></table>
Note:
Goal is automated reproducibility, enables assessment of robustness and benchmarking algorithmic developments
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Ultimate goal: (re-)usability
![](img/vamp_4_reuse.png)<!-- .element: width="100%" -->
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re-)used as modular components in larger contexts &mdash; propagating their traits
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# declare a dependency on another dataset and
# re-use it a particular state in a new context
% datalad clone -d &lt;superdataset&gt; &lt;url&gt; &lt;path-in-dataset&gt;
</pre></code>
</td></tr></table>
Note:
With these in place, re-usability is a small(er) step
</script></section>
<section data-markdown><script type="text/template">
## DataLad: Manage (co-)evolution of digital objects
![](img/yoda_decentralized_publishing.png)<!-- .element: width="900" style="margin-bottom:-70px;margin-top:-20px" -->
Consume, create, curate, analyze, publish, and query data with full provenance capture and "universal" metadata support.
<p style="font-size:70%;margin-top:-20px">
DataLad is free and open source (MIT-licensed). http://datalad.org
</p>
<note>
Halchenko, Meyer, Poldrack, ... & Hanke, M. (2021).
DataLad: distributed system for joint management of code, data, and their relationship.
Journal of Open Source Software, 6(63), 3262.
</note>
Note:
- following illustrations contain concrete implementation with datalad
- Software developed to address the needs of long-term maintenance and collab on the stufyforrest dataset
</script></section>
<section data-markdown><script type="text/template">
## Talk is cheap, show me the code: Git vs. DataLad
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/Yrg6DgOcbPE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.youtube.com/watch?v=Yrg6DgOcbPE
<aside class="notes">
- show git limits: commit a change in a 3rd-level submodule
- show annex limits: get file in a subdataset
- reveal: datalad makes repo-boundaries vanish -- show save -r
</aside>
</script></section>
</section>
<section data-markdown><script type="text/template">
## Extensive documentation and training materials
![](img/datalad_anintroduction_cover.jpg)<!-- .element: width="700" style="margin-top:-20px;margin-bottom:-10px" -->
https://handbook.datalad.org (or ISBN 979-8857037973)
- **educational materials** on technologies &mdash; **targeting researchers**, not developers (executable paper, student surpervisor workflow,
...)
- handbook on concepts, workflows, and use cases
- **weekly public (virtual) office hour**
Note:
RDM Education is key. Handbook helps people be more productive, yielding more FAIR resources as an outcome, but not as the main goal.
</script></section>
<section>
<section data-markdown data-transition="none"><script type="text/template">
## Machine-driven metadata reporting
![Screenshots](img/machine_driven_metadata.svg)<!-- .element: style="height:650px;margin-bottom:-30px" -->
Formal "open-world" model, query and validated submission<br>
RDF-compatible *and* simultaneously scripting-ready<br>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Manual annotation and description
<video data-autoplay width="1280" height="720" controls loop>
<source src="vid/annotate_demo.webm" type="video/webm">
</video>
Preview a live editor: https://annotate.trr379.de/s/demo
</script></section>
<section data-markdown><script type="text/template">
## Full-stack RDM solution
![](img/forgejo.webp)
See https://atris.fz-juelich.de for a FZJ Forgejo-Aneksajo deployment
</script></section>
</section>
<section data-markdown><script type="text/template">
![](img/distribits2025-teaser.webp)
https://distribits.live
</script></section>
<section>
<h2>DataLad contact and more information</h2>
<table>
<tr><td>Website + Demos</td>
<td><a href="http://datalad.org">http://datalad.org</a></td>
</tr><tr><td>Documentation</td>
<td><a href="http://handbook.datalad.org">http://handbook.datalad.org</a></td>
</tr><tr><td>Talks and tutorials</td>
<td><a href="https://youtube.com/datalad">https://youtube.com/datalad</a></td>
</tr><tr><td>Development</td>
<td><a href="http://github.com/datalad">http://github.com/datalad</a></td>
</tr><tr><td>Support</td>
<td><a href="https://matrix.to/#/#datalad:matrix.org">https://matrix.to/#/#datalad:matrix.org</a></td>
</tr><tr><td>Open data</td>
<td><a href="http://datasets.datalad.org">http://datasets.datalad.org</a></td>
</tr>
</tr><tr><td>Mastodon</td>
<td>@datalad@fosstodon.org</td>
</tr>
</table>
</section>
</div> <!-- /.slides -->
</div> <!-- /.reveal -->
<script src="common/reveal.js/js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/reveal.js#configuration
Reveal.initialize({
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1280,
height: 960,
// Factor of the display size that should remain empty around the content
margin: 0.1,
// Bounds for smallest/largest possible scale to apply to content
minScale: 0.2,
maxScale: 1.0,
controls: true,
progress: true,
history: true,
center: true,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Optional reveal.js plugins
dependencies: [
{ src: 'common/reveal.js/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: 'common/reveal.js/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'common/reveal.js/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'common/reveal.js/plugin/zoom-js/zoom.js', async: true },
{ src: 'common/reveal.js/plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>