datalad-beyond-git/index.html

507 lines
18 KiB
HTML

<!doctype html>
<html lang="en">
<head>
<title>DataLad beyond Git</title>
<meta name="description" content="DataLad has been built on Git and git-annex as foundational pillars. However, the vast majority of data infrastructures are not Git-aware. Git-annex can work with a much broader array of services, but the need to 'keep the Git repo somewhere' imposes undesirable technical and procedural complexity on users. In this talk I illustrate existing means to take Git-based DataLad datasets to places that Git cannot reach on its own. Moreover, I introduce ongoing work that aims to enable DataLad users to consume non-DataLad resources as native DataLad datasets, and non-DataLad users to consume DataLad resources without DataLad, git-annex, or even Git. some description ">
<meta name="author" content="Michael Hanke">
<meta charset="utf-8">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="common/css/main.css" id="theme">
<script src="common/js/printpdf.js"></script>
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<h1>DataLad beyond Git<br><small>Connecting to the rest of the world</small></h1>
<p>Michael Hanke</p>
<p>
<small>Institute of Neuroscience and Medicine, Brain &amp; Behavior (INM-7),
Research Center Jülich</small><br>
<small>Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf</small></br>
<p><img style="height:50px;margin-bottom:-12px;margin-right:10px" data-src="common/img/mastodon.svg" />@mih@mas.to &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<a href="http://psychoinformatics.de">http://psychoinformatics.de</a></p>
<p style="margin-top:50px"><img style="height:100px;margin-right:100px" data-src="common/img/fzj_logo.svg" />
<img style="height:100px" data-src="common/img/hhu_logo.svg" /></p>
</section>
<section>
<h2>Acknowledgements</h2>
<table>
<tr style="vertical-align:middle">
<td style="vertical-align:middle">
<dl style="margin-bottom:20px">
<dt style="margin-top:20px">DataLad software <br>
&amp; ecosystem</dt>
<dd style="margin-left:5px!important">
<ul style="margin-left:5px!important">
<li>Psychoinformatics Lab, <br>
Research Center Jülich</li>
<li>Center for Open <br>
Neuroscience, <br>
Dartmouth College</li>
<li>Joey Hess (git-annex)</li>
<li><em>>100 additional contributors</em></li>
</ul>
</dd>
</dl>
</td>
<td style="vertical-align:middle">
<div style="margin-top:-20px;margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
<div style="margin-top:-20px">
<img style="height:150px;margin-right:0px" data-src="common/img/nsf.png" />
<img style="height:150px;margin-right:0px;margin-left:40px" data-src="common/img/binc.png" />
<img style="height:150px;margin-left:0px" data-src="common/img/bmbf_datalad.png" />
</div>
<div style="margin-top:-20px">
<img style="height:80px;margin-top:0px;margin-left:30px" data-src="common/img/fzj_logo.svg" />
<img style="height:60px;margin-left:50px;margin-bottom:25px" data-src="common/img/dfg_logo.png" />
</div>
<div style="margin-top:-20px">
<img style="height:100px" data-src="common/img/erc_logo.png" />
<img style="height:60px;margin-bottom:35px" data-src="common/img/erdf.png" />
</div>
<div style="margin-top:-20px">
<img style="height:80px;margin-right:20px;margin-bottom:5px" data-src="common/img/nrw_mkw_logo.png" />
<img style="height:60px;margin-right:20px" data-src="common/img/cbbs_logo.png" />
<img style="height:60px" data-src="common/img/LSA-Logo.png" />
</div>
</td>
</tr>
<tr>
<td colspan=2 width="100%">
<div style="margin-top:0px">
<div style="margin-top:20px;margin-bottom:-50px"><strong>Collaborators</strong></div>
<img style="height:100px;margin:0px;margin-left:100px" data-src="common/img/cbrain_logo.png" />
<img style="height:100px;margin:20px" data-src="common/img/hbp_logo.png" />
<img style="height:100px;margin:20px" data-src="common/img/conp_logo.png" />
<img style="height:120px;margin:10px" data-src="common/img/openneuro_logo.png" />
<img style="height:100px;margin:20px" data-src="common/img/ebrain-health-logo.png"/>
<img style="height:100px;margin:20px" data-src="common/img/GIN_logo.png" />
</div>
<div style="margin-top:-20px;text-align:center">
<img style="height:120px;margin:20px" data-src="common/img/sfb1451_logo.png" />
<img style="height:140px;margin:10px" data-src="common/img/brainlife_logo.png" />
<img style="height:100px;margin:20px" data-src="common/img/vbc_logo.png" />
</div>
</td>
</tr>
</table>
</section>
<section data-markdown data-transition="none"><script type="text/template">
## Mindset: Everything is distributed
Resources, people, expertise, services
![](img/collab_mindset.svg)<!-- .element: height="400" -->
<!-- .element: style="float:right" -->
<div style="width:800px">
- **Version-control** is an organizer/safety wrapper around processes and people (including self)
- Progress requires a **collaboration** with an ever changing group of people, across different locations
- Success is an incremental and **sustainable achievement** built on a trustworthy foundation
</div>
*DataLad is a productivity tool for a distributed world*
(inspired by the free software movement)
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## DataLad "world model"
![](img/world_of_git_annex_datalad.svg)
DataLad is an orchestrator for Git and git-annex
</script></section>
<section>
<section data-markdown data-transition="none"><script type="text/template">
## Question to answer when going distributed
![](img/planning_mindset.svg)<!-- .element: style="margin-top:100px" height="400" -->
<!-- .element: style="float:right" -->
<div style="width:900px">
- Collaborating or depositing? Are updates expected? From whom? How provided?
- Git stuff (code, metadata):
- Where can it live? Who can have it?
- Service required/desired for collaboration assistance or visibility?
- Data stuff (large and/or binary blobs):
- Too big to be everywhere?
- Target Audience exactly identical to Git-stuff?
- Does it evolve?
</div>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Solutions for joint hosting
Git repo and data hosted at the same location/service
- Self-hosted Git repos with annex (related: DataLad RIA store)
- Git-hosting with Git-LFS
- Git-hosting with built-in annex support (GIN)
*Joint-hosting is attractive (complexity low),<br> and possible at any location that Git can reach.*
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Git is a constraint
- Issues
- Most cloud storage excluded
- Many institutions implement Git-is-for-code services only
- Git-LFS is unsatisfactory when data deletion is (frequently) necessary
- `git-annex export` is single-version, withholds advantages of distributed VCS from consumers
- Solution: Compound hosting
- Git repo hosted on Git-aware/compatible infrastructure (e.g. GitHub for reach)
- Data host anywhere (cheap enough, large enough, safe enough)
- Benefit: access managed separately for data vs metadata (think personal data)
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Good enough?
![](img/disappointed_mindset.svg)<!-- .element: style="margin-top:50px;float:right" height="400" -->
<div style="width:900px;text-align:left">
No!
- Minimum of two separate systems to support/maintain
- Often different authorities
- Need to get another services approved for use
But the benefit of separate access management?
- When target audiences for data and metadata are identical, there is no benefit
</div>
</script></section>
</section>
<section>
<section data-markdown data-transition="none"><script type="text/template">
## Git remote helper: datalad-annex
- Deposit a Git repo via git-annex (possibly inside another annex)
- Establish git-annex remote as common interface for Git or git-annex data transport
![](img/git-via-annex_mindset.svg)<!-- .element: style="margin-top:100px" height="400" -->
<!-- .element: style="float:right" -->
<div style="width:900px">
- Idea:
- Represent a Git remote as two annex keys
1. Plain-text list of `refs`
2. Zipped, bare Git repo with the refs
- Use a custom Git-annex key backend (XDLRA) to bypass any content verification and selectively employ Git-annex remotes for transport
- Key names are *not* content-based <br> one deposit per unique remote setup/annex
</div>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## datalad-annex internals
![](img/datalad_annex_setup.svg)
- `git-fetch`: check `refs`, copy `repo-export`, unpack, fetch
- `git-push`: fetch, push, pack, copy `repo-export`, update `refs`
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## datalad-annex examples
- URL-encode full `initremote` parameter list
- Parameter expansion support for URL components.
<div style="text-align:left">
Public S3 bucket (export)
```
datalad-annex::?type=S3&encryption=none&bucket=<BUCKET>&exporttree=yes&public=yes 🢱
&encryption=none
```
Dataverse dataset (by DOI with annex object tree)
```
datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no 🢱
&url=https%3A//demo.dataverse.org&doi=doi:10.70122/MYT/ESTDOI
```
Zipped repo at (localhost) `/tmp/XDLRA--repo-export`
```
datalad-annex::file:///tmp?type=external&externaltype=uncurl&encryption=none 🢱
&url={noquery}/{{annex_key}}'
datalad-annex::ssh://localhost/tmp?type=external&externaltype=uncurl& 🢱
encryption=none&url={noquery}/{{annex_key}}'
```
Zipped Git repo at https://example.com/.datalad/dotgit/repo.zip
```
datalad-annex::https://example.com
```
</div>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## datalad-annex summary
- **Not** a high-performance, collaboration utility for centralized contributor workflows
- **But** a flexible repository deposition helper: one provider, many consumers
- Confirmed to work with git-annex v8.20211123 or later,<br>
should work with any annex remote implementation
- Available from `datalad-next` extension package
<note>
http://docs.datalad.org/projects/next/en/latest/generated/datalad_next.gitremotes.datalad_annex.html
</note>
</script></section>
</section>
<section>
<section data-markdown data-transition="none"><script type="text/template">
## Longevity: DataLad is a liability
![](img/datalad_liability.svg)
<note>https://social.sciences.re/@zimoun/112036749331120124</note>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## And while we are at it: Git and git-annex too
![](img/archeology_mindset.svg)<!-- .element: height="400" -->
<!-- .element: style="float:right" -->
<div style="text-align:left;width:900px">
- Imagine, finding a CVS repository from 1994 with pointers to data (on tapes) in some language documented in a TROFF-formatted manual...
- Imagine, finding a "fast-exported" git-annex repository in 2054 with pointers to data (stored at something reachable by software from 2024)...
- ... two archeology projects
</div>
<div style="margin-top:100px">
**Data preservation demands a data-description optimized record,<br>
not a data-use optimize record.**
*But still, we want and need to use today's data, today, with today's tech.*
</div>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Future of DataLad: Use while you benefit, only
![](img/datalad_world_role.svg)
Convert DataLad datasets to/from a variety of metadata standards.
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Metadata schema for data-distributions
<div style="font-size:90%">
- A schema, **not a new ontology/vocabulary**
- Semantics comprehensively defined, **RDF-serialization supported**
- **Built on W3C PROV-O and DCAT** (embracing ODRL)
- Developed with `linkml`<br>
(https://linkml.io; generate OWL, SHACL, ... as needed)
- Able to capture multi-version DataLad datasets with redundant availability
- **Key ideas**
- Primary subject is file content (`DCAT:Distribution`)
- Open-world attitude: provides key structural elements, but does not prescribe or limit to a particular domain/vocabulary
- Almost everything has a globally unique identifier
- Facilitates version-on-read/export metadata workflows
</div>
<note>Work in progress at: https://concepts.datalad.org</note>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
#### Example
<div style="font-size:70%">
Stored metadata:
```bash
id: exthisdsver:./some/path.ext
is_distribution_of: exthisdsver:#some/path
relation:
- id: exthisdsver:#some/path
meta_type: dldist:Resource
description: Some tabular data
is_part_of: exthisdsver:#
- id: exthisdsver:#
meta_type: dldist:Resource
description: A version of a collection of some data
is_version_of: exthisds:#
- id: exthisds:#
meta_type: dldist:Resource
description: A collection of some data
```
Reported metadata:
```
> linkml-convert -s &lt;schema&gt; -t ttl ↷
-P exthisdsver=gitsha:ab34ef11/ -P exthisds=datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/ data.yaml
@prefix dldist: <https://concepts.datalad.org/s/distribution/unreleased/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<gitsha:ab34ef11/./some/path.ext> dldist:is_distribution_of <gitsha:ab34ef11/#some/path> ;
dldist:meta_type "dldist:Distribution"^^xsd:anyURI ;
dldist:relation <datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/#>,
<gitsha:ab34ef11/#>,
<gitsha:ab34ef11/#some/path> .
<datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/#> dldist:description "A collection of some data" ;
dldist:meta_type "dldist:Resource"^^xsd:anyURI .
<gitsha:ab34ef11/#> dldist:description "A version of a collection of some data" ;
dldist:is_version_of <datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/#> ;
dldist:meta_type "dldist:Resource"^^xsd:anyURI .
<gitsha:ab34ef11/#some/path> dldist:description "Some tabular data" ;
dldist:is_part_of <gitsha:ab34ef11/#> ;
dldist:meta_type "dldist:Resource"^^xsd:anyURI .
```
Examples online for: Git blob/tree/commit, annex key/remote, DataLad dataset, publication, study, subject, research topic, instrument, data type, funding, agent/entity roles, ...
</div>
</script></section>
</section>
<section data-markdown data-transition="none"><script type="text/template">
## The future
![](img/future_mindset.svg)<!-- .element: height="400" style="margin-top:200px;float:right" -->
<div style="text-align:left;width:900px">
- Reduction to **only two primary outward-facing "APIs"**
- A versioned metadata schema<br>
(any tooling: JSON-LD, RDF, YAML, ...)
- git-annex external remote protocol<br>
(rely on, or provide a suitable implementation for a particular data store)
- **Reduced requirements** for "optimal" dataset hosting
- Stores files/objects
- (Optionally) accepts file/object metadata<br>
(for search/discoverability)
- Continued compatibility with Git/git-annex repositories, but this format will be a **choice** for particular use cases, and certain workflows
</div>
**Will git-annex special remotes learn the concept of object metadata?**
</script></section>
<section>
<h2>DataLad contact and more information</h2>
<table>
<tr><td>Website</td>
<td><a href="https://datalad.org">https://datalad.org</a></td>
</tr><tr><td>Documentation</td>
<td><a href="https://handbook.datalad.org">https://handbook.datalad.org</a></td>
</tr><tr><td>Talks and tutorials</td>
<td><a href="https://youtube.com/datalad">https://youtube.com/datalad</a></td>
</tr><tr><td>Development</td>
<td><a href="https://github.com/datalad">https://github.com/datalad</a></td>
</tr><tr><td>Support</td>
<td><a href="https://matrix.to/#/#datalad:matrix.org">https://matrix.to/#/#datalad:matrix.org</a></td>
</tr><tr><td>Schema development</td>
<td><a href="https://concepts.datalad.org">https://concepts.datalad.org</a></td>
</tr>
</tr><tr><td>Social media</td>
<td>@datalad@fosstodon.org</td>
</table>
</section>
</section>
</div> <!-- /.slides -->
</div> <!-- /.reveal -->
<script src="common/reveal.js/js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/reveal.js#configuration
Reveal.initialize({
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1280,
height: 960,
// Factor of the display size that should remain empty around the content
margin: 0.1,
// Bounds for smallest/largest possible scale to apply to content
minScale: 0.2,
maxScale: 1.0,
controls: true,
progress: true,
history: true,
center: true,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Optional reveal.js plugins
dependencies: [
{ src: 'common/reveal.js/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: 'common/reveal.js/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'common/reveal.js/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'common/reveal.js/plugin/zoom-js/zoom.js', async: true },
{ src: 'common/reveal.js/plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>