Astro 528: High-Performance Scientific Computing for Astrophysics (Fall 2023)
ToC on side
Week 12 Discussion Topics
Reproduciblity & Replicability
Code behind the figures
Sharing code
Package managers & Environments
Creating your own package
Registering your own package
Reproducibile Computing Environments
Julia
Docker/Singularity
Q&A
Reproducibility & Replicability
Data behind the figures
NASA grants:
``At a minimum the Data Management Plan (DMP) for ROSES must explain how you will release the data needed to reproduce figures, tables and other representations in publications, at the time of publication. Providing this data via supplementary materials with the journal is one really easy way to do this and it has the advantage that the data and the figures are linked together in perpetuity without any ongoing effort on your part.'' and
``Software, whether a stand-alone program, an enhancement to existing code, or a module that interfaces with existing codes, created as part of a ROSES award, should be made publicly available when it is practical and feasible to do so, and when there is scientific utility in doing so... SMD expects that the source code, with associated documentation sufficient to enable use of the code, will be made publicly available as Open Source Software (OSS) under an appropriately permissive license (e.g., Apache-2, BSD-3-Clause, GPL). This includes all software developed with SMD funding used in the production of data products, as well as software developed to discover, access, visualize, and transform NASA data.'' – NASA SARA DMP FAQ
NSF:
``Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.'' – NSF Data Management Plan Requirements
``Providing software to read and analyze scientific data products can greatly increase value of these products. Investigators should use one of many software collaboration sites, like Github.com. These sites enable code sharing, collaboration and documentation at one location.'' – AST-specific Advice to PIs on the DMP
How to share code
Old-school
Source code for a few functions published as an appendix.
Source code avaliable upon request.
Source code avaliable from my website.
Modern
Practical sharing of evolving code:
Institutional Git server (e.g., PSU's GitLab)
Archiving of code (& data):
Dedicated archive with
Long-term plan
Digital Object Identifier (DOI) for your work
Standard file format
Metadata
Examples:
Zenodo (by CERN)
Dataverse (by Harvard)
ScholarSphere (by Penn State Libraries)
Data Commons (by Penn State EMS)
Problems with sharing non-trivial codes
Compiling for each processor/OS
Linking to libraries
Installing libraries that are needed
Multi-step instructions (different for each OS) that become out-of-date
Package managers
Find package you request
Indentify dependancies (direct & indirect).
Find versions that satisfy all requirements
Download requested packaged & dependancies.
Install requested packaged & dependancies.
Perform any custom build steps.
What if you have two projects?
Could let both projects think that they depend on everything the other depends on.
If a dependancy breaks, which project(s) break?
What if two projects require different versions?
⇒ Environments
Environments
Environments allow you to have multiple versions of packages installed and rapidly specify which versions you want made avaliable for the current session. In Julia,
Project.toml: Specifies direct dependencies & version constaints (required)
Manifest.toml: Specifies precise version of direct & indirect dependancies, so as to offer a fully reproducible environment (optional)
If no Manifest.toml, then package manager can find most recent versions that satisfy Project.toml requirements.
julia
starts julia with default environment (separate environment for each minor version number, e.g., 1.9)
julia --project=.
or julia --project
starts julia using environment specified by Project.toml and Manifest.toml in current directory (if don't exist, will create them).
What do the Project.toml and Manifest.toml files do?
What is the difference between Project.toml and Manifest.toml?
Project.toml from Lab 3:
name = "lab3"
uuid = "3355e5e9-99a6-4e94-be24-d3293f18bccc"
authors = ["Eric Ford <ebf11@psu.edu>"]
version = "0.1.0"
[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
FITSIO = "525bcba6-941b-5504-bd06-fd0dc1a4d2eb"
FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
InteractiveUtils = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
LaTeXStrings = "b964fa9f-0449-5b57-a5c2-d3ea65f4040f"
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
PlutoTeachingTools = "661c6b06-c737-4d37-b85c-46df65de6f69"
PlutoTest = "cb4044da-4d16-4ffa-a6a3-8cad7f73ebdc"
PlutoUI = "7f904dfe-b85e-4ff6-b463-dae2292396a8"
PyCall = "438e738f-606a-5dbb-bf0a-cddfbfd45ab0"
Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Manifest.toml from Lab 3:
# This file is machine-generated - editing it directly is not advised
[[Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "84918055d15b3114ede17ac6a7182f68870c16f7"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.3.1"
[[ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"
[[Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"
[[Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
[[BenchmarkTools]]
deps = ["JSON", "Logging", "Printf", "Statistics", "UUIDs"]
git-tree-sha1 = "aa3aba5ed8f882ed01b71e09ca2ba0f77f44a99e"
uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
version = "1.1.3"
[[Bzip2_jll]]
deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"]
git-tree-sha1 = "c3598e525718abcc440f69cc6d5f60dda0a1b61e"
uuid = "6e34b625-4abd-537c-b88f-471c36dfa7a0"
version = "1.0.6+5"
[[CFITSIO]]
deps = ["CFITSIO_jll"]
git-tree-sha1 = "c860f5545064216f86aa3365ec186ce7ced6a935"
uuid = "3b1b4be9-1499-4b22-8d78-7db3344d1961"
version = "1.3.0"
[[CFITSIO_jll]]
deps = ["Artifacts", "JLLWrappers", "LibCURL_jll", "Libdl", "Pkg"]
git-tree-sha1 = "2fabb5fc48d185d104ca7ed7444b475705993447"
uuid = "b3e40c51-02ae-5482-8a39-3ace5868dcf4"
version = "3.49.1+0"
[[CSV]]
deps = ["Dates", "Mmap", "Parsers", "PooledArrays", "SentinelArrays", "Tables", "Unicode"]
git-tree-sha1 = "b83aa3f513be680454437a0eee21001607e5d983"
uuid = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
version = "0.8.5"
...
Providing both Project.toml
and Manifest.toml
for an environment maximizes reproducibility (e.g., for code to reproduce figures in a paper).
But packages that are meant to be imported by others typically provide only a Project.toml
, so the package manager can figure out how best to combine packages. Julia's default registry requires packages to provide [compat]
constraints for each dependency.
Project.toml
for a simple registered package.
name = "PlutoTeachingTools"
uuid = "661c6b06-c737-4d37-b85c-46df65de6f69"
authors = ["Eric Ford <ebf11@psu.edu> and contributors"]
version = "0.2.13"
[deps]
Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
HypertextLiteral = "ac1192a8-f4b3-4bfe-ba22-af5b92cd3ab2"
LaTeXStrings = "b964fa9f-0449-5b57-a5c2-d3ea65f4040f"
Latexify = "23fbe1c1-3f47-55db-b15f-69d7ec21a316"
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
PlutoLinks = "0ff47ea0-7a50-410d-8455-4348d5de0420"
PlutoUI = "7f904dfe-b85e-4ff6-b463-dae2292396a8"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
[compat]
HypertextLiteral = "0.9"
LaTeXStrings = "1"
Latexify = "0.15, 0.16"
PlutoLinks = "0.1.5"
PlutoUI = "0.7"
julia = "1.7, 1.8, 1.9"
[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
[targets]
test = ["Test"]
In the readings, they describe package versions as something like x.y.z, what is the difference between x, y, and z, and how do I decide which number my current update should increment?
Semantic Versioning 2.0:
X: Major: Can break things, e.g., improve API.
Y: Minor: Minor changes, new features, bugfixes, refactoring internals, improvements that are unlikly to break things.
Z: Patch: Bugfixes, documentation improvements, low-risk performance upgrades
[compat]
allows developer to specify what versions/upgrades will be allowed.
# A leading caret (^) allows upgrades that would be compatible according to semver
PkgA = "^1.2.3" # [1.2.3, 2.0.0)
PkgB = "^1.2" # [1.2.0, 2.0.0)
PkgC = "^1" # [1.0.0, 2.0.0)
PkgD = "^0.2.3" # [0.2.3, 0.3.0)
# ^ is the default
Example = "0.2.1" # [0.2.1, 0.3.0)
# ~ is more restrictive
PkgA = "~1.2.3" # [1.2.3, 1.3.0)
PkgB = "~1.2" # [1.2.0, 1.3.0)
PkgC = "~1" # [1.0.0, 2.0.0)
# = requires exact equality
PkgA = "=1.2.3" # [1.2.3, 1.2.3]
PkgA = "=0.10.1, =0.10.3" # 0.10.1 or 0.10.3
# - allows for ranges
PkgA = "1.2.3 - 4.5.6" # [1.2.3, 4.5.6]
PkgA = "0.2.3 - 4.5.6" # [0.2.3, 4.5.6]
PkgA = "1.2.3 - 4.5" # 1.2.3 - 4.5.* = [1.2.3, 4.6.0)
PkgA = "1.2.3 - 4" # 1.2.3 - 4.*.* = [1.2.3, 5.0.0)
PkgA = "1.2 - 4.5" # 1.2.0 - 4.5.* = [1.2.0, 4.6.0)
PkgA = "1.2 - 4" # 1.2.0 - 4.*.* = [1.2.0, 5.0.0)
For details and more examples, see documentation.
Pluto & Package Management/Environments
Pluto has it's own package manager!
Automatically creates a new temporary environment for each notebook, based on where it sees
using
orimport
and a package name.Great for reproducibility
Adds a little extra startup time
Each notebook embeds a Project.toml and Manifest.toml
Can edit embedded environment
import Pkg, Pluto
Pluto.activate_notebook_environment("~/Documents/hello.jl")
Pkg.update()
You can disable Pluto's package manager and use Julia's default package manager by including Pkg.activate(path)
anywhere in notebook (as code, not as text).
begin
import Pkg
# activate an existing project environment that
# can be shared across multiple sessions and/or notebooks
Pkg.activate(Base.current_project())
# load packages that are included in the existing Project.toml & installed
using Plots, PlutoUI, LinearAlgebra
end
This reduces startup cost by reusing an existing environment
But all packages to be used by the notebook must be included in the specified
Project.toml
and already installed locally.