Astro 528: High-Performance Scientific Computing for Astrophysics (Fall 2023)
Week 6 Discussion
Admin Announcements
Labs
Review feedback on Lab 4, Ex 2 via GitHub
No lab this week, so can work on your projects
Still meet Wednesday for a group work session
Class project
Review project instructions on website
Serial code should be ready for peer code review by Oct 2
Sign up for project presentation schedule
Click Fork button to create your own repository
Edit README.md in your repository (can use web interface for small chagnes)
Commmit change to yoru repo and push (if editting local repo)
Create Pull Request to merge your change into the class repository
Serial version of Code Rubic
Code performs proposed tasks (1 point)
Comprehensive set of unit tests, at least one integration or regression test (1 point)
Code passes tests (1 point)
Student code uses a version control system effectively (1 point)
Repository includes many regular, small commits (1 point)
Documentation for functions’ purpose and design (1 point)
Comprehensive set of assertions (1 point)
Variable/function names consistent, distinctive & meaningful (1 point)
Useful & consistent code formatting & style (1 point)
Code is modular, rather than having chunks of same code copied and pasted (1 point)
Peer Review Logistics
Access
I'll send you GitHub ID of peer reviewer via Canvas
Make sure reviewer(s) can access your repo
Make sure you can access the repo you are to review
If using Jupyter notebooks, make sure to add Markdown version of code for getting feedback
Install Weave (once): `julia –project=. -e 'import Pkg; Pkg.add("Weave")' #
Convert Jupyter notebook to markdwon (each time you want to update the markdown): `julia –project=. -e 'using Weave; convert_doc("NOTEBOOK_NAME.ipynb","NOTEBOOK_NAME.jmd")'
Provide most feedback via GitHub Issues
Q&A
What exactly is "scratch" memory, and what differentiates it from other kinds of memory?
"Scratch" can mean different things depending on context:
A separate physical disk or file system that is intended to be used for temporary files.
E.g., Roar's
/storage/scratch/USERID/
provides large storage but autodeletes your files
A portion of memory allocated and reserved for holding scratch data
E.g., preallocating a workspace to be used for auto-differentiation, integration, factoring a matrix, etc.
Garbage collection
Is there a way in Julia where we can easily identify memory that can be released or the program can automatically help us release memory?
That's the garbage collector's job.
But you can make it's just easier or harder.
How does one check if the code would cause a memory leak?
If using pure Julia, then garbage collector prevents leaks (at least in theory)
In practice, you can use poor practices that cause it to use lots of memory, e.g.,
Large/many variables in global/module scope
Not organizing code into self-contained functions
Allocating more memory than you really need
Many small allocations
If you call C, Fortran, Python, R, etc., then memory leaks are possible.
Test your code
@time
or@allocated
to count number/ammount of allocations. Does it match what you expect?In ProfileCanvas.jl, can use
@profview_allocs
to visually find functions/lines that allocate lots of memory (not necessarily a leak).
How severe will thrashing be in a high-level language such as Julia? Should we worry about it immediately or only during optimization?
Thrasing is a result of programming practices.
When you're a beginning programmer, focus on other things first. Benchmark/profile to find inefficient code and then optimize.
As you gain more experience, you'll start to recognize places where thrashing could occur as you start to write them. In that case, a little planning early on can save work down the road.
What are the main causes of thrashing and how does Julia mitigate it? Specifically, how does Julia’s garbage collection reduce thrashing, if there even is a strong connection?
Lots of small allocations on the heap
Java (probably the first "major" language to have garbage collection built-in) gave garbage collection a bad reputation because it only allows mutable user-defined types (and passes all objects by pointers), making it quite hard to avoid heap allocation of even very small objects.
Julia (and C#) encourage the use of immutable types
Julia pass variables by reference (so they can pass variables on the stack)
C# passes variables by value by default (so they stay on stack, but often unnecessary stack allocations) and can pass by reference.
I am still a bit confused about the high-water mark technique for creating flexibly-sized arrays/vectors. How do we determine the logical size that we need to make effective use of memory allocation? Is this just the true size of the vector instead of the actual size?
The idea is to allocate for the maximum possible size. But that only works if you can figure that out in advance.
Is there a different technique for creating flexible array sizes in Julia that doesn't require doubling of your physical size every time you add a row/column?
By doubling the size allocated, you need to change the size much less often than if you increased by 1.
The reading talks about being cache friendly– if we were performing a search wouldn't how cache friendly the search is be dependent on the search algorithm? How do we know what type of search algorithm is being used if we didn't write the code ourselves? How would we know how to optimize the structure of our code based on the search algorithm we are using and how the program/computer access memory?
Read the documentation
Choose the algorithm for your problem (e.g., Description of sorting algorithms)
Consider order of algorithm and whether in-place
Most often, I choose the algorithm that's best fit for my data.
But sometimes I might change the data structure to be a better fit for my algorithm
1
1.03402
7
-1.07527
5
-0.179478
6
-0.153796
9
1.33361
10
1.18966
5
-0.740369
1
0.374814
2
-1.07991
7
-0.294624
5
0.8038
10
-0.543084
9
-0.306371
8
-0.363832
3
-0.298261
1
0.614491
2
-0.410525
3
0.537932
5
-0.521134
4
-0.656754
9
0.718671
8
-1.04147
1
-2.66582
7
0.986666
4
0.87463
9
-0.615275
4
-0.350423
7
-0.0950837
10
-0.0355948
3
1.73756
But be careful... sometimes different algorithms give different results. E.g., whether sorting is stable.
1
-1.1371
1
-2.66582
1
0.237909
1
1.03402
1
1.40668
1
0.374814
1
0.614491
2
-0.546596
2
-0.410525
2
-1.08489
2
0.0929487
2
-0.38207
2
-1.07991
2
0.529639
2
-0.226232
3
-1.91813
3
-0.367755
3
0.279415
3
-1.45468
3
1.17601
9
-0.615275
9
-0.442758
9
0.928034
10
1.03915
10
-2.12387
10
-0.543084
10
1.55775
10
1.20597
10
1.18966
10
-0.0355948
1
1.03402
1
0.374814
1
0.614491
1
0.237909
1
-1.1371
1
1.40668
1
-2.66582
2
-1.07991
2
-0.410525
10
-0.0355948
1
-1.1371
1
-2.66582
1
0.237909
1
1.03402
1
1.40668
1
0.374814
1
0.614491
2
-0.546596
2
-0.410525
10
-0.0355948
Resume here on Wedneseday
Other Questions
How does memory function from cell-to-cell within a notebook? Is it more efficient to split up code over many cells, or have them operate in the same one? How does this impact runtime and general performance?
It's more efficient to split up code into separate functions (regardless of whether they are in the same cell or not).
There might be a very slight latency cost of having lots of cells. But that's unlikley significant unless you are making a really big notebook.
Can you demonstrate how to import a julia program into python?
First, setup PyCall.jl (to call Python from Julia) by running
> julia -e 'import Pkg; Pkg.add("PyCall");'
You only need to do that once (for each system you're running it on).
Then from python/Jupyter notebook with Python kernel
from julia.api import Julia
julia = Julia(compiled_modules=False)
from julia import Base
Base.sind(90)
from julia import Main as jl
jl.exp(0)
jl.xs = [1, 2, 3]
jl.eval("sin.(xs)")
import numpy as np
x = np.array([1.0, 2.0, 3.0])
jl.sum(x)
jl.include("my_julia_code.jl")
jl.function_in_my_julia_code(x)
Inside Jupyter notebook:
In [1]: %load_ext julia.magic
Initializing Julia runtime. This may take some time...
In [2]: %julia [1 2; 3 4] .+ 1
Out[2]:
array([[2, 3],
[4, 5]], dtype=int64)
In [3]: arr = [1, 2, 3]
In [4]: %julia $arr .+ 1
Out[4]:
array([2, 3, 4], dtype=int64)
In [5]: %julia sum(py"[x**2 for x in arr]")
Out[5]: 14
Data structures
What [is] a linked list?... how we might find it useful in our projects?
Use array when:
Know size at time of creation (or won't need to change size often)
Value fast access to elements (not just the beginning/end)
Value not allocating more memory than memory
Very common for scientific performance sensitive code
Use linked list when:
Likely to insert elements and/or change size often
Don't mind taking longer to access elements (other than beginning/end)
Value not allocating (much) more memory than necessary
Useful for frequent sorting
Other common data structures to consider...
Hash table (aka dictionary/Dict
) when:
Elements unlikely to be accessed in any particular order
Value pretty fast access to individual elements
Don't mind allocating significantly more memory than necessary
Useful for scripting, non-performance sensitive code
- source Wikimedia, Jorge Stolfi, CC-BY-SA-3.0
Some form of tree when:
Elements have a meaningful order
Value fast adding/removing and searching of elements
Value not allocating (much) more memory than necessary
Don't mind taking longer to access individual elements
Willing to spend some time maintaing well-ordered tree
Common in database type applications
Generic tree (not particularly useful)
- source Wikimedia
Balanced binary tree
- source Wikimedia
Preping for Code Review
Make it easy for your reviewer
Provide overview of what it's doing in README.md
Include an example of how to run/use code
What files should they focus on?
What files should they ignore?
Where should they start?
What type of feedback would you be most appreciative of?
Make it easy for you
If you have a large code base, then
Move most code out of notebooks and into
.jl
files, as functions mature.@ingredients "src/my_great_code.jl
from Pluto notebooksinclude("src/my_great_code.jl")
from Jupyter notebooksNear end of semester, I'll describe how to make your code into a registered package. Then you could simple do
use MyGreatPackage
.
Organize functions into files
.jl
files insrc
directoryUse
test
,examples
,data
,docs
, etc. directories as appropritate
Peer Code Review Rubric
Constructive suggestions for improving programming practices (1 point)
Specific, constructive suggestions for improving code readability/documentation (1 point)
Specific, constructive suggestions for improving tests and/or assertions (1 point)
Specific, constructive suggestions for improving code modularity/organization/maintainability (1 point)
Specific, constructive suggestions for improving code efficiency (1 point)
Finding any bugs (if code author confirms) (bonus points?)
(Credit: Manu via BetterProgramming.pub)
Learning from others with more experience conducting Code Reviews
Best Practices for Code Review @ Smart Bear
Review fewer than 400 lines of code at a time
Take your time. Inspection rates should under 500 LOC per hour
Do not review for more than 60 minutes at a time
Set goals and capture metrics
Authors should annotate source code before the review
Use checklists
Establish a process for fixing defects
Foster a positive code review culture
Embrace the subconscious implications of peer review
Practice lightweight code reviews
How to excel at code reviews @ BetterProgramming
The code improves the overall health of the system
Quick code reviews, responses, and feedback
Educate and inspire during the code review
Follow the standards when reviewing code
Resolving code review conflicts
Demo UI changes as a part of code review
Ensure that the code review accompanies all tests
When focused, do not interrupt yourself to do code review
Review everything, and don’t make any assumptions
Review the code with the bigger picture in mind
Code Review Best Practices @ Palantir
Purpose
Does this code accomplish the author’s purpose?
Ask questions.
Implementation
Think about how you would have solved the problem.
Do you see potential for useful abstractions?
Think like an adversary, but be nice about it.
Think about libraries or existing product code.
Does the change follow standard patterns?
Does the change add dependencies?
Think about your reading experience.
Does the code adhere to coding guidelines and code style?
Does this code have TODOs?
Maintainability
Read the tests.
Does this CR introduce the risk of breaking test code, staging stacks, or integrations tests?
Leave feedback on code-level documentation, comments, and commit messages.
Was the external documentation updated?
Peer Code Review Rubric = Reviewer's checklist
Constructive suggestions for improving programming practices
Specific, constructive suggestions for improving code readability/documentation
Specific, constructive suggestions for improving tests and/or assertions
Specific, constructive suggestions for improving code modularity/organization/maintainability
Specific, constructive suggestions for improving code efficiency
Finding any bugs (bonus points?)
Old Questions
Q: Is it always best to insure a code passes all tests before anything else?
A1: Can anyone want share counter examples?
A2: Story of Kepler database
Q: Even with a readme file and annotations, what are ways we can determine what the code does and whether the approach is correct or wrong?
Ask for clarification
Setup
ToC on side