class: center, middle, inverse, title-slide # Naming & Organization ## Reproducible Computing
@ JSM 2019 ### Colin Rundel ### July 27, 2019 --- ## Naming things? <br><br><br><br> >There are only two hard things in Computer Science: cache invalidation and naming things. <br/> - Phil Karlton <br><br><br> --- ## Face it - There are going to be files - LOTS of files - The files will change over time - The files will have relationships to each other - It'll will get complicated --- ## Mighty weapon - File organization and naming is a mighty weapon against chaos - Make a file's name and location VERY INFORMATIVE about what it is, why it exists, how it relates to other things - The more things are self-explanatory, the better - READMEs are great, but don't document something if you could just make that thing self-documenting by definition --- ## What works, what doesn't? **NO** ~~~ myabstract.docx Joe’s Filenames Use Spaces and Punctuation.xlsx figure 1.png fig 2.png JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt ~~~ **YES** ~~~ 2014-06-08_abstract-for-sla.docx joes-filenames-are-getting-better.xlsx fig01_scatterplot-talk-length-vs-interest.png fig02_histogram-talk-attendance.png 1986-01-28_raw-data-from-challenger-o-rings.txt ~~~ --- ## Three principles for (file) names 1. Human readable 2. Machine parsable 3. Plays well with OS ordering --- ## Machine Parsable - Search (Regular expression and globbing) friendly: Avoid spaces, punctuation, accented characters, case sensitivity - Easy to compute on: Deliberate use of delimiters --- ## Globbing **Excerpt of complete file listing:** ![players_names](img/naming-glob.png) <br> **Example of globbing to narrow file listing:** ``` datasets rundel$ ls *Players* WorldCup-Players-01.csv WorldCup-Players-03.csv WorldCup-Players-02.csv WorldCup-Players-04.csv ``` --- ## Same using Mac OS Finder search facilities ![players_mac_os_search](img/naming-os-search.png) --- ## Using globs or regexs in R ### glob ```r > library(fs) > dir_ls(glob = "*Players*") WorldCup-Players-01.csv WorldCup-Players-02.csv WorldCup-Players-03.csv WorldCup-Players-04.csv ``` ### regex ```r > library(fs) > dir_ls(regexp = "Players-\\d{2}") WorldCup-Players-01.csv WorldCup-Players-02.csv WorldCup-Players-03.csv WorldCup-Players-04.csv ``` --- ## Punctuation Deliberate use of `"-"` and `"_"` allows recovery of meta-data from the filenames: - Use one to delimit units of meta-data you might want later - Use the other to delimit words - Stay consistent For example: ``` 2019-09-01_Experiment-1_Rep-A.csv 2019-09-01_Experiment-1_Rep-B.csv 2019-09-07_Experiment-1_Rep-C.csv 2019-09-07_Experiment-2_Rep-A.csv ``` --- ## Recap: Machine parsable - Easy to search for files later - Easy to narrow file lists based on names - Easy to extract info from file names, e.g. by splitting - New to regular expressions and globbing? be kind to yourself and avoid + Spaces in file names + Punctuation + Accented characters (Unicode in general) + Different files named `foo` and `Foo` --- ## Human readable - Name contains info on content - Connects to concept of a *slug* from semantic URLs <img src="img/nytimes.png" width="100%" /> --- ## Example Which set of file(name)s do you want at 3 a.m. before a deadline? <img src="img/human-readable-not-options.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- ## Plays well with OS default ordering - Put something numeric first - Use the ISO 8601 standard for dates - Left pad other numbers with zeros --- ## Examples **Chronological order:** ![chronological_order](img/chronological_order.png) **Logical order:** Put something numeric first ![logical_order](img/logical_order.png) --- ## Dates Use the ISO 8601 standard for dates: YYYY-MM-DD ![chronological_order](img/chronological_order.png) --- ## ISO8601 .center[ ![iso_psa](img/iso_8601.png) ] .footnote[ Source: [XKCD #1179](https://xkcd.com/1179/) ] --- ## Comprehensive map of all countries that use the MM-DD-YYYY format .center[ ![map_mmddyyy](img/map_mmddyyy.png) ] .footnote[ Source: https://twitter.com/donohoe/status/597876118688026624. ] --- ## Left pad other numbers with zeros ```shell > ls 01_data-cleaning.R 02_fit-model.R 10_final-figs-for-publication.R ``` <br> If you don’t left pad, you get this: ```shell > ls 10_final-figs-for-publication.R 1_data-cleaning.R 2_fit-model.R ``` --- class: center, middle # Organizing your workflow --- ## There is no one formula that will work for all projects, but use an organization that will allow - you to come back to the project a year later and resume work fairly quickly - your collaborators to figure out what you did and decided and what files they need to look at - works with your tools not against them --- ## Tip 1 - Use Projects / Project Folders Specifically within RStudio, but also more generally as an organizing principal. * Use one (master) folder per project * Everything related to that project needs to live within that folder. (e.g. data, scripts, etc.) * IDEs, git, R sessions are all designed around this principal (don't fight it) * Organize related files within your folder (e.g. `data/`, `scripts/`, `figures/`) --- ## Aside - Raw data is sacrosanct Raw data is foundational to reproducibility and it is critical to have a auditable log of any changes at all times. Create a folder for it, put it there and never touch it. *BAD*: ```shell project/ - data/ ``` .center[vs.] *GOOD* .pull-left[ ```shell project/ - data-raw/ - data-clean/ ``` ] .pull-right[ ```shell project/ - data/ - raw/ - clean/ ``` ] --- ## Relative vs absolute paths .center[ ![abs path computer on fire](img/jenny_fire.png) ] .footnote[ Source: Jenny Bryan's [Zen and the Art of Workflow Maintenance](https://speakerdeck.com/jennybc/zen-and-the-art-of-workflow-maintenance) ] --- ## Working directories Keeping track of working directories can be painful. -- Take a project that looks something like the following: ```shell project/ - project.Rproj - data/ - raw/ - data.csv - clean/ - script/ - 01_clean.R ``` When I run `01_clean.R` what is its working directory? --- ## Using `here` `here` is a package that tries to simplify the process by identifying the root of your project, `project/` in this case and then providing relative paths from that root directory to everything else in your project. ```r here::here() ## [1] "/home/rundel/Desktop/project" here::here("data/raw", "data.csv") ## [1] "/home/rundel/Desktop/project/data/raw/data.csv" ``` --- ## Tips -- - Keep / protect the raw data -- - Give yourself less rope -- - Avoid monoliths (modularize your code / scripts into logical steps) -- - Keep the life cycle of data in mind