Code
Overview
This section covers general coding/data cleaning principles and practices. For ARTIS-specific data and workflow issues, see ARTIS manual
Recommended workflow
The typical lab workflow aims to produce open data and reproducible analysis. This lab typically uses R/RStudio and GitHub.
Steps: each of these should be contained in separate code (e.g., separate “.R” files, code chunks, or folders):
- Create a version-controlled R project repository within the Seafood Globalization Lab GitHub organization. If a GitHub repository already exists, pull this onto your local and work from within it.
Each project folder, whether created by you or pulled from GitHub, should include the following folders:
data - Keep all raw data in this folder and do not modify the raw data
output - Write out all cleaned versions of data (make sure the file name differs from the raw data file name to avoid overwriting and confusion) and analysis output here
figures - Write out all figures here. You may initially want to use descriptive figure names, but ultimately will want to name the files according to the figure numbering in the resulting manuscript (e.g., fig1.png)
scripts or R - Save all R scripts here. If the project will be set up as an R package, then the folder must be named “R”.
Create other folders as needed (e.g., “literature”, “ms”, “archive”, etc.) but add the folder to your .gitignore file so they are not pushed up to GitHub.
You may want to add dates to files that will have multiple versions (e.g., different runs of an analysis, versions of figures, etc.) using the format YYYY-MM-DD to the beginning of file names so files are sorted accordingly.
README - A text README file should be created for every project and a full description of the raw data should be included here
Data Cleaning - do NOT edit the raw data file, instead do ALL cleaning within your code. The R janitor package is useful for cleaning your data frame. Additional useful R functions include
to_lower()
,case_when()
,str_extract
, andseparate()
. See X for more.
Your final clean data file(s) should be in a tidy (“long”) format. See Wickham for more on tidy format.
Commenting - There is no such thing as too much commenting and documentation. All code should be extensively commented such that each step of the analysis is clear to any reader without that individual knowing R. Aim for a commented section header for each code chunk along with one comment per line of code. Follow the style guide [FIXIT: add link to style page] for additional notes on commenting code Data
Plotting and validation - We recommend producing simple scatter plots, boxplots, check that outliers are not errors.
Saving the cleaned data file - After data cleaning and plotting/validation, output data into its final, clean format (write this file to the “Output” folder. Everyone in the research team should start from the same cleaned dataset. They should also have access to to code that produced the cleaned file so they can refer back to all cleaning steps.
Data Analysis - Create separate scripts to process, filter, aggregate, summarize data as needed for different types of analyses.
GitHub
- Make sure you are on the correct branch.
- Pull all changes before beginning work on your code
- All code changes should be pushed to GitHub on a regular basis (it is better to commit more rather than less)
- Aim to include as descriptive commit messages as possible
- In general, do not push the “data,” “output,” or “figures” folders up to GitHub (i.e., include these folders in your .gitignore file) because they are often too large
- Do not push up non-analysis folders (e.g., “literature”, “ms”, “archive” etc.). Instead add to .gitignore
- When setting up a GitHub project, default to private settings. If a project is made public, carefully check for any data that may have been pushed up to ensure it is okay for this to be public