To ensure a clean and robust analytic environment a few dependencies are required. Local Administrator rights are necessary to install some of these. Where possible, I have installed via code rather then using the *.exe installer. SQL Server Management Studio is not included here as that should be installed by ICT.
This is a Windows package manager for allowing us to install and update installations through the CLI for our toolkit.
It always requires LocalAdmin to run, and can be run in either cmd.exe
or PowerShell
.
This makes maintaining the tools and versions considerably easier.
If WinGet (developed my Microsoft) is available then this will be updated.
# Administrator: PowerShell
# install chocolatey
Set-ExecutionPolicy Bypass -Scope Process
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
# show all installed packages
choco list -l
choco upgrade all
This is an integrated terminal which allows use of many CLI tools in a clean tab-based interface. You will need to download and install following instructions from the GitHub page. Once installed, you can set your profiles for the various tools in the settings.json.
# PowerShell
# Note your .config files will live in $HOME
# create a base layer for all project repos
mkdir $HOME\repos
Set the GitBash profile in Windows Terminal to start in %USERPROFILE%\\repos
the rest will probaly want to start in %USERPROFILE%
.
# Administrator: PowerShell/cmd
# reset the terminal (imported by chocolatey)
refreshenv
choco install microsoft-windows-terminal
# allows us to run Administrator: PowerShell in Windows Terminal with password
choco install gsudo
Git is essential for version control. Both GitHub and the internal GitLab use the same installation.
# Administrator: PowerShell/cmd
choco install git
Open and the config file and add the following to the profile.
# GitBash
$HOME = /c/Users/$USERNAME
vim ~/.gitconfig
Paste the following to allow alternate usernames for GitLab and GitHub
Remember ESC :w
to save and ESC :x
to exit
# Git user global config file
[user]
name = Alex Bhattacharya
email = alex.bhattacharya@ukhsa.gov.uk
# set usernames for various git instances
[credential "https://github.com"]
username = alexbhatt
[credential "https://gitlab.phe.gov.uk"]
username = alex.bhattacharya
It may be helpful to setup an SSH keypair for GitLab in GitBash. Paste the SSH key into the GitLab profile manually.
# GitBash
# generate the SSHs
ssh-keygen -t ed25519 -C "[email protected]"
# copy it
cat ~/.ssh/id_ed25519.pub | clip
# test it
ssh -T [email protected]
Vim is a CLI text editor packaged up in GitBash and Linux that works through the terminal; read this guide. Its very helpful for quick edits.
# GitBash
# use vim to create a settings file
vim ~\.vimrc
# vim persistent settings
set numbers
Another code editor, but this is going to allow us to work in WSL and act as a python IDE. Once I get used to VSCode, I may drop Atom.
# Administrator: PowerShell
choco install vscode -y
Atom is a lightweight text editor made by Git. I like it. It has excellent packages for markdown and is very customisable.
# Administrator: PowerShell
choco install atom
refreshenv
# add to apm and atom to PATH
$userenv = [System.Environment]::GetEnvironmentVariable("Path", "User")
[System.Environment]::SetEnvironmentVariable("PATH", $userenv + ";$HOME\AppData\Local\atom\bin", "User")
# PowerShell
# syntax highlighting
apm install linter
apm install linter-markdown
apm install linter-ui-default
# markdown support
apm install pp-markdown
apm install markdown-scroll-sync
The primary open-source programming used for scientific use and epidemiology. Can be used in conjunction with Python.
- Rtsudio: This IDE can run both R and Python code. Also has great markdown support.
- Rtools: This is necessary dependency for functional programming and development and is often a dependency in R.
# Administrator: PowerShell
choco install r.project -y
choco install rtools -y
choco install r.studio -y
Use the package manager renv right from the start on a project-by-project basis. There is really no reason not to. Works brilliant with Docker.
# R
# allow development
install.packages("devtools")
# manage packages
install.packages("renv")
# future updates of renv
renv::upgrade()
# run python in R
install.packages("reticulate")
In order to render documents from code to PDF, word or slides, you will need an installation of Pandoc and MiKTeX. This is necessary for any automated reporting.
# Administrator: PowerShell
choco install pandoc miktex
Download and install python via the exe installer. Python is useful to have installed even if not using as the primary data tool. When we need to use python properly, we will want to use it through the WSL Linux kernel for a more robust toolset.
- pyenv this will allow us to install and manage multiple python versions and environments
- poetry this will allow us to manage the virtual environments and dependencies
- pipenv similar to poetry; a dependency management tool
- pip is your python package management tool and you will use this to get the rest of your python tools, packages and updates.
- venv is your environment management tool, like renv, and will be used on a project basis.
- conda an alternative tool to pip and venv; but has intel optimised packages for faster execution
# for updating these
python -m pip install --upgrade pip
# change to existing venv environment
python -m venv --upgrade $HOME\repos\VENVDIR
# create a virtual environment
# have a basic python install with a couple data science packages
python -m venv $HOME\repos\penv
cd $HOME\repos\penv
# add the jupyter kernel
python -m ipykernel install --user --name=penv
# activate the environment, you'll see (penv) in the CLI after its activated
.\Scripts\Activate.ps1 # PS
.\Scripts\Activate.bat # cmd
source .\Scripts\activate # bash
# add your Packages
pip install numpy scipy pandas # manage data
pip install scikit-learn # machine learning
pip install tensorflow # machine learning; install not working
pip install plotly matplotlib seaborn # data visualisation
# save the requirements
pip freeze > requirements.txt
# close up shop
deactivate
# this will allow recall of the environment in the future using
python -m pip install -r requirements.txt
Lightweight IDE for Python and R notebooks run in your default browser. These are very sharable, but will really not be using them full time. More like a scratchpad for analytic ideas.
# PowerShell
pip install jupyterlab notebook
pip install --user ipykernel
# run to launch
jupyter notebook
jupyter lab
To run R within Jupyter, you need to make the R kernel accessible. This will be auto-detected by Jupyter after its run in R, and only needs to be done once per R installation.
# R
install.packages('IRkernel')
IRkernel::installspec()
Follow the microsoft guide, for best results activate WSL2.
Docker is a container management system for reproducible environments. It is completely agnostic. It is necessary to allow us to send analysis to a kubernetes based system like the PHE OpenShift high performance computer cluster.
- WSL activated; can be done without, but its less efficient
- Download
Docker Desktop Installer.exe
and run as LocalAdmin - Add to user group in Administrator: PowerShell
- RESTART the machine to enable the group changes
- RUN Docker Desktop; it will configure to run on start-up after this
# Administrator: PowerShell
net localgroup docker-users "[email protected]" /add
docker image list
docker pull IMAGENAME
- Prerequisite: Python
- Prerequisite: Docker
Airflow is a DAG manager for data pipelines, it can be installed locally or via Docker.
# GitBash
# setup a venv
python -m venv airflow_local
cd airflow_local
python -m pip --upgrade pip
source /Scripts/activate
# https://airflow.apache.org/docs/apache-airflow/stable/start/local.html
export AIRFLOW_HOME=~/airflow
AIRFLOW_VERSION=2.0.1
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
# save the environment
pip freeze > requirements.txt
# GitBash
mkdir airflow
cd airflow
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.0.1/docker-compose.yaml'
mkdir ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
docker-compose up airflow-init
docker-compose up
- Prerequisite: WSL
An alternative distro would be Debian, but Ubuntu is the most common. It is helpful to have a WSL Linux distro available for testing out containers. Note you may not have access to the wider system environment and AD access from within WSL.
- WSL activated
- manually download a Linux distribution as the Microsoft Store is unavailable
- Follow the guide to install
- Setup root user account on the first time you run the distro, this is separate from your AD account and is purely for managing the distro in the closed environment.
NOTE: if you ever really mess it up just run
wsl --unregister Ubuntu-20.04
in PowerShell and then run theubuntu2004.exe
and it will rebuild the base image
# Administrator: PowerShell
cd $HOME
# Download distro (Ubuntu 20.04)
Invoke-WebRequest -Uri https://aka.ms/wslubuntu2004 -OutFile .\Downloads\Ubuntu.appx -UseBasicParsing
# Open it and save in the user account folder
Rename-Item .\Downloads\Ubuntu.appx Ubuntu.zip
Expand-Archive .\Downloads\Ubuntu.zip .\AppData\Local\Packages\Ubuntu
# add to PATH
$userenv = [System.Environment]::GetEnvironmentVariable("Path", "User")
[System.Environment]::SetEnvironmentVariable("PATH", $userenv + ";C:\Users\Administrator\Ubuntu", "User")
# Run it and set root user and password when prompted
.\AppData\Local\Packages\Ubuntu\ubuntu2004.exe
Paste the following code after typing sudo vim /home/DNSfix.sh
## hotfix for DNS servers; will need to update once I can get wsl.conf working
# remove the existing file
sudo rm -Rf /etc/resolv.conf
# creat a new one
echo "[network]" | sudo tee /etc/resolv.conf
echo "generateResolvConf = false" | sudo tee -a /etc/resolv.conf
# add DNS servers
## GoogleDNS
echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolv.conf
echo "nameserver 8.8.4.4" | sudo tee -a /etc/resolv.conf
## OpenDNS
echo "nameserver 208.67.222.222" | sudo tee -a /etc/resolv.conf
cat /etc/resolv.conf
# WindowsTerminal: Ubuntu-20.04
# The DNS servers dont work, so change them
source /home/DNSfix.sh
# your Vim settings are here if you want to update
# i like to add set number
sudo vim /etc/vim/vimrc
# updates the install with the core dependencies and installed programs
sudo apt update && sudo apt -y upgrade && sudo apt autoremove
# some common dependencies
sudo apt install -y wget curl openssl
sudo apt install software-properties-common
sudo apt install apt-transport-https
sudo apt install gnupg-agent
sudo apt install ca-certificates
sudo apt install -y libcurl4-openssl-dev libcurl4-gnutls-dev libssl-dev libxml2-dev unixodbc-dev
sudo apt install unixodbc-dev msodbcsql17
# enable git to work with Windows Credentials
git config --global credential.helper "/mnt/c/Program\ Files/Git/mingw64/libexec/git-core/git-credential-manager-core.exe"
# install python dependencies
sudo apt install -y make build-essential zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev llvm libncurses5-dev xz-utils tk-dev libxmlsec1-dev libffi-dev liblzma-dev
# pyenv
git clone https://github.com/pyenv/pyenv.git ~/.pyenv
cd ~/.pyenv && src/configure && make -C src
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
## RESART the Environment
# poetry
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
# pipenv
sudo apt install pipenv
# Miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
rm Miniconda3-latest-Linux-x86_64.sh
# install python
sudo apt install libpython3-dev
sudo apt install -y python3 python3-pip python3-venv ipython
pip3 install --user jupyterlab NumPy SciPy pandas Matplotlib seaborn
# to launch jupyter, need --no-browser since in WSL
jupyter lab --no-browser
http://localhost:8888/
# install R
# https://support.rstudio.com/hc/en-us/articles/360049776974-Using-RStudio-Server-in-Windows-WSL2
sudo apt install dirmngr
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt install -y r-base r-base-core r-recommended r-base-dev
sudo apt install -y gdebi-core build-essential
# Install RStudio server
wget https://rstudio.org/download/latest/stable/server/bionic/rstudio-server-latest-amd64.deb
sudo gdebi rstudio-server-latest-amd64.deb
sudo rstudio-server start
# login using UNIX username:password
http://localhost:8787/