-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathpyr.qmd
80 lines (43 loc) · 8.13 KB
/
pyr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Python vs. R {#sec-pyr}
As we mentioned at the beginning (@sec-intro-lang), use the language you like as long as it gets the job done *well*. The vast majority of data scientists use Python, but R is very common in academia and statistical analysis in general. Both languages have their strengths and weaknesses, and it's worth getting a sense of what they are[^otherlang].
[^otherlang]: There are other languages that can be used for data science, but they are not nearly as common. For example, Julia is a language that is very fast for some things and generally has a lot of potential for data science. But honestly it came too late to the game, and it's not as user-friendly as Python or R. Proprietary tools like Matlab, SAS, Stata, and similar had their day, and it has long since passed, and closed-source tools will never develop as rapidly as popular open source tools. Still other languages whose primary purpose is not data science may be useful for some tasks (e.g., Spark-ML), but they are not as well-suited for as Python or R.
## Our Background {#sec-pyr-context}
Michael has used both R and Python for many years and now uses Python primarily as a reflection of his own movement from academia to industry. Seth also has used both for many years and teaches with both in his courses every semester. Unlike many practitioners, we have a lot of experience with using both for serious data science, which we hope gives us a pretty good perspective on the strengths and weaknesses of each. In the end though, this is just our opinion, and you would do well to consider your specific needs and preferences when choosing a language.
## Python {#sec-pyr-python}
Python is the leading language for machine learning, deep learning, and data science in general. Many other languages can perform ML and maybe even well, but Python is the most popular, has the most packages, and it's where tools are typically implemented and developed first. Even if it isn't your primary language, it should be for any implementation of machine learning or using deep learning models.
### Pros {#sec-pyr-python-pros}
- **Popular Tools**: Very powerful and very widely used tools.
- **Efficiency**: Typically efficient on memory and fast.
- **Modeling API**: Consistent API across many modeling packages (e.g., [sklearn]{.pack}).
- **Pipelines**: Easy pipeline and reproducibility setup.
- **ML/DL Development**: It is *the* standard for machine/deep learning packages.
- **Versatility**: Versatile, general programming language, and can be used for many other things easily.
- **Origins**: Was not developed by statisticians.
### Cons {#sec-pyr-python-cons}
- **Model Exploration**: It can be difficult to get model summaries, visualizations, and interpretable results.
- **Data Processing**: [Pandas]{.pack} is not as intuitive/user-friendly as [tidyverse]{.pack}.
- **Environment fragility**: Python or package updates often break functionality; environments are often frozen to avoid this, which often means using outdated package versions that can cause miscellaneous issues and retain known bugs.
- **Documentation**: Documentation is inconsistent since there is no standard. This hopefully will be alleviated in the future with AI tools that can write the documentation for you.
- **Interactive Development**: Jupyter notebooks lag behind RMarkdown, but Quarto shows promise for improving Python's usability in this area[^quartopy].
[^quartopy]: Quarto is a relatively new markdown tool that attempts to be language agnostic and is designed to be a successor to RMarkdown. In our experience it is already much better than Jupyter for producing an actual document, and nearly approaching the same level for interactive use.
## R {#sec-pyr-r}
R is the leading language for statistical analysis, visualization and reporting analytical results. Your authors can also definitively say that R is actually great at ML and at production level, as they have used it with data comprising millions of data points for very large and well-known companies. The default tools are not as fast or memory efficient relative to Python, but they are typically more user friendly, and usually have good to even excellent documentation, as package development has been largely standardized for some time. When it comes to statistical models like those from Part I, R is the best tool for the job hands down.
### Pros {#sec-pyr-r-pros}
- **Interactive analysis**: Interactive analysis was always the default approach.
- **User-friendly**: User-friendly and fast data processing, modeling, and visualization.
- **Compatibility**: Practically every tool you'd ever use works with data frames.
- **Post-processing**: Easy post-processing of models with many packages.
- **Documentation**: Documentation is standardized for any CRAN package, and almost all non-CRAN packages use the same approach. Examples are expected for documented functions, and the package will fail to build if any example fails.
- **Package Development**: CRAN packages are checked against the latest version of R, all major operating systems, and even other packages that depend on them.
### Cons {#sec-pyr-r-cons}
- **Speed**: Relatively slow for many modeling tasks.
- **Memory**: R is very memory intensive, with modern machines this is less of an issue for many tabular data applications, but can be very limiting for large data sets.
- **Pipelines**: Pipeline approach and/or reproducibility in a production-level sense has only recently been of focus.
- **Production tools**: Production tools are still relatively recent and not as well-developed.
## Conclusion {#sec-pyr-conclusion}
In summary, R is fantastic for Part I models (statistical modeling) and Python for Part II (machine learning). We can also say that both are great for causal modeling depending on what your approach is. We still feel R is notably easier to use for data processing, visualization, and model exploration, producing data science reports/documents, and obviously for statistical modeling. We would likely use Python for anything that requires a lot of data or is computationally expensive, and for machine learning in general. For deep learning models, Python is the obvious choice.
We like tools like Quarto because it makes it easy to use both, even simultaneously within the same document, so the great thing is that you don't have to choose. More to the point, AI tools like CoPilot now make it easier to program in either, leaving you to focus on the data science instead of the programming.
<!--
R's usage appears to be on the decline recently[^rankings], and ultimately it may reside in academia primarily. But this is mostly because the Python data science community has finally come around to realizing that writing custom functions and classes to do standard model exploration and visualization *every freakin' time* you want to know more about your model than what it predicts is a serious waste of time, money, and other resources. On top of that, Jupyter notebooks have not been close to the level of RMarkdown for years, but Python users are ultimately going to be bailed out by the Rmarkdown successor Quarto which attempts to do everything Rmarkdown did while being language agnostic.
but it's still a great tool for data science. Python is the best tool for ML, but it's not as good for many other things. The best tool for data science is the one that gets the job done well, and that's the one you should use. Quarto makes it easy to use both, so you don't have to choose.
[^rankings]: While none are perfect, various programming language rankings are available. The [IEEE spectrum](https://spectrum.ieee.org/the-top-programming-languages-2023) is one where R was as high as #7 in 2021 and was not in the top 10 in 2023. You can look at others like [TIOBE](https://www.tiobe.com/tiobe-index/) or the [Stack Overflow survey](https://survey.stackoverflow.co/2023/#technology) where it is not in the top 20. It's important to note though, that R is almost exclusively used for data science, and most of other languages listed are general purpose languages. Also, those surveys are specific to industry not academia. So while R might be trending downward in some areas, it is still very popular for data science, and will be for a while. -->