Skip to content

Commit

Permalink
[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the doc…
Browse files Browse the repository at this point in the history
…umentation base

### What changes were proposed in this pull request?

This PR proposes to redesign the PySpark documentation.

I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html.

Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html.

In more details, this PR proposes:
1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark.
2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow.
3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively.

    One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage.

### Why are the changes needed?

Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/).

It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate.

Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it.

### Does this PR introduce _any_ user-facing change?

Yes, PySpark API documentation will be redesigned.

### How was this patch tested?

Manually tested, and the demo site was made to show.

Closes apache#29188 from HyukjinKwon/SPARK-32179.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
  • Loading branch information
HyukjinKwon committed Jul 27, 2020
1 parent 99f33ec commit 6ab29b3
Show file tree
Hide file tree
Showing 42 changed files with 2,129 additions and 584 deletions.
8 changes: 6 additions & 2 deletions .github/workflows/master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,9 @@ jobs:
architecture: x64
- name: Install Python linter dependencies
run: |
pip3 install flake8 sphinx numpy
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme
- name: Install R 3.6
uses: r-lib/actions/setup-r@v1
with:
Expand All @@ -218,7 +220,9 @@ jobs:
- name: Install dependencies for documentation generation
run: |
sudo apt-get install -y libcurl4-openssl-dev pandoc
pip install sphinx mkdocs numpy
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
gem install jekyll jekyll-redirect-from rouge
sudo Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2'), repos='https://cloud.r-project.org/')"
- name: Scala linter
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ python/lib/pyspark.zip
python/.eggs/
python/deps
python/docs/_site/
python/docs/source/reference/api/
python/test_coverage/coverage_data
python/test_coverage/htmlcov
python/pyspark/python
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ Python Software Foundation License
----------------------------------

pyspark/heapq3.py
python/docs/_static/copybutton.js
python/docs/source/_static/copybutton.js

BSD 3-Clause
------------
Expand Down
5 changes: 4 additions & 1 deletion dev/create-release/spark-rm/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,10 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
# These arguments are just for reuse and not really meant to be customized.
ARG APT_INSTALL="apt-get install --no-install-recommends -y"

ARG PIP_PKGS="sphinx==2.3.1 mkdocs==1.0.4 numpy==1.18.1"
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
# We should use the latest Sphinx version once this is fixed.
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1"
ARG GEM_PKGS="jekyll:4.0.0 jekyll-redirect-from:0.16.0 rouge:3.15.0"

# Install extra needed repos and refresh.
Expand Down
18 changes: 17 additions & 1 deletion dev/lint-python
Original file line number Diff line number Diff line change
Expand Up @@ -173,14 +173,30 @@ function sphinx_test {
return
fi

# TODO(SPARK-32279): Install Sphinx in Python 3 of Jenkins machines
PYTHON_HAS_SPHINX=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("sphinx") is not None)')
if [[ "$PYTHON_HAS_SPHINX" == "False" ]]; then
echo "$PYTHON_EXECUTABLE does not have Sphinx installed. Skipping Sphinx build for now."
echo
return
fi

# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
PYTHON_HAS_SPHINX_3_0=$("$PYTHON_EXECUTABLE" -c 'from distutils.version import LooseVersion; import sphinx; print(LooseVersion(sphinx.__version__) < LooseVersion("3.1.0"))')
if [[ "$PYTHON_HAS_SPHINX_3_0" == "False" ]]; then
echo "$PYTHON_EXECUTABLE has Sphinx 3.1+ installed but it requires lower then 3.1. Skipping Sphinx build for now."
echo
return
fi

# TODO(SPARK-32391): Install pydata_sphinx_theme in Jenkins machines
PYTHON_HAS_THEME=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("pydata_sphinx_theme") is not None)')
if [[ "$PYTHON_HAS_THEME" == "False" ]]; then
echo "$PYTHON_EXECUTABLE does not have pydata_sphinx_theme installed. Skipping Sphinx build for now."
echo
return
fi

echo "starting $SPHINX_BUILD tests..."
pushd python/docs &> /dev/null
make clean &> /dev/null
Expand Down
1 change: 1 addition & 0 deletions dev/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ jira==1.0.3
PyGithub==1.26.0
Unidecode==0.04.19
sphinx
pydata_sphinx_theme
2 changes: 1 addition & 1 deletion dev/tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
[pycodestyle]
ignore=E226,E241,E305,E402,E722,E731,E741,W503,W504
max-line-length=100
exclude=python/pyspark/cloudpickle/*.py,heapq3.py,shared.py,python/docs/conf.py,work/*/*.py,python/.eggs/*,dist/*,.git/*
exclude=python/pyspark/cloudpickle/*.py,heapq3.py,shared.py,python/docs/source/conf.py,work/*/*.py,python/.eggs/*,dist/*,.git/*
7 changes: 6 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,13 @@ Note: Other versions of roxygen2 might work in SparkR documentation generation b

To generate API docs for any language, you'll need to install these libraries:

<!--
TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
See also https://github.com/sphinx-doc/sphinx/issues/7551.
-->

```sh
$ sudo pip install sphinx mkdocs numpy
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
```

## Generating the Documentation HTML
Expand Down
4 changes: 2 additions & 2 deletions docs/_plugins/copy_api_dirs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,8 @@
puts "Making directory api/python"
mkdir_p "api/python"

puts "cp -r ../python/docs/_build/html/. api/python"
cp_r("../python/docs/_build/html/.", "api/python")
puts "cp -r ../python/docs/build/html/. api/python"
cp_r("../python/docs/build/html/.", "api/python")
end

if not (ENV['SKIP_RDOC'] == '1')
Expand Down
Binary file added docs/img/spark-logo-reverse.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions python/docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
# You can set these variables from the command line.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR ?= .
BUILDDIR ?= _build
SOURCEDIR ?= source
BUILDDIR ?= build

export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.9-src.zip)

Expand Down
90 changes: 0 additions & 90 deletions python/docs/_static/pyspark.css

This file was deleted.

99 changes: 0 additions & 99 deletions python/docs/_static/pyspark.js

This file was deleted.

6 changes: 0 additions & 6 deletions python/docs/_templates/layout.html

This file was deleted.

53 changes: 0 additions & 53 deletions python/docs/index.rst

This file was deleted.

4 changes: 2 additions & 2 deletions python/docs/make2.bat
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
set SOURCEDIR=source
set BUILDDIR=build

set PYTHONPATH=..;..\lib\py4j-0.10.9-src.zip

Expand Down
Loading

0 comments on commit 6ab29b3

Please sign in to comment.