Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不能用python执行spark job #39

Open
xiayank opened this issue May 23, 2017 · 10 comments
Open

不能用python执行spark job #39

xiayank opened this issue May 23, 2017 · 10 comments

Comments

@xiayank
Copy link

xiayank commented May 23, 2017

我可以用spark-submit来执行spark job。但是用python直接执行就会报错ModuleNotFoundError: No module named 'py4j'.
这是log:

NIC@Yan-Mac  ~/Documents/504_BankEnd/DemoCode/week7_codelab1  python demo0.py demo1.txt
Traceback (most recent call last):
  File "demo0.py", line 2, in <module>
    from pyspark import SparkContext
  File "/usr/local/spark/python/pyspark/__init__.py", line 44, in <module>
    from pyspark.context import SparkContext
  File "/usr/local/spark/python/pyspark/context.py", line 29, in <module>
    from py4j.protocol import Py4JError
ModuleNotFoundError: No module named 'py4j'

这是我的环境变量。

export PATH="/usr/local/git/bin:/sw/bin/:/usr/local/bin:/usr/local/:/usr/local/sbin:/usr/local/mysql/bin:$PATH"
export SPARK_HOME=/usr/local/spark/
export PATH="$SPARK_HOME/bin:$PATH"
export PYTHONPATH="$SPARK_HOME/python:$PYTHONPATH"
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

# added by Anaconda3 4.3.1 installer
export PATH="/Users/NIC/anaconda/bin:$PATH"
@hackjutsu
Copy link
Member

@xiayank 能确认一下面路径是否失效?

$SPARK_HOME/python/lib/py4j-0.10.1-src.zip

@xiayank
Copy link
Author

xiayank commented May 23, 2017

是的。确实我的版本不是这个。解决了。

@xiayank
Copy link
Author

xiayank commented May 23, 2017

我的版本是py4j-0.10.4-src.zip. 把版本更改一下就可以运行了。感谢助教!

@xiayank xiayank closed this as completed May 23, 2017
@xiayank xiayank reopened this May 23, 2017
@xiayank
Copy link
Author

xiayank commented May 23, 2017

@hackjutsu 我用sudo pip install -U nltk安装nltk,结果报下面的错。搜了半天也没什么解决办法。

Collecting six (from nltk)
  Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, nltk
  Found existing installation: six 1.4.1
    DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
    Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 778, in install
    requirement.uninstall(auto_confirm=True)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_install.py", line 754, in uninstall
    paths_to_remove.remove(auto_confirm)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_uninstall.py", line 115, in remove
    renames(path, new_path)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/__init__.py", line 267, in renames
    shutil.move(old, new)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
    copy2(src, real_dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
    copystat(src, dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
    os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-xXGrka-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

我试图用sudo -H pip uninstall nltkuninstall 然后再安装,提示Cannot uninstall requirement nltk, not installed.

@hackjutsu
Copy link
Member

hackjutsu commented May 23, 2017

建议在virtualenv环境下跑python。要安装的package和macOS自带的Python package冲突了。

@xiayank 能给出测试用的代码?把代码简化一下,让别人也能快速地重复你遇到的问题。

@hackjutsu
Copy link
Member

@xiayank 我周四CodeLab时候讲讲如何使用Python Virtual environment吧。

@xiayank
Copy link
Author

xiayank commented May 24, 2017

@hackjutsu 好的 谢谢助教 我先自己研究一下。

@xiayank
Copy link
Author

xiayank commented May 24, 2017

@hackjutsu
I used python3 to install and download nltk. It works. But when I run the generate_word2vec_training_data.py, it throws ExceptionTypeError: cannot use a string pattern on a bytes-like object. I guess the variable query is bytes type. I tried this query_tokens = cleanData(query.decode()) to convert is to String, but same case.
I am new to Python. I guess this is because of the difference between Python2 and 3.
Here is log info:

TypeError                                 Traceback (most recent call last)
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in <module>()
     30                     title = entry["title"].lower().encode('utf-8')
     31                     query = entry["query"].lower().encode('utf-8')
---> 32                     query_tokens = cleanData(query)
     33
     34

/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in cleanData(input)
     15 def cleanData(input) :
     16     #remove stop words
---> 17     list_of_tokens = [i.lower() for i in wordpunct_tokenize(input) if i.lower() not in stop_words ]
     18     return list_of_tokens
     19

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)
    127         # If our regexp matches tokens, use re.findall:
    128         else:
--> 129             return self._regexp.findall(text)
    130
    131     def span_tokenize(self, text):

TypeError: cannot use a string pattern on a bytes-like object

@xiayank
Copy link
Author

xiayank commented May 26, 2017

@hackjutsu
运行python virtual environment后,用python执行``可以了。但是如果用spark-sumbit提交,log里面warningUserWarning: Attempting to work in a virtualenv,但是exception里面报错的信息是`/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)`,执行的时候应该还是用了系统的py3.6。很奇怪。

(ENV)  NIC@Yan-Mac  ~/Documents/504_BankEnd/DemoCode/week7_codelab1  spark-submit --master "local[4]" generate_word2vec_training_data.py ads_0502.txt traning_data_0502.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py:706: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please install IPython inside the virtualenv.
  warn("Attempting to work in a virtualenv. If you encounter problems, please "
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in <module>()
     30                     title = entry["title"].lower().encode('utf-8')
     31                     query = entry["query"].lower().encode('utf-8')
---> 32                     query_tokens = cleanData(query)
     33
     34

/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in cleanData(input)
     15 def cleanData(input) :
     16     #remove stop words
---> 17     list_of_tokens = [i.lower() for i in wordpunct_tokenize(input) if i.lower() not in stop_words ]
     18     return list_of_tokens
     19

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)
    127         # If our regexp matches tokens, use re.findall:
    128         else:
--> 129             return self._regexp.findall(text)
    130
    131     def span_tokenize(self, text):

TypeError: cannot use a string pattern on a bytes-like object

@hackjutsu
Copy link
Member

hackjutsu commented May 26, 2017

因为spark-submit默认引用的是系统的Python。Virtualenv只是把ENV里的python路径放到PATH最前。如果spark-submit不是根据PATH来选择Python的话,也有可能会用system的python。

-- Update --
参考贴?
http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants