学习代码，请教~ #3

tongbc · 2019-04-03T02:14:06Z

老哥好，北航研一学生~对代码有些使用上的问题，能否请教一下？qq 624360737，方便的话能否加一下，不会耽误太多时间，非常感谢！

raven4752 · 2019-04-03T02:21:35Z

你好，加qq不太方便，如果你有问题可以在issue里直接说。

tongbc · 2019-04-03T06:08:23Z

好的，感谢~我现在已下载苏老师的微信w2v语料并导入gensim，但是我看您用的是tok.pkl，tok_c.pkl，embedding_matrix_c.npy这些文件，请问这些是什么文件？还有w2v.csv这个是您自己w2v之后转换的么，能否发我下文件或者告诉我下生成这些文字的代码，方便的话可以发我邮箱，[email protected] 非常感谢您！

padeoe · 2019-04-03T08:11:55Z

好的，感谢~我现在已下载苏老师的微信w2v语料并导入gensim，但是我看您用的是tok.pkl，tok_c.pkl，embedding_matrix_c.npy这些文件，请问这些是什么文件？还有w2v.csv这个是您自己w2v之后转换的么，能否发我下文件或者告诉我下生成这些文字的代码，方便的话可以发我邮箱，[email protected] 非常感谢您！

这三个文件不是事先就有的。你看第 168、181 、207 行，分别生成了 tok.pkl、tok_c.pkl、embedding_matrix_c.npy 这三个文件。

raven4752 · 2019-04-03T11:15:36Z

那三个文件确实和@padeoe说的一样是运行时生成的，w2v.csv就是来自 https://kexue.fm/archives/4304 ,只是因为平台限制我才转了csv，内容是一样的。你可以从那里下载文件，然后：

from gensim.models import word2vec
model = word2vec.Word2Vec.load('word2vec_wx')
model.wv.save_word2vec_format('w2v.csv', binary=False)

同时把input_online.py的最后一行改为：

 df2 = pd.read_csv('input/w2v.csv', encoding='utf-8', header=None,sep=' ',quoting =3)

tongbc · 2019-04-04T01:30:44Z

太感谢了！我试试~非常感谢您

raven4752 · 2019-04-04T01:44:38Z

不客气。如果你还有问题可以再开这个issue。

tongbc · 2019-04-08T01:45:57Z

您好，生成csv成功后，载入的时候一直会报这个错误，请问您遇到过没有~
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 257
，一直没有解决。。

raven4752 · 2019-04-08T02:02:08Z

应该是导出的文件第一行会有#词向量个数, #词向量维度两个参数导致的，你可以删掉这一行。或者读取的时候使用：

   df2 = pd.read_csv('input/w2v.csv', encoding='utf-8', header=None,sep=' ',quoting =3，skiprows=1)

tongbc · 2019-04-08T02:14:04Z

@raven4752,您好，好像不是这个问题，不知道为何他总是说应该拿到256维度，但是读出257，我把已有的向量拿出来split也是256的，但是他一直报错这个，pandas.io.common.CParserError: Error tokenizing data. C error: Expected 256 fields in line 3, saw 257
，不知道是不是我gensim的版本问题？要不方便的话可否发我一份您的csv。。卡在这步属实不知道如何是好，麻烦您了~

raven4752 · 2019-04-08T03:01:09Z

你好，我之前处理好的向量已经找不到了，我记得是里面一些空白字符的词向量导致的，你可以手动删掉。或者你可以使用这个项目提供的这个词向量，效果应该差不多。解压文件之后，读取方式同样为：

  df2 = pd.read_csv(file_path, encoding='utf-8', header=None,sep=' ',quoting =3，skiprows=1)

tongbc · 2019-04-08T07:58:25Z

太感谢啦~终于成功了，剩下来的我是不是就把之前的256dimension改成300就行啦？太感谢您了

raven4752 · 2019-04-08T08:07:44Z

我没有 hard coding embedding的维度，应该可以直接运行。如果你是说隐层的维度256的话，那个是超参数，和embedding维度无关。

tongbc · 2019-04-08T08:15:48Z

好的，感谢~~我尝试一下，我其实说因为苏老师那个w2v是256维的，这个是300维，感觉可能需要改变一下~~

raven4752 closed this as completed Apr 4, 2019

raven4752 reopened this Apr 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

学习代码，请教~ #3

学习代码，请教~ #3

tongbc commented Apr 3, 2019

raven4752 commented Apr 3, 2019

tongbc commented Apr 3, 2019

padeoe commented Apr 3, 2019

raven4752 commented Apr 3, 2019 •

edited

Loading

tongbc commented Apr 4, 2019

raven4752 commented Apr 4, 2019 •

edited

Loading

tongbc commented Apr 8, 2019

raven4752 commented Apr 8, 2019 •

edited

Loading

tongbc commented Apr 8, 2019

raven4752 commented Apr 8, 2019

tongbc commented Apr 8, 2019

raven4752 commented Apr 8, 2019

tongbc commented Apr 8, 2019

学习代码，请教~ #3

学习代码，请教~ #3

Comments

tongbc commented Apr 3, 2019

raven4752 commented Apr 3, 2019

tongbc commented Apr 3, 2019

padeoe commented Apr 3, 2019

raven4752 commented Apr 3, 2019 • edited Loading

tongbc commented Apr 4, 2019

raven4752 commented Apr 4, 2019 • edited Loading

tongbc commented Apr 8, 2019

raven4752 commented Apr 8, 2019 • edited Loading

tongbc commented Apr 8, 2019

raven4752 commented Apr 8, 2019

tongbc commented Apr 8, 2019

raven4752 commented Apr 8, 2019

tongbc commented Apr 8, 2019

raven4752 commented Apr 3, 2019 •

edited

Loading

raven4752 commented Apr 4, 2019 •

edited

Loading

raven4752 commented Apr 8, 2019 •

edited

Loading