Wrong mapping of the raw IDs to the internal IDs #465

benhaf · 2023-03-27T06:53:57Z

Hi,

Description

The mapping of the raw IDs of the users to the internal IDs is not correct when the dataset contains more than 25000 rows. I tried to read the ratings from a file and from a dataframe, but it always gives a wrong mapping of the user IDs. I tested several datasets.
In the code below, after saving the training set with the internal IDs to SupriseTrainingSet.csv, I compare the file Train.txt to SupriseTrainingSet.csv.

Steps/Code to Reproduce

from surprise import Dataset, KNNBasic, Reader
import pandas as pd
import csv

train_file = files_dir + folder + "Train.txt"

reader = Reader(line_format="user item rating", sep="\t")

data = Dataset.load_from_file(train_file, reader=reader)

trainset = data.build_full_trainset() #creates the training set from the whole dataset

with open(files_dir + folder +"SupriseTrainingSet.csv", 'w', newline='') as file:
writer = csv.writer(file)
# write each row of data to the CSV file
for row in trainset.all_ratings():
writer.writerow(row)

algo = KNNBasic()
algo.fit(trainset)

Expected Results

####Original dataset
User Item Rating
1 225 2
1 154 5
1 73 3
1 43 4
1 199 4
1 34 2
1 227 4
1 94 2
1 74 1
1 76 4
1 181 5
1 105 2
1 253 5
1 200 3
1 61 4
1 93 5
1 272 3
1 53 3
1 174 5
1 193 4
1 161 4
1 129 5
1 195 5
1 9 5
1 156 4
1 262 3
1 99 3
1 21 1
1 35 1
1 123 4
1 104 1
1 148 2
1 184 4
1 249 4
1 54 3
1 66 4
1 107 4
1 8 1
1 145 2
1 102 2
1 134 4
1 125 3
1 165 5
1 49 3
1 114 5
1 32 5
1 252 2
1 209 4
1 153 3
1 26 3
1 137 5
1 133 4
1 217 3
1 245 2
1 24 3
2 286 4
2 292 4
2 313 5
2 272 5
2 290 3
2 10 2
2 312 3
2 280 3
2 281 3
2 14 4
2 296 3
2 1 4
2 279 4
3 332 1
3 339 3
3 350 3
3 319 2
3 352 2
3 260 4
3 336 1
3 348 4
3 345 3
3 271 3
3 346 5
4 327 5
4 357 4
4 329 5
4 288 4
4 300 5
5 457 1
5 2 3

####Internal IDs of surprise
User Item Rating
0 0 2
0 1 5
0 2 3
0 3 4
0 4 4
0 5 2
0 6 4
0 7 2
0 8 1
0 9 4
0 10 5
0 11 2
0 12 5
0 13 3
0 14 4
0 15 5
0 16 3
0 17 3
0 18 5
0 19 4
0 20 4
0 21 5
0 22 5
0 23 5
0 24 4
0 25 3
0 26 3
0 27 1
0 28 1
0 29 4
0 30 1
0 31 2
0 32 4
0 33 4
0 34 3
0 35 4
0 36 4
0 37 1
0 38 2
0 39 2
0 40 4
0 41 3
0 42 5
0 43 3
0 44 5
0 45 5
0 46 2
0 47 4
0 48 3
0 49 3
0 50 5
0 51 4
0 52 3
0 53 2
0 54 3
1 369 4
1 533 5
1 503 3
1 451 1
1 239 4
1 314 4
1 110 4
1 956 4
1 714 4
1 134 4
1 674 4
1 227 5
1 471 1
2 180 5
2 382 5
2 264 4
2 213 3
2 517 1
2 86 1
2 351 5
2 162 5
2 272 2
2 410 4
2 822 2
3 1328 1
3 401 5
3 807 3
3 84 3
3 1074 5
4 415 5
4 589 4

Actual Results

####Original dataset
User Item Rating
1 225 2
1 154 5
1 73 3
1 43 4
1 199 4
1 34 2
1 227 4
1 94 2
1 74 1
1 76 4
1 181 5
1 105 2
1 253 5
1 200 3
1 61 4
1 93 5
1 272 3
1 53 3
1 174 5
1 193 4
1 161 4
1 129 5
1 195 5
1 9 5
1 156 4
1 262 3
1 99 3
1 21 1
1 35 1
1 123 4
1 104 1
1 148 2
1 184 4
1 249 4
1 54 3
1 66 4
1 107 4
1 8 1
1 145 2
1 102 2
1 134 4
1 125 3
1 165 5
1 49 3
1 114 5
1 32 5
1 252 2
1 209 4
1 153 3
1 26 3
1 137 5
1 133 4
1 217 3
1 245 2
1 24 3
2 286 4
2 292 4
2 313 5
2 272 5
2 290 3
2 10 2
2 312 3
2 280 3
2 281 3
2 14 4
2 296 3
2 1 4
2 279 4
3 332 1
3 339 3
3 350 3
3 319 2
3 352 2
3 260 4
3 336 1
3 348 4
3 345 3
3 271 3
3 346 5
4 327 5
4 357 4
4 329 5
4 288 4
4 300 5
5 457 1
5 2 3

####Internal IDs of surprise
User Item Rating
0 0 2
0 1 5
0 2 3
0 3 4
0 4 4
0 5 2
0 6 4
0 7 2
0 8 1
0 9 4
0 10 5
0 11 2
0 12 5
0 13 3
0 14 4
0 15 5
0 16 3
0 17 3
0 18 5
0 19 4
0 20 4
0 21 5
0 22 5
0 23 5
0 24 4
0 25 3
0 26 3
0 27 1
0 28 1
0 29 4
0 30 1
0 31 2
0 32 4
0 33 4
0 34 3
0 35 4
0 36 4
0 37 1
0 38 2
0 39 2
0 40 4
0 41 3
0 42 5
0 43 3
0 44 5
0 45 5
0 46 2
0 47 4
0 48 3
0 49 3
0 50 5
0 51 4
0 52 3
0 53 2
0 54 3
0 369 4
0 533 5
0 503 3
0 451 1
0 239 4
0 314 4
0 110 4
0 956 4
0 714 4
0 134 4
0 674 4
0 227 5
0 471 1
0 180 5
0 382 5
0 264 4
0 213 3
0 517 1
0 86 1
0 351 5
0 162 5
0 272 2
0 410 4
0 822 2
0 1328 1
0 401 5
0 807 3
0 84 3
0 1074 5
0 415 5
0 589 4

Uploading results.xlsx…

Versions

Windows-10-10.0.22621-SP0
Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
surprise 1.1.3

benhaf changed the title ~~Wron mapping of the raw IDs to the internal IDs~~ Wrong mapping of the raw IDs to the internal IDs Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong mapping of the raw IDs to the internal IDs #465

Wrong mapping of the raw IDs to the internal IDs #465

benhaf commented Mar 27, 2023 •

edited

Loading

Wrong mapping of the raw IDs to the internal IDs #465

Wrong mapping of the raw IDs to the internal IDs #465

Comments

benhaf commented Mar 27, 2023 • edited Loading

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

benhaf commented Mar 27, 2023 •

edited

Loading