Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem on prediction stage #113

Open
lordfiftysix opened this issue Oct 19, 2022 · 9 comments
Open

problem on prediction stage #113

lordfiftysix opened this issue Oct 19, 2022 · 9 comments

Comments

@lordfiftysix
Copy link

lordfiftysix commented Oct 19, 2022

I am attempting to run the following code:

x_test = pandas.read_csv('test(1).csv')

x_test = text_preprocessor.transform(x_test)

predictions= trainer.predict(X_text=x_test)

where x_test consists of a single column with text descriptions

and I am getting the following output

predict: 75%|███████▌ | 3/4 [00:00<00:00, 12.85it/s]
/usr/local/lib/python3.7/dist-packages/pytorch_widedeep/training/trainer.py in
--> 581 return np.vstack(preds_l).squeeze(1)

ValueError: cannot select an axis to squeeze out which has size not equal to one

I am wondering how to go about resolving this. I have already tried expanding the dims of the x_test and tried resizing it but I am still getting the same issue

@jrzaurin
Copy link
Owner

jrzaurin commented Oct 19, 2022

hey @lordfiftysix

Could you point me to some code where I can reproduce the error?

I am assuming you have train the trainer and all that right?

@lordfiftysix
Copy link
Author

yes

@jrzaurin
Copy link
Owner

jrzaurin commented Oct 19, 2022

Then if you could please point me to some code?

Otherwise, maybe I can try later with some dataset I might have and "report back" the results here :)

@lordfiftysix
Copy link
Author

I suppose I am fine with the second option

@jrzaurin
Copy link
Owner

here is some fully functioning code

from sklearn.model_selection import train_test_split

from pytorch_widedeep import Trainer
from pytorch_widedeep.datasets import load_womens_ecommerce
from pytorch_widedeep.models import BasicRNN, WideDeep
from pytorch_widedeep.preprocessing import TextPreprocessor

if __name__ == "__main__":

    df = load_womens_ecommerce(as_frame=True)

    # to be safe, but one can me more gentle here
    df = df.dropna().reset_index(drop=True)

    # just aesthetics
    df.columns = [c.lower().replace(" ", "_") for c in df.columns]

    # the reviews are a bit imbalanced, so we turned the problem into a binary
    # classification
    df["target"] = (df.rating >= 4).astype("int")
    text_col = "review_text"
    target = "target"

    # train/test split
    train, test = train_test_split(df, test_size=0.2, stratify=df.target)

    # processing
    text_processor = TextPreprocessor(text_col=text_col)
    X_train = text_processor.fit_transform(train)
    X_test = text_processor.transform(test)

    # model definition. The model component needs to be wrap up with the
    # WideDeep class
    basic_rnn = BasicRNN(
        vocab_size=len(text_processor.vocab.itos),
        embed_dim=100,
        hidden_dim=64,
        n_layers=3,
        bidirectional=True,
        rnn_dropout=0.5,
        padding_idx=1,
        head_hidden_dims=[100, 50],
    )
    model = WideDeep(deeptext=basic_rnn, pred_dim=1)

    # Train
    trainer = Trainer(model, objective="binary")
    trainer.fit(
        X_text=X_train,
        target=train[target].values,
        n_epochs=1,
        batch_size=256,
        val_split=0.2,
    )

    # predict
    preds = trainer.predict(X_text=X_test)

@lordfiftysix
Copy link
Author

lordfiftysix commented Oct 22, 2022

It did not work. I am trying to do multi-output regression. Here is some more of my code.

train_df = pd.read_csv('train.csv')
vectorizer = TfidfVectorizer(strip_accents=None,lowercase=False)
#word_vectors_path = "../tmp_data/glove.6B/glove.6B.100d.txt"
text_preprocessor = TextPreprocessor(text_col='full_text')
#print(train_df.head(0))
#tab_preprocessor = TabPreprocessor(['full_text'])
#tab_preprocessor = TabPreprocessor(['full_text'])
#print(x.head(0))
text_id = train_df['text_id']
#train_df = train_df.set_index('text_id')
train_df = train_df.drop(['text_id'],1)
train_df = train_df.dropna().reset_index(drop=True)
x= text_preprocessor.fit_transform(train_df)

#tfidf_vectorizer = TfidfVectorizer()

#x = vectorizer.fit_transform(train_df['full_text'])
#x = pd.DataFrame(x.todense())
#print(x)
#x = tab_preprocessor.fit_transform(x)#

print(x)
print(x.shape)

#print(train_df.head(0))
cols = cols# this is a series of 6 columns that go into the target
#text_id = train_df['text_id']
#y = train_df.drop(['full_text'],1)

#y = y.set_index('text_id')
#tab_preprocessor = TabPreprocessor(cols,shared_embed=False)#, crossed_cols=crossed_cols)
#ywide = tab_preprocessor.fit_transform(y)
#ywidee = y['cohesion']
#tab_preprocesso = TabPreprocessor(['cohesion'])
#ywidee = tab_preprocesso.fit_transform(y)
#print(tab_preprocessor.cat_embed_input)
#print(ywide)
#fast_model = TabMlp(tab_preprocessor.column_idx,tab_preprocessor.cat_embed_input)
#fast_model =  TabFastFormer(tab_preprocessor.column_idx,tab_preprocessor.cat_embed_input)#column_idx=text_id, cat_embed_input=cat_embed_input tab_preprocessor.column_idx,
#rmodel = AttentiveRNN()
rmodel = AttentiveRNN(vocab_size=5741, embed_dim=80) 
#print(y.shape)
#print(tab_preprocessor.cat_embed_input)
target = target
model = WideDeep(
    
    #wide=wide,
    deeptext=rmodel,pred_dim=6
    #deeptabular=fast_model,

) 

wd_trainer = Trainer(
    model=model,
    objective='rmse',#objective="rmse",
    optimizers=torch.optim.AdamW(model.parameters(), lr=0.001),
    #metrics=[Accuracy, Precision]
    #metrics=[Accuracy, Precision],
)

#target = target #where target is a series of 6 columns with numerical int values
#print(ywide)
print(x)
xx=x
print(train_df)
wd_trainer.fit(X_text=x, target=train_df[cols].values, n_epochs=1, batch_size=1, val_split=0.2)

x_test = pd.read_csv('test(1).csv')

x_test = text_preprocessor.transform(x_test)

print(x_test.shape)
x_test = x_test.reshape(80,3)

#x_test = np.expand_dims(x_test,2)
print(x_test.shape)
df_pred = wd_trainer.predict(X_text=x_test)
print(df_pred)

And I am still getting the same error on the prediction stage

@jrzaurin
Copy link
Owner

To do multi-output regression or multi-label classification we would need to modify the code.

In fact I don't know what the rmse value that outputs might be in your case, since the library is designed to work with targets of shape (N, 1), as it is written in the docs: "Losses in this module expect the predictions and ground truth to have the same dimensions for regression and binary classification problems (N_samples, 1) . In the case of multiclass classification problems the ground truth is expected to be a 1D tensor with the corresponding classes."

Anyway, if you can point me towards a notebook/colab with some small dataset or mock data would save me a lot of time. Otherwise I will try to mock some data myself and dig into this later

@lordfiftysix
Copy link
Author

Hey I wonder if you were ever able to dig into this problem. I can confirm that i have 6 columns and a few thousand rows as my output so RMSE probably wont work. That being said I am struggling to do multi-output regression on these 6 target columns given a single input text column.

@jrzaurin
Copy link
Owner

Hey, sorry @lordfiftysix

I am buried at work these days, sorry for the late reply.

No I did not have the time sorry 🙁.

maybe you could consider this as 6 independent problems? and then combine the losses?

Alternatively, maybe you could code a custom loss yourself? Although this might not be straightforward. See if I get a sec towards the end of the week. Alternatively I will see if @5uperpalo can look into it

@5uperpalo let's have a chat see if we can code a custom loss that takes multiple inputs and produces a single output

@jrzaurin jrzaurin reopened this Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants