2. Prepocessing¶
2.1. Train test split¶
Split your dataset into train and test part.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
If you want to conserve target balancement between classes (in classification), you should use stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)
2.2. Remove Punctuation in String¶
# importing a string of punctuation and digits to remove
import string
to_remove_characters = string.punctuation + string.digits
# remove punctuations and digits from oldtext
table_ = str.maketrans(to_remove_characters, ' ' * to_remove_characters)
cleaned_text = text.translate(table_)