jcollado's picture
Card, work in progress
1756e41
|
raw
history blame contribute delete
No virus
341 Bytes

Text preprocessing

This tokenizer has been trained with tweets that have been preprocessed as follows:

  1. User mentions (@user_name) have been replaced with the word user.
  2. URLs have been replace with the word url.
  3. WIP. If you are going to use this tokenizer, we recommend you to preprocess your own dataset in the same manner.