

source code:

some important functions and variables

  • init

  • def fit_on_texts(self, texts) #texts can be a string or a list of strings or a list of list of strings

  • self.word_index # the type of variance is dictonary, which contain a specific word subject to a unique index

  • self.index_word #r eserve the key and value of the word_index


  import tensorflow as tf
from tensorflow import keras
# the package which can tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
transform the word into number
sentences= ['i love my dog', 'i love my cat','you love my dog!']
tokenizer = Tokenizer(num_words = 100)
word_index = tokenizer.word_index
# get the result {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}



sentences= ['i love my dog', 'i love my cat','you love my dog!','do you think my dog is amazing']
sequences = tokenizer.texts_to_sequences(sentences)
result is [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4], [6, 2, 4]]
which is not encoding for amazing, because it's not appear in fit texts

To solve this problem,we can set a oov in tokenizer to encode a word which not appear before.

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
restart the code,we can get the result
[[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5], [1, 7, 1, 3, 5, 1, 1]]

but each sequences has the different length of the series, it's difficult for train a neuro network,so we need make the sequnces has the same length.

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences,
padding = 'post', # right padding
maxlen = 5, # max len of senquence
truncating = 'post') # right cut
then we can get the result
array([[5, 3, 2, 4, 0],
[5, 3, 2, 7, 0],
[6, 3, 2, 4, 0],
[8, 6, 9, 2, 4]])

