In this series of tutorials, you will learn how to use a free resource called Colaboratory given out by Google and build a simple yet sophisticated Neural Machine Translation model.
Google Colab and Deep Learning Tutorial
We will dive into some real examples of deep learning by using open source machine translation model using PyTorch. Through this tutorial, you will learn how to use open source translation tools.
Overview of Colab
- Google Colab is a free to use research tool for machine learning education and research.
- Google provides free Tesla K80 GPU of about 12GB.
- You can run the session in an interactive Colab Notebook for 12 hours. It is limited for 12 hours because there might be chances of people using it for wrong purposes (Ex: Cryptocurrency Mining). After 12 hours, you can restart the session again.
This tutorial assumes that you have prior knowledge of Python programming and Neural Machine Translation. Even if you don’t have any hands on experience, this tutorial should help you understand the basics of machine translation. Now, without wasting much time let’s jump right in and see how to use Google Colab.
Getting Started with Google Colab
Now, you can create a Colab Notebook in two ways.
1st way: Visit Google Drive , Right Click -> More -> Colaboratory or New -> More -> Colaboratory to start a new Colab Notebook.
If this is the first time to use Colab, you might first need to click on “Connect more apps” and search for “Colaboratory“, and then follow the above step.
2nd way: Visit Colab, start a new Python3 Notebook or you can cancel and experiment with some of their existing code snippets.
Connecting to Server and Setting up GPU Runtime
By default, the runtime type will be NONE, which means the hardware accelerator would be CPU, below you can see how to change from CPU to GPU.
Open the Runtime menu -> Change Runtime Type -> Select GPU
You can change and edit the name of the notebook from right corner.
To test if you have your GPU set and available, run these two lines of code below. Use Ctrl/Command + Enter to run the current cell, or simply click the run ► button before the cell.
1 2 |
from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) |
The screenshot below shows the difference in output if GPU is not available or selected from the runtime option.
Below is the output when the GPU is selected.
This command will give you a detailed information about the graphics card and available memory.
1 !nvidia-smi
Remember: All/most of non-Python and Linux commands has to be preceded by !. For example !python hello.py
Mounting Your Google Drive to Colab Notebook
Since, we will be training some textual data, and we need to save our data model for testing purposes. We cannot completely rely on Colab for data storage. So, it is important to connect your session to Google Drive as an external storage.
Running the code below, will help you connect to Google Drive. You will be asked to authorize through your Google account.
1 2 |
from google.colab import drive drive.mount('/content/gdrive') |
When you run the code above, click the link, select the Google account you want to connect, copy and paste the authorization code into the box and hit Enter.
Once you finish the authorization, you should see something like this.
For further confirmation to check if you are connected to Google Drive, you can simply run the !ls command or you can also access through file explorer on the right. You can upload programs necessary to run directly to the drive.
Right Arrow Icon -> Files -> REFRESH.
Now, use the command !mkdir to create a folder named PyTrained in Google Drive, which you will be using to store all your code and necessary data and navigate to that particular through colab. Then, use the command !cd to open the folder.
12 !mkdir "/content/gdrive/My Drive/PyTrained/"!cd "/content/gdrive/My Drive/PyTrained/"
Data Generation
Even though it is easy to find some existing compiled datasets specifically for Machine Translation , we will take a detour and generate some fake data by ourselves using python library Faker.
12 !pip install Faker
You can install Faker with the above command on Google Colab, in case if you are installing it on your Linux/Mac machine you might need sudo permission.
I will execute some basic functions of Faker library.
1 2 3 4 5 |
from faker import Faker fake = Faker() print(fake.name()) print(fake.address()) print(fake.day_of_week()) |
Running those few lines of code, should give you the following output, and the output will be different every time you run the code because of the random function used in the library.
Data Generation: Data Provider
Now, let’s go ahead and build a proper data provider, which will help us generate some random names.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
class NamesProvider(BaseProvider): def new_name(self): #Generate names based on gender randomly SEX = ["F", "M"] user_sex = random.choice(SEX) if user_sex == 'F': first_name = self.generator.first_name_female() last_name = self.generator.last_name_female() elif user_sex == 'M': first_name = self.generator.first_name_male() last_name = self.generator.last_name_male() return first_name + " " + last_name |
Understanding the Basic Names Provider
If you carefully look at the code, all we are doing is generating first name and last name based on the gender. And we choose the gender randomly between male and female. We can take a much closer look to the code and let us finish our name generator code and write the output to a file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
import argparse import random from faker import Faker from faker.providers import BaseProvider fake = Faker('en_GB') ''' Choosing locale, you can change it to Faker('en_AU') for Australian names or Faker('it_IT') for Italian Names ''' class NamesProvider(BaseProvider): def new_name(self): SEX = ["F", "M"] user_sex = random.choice(SEX) if user_sex == 'F': first_name = self.generator.first_name_female() last_name = self.generator.last_name_female() elif user_sex == 'M': first_name = self.generator.first_name_male() last_name = self.generator.last_name_male() return first_name + " " +last_name def main(): parser = argparse.ArgumentParser("Generate some random names") parser.add_argument("Unique",help="Total Unique Names", type=int) args = parser.parse_args() fake.add_provider(NamesProvider) f = open('names.txt', 'w') for i in range(args.Unique): f.write(fake.new_name()+'\n') if __name__ == "__main__": main() |
Diving deeper into the code, I have set the localization for name generation to English-Great Britain on line 5. You can choose a different localization available and provided by faker library.
1 fake = Faker('en_GB')
Coming to the main(), the program require one argument to decide on the number of names to be generated. And then we register our names provider class. The names generated will be written to a file called ‘names.txt‘. Below command tells you how to run the code.
1 python goTrainedNames.py 10
Once the code is executed, you should be able to see ‘names.txt‘ file generated in your present directory. And you can view the contents of ‘names.txt‘ file using this command or you can directly view it using your favorite editor.
12 cat names.txt
In case, if you don’t pass any parameter to the code, it will throw an error requiring parameters. Here in the below image you can see, I tried to run the program without any parameters.
Now, let’s get further and add some more functions to our existing Names Provider class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
import argparse import random from faker import Faker from faker.providers import BaseProvider fake = Faker('en_GB') class NamesProvider(BaseProvider): def new_name(self): SEX = ["F", "M"] user_sex = random.choice(SEX) if user_sex == 'F': first_name = self.generator.first_name_female() last_name = self.generator.last_name_female() elif user_sex == 'M': first_name = self.generator.first_name_male() last_name = self.generator.last_name_male() return first_name + " " +last_name #Populate the dataset with repeated names for training def repeated_name(self,names): getRandomName = random.choice(names) return getRandomName def main(): parser = argparse.ArgumentParser("Generate some random names") parser.add_argument("Unique",help="Total Unique Names", type=int) parser.add_argument("Repeated",help = "Total Repeated Names",type=int) args = parser.parse_args() fake.add_provider(NamesProvider) names_list = [] #Append unique names first for i in range(args.Unique): names_list.append(fake.new_name()) #Appending repeated names from the list of Unique names for i in range(args.Repeated): names_list.append(fake.repeated_name(names_list)) #Writing final list to a text file with open('names.txt', 'w') as outfile: outfile.write("\n".join(names_list)) if __name__ == "__main__": main() |
Comparing our above code to the previous Names Provider, we have added a new function called ‘repeated_name’. The idea behind repeated user will be explained down the line when we start Neural Machine Translation. What our Names Provider does now is, takes two parameters 1) Unique and 2) Repeated.
This will generate X number of unique names and Y number of repeated names by accessing the list ‘names_list‘. Let’s run the above program with the following command.
12 python goTrainedNames.py 5 10
From the execution above, ‘names.txt’ file should contain a total of 15 names. If you open the file, you should be able to find some repeated names.
Data Generation: Error Introduction
Alright, now we have a set of names which will be used as Ground Truth. Let us introduce some random errors in the names, so we can understand if our machine translation model would be able to identify these mistakes and correct it for us once we have a fully trained model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
import argparse import os.path import random #Check if the names.txt file exist (The one generated by NamesProvider class) def is_valid_file(parser, arg): if not os.path.exists(arg): parser.error("The file %s does not exist!" % arg) else: return arg def add_some_error(file_path): lines = open(file_path).read().splitlines() #A dictionary to replace characters from names. sample_dict ={'h':'b', 'a':'@', 'o':'0', 'P':'B', 'e':'c','l':'1','n':'u'} output = open("error_names.txt","w") #To get total number of lines in names file length = len(lines) #Calculating 40% from the length percentage = round(length * 0.4) #Selecting random 40% of names to introduce errors data=random.sample(range(0, length), percentage) for i in range(length): name = lines[i] if i in data: for j,k in sample_dict.items(): name = name.replace(j,k) output.write(name+"\n") else: output.write(name+"\n") def main(): parser = argparse.ArgumentParser("Error Names") parser.add_argument("names_file",help="Please specify names.txt path",metavar="FILE",type=lambda x: is_valid_file(parser, x)) args = parser.parse_args() add_some_error(args.names_file) if __name__ == "__main__": main() |
Let us go to the function ‘add_some_error’, we are gonna read each name in the file ‘names.txt’. If you notice in the code, we are only adding errors to 40%(not a standard %) of the names because corrupting the entire data sequence would not make any sense.
1 2 |
random.sample(range(0, length), percentage) |
The above code will generate percentage amount of random unique sample in range 0 to total number of names in the file. Now, we can randomly choose names and add errors to those names, so the data corruption is not sequential.
For example, you can look at the image below where length is 10 and the variable ‘data’ has random 4 elements from within the range 0-10.
Let us break down the replacement function to understand it clearly. You can replace ‘my_text’ to your favorite name and maybe add some more data to the dictionary and test the output.
1 2 3 4 5 6 7 8 9 10 |
def replace(text, dic): for i, j in dic.items(): text = text.replace(i, j) return text my_text = 'rakshith' dictionary = {'r':'R', 'a':'@', 'i':'u','h':'G'} txt = replace(my_text, dictionary) print (txt) |
Since we know how to create some random fake data and also induce some error. All this time we were concentrating on generating small sample. Now, let us go ahead and create 100000 names, with 5000 unique names and the rest of them will be repeated from those 5000.
123 python goTrainedNames.py 5000 95000python add_error.py names.txt
As there is so much of data to process, ‘add_error.py’ might be slow. So, do not terminate the program until it finish writing the changes to ‘error_names.txt’.
Splitting Data
Training, Testing and Validation Sets
Now that we have enough amount of data, let us split the data into train, validation and test sets.
Training set – A subset of data to train the model
Test set – A subset of data to test on our trained model
Validation set – A subset of data used to improve and evaluate the training model based on unbiased predictions by the model.
12 <span style="color: #ff0000;"><strong>Imp: Never train on test set</strong></span>
We divide our data set into 70% training, 15% validation and 15% test. To understand more about data split, I recommend reading Best Practices – Data Splitting.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
import pandas as pd from sklearn.model_selection import train_test_split import argparse import os.path #Check if the file exists in the path def is_valid_file(parser, arg): if not os.path.exists(arg): parser.error("The file %s does not exist!" % arg) else: return arg,os.getcwd() def split(truth, error, path_to_save): dataset_truth = pd.read_csv(truth, header = None, index_col = 0) dataset_error = pd.read_csv(error, header = None, index_col = 0) #split data to 70,30 (Train,test) train_src, test_src_temp, train_trg, test_trg_temp = train_test_split(dataset_error, dataset_truth, test_size=0.30, random_state = 0) #Split Testset to (15,15) val_src, test_src, val_trg, test_trg = train_test_split(test_src_temp, test_trg_temp, test_size=0.50, random_state = 0) path = path_to_save + "/splitdata" if not os.path.exists(path): os.makedirs(path) #Writing dataset to file #Making header=None avoids writing of index numbers for each line train_src = pd.DataFrame(train_src).to_csv(path + '/train_src.txt', header=None) train_trg = pd.DataFrame(train_trg).to_csv(path + '/train_trg.txt', header=None) test_src = pd.DataFrame(test_src).to_csv(path + '/test_src.txt', header=None) test_trg = pd.DataFrame(test_trg).to_csv(path + '/test_trg.txt', header=None) val_src = pd.DataFrame(val_src).to_csv(path + '/val_src.txt', header=None) val_trg = pd.DataFrame(val_trg).to_csv(path + '/val_trg.txt', header=None) def main(): parser = argparse.ArgumentParser("Train, Test and Validation Sets") parser.add_argument("Names",help="Generated names file path",metavar="FILE",type=lambda x: is_valid_file(parser, x)) parser.add_argument("ErrorNames",help="Generated error names file",metavar="FILE",type=lambda x: is_valid_file(parser, x)) args = parser.parse_args() split(args.Names[0], args.ErrorNames[0], args.Names[1]) if __name__ == "__main__": main() |
The above code takes two parameters and will help you split out data into test, train and val files.
12 python train_test_split.py names.txt error_names.txt
Once you run the above command, it creates a folder ‘splitdata’ and omits six different files, which I will be explaining about their significance below.
train_src.txt
: Training file containing 70000 names with error(Source)train_trg.txt
: Training file containing 70000 names without error (Ground truth)val_src.txt
: Validation data consisting of 15000 names with errorval_trg.txt
: Validation data consisting of 15000 names without error .test_src.txt
: Test Evaluation data consisting of 15000 names with error.test_trg.txt
: Test Evaluation data consisting of 15000 names without error.
Diving into OpenNMT
For testing purposes, we will be training our model on the fake dataset we created. However, for practical projects, you can use datasets available at this Translation Task or at OPUS parallel corpus.
Now we are all set to experiment with the Neural Machine Translation. Copy all these 6 files by creating a new folder on Google Drive. Let us go back to Colab and start accessing these files train a neat machine translation model. Assuming that you were successful in copying files to Google Drive and changing your present working directory on Google Colab, let us clone the OpenNMT-PY library.
1234 !pwd!ls!git clone https://github.com/OpenNMT/OpenNMT-py.git
After this, change your directory to OpenNMT-py/ so we can install the necessary dependencies to run the code.
1 2 3 |
!cd OpenNMT-py/ !pip install -r requirements.txt |
Step 1: PreProcessing
1 2 |
!python preprocess.py -train_src '/content/gdrive/My Drive/PyTrained/'train_src.txt -train_tgt '/content/gdrive/My Drive/PyTrained/'train_trg.txt -valid_src '/content/gdrive/My Drive/PyTrained/'val_src.txt -valid_tgt '/content/gdrive/My Drive/PyTrained/'val_trg.txt -save_data '/content/gdrive/My Drive/PyTrained/'names |
Here, we will be working with the data that we have created and divided into 6 parts. So,let us place these 6 files in a folder on Google Drive.
Since the files are in Google Drive, the path I will be using is down below for your reference.
12 PATH : '/content/gdrive/My Drive/PyTrained/'
In preprocessing, we use four of our files. And data of each line in the src file corresponds to the data of trg file. This is because the model can try and understand the difference between source and target data.
train_src.txt
train_trg.txt
val_src.txt
val_trg.txt
Validation files are required to frequently evaluate the training model and converge the training to improve the model prediction.
Once the preprocessing is completed, you should be able to see the following three files.
names.train.pt
names.val.pt
names.vocab.pt
These are the serialized PyTorch files, which contains index of word for referencing.
Step 2: Training the Model
1 2 |
!python train.py -data '/content/gdrive/My Drive/PyTrained/'names -save_model '/content/gdrive/My Drive/PyTrained/'model/names-model -gpu_ranks 0 -save_checkpoint_steps 10000 -train_steps 100000 -learning_rate 0.001 |
Now, we train our model using 2 layered LSTM network and by setting our GPU. We are saving checkpoints of our model to 10000 steps, and train it up-to 50000 steps. By the end of training, we should be having 5 model.pt files. We can also increase or decrease the number of training steps and change various hyper parameters during the training. In our example, we will be running a minimal encoder/decoder model with 500 hidden units and training on a single GPU.
It is also possible to specify the GPU_ID if you have multiple GPU’s or it is possible to perform distributed training in parallel to speed up the process.
If you look at the script, we are saving our training set into a folder called ‘model‘ through save_model parameter.
Tip:
Colab session lasts for 12 hours and sometimes it disconnected before that. If this happens while your trained model has not been completed, you can complete from the last saved checkpoint; run your line again adding the option -train_from followed by the last saved file.
Training without GPU
If you look at the highlighted part, it says training on a CPU could be slow. And I also have highlighted the amount of time it takes to finish 50 steps. It is clear that training on CPU would take longer time compared to training on GPU and it will also lead to out of memory error down the line if you continue training on CPU.
Step 3: Translating the Output
1 2 |
!python translate.py -model '/content/gdrive/My Drive/PyTrained/model/'names-model_step_50000.pt -src '/content/gdrive/My Drive/PyTrained/'test_src.txt -output '/content/gdrive/My Drive/PyTrained/'output.txt -replace_unk -verbose |
Now, we have a model which is trained and it can predict on new data and you should be able to see the output on screen while running and also we are saving the output to ‘names_output.txt’.
Conclusion
We have only trained our data on a small and fake dataset, so there will be possibility of our model predicting wrong results. For a much better result, it is recommended to train on a larger dataset.
Fake names generated is used as a test case to see that OpenNMT can also support in correcting the mistakes in text data and it can help in sentence summarization or article summarization for news applications.
And similar to this example, we can train our network model to translate from one language (Ex Chinese –> English) by splitting our data to test, train and validation sets. For a more practical work, it is recommended to use data from Translation DataSet, and you can see the actual model put into work to translate sentences to and from different languages. Also, you can play around by changing various hyper parameters such as number of RNN layers, number of training steps and learning rate.
I am a Master’s student studying computer science in TU Kaiserslautern, Germany. My area of specialization is Deep Learning, OpenNMT and developing apps for Android. I am currently a researcher at DFKI and working on various machine translation models. I fancy traveling and definitely love new experiences.