Our machine learning driven target identification uses a range of models, often requiring different kinds and formats of the input data. This, combined with the data coming from various sources and in different shapes makes the preprocessing a key step in our pipeline. In our previous post we presented how we use Spark on k8s to scale our NLP processing. In this post, we look into using Spark as a way to speed up feature extraction and data preprocessing for ML models.
Data preprocessing is an integral part of machine learning. In fact, often enough researchers spend more time in data related operations than in building the models. Before the data is fed into the model, it firstly needs to be properly cleaned, normalised and featurised. The quality and speed of the preprocessing can have an immense impact on the performance of the model, often exceeding the importance of model architecture itself.
Depending on the use case, data preprocessing may involve a number of different steps, varying from handling missing values, to sampling and tokenising textual data. For each item in the dataset, such as a document, the series of steps that need to be performed are often the same and can be performed in parallel. To make this process fast, efficient and scalable, we heavily rely on Apache Spark.
Apache Spark is a framework that allows for quick data processing on large amounts of data across multiple computers (i.e. on a cluster). Spark is implemented in Scala and runs in JVM but has APIs for Java, Scala, Python, R and SparkSQL. Most ML software utilises Python so we mostly use Spark through the Python API. Spark provides a facility called User Defined Functions (UDF) to carry out non-JVM tasks for a DataFrame.
In contrast to other widely popular packages for data preprocessing such as Pandas, Spark runs operations in parallel on different nodes in the cluster.
Machine learning models usually require a specific data format as input. One such model is BERT, which is a state-of-the-art language model that can be used for a range of NLP tasks. Below, we are going to show how we can use Spark to preprocess sentences, turning them from raw text to a format which can be fed directly into BERT.
First, let’s create a SparkSession. The way how you do this depends on the platform where Spark is run; you can find more details from the Spark deployment manual, or the code snippet below should work if you've installed the pyspark package from PyPi.
The next step is to create a DataFrame that contains the text we wish to tokenise. This can be read from data files using the spark.read module but in this case, we are creating a small sample dataset. The DataFrame contains a single column text and has three rows of type StringType.
Next, we’ll create the tokeniser using AutoTokeniser from the transformers package, that automatically downloads the tokeniser from the HuggingFace repository. It’s good practice to use Spark’s broadcast mechanism to pre-distribute large objects to executors.
To define a UDF, we need to define a normal Python function that takes a number of column value(s) as input and returns the output. The output schema needs to be specified using Spark’s data types.
A few things require explanations in the above code:
Finally, we are ready to run the code and produce our tokenised output
The output has a schema like this
You can then choose to further process the tokenised output or write to disk. We prefer Avro format for saving features since it preserves float precision and allows checking against schemas easily. The tooling for Avro in different programming languages is also very good.
Note that Spark reads, processes and writes data in parallel. Therefore, you will most of the time get multiple files at the output. Your code should be able to handle this, although machine learning frameworks like PyTorch and Tensorflow both have support for reading multiple input files. You can also write your data in many other formats like CSV, JSON or Parquet.
After our data is properly preprocessed, the next step is to load the data into a data loader. This is framework specific, so for example in PyTorch, this would be a DataLoader object. Once the data is loaded, we can use it to train our model. Although Spark supports scheduling tasks on GPUs, we currently prefer using Pytorch Lightning to run out training and prediction tasks.
Use of Spark allows us to focus on using features that would otherwise be difficult to develop, particularly when the model is trained with a lot of data. We showed in this blog that Spark allows flexible use of Spark DataFrame API and Python. We don’t need to think about the reading/writing or data aggregations since we can rely on Spark doing them much better than any custom code we might have in Python.
The biggest advantage of Spark is that it allows us to process a large amount of training data in a matter of minutes. Leveraging distributed computing and parallel processing, it can handle massive datasets, helping us avoid processing everything sequentially, for example by using a for loop.
Juha has MSc (1996) and PhD (2008) degrees in Computer Science from Tampere University of Technology. After graduation he worked for Nokia in Finland and in China, developing speech recognition, text-to-speech synthesis and predictive text input technologies for Nokia. In 2013 Juha relocated to London to work as Head of Research at SwiftKey. During this time SwiftKey launched a neural network based mobile predictive text input system. Juha has been at BenevolentAI from 2017. His main interests are in Deep Learning for biomedical NLP and distributed computing using Apache Spark.
Maciej is a Machine Learning Engineer at BenevolentAI, where he focuses on Natural Language Processing (NLP) and its applications in drug discovery. In the past he has worked in research labs at University College London and University of Edinburgh on topics such as multi-agent reinforcement learning and generative adversarial networks. He has worked and published on topics such as entity linking, knowledge graph embedding systems and generative adversarial networks.