How Forte Transforms ML Building With Pytorch
Authors: the CASL Project Team
Forte introduces “DataPack”, a standardized data structure for unstructured data, distilling good software engineering practices such as reusability, extensibility, and flexibility into PyTorch-based ML solutions.
Visit Forte at:
- Github: https://github.com/asyml/forte
- Documentation: https://asyml-forte.readthedocs.io/en/latest
- Technical Report: https://aclanthology.org/2020.emnlp-demos.26/
Introduction
Machine Learning (ML) technologies are now widely used in many day-to-day applications. For example, the systems behind personal assistants like Siri or Alexa are grounded in complex ML technologies, such as Natural Language Processing, Computer Vision, and many more.
While the consumer interface of Machine Learning systems may appear simple, the systems behind the scene can be much more complex than they first appear. For example, building an intelligent medical information retrieval system requires one to stitch together a diverse set of techniques. These steps could be:
- Preprocessing tools to clean and anonymize data
- A Full-Text Search system to find relevant reports per user query
- A few PyTorch Information Extraction models to identify key entities and refer to medical knowledge bases
- Human interfaces for practitioners to examine the results and provide feedback
Even for this simplified example, the diverse nature of these components makes it challenging to create robust and reusable ML systems.
To solve this problem, we introduce Forte, a data-centric framework designed to handle complex ML workflows. Supported by the researchers at Petuum and our open-source community CASL, Forte employs a unique data structure, namely, the DataPack, as a standardized, but customizable data representation that flows in a workflow. This data representation allows each component to interact with a data stream in a standard way, effectively decoupling it from other components in the system and reducing the time spent on converting (maybe unreliable) data formats back and forth.
This simple idea allows engineers to develop reusable components easily and provides a simple way to compose ML models from the PyTorch ecosystem. Complementary to PyTorch’s power in building neural network models, Forte transforms the building process of complex ML solutions like assembling LEGO blocks.
Why is ML engineering hard?
Properly engineering a complex ML workflow with many components is a non-trivial task. In our research, we have encountered many frustrating issues, including:
- Too many output formats: The output of each tool differs or even conflicts due to different preprocessing schemes, making it challenging to pass the results from one tool to another.
- Data processing makes it faster to re-write than to re-use: We tend to build a model for a particular use case, and may have the best intentions to reuse it in the future. But since data processing steps become intertwined with the model, teams find it faster to just do everything again.
- Customization is hard: The ML models and solutions are hard to repurpose — even if it’s the “same” AI problem — to a different domain or a new customer. Sometimes the new data formats do not match the existing model. Other times, new domain concepts violate the assumptions made about the data.
The problems we listed above are just the tip of the iceberg. If we go into depth on this, we’ll find the crux of many problems is we often fit the data to the workflow, but not the other way around. Ad-hoc processing logic from different workflow components tends to make data processing inconsistent and fragile.
Building state-of-the-art models and algorithms can be the most exciting work for ML practitioners. But instead of working on the cutting-edge technologies, they spend most of their time on tedious data processing tasks.
It is easy to fall prey to using quick hacks or gluing patchwork fixes in the pipeline. As a result, the code quality suffers, maintenance costs build up, and glitches and bugs accumulate over time.
Our solution, as hinted above, is simple: we agree on a data scheme in advance, and every component will interface with the data stream in a standard way. As opposed to ad-hoc data processing, we try to fit the workflow to the data, in a data-centric way.
Introducing Forte
Forte is a data-centric system to facilitate the construction or modification of a model pipeline. For example, an Entity Analysis and Visualization pipeline can be literally built with a one-liner:
In this example script, we build a pipeline that has an “HTMLReader”, followed by a “SpacyProcessor” which utilizes the SpaCy tool to perform NLP tasks like tokenization, part-of-speech tagging, entity linking, and so on. Finally, we add a visualizer (“StaveProcessor”) to conclude the pipeline.
The pipeline can be immediately run by calling the “run” function. From the “run” function, we can directly pass an HTML snippet and it will be processed by the pipeline and the results will be visualized.
Quite simple, isn’t it? This is because each processor only needs to work directly with a data stream with a standard format. This is exactly what our data-centric view aims to achieve: converting data into standard packages, to help different components in a complex ML system work together easily.
Data Abstractions: DataPacks and Ontology
The data-centric idea is straightforward — instead of performing many ad-hoc data conversions across different data formats, Forte unifies data formats and creates a new format called the “DataPack” — a standard Data Format for complex ML tasks (imagine a Data Frame for unstructured data, but with an extensible data schema).
A DataPack consists of two parts: the raw data source (e.g. text, audio, images) and the structured markups on top of it. The markups represent the annotation information associated with certain tasks, such as part-of-speech tags, named entity mentions, dependency links, bounding boxes, and so forth. DataPacks store such information similar to how people highlight text or annotate books. Visually, a DataPack looks like the following:
The magic behind the DataPack is a hierarchical ontology system that defines the relations among ML data types. For example, a dependency link associates a head token to a dependent token, and the dependency label is its attribute.
In Forte, these data types are used like classes in object-oriented programming, meaning a data type can be inherited or used by others. This gives the flexibility to define new types at any time. Forte provides a few handy data types characterized by some typical ML data structures. For example, the “Span” represents text spans such as “Token” and “Entity Mention”; the “Link” represents relations, such as “Dependency”; the “Group” represents clusters or sets, such as “Entity Coreference Group”.
The powerful ontology system provides a flexible way to represent structures within the unstructured data. Although the idea is simple, it helps us tremendously in the standardization of developing ML solutions.
How to use Forte
Getting started with Forte is simple:
- Install it through PyPI
- Or install the bleeding-edge version
- Install our open-source SpaCy adaptor (optional)
- Install our visualizer (optional, please note this tool is under development)
- You can try out the above-mentioned one-liner here
Great! Forte is installed! You can try out examples from our Github repository. If you are interested in more technical details, read on! Next, we will go through with an example of how to build an Information System for Clinical Notes.
Building an Information System for Clinical Notes
Imagine we are building a Clinical Notes Analysis system, the requirements include a search system that returns the relevant documents based on user queries and a text analysis system that highlights the relations between medical-related entity mentions.
The development process in Forte generally includes 4 steps:
- Define your data types,
- Develop the readers,
- Develop the processors,
- Assemble the pipeline!
1. Define your data types with the ontology system
The first step to build a Forte pipeline is always to determine what data types are needed for the task and the domain. By doing so, we force ourselves to understand the problem requirements and learn the domain knowledge before writing code. In this example, we are going to create two data types, one for medical entity mention, and one for UMLS link. The ontology is defined using the following JSON file:
Here, the “MedicalEntityMention” type inherits the “EntityMention” and defines an additional attribute “umls_entities”, which stores a list of “UMLSConceptLink”. The “UMLSConceptLink” stores additional attributes that are related to a medical knowledge base, such as the CUI and TUIS strings. Now we can call Forte’s ontology generator:
Forte now generates the data classes as the following:
One benefit of Forte is it allows different teams or engineers working on the same problem to discuss and agree on this shared data schema. Afterward, all the developed workflow components can easily communicate through this common ground.
2. Develop the Reader(s)
The next step in the process is to implement Readers, the entry point to the Forte System. Depending on how many different data formats you are dealing with, you may have multiple readers. Here is an example of how to implement an HTML Reader.
The data that consumes through the “Reader” interface will become “DataPack”, and from now on, we don’t need to worry about the difference in formats anymore. Forte “forces” the developers to conduct all necessary data conversion at the entry point of the whole workflow.
3. Develop the Processor
A pipeline usually consists of multiple processors with a stream of DataPacks. As the DataPacks flow along the pipeline, each processor can access or edit information in the DataPack. By doing so, Forte makes it easy to run PyTorch models together with any other models. Below is a sketch implementation of SpacyProcessor which calls the SpaCy library.
While it is fairly simple to build a processor from a custom PyTorch model, Forte wraps the fantastic collections of ML libraries (e.g., AllenNLP) from the PyTorch ecosystem as well as other sources (e.g., SpaCy, HuggingFace).
4. Assemble the Pipeline and See it Running in Action
After building the components we need, it is straightforward to create the pipeline. In the snippet below, we piece together a couple of ready-to-use modules such as the reader, Tokenizer, NER, Entity Linker, and Relation Extractor.
That’s it for the pipeline building! Forte effectively distributes the data processing burden at the component building time. Now the pipeline is ready, let’s see it run in action:
The resulting DataPack contains the processed results, along with the original text. Visually, it looks like the following example:
Visualization like this can be quite useful for scientists to understand problems in data, especially for data with complex structures. In the next section, we introduce how our visualization tool Stave can help us to inspect data within a workflow.
Human in the Loop
The unified data representation allows us to standardize the whole workflow, beyond solely model development. The existence of DataPacks also enables human-in-the-loop workflows with ease. We treat human operations on the DataPacks the same as a Forte processor. In the future, we will come out with an extendable framework that allows one to add customized visualization plugins. You can try out the initial version of this tool (Stave) with the following:
You can also call Stave inside a Forte pipeline. Forte introduces StaveProcessor to enable immediate visualization of Forte pipeline results. Forte users can plug it into the pipeline to easily visualize DataPacks with annotations.
Re-purposing the Pipeline
Forte’s design makes it simple to re-purpose the pipeline for a different solution or domain. This can be done on many levels: one can either re-purpose the whole pipeline or simply reuse part of them for another application. Here we briefly introduce how we can re-purpose the above-mentioned pipeline to the newswire domain. To achieve this, let’s examine the process and determine the following:
Do we need different ontology types?
Before we start coding, we should determine the ontology types. For example, in the newswire domain, we don’t need to use “MedicalEntityMention”, we can use a different ontology definition. In this example, we’ll use “WikiEntityMention” to extend the “EntityMention” type while containing additional information such as links to general knowledge bases (e.g., Yago and Wikipedia).
Note that by defining new annotations, we are making domain-specific assumptions. For example, we allow one “MedicalEntity” to have multiple “UMLSLink”, but a “WikiEntityMention” will only be linked to one Wikipedia/Yago node. Yet regardless of these differences, the procedures we developed on “EntityMention” level will still be shared among them.
Rewrite Required Components
We will often need to rewrite the “Reader(s)” to make sure we can ingest datasets of a different format. In this example, we need to rewrite the reader and add the “WikiEntityMention” correctly to the DataPack.
Occasionally, we need to add additional processors or change a particular processor. This can be done easily by changing the Pipeline building script.
Building Reusable Models
The process of converting raw data to model input is tedious but important. Even though PyTorch reduces the need to hand-craft feature extractors, the model performance is still sensitive to data preprocessing. Forte helps to modularize this process since data processing logic can be implemented for high-level data types. For example, a BIO Sequence encoding scheme can be implemented once for MedicalEntityMention, but it will be much more re-usable if we implement it for EntityMention, or even for Span.
Forte further introduces the “Extractor” interface (a new experimental feature), which converts raw data from DataPack into domain-independent features. This interface forces practitioners to separate the domain logic from model logic. The following figures show the configuration of a BIOTaggingExtractor.
In this example, we change data preprocessing from the medical domain to the Wikipedia domain, by changing only one attribute in the extractor configuration. Since the output of the Extractor will be pure tensors, developers can simply build one PyTorch model without dealing with any domain logic. There is almost no code change needed elsewhere to truly reuse the model.
The following figure provides a few more examples of the extractor interface. It serves as the interface between model and data, and helps to strip the unnecessary domain logic away from the models.
And There’s More…
The data-centric abstraction really opens the gate to many other opportunities. The Forte team is actively working on many interesting features. One such example is data augmentation. Since the unified representation of DataPacks provides us a natural way to develop generic data operations. Our data augmentation interface has several unique characteristics. For example, Forte can attempt to realign the structures stored in the DataPack after certain augmentation operations (e.g. realign the syntax tree after adding new tokens). The Forte team is further enabling DataPack to handle more types of unstructured data, such as audio, images, and videos!
The following figure shows the current stack of the Forte system, stay tuned for more interesting developments!
Conclusions and Future Remarks
When producing ML solutions, we have witnessed the importance of adhering to good software engineering practices. Yet, building ML applications requires a different set of abstractions, compared to traditional software. Forte proposes a layer of ML-related abstractions on top of PyTorch tensors, providing a natural interface for domain experts to interact with data and models. We believe Forte can help ML practitioners to think of the whole picture at the beginning, enhance the reusability of components, reduce cross-team communication overhead, and eventually facilitate technology advancement instead of technical debt.
The Forte team’s mission is to help standardize the ML solution building process, which frees engineers and scientists from low-level details, makes the data flow transparent to pipeline users, and enhances the reusability and quality of the solutions and products. If you are interested, please don’t hesitate to visit Forte on Github and leave any comments.
About Petuum and CASL
Forte is a part of CASL, our fast-growing open-source community. CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction.
Petuum was founded with the mission to improve productivity and ease in the ML lifecycle and help accelerate the adoption of AI. We have been developing and guiding the CASL ecosystem with the principle of openness and flexibility, and we are constantly investing in integrations to better support diverse use cases.
In order to better facilitate the use of Forte, the rest of our CASL components, and help AI teams with full MLOps workflows, Petuum is working to build a platform with extensive first-party support. Please visit our website at https://petuum.com to learn more about the work we are doing. And be sure to visit the CASL website to stay up to date on additional CASL and Forte announcements soon.