Custom Entity Recognition in AWS Comprehend

published on 11 November 2024

AWS Comprehend's custom entity recognition lets you train models to find specific terms in your documents - no AI expertise needed. Here's what you need to know:

  • Identify industry-specific entities beyond standard NLP categories
  • Works with plaintext, PDFs, images, and Word docs
  • Train on up to 25 custom entity types
  • No machine learning experience required

To get started:

  1. Prepare training data (entity lists or annotated docs)
  2. Set up an entity recognizer in AWS
  3. Train the model
  4. Deploy for real-time or batch processing
  5. Monitor performance metrics

Key benefits:

  • Automate data extraction from industry-specific documents
  • Improve data quality for better decision making
  • Save time and money vs manual review

Whether you're in finance, manufacturing, healthcare or another field, custom entity recognition can pull out the exact information you need from your documents.

Prerequisites

Before you start with custom entity recognition in AWS Comprehend, you need to set up a few things. Let's go through them:

Set Up AWS Account

AWS

First, you need an AWS account. Don't have one? Here's what to do:

  1. Go to the AWS website and hit "Create an AWS Account"
  2. Fill in your email and payment info
  3. Finish the verification process

Once you're done, you can log into the AWS Management Console and use AWS services, including Comprehend.

Set Up IAM Permissions

Next up: IAM permissions. These are crucial for using Comprehend safely. Here's how:

  1. Log into the AWS Management Console
  2. Head to the IAM service
  3. Create a new IAM user (or use an existing one)
  4. Add the right permission policies

For basic access, attach the ComprehendReadOnly policy. But for custom entity recognition, you'll need more.

Here's an example policy for managing flywheels in Comprehend:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "comprehend:CreateFlywheel",
                "comprehend:DeleteFlywheel",
                "comprehend:StartFlywheelIteration"
            ],
            "Resource": "*"
        }
    ]
}

"Identity-based policies determine whether someone can create, access, or delete Amazon Comprehend resources in your account." - Amazon Comprehend Documentation

Remember: Only give the permissions needed for specific tasks.

Set Up S3 Bucket

Lastly, you need an S3 bucket for your training data and Comprehend output. Here's how to make one:

  1. Open the S3 console in AWS
  2. Click "Create bucket"
  3. Pick a unique name
  4. Choose your AWS Region (make sure it supports Comprehend)
  5. Set up the bucket (default settings are fine for now)

Note: Your IAM role for Comprehend must be able to read from this S3 bucket.

Want to use the AWS CLI instead? Try this command:

aws s3api create-bucket --bucket your-unique-bucket-name --region your-preferred-region

With these steps done, you're ready to dive into custom entity recognition with AWS Comprehend.

Prepare Your Data

Getting your data ready is key for a solid custom entity recognition model in AWS Comprehend. Here's what you need to know:

Data Format Types

AWS Comprehend works with these formats:

  • Plaintext: Simple and versatile
  • PDF files: Great for annotated training data (English only)
  • Image files: JPG, PNG, and TIFF for input (not for annotation)
  • Word documents: For input (not for annotation)

Pro tip: Use annotated PDF files to create a model that works with all these formats. No extra steps needed.

Create Entity Lists

New to this? Entity lists are a good starting point:

  1. List your entities
  2. Group them by type (max 25 custom types)
  3. Make a plaintext file: one entity and its type per line

Like this:

Amazon,COMPANY
Jeff Bezos,PERSON
AWS,PRODUCT

Entity lists work best when:

  • You have a ready-made list of entities
  • Entities make sense without context
  • You're only using plaintext docs

Check Data Quality

Good data = accurate model. Remember:

  • Label entities consistently
  • Include diverse examples for each type
  • For annotations, provide context
  • More manually annotated docs = better accuracy

"Bad data in, bad results out." - Amazon Comprehend Team

For annotations:

  • Test data needs at least one annotation per entity type
  • They're crucial for tricky or context-dependent entities

Set Up Entity Recognizer

Want to make AWS Comprehend work for your specific business? Let's set up a custom entity recognizer. This tool helps you train models to spot industry-specific terms in your docs.

Configure Model Settings

Here's how to get your custom entity recognizer up and running:

  1. Log into AWS Management Console
  2. Find Amazon Comprehend
  3. Click "Custom" > "Entity recognition"
  4. Hit "Create entity recognizer"

You'll need to provide:

  • A unique name for your recognizer
  • Your training data's language (only English for now with annotated PDFs)
  • Your preferred AWS Region

Now, let's talk training data. You've got two options:

  1. Entity List: Simple and great for plaintext docs with clear-cut entities.
  2. Annotations: Better for image files, PDFs, or Word docs. It's your go-to for context-dependent entities.

For the Entity List method, whip up a CSV file like this:

Text,Type
iPhone X,DEVICE
Samsung Galaxy,DEVICE
Android,DEVICE

Going with Annotations? You'll need at least 1,000 annotated docs with your custom entities.

Set Up Entity Types

A few things to keep in mind:

  • You can have up to 25 custom entity types
  • Use clear labels (like DEVICE, PRODUCT_CODE, ROUTE_NUMBER)
  • Be consistent with your labeling

Let's say you're in manufacturing. Your entity types might look like:

  • PART_ID
  • ROUTE_NUMBER
  • WAREHOUSE_CODE

Remember: Your recognizer will only spot the entity types you train it on. Want to catch standard entities like LOCATION or DATE? Include them in your training data.

Once you've set everything up, it's training time. AWS Comprehend handles the heavy lifting, so you can focus on your specific needs.

"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required." - Nino Bice, Sr. Product Manager leading product for Amazon Comprehend

Keep an eye on the training progress with the DescribeEntityRecognizer operation. When it says TRAINED, you're good to go!

sbb-itb-6210c22

Train and Track Model

You've set up your custom entity recognizer. Now it's time to train it and keep an eye on how it's doing. Let's break it down.

Start the Training Process

To get the ball rolling:

  1. Head to the Amazon Comprehend console
  2. Pick your custom entity recognizer
  3. Hit "Train"

Amazon Comprehend takes it from there. The training time? It depends. Could be minutes, could be hours. It's all about your dataset's size and complexity.

During this process, Amazon Comprehend does the heavy lifting. It picks the best algorithm, handles sampling, and tweaks the models to find the perfect fit for your data.

Monitor Training Progress

Want to check how things are going? Use the DescribeEntityRecognizer operation. Here's a quick example using the AWS CLI:

aws comprehend describe-entity-recognizer --entity-recognizer-arn "arn:aws:comprehend:us-east-1:1234567890:entity-recognizer/my-custom-recognizer"

This command shows you where your entity recognizer is at. Keep an eye on the Status field:

  • SUBMITTED: We got your request
  • TRAINING: Your model's learning the ropes
  • TRAINED: All done and ready to go
  • IN_ERROR: Oops, something went wrong

When you see TRAINED, your custom entity recognizer is good to go.

Check Model Performance

After training, you'll want to see how well your model's doing. Amazon Comprehend gives you three key metrics:

  1. Precision: How accurate are the positive predictions?
  2. Recall: How many actual positives did we catch?
  3. F1 Score: A balance between precision and recall

These metrics come from true positives (TP), false positives (FP), and false negatives (FN):

  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Let's look at a real example:

Say you've trained a custom entity recognizer for a financial services company to spot PERSON and COMPANY entities. After training on 10,000 documents, here's what you get:

Entity Type Precision Recall F1 Score
PERSON 0.92 0.88 0.90
COMPANY 0.85 0.79 0.82

These numbers show your model's doing pretty well, especially with PERSON entities. But there's room to grow when it comes to COMPANY entities.

"When the training is completed the custom model is ready to go. To start analyzing documents looking for custom entities, either use the portal or APIs via the AWS SDK." - Nino Bice, Sr. Product Manager leading product for Amazon Comprehend

Improve Model Performance

Not happy with your model's performance? Try these:

  1. Boost data quality: Stick to Amazon Comprehend's guidelines for annotations or entity lists.
  2. Use annotations: If you're using entity lists now, switching to annotations often works better.
  3. Expand your dataset: More high-quality, annotated documents can bump up accuracy.
  4. Balance your entities: Make sure all entity types are well-represented in your training data.

Deploy Your Model

You've trained and fine-tuned your custom entity recognition model. Now it's time to use it. AWS Comprehend gives you two main options: real-time processing for quick results and batch processing for bigger datasets.

Real-time Processing

Want quick entity detection for single documents or small batches? Here's how to set it up:

1. Create an endpoint using the AWS CLI:

aws comprehend create-endpoint \
--desired-inference-units 1 \
--endpoint-name my-custom-entity-endpoint \
--model-arn arn:aws:comprehend:us-east-1:123456789012:model/my-custom-model \
--tags Key=Project,Value=EntityRecognition

This sets up an endpoint with 1 inference unit (IU), processing 100 characters per second. Need more speed? Just bump up the --desired-inference-units.

2. Start detecting entities in real-time:

aws comprehend detect-entities \
--endpoint-arn arn:aws:comprehend:us-east-1:123456789012:endpoint/my-custom-entity-endpoint \
--language-code en \
--text "Andy Jassy became the CEO of Amazon in 2021."

You'll get back detected entities with their types and confidence scores.

"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required." - Nino Bice, Sr. Product Manager leading product for Amazon Comprehend

Keep an eye on your endpoint's performance with Amazon CloudWatch. It'll help you tweak the number of inference units to balance cost and speed.

Batch Processing

Got a ton of data? Batch processing is your friend. Here's how:

1. Put your documents in an S3 bucket (e.g., s3://my-comprehend-bucket/input/).

2. Kick off a batch job:

aws comprehend start-entities-detection-job \
--entity-recognizer-arn arn:aws:comprehend:us-east-1:123456789012:entity-recognizer/my-custom-recognizer \
--job-name batch-entity-job-1 \
--data-access-role-arn arn:aws:iam::123456789012:role/ComprehendAccessRole \
--language-code en \
--input-data-config S3Uri=s3://my-comprehend-bucket/input/ \
--output-data-config S3Uri=s3://my-comprehend-bucket/output/ \
--region us-east-1

This processes all docs in your input bucket and saves results to the output bucket.

3. Check on your job:

aws comprehend describe-entities-detection-job \
--job-id your-job-id

When it's done, you'll find JSON files with detected entities in your output bucket.

Here's a real-world example: A financial services company used custom entity recognition on 10,000 quarterly reports. They ran a batch job overnight, spotting custom entities like specific financial products and company terms with 92% accuracy. It saved their analysts over 200 hours of manual review time.

Don't forget to delete your endpoint when you're done to avoid extra charges. You can always create a new one later.

Fix Common Problems

Custom entity recognition in AWS Comprehend can be tricky. Let's look at some common issues and how to solve them.

Fix API Errors

API errors are a pain, but you can usually fix them. Here are some you might run into:

1. Missing Annotations Error

If you see this:

"The augmented manifest referenced in your InputDataConfig.AugmentedManifests at index 0 doesn't have any annotations"

Your input data is missing annotations. To fix it:

  • Check your augmented manifest files
  • Make sure they have correct annotation references
  • Check that annotation files exist and you can access them

2. Authorization Errors

You might see something like:

"User: arn:aws:iam::123456789012:user/mateojackson is not authorized to perform: comprehend:GetWidget on resource: my-example-widget"

This means you don't have the right permissions. Here's what to do:

  • Look at your IAM policies
  • Make sure they allow the right actions in Amazon Comprehend
  • Ask your AWS admin for help if needed

3. iam:PassRole Error

If you see:

"User: arn:aws:iam::123456789012:user/marymajor is not authorized to perform: iam:PassRole"

Your policies need updating to let you pass a role to Amazon Comprehend. Talk to your AWS admin about this.

Improve Model Performance

If your model isn't working as well as you'd like, try these:

1. Enhance Data Quality

Bad data leads to bad results. To make it better:

  • Follow Amazon Comprehend's rules for annotations or entity lists
  • Label everything the same way
  • Use different examples for each entity type

2. Increase Dataset Size

More good data usually means better results. Amazon Comprehend doesn't need as much data as before, but more is still better:

  • Try for at least 25 annotations per entity type
  • With just 25 annotations per type, some users got an average F1 score of 84%

3. Use Annotations Instead of Entity Lists

If entity lists aren't working, try annotations. They're good for:

  • Entities that depend on context
  • Training models for image files, PDFs, or Word documents

4. Balance Your Entities

Make sure you have enough examples of all entity types. If you don't, some might not work well.

Handle Incorrect Results

Sometimes, even after training, things go wrong. One user said:

"AWS returned the value of 'select common flexbase' as one single 'material' type. This should have been three separate 'material' types (and was annotated hundreds, if not thousands, of times separately in the training annotations)."

To fix this:

  • Check your annotations to make sure they're right
  • Try adding more examples for types that aren't working
  • If it's still not working, you might need to train your model again with different examples

Nino Bice, Sr. Product Manager for Amazon Comprehend, said:

"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required."

This makes it easier, but you still need good data to get good results.

Summary

Custom entity recognition in AWS Comprehend lets businesses pull specific info from text. Here's how it works:

1. Data Prep

Gather your entity list or annotated docs. A finance company might list bankruptcy terms or mark up quarterly reports.

2. Set Up the Recognizer

Choose your model settings and entity types. You can train on up to 25 custom entities at once - great for specialized industries.

3. Train the Model

AWS Comprehend does the heavy lifting. It picks the best algorithm and fine-tunes based on your data.

4. Deploy

Go for real-time processing for quick results or batch processing for bigger datasets.

5. Keep an Eye on Performance

Watch precision, recall, and F1 scores to see how well your model's doing.

Custom entity recognition works across industries:

  • Finance: Spot bankruptcy terms in market reports
  • Manufacturing: Pull part IDs and route numbers from logistics docs
  • Healthcare: Find medical terms and treatments in patient records
  • Legal: Catch case numbers and legal jargon

Nino Bice, Sr. Product Manager for Amazon Comprehend, says:

"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required."

This tech is now open to businesses of all sizes. For example, a mid-sized manufacturer used it to analyze 10,000 logistics docs overnight. They found part IDs and route numbers with 92% accuracy, saving 200+ hours of manual review.

FAQs

Which of the following AI services does Amazon Comprehend provide?

Amazon

Amazon Comprehend packs a punch with its AI-powered natural language processing (NLP) services. Here's what it can do:

1. Entity Recognition

Spots people, places, and organizations in text. It even does custom recognition for industry-specific terms.

2. Sentiment Analysis

Figures out if a piece of text is happy, sad, neutral, or a mix of emotions.

3. Key Phrase Extraction

Pulls out the most important bits from a document.

4. Language Detection

Tells you what language the text is in.

5. Topic Modeling

Groups text documents into topics.

6. Syntax Analysis

Breaks down text to show how words relate to each other.

7. PII Detection

Finds and removes sensitive personal info from documents.

8. Document Classification

Sorts documents into categories, including custom ones for specific needs.

Amazon Comprehend isn't picky - it works with plain text, PDFs, Word docs, and even images (JPG, PNG, TIFF). It speaks multiple languages too, including English, Spanish, German, Italian, Portuguese, French, and Japanese.

"Amazon Comprehend uses deep learning technology to accurately analyze text." - Amazon Comprehend Team

The best part? You don't need to be a machine learning guru to use it. It's fully managed and always learning.

Here's a real-world example: A financial services company used custom entity recognition to analyze 10,000 quarterly reports overnight. They picked out specific financial products and company terms with 92% accuracy. Not too shabby!

Related posts

Read more