AWS Comprehend's custom entity recognition lets you train models to find specific terms in your documents - no AI expertise needed. Here's what you need to know:
- Identify industry-specific entities beyond standard NLP categories
- Works with plaintext, PDFs, images, and Word docs
- Train on up to 25 custom entity types
- No machine learning experience required
To get started:
- Prepare training data (entity lists or annotated docs)
- Set up an entity recognizer in AWS
- Train the model
- Deploy for real-time or batch processing
- Monitor performance metrics
Key benefits:
- Automate data extraction from industry-specific documents
- Improve data quality for better decision making
- Save time and money vs manual review
Whether you're in finance, manufacturing, healthcare or another field, custom entity recognition can pull out the exact information you need from your documents.
Related video from YouTube
Prerequisites
Before you start with custom entity recognition in AWS Comprehend, you need to set up a few things. Let's go through them:
Set Up AWS Account
First, you need an AWS account. Don't have one? Here's what to do:
- Go to the AWS website and hit "Create an AWS Account"
- Fill in your email and payment info
- Finish the verification process
Once you're done, you can log into the AWS Management Console and use AWS services, including Comprehend.
Set Up IAM Permissions
Next up: IAM permissions. These are crucial for using Comprehend safely. Here's how:
- Log into the AWS Management Console
- Head to the IAM service
- Create a new IAM user (or use an existing one)
- Add the right permission policies
For basic access, attach the ComprehendReadOnly
policy. But for custom entity recognition, you'll need more.
Here's an example policy for managing flywheels in Comprehend:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"comprehend:CreateFlywheel",
"comprehend:DeleteFlywheel",
"comprehend:StartFlywheelIteration"
],
"Resource": "*"
}
]
}
"Identity-based policies determine whether someone can create, access, or delete Amazon Comprehend resources in your account." - Amazon Comprehend Documentation
Remember: Only give the permissions needed for specific tasks.
Set Up S3 Bucket
Lastly, you need an S3 bucket for your training data and Comprehend output. Here's how to make one:
- Open the S3 console in AWS
- Click "Create bucket"
- Pick a unique name
- Choose your AWS Region (make sure it supports Comprehend)
- Set up the bucket (default settings are fine for now)
Note: Your IAM role for Comprehend must be able to read from this S3 bucket.
Want to use the AWS CLI instead? Try this command:
aws s3api create-bucket --bucket your-unique-bucket-name --region your-preferred-region
With these steps done, you're ready to dive into custom entity recognition with AWS Comprehend.
Prepare Your Data
Getting your data ready is key for a solid custom entity recognition model in AWS Comprehend. Here's what you need to know:
Data Format Types
AWS Comprehend works with these formats:
- Plaintext: Simple and versatile
- PDF files: Great for annotated training data (English only)
- Image files: JPG, PNG, and TIFF for input (not for annotation)
- Word documents: For input (not for annotation)
Pro tip: Use annotated PDF files to create a model that works with all these formats. No extra steps needed.
Create Entity Lists
New to this? Entity lists are a good starting point:
- List your entities
- Group them by type (max 25 custom types)
- Make a plaintext file: one entity and its type per line
Like this:
Amazon,COMPANY
Jeff Bezos,PERSON
AWS,PRODUCT
Entity lists work best when:
- You have a ready-made list of entities
- Entities make sense without context
- You're only using plaintext docs
Check Data Quality
Good data = accurate model. Remember:
- Label entities consistently
- Include diverse examples for each type
- For annotations, provide context
- More manually annotated docs = better accuracy
"Bad data in, bad results out." - Amazon Comprehend Team
For annotations:
- Test data needs at least one annotation per entity type
- They're crucial for tricky or context-dependent entities
Set Up Entity Recognizer
Want to make AWS Comprehend work for your specific business? Let's set up a custom entity recognizer. This tool helps you train models to spot industry-specific terms in your docs.
Configure Model Settings
Here's how to get your custom entity recognizer up and running:
- Log into AWS Management Console
- Find Amazon Comprehend
- Click "Custom" > "Entity recognition"
- Hit "Create entity recognizer"
You'll need to provide:
- A unique name for your recognizer
- Your training data's language (only English for now with annotated PDFs)
- Your preferred AWS Region
Now, let's talk training data. You've got two options:
- Entity List: Simple and great for plaintext docs with clear-cut entities.
- Annotations: Better for image files, PDFs, or Word docs. It's your go-to for context-dependent entities.
For the Entity List method, whip up a CSV file like this:
Text,Type
iPhone X,DEVICE
Samsung Galaxy,DEVICE
Android,DEVICE
Going with Annotations? You'll need at least 1,000 annotated docs with your custom entities.
Set Up Entity Types
A few things to keep in mind:
- You can have up to 25 custom entity types
- Use clear labels (like DEVICE, PRODUCT_CODE, ROUTE_NUMBER)
- Be consistent with your labeling
Let's say you're in manufacturing. Your entity types might look like:
- PART_ID
- ROUTE_NUMBER
- WAREHOUSE_CODE
Remember: Your recognizer will only spot the entity types you train it on. Want to catch standard entities like LOCATION or DATE? Include them in your training data.
Once you've set everything up, it's training time. AWS Comprehend handles the heavy lifting, so you can focus on your specific needs.
"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required." - Nino Bice, Sr. Product Manager leading product for Amazon Comprehend
Keep an eye on the training progress with the DescribeEntityRecognizer operation. When it says TRAINED, you're good to go!
sbb-itb-6210c22
Train and Track Model
You've set up your custom entity recognizer. Now it's time to train it and keep an eye on how it's doing. Let's break it down.
Start the Training Process
To get the ball rolling:
- Head to the Amazon Comprehend console
- Pick your custom entity recognizer
- Hit "Train"
Amazon Comprehend takes it from there. The training time? It depends. Could be minutes, could be hours. It's all about your dataset's size and complexity.
During this process, Amazon Comprehend does the heavy lifting. It picks the best algorithm, handles sampling, and tweaks the models to find the perfect fit for your data.
Monitor Training Progress
Want to check how things are going? Use the DescribeEntityRecognizer operation. Here's a quick example using the AWS CLI:
aws comprehend describe-entity-recognizer --entity-recognizer-arn "arn:aws:comprehend:us-east-1:1234567890:entity-recognizer/my-custom-recognizer"
This command shows you where your entity recognizer is at. Keep an eye on the Status
field:
SUBMITTED
: We got your requestTRAINING
: Your model's learning the ropesTRAINED
: All done and ready to goIN_ERROR
: Oops, something went wrong
When you see TRAINED
, your custom entity recognizer is good to go.
Check Model Performance
After training, you'll want to see how well your model's doing. Amazon Comprehend gives you three key metrics:
- Precision: How accurate are the positive predictions?
- Recall: How many actual positives did we catch?
- F1 Score: A balance between precision and recall
These metrics come from true positives (TP), false positives (FP), and false negatives (FN):
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Let's look at a real example:
Say you've trained a custom entity recognizer for a financial services company to spot PERSON and COMPANY entities. After training on 10,000 documents, here's what you get:
Entity Type | Precision | Recall | F1 Score |
---|---|---|---|
PERSON | 0.92 | 0.88 | 0.90 |
COMPANY | 0.85 | 0.79 | 0.82 |
These numbers show your model's doing pretty well, especially with PERSON entities. But there's room to grow when it comes to COMPANY entities.
"When the training is completed the custom model is ready to go. To start analyzing documents looking for custom entities, either use the portal or APIs via the AWS SDK." - Nino Bice, Sr. Product Manager leading product for Amazon Comprehend
Improve Model Performance
Not happy with your model's performance? Try these:
- Boost data quality: Stick to Amazon Comprehend's guidelines for annotations or entity lists.
- Use annotations: If you're using entity lists now, switching to annotations often works better.
- Expand your dataset: More high-quality, annotated documents can bump up accuracy.
- Balance your entities: Make sure all entity types are well-represented in your training data.
Deploy Your Model
You've trained and fine-tuned your custom entity recognition model. Now it's time to use it. AWS Comprehend gives you two main options: real-time processing for quick results and batch processing for bigger datasets.
Real-time Processing
Want quick entity detection for single documents or small batches? Here's how to set it up:
1. Create an endpoint using the AWS CLI:
aws comprehend create-endpoint \
--desired-inference-units 1 \
--endpoint-name my-custom-entity-endpoint \
--model-arn arn:aws:comprehend:us-east-1:123456789012:model/my-custom-model \
--tags Key=Project,Value=EntityRecognition
This sets up an endpoint with 1 inference unit (IU), processing 100 characters per second. Need more speed? Just bump up the --desired-inference-units
.
2. Start detecting entities in real-time:
aws comprehend detect-entities \
--endpoint-arn arn:aws:comprehend:us-east-1:123456789012:endpoint/my-custom-entity-endpoint \
--language-code en \
--text "Andy Jassy became the CEO of Amazon in 2021."
You'll get back detected entities with their types and confidence scores.
"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required." - Nino Bice, Sr. Product Manager leading product for Amazon Comprehend
Keep an eye on your endpoint's performance with Amazon CloudWatch. It'll help you tweak the number of inference units to balance cost and speed.
Batch Processing
Got a ton of data? Batch processing is your friend. Here's how:
1. Put your documents in an S3 bucket (e.g., s3://my-comprehend-bucket/input/
).
2. Kick off a batch job:
aws comprehend start-entities-detection-job \
--entity-recognizer-arn arn:aws:comprehend:us-east-1:123456789012:entity-recognizer/my-custom-recognizer \
--job-name batch-entity-job-1 \
--data-access-role-arn arn:aws:iam::123456789012:role/ComprehendAccessRole \
--language-code en \
--input-data-config S3Uri=s3://my-comprehend-bucket/input/ \
--output-data-config S3Uri=s3://my-comprehend-bucket/output/ \
--region us-east-1
This processes all docs in your input bucket and saves results to the output bucket.
3. Check on your job:
aws comprehend describe-entities-detection-job \
--job-id your-job-id
When it's done, you'll find JSON files with detected entities in your output bucket.
Here's a real-world example: A financial services company used custom entity recognition on 10,000 quarterly reports. They ran a batch job overnight, spotting custom entities like specific financial products and company terms with 92% accuracy. It saved their analysts over 200 hours of manual review time.
Don't forget to delete your endpoint when you're done to avoid extra charges. You can always create a new one later.
Fix Common Problems
Custom entity recognition in AWS Comprehend can be tricky. Let's look at some common issues and how to solve them.
Fix API Errors
API errors are a pain, but you can usually fix them. Here are some you might run into:
1. Missing Annotations Error
If you see this:
"The augmented manifest referenced in your InputDataConfig.AugmentedManifests at index 0 doesn't have any annotations"
Your input data is missing annotations. To fix it:
- Check your augmented manifest files
- Make sure they have correct annotation references
- Check that annotation files exist and you can access them
2. Authorization Errors
You might see something like:
"User: arn:aws:iam::123456789012:user/mateojackson is not authorized to perform: comprehend:GetWidget on resource: my-example-widget"
This means you don't have the right permissions. Here's what to do:
- Look at your IAM policies
- Make sure they allow the right actions in Amazon Comprehend
- Ask your AWS admin for help if needed
3. iam:PassRole Error
If you see:
"User: arn:aws:iam::123456789012:user/marymajor is not authorized to perform: iam:PassRole"
Your policies need updating to let you pass a role to Amazon Comprehend. Talk to your AWS admin about this.
Improve Model Performance
If your model isn't working as well as you'd like, try these:
1. Enhance Data Quality
Bad data leads to bad results. To make it better:
- Follow Amazon Comprehend's rules for annotations or entity lists
- Label everything the same way
- Use different examples for each entity type
2. Increase Dataset Size
More good data usually means better results. Amazon Comprehend doesn't need as much data as before, but more is still better:
- Try for at least 25 annotations per entity type
- With just 25 annotations per type, some users got an average F1 score of 84%
3. Use Annotations Instead of Entity Lists
If entity lists aren't working, try annotations. They're good for:
- Entities that depend on context
- Training models for image files, PDFs, or Word documents
4. Balance Your Entities
Make sure you have enough examples of all entity types. If you don't, some might not work well.
Handle Incorrect Results
Sometimes, even after training, things go wrong. One user said:
"AWS returned the value of 'select common flexbase' as one single 'material' type. This should have been three separate 'material' types (and was annotated hundreds, if not thousands, of times separately in the training annotations)."
To fix this:
- Check your annotations to make sure they're right
- Try adding more examples for types that aren't working
- If it's still not working, you might need to train your model again with different examples
Nino Bice, Sr. Product Manager for Amazon Comprehend, said:
"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required."
This makes it easier, but you still need good data to get good results.
Summary
Custom entity recognition in AWS Comprehend lets businesses pull specific info from text. Here's how it works:
1. Data Prep
Gather your entity list or annotated docs. A finance company might list bankruptcy terms or mark up quarterly reports.
2. Set Up the Recognizer
Choose your model settings and entity types. You can train on up to 25 custom entities at once - great for specialized industries.
3. Train the Model
AWS Comprehend does the heavy lifting. It picks the best algorithm and fine-tunes based on your data.
4. Deploy
Go for real-time processing for quick results or batch processing for bigger datasets.
5. Keep an Eye on Performance
Watch precision, recall, and F1 scores to see how well your model's doing.
Custom entity recognition works across industries:
- Finance: Spot bankruptcy terms in market reports
- Manufacturing: Pull part IDs and route numbers from logistics docs
- Healthcare: Find medical terms and treatments in patient records
- Legal: Catch case numbers and legal jargon
Nino Bice, Sr. Product Manager for Amazon Comprehend, says:
"Customers can now train state-of-the-art entity recognition models to extract their specific terms, completely automatically. No machine learning experience required."
This tech is now open to businesses of all sizes. For example, a mid-sized manufacturer used it to analyze 10,000 logistics docs overnight. They found part IDs and route numbers with 92% accuracy, saving 200+ hours of manual review.
FAQs
Which of the following AI services does Amazon Comprehend provide?
Amazon Comprehend packs a punch with its AI-powered natural language processing (NLP) services. Here's what it can do:
1. Entity Recognition
Spots people, places, and organizations in text. It even does custom recognition for industry-specific terms.
2. Sentiment Analysis
Figures out if a piece of text is happy, sad, neutral, or a mix of emotions.
3. Key Phrase Extraction
Pulls out the most important bits from a document.
4. Language Detection
Tells you what language the text is in.
5. Topic Modeling
Groups text documents into topics.
6. Syntax Analysis
Breaks down text to show how words relate to each other.
7. PII Detection
Finds and removes sensitive personal info from documents.
8. Document Classification
Sorts documents into categories, including custom ones for specific needs.
Amazon Comprehend isn't picky - it works with plain text, PDFs, Word docs, and even images (JPG, PNG, TIFF). It speaks multiple languages too, including English, Spanish, German, Italian, Portuguese, French, and Japanese.
"Amazon Comprehend uses deep learning technology to accurately analyze text." - Amazon Comprehend Team
The best part? You don't need to be a machine learning guru to use it. It's fully managed and always learning.
Here's a real-world example: A financial services company used custom entity recognition to analyze 10,000 quarterly reports overnight. They picked out specific financial products and company terms with 92% accuracy. Not too shabby!