Troubleshooting AWS Direct Connect Step-by-Step

AWS Direct Connect provides a private, high-speed connection between your data center and AWS. However, when issues arise, troubleshooting can be complex, as problems may occur across multiple layers: physical, data link, network/transport, and routing. Here's a quick breakdown of how to identify and resolve common issues:

Physical Layer: Check fiber cables, transceivers, and cross-connect setups for faults. Verify signal strength (-14.4 to 2.50 dBm for 1-10 Gbps, -4.3 to 4.5 dBm for 100 Gbps).
Data Link Layer: Confirm VLAN tagging (802.1Q) and ARP resolution. Ensure VLAN IDs match and clear ARP cache if needed.
Network/Transport Layer: Verify BGP configurations (ASN, IPs, MD5 keys) and ensure TCP port 179 is open. Check firewall and ACL rules.
Routing Layer: Confirm route advertisements and propagation. Stay within AWS limits (e.g., 100 routes for private interfaces).

Use tools like CloudWatch for monitoring and VPC Flow Logs to identify blocked traffic. If problems persist, document your findings and escalate to AWS Support with detailed logs and configurations.

Fixing Physical Connection Problems

When Direct Connect connections fail entirely, the root cause often lies within the physical layer. This includes hardware components like fiber cables, optical transceivers, cross-connects, and router ports. These types of issues tend to show clear symptoms and typically have straightforward fixes.

Checking Cross-Connect Setup and Hardware

Start by confirming that the cross-connect installation is complete. Many connection problems stem from incomplete or incorrect setups at the colocation facility. Reach out to your colocation provider to verify that the cross-connect between your equipment and AWS is finalized. Ask for a cross-connect completion notice and compare the listed ports with those on your LOA-CFA (Letter of Authorization and Connecting Facility Assignment). Keep in mind, the LOA-CFA expires after 90 days if the cross-connect isn't completed. If needed, download a fresh copy from the AWS Direct Connect console.

Next, ensure your equipment and router ports are functioning as expected. Direct Connect requires single-mode fiber optic cables and specific optical transceivers depending on the connection speed:

1 Gbps: Use 1000BASE-LX transceivers at 1310 nm
10 Gbps: Use 10GBASE-LR transceivers
100 Gbps: Use 100GBASE-LR4 transceivers
400 Gbps: Use 400GBASE-LR4 transceivers

Auto-negotiation settings can also cause trouble. For speeds over 1 Gbps, disable auto-negotiation and manually configure the port speed and full-duplex mode. For 1 Gbps connections, the requirement depends on the AWS Direct Connect endpoint.

Lastly, ensure your connection path supports 802.1Q VLAN encapsulation, and verify that all intermediate devices are compatible. Your equipment must also support BGP and BGP MD5 authentication to establish connectivity at higher network layers. If issues persist, proceed to test the fiber-optic signals.

Testing Optical Signals and Fiber Connections

Proper optical signal strength is essential for reliable operation. AWS specifies the following signal ranges:

1–10 Gbps connections: Transmit and receive signals should be between –14.4 dBm and 2.50 dBm.
100 Gbps connections: Transmit signals should range from –4.3 to 4.5 dBm, while receive signals should fall between –10.6 and 4.5 dBm. These ranges apply to each optical lane in 100 Gbps setups.

If signals fall outside these ranges, try flipping the transmit and receive fiber strands. This simple adjustment can correct issues caused by incorrect fiber polarity.

Request an optical signal report from your colocation provider to check the transmit and receive levels across the cross-connect. This report can help pinpoint whether problems originate from your equipment, the cross-connect, or the AWS side.

Perform loopback tests to narrow down the issue. Start by testing the link from the Meet-Me-Room (MMR) to your router. If your device's port becomes active, the link to the MMR is functional. Use commands like show interfaces eth1 transceiver on Cisco equipment to check transmit and receive light levels. Then, conduct a loopback test from the MMR toward the AWS router, running it for at least 10 minutes. If the physical components are working correctly, the AWS port should activate, which you can confirm with CloudWatch metrics.

If problems persist, request a comprehensive media test from your colocation provider. This should include patch-cord verification, cleaning, and a full examination of the cabling and signaling. You can also use a Visual Fault Locator (VFL) to check fiber continuity - if the red light doesn't travel end-to-end, there may be a fiber break.

Monitoring Physical Errors with CloudWatch

After confirming hardware and signal integrity, use Amazon CloudWatch to monitor for ongoing issues. CloudWatch provides metrics that track the health of the Direct Connect physical layer, updated every 5 minutes by default (1-minute intervals are available upon request).

Key metrics to monitor include:

Metric	Purpose	Normal Range
ConnectionState	Physical connection status	1 (up) or 0 (down)
ConnectionLightLevelTx	Outbound optical signal strength	–14.4 to 2.50 dBm (1–10 Gbps); –4.3 to 4.5 dBm (100 Gbps)
ConnectionLightLevelRx	Inbound optical signal strength	–14.4 to 2.50 dBm (1–10 Gbps); –10.6 to 4.5 dBm (100 Gbps)

For 100 Gbps connections, use the OpticalLaneNumber dimension to inspect the four optical lanes individually. Metrics like ConnectionState indicate basic connectivity (1 for up, 0 for down), while ConnectionErrorCount tracks MAC-level errors, such as CRC errors. Rising error counts may point to damaged cables or failing transceivers.

Set up CloudWatch alarms for these metrics to catch issues early. Also, check the AWS Health Dashboard for planned maintenance or outages, and confirm with your colocation provider that no scheduled activities could disrupt your service.

Fixing Data Link Layer Problems

Once you've ensured the physical layer is stable, the next step is tackling Layer 2 issues, such as VLAN misconfigurations and ARP failures. These problems often disrupt communication between your equipment and AWS, making it crucial to address them promptly.

Checking VLAN and IP Configuration

VLAN misconfigurations are a frequent cause of Direct Connect issues at Layer 2. AWS requires all traffic to follow proper 802.1Q VLAN tagging. Any deviation from this standard will block ARP establishment on the AWS side.

Start by verifying the VLAN settings on both ends of the connection. In the Direct Connect console, review your virtual interface details or use the describe-virtual-interfaces command. Ensure the AWS peer IP is assigned to the correct VLAN subinterface (e.g., GigabitEthernet0/0.123) rather than the physical interface. This is critical because AWS requires tagged traffic to establish communication.

Check your router configuration to confirm the VLAN tag is unique and falls within the valid range of 1–4094. Download the router config file for precise commands if needed. Additionally, enable VLAN trunking on all intermediate devices between your router and the AWS Direct Connect endpoint. Any device that mishandles the 802.1Q VLAN tag can break connectivity. Once VLAN settings are verified, test Layer 2 connectivity by examining ARP entries.

Clearing ARP Cache and Checking MAC Addresses

After confirming VLAN tagging, focus on MAC address resolution by inspecting the ARP table. If the ARP table lacks an entry for the AWS endpoint’s MAC address, it indicates a Layer 2 issue. This could mean your device isn’t learning the Direct Connect endpoint’s MAC address, pointing to ARP resolution failures.

To address this:

Clear the ARP cache on your customer gateway to force a rebuild of MAC mappings.
Enable ARP debugging on your customer gateway to monitor ARP request and reply activity.
Use packet capture tools on the Layer 2 device connected to the Direct Connect endpoint. Verify that ARP broadcasts are being sent to AWS with the correct 802.1Q encapsulation and VLAN tag. These broadcasts should use the MAC address FF:FF:FF:FF:FF:FF and include proper VLAN tagging.

Escalating Persistent Layer 2 Problems

If your virtual interface is still down after checking VLAN configurations and troubleshooting ARP issues, you may need to involve AWS Support. Before escalating, ensure you've thoroughly documented your troubleshooting process.

Provide AWS Support with the following:

ARP table contents
Packet capture results
Confirmation that VLAN trunking is enabled on all intermediate devices

Also, include details such as the specific VLAN ID in use, confirmation that it falls within the valid range (1–4094), and verification that the VLAN tag is unique. For hosted connections, keep in mind that your AWS Direct Connect Partner assigns the VLAN value, and it cannot be changed after the virtual interface is created.

Fixing Network and Transport Layer Problems

Once Layer 2 issues are resolved, it's time to focus on the network and transport layers. If Layer 2 connectivity is confirmed but the virtual interface remains down, the problem likely lies in Layer 3 or 4. These issues often stem from BGP routing misconfigurations or firewall rules blocking BGP traffic.

Checking BGP Settings and Authentication

BGP configuration mistakes are a frequent culprit behind Direct Connect network layer failures. Start by confirming that the BGP ASN numbers match exactly between your router configuration and the AWS Direct Connect console. Even a small mismatch will prevent the session from establishing.

Next, verify the peer IP addresses on both sides. Your customer gateway must be configured with the correct AWS peer IP address, and AWS needs the correct IP for your router.

If you're using MD5 authentication, ensure the key matches perfectly between your router and the AWS console. Any discrepancy here will result in authentication failures. Double-check your router's MD5 key against the one entered in the AWS console when setting up the virtual interface. Also, keep in mind that AWS does not allow 0 as a Hold Time value, so make sure your BGP timer settings use valid non-zero values.

To test connectivity, run a telnet check on TCP port 179 between your router and the AWS peer IP. If this fails, there's a connectivity issue that needs attention. Once you've confirmed BGP settings, review your firewall and ACL rules to ensure they aren't blocking traffic.

Checking Firewall and ACL Rules

Firewall settings are another common reason for BGP traffic issues. BGP uses TCP port 179 for initial connections and high-numbered ephemeral TCP ports for ongoing communication. Any rules blocking these ports will interrupt BGP connectivity.

On the AWS side, check your VPC security groups and Network ACLs. These must allow two-way traffic between your on-premises network CIDR blocks and your AWS resources. Be sure to review both inbound and outbound rules to confirm that the required ports are open for traffic in both directions.

To locate where traffic is being blocked, use traceroute from your on-premises router and an Amazon VPC instance. If the trace halts at your on-premises router's peer IP, your local firewall might be dropping the traffic. If it stops at the AWS peer IP, examine your AWS-side security group and Network ACL rules.

Enabling Amazon VPC Flow Logs can give you a clearer picture of traffic patterns. These logs show whether packets from your on-premises router are reaching specific elastic network interfaces in your VPC. This can help identify if the block is occurring at the network interface level instead of at the router.

Checking BGP Logs for Errors

To pinpoint negotiation failures, collect packet captures and analyze BGP debug logs. These tools reveal what happens during each step of the session establishment process.

BGP sessions move through distinct states, and log analysis can help identify where the failure occurs. In the Connect state, confirm that the TCP three-way handshake completes successfully. Look for dropped packets or connection errors in the logs. If the handshake doesn’t complete, check that both routers are configured with the correct neighbor IP addresses and that MD5 authentication keys are an exact match.

During the OpenSent state, BGP sends an OPEN message and waits for a response. Review your logs to ensure the message includes the correct parameters, such as the version number, AS number, Hold Timer, and BGP Identifier IP address. Errors like mismatched AS numbers or invalid Hold Timer values can cause failures here.

In the Established state, routers exchange route updates and keepalive messages. If your session reaches Established but then drops back to Idle, check whether you're advertising more than 100 routes. AWS imposes a limit of 100 advertised routes for private virtual interfaces, and exceeding this will terminate the session. You can fix this by summarizing your routes or advertising a default route instead.

Finally, review your BGP logs for any prefix filters or route maps that could be blocking route advertisement. Ensure that the prefixes you're advertising appear in the AWS route tables and that no local filtering is preventing expected routes from being advertised.

sbb-itb-6210c22

Fixing Routing and Traffic Flow Problems

When routing and traffic permissions are not configured correctly, data transmission can fail even if a BGP session is established.

Checking Route Advertisement and Propagation

Start by ensuring your router advertises the correct IP prefixes to AWS. Use the show ip bgp neighbors command to confirm this. The output should list all the prefixes from your network that you want AWS to access.

Next, check the virtual interface BGP route table in the AWS Direct Connect console to verify the routes received from your router. Missing routes could mean your router isn’t advertising them properly, or they might be filtered by route maps or prefix lists on your end.

In the VPC console, ensure route propagation is enabled for your Virtual Private Gateway. Without this, even properly advertised routes won’t show up in your VPC routing tables.

If you're working with transit gateways, examine the route tables linked to your Direct Connect gateway attachment. Routes must either be explicitly associated or propagated to the correct route table. Connectivity issues often arise due to missing route table associations in transit gateway setups.

Also, make sure that summarized routes cover all necessary subnets to avoid traffic black holes. Lastly, check that security settings are not inadvertently blocking the advertised routes.

Checking VPC Security Groups and Network ACLs

Security groups and Network ACLs play a crucial role in allowing traffic from your on-premises network.

Security groups are stateful and operate at the instance level. This means return traffic is automatically allowed for established connections. Review the security groups attached to your EC2 instances or other resources. Ensure the inbound rules permit traffic from your on-premises IP ranges over the required ports and protocols.

For instance, if your goal is to access a web server in your VPC, the security group must include an inbound rule allowing HTTP (port 80) or HTTPS (port 443) traffic from your corporate network’s IP range. Without this, packets will be dropped regardless of correct routing.

Network ACLs are stateless and work at the subnet level, requiring explicit rules for both inbound and outbound traffic. Check the ACLs associated with the subnet hosting your target resources. Ensure traffic is allowed in both directions, including ephemeral port ranges for return traffic.

While the default Network ACL allows all traffic, custom ACLs are often more restrictive. If you’ve implemented custom rules, verify they accommodate your on-premises traffic. Remember, Network ACL rules are processed in numerical order, and the first matching rule determines the action.

If packets are being rejected, use VPC Flow Logs to identify the issue. Look for REJECT entries in the logs, then review the corresponding security groups and Network ACLs for the affected resources.

Checking AWS Routing Limits

AWS enforces specific routing limits that, if exceeded, can disrupt connectivity. Being aware of these limits is key to avoiding traffic issues.

For private virtual interfaces, AWS supports up to 100 advertised routes. If your router advertises more than this, AWS will shut down the BGP session. Regularly monitor your route advertisements and use route summarization to stay within this limit.

VPC route tables have limits too: 50 non-propagated routes and 100 propagated routes per table. If you’re using multiple Direct Connect or VPN connections, you could hit these limits, leading to ignored routes and connectivity gaps.

To manage these limits, consider implementing route summarization. For example, instead of advertising individual /24 subnets, summarize them into larger blocks like /16 or /8 networks. This reduces the number of advertised routes while maintaining network connectivity.

For large networks, advertising a default route from your router to AWS can be a more efficient strategy. This uses just one route slot but requires careful planning to ensure proper traffic flow.

Monitor your route usage through the AWS console, and set up CloudWatch alarms to alert you when you’re nearing routing limits. Proactive monitoring can help you avoid unexpected connectivity issues caused by exceeding these thresholds.

Conclusion

When troubleshooting AWS Direct Connect, a structured, step-by-step approach is key. By examining each network layer from the ground up, you can avoid missing fundamental issues that might ripple across multiple layers.

Starting with the physical layer is critical. Problems like damaged fiber connections or incorrect cross-connects can block all higher-layer protocols from functioning. Once the physical layer is confirmed, move to the data link layer, where misconfigured VLANs or ARP issues could cause sporadic connectivity problems. Proper testing is essential to pinpoint these issues.

BGP configuration errors are another frequent culprit. Double-check details like AS numbers, authentication keys, and IP addresses before assuming hardware is at fault. Even minor configuration mismatches can disrupt BGP sessions entirely.

Don’t overlook AWS limits - these can lead to unexpected outages. For example, exceeding the 100-route limit on private virtual interfaces or hitting VPC route table limits can cause disruptions. Regularly monitor your route advertisements and consider using route summarization to avoid hitting these thresholds.

Leverage monitoring tools throughout the process. These tools provide insights into connection status, packet flows, and the decisions made by security groups. Flow logs, in particular, can help pinpoint whether traffic is being blocked by security groups or Network ACLs, or if routing is the issue.

When reaching out to AWS Support, ensure you document everything thoroughly. Include details like BGP session states, route advertisements, and security configurations. This not only speeds up resolution but also shows that you’ve done your homework.

Finally, collaboration is crucial. Bring together your network, cloud, and security teams to share findings and streamline troubleshooting. This teamwork reduces redundancy and ensures every angle of the issue is thoroughly addressed.

FAQs

What are common physical connection issues with AWS Direct Connect, and how can they be fixed?

Physical connection problems with AWS Direct Connect often stem from damaged or loose fiber optic cables or an incomplete cross-connect setup. Here's how you can address these issues:

Check the cables: Ensure all fiber optic cables are securely connected and free from damage.
Verify the cross-connect setup: Contact your colocation provider to confirm that the cross-connect is properly configured and active.

If you notice there's no light coming from the cross-connect panel, it's usually a sign of a hardware or connectivity issue. In this case, inspect the cables, ports, and any related hardware to identify potential faults or misconfigurations. Keeping these components in good condition is essential for restoring your connection.

What are the effects of VLAN misconfigurations on AWS Direct Connect and how can you resolve them?

VLAN misconfigurations can throw a wrench into AWS Direct Connect, leading to virtual interfaces (VIFs) becoming unresponsive or failing to connect. These hiccups often arise from issues like using incorrect VLAN IDs, mismatched settings between your network and AWS, or errors in traffic tagging.

To get things back on track, start by double-checking that the VLAN IDs on your devices align with those provided by AWS. Make sure VLAN tagging is set up correctly and that your network hardware supports the required encapsulation standard, such as 802.1ad. Next, review your physical connections and confirm that the cross-connect is properly established. Take a close look at the status of the VIF and BGP session for any irregularities. If problems persist, running traceroutes can help pinpoint potential network path issues. Tackling these troubleshooting steps methodically can go a long way in restoring your connection.

How can I configure BGP settings for a stable AWS Direct Connect connection?

To keep your AWS Direct Connect connection steady, it's important to configure BGP (Border Gateway Protocol) with well-chosen settings. For example, set the hold timer to a practical value, like 3 seconds, to strike the right balance between quick responsiveness and connection stability. Also, make sure to establish two BGP sessions for redundancy - this way, if one session fails, the other can take over without disrupting your connection.

You can also use BGP communities to fine-tune route preferences, making it easier to manage traffic flow. Regularly checking the status of your BGP sessions is key to spotting and fixing potential issues before they cause problems. This proactive monitoring ensures your connection remains reliable and helps reduce the risk of downtime.

Troubleshooting AWS Direct Connect Step-by-Step

Fixing Physical Connection Problems

Checking Cross-Connect Setup and Hardware

Testing Optical Signals and Fiber Connections

Monitoring Physical Errors with CloudWatch