Race conditions in CSPM part 2

July 01, 2024 3 minute read

Read part 1 for context and background.

In AWS, there is no native control to prevent establishing a VPC peering connection outside your AWS organization.

Teams will have an incentive to initiate VPC peering requests to third parties, because this is how a third party vendor works¹. Blocking the creation or acceptance of VPC connections at the SCP level would take away the application team's power, introduce a new gatekeep check, and generally slow things down.

In AWS API terms application teams can ec2:CreateVPCPeeringConnection and ec2:AcceptVpcPeeringCnnection. To keep the scenario simple we'll assume that the teams can't accept VPC peering connections, just create them. Threat modelling this you come across the state diagram from AWS

What do you do ?

You have two options:

Detect the event in "real-time"² - on the initiating-request state, i.e. check for the event and parameters in CloudTrail CreateVPCPeeringConnection. If it's a known AWS Account all is good, otherwise send a DeleteVPCPeeringConnection which would end up deleting the request to the connection.

Detecting based on events gives a smaller detection window.

Schedule a recurring state check - on the active state, i.e. run describe_vpc_peering_connections every x minutes. After that if you detect an unwanted VPC peer, you guessed it, send a DeleteVPCPeeringConnection.

Scheduling a recurring state check has drawbacks. Your detection window will be at least as big as your scheduling window. By the time the peering has been established, a connection can be established to the requesting VPC and your data would be already on its way out.

Intuitively, the first option seems to have no drawbacks. We'll get the connection request event, check the target account, and reject it before the connection gets accepted.

Let's use a Cloud Custodian policy as an example. https://cloudcustodian.io/docs/aws/examples/vpcpeeringcrossaccount.html The principles apply to other CSPMs and tooling that operates in a similar manner.

policies:

 - name: vpc-peering-cross-account-checker-real-time
   resource: peering-connection
   mode:
      type: cloudtrail
      events:
         - source: ec2.amazonaws.com
           event: CreateVpcPeeringConnection
           ids: 'responseElements.vpcPeeringConnection.vpcPeeringConnectionId'
      timeout: 90
      memory: 256
      role: arn:aws:iam::{account_id}:role/Cloud_Custodian_EC2_Lambda_Role
   description: |
     When a new peering connection is created the Accepter and Requester account
     numbers are compared and if they aren't both internally owned accounts then the
     cloud and security teams are notified to investigate and delete the peering connection.
   actions:
   - type: invoke-lambda
     function: RejectVPCPeering

...

This policy will check for the CreateVpcPeeringConnection event. It will also invoke a lambda that will issue a RejectVpcPeeringConnection. Imagine there's a filter that checks which accounts are ours and which aren't.

If an attacker manages to AcceptVpcPeeringCnnection before the CSPM rejects the connection, then the connection would be established. The remediation lambda will fail, leading to a new failure mode that depends on your CSPM and monitoring.

You need to monitor your remediation and detection infrastructure, and understand how they operate. You need to understand the failure modes of your CSPM.

Any AWS API that has a state machine like this with a reject suffers from the same issues.

See MongoDB Atlas, and DataBricks

Can never be real-time