Troubleshooting High Flow Table Usage Alarm in NSX

Table of Contents

If you manage an NSX environment, you might have come across the following alarm:

Enhanced Dp Flow Table Usage Very High

The alarm’s details direct you to Broadcom KB 345796 or KB 345809, which suggests (with several caveats) that you could increase the size of the flow table to alleviate the issue. If your network traffic is very predictable, you know you have a workload that actually needs that many flows, AND you notice an increase in packet drops or latency when the alarms occur, then increasing the flow table size could be a viable solution. However, most environments aren’t like that. A better solution would be to identify what’s creating so many flows and resolve that instead.

What we can do is view the enhanced datapath flow table of the host and identify what IP addresses occur significantly more often. To do this, you have to log into the shell of the ESXi host identified by the alarm message, either through a local console or SSH. Using the nsxdp-cli tool, we can dump the host’s flow table using the following command.

nsxdp-cli ens flow-table dump

ENS here refers to enhanced network stack, which is interchangeable with enhanced datapath for the most part. This command generates a list of all flows in that host’s flow table. If you have a workload that is establishing or receiving an exceptionally large number of flows, we can use this to identify the responsible IP address.

[root@esxi-1:~] nsxdp-cli ens flow-table dump
FT  dstMAC             srcMAC             VLAN  srcPort  srcIP         dstIP        proto  VNI      srcPort/type  dstPort/code  swHits      swBytes      hwHits      hwBytes     Actions
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L4  00:0c:29:3a:fa:62  00:0c:29:19:77:a1  0     1        192.168.0.2   192.168.0.1  6      0        32932         5201          16          1386         0           0           bmap:0x80 inval(s):121 cg:12 dp:0 len:704;
L4  00:0c:29:19:77:a1  00:0c:29:3a:fa:62  0     0        192.168.0.1   192.168.0.2  6      0        5201          32932         13          1069         0           0           bmap:0x80 inval(s):122 cg:12 dp:0x1 len:704;
L4  00:0c:29:19:77:a1  00:0c:29:3a:fa:62  0     0        192.168.0.1   192.168.0.2  6      0        5201          32938         200862      13257252     0           0           bmap:0x80 inval(s):115 cg:12 dp:0x1 len:704;
L4  00:0c:29:3a:fa:62  00:0c:29:19:77:a1  0     1        192.168.0.2   192.168.0.1  6      0        32938         5201          608926      39616303361  0           0           bmap:0x80 inval(s):103 cg:12 dp:0 len:704;

List the Top 10 Source IP Addresses #

You can use the following command to see the top 10 source IP addresses in the flow table, as well as how many flows are associated with them.

nsxdp-cli ens flow-table dump | awk '{print $6}' | sort | uniq -c | sort -rn | head

Example:

[root@esxi-1:~] nsxdp-cli ens flow-table dump | awk '{print $6}' | sort | uniq -c | sort -rn | head
  15 192.168.0.2
   7 192.168.0.1
   5 192.168.0.3
   4 192.168.0.4
   3 192.168.0.5

List the Top 10 Destination IP Addresses #

To see the top 10 destination IP addresses instead, you can use the following slightly modified command.

nsxdp-cli ens flow-table dump | awk '{print $7}' | sort | uniq -c | sort -rn | head

Example:

[root@esxi-1:~] nsxdp-cli ens flow-table dump | awk '{print $7}' | sort | uniq -c | sort -rn | head
  14 192.168.0.1
   8 192.168.0.3
   7 192.168.0.4
   3 192.168.0.2
   2 192.168.0.5

List the Top 10 Source and Destination IP Address Pairs #

And to see the top 10 pairs of source and destination IP addresses, you can use the following command. The first IP address in each line is the source IP address.

nsxdp-cli ens flow-table dump | awk '{print $6,$7}' | sort | uniq -c | sort -rn | head

Example:

[root@esxi-1:~] nsxdp-cli ens flow-table dump | awk '{print $6,$7}' | sort | uniq -c | sort -rn | head
  13 192.168.0.2 192.168.0.1
   4 192.168.0.4 192.168.0.3
   3 192.168.0.5 192.168.0.3
   3 192.168.0.3 192.168.0.4
   2 192.168.0.2 192.168.0.4
   2 192.168.0.1 192.168.0.5
   2 192.168.0.1 192.168.0.4
   2 192.168.0.1 192.168.0.2
   1 192.168.0.3 192.168.0.2
   1 192.168.0.3 192.168.0.1

It’s important to note that an IP address having a large number of flows doesn’t necessarily mean that it’s sending a large amount of traffic. If you have a security appliance that performs port scans (such as Nessus), then ESXi would create a flow for every port that it scans, despite the low amount of data actually being sent. For example, if the appliance scans 1024 ports on 10 different IP addresses, ESXi would have to create 10,240 flows to support that scan.

Write the Flow Table into a File #

If you want to perform a more in-depth analysis, you can dump the flow table into a text file to import into your desired data analysis program. However, keep in mind that this file could be a few hundred megabytes. Not large in the grand scheme of things, but it’s definitely not a trivial text file.

nsxdp-cli ens flow-table dump > /tmp/flow-table.dmp

View Flow Table Statistics for ESXi Hosts (NSX 4.2+) #

Unfortunately, there isn’t a way to view a host’s flow table outside of the host’s local CLI. So logging into the host itself is the only option to view individual flow table entries. However if you’re using NSX 4.2, you can still view flow table statistics for each host using the API or PowerCLI. You can see an explanation of each statistic in the NSX 4.2 documentation.

Since the NSX Manager UI only tells you when a flow table is 90% or 95% full, viewing these stats through the API would at least let you see the current flow table usage before it hits that point and allow you proactively respond. The main statistics to be concerned with are num_flows, flow_table_occupancy_NN_pct, and insertion_errors. If you see a host with a large number of flows, a large number of flow tables in high occupancy, AND a growing number of insertion errors, that could be a strong indicator of performance issues in your environment. What also might catch you off guard is that the hosts likely don’t have a single flow table. There are multiple lcores on the host, each with with a flow table of size flow_table_size. So if you notice num_flows is larger than flow_table_size, that’s the reason why.

Keep in mind that this is an experimental feature and its behavior could change in a future release. The following example script requires PowerCLI and at least version 13.3.0.24145081 of the VMware.Sdk.Nsx.Policy module.

# Login to NSX Manager
$adminCreds = Get-Credential -Message "Enter your NSX admin credentials"
Connect-NsxServer -Server nsx.sddc.lab -Credential $adminCreds

# Retrieve all host nodes in this NSX Manager
$hostNodes = (Invoke-ListHostTransportNodes -SiteId default -EnforcementpointId default).Results

# Retrieve flow table statistics for each host
$fpStats = foreach ($hostNode in $hostNodes) {
    # Retrieve flow table statistics from NSX Manager for this host
    $stats = Invoke-GetObservabilityMonitorStatics -SiteId default -EnforcementpointId default -HostTransportNodeId $hostNode.Id -Type fast_path_sys_stats
    # Add the host's name this the output
    $stats.FastPathSysStats.HostEnhancedFastpath | Add-Member -NotePropertyName "Host" -NotePropertyValue $hostNode.DisplayName
    # Output the statistics
    $stats.FastPathSysStats.HostEnhancedFastpath 
}

# *Windows Only* Display the flow table statistics for each host in a graphical table
Out-GridView -InputObject $fpStats

# *Non-Windows* Output flow table statistics in text format
Format-Table -InputObject $fpStats