I have following baseline:
Dell R630 (2x14 core Xeon, 128GB RAM, 800GB SSD)
x710 4-port NIC, in 10Gbit mode
Latest NIC firmware but default PF/VF drivers (came with OS, v1,3,4)
VF driver blacklisted on hypervisor
Setup according to official Intel and Suse documentation, KVM hypervisor
With test setup, single VM with single VF and untagged traffic, I could achieve basically line-rate numbers: with MTU 1500, there were about 770Kpps and BW of 9.4Gbps, achieved both for UDP and TCP traffic, with no packet drops. There is plenty of processing power, setup is nice and tidy and everything works as it should.
Production setup is a bit different: VM is using 3 VFs, one for each PF (4th PF is not being used). All VFs except first one use untagged traffic. First VF is passing two types of traffic: first one untagged (VLAN 119) and second one tagged (VLAN 1108). Tagging is done inside VM. Setup worked fine for some time, confirming test setup numbers. However, after some time following errors started to appear in hypervisor logs:
Mar 11 14:32:52 test_machine1 kernel: [10423.889924] i40e 0000:01:00.1: TX driver issue detected on VF 0
Mar 11 14:32:52 test_machine1 kernel: [10423.889925] i40e 0000:01:00.1: Too many MDD events on VF 0, disabled
And performance numbers became erratic: sometimes it worked perfectly, sometimes it did not. But most importantly, packet drops occured.
So, I've reinstalled everything (hypevisor and VMs), configured exactly as before using automated tools, but upgraded PF and VF drivers to latest ones (v2.0.19/v2.0.16). Errors in logs disappeared, but issue persists. Now I have this in logs:
2017-03-12T11:33:43.356014+01:00 test_machine1 kernel: [ 420.439112] i40e 0000:01:00.1: Unable to add VLAN filter 0 for VF 0, error -22
2017-03-12T11:33:43.376009+01:00 test_machine1 kernel: [ 420.459168] i40e 0000:01:00.0: Unable to add VLAN filter 0 for VF 0, error -22
2017-03-12T11:33:44.352009+01:00 test_machine1 kernel: [ 421.435124] i40e 0000:01:00.2: Unable to add VLAN filter 0 for VF 0, error -22
I've increased VM CPU count number, VF ring sizes, turnet off VF spoofcheck in hypervisor, VM linux software buffers, VM netdev.budget kernel parameter (amount of CPU time assinged for NIC processing) etc. but situation remains the same. Sometimes works perfectly, other time it does not.
Can you please provide some insight? Since rx_dropped counter is increasing in VM, I am suspecting driver/VF issue.
Is there a way to handle this problem, without switching to untagged traffic?
Thank you in advance,