Val:~$ whoami

I am Val Glinskiy, network engineer specializing in data center networks. TIME magazine selected me as Person of the Year in 2006.

Search This Blog

Tuesday, March 13, 2018

DIY routing to the host

   Cumulus Networks promotes routing to the host via Host Pack software package as a way to provide host network redundancy without using proprietary MLAG or mostly incompatible EVPN ESI multihoming solutions from switch vendors. While Host Pack seems to be geared towards hosts running Linux containers, it got me thinking how can I do routing to bare metal host. The routing protocol of choice is BGP. Now I need an IP address on the interface that never goes down and make sure that my server and client applications use that IP. That same IP will be advertised via BGP from the host. Loopback interface is obvious choice for this kind of interface.
   srv1 and srv2 are Vagrant minimal/xenial64 boxes. srv1, tor1, tor2 and tor3 run BGP, srv2 is connected to network hosted on tor3. Let's configure "always-up" IP address on srv1:
sudo ip addr add dev lo:100
    While binding server application like Apache to specific IP address or interface is pretty straightforward task, selecting source address for outgoing connection is a bit more complicated. Here is how Linux selects source IP address:
The application can request a particular IP [20], the kernel will use the src hint from the chosen route path [21], or, lacking this hint, the kernel will choose the first address configured on the interface which falls in the same network as the destination address or the nexthop router.
I want it to be transparent for the applications and left on its own, Linux most likely will select IP address of one of the physical interfaces. The only option left is to make sure that route to on srv1 is programmed with src
In this lab I am using BIRD 1.6 to run BGP on srv1, but Free Rang Routing will work too.

router id;

filter my_vip
        if net = then accept;
filter remote_site
        if net ~ [ ] then
           krt_prefsrc =; #set src

protocol kernel {
        scan time 60;
        import none;
        export filter remote_site;
        persist;     # routes stay even if bird is down
        merge paths on;  # ECMP

protocol device {
        scan time 60;

protocol direct {
        interface "enp0s[8|9]", "lo*";
protocol bgp host_2rtr1 {
        local as 65499;
        neighbor as 64900;
        export filter my_vip;
        import filter remote_site;
protocol bgp host_2rtr2 {
        local as 65499;
        neighbor as 64920;
        export filter my_vip;
        import filter remote_site;

Let's see BGP routes we get from tor1 and tor2:
bird> show route     via on enp0s8 [host_2rtr1 16:15:49] * (100) [AS65000i]
                   via on enp0s9 [host_2rtr2 16:15:49] (100) [AS65000i]

Only one route is marked as primary, I could not find "bestpath as-path multipath-relax" equivalent in BIRD. It's required because tor1 and tor2 have different AS numbers. But no worries, "merge path on" under "protocol kernel" will take care of this. Indeed:
vagrant@srv1:~$ ip route show  proto bird  src       
        nexthop via  dev enp0s8 weight 1 
        nexthop via  dev enp0s9 weight 1
both routes are installed and claim to use as a source IP address to reach srv2 network.
Let's verify. I start pinging srv2 from srv1 and run tcpdump on srv2 side.

vagrant@srv1:~$ ping
Here is tcpdump output on srv2:
02:22:09.753864 IP > ICMP echo request, id 5024, seq 1, length 64
02:22:09.753906 IP > ICMP echo reply, id 5024, seq 1, length 64
02:22:10.750884 IP > ICMP echo request, id 5024, seq 2, length 64
02:22:10.750920 IP > ICMP echo reply, id 5024, seq 2, length 64

As you can see, packets are coming from, even though I did not specify source IP address for the ping.
Similar test with ssh
vagrant@srv1:~$ ssh

02:31:37.652419 IP > Flags [S], seq 3479345726, win 29200, options [mss 1460,sackOK,TS val 2387063 ecr 0,nop,wscale 7], length 0
02:31:37.652473 IP > Flags [S.], seq 2929355359, ack 3479345727, win 28960, options [mss 1460,sackOK,TS val 2404414 ecr 2387063,nop,wscale 7], length 0
02:31:37.665081 IP > Flags [.], ack 1, win 229, options [nop,nop,TS val 2387066 ecr 2404414], length 0
02:31:37.666605 IP > Flags [P.], seq 1:42, ack 1, win 229, options [nop,nop,TS val 2387066 ecr 2404414], length 41
02:31:37.666621 IP > Flags [.], ack 42, win 227, options [nop,nop,TS val 2404418 ecr 2387066], length 0

And last test - failover. Since my lab setup is entirely virtual, the goal was to test if failover works at all and not how fast it does. You need real hardware to check the speed of failover.

vagrant@srv1:~$ iperf -s -B

vagrant@srv2:~$ iperf  -M 1000 -b 80K -i 1 -c -t 120

In my case traffic took srv2 -> tor3 -> tor1 -> srv1 path. While iperf was running, I shutdown BGP session between tor1 and srv1. Here are the results from iperf:
[  3] 46.0-47.0 sec   384 KBytes  3.15 Mbits/sec
[  3] 47.0-48.0 sec   512 KBytes  4.19 Mbits/sec
[  3] 48.0-49.0 sec   384 KBytes  3.15 Mbits/sec
[  3] 49.0-50.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 50.0-51.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 51.0-52.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 52.0-53.0 sec   256 KBytes  2.10 Mbits/sec
[  3] 53.0-54.0 sec   512 KBytes  4.19 Mbits/sec

That 3-second interval of 0.00bits/sec is failover time. Again, since it's virtual environment, your mileage may vary.

Monday, March 05, 2018

Interface uptime time format

I was looking into why my NAPALM-based script could not validate state of SVI interface on Cisco Nexus and decided to dig into NAPALM source code. I found something amusing in module line 230:
def _compute_timestamp(stupid_cisco_output):
The code that follows after that tries to convert Cisco's way of reporting uptime into epoch. I totally understand the frustration. Let's say you want to find out when interface flapped last time. Here are the few examples:
Last link flapped 5d02h           
Last link flapped 16week(s) 5day(s)
Last link flapped never
Last link flapped 23:39:41
I understand, that "show" command output is intended for human consumption and it's easy to read. Unfortunately, Cisco provides same kind of time format in XML output, which is supposed to be consumed by some kind of automation. Good luck parsing it. While Arista and Juniper also display interface uptime in similar fashion in plain text output, they do much better job in structured output. Here is JUNOS output in JSON:

"interface-flapped" : [                                 
    "data" : "2017-09-13 14:39:29 EDT (24w0d 20:45 ago)",
    "attributes" : {"junos:seconds" : "14589956"}       
or XML:

<interface-flapped junos:seconds="14590210" > 2017-09-13 14:39:29 EDT (24w0d 20:50 ago)</interface-flapped>
Arista's JSON:

"Ethernet5/1": {                                 
    "lastStatusChangeTimestamp": 1519771449.121221,