Monday, November 28, 2016

Procedures to Deploy RMA device into Juniper SRX Chassis Cluster

Juniper KB mentioned some RMA steps for failed Juniper device replacement. There are some steps not clear enough. I put some more configuration steps in this post for future reference:

There are many preparation works before you can add RMA device into your chassis group.




Step 1, Upgrade JunOS Remotely
Usually your RMA Device is delivered to the production environment to do replacement. You will have to remotely upgrade JunOS first.


login: root
root> 
--- JUNOS 10.0R1.8 built 2009-11-03 10:06:39 UTC
root> 

root> show version 
Model: srx240-hm
JUNOS Software Release [10.0R1.8]

root> configure 
Entering configuration mode

[edit]
root# delete 
This will delete the entire configuration
Delete everything under this level? [yes,no] (no) yes 

[edit]
root# set system root-authentication plain-text-password 
New password:
Retype new password:

[edit]
root# commit and-quit 
commit complete
Exiting configuration mode

root> set chassis cluster cluster-id 4 node 0 reboot 
Successfully enabled chassis cluster. Going to reboot now



Some basic configurationon fxp0.0 interface and default static route. Also ssh service will need to be enabled.
root> show configuration 
## Last commit: 2016-11-29 03:37:32 UTC by root
version 10.0R1.8;
system {
    root-authentication {
        encrypted-password "$1$2eav5HPL$01SUB9SOzDJl007hXhNVj0"; ## SECRET-DATA
    }
    services {
        ssh;
    }
}
interfaces {        
    fxp0 {
        unit 0 {
            family inet {
                address 10.9.1.11/24;
            }
        }
    }
}
routing-options {
    static {
        route 0.0.0.0/0 next-hop 10.9.1.1;
    }
}
{primary:node0}
root> request system software add /var/tmp/junos-srxsme-12.1X46-D55.3-domestic.tgz reboot       
NOTICE: Validating configuration against junos-srxsme-12.1X46-D55.3-domestic.tgz.
NOTICE: Use the 'no-validate' option to skip this if desired.
Formatting alternate root (/dev/da0s2a)...
/dev/da0s2a: 298.0MB (610284 sectors) block size 16384, fragment size 2048
        using 4 cylinder groups of 74.50MB, 4768 blks, 9600 inodes.
super-block backups (for fsck -b #) at:
 32, 152608, 305184, 457760
** /dev/altroot
FILE SYSTEM CLEAN; SKIPPING CHECKS
clean, 150096 free (24 frags, 18759 blocks, 0.0% fragmentation)
Checking compatibility with configuration
Initializing...
Verified manifest signed by PackageProduction_10_0_0
Verified junos-10.0R1.8-domestic signed by PackageProduction_10_0_0
Using junos-12.1X46-D55.3-domestic from /altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic
Copying package ...
veriexec: cannot validate /cf/var/validate/chroot/junos/pkg/manifest.certs: unhandled critical extension: /C=US/ST=CA/L=Sunnyvale/O=Juniper Networks/OU=Juniper CA/CN=PackageProductionRSA_2016/emailAddress=ca@juniper.net
chroot: /usr/bin/hwdb_xml_parser: Authentication error
Unable to regenerate Hardware Database, skipping hardware database checks at install time
chroot: tar: Authentication error
Validating against /config/juniper.conf.gz
cp: /cf/var/validate/chroot/var/etc/resolv.conf and /etc/resolv.conf are identical (not copied).
cp: /cf/var/validate/chroot/var/etc/hosts and /etc/hosts are identical (not copied).
chroot: /usr/sbin/mgd: Authentication error
Validation failed
WARNING: Current configuration not compatible with /altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic

{primary:node0}
root> request system software add /var/tmp/junos-srxsme-12.1X46-D55.3-domestic.tgz reboot no-validate            
Formatting alternate root (/dev/da0s2a)...
/dev/da0s2a: 298.0MB (610284 sectors) block size 16384, fragment size 2048
        using 4 cylinder groups of 74.50MB, 4768 blks, 9600 inodes.
super-block backups (for fsck -b #) at:
 32, 152608, 305184, 457760
** /dev/altroot
FILE SYSTEM CLEAN; SKIPPING CHECKS
clean, 150096 free (24 frags, 18759 blocks, 0.0% fragmentation)
Installing package '/altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic' ...
verify-sig: cannot validate ./certs.pem
unhandled critical extension: /C=US/ST=CA/L=Sunnyvale/O=Juniper Networks/OU=Juniper CA/CN=PackageProductionRSA_2016/emailAddress=ca@juniper.net

Installation failed for package '/altroot/cf/packages/install-tmp/junos-12.1X46-D55.3-domestic'

One of the reasons why installation failed is because the device is set to a date earlier than the date on which the jloader was built, therefore the certificate for the file is not yet valid.



root> set date 201611281600.00    
node0:
--------------------------------------------------------------------------
Mon Nov 28 16:00:00 UTC 2016




Another reason is you will have to upgrade to intermediate version first before you can upgrade to some latest release. For example, from JunOS 10 to 12.1x44 first, then you can do upgrade to 12.1x46


Step 2: Follwoing Juniper KB's instruction:

Note: It does not include IDP signature database step when there is IDP feature enabled on your system. You will have to deactivate security idp first before go to step 6.


  [KB21134] Show KB Properties
Perform the following procedure:
  1. Check the following parameters, prior to  deploying a RMA device in a Chassis Cluster environment:

    Make sure that the following parameters on the new RMA device are the same as the active node of the Chassis Cluster.

    • Check the hardware on the active cluster node and ensure that the device, which is being placed in the cluster, has the same hardware setup and all FPCs are present in the same slot and active. The command to check this is show chassis hardware.
    • Check the Junos version on the active node of the cluster and upgrade or downgrade Junos (for more information, refer to KB16652 - SRX Getting Started - Junos Software Installation/Upgrade) on the new device; so that they match. 
    • Save the configuration in a file on the working node and upload the file to the new device in the /var/tmpdirectory.
    • note: we can use FAT formatted USB key to transfer file into new SRX. 
    • Command: mount -t msdos /dev/da0s1 /mnt
  2. Console to the isolated RMA device (make sure that no cables are connected, other than console cable) and perform the following procedure:    

    1. Get into the configuration mode.
    2. Execute the # delete command.
    3. Configure the root password:
      # set system root-authentication plain-text-password
    4. Then commit:
      # commit
  3. Configure Chassis Clustering on the isolated RMA device.  Use the following command to enable the chassis cluster (you can execute the show chassis cluster status command on the working node to identify the cluster-id):
    code>set chassis cluster cluster-id <id> node <No.>
     <No.> will be 1 or 0, depending on which node is being replaced.
  4. Reboot the new node. The node will come online with the cluster being enabled:
    > request system reboot
  5. Enter the configuration mode and load the configuration from the file, which was copied in the /var/tmp directory in step 1. Use the  following command to load the configuration:
    # load override /var/tmp/<filename>
    note: if there is IDP feature enabled, you will have to deactivate it first with command : deactivate security idp
  6. When the configuration is completely loaded, commit the configuration:
    # commit and-quit
  7. Halt the new node:
    > request system halt.
  8. Now connect the fabric and control ports (makes sure that none of the revenue port cables are connected) and reboot the node.
  9. Check the status of the FPC PIC by executing the show chassis fpc pic-status command. In the output, all of the FPCs and PICs should be online.
  10. When the new node comes online, it should join the cluster as the secondary. You can check the status by executing the show chassis cluster status command. In the output, the priority of RG0 should be the configured value and the priority of the other RG should be 0, If interface monitoring has been configured.
  11. In the output that is generated in step 10, if the new node is shown as the primary, then contact Juniper support for assistance.
  12. If the output that is generated in step 10 shows the primary and secondary for all RGs, then connect all the revenue port cables and again check the chassis cluster status via the show chassis cluster status command. In this output, you should see the configured values for all of the RGs.

  13. If you can access the internet from the new node, then update the license on the new node or download the license and load it. If you are downloading the license on the PC, then save it in a file and upload it to the new node in the /var/tmp directory:
    > request system licnese update >  If you can access the the internet from the new node.
    > request system license add /var/tmp/<filename> > if adding the license from a file.   
Step 3: Troubleshooting Issues

3.1 Nodes of a cluster go into Primary/Lost  or Primary / Primary state
Control link and Fabric link send the packets but not receive anything.
Changed Fabric ports on SRX , but situation is still same. Changed cable to try, same result.

Based on KB23929, it is caused with following reason:

"With codes prior to 10.4, by default, the control port tagging was enabled and it used the 4094 VLAN. For 10.4 and later codes, by default, it is disabled.

So, the upgrade/downgrade makes one node of the control port as tagged and the other node as untagged; so this causes control packets to drop, which in turn causes the Split Brain condition."

SOLUTION:
to avoid the split brain condition, set both sides of the control-link either as tagged or untagged, by using the following command via the CLI:

root> set chassis cluster control-link-vlan enable/disable
warning: A reboot is required for control-link-vlan to be disabled

{primary:node1}
test@fw1-2> request system reboot 
Reboot the system ? [yes,no] (no) yes

{primary:node1}
test@fw1-2> show chassis cluster information detail 
node0:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: active-active
Cluster configuration:
    Heartbeat interval: 1000 ms
    Heartbeat threshold: 3
    Control link recovery: Enabled
    Fabric link down timeout: 66 sec
Node health information:
    Local node health: Healthy
    Remote node health: Healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: none
    Events:
        Dec  7 13:57:43.435 : hold->secondary, reason: Hold timer expired
        Dec  7 15:48:17.158 : secondary->primary, reason: Control & Fabric links down
        Dec  7 15:48:34.749 : primary->secondary-hold, reason: Preempt/yield(10/100)
        Dec  7 15:53:34.754 : secondary-hold->secondary, reason: Ready to become secondary
        Dec  7 17:53:56.761 : secondary->primary, reason: Control & Fabric links down
        Dec  7 17:53:59.428 : primary->secondary-hold, reason: Preempt/yield(10/100)
        Dec  7 17:58:59.433 : secondary-hold->secondary, reason: Ready to become secondary

Redundancy group: 1, Threshold: 255, Monitoring failures: none
    Events:
        Dec  7 13:57:43.512 : hold->secondary, reason: Hold timer expired
        Dec  7 15:48:17.134 : secondary->ineligible, reason: Fabric link down
        Dec  7 15:48:17.863 : ineligible->primary, reason: Control & Fabric links down
        Dec  7 15:48:34.753 : primary->secondary-hold, reason: Monitor failed: IF 
        Dec  7 15:48:35.762 : secondary-hold->secondary, reason: Ready to become secondary
        Dec  7 15:51:00.571 : secondary->ineligible, reason: Fabric link down
        Dec  7 17:53:41.929 : ineligible->secondary, reason: fabric link UP
        Dec  7 17:53:56.830 : secondary->primary, reason: Control & Fabric links down
        Dec  7 17:53:59.431 : primary->secondary-hold, reason: Monitor failed: CS 
        Dec  7 17:54:00.434 : secondary-hold->secondary, reason: Ready to become secondary
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 19997
        Heartbeat packets received: 19949
        Heartbeat packet errors: 0
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0
    Sequence number of last heartbeat packet sent: 20024
    Sequence number of last heartbeat packet received: 20501
Fabric link statistics:
    Child link 0
        Probes sent: 11579
        Probes received: 11575
    Child link 1
        Probes sent: 0
        Probes received: 0
Switch fabric link statistics:
    Probe state : DOWN
    Probes sent: 0
    Probes received: 0
    Probe recv errors: 0
    Probe send errors: 0
    Probe recv dropped: 0
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0

Chassis cluster LED information:
    Current LED color: Green            
    Last LED change reason: No failures
Control port tagging:
    Disabled
............omitted......

node1:
--------------------------------------------------------------------------
Redundancy mode:
    Configured mode: active-active
    Operational mode: active-active
Cluster configuration:
    Heartbeat interval: 1000 ms
    Heartbeat threshold: 3
    Control link recovery: Enabled
    Fabric link down timeout: 66 sec
Node health information:
    Local node health: Healthy          
    Remote node health: Healthy

Redundancy group: 0, Threshold: 255, Monitoring failures: none
    Events:
        Dec  7 13:49:59.220 : hold->secondary, reason: Hold timer expired
        Dec  7 13:53:47.517 : secondary->primary, reason: Remote node reboot

Redundancy group: 1, Threshold: 255, Monitoring failures: none
    Events:
        Dec  7 13:49:59.267 : hold->secondary, reason: Hold timer expired
        Dec  7 13:51:05.382 : secondary->primary, reason: Remote yield (100/0)
Control link statistics:
    Control link 0:
        Heartbeat packets sent: 20475
        Heartbeat packets received: 20172
        Heartbeat packet errors: 0
        Duplicate heartbeat packets received: 0
    Control recovery packet count: 0
    Sequence number of last heartbeat packet sent: 20502
    Sequence number of last heartbeat packet received: 20025
Fabric link statistics:
    Child link 0
        Probes sent: 11740
        Probes received: 11585
    Child link 1
        Probes sent: 0
        Probes received: 0
Switch fabric link statistics:
    Probe state : DOWN
    Probes sent: 0
    Probes received: 0
    Probe recv errors: 0
    Probe send errors: 0
    Probe recv dropped: 0
    Sequence number of last probe sent: 0
    Sequence number of last probe received: 0

Chassis cluster LED information:
    Current LED color: Green
    Last LED change reason: No failures
Control port tagging:
    Disabled
............omitted......




No comments:

Post a Comment

NetSec Youtube Videos