13
June

A new start

I started this blog in 2012 when I was assigned to a new client and had to get up to speed with NetWorker. The purpose was to capture any and all learning’s here. Four years and close to 80 posts later, it has been an invaluable tool to capture and share knowledge. It’s not unusual for me to look up past issues here or to google them, only to be redirected to my own blog. I never considered an issue resolved, until it was captured here. Not unusual for others to find their way here also. Stats for the last month show 656 hits, mostly from India and the U.S. Its good to know I’m not here alone screaming in the dark.

 

 

Today I find myself at the start of a new opportunity with many great challenges and things to learn. So, expect this blog to not only be a great place for capturing knowledge for NetWorker and Avamar. It will now be a repository for new learning’s related to my old friend NetBackup, as well as storage and virtualization.

no comments

13
June

Getting familiar with cmode

Had a request from my client to upload some logs. These logs were required for some c-mode systems and required access to the systemshell. Having completed some troubleshooting recently for 7-mode, the process was not altogether unfamiliar but was slightly different

The diag user is required. So lets check the status. Is it locked? Do we know the password? Lets hope.

blob0::> security login show -username diag

Username Application method Role Name Acct locked
diag            console        passwd  admin           no

Enter priv mode
set -privilege advanced

After confirming the password we can access the systemshell

blob0::>system node systemshell -node blob0n01

Fascinating, I know

no comments

29
November

Avamar – expiring snapups

 

 

il_fullxfull.329221825

In a perfect world, you should never need this command. As my friend Ian Anderson wrote in a great Ask the experts session  where he spoke of achieving  “Avamar Zen”.

Avamar Zen is a state of harmony, where you have achieved a steady state of data ingestion vs data expiration. Where hopefully, you have more data expiring and being cleaned by the garbage collection than you have new data coming in.  However, zen can be hard to achieve. Avamar is an amazing product. If the SE’s have done their job and sized it properly, you should realize steady state. What does happen sometimes is, the client is so impressed they begin adding more systems and workloads that were outside of the initial sizing scope.

Years ago, when I was embedded onsite we ran into such an issue. It wasn’t so much about adding to many systems, but one in particular.  We had some groups, one configured to cross mount points and another to only protect local data. A co-worker spun up a new system and instead of checking with me, added the system himself to Avamar, and the wrong group.

The next day I arrive and find my Avamar grid is filled, also this was replicated over to the secondary. Quite the mess. So just roll the system back to a previous checkpoint? You may think, unfortunately when an Avamar system has reached capacity there is not enough space for the required overhead to engage the checkpoint roll-back. Now, lets meet our friend expire-snapshots.

What does it do? What do you think it does? It expires snaps! Awesome, right? What is really cool is how it does this. The command runs with switches where you can granulary target specific data to remove. For example, if you wanted to remove all data from Nov 30, 2015 and the previous 25 days, you would run the following

expire-snapups –before=’2015-11-30′ –days=25 –domain=/ > do-expire.sh

This will create a script in tmp you can then run and wipe out the offending data.

There are other switches available to target specific data and clients. When complete, settle in for a long garbage collect to run and turf the offending client.

When complete, I hope you can achieve “Avamar Zen” as I did. I also changed the admin password, so my helpful co-worker could not again repeat the same mistake.

 

no comments

20
November

NetWorker Exchange failure – The group or resource is not in the correct state to perform the requested operation.

As with any windows system backup, we sadly have to deal with VSS. VSS is the bane of any backup administrators existence. Add in Exchange, and you have another layer of services that require the correct orchestration to ensure protection.

In today’s post, we revisit our old friend VSS and provide some additional insight into NetWorker and the associated services that help coordinate backups.

I had some new Exchange systems to add to backup rotation. These had Exchange 2013 on W2k12.

We started to see the following errors:

Networker error messages:
APPLICATIONS:\Microsoft Exchange 2013: Backup of [APPLICATIONS:\Microsoft Exchange 2013] failed
rsnap_vss_save:NMM .. error caught calling RM to commit the replica.
49931:nsrsnap_vss_save:RM .. 027114 ERROR:Exchange Replication Service is stable. The error is VSS_E_WRITERERROR_RETRYABLE. The code is: 0x800423f3. Check the application event log for more information.
49931:nsrsnap_vss_save:RM .. 027114 ERROR:Exchange Replication Service is stable. The error is VSS_E_WRITERERROR_RETRYABLE. The code is: 0x800423f3. Check the application event log for more information.
49931:nsrsnap_vss_save:RM .. 027114 ERROR:Exchange Replication Service is stable. The error is VSS_E_WRITERERROR_RETRYABLE. The code is: 0x800423f3. Check the application event log for more information.
49931:nsrsnap_vss_save:RM .. 024168 ERROR:The VSS shadow copy for Exchange Replication Service was not successful because of an error that occurred earlier.
83394:nsrsnap_vss_save:NMM .. cluster failover or application role change after a replica created may have caused snapshot creation failure, try to restart the backup.
Microsoft DiskPart version 6.3.9600
Copyright (C) 1999-2013 Microsoft Corporation.
On computer: exchange-node
Automatic mounting of new volumes enabled.
37959:nsrsnap_vss_save:The group or resource is not in the correct state to perform the requested operation.
.
63335:nsrsnap_vss_save:NMM backup failed to complete successfully.
Internal error.
102333 1447959921 3 0 0 15084 14984 0 server nsrsnap_vss_save NSR error 21 Exiting with failure. 0

Corrective actions taken:
Rebooted Exchange servers
Removed networker client , NMM,

And some of the following from the event viewer:
[RpcHttp] Marking ClientAccess 2010 server server(https://server/rpc/rpcproxy.dll) as unhealthy due to exception: System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. —> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. —> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
**************************************************************************************************************************
[Autodiscover] Marking ClientAccess 2010 server server (https://server) as unhealthy due to exception: System.Net.WebException: The operation has timed out
at System.Net.HttpWebRequest.GetResponse()
at Microsoft.Exchange.HttpProxy.ProtocolPingStrategyBase.Ping(Uri url)
****************************************************************************************************************************

The description for Event ID 5068 from source Replication API Service cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

2015 11 19 00:00:01
exchange-node

The specified resource type cannot be found in the image file
***********************************************************************************************************************

********************************************************************************************
PS: (CPSImportService::GetOperationStatus) ERR: Error for operation 1
*********************************************************************************************
NMM .. Error during import of snapshot: PS .. Error completing snapshot import
**********************************************************************************************.
NMM .. Registration of snapshot set failed — PS .. Error completing snapshot import
****************************************************************************************************
Microsoft Exchange VSS Writer backup failed. No log files were truncated. Instance d93e725e-486d-440f-8397-93061460ee02. Database 8bee9b3c-8fff-405e-a42d-b739d05caadd.
********************************************************************************************************************
The Microsoft Exchange Replication service VSS Writer (Instance d93e725e-486d-440f-8397-93061460ee02) failed with error FFFFFFFC when processing the backup completion event.
********************************************************************************************************************
An internal transport certificate will expire soon. Thumbprint:EE063967EB62D17668AD392286EFF96622F84BFC, hours remaining: 1594
******************************************************************************************************************

What we did?

We rebooted the system, attempted to restart some of the associated VSS services. No Joy. Opened a ticket with EMC, Anthony gave me a call back. I had worked with him before, so I was glad to hear from him.

On your Exchange Server run the following commands and give their output:

 

  1. Vssadmin list writers
  2. Vssadmin list shadows
  3. If the above command is listing any shadow copies, perform following steps:
  4. type diskshadow
  5. type list shadows all
  6. type delete shadows all
  7. Stop all networker services on client, verify all services are stopped aside from nsrpm.
  8. While NetWorker services are stopped restart Replication Manager RMAgentPS and make sure the other Replication Manager service is stopped as well.
  9. Restart NetWorker services
  10. In cmd prompt type tasklist | findstr nsr

We retried the backup and it worked.

A little insight into some of these services and the roles they play. We are familiar with nsrexecd and Powersnap. We also had to restart the RMAgentPS service. This service executes operations for Replication Manager Client for RMAgentPS. In addition we had to stop the Replication Manager Exchange Interface. This service executes Exchange commands for replication manager. In the end there were some stale shadows and we needed to restart the required service in the correct order. Of course also remember to make sure the Exchange VSS writer is present.

 

 

 

no comments

29
July

Troubleshooting Conflicting NSR peer Information errors

Recently, our server team migrated some data to a new server, keeping the same name and IP for the client.

After reinstalling the NetWorker client I ran a test backup and noticed the following filling up the log,

Error: Conflicting NSR peer information resources detected for host. Please see server log for more information.

This is a pretty easy thing to fix, a quick google of this error and you will easily find the solution, but what does it mean and why is this error being produced?

Below from the man page.

The NSR peer information resource is used by NetWorker authentication daemon nsrexecd. Resources of this type are populated/created by NetWorker. They are used to hold the identity and certificate of remote NetWorker installations that the local installation communicated with in the past. These resources are similar to known_hosts file used by ssh(1). Once a NetWorker installation (client, server, or storage node) communicates with a remote NetWorker install (client, server, or storage node), a NSR peer information resource will be created on each host and will contain information about the peer (i.e. identity and certificate). During this initial communication, each host will send information about itself to the peer. This information includes the NW instance name, NW instance ID, and the certificate. After this initial communication, each NetWorker install will use the registered peer certificate to validate future communications with that peer.

So, it goes without saying that when a system is rebuilt or a new system is built with a previously used name, this certificate will change.

The resolution below is from the following link;

https://community.emc.com/docs/DOC-20085

Delete the NSR Peer Information of the NetWorker Server on the client/storage node.
Then delete the NSR Peer Information for the client/storage node from the NetWorker Server.

Please follow the steps given below to delete the NSR peer information on NetWorker Server and on the Client.

1. At NetWorker server command line, go to the location /nsr/res
2. Type the command:

nsradmin -p nsrexec
print type:nsr peer information; name:client_name
delete
y

Specify the name of the client/storage node in the place of client_name.

1. At the client/storage node command line, go to the location /nsr/res
2. Type the command:

nsradmin -p nsrexec
print type:nsr peer information
delete
y

 

 

no comments

22
July

The most recent checkpoint for the VBA appliance is outdated

Recently found one of our VBA appliances was producing the following error in vSphere

error> The most recent checkpoint for the VBA appliance is outdated

cplist
cp.20150714161043 Tue Jul 14 10:10:43 2015 valid hfs — nodes 1/1 stripes 344
cp.20150714172020 Tue Jul 14 11:20:20 2015 valid hfs — nodes 1/1 stripes 344

Opened a ticket with EMC and the following procedure was provided.

dpnctl status
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: gsan status: up
dpnctl: INFO: MCS status: up.
dpnctl: INFO: Backup scheduler status: up.
dpnctl: INFO: axionfs status: up.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: enabled.

A status.dpn gives us a little more information on some of the maintenance services

status.dpn
Tue Jul 21 09:35:44 MDT 2015 [VBA01] Tue Jul 21 15:35:44 201 5 UTC (Initialized Tue Nov 4 20:34:05 2014 UTC)
Node IP Address Version State Runlevel Srvr+Root+User Dis Suspend Loa d UsedMB Errlen %Full Percent Full and Stripe Status by Disk
0.0 10.0.80.7 7.0.62-10 ONLINE fullaccess mhpu+0hpu+0hpu 4 false 0.5 8 7586 15696646 6.1% 6%(onl:116) 6%(onl:116) 6%(onl:115)
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwrit eable
System ID: 1415133245@00:50:56:88:1A:CC
All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)
System-Status: ok
Access-Status: full
Last checkpoint: cp.20150714172020 finished Tue Jul 14 11:20:40 2015 after 00m 2 0s (OK)
No GC yet
Last hfscheck: finished Tue Jul 14 11:28:07 2015 after 07m 19s >> checked 343 of 343 stripes (OK)

 

Although maintenance is running cp, gc and the hfscheck were suspended?

contd..

Maintenance windows scheduler capacity profile is active.
WARNING: cp is suspended permanently.
WARNING: gc is suspended permanently.
WARNING: hfscheck is suspended permanently.
The maintenance window is currently running.
Next backup window start time: Tue Jul 21 20:00:00 2015 MDT
Next maintenance window start time: Wed Jul 22 08:00:00 2015 MDT

The following commands will set hfscheck, gc and cp to on permanently.

avmaint sched resume hfscheck –permanent –ava
avmaint sched resume gc–permanent –ava
avmaint sched resume cp–permanent –ava

You may want to create a checkpoint manually
dpnctl stop maint
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Suspending maintenance windows scheduler…

avmaint checkpoint –ava
<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
<checkpoint
tag=”cp.20150721154550″
isvalid=”false”/>

dpnctl start maint
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Resuming maintenance windows scheduler…
dpnctl: INFO: maintenance windows scheduler resumed.

no comments

9
July

NetWorker 8.2.1.4 Build 783 released

NetWorker 8.2.1.4 Build 783
Publication Date: 2015-JUN-08

234898 ESC		NW_VSS		Escalation 23776:Large SQLDatabase Backups are not possible(VDI Backup)
234595 ESC		NW_Console	Escalation 23700:VBA: VMware protection policy details shows another days backup information
233518 ESC		NW_VSS		Escalation 23701:NMSQL recovered the different data than what is being requested
232892 ESC		NetWorker	Escalation 23391: device mismatch errors in other devices after device deletion that prevents devices from working.
232110 ESC		NetWorker	Escalation 23548:nsrjobd crashes while performing RMAN recovery
229042 ESC		NW_VSS		Escalation 23085:nsrsnap_vss_save crashes in a 25 node Hyper-V Windows 2012 R2 core setup
225356 ESC		NW_VSS		Escalation 22686:NW00162213-NW162213:EXCHANGE 2010 SP3 RU5 backups failing with error VSS_E_WRITERERROR_RETRYABLE
233253 BUG		NetWorker	VBA stuck in query pending forever in case of problems
232507 BUG		NetWorker	ESC 23654 - VBA Policy status shows as failed but not clients are listed in NMC
232306 BUG		NetWorker	ESC 23393:VBA jobs show nothing in waiting to run in NMC even when there are jobs in queued state on VBA and failed                                         VMs show no error
185641 (NW160391) BUG	NetWorker	No info to user when backups are not run due to No Eligible Proxies during hot-add only backup mode
206681 (NW159953) ESC	NetWorker	Cannot label blank tape after upgrade to NetWorker version 8.1 on AIX - different than NW156909
206213 (NW157573) ESC	NetWorker	Library getting down after upgraded from 8.0.1.1 to 8.1.0.2 even to 8.1.0.3
204138 (NW159899) ESC	NetWorker	auto inventory of HP MSL libraries doesnt work in 8.1
200652 (NW161429) ESC	NetWorker	Skips of scheduled clone jobs show as interrupted in the NW gui After upgrading to NW 8.1.1.6

 

no comments

8
July

Cannot access NetWorker VBA GUI

Had an issue recently where the VBA config and FLR GUI’s were inaccessible.  It was easy enough to stop and start tomcat with the emwebapp script, but it didn’t work. EMC provided this process to re-register the certificate also.

 

1. Stop emwebapp
emwebapp.sh – -stop

2. Back up existing keystore
cp /root/.keystore /root/.keystore.sav

3.  List tomcat certificate – should see 1 certificate
/usr/java/latest/bin/keytool -list -keystore /root/.keystore -storepass changeit -alias tomcat

4. Delete tomcat certificate from keystore
/usr/java/latest/bin/keytool -delete -alias tomcat -storepass changeit

5. List tomcat certificate again – should return empty
/usr/java/latest/bin/keytool -list -keystore /root/.keystore -storepass changeit -alias tomcat

6. Regenerate certificate using SHA256
/usr/java/latest/bin/keytool -genkeypair -v -alias tomcat -keyalg RSA -sigalg SHA256withRSA -keystore /root/.keystore -storepass changeit -keypass changeit -validity 3650 -dname “CN=localhost.localdom, OU=Avamar, O=EMC, L=Irvine, S=California, C=US”

7. List tomcat certificate again – should see 1 certificate

8. Start emwebapp.sh
emwebapp.sh – -start

no comments

3
July

Troubleshooting NetWorker Disaster Recovery Backup Failures

Occasionally, I have found that all backup savesets will complete with the exception of the disaster recovery portion?

One thing to check is to ensure all volumes are online. This is required for the disaster recovery backup to complete.

Open a command line on the client and use the diskpart utility.

C:\diskpart

Always, rescan first,

DISKPART>rescan

When complete list volumes to see if any are offline

DISKPART>list vol

Identify the volume offline and put it online

DISKPART>select volume 1

DISKPART>online volume

 

 

no comments

26
June

Unexpected Connection error with NetWorker and VBA

I was really looking forward to another idyllic day as a Backup Administrator.

Those days usually begin troubleshooting a few backup failures, drinking a lot of coffee and planning world domination. Sadly, I actually had to do some work today instead.

I found all my VMware protection policies had failed the previous evening.

error: Unable to connect to VBA, error Cannot establish session to VBA.
Logged onto vSphere and attempted to browse the backup recover area. Where we found the following error:

error: An unexpected connection error occurred and the cause could not be determined, Please check your EBR configuration screen to troubleshoot, or contact an administrator.

We rebooted the appliance and and it seemed to fix the issue for one. Ended up delving into the nuts and bolts of the various resources required to create a VBA backup. Let’s review these components.

EBR Config GUI

You can access the configuration interface by browsing to:

https://VBA:8543/ebr-configure/

Login via the root credentials of the appliance. The default password is 8RttoTriz

Note: I don’t know about you, but try as I might I could not access this via chrome.

VMUSER

In the GUI you can view running service and restart if required. You can also configure the connection to NetWorker. To do this you will need to know the password for the vmuser account. Near as I can tell the is an ID internal to NetWorker that is used to establish communication. The default password is “changeme”

CHANGING THE VMUSER PASSWORD

We hadn’t change this password. If you would like to change the password this can be done on the NetWorker server properties page under the misc tab. We also were not sure what the password was? Yeah, I know.

Re-establish communication between the appliance and NetWorker

Go to the NetWorker Config tab. Here you can enter the password and save to re-establish (and confirm the password) the interface with NetWorker. You will need to reboot the appliance after this.

6-24-2015 2-30-16 PM

 

 

This did resolve our issue and we can now browse the Backup Recovery interface in vSphere. With this re-established backups should run tonight. Fingers crossed.

 

no comments

Back to top