Avamar – MSG_ERR_DDR_ERROR – hfscheck-finish Backup directory missing for backup

Issue: Checkpoint validation fails with the following error


Check for error in /usr/local/avamar/var/ddrmaintlogs

more ddrmaint.log.1 |grep -i error

Jun 13 12:23:56 cmz-avmr-uti ddrmaint.bin[20520]: Error: hfscheck-finish Client directory missing for backup ddtarget.XYZ.com(1) cp.20120613080724/33868db2a2dbf30208e9cb491867c8d496f77256/1CD3042CC071B8E


avmgr resf –acnt=ref{33868db2a2dbf30208e9cb491867c8d496f77256}
/Servers/SRV/client.XYZ.com 33868db2a2dbf30208e9cb491867c8d496f77256

avmgr getb –format=xml –incpartial –path=Servers/SRV/client.XYZ.com |grep -i 1CD3042CC071B8E
<backuplistrec flags=”17825809″ labelnum=”178282″ label=”server bu Incr – Top hour-server bu Top-h” created=”1544925664″ roothash=”32d5213382ce43db7539d4361a738d9b66fe14be” totalbytes=”11927552.00″ ispresentbytes=”0.00″ pidnum=”3006″ percentnew=”27″ expires=”1545530464″ created_prectime=”0x1d494e333611968” partial=”1″ retentiontype=”daily” backuptype=”Incremental” ddrindex=”1″ locked=”0″ direct_restore=”1″ tier=”0″ appconsistent=”not_available”/>

avmgr delb –incpartial –path=/Servers/SRV/client.XYZ.com –date=0x1D48E24811015EE


no comments


AWS – Introduction to Glacier

no comments


A new start

I started this blog in 2012 when I was assigned to a new client and had to get up to speed with NetWorker. The purpose was to capture any and all learning’s here. Four years and close to 80 posts later, it has been an invaluable tool to capture and share knowledge. It’s not unusual for me to look up past issues here or to google them, only to be redirected to my own blog. I never considered an issue resolved, until it was captured here. Not unusual for others to find their way here also. Stats for the last month show 656 hits, mostly from India and the U.S. Its good to know I’m not here alone screaming in the dark.



Today I find myself at the start of a new opportunity with many great challenges and things to learn. So, expect this blog to not only be a great place for capturing knowledge for NetWorker and Avamar. It will now be a repository for new learning’s related to my old friend NetBackup, as well as storage and virtualization.

no comments


Getting familiar with cmode

Had a request from my client to upload some logs. These logs were required for some c-mode systems and required access to the systemshell. Having completed some troubleshooting recently for 7-mode, the process was not altogether unfamiliar but was slightly different

The diag user is required. So lets check the status. Is it locked? Do we know the password? Lets hope.

blob0::> security login show -username diag

Username Application method Role Name Acct locked
diag            console        passwd  admin           no

Enter priv mode
set -privilege advanced

After confirming the password we can access the systemshell

blob0::>system node systemshell -node blob0n01

Fascinating, I know

no comments


Avamar – expiring snapups




In a perfect world, you should never need this command. As my friend Ian Anderson wrote in a great Ask the experts session  where he spoke of achieving  “Avamar Zen”.

Avamar Zen is a state of harmony, where you have achieved a steady state of data ingestion vs data expiration. Where hopefully, you have more data expiring and being cleaned by the garbage collection than you have new data coming in.  However, zen can be hard to achieve. Avamar is an amazing product. If the SE’s have done their job and sized it properly, you should realize steady state. What does happen sometimes is, the client is so impressed they begin adding more systems and workloads that were outside of the initial sizing scope.

Years ago, when I was embedded onsite we ran into such an issue. It wasn’t so much about adding to many systems, but one in particular.  We had some groups, one configured to cross mount points and another to only protect local data. A co-worker spun up a new system and instead of checking with me, added the system himself to Avamar, and the wrong group.

The next day I arrive and find my Avamar grid is filled, also this was replicated over to the secondary. Quite the mess. So just roll the system back to a previous checkpoint? You may think, unfortunately when an Avamar system has reached capacity there is not enough space for the required overhead to engage the checkpoint roll-back. Now, lets meet our friend expire-snapshots.

What does it do? What do you think it does? It expires snaps! Awesome, right? What is really cool is how it does this. The command runs with switches where you can granulary target specific data to remove. For example, if you wanted to remove all data from Nov 30, 2015 and the previous 25 days, you would run the following

expire-snapups –before=’2015-11-30′ –days=25 –domain=/ > do-expire.sh

This will create a script in tmp you can then run and wipe out the offending data.

There are other switches available to target specific data and clients. When complete, settle in for a long garbage collect to run and turf the offending client.

When complete, I hope you can achieve “Avamar Zen” as I did. I also changed the admin password, so my helpful co-worker could not again repeat the same mistake.


no comments


NetWorker Exchange failure – The group or resource is not in the correct state to perform the requested operation.

As with any windows system backup, we sadly have to deal with VSS. VSS is the bane of any backup administrators existence. Add in Exchange, and you have another layer of services that require the correct orchestration to ensure protection.

In today’s post, we revisit our old friend VSS and provide some additional insight into NetWorker and the associated services that help coordinate backups.

I had some new Exchange systems to add to backup rotation. These had Exchange 2013 on W2k12.

We started to see the following errors:

Networker error messages:
APPLICATIONS:\Microsoft Exchange 2013: Backup of [APPLICATIONS:\Microsoft Exchange 2013] failed
rsnap_vss_save:NMM .. error caught calling RM to commit the replica.
49931:nsrsnap_vss_save:RM .. 027114 ERROR:Exchange Replication Service is stable. The error is VSS_E_WRITERERROR_RETRYABLE. The code is: 0x800423f3. Check the application event log for more information.
49931:nsrsnap_vss_save:RM .. 027114 ERROR:Exchange Replication Service is stable. The error is VSS_E_WRITERERROR_RETRYABLE. The code is: 0x800423f3. Check the application event log for more information.
49931:nsrsnap_vss_save:RM .. 027114 ERROR:Exchange Replication Service is stable. The error is VSS_E_WRITERERROR_RETRYABLE. The code is: 0x800423f3. Check the application event log for more information.
49931:nsrsnap_vss_save:RM .. 024168 ERROR:The VSS shadow copy for Exchange Replication Service was not successful because of an error that occurred earlier.
83394:nsrsnap_vss_save:NMM .. cluster failover or application role change after a replica created may have caused snapshot creation failure, try to restart the backup.
Microsoft DiskPart version 6.3.9600
Copyright (C) 1999-2013 Microsoft Corporation.
On computer: exchange-node
Automatic mounting of new volumes enabled.
37959:nsrsnap_vss_save:The group or resource is not in the correct state to perform the requested operation.
63335:nsrsnap_vss_save:NMM backup failed to complete successfully.
Internal error.
102333 1447959921 3 0 0 15084 14984 0 server nsrsnap_vss_save NSR error 21 Exiting with failure. 0

Corrective actions taken:
Rebooted Exchange servers
Removed networker client , NMM,

And some of the following from the event viewer:
[RpcHttp] Marking ClientAccess 2010 server server(https://server/rpc/rpcproxy.dll) as unhealthy due to exception: System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. —> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. —> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
[Autodiscover] Marking ClientAccess 2010 server server (https://server) as unhealthy due to exception: System.Net.WebException: The operation has timed out
at System.Net.HttpWebRequest.GetResponse()
at Microsoft.Exchange.HttpProxy.ProtocolPingStrategyBase.Ping(Uri url)

The description for Event ID 5068 from source Replication API Service cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

2015 11 19 00:00:01

The specified resource type cannot be found in the image file

PS: (CPSImportService::GetOperationStatus) ERR: Error for operation 1
NMM .. Error during import of snapshot: PS .. Error completing snapshot import
NMM .. Registration of snapshot set failed — PS .. Error completing snapshot import
Microsoft Exchange VSS Writer backup failed. No log files were truncated. Instance d93e725e-486d-440f-8397-93061460ee02. Database 8bee9b3c-8fff-405e-a42d-b739d05caadd.
The Microsoft Exchange Replication service VSS Writer (Instance d93e725e-486d-440f-8397-93061460ee02) failed with error FFFFFFFC when processing the backup completion event.
An internal transport certificate will expire soon. Thumbprint:EE063967EB62D17668AD392286EFF96622F84BFC, hours remaining: 1594

What we did?

We rebooted the system, attempted to restart some of the associated VSS services. No Joy. Opened a ticket with EMC, Anthony gave me a call back. I had worked with him before, so I was glad to hear from him.

On your Exchange Server run the following commands and give their output:


  1. Vssadmin list writers
  2. Vssadmin list shadows
  3. If the above command is listing any shadow copies, perform following steps:
  4. type diskshadow
  5. type list shadows all
  6. type delete shadows all
  7. Stop all networker services on client, verify all services are stopped aside from nsrpm.
  8. While NetWorker services are stopped restart Replication Manager RMAgentPS and make sure the other Replication Manager service is stopped as well.
  9. Restart NetWorker services
  10. In cmd prompt type tasklist | findstr nsr

We retried the backup and it worked.

A little insight into some of these services and the roles they play. We are familiar with nsrexecd and Powersnap. We also had to restart the RMAgentPS service. This service executes operations for Replication Manager Client for RMAgentPS. In addition we had to stop the Replication Manager Exchange Interface. This service executes Exchange commands for replication manager. In the end there were some stale shadows and we needed to restart the required service in the correct order. Of course also remember to make sure the Exchange VSS writer is present.




no comments


The most recent checkpoint for the VBA appliance is outdated

Recently found one of our VBA appliances was producing the following error in vSphere

error> The most recent checkpoint for the VBA appliance is outdated

cp.20150714161043 Tue Jul 14 10:10:43 2015 valid hfs — nodes 1/1 stripes 344
cp.20150714172020 Tue Jul 14 11:20:20 2015 valid hfs — nodes 1/1 stripes 344

Opened a ticket with EMC and the following procedure was provided.

dpnctl status
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: gsan status: up
dpnctl: INFO: MCS status: up.
dpnctl: INFO: Backup scheduler status: up.
dpnctl: INFO: axionfs status: up.
dpnctl: INFO: Maintenance windows scheduler status: enabled.
dpnctl: INFO: Unattended startup status: enabled.

A status.dpn gives us a little more information on some of the maintenance services

Tue Jul 21 09:35:44 MDT 2015 [VBA01] Tue Jul 21 15:35:44 201 5 UTC (Initialized Tue Nov 4 20:34:05 2014 UTC)
Node IP Address Version State Runlevel Srvr+Root+User Dis Suspend Loa d UsedMB Errlen %Full Percent Full and Stripe Status by Disk
0.0 7.0.62-10 ONLINE fullaccess mhpu+0hpu+0hpu 4 false 0.5 8 7586 15696646 6.1% 6%(onl:116) 6%(onl:116) 6%(onl:115)
Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwrit eable
System ID: [email protected]:50:56:88:1A:CC
All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)
System-Status: ok
Access-Status: full
Last checkpoint: cp.20150714172020 finished Tue Jul 14 11:20:40 2015 after 00m 2 0s (OK)
No GC yet
Last hfscheck: finished Tue Jul 14 11:28:07 2015 after 07m 19s >> checked 343 of 343 stripes (OK)


Although maintenance is running cp, gc and the hfscheck were suspended?


Maintenance windows scheduler capacity profile is active.
WARNING: cp is suspended permanently.
WARNING: gc is suspended permanently.
WARNING: hfscheck is suspended permanently.
The maintenance window is currently running.
Next backup window start time: Tue Jul 21 20:00:00 2015 MDT
Next maintenance window start time: Wed Jul 22 08:00:00 2015 MDT

The following commands will set hfscheck, gc and cp to on permanently.

avmaint sched resume hfscheck –permanent –ava
avmaint sched resume gc–permanent –ava
avmaint sched resume cp–permanent –ava

You may want to create a checkpoint manually
dpnctl stop maint
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Suspending maintenance windows scheduler…

avmaint checkpoint –ava
<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

dpnctl start maint
Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)
dpnctl: INFO: Resuming maintenance windows scheduler…
dpnctl: INFO: maintenance windows scheduler resumed.

no comments


Cannot access NetWorker VBA GUI

Had an issue recently where the VBA config and FLR GUI’s were inaccessible.  It was easy enough to stop and start tomcat with the emwebapp script, but it didn’t work. EMC provided this process to re-register the certificate also.


1. Stop emwebapp
emwebapp.sh – -stop

2. Back up existing keystore
cp /root/.keystore /root/.keystore.sav

3.  List tomcat certificate – should see 1 certificate
/usr/java/latest/bin/keytool -list -keystore /root/.keystore -storepass changeit -alias tomcat

4. Delete tomcat certificate from keystore
/usr/java/latest/bin/keytool -delete -alias tomcat -storepass changeit

5. List tomcat certificate again – should return empty
/usr/java/latest/bin/keytool -list -keystore /root/.keystore -storepass changeit -alias tomcat

6. Regenerate certificate using SHA256
/usr/java/latest/bin/keytool -genkeypair -v -alias tomcat -keyalg RSA -sigalg SHA256withRSA -keystore /root/.keystore -storepass changeit -keypass changeit -validity 3650 -dname “CN=localhost.localdom, OU=Avamar, O=EMC, L=Irvine, S=California, C=US”

7. List tomcat certificate again – should see 1 certificate

8. Start emwebapp.sh
emwebapp.sh – -start

no comments


NetWorker Build 774 has been released.

NetWorker Build 774 has been released.

It can be downloaded from ftp://ftp.legato.com/pub/NetWorker/Cumulative_Hotfixes/8.2/
This package contains the following cumulative fixes:

ID Details
(NW161917) ESC	NetWorker Escalation 22509: NSM DB2 PIT restore doesn't restore connecting directories' ACL on AIX
NW161624) ESC	NetWorker Escalation 22307: Device discovery raises alerts if udev-named library handle already configured: 14249:dvdetect: 'skipped as requested'
ESC	NetWorker Escalation 22567: Error counts not correctly handled with nsrsnmd & cdi changes
(NW159916) ESC 	NetWorker Escalation 21778: Disable "label" operation in AMM functionality: DataDomain devices unmount with "RPC severe Lost connection to media database"

no comments


Troubleshooting Cloning – Part 2 – Clone Wars

The clone wars continue. As you may recall from the Part 1, my NetWorker optimized clones have been hanging with this misleading error:

Waiting for 1 Writable volume on backup pool ‘Device’ disk(s) on nsr_server

To further complicate things, control over the clone from the console can be limited. I had been using the jobkill utility. Preston has a great write up on it here. The issue is that after killing the clone, the NMC console shows it as running still? Attempts to restart the clone via NMC resulted in the following error:

1 1429638078 event task manager Task aborted: task ‘clone.name Clone’ is already running

So is the job kill, really murdering all the required processes? Lets take a look.

My new env is Windows. So lets visit our old friend the task manager, here we want to look for nsrtask and nsrclone processes. First nsrclone. I’ve had to obfuscate the output, but what I found were the appropriate running clones. The specific job I needed to restart was not listed.



The same cannot be said for the clone jobs associated nsrtask process. There was indeed a process still hanging around.


After killing it, the state in NMC changed from Running to Interrupted. I could then restart the job.




All this just to get my clone going again. This is some progress as I had previously been restarting NetWorker, interrupting the service and other running clones.

no comments

« Previous Entries