cbbackupmgr cloud
Storing cbbackupmgr archives directly in the cloud
DESCRIPTION
A document which should give you a basic understanding of how to utilize cbbackupmgr’s Enterprise Edition feature, native cloud interactions.
TUTORIAL
CREDENTIALS
Backing up directly to an external cloud provider will mean that we require permissions to access the given data store. Each provider has its own way of authentication; see the sections below on how to authenticate for your chosen cloud provider.
AWS
When using AWS S3, there are multiple different ways that you can supply credentials to authorize yourself to AWS S3. Below is a list of the supported techniques:
-
When running in EC2, credentials may be obtained via the instance metadata service by setting/supplying:
-
CB_AWS_ENABLE_EC2_METADATA=true
-
--obj-auth-by-instance-metadata
-
-
Providing a set of environment variables including:
-
CB_OBJSTORE_REGION
-
CB_OBJSTORE_ACCESS_KEY_ID
-
CB_OBJSTORE_SECRET_ACCESS_KEY
-
-
Loading credentials from the shared config files located at:
-
$HOME/.aws/config
-
$HOME/.aws/credentials
-
-
Providing static config/credentials using the cli flags:
-
--obj-access-key-id
-
--obj-region
-
--obj-secret-access-key
-
Setting up cbbackupmgr to interact with AWS should be a very similar process to
setting up the aws-cli
. The steps to configure the aws-cli
can be found at
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration.
For more information about authenticating using the EC2 instance metadata
service, please see the Cloud Provider Specific Features
section.
Azure (Developer Preview)
As with AWS, there are multiple accepted methods for authorizing against Azure. Below is a list of the supported techniques:
-
Providing static credentials using the cli flags:
-
--obj-access-key-id
-
--obj-secret-access-key
-
-
Providing a set of environment variables including:
-
AZURE_STORAGE_ACCOUNT
-
AZURE_STORAGE_KEY
-
-
Providing a connection string via an environment variable:
-
AZURE_STORAGE_CONNECTION_STRING
-
-
Obtaining a service principle token through the environment, using:
-
Client credentials
-
AZURE_CLIENT_ID
-
AZURE_TENANT_ID
-
AZURE_CLIENT_SECRET
-
-
A client certificate
-
AZURE_CLIENT_ID
-
AZURE_TENANT_ID
-
AZURE_CERTIFICATE_PATH
-
-
A username/password
-
AZURE_CLIENT_ID
-
AZURE_TENANT_ID
-
AZURE_USERNAME
-
AZURE_PASSWORD
-
-
A managed identity
-
When using Azure Active Directory to authenticate, cbbackupmgr
may not be able to determine the storage account name
and, as a consequence, the URL that it should use to connect to the Azure Blob Storage service. In this case it will
return an error such as the one below.
failed to determine account name, checked '--obj-access-key-id', 'AZURE_STORAGE_ACCOUNT' and 'AZURE_STORAGE_CONNECTION_STRING'
The account name will need to be supplied by using one of the following options:
- The --obj-access-key-key
flag
- The AZURE_STORAGE_ACCOUNT
environment variable
- The AZURE_STORAGE_CONNECTION_STRING
environment variable (by specifically setting the AccountName
key)
GCP
As with AWS/Azure, GCP supports multiple methods of authenticating. Below is a list of the supported techniques:
-
Providing static credentials using the cli flags:
-
--obj-access-key-id
-
--obj-secret-access-key
-
--obj-refresh-token
-
-
Providing an SDK style auth file
-
Using a static flag
-
--obj-auth-file
-
-
Using an environment variable
-
GOOGLE_APPLICATION_CREDENTIALS
-
-
-
Using a "well known" credentials file
-
$HOME/.config/gcloud/application_default_credentials.json
-
-
Using the first generation app engine runtime
-
Using the metadata service when running in Google Compute Engine
THE STAGING DIRECTORY
One of the most important concepts behind how backup to object store works is the staging directory. The staging directory is a location on disk where temporary data is stored during the execution of a sub-command. For a backup/restore this will be DCP metadata and storage indexes.
When creating an archive to store in a cloud provider you are required to
provide a location for the obj-staging-dir
. This is a local location where
archive meta data will be stored. During a backup, files will be stored here
before they are uploaded to the cloud. Note that cbbackupmgr doesn’t store any
document values in the staging directory; they are streamed directly to the
cloud.
Each cloud archive must have a unique staging directory i.e. they can’t be shared. cbbackupmgr will detect cases where the staging directory is being reused across archives.
Any modifications to the cloud archive (using the web-ui or cli tools) and not cbbackupmgr are not supported whilst using the same staging directory. If a cloud archive has been modified, the staging directory should be removed and recreated before using cbbackupmgr to interact with the archive again.
The staging directory is only used during operations e.g. backup/restore and can be safely deleted once an operation completes; this is because all the files will have been uploaded to the cloud.
The staging directory can become quite large during a normal backup depending
on the number of documents being backed up, and the size of their keys. See the
Disk requirements
section for more information about how to provision the
staging directory.
CONFIGURING CLOUD BACKUPS
The first step is to create a backup archive in object store. This can be done with
the config
command and only needs to be done once. All other commands will
automatically download the archive meta data in the directory provided via the obj-staging-dir
argument prior to performing any operations; this is done regardless of whether the archive
exists locally because we must ensure the archive in the staging directory is up to date.
Below is an example of how you would configure an archive in AWS S3.
$ cbbackupmgr config -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging Backup repository `repo` created successfully in archive `s3://bucket/archive`
Assuming your credentials are correct, the archive should now reside directly in the
provided S3 bucket. To verify, you could use the aws-cli
to list the contents off the
bucket and they should be identical to that which would exist for a local backup.
Although it’s possible to have cbbackupmgr coexist in the same S3 bucket as other general purpose storage, we recommend using a bucket which cbbackupmgr has exclusive access too.
BACKING UP A CLUSTER
Once an archive is configured performing a backup works in a similar fashion to performing a local
backup. It’s important to note that when backing up directly to S3 a certain amount of disk
space will be used to stage local meta data files and storage indexes. See the
Staging Directory
section for more information. Below is an example of
doing a backup and storing directly in AWS S3.
$ cbbackupmgr backup -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \ -c http://10.101.101.112:8091 -u Administrator -p password Copied all data in 1m2.525397237s (Avg. 51.61KB/Sec) 7303 items / 3.12MB beer-sample [===========================================] 100.00% Backup successfully completed Backed up bucket "beer-sample" succeeded Mutations backedup; 7303, Mutations failed to backup: 0 Deletions backedup: 0, Deletions failed to backup: 0
Performing incremental backups works exactly as it would if you were performing an incremental locally; simply rerun the command above and an incremental backup would be created.
When choosing the amount of threads to use it’s important to consider that when backing up to the cloud, cbbackupmgr buffers data in memory before uploading it. This means that choosing an extremely large amount of threads when using a poor internet connection could lead to a scenario where your machine runs out of memory.
To learn more about backup options see cbbackupmgr-backup.
RESTORING A BACKUP / MULTIPLE INCREMENTAL BACKUPS
Once you have created a backup, restoring it works in a similar way to restoring a local backup. It’s worth noting that restoring a backup to a cluster that’s hosted outside of AWS is likely to be significantly more expensive than performing a backup (depending on the size of your dataset). See Costing for more information.
Below is an example of restoring a backup that is store in AWS S3.
$ cbbackupmgr restore -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \ -c http://10.101.101.112:8091 -u Administrator -p password (1/1) Restoring backup 2020-03-19T15_35_00.467487218Z Copied all data in 28.048019272s (Avg. 103.21KB/Sec) 7303 items / 2.82MB beer-sample [==========================================] 100.00% Restore bucket 'beer-sample' succeeded Mutations restored: 7303, Mutations failed to restore: 0 Deletions restored: 0, Deletions failed to restore: 0 Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0 Restore completed successfully
DISK REQUIREMENTS
As discussed in Staging Directory
, you will be required to provision enough
disk space to store all the keys for your dataset on disk during a backup. We
would recommend doing a simple calculation to determine the approximate size of
the staging directory.
Using the formula below, you can calculate the approximate size of the staging directory in Gigabytes:
(${NUMBER_OF_ITEMS} * (${AVERAGE_KEY_SIZE_IN_BYTES} + 30)) / (1024 ^ 3)
Note that this is a rough estimate which doesn’t account for factors such as fragmentation, however, it should be a good starting point. Using this formula and given a dataset with 50 Million keys with an average size of 75 bytes, we’d expect to need to provision about 5GiB of disk space.
When approximating the size of the staging directory, we don’t need to account for the size of the document values because they are never stored on disk; they are uploaded directly to object store.
COSTING
Before using any cbbackupmgr
sub-commands, it’s worth ensuring that you
understand the costing related to using your chosen cloud provider; often the
pattern being that it’s cheap to upload/store data, but (comparatively)
expensive to access/download (to the wider internet). We recommend using one of
the following calculators.
BACKUP
Backing up data from outside/inside AWS S3 is cheap; this is because at the time of writing, it doesn’t cost anything to transfer data into S3 (you only pay for the storage/requests).
RESTORE
Restoring data is another matter, AWS S3 charges users for pulling data from AWS onto the internet. This means that restoring large datasets can become quite costly if your cluster is not in AWS. Before performing a restore, use info (as described below in Interrogating backups) to determine the size of your backup. You can then use this to calculate how much it will cost to restore your backup.
At the time of writing, restoring a backup to a cluster stored inside AWS S3 will not be significantly costly since AWS do not charge for the bandwidth inside AWS. No matter whether your cluster is hosted in/outside AWS it’s worth calculating the costs before performing a restore.
MERGING
One of the main reasons for merging incremental backups is to save disk space. In AWS S3 space is cheap and bandwidth (to the broader internet) is expensive. This means that there isn’t a financially viable reason for merging cloud backups. For this reason merging incremental backups stored in the cloud is not supported.
Restoring will continue to support applying incremental backups in chronological order in the same fashion that it’s would when merging e.g. you will end up with the same data in your Couchbase cluster.
INTERROGATING BACKUPS
Several tools have been made available for use with archives stored directly in the cloud, currently these are:
- examine
- info
EXAMINE
Examine can be used to query whether a document with the given key exists in given collection (possibly across multiple backups), the examine tools supports directly querying the data in S3.
$ cbbackupmgr examine -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \ --collection-string beer-sample --key '21st_amendment_brewery_cafe' Key: 21st_amendment_brewery_cafe SeqNo: 5 Backup: 2020-01-08T17_21_20.232087665Z Deleted: false Size: 27B (key), 29B (meta), 666B (value) Meta: {"flags":33554432,"cas":1578502228728479744,"revseqno":1,"datatype":1} Value: {"address":["563 Second Street"],"city":"San Francisco","code":"94107","country":"United States","description":"The 21st Amendment Brewery offers a variety of award winning house made brews and American grilled cuisine in a comfortable loft like setting. Join us before and after Giants baseball games in our outdoor beer garden. A great location for functions and parties in our semi-private Brewers Loft. See you soon at the 21A!","geo":{"accuracy":"ROOFTOP","lat":37.7825,"lon":-122.393},"name":"21st Amendment Brewery Cafe","phone":"1-415-369-0900","state":"California","type":"brewery","updated":"2010-10-24 13:54:07","website":"http://www.21st-amendment.com/"}
To learn more about examine options see cbbackupmgr-examine.
INFO
The info command can be used to query a broader archive to understand its structure and to gain an understanding of what data is backed up and where.
$ cbbackupmgr info -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging Name | Size | # Backups | rift | 796B | 1 | + Backup | Size | Type | Source | Range | Events | Aliases | Complete | + 2020-01-08T17_21_20.232087665Z | 796B | FULL | http://172.20.1.1:8091 | N/A | 0 | 0 | true | - Bucket | Size | Items | Mutations | Tombstones | Views | FTS | Indexes | CBAS | - beer-sample | 790B | 7303 | 7303 | 0 | 1 | 0 | 0 | 0 |
To learn more about info options see cbbackupmgr-info.
ARCHIVE LOCKING
It’s important that only one instance of cbbackupmgr has access to the archive at a time; this is enforced using a lockfile meaning most of the time you shouldn’t need to worry about this. However, there are some situations where cbbackupmgr may fail to ensure exclusive access to the archive:
-
Another process (on another machine, or the local machine) already has an active lockfile.
-
A stale lockfile exists which belongs to a system with a different hostname.
In cases where cbbackupmgr fails to lock an archive a few simple steps can be taken:
-
Manually ensure that nobody else is using the archive
-
If you are certain nobody else is using the archive, locate the lockfile in S3 (it has the format
lock-${UUID}.lk
and is stored in the top-level of the archive). -
Remove the lockfile and try to continue using the archive with your own instance of cbbackupmgr.
It’s extremely important that you only manually remove the lockfile if you a certain that there isn’t another instance of cbbackupmgr using the archive. Having two instances of cbbackupmgr running against the same archive could cause data loss through overlapping key prefixes.
Below is an example of an archive which contains a lockfile from a system that crashed where the lockfile was never cleaned up.
$ aws s3 ls s3://backups --recursive 2020-04-27 09:34:10 120 archive/.backup 2020-04-27 09:34:23 34 archive/lock-14eb923b-60a7-480a-849e-8af48e47f9ea.lk 2020-04-27 09:34:10 520 archive/logs/backup-0.log 2020-04-27 09:34:10 651 archive/repo/backup-meta.json
If we attempt to use cbbackupmgr to create a backup, we should see a message similar to the one below:
$ cbbackupmgr backup -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \ -c 172.20.1.1:8091 -u admin -p password Error backing up cluster: the process '{PID}' running on '{HOSTNAME}' already holds the lock
In this case, cbbackupmgr will not remove the lock automatically since it cannot safely determine whether the other process is active or not. We can use the information about which machine the other instance of cbbackupmgr is running on to check whether it is active. If this machine has crashed and that instance of cbbackupmgr is no longer using the archive, we can manually remove the lockfile.
$ aws s3 rm s3://backups/archive/lock-14eb923b-60a7-480a-849e-8af48e47f9ea.lk delete: s3://backups/archive/lock-14eb923b-60a7-480a-849e-8af48e47f9ea.lk
If we attempt to perform the backup once again we will see that it continues successfully; in the
case that the other machine failed during a backup you may be asked to purge the previous backup
using the --purge
flag before you can create a new backup:
$ cbbackupmgr backup -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \ -c 172.20.1.1:8091 -u admin -p password Copied all data in 1m2.525397237s (Avg. 51.61KB/Sec) 7303 items / 3.12MB beer-sample [===========================================] 100.00% Backup successfully completed Backed up bucket "beer-sample" succeeded Mutations backedup; 7303, Mutations failed to backup: 0 Deletions backedup: 0, Deletions failed to backup: 0
COMPATIBLE OBJECT STORES
cbbackupmgr is tested against the cloud providers that are supported, however, in some cases it will work with compatible object stores e.g. Localstack/Scality. It’s important to note that experience may be different when interacting with compatible object stores because some have slightly different behaviors which cbbackupmgr may not explicitly handle.
AWS
It should be possible to use cbbackupmgr with S3 compatible object stores, however, there are some things that need to be taken into consideration. First and foremost is the features that cbbackupmgr leverages. Below is a list of S3 API features that cbbackupmgr uses but not all compatible object stores support:
It’s important that you check whether these features are implemented on your S3 compatible object store because without them cbbackupmgr will not work as expected.
AWS also has a slightly newer virtual addressing style the documentation for which can be found at
https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html. Not all S3 compatible object
stores support this style of addressing. The errors that are returned by the SDK (and therefore
cbbackupmgr) in these cases are not always clear. Before raising a support ticket about cbbackupmgr
not working with an S3 compatible object store you should first try using the
--s3-force-path-style
argument. This will force cbbackupmgr to use the old path style addressing.
From our testing with S3 compatible object stores it’s very common for this flag to be required.
Azure/GCP
As with AWS S3 support, Azure/GCP should work compatible object storage solutions. You also shouldn’t need to provide any specific flags when using compatible storage solutions.
It should however, be noted that Azure/GCP will only work when the underlying compatible object storage solution implements all the required features and it should be noted that this behavior is not exhaustively validated by Couchbase.
CLOUD PROVIDER SPECIFIC FEATURES
As stated above in the 'Compatible Object Stores' section it’s possible to use cbbackupmgr with other providers which expose an S3 compatible API. It’s important to note that some features may only be accessible to those using the AWS.
AWS
When running cbbackupmgr in an AWS instance, it may use the EC2 instance metadata to get
credentials. This is disabled by default, however, may be enabled by either supplying the
--obj-auth-by-instance-metadata
flag or setting the CB_AWS_ENABLE_EC2_METADATA
environment
variable to true
.
For example, if we wanted to use cbbackupmgr with the EC2 instance metadata we would:
. Create a role with a policy which allows S3 data manipulation (e.g. S3 Full Admin)
. Attach that role to the instance
.. Run export CB_AWS_ENABLE_EC2_METADATA=true
to enable fetching EC2 instance metadata
.. Add the --obj-auth-by-instance-metadata
flag to your cbbackupmgr command.
. Run cbbackupmgr as described elsewhere in this tutorial
Azure (Developer Preview)
When running in an Azure VM, cbbackupmgr
will attempt to fetch a service
principle token from the environment which will be used to authenticate against
blob storage. In this case, the correct RBAC permissions should be provided to
allow cbbackupmgr
access to the chosen storage account. As apposed to AWS,
this behavior is enabled by default.
GCP
When running in Google Compute, cbbackupmgr
will using the instance metadata
server to fetch credentials which may be used to authenticate against Google
Storage. As with AWS/Azure, the correct RBAC permissions should be provided to
allow cbbackupmgr
access to the given bucket. As apposed to AWS, this
behavior is enabled by default.
RBAC
It’s quite common to run cbbackupmgr
with an account with limited permissions, this section covers
any cloud provider specific permissions which are required. Please note that any permissions listed
in the following sections are subject to change between releases.
AWS
The following is a list of the actions required by cbbackupmgr
when interacting with a remote
archive in AWS.
-
AbortMultipartUpload
-
CompleteMultipartUpload
-
CreateMultipartUpload
-
DeleteObject
-
DeleteObjects
-
GetObject
-
HeadObject
-
ListObjectsV2
-
ListObjects
-
ListParts
-
PutObject
CBBACKUPMGR
Part of the cbbackupmgr suite