Following CEPH125R
- CEPH
- Overview
- Architecture
- Commands
- Ansbile
- Replica pools
- Erasure pools
- Cephx
- RBD
- RBD snapshots
- RBD mirror for DR
- RBD export/import
- RGW S3 API
- RGW OpenStack Swift
- Multisite RGW
- CephFS
- CRUSH SSD pool configuration
- CRUSH conf on location
- Opeartions
Red Hat Ceph Storage focuses on providing a unified storage solution for:
- block-based
- object-based
- file-based
The design of Ceph is designed to achieve the following goals:
- Be scalable for every component
- Provide no single point of failure
- Be software-based (not an appliance) and open source (no vendor lock-in)
- Run on readily available hardware
- Be self-managed wherever possible, minimizing user intervention
RADOS: (Reiliable Autonomous Distributed Object Store) is a object storage back end.
LIBRADOS: A libary to directly access RADOS (C, C++, JAVA, Python, Ruby)
The following is the 3 avaliable way to interact with CEPH unless you use LIBRADOS
- RBD (RADOS Block Device): Block storage good when you use Vm-ware to create
- RGW (RADOS Gateway): Object storage, S3 and open-stack swift compatible
- CephFS (Ceph File System): File-system based storage for POSIX based systems, it's supported but the snapshot function is still in technical preview.
RADOS, the Ceph storage back end, is based on the following daemons, which can be scaled out to meet the requirements of the architecture being deployed:
Monitors (MONs) , which maintain maps of the cluster state and are used to help the other daemons coordinate with each other.
Object Storage Devices (OSDs) , which store data and handle data replication, recovery and rebalancing.
Managers (MGRs) , which keep track of runtime metrics and expose cluster information through a web browser-based dashboard and REST API.
Metadata Servers (MDSs) , which store metadata used by CephFS (but not object storage or block storage) to allow efficient POSIX command execution by clients.
ceph -s
Shows the disk used in your osd on your nodes
ceph osd tree
Only check osd, but the same info as ceph -s
ceph osd stat
ceph osd df
ceph osd lspools
or
ceph osd pool ls detail
ceph osd pool set-quota myfirstpool max_objects 1000
to remove quota set max_objects=0
ceph osd pool mksnap pool-namesnap-name
ceph osd pool rmsnap pool-namesnap-name
rados -p pool-name -s snap-name get object-namefile
rados -p pool-name rollback object-namesnap-name
ceph osd pool set mypool size 3
ceph osd pool ls detail
ceph daemon osd.0 config show
ceph daemon mds.servera config get mds_data
If you have had old disks that is "dirty" from old partitions. Use the following to delete everything from them.
ceph-disk zap /dev/vdb
Requiered rpm package to get ansible scripts:
sudo yum install -y ceph-ansible
Ansible path:
/usr/share/ceph-ansible
NOTE This command will use /etc/ansible/hosts
cd /usr/share/ceph-ansible
ansible-playbook --limit=clients site.yml
If you get issues with the keyring not being put to your client for example. You can copy it over to your client (probably not good). if you run ceph -s
The real command that happens is:
ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health
https://docs.ceph.com/docs/mimic/rados/operations/user-management/
Before you can store anything on your brand new CEPH installation. You need to create a Pool, this pool contains multiple OSD and CEPH with the help of CRUSH (Controlled Replication Under Scalable Hashing) sends out the different objects to different OSD.
The number of placement groups in a pool has a major impact on performance. If you configure too few placement groups in a pool, too much data will need to be stored in each PG and Ceph will not perform well. If you configure too many placement groups in a pool, the OSDs will require a large amount of RAM and CPU time and Ceph will not perform well. Typically, a pool should be configured to contain 100 - 200 placement groups per OSD.
ceph osd pool create pool-namepg-num [pgp-num] \ [replicated] [crush-ruleset-name] [expected-num-objects]
Where:
-
pool-name is the name of the new pool.
-
pg-num is the total number of Placement Groups (PGs) for this pool.
-
pgp-num is the effective number of placement groups for this pool. Normally, this should be equal to the total number of placement groups.
-
replicated specifies that this is a replicated pool, and is normally the default if not included in the command.
-
crush-ruleset-name is the name of the CRUSH rule set you want to use for this pool. The osd_pool_default_crush_replicated_ruleset configuration parameter sets the default value.
-
expected-num-objects is the expected number of objects in the pool. If you know this number in advance, Ceph can prepare a folder structure on the OSD's XFS file system at pool creation time. Otherwise, Ceph reorganizes this directory structure at runtime as the number of objects increases. This reorganization has a latency impact.
Example:
ceph osd pool create myfirstpool 50 50
We must assign the pool so it knows what it will be used for.
-
cephfs: Ceph File System
-
rbd: Ceph Block Device
-
rgw: Ceph Object Gateway (S3)
ceph osd pool application enable pool-nameapp
Example:
ceph osd pool application enable myfirstpool rbd
ceph osd pool delete pool-namepool-name --yes-i-really-really-mean-it
NOTE
In Red Hat Ceph Storage 3, for extra protection, Ceph sets the mon_allow_pool_delete configuration parameter to false. With this directive, and even with the --yes-i-really-really-mean-it option, the ceph osd pool delete command does not result in the deletion of the pool.
You can set the mon_allow_pool_delete parameter to true and restart the mon services to allow pool deletion.
Store the file /etc/services in srv object in namespace system in the mytestpool pool.
rados -p mytestpool -N system put srv /etc/services
rados -p mytestpool -N system ls
rados -p mytestpool --all ls --format=json | python -m json.tool
https://access.redhat.com/labs/cephpgc/
We save disk but calculation of the coding chunks adds CPU and memory overhead for erasure coded pools, reducing performance. In addition, in Red Hat Ceph Storage 3, operations that require partial object writes are not supported for erasure coded pools.
Red Hat Ceph Storage currently only supports erasure coded pools accessed through the Ceph Object Gateway.
ceph osd pool create pool-name pg-num [pgp-num] erasure [erasure-code-profile] [crush-ruleset-name] [expected_num_objects]
Where:
-
pool-name is the name for the new pool.
-
pg-num is the total number of Placement Groups (PGs) for this pool.
-
pgp-num is the effective number of placement groups for this pool. Normally, this should be equal to the total number of placement groups.
-
erasure specifies that this is an erasure coded pool.
-
erasure-code-profile is the name of the profile to use. You can create new profiles with the ceph osd erasure-code-profile set command as described below. A profile defines the k and m values and the erasure code plug-in to use. By default, Ceph uses the default profile.
-
crush-ruleset-name is the name of the CRUSH rule set to use for this pool. If not set, Ceph uses the one defined in the erasure code profile.
-
expected-num-objects is the expected number of objects in the pool. If you know this number in advance, Ceph can prepare a folder structure on the OSD's XFS file system when it creates the pool. Otherwise, Ceph reorganizes this directory structure at runtime as the number of objects increases. This reorganization has a latency impact.
Example:
ceph osd pool create mysecondpool 50 50 erasure
ceph osd erasure-code-profile ls
ceph osd erasure-code-profile get default
Authentication time.
Cephx is installed by default, it's the keyring solution.
CEPH uses user accounts for several purposes:
- Internal communication between Ceph daemons
- Client applications accessing the Red Hat Ceph Storage cluster through the librados library
- Ceph administrators
ceph auth get-or-create client.formyapp2 mon 'allow r' osd 'allow rw pool=myapp'
ceph auth list
ceph auth get client.admin
ceph auth export client.operator1 > ~/operator1.export
ceph auth import -i ~/operator1.export
ceph auth caps client.application1 mon 'allow r' osd 'allow rw pool=myapp'
How to create a block devide for linux clients.
ceph osd pool create pool-rbd 32
Shortcut:
rbd pool init pool-rbd
Normal way: ceph osd pool application enable pool-rbd rbd
ceph auth get-or-create client.rbd.servera
mon 'profile rbd' osd 'profile rbd'
-o /etc/ceph/ceph.client.rbd.servera.keyring
This to not be forced to add --id etc everytime you write a command.
export CEPH_ARGS='--id=rbd.servera'
rbd create rbd/test --size=128M
rbd ls
rbd info rbd/test
sudo rbd --id rbd.servera map rbd/test
rbd showmapped
sudo mkfs.ext4 /dev/rbd0
sudo mkdir /mnt/rbd
sudo mount /dev/rbd0 /mnt/rbd
sudo chown ceph:ceph /mnt/rbd
rbd du rbd/test
Follow these commands if you want to perfrom a snapshot of a mounted rbd image.
sudo fsfreeze --freeze /mnt/source
rbd snap create rbd/clonetest@clonesnap
sudo fsfreeze --unfreeze /mnt/source
rbd snap protect rbd/clonetest@clonesnap
rbd snap ls rbd/clonetest
rbd clone rbd/clonetest@clonesnap rbd/realclone
rbd snap unprotect rbd/clonetest@clonesnap
rbd snap purge rbd/clonetest
RBD Mirroring supports two configurations:
- One-way mirroring or active-passive
- Client only needs to reach one cluster and a mirror client acts as middle man.
- Two-way mirroring or active-active
- Client needs to reach both clusters.
Supported Mirroring Modes:
- Pool mode
- Image mode
scp over the keyrings and config files the followign for the different clusters:
serverc:
- /etc/ceph/prod.conf
- /etc/ceph/prod.client.admin.keyring
serverf:
- /etc/ceph/bup.conf
- /etc/ceph/bup.client.admin.keyring
ceph -s --cluster prod
ceph -s --cluster bup
rbd mirror pool enable rbd pool --cluster bup
rbd mirror pool enable rbd pool --cluster prod
rbd mirror pool peer add rbd client.admin@prod --cluster bup
rbd mirror pool info rbd --cluster bup
rbd mirror pool status rbd --cluster bup
rbd create rbd/prod1 --size=128M
--image-feature=exclusive-lock,journaling --cluster prod
rbd mirror image status rbd/prod1 --cluster bup
rbd rm rbd/prod1 --cluster prod
You can import exmport images.
rbd export rbd/test ~/export.dat
rbd import rbd/test ~/export.dat
cat ~/export.dat | ssh ceph@serverf rbd import - rbd/test
radosgw-admin user create --uid="operator"
--display-name="S3 Operator" --email="operator@example.com" \ --access_key="12345" --secret="67890"
s3cmd --configure
After going through the guide update the following values to match your server:
[student@servera ~]$ grep -r host_ ~/.s3cfg host_base = servera host_bucket = %(bucket)s.servera
s3cmd mb s3://my-bucket
s3cmd ls
s3cmd put --acl-public /tmp/10MB.bin s3://my-bucket/10MB.bin
wget -O /dev/null http://my-bucket.servera/10MB.bin
or
wget -O /dev/null http://servera/my-bucket/10MB.bin
If you upload a big file to s3 it might take a long time. To view status of the upload and potential stuck uploads.
s3cmd multipart s3://nano
radosgw-admin user create --uid=admin
--display-name="Admin User"
--caps="users=read,write;usage=read,write;
buckets=read,write;zone=read,write"
--access-key="abcde" --secret="qwerty"
radosgw-admin bucket list
radosgw-admin metadata get bucket:my-bucket
Swift uses multi-tier design, built around tenants and users for auth.
While S3 uses single-tier design. A single user account may have multiple access keys and secrets which are used to provide different types of access in the same account.
Due to this when creating a Swift user you create "subusers".
radosgw-admin subuser create --uid="operator"
--subuser="operator:swift" --access="full"
--secret="opswift"
sudo yum -y install python-swiftclient
swift -V 1.0 -A http://servera/auth/v1 -U operator:swift -K opswift stat
swift -V 1.0 -A http://servera/auth/v1 -U operator:swift -K opswift list
swift -V 1.0 -A http://servera/auth/v1 -U operator:swift -K opswift create my-container
swift -V 1.0 -A http://servera/auth/v1 -U operator:swift -K opswift upload my-container /tmp/swift.dat
radosgw-admin realm create --rgw-realm=ceph125 --default
radosgw-admin zonegroup delete --rgw-zonegroup=default
And make it default
radosgw-admin zonegroup create --rgw-zonegroup=classroom
--endpoints=http://servera:80 --master --default
export SYSTEM_ACCESS_KEY=replication export SYSTEM_SECRET_KEY=secret
radosgw-admin zone create --rgw-zonegroup=classroom
--rgw-zone=main --endpoints=http://servera:80
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
Create a system user named repl.user to access the zone pools. The keys for the repl.user user must match the keys configured for the zone.
radosgw-admin user create --uid="repl.user"
--display-name="Replication User"
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
--system
radosgw-admin period update --commit
From serverf:
radosgw-admin realm pull --url=http://servera:80
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
From serverf:
radosgw-admin period pull --url=http://servera:80 \ --access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
From serverf:
radosgw-admin realm default --rgw-realm=ceph125
radosgw-admin zonegroup default --rgw-zonegroup=classroom
From serverf:
radosgw-admin zone create --rgw-zonegroup=classroom
--rgw-zone=fallback --endpoints=http://serverf:80
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
--default
From serverf:
radosgw-admin period update --commit --rgw-zone=fallback
From serverf:
radosgw-admin sync status
From servera:
radosgw-admin user create --uid="s3user"
--display-name="S3 User" --id rgw.servera
--access-key="s3user" --secret-key="password"
radosgw-admin user list
From serverf: radosgw-admin user list
You should now see the s3user on serverf as well.
CephFS isn't NFS.
You can mount CephFS two ways:
- kernel client (available starting with RHEL 7.3)
- FUSE client (available starting with RHEL 7.2)
As allways there are good and bad things with both. Example:
- Kernel client does not support quotas but may be faster.
- FUSE client supports quotas and ACLs, but they have to be enabled explicitly
As allways install using ansible.
ceph df
You should see among other things:
cephfs_data cephfs_metadata
ceph auth get-key client.admin | sudo tee /root/asecret
sudo mount -t ceph serverc:/ /mnt/cephfs
-o name=admin,secretfile=/root/asecret
Exprimental feature, don't use in prod. (Don't know why they teach none supported features)
sudo ceph mds set allow_new_snaps true --yes-i-really-mean-it
Just like EMC and netapp users can restore their own files from snapshot. Just go in to the .snap folder and you will find all the snapshots. You can even create a snapshot by creating a folder with a random name.
mkdir /mnt/cephfs/.snap/mysnap
Let's assign a pool to only use SSD disks
I think this is the different types of disks that is found in your osd:s.
ceph osd crush class ls
output: [ "hdd", "ssd" ]
List your osd disks and classes together with it's CRUSH weight.
ceph osd crush tree
ceph osd crush rule create-replicated onssd default host ssd
ceph osd crush rule ls
ceph osd pool create myfast 32 32 onssd
List all the pools ID:s. In my case myfast have id: 18
ceph osd lspools
List PG (PoolGroups), and see how the PG is spread out on which disks.
From ceph osd crush tree we know the osd disk id:s that have ssh. We should see only those disks id:s used from bellow command:
ceph pg dump pgs_brief | grep -F 18.
Let's define how data should be spread over the cluster depending on where they are physically located.
We will create the following:
default-ceph125 (root bucket)
rackblue (rack bucket)
hostc (host bucket)
osd.4
osd.6
osd.8
rackgreen (rack bucket)
hostd (host bucket)
osd.0
osd.1
osd.2
rackpink (rack bucket)
hoste (host bucket)
osd.3
osd.5
osd.7
ceph osd crush add-bucket default-ceph125 root
ceph osd crush add-bucket rackblue rack
ceph osd crush add-bucket hostc host
ceph osd crush add-bucket rackgreen rack
ceph osd crush add-bucket hostd host
ceph osd crush add-bucket rackpink rack
ceph osd crush add-bucket hoste host
ceph osd crush move rackblue root=default-ceph125
ceph osd crush move hostc rack=rackblue
ceph osd crush move rackgreen root=default-ceph125
ceph osd crush move hostd rack=rackgreen
ceph osd crush move rackpink root=default-ceph125
ceph osd crush move hoste rack=rackpink
ceph osd crush tree
In CEPH125 course we have gotten my-crush-location script. It more or less talks to ceph ask how we have configured the CRUSH map. We will use this script as a hook
/usr/local/bin/my-crush-location --cluster ceph --type osd --id 0
To all osd hosts edit /etc/ceph/ceph.conf
[osd] crush_location_hook = /usr/local/bin/my-crush-location
Don't forget to restart ceph-osd.target
sudo systemctl restart ceph-osd.target
Can ofc be done in Ansbile under ceph_config_overrides.
You can get a easy overview of the cluster looking at the OSD map
ceph osd dump
The epoch values changes when a event have happend (like a osd shutdown) You can also view the status of each OSD
To my understanding affinity is the definition of the disk that get's read from. If you do matinance on a disk you can change this for example.
The id of the disk is 3 and is gatherd from the ceph osd dump command.
When setting it to 0 it won't read anything.
NOTE I'm not 100% on this
ceph osd primary-affinity 3 1.0
ceph tell osd.* version
ceph osd map rbd file00
ceph osd perf
CEPH got a built in bench mark tool
rados -p rbd bench 300 write
In the bottom of the strace you will see which keyring that you use.
strace -e stat,open rados -p rbd ls