To facilitate the transition from Ceph-Ansible to Rook, SCS has almost finished developing a migration tool called Rookify. This tool simplifies and streamlines the migration process, making it easier for users to switch from a Ceph-Ansible deployment to Rook. The tool is now under first technical preview and is being tested.
Rookify is a python package that uses a state-machine based on the transitions-library to migrate the various resources (such as mons, mgrs, osds, mds and anything else) to Rook. Each of these resources has a corresponding module in Rookify, which can be executed independently or in combination with other modules.
It’s important to note that most modules have dependencies on other modules and will implicitly run them as needed. For example, the mgirate-mons
module needs the analyze-ceph
module to run first (as indicated by the REQUIRES variable).This is necessary for Rookify to determine the current location of the mons and where they should be migrated to.
Rookify can be configured by editing a comprehensive config.yaml
file, such as the provided confing.example.yaml. This configuration file specifies various configuration-dependencies (like SSH keys, Kubernetes and Ceph configurations) and allows users to easily decide which modules should be run (see the migration_modules
section below in config.yaml
).
Rookify optionally (and recommended) supports using a pickle file (see on top below the general
section in config.example.yaml). Pickle is a model for object serialization that saves the state of progress externally, i.e. which modules have been run and information about the target machine. This means that Rookify can ‘save’ its progress:
⚠️ Important:: This means that if the same Rookify installation should be used to migrate more than one cluster, or the one cluster has significantly changed suddenly, the pickle file should be deleted manually.
Currently, Rookify only offers a straightforward CLI interface, which can be displayed by running rookify
with -h
or --help
:
usage: Rookify [-h] [-d] [-m] [-s]
options:
-h, --help show this help message and exit
-d, --dry-run Preflight data analysis and migration validation.
-m, --migrate Run the migration.
-s, --show-states Show states of the modules.
The -m
or --migrate
argument has been added to (really) execute Rookify, while execution without arguments runs in preflight mode, i.e. with -d
or --dry-run
.
The -s
or --show-states
has been added to track the state of progress. The pickle file is read out for this purpose. Each module then reports the known status on stdout
.
Rookify’s main function is to migrate all of Ceph’s resources to Rook. Let’s take a look at the migration_modules
section in the config.yaml
file:
# config.yaml
migration_modules:
- migrate_mons
This configuration instructs Rookify to perform the following steps:
migrate_mons
module first runs in a preflight mode, which can also be manually triggered using the rookify --dry-run
command. During this phase, Rookify runs the preflight methods for the configured modules and their dependent modules. If the migration has already been successfully completed, the module stops here.ceph-analyze
module: Rookify identifies that the analyze-ceph
module needs to be run first in any case. The analyze-ceph
module collects data on the running Ceph resources and the Kubernetes environment with the Rook operator. Note that, like any other module, ceph-analyze
first runs in preflight mode to check if the state has already been captured in the pickle file. If no state is found, analyze-ceph
gathers the necessary information.k8s_prerequisites_check
module: Validation for k8s namespace is done here, as it is required that the cluster namespace has been created manually. Only the CephCluster resource is generated in the next step by create_rook_cluster
.create_rook_cluster
module: After successfully running the analyze-ceph
and k8s_prerequisites_check
modules, Rookify will check for other dependencies, such as the create_rook_cluster
module. This module creates the clustermap.yaml
for Rook based on information from analyze-ceph
and k8s_prerequisites_check
.analyze_ceph
and create_cluster
, the migrate_mons
module is executed. Rookify shuts down the first running Ceph monitor on the first worker node (using sudo systemctl disable --now ceph-mon.target
) and immediately activates the equivalent monitor in Rook (by setting its metadata in the clustermap.yaml to true).For both managers and monitors, Rookify will use the just described approach: it will try to switch of the ceph resource after it has made sure that it can re-create an equivalent in the Rook cluster. For OSDs and MDS the migratory algorithm is a bit different.
Here the “one-by-one”-algorithm described for managers and monitors does not work, because Rook has a container called rook-ceph-osd-prepare
that always tries to find all osds on a path and build all of them at once. Note that there are configuration options that give the impression to handle this, like useAllNodes=false
und useAllDevices=false
(see rook docs). Both variables are set to false per default in the Rook deployment of OSISM, nevertheless rook-ceph-osd-prepare
tries to scan and process all osds per node. This means in effect, that a device is busy
-error will occur as well as a crashloop-feedback. This behavior has been mitigated by shutting down all OSD daemons per node:
rook-ceph-osd-prepare
to enforce sequential processingThe “one-by-one”-algorithm described for managers and monitors does not work here either, because Rook might want to update instances while the migration is in process: This might happen, for example, if one MDS’s instance of the Ceph-Ansible deployment is shutoff and Rook is allowed to rebuild this instance within Kubernetes. Then Rook might want to update all MDS instances and will consequently try to switch of all the instances in order to update them: also the ones that are still running under Ceph-Ansible. Consequently, Rook will not reach these instances and errors will be thrown.
One way to solve this problem, would be to switch of all mds-instances under Ceph-Ansible and let Rook rebuild all of them. That would cause some minimal downtime though, and Rookify strives to cause 0 downtime.
That is why Rookify currently uses the following approach:
To get started with Rookify, make sure to checkout the README.md in the repository.
If you’d like to try to test the current state of Rookify (much appreciated - feel free to report any issues to Github), you can use the testbed of OSISM.
ℹ️ Info: The OSISM testbed is intended for testing purposes, which means it may be unstable. If Ceph and K3s cannot be deployed without errors, you may need to wait for a fix or find a workaround to test Rookify.
In order to setup the testbed, first consult OSISM’s testbed documentation to ensure you meet all requirements. If everything is in order, clone the repository and use make ceph to set up a Ceph testbed. This command will automatically pull the necessary Ansible roles, prepare a virtual environment, build the infrastructure with OpenStack, create a manager node, and deploy Ceph on three worker nodes:
git clone github.com:osism/testbed.git
make ceph
Once the infrastructure for Ceph and the testbed has been deployed, log in with make login and deploy K3s as well as a Rook operator:
make login
osism apply k3s
osism apply rook-operator
If you want to modify any configurations, such as a Rook setting, refer to /opt/configuration/environments/rook/
and check OSISM’s documentation on Rook for various settings.
Start by cloning the Rookify repository, setting it up, and building the Python package in a virtual environment using the Makefile. You can simply run make
without any arguments to see a list of helper functions that can assist you in setting up and configuring Rookify. The command make setup
downloads the necessary virtual environment libraries and installs the Rookify package into .venv/bin/rookify
within the working directory:
git clone https://github.com/SovereignCloudStack/rookify
cd rookify
make setup
ℹ️ Info:
If you encounter an error from the python-rados
library, you can run make check-radoslib
to check if the library is installed locally. If not, install the package manually. The python-rados library should be version 2.0.0 at the time of writing (check the README.md
file of Rookify for the most up-to-date documentation). The library could not be integrated within the setup because Ceph currently offers no builds for pip.
Copy config.example.osism.yaml
to config.yaml
and modify the various configuration settings as needed. Rookify will require access to an SSH key (e.g., the .id_rsa
file in the Terraform directory in the testbed repository), Ceph configuration files (see /etc/ceph/
on one of the worker nodes), and Kubernetes files (e.g., ~/.kube/config
from the manager node). Check if the Makefile contains any helper functions to assist you: run make in the root of the working directory to see all options that the Makefile offers.
📝 Note: Ensure that Rookify can connect to the testbed. Refer to OSISM-documentation about how to setup a VPN connection.
general:
machine_pickle_file: data.pickle
logging:
level: INFO # level at which logging should start
format:
time: "%Y-%m-%d %H:%M.%S" # other example: "iso"
renderer: console # or: json
ceph:
config: ./.ceph/ceph.conf
keyring: ./.ceph/ceph.client.admin.keyring
# fill in correct path to private key
ssh:
private_key: /home/USER/.ssh/cloud.private
hosts:
testbed-node-0:
address: 192.168.16.10
user: dragon
testbed-node-1:
address: 192.168.16.11
user: dragon
testbed-node-2:
address: 192.168.16.12
user: dragon
kubernetes:
config: ./k8s/config
rook:
cluster:
name: osism-ceph
namespace: rook-ceph
mds_placement_label: node-role.osism.tech/rook-mds
mgr_placement_label: node-role.osism.tech/rook-mgr
mon_placement_label: node-role.osism.tech/rook-mon
osd_placement_label: node-role.osism.tech/rook-osd
rgw_placement_label: node-role.osism.tech/rook-rgw
ceph:
image: quay.io/ceph/ceph:v18.2.1
migration_modules: # this sets the modules that need to be run. Note that some of the modules require other modules to be run as well, this will happen automatically.
- analyze_ceph
Finally, run Rookify to test it. Rookify allows the use of --dry-run
to run modules in preflight mode. Note that Rookify always checks for a successful run of the various modules in preflight mode before starting the migration.
Additionally it is always safe to run the analyze_ceph
module, as it will not make any real changes.
📝 Note:
By default, Rookify runs in preflight mode. If you execute Rookify without any arguments --dry-run
will be appended.
To be on the secure side, you can now run Rookify in preflight mode. In this case you could also execute it directly, i.e. by adding the argument --migrate
(because the analyze_ceph
module does not break anything):
# Preflight mode
.venv/bin/rookify --dry-run
# Execution mode
.venv/bin/rookify --migrate
⚠️ Important: It is advised to run any module in preflight mode first, to avoid any real changes.
If all is setup correctly you will see an output similar to this:
.venv/bin/rookify --migrate
2024-09-02 15:21.37 [info ] Execution started with machine pickle file
2024-09-02 15:21.37 [info ] AnalyzeCephHandler ran successfully.
Note that now there is a data.pickle
file in the root of your working directory. The file should contain data:
du -sh data.pickle
8.0K data.pickle
At this point we can re-edit the config.yaml
file to migrate the osds, mds, managers and radosgateway resources:
migration_modules:
- analyze_ceph
- create_rook_cluster
- migrate_mons
- migrate_osds
- migrate_osd_pools
- migrate_mds
- migrate_mds_pools
- migrate_mgrs
- migrate_rgws
- migrate_rgw_pools
ℹ️ Info:
Some of these are redundant in the sense that their REQUIRED
variables contain the modules as their dependencies already. For example migrate_osds
has the following REQUIRED
-variable: REQUIRES = ["migrate_mons"]
, and migrate_mons
has REQUIRES = ["analyze_ceph", "create_rook_cluster"]
. So it would be fine to skip the first three modules. Still, the extra mention of the modules could improve clarity for the reader. In effect Rookify will run the modules only once, so it does not hurt to add them in the config.yaml
.
We can first run Rookify with in preflight mode ( --dry-run
) to check if all is ok and then run it with --migration
. Executing Rookify should then give you an output similar to this:
.venv/bin/rookify --migration
2024-09-04 08:52.02 [info ] Execution started with machine pickle file
2024-09-04 08:52.04 [info ] AnalyzeCephHandler ran successfully.
2024-09-04 08:52.04 [info ] Validated Ceph to expect cephx auth
2024-09-04 08:52.04 [warning ] Rook Ceph cluster will be configured without a public network and determine it automatically during runtime
2024-09-04 08:52.04 [info ] Rook Ceph cluster will be configured without a cluster network
2024-09-04 08:52.11 [warning ] ceph-mds filesystem 'cephfs' uses an incompatible pool metadata name 'cephfs_metadata' and can not be migrated to Rook automatically
2024-09-04 08:52.16 [info ] Creating Rook cluster definition
2024-09-04 08:52.16 [info ] Waiting for Rook cluster created
2024-09-04 08:52.16 [info ] Migrating ceph-mon daemon 'testbed-node-0'
2024-09-04 08:52.32 [info ] Disabled ceph-mon daemon 'testbed-node-0'
2024-09-04 08:53.45 [info ] Quorum of 3 ceph-mon daemons successful
2024-09-04 08:53.45 [info ] Migrating ceph-mon daemon 'testbed-node-1'
2024-09-04 08:54.07 [info ] Disabled ceph-mon daemon 'testbed-node-1'
2024-09-04 08:54.44 [info ] Quorum of 3 ceph-mon daemons successful
2024-09-04 08:54.44 [info ] Migrating ceph-mon daemon 'testbed-node-2'
2024-09-04 08:55.04 [info ] Disabled ceph-mon daemon 'testbed-node-2'
2024-09-04 08:55.52 [info ] Quorum of 3 ceph-mon daemons successful
2024-09-04 08:55.52 [info ] Migrating ceph-osd host 'testbed-node-0'
2024-09-04 08:55.55 [info ] Disabled ceph-osd daemon 'testbed-node-0@0'
2024-09-04 08:55.57 [info ] Disabled ceph-osd daemon 'testbed-node-0@4'
2024-09-04 08:55.57 [info ] Enabling Rook based ceph-osd node 'testbed-node-0'
2024-09-04 08:57.00 [info ] Rook based ceph-osd daemon 'testbed-node-0@0' available
2024-09-04 08:57.02 [info ] Rook based ceph-osd daemon 'testbed-node-0@4' available
2024-09-04 08:57.02 [info ] Migrating ceph-osd host 'testbed-node-1'
2024-09-04 08:57.05 [info ] Disabled ceph-osd daemon 'testbed-node-1@1'
2024-09-04 08:57.07 [info ] Disabled ceph-osd daemon 'testbed-node-1@3'
2024-09-04 08:57.07 [info ] Enabling Rook based ceph-osd node 'testbed-node-1'
2024-09-04 08:58.46 [info ] Rook based ceph-osd daemon 'testbed-node-1@1' available
2024-09-04 08:58.46 [info ] Rook based ceph-osd daemon 'testbed-node-1@3' available
2024-09-04 08:58.46 [info ] Migrating ceph-osd host 'testbed-node-2'
2024-09-04 08:58.48 [info ] Disabled ceph-osd daemon 'testbed-node-2@2'
2024-09-04 08:58.50 [info ] Disabled ceph-osd daemon 'testbed-node-2@5'
2024-09-04 08:58.50 [info ] Enabling Rook based ceph-osd node 'testbed-node-2'
2024-09-04 09:00.25 [info ] Rook based ceph-osd daemon 'testbed-node-2@2' available
2024-09-04 09:00.27 [info ] Rook based ceph-osd daemon 'testbed-node-2@5' available
2024-09-04 09:00.27 [info ] Migrating ceph-mds daemon at host 'testbed-node-0'
2024-09-04 09:00.27 [info ] Migrating ceph-mds daemon at host 'testbed-node-1'
2024-09-04 09:00.27 [info ] Migrating ceph-mds daemon at host 'testbed-node-2'
2024-09-04 09:00.27 [info ] Migrating ceph-mgr daemon at host'testbed-node-0'
2024-09-04 09:01.03 [info ] Disabled ceph-mgr daemon 'testbed-node-0' and enabling Rook based daemon
2024-09-04 09:01.20 [info ] 3 ceph-mgr daemons are available
2024-09-04 09:01.20 [info ] Migrating ceph-mgr daemon at host'testbed-node-1'
2024-09-04 09:01.51 [info ] Disabled ceph-mgr daemon 'testbed-node-1' and enabling Rook based daemon
2024-09-04 09:02.09 [info ] 3 ceph-mgr daemons are available
2024-09-04 09:02.09 [info ] Migrating ceph-mgr daemon at host'testbed-node-2'
2024-09-04 09:02.41 [info ] Disabled ceph-mgr daemon 'testbed-node-2' and enabling Rook based daemon
2024-09-04 09:03.00 [info ] 3 ceph-mgr daemons are available
2024-09-04 09:03.00 [info ] Migrating ceph-rgw zone 'default'
2024-09-04 09:03.00 [info ] Migrated ceph-rgw zone 'default'
2024-09-04 09:03.00 [info ] Migrating ceph-osd pool 'backups'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'backups'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'volumes'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'volumes'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'images'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'images'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'metrics'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'metrics'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'vms'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'vms'
2024-09-04 09:03.01 [info ] Migrating ceph-rgw daemon at host 'testbed-node-2'
2024-09-04 09:04.27 [info ] Disabled ceph-rgw host 'testbed-node-2'
2024-09-04 09:04.35 [info ] Rook based RGW daemon for node 'testbed-node-2' available
2024-09-04 09:04.35 [info ] Migrating ceph-rgw daemon at host 'testbed-node-1'
2024-09-04 09:04.41 [info ] Disabled ceph-rgw host 'testbed-node-1'
2024-09-04 09:05.09 [info ] Rook based RGW daemon for node 'testbed-node-1' available
2024-09-04 09:05.09 [info ] Migrating ceph-rgw daemon at host 'testbed-node-0'
2024-09-04 09:05.13 [info ] Disabled ceph-rgw host 'testbed-node-0'
2024-09-04 09:05.19 [info ] Rook based RGW daemon for node 'testbed-node-0' available
Wow! You migrated from Ceph-Ansible to Rook!
Now login into the testbed and check your ceph cluster’s health with ceph -s
. Also use kubectl
to check if all Rook pods are running.