2010年4月28日 星期三

Ceph install on Fedora

Ceph, 另外一套分散式檔案系統,是人家的PhD論文發展來的(外國論文實做能力都超強的 *_*),一樣是 POSIX-compatible, 目前尚處於開發階段,不適用於 production 環境

針對其架構介紹,在 Linux Magazine 有篇簡單的介紹:Ceph: The Distributed File System Creature from the Object Lagoon

目前我選擇在 Fedora 上安裝,主要原因是目前測試過的大多數檔案系統,對於 RHEL, SuSE like 的作業系統支援度,都相較於其他 OS 為高,只是想避免一些不必要的麻煩。

OS: Fedora 12
Kernel: 2.6.34-rc3

Step1. Build Ceph Client
目前 Ceph 已經 merge 進 2.6.34 了,因此在安裝上會相對的簡單許多。

Check out new kernel source and compile

# cd /usr/src/kernels
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
# cd ceph-client/
# make menuconfig

選擇 Ceph 成 module 或 build-in 到 kernel 裡
由於 Ceph 已經放棄原本的 Ebofs, 轉而用 Btrfs 取代之,因此也將 Btrfs 編進去吧。
編譯好重開後,系統應已支援這兩種檔案系統。

Step2. 準備 OSD 空間
此步驟重複於所有希望當 OSD 的結點。

Partion 切割

你可以使用 fdisk, cfdisk((util-linux-ng) 或其他你善用的工作來進行 partition 切割。

# fdisk /dev/cciss/c0d2

The number of cylinders for this disk is set to 8716.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): m
Command action
a toggle a bootable flag
b edit bsd disklabel
c toggle the dos compatibility flag
d delete a partition
l list known partition types
m print this menu
n add a new partition
o create a new empty DOS partition table
p print the partition table
q quit without saving changes
s create a new empty Sun disklabel
t change a partition's system id
u change display/entry units
v verify the partition table
w write table to disk and exit
x extra functionality (experts only)

Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-8716, default 1): 1
Last cylinder, +cylinders or +size{K,M,G} (1-8716, default 8716): 8716

Command (m for help): p

Disk /dev/cciss/c0d2: 36.4 GB, 36414750720 bytes
255 heads, 32 sectors/track, 8716 cylinders
Units = cylinders of 8160 * 512 = 4177920 bytes
Disk identifier: 0x0004f893

Device Boot Start End Blocks Id System
/dev/cciss/c0d2p1 1 8716 35561264 83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

# mkdir /mnt/btrfs
# cd /mnt/btrfs/
# mkdir osd0 osd1

格式化硬碟

# yum install btrfs-progs.i686
# mkfs.btrfs /dev/cciss/c0d2p1

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d2p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19

OSD local 硬碟掛載

# mount -t btrfs /dev/cciss/c0d2p1 /mnt/btrfs/osd0/
# mount -t btrfs /dev/cciss/c0d3p1 /mnt/btrfs/osd1
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_f172-lv_root
32G 6.2G 24G 21% /
tmpfs 503M 0 503M 0% /dev/shm
/dev/cciss/c0d0p1 194M 37M 147M 20% /boot
/dev/cciss/c0d2p1 34G 56K 32G 1% /mnt/btrfs/osd0
/dev/cciss/c0d3p1 34G 56K 32G 1% /mnt/btrfs/osd1

# umount /mnt/btrfs/osd0 /mnt/btrfs/osd1 (格式化、掛載 osd disk 等動作由 Step5, Step6 時,由該 script 來執行,在此僅確認這些硬碟都 ok)

Step3. 編譯 Ceph 套件
此步驟重覆於所有結點

取得 Ceph source code

# cd /usr/src
# git clone git://ceph.newdream.net/ceph.git
Initialized empty Git repository in /usr/src/ceph/.git/
remote: Counting objects: 79397, done.
remote: Compressing objects: 100% (19438/19438), done.
remote: Total 79397 (delta 64545), reused 73253 (delta 58751)
Receiving objects: 100% (79397/79397), 15.10 MiB | 3.17 MiB/s, done.
Resolving deltas: 100% (64545/64545), done.

安裝必要檔案

# yum install boost.i686
# yum install boost-devel.i686
# yum install libedit-devel.i686
# yum install openssl.i686
# yum install openssl-devel.i686

編譯 Ceph

# cd /usr/src/ceph/
# more INSTALL
# ./autogen.sh
# ./configure
# make
# make install

Step4. 設定

確認所有結點間,皆可透過 ssh, 以無密碼方式登入。
這部分做法請參考:SSH Login Without Password

ceph.conf, 確認每個結點這部分設定都相同

# cd /mnt/btrfs/
# mkdir mon0
# cd /usr/local/etc/ceph/
# cat ceph.conf
===
;
; Sample ceph ceph.conf file.
;
; This file defines cluster membership, the various locations
; that Ceph stores data, and any other runtime options.

; If a 'host' is defined for a daemon, the start/stop script will
; verify that it matches the hostname (or else ignore it). If it is
; not defined, it is assumed that the daemon is intended to start on
; the current host (e.g., in a setup with a startup.conf on each
; node).

; global
[global]
pid file = /var/run/ceph/$name.pid

; some minimal logging (just message traffic) to aid debugging
debug ms = 1

; monitor
[mon]
mon data = /mnt/btrfs/mon$id

[mon0]
host = node181
mon addr = 10.0.0.181:6789

; mds
[mds]

[mds.node181]
host = node181

; osd
[osd]
sudo = true
osd data = /mnt/btrfs/osd$id

[osd0]
host = node172

; if 'btrfs devs' is not specified, you're responsible for
; setting up the 'osd data' dir. if it is not btrfs, things
; will behave up until you try to recover from a crash (which
; usually fine for basic testing).
btrfs devs = /dev/cciss/c0d2p1
osd data = /mnt/btrfs/osd0

[osd1]
host = node172
btrfs devs = /dev/cciss/c0d3p1
osd data = /mnt/btrfs/osd1

[osd2]
host = node173
btrfs devs = /dev/cciss/c0d2p1
osd data = /mnt/btrfs/osd2

[osd3]
host = node173
btrfs devs = /dev/cciss/c0d3p1
osd data = /mnt/btrfs/osd3

; access control
[group everyone]
; you probably want to limit this to a small or a list of
; hosts. clients are fully trusted.
addr = 0.0.0.0/0

[mount /]
allow = %everyone
===

Step5. Create file system

[root@f181 ceph]# mkcephfs -c /usr/local/etc/ceph/ceph.conf --allhosts --mkbtrfs
/usr/local/bin/monmaptool --create --clobber --add 10.0.0.181:6789 --print /tmp/monmap.2319
/usr/local/bin/monmaptool: monmap file /tmp/monmap.2319
/usr/local/bin/monmaptool: generated fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43
epoch 1
fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43
last_changed 10.04.24 20:10:01.089929
created 10.04.24 20:10:01.089929
mon0 10.0.0.181:6789/0
/usr/local/bin/monmaptool: writing epoch 1 to /tmp/monmap.2319 (1 monitors)
max osd in /usr/local/etc/ceph/ceph.conf is 3, num osd is 4
/usr/local/bin/osdmaptool: osdmap file '/tmp/osdmap.2319'
/usr/local/bin/osdmaptool: writing epoch 1 to /tmp/osdmap.2319
Building admin keyring at /tmp/admin.keyring.2319
creating /tmp/admin.keyring.2319
Building monitor keyring with all service keys
creating /tmp/monkeyring.2319
importing contents of /tmp/admin.keyring.2319 into /tmp/monkeyring.2319
creating /tmp/keyring.mds.node181
importing contents of /tmp/keyring.mds.node181 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.0
importing contents of /tmp/keyring.osd.0 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.1
importing contents of /tmp/keyring.osd.1 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.2
importing contents of /tmp/keyring.osd.2 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.3
importing contents of /tmp/keyring.osd.3 into /tmp/monkeyring.2319
=== mon0 ===
10.04.24 20:10:02.245494 store(/mnt/btrfs/mon0) mkfs
10.04.24 20:10:02.245722 store(/mnt/btrfs/mon0) test -d /mnt/btrfs/mon0 && /bin/rm -rf /mnt/btrfs/mon0 ; mkdir -p /mnt/btrfs/mon0
10.04.24 20:10:02.581627 mon0(starting).class v0 create_initial -- creating initial map
10.04.24 20:10:02.692877 mon0(starting).auth v0 create_initial -- creating initial map
10.04.24 20:10:02.692940 mon0(starting).auth v0 reading initial keyring
/usr/local/bin/mkmonfs: created monfs at /mnt/btrfs/mon0 for mon0
admin.keyring.2319 100% 119 0.1KB/s 00:00
=== mds.node181 ===
WARNING: no keyring specified for mds.node181
=== osd0 ===
umount: /mnt/btrfs/osd0: not mounted
umount: /dev/cciss/c0d2p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d2p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd0 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd0
WARNING: no keyring specified for osd0
=== osd1 ===
umount: /mnt/btrfs/osd1: not mounted
umount: /dev/cciss/c0d3p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d3p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd1 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd1
WARNING: no keyring specified for osd1
=== osd2 ===
umount: /mnt/btrfs/osd2: not mounted
umount: /dev/cciss/c0d2p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d2p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd2 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd2
WARNING: no keyring specified for osd2
=== osd3 ===
umount: /mnt/btrfs/osd3: not mounted
umount: /dev/cciss/c0d3p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d3p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd3 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd3
WARNING: no keyring specified for osd3

Step6. 啟動 Ceph Server

[root@f181 ceph]# cd /usr/src/ceph/src/
[root@f181 src]# ./init-ceph -a -c /usr/local/etc/ceph/ceph.conf start
=== mon0 ===
Starting Ceph mon0 on node181...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting mon0 at 10.0.0.181:6789/0 mon_data /mnt/btrfs/mon0 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43
=== mds.node181 ===
Starting Ceph mds.node181 on node181...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting mds.node181 at 0.0.0.0:6800/3219
=== osd0 ===
Mounting Btrfs on node172:/mnt/btrfs/osd0
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd0 on node172...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd0 at 0.0.0.0:6800/3484 osd_data /mnt/btrfs/osd0 (no journal)
=== osd1 ===
Mounting Btrfs on node172:/mnt/btrfs/osd1
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd1 on node172...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd1 at 0.0.0.0:6802/3717 osd_data /mnt/btrfs/osd1 (no journal)
=== osd2 ===
Mounting Btrfs on node173:/mnt/btrfs/osd2
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd2 on node173...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd2 at 0.0.0.0:6800/3310 osd_data /mnt/btrfs/osd2 (no journal)
=== osd3 ===
Mounting Btrfs on node173:/mnt/btrfs/osd3
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd3 on node173...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd3 at 0.0.0.0:6802/3548 osd_data /mnt/btrfs/osd3 (no journal)

Step7. 掛載!

[root@f181 src]# mount -t ceph 10.0.0.181:/ /mnt/ceph/
[root@f181 src]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_f181-lv_root
32G 6.2G 24G 21% /
tmpfs 503M 0 503M 0% /dev/shm
/dev/cciss/c0d0p1 194M 33M 152M 18% /boot
10.0.0.181:/ 136G 7.0M 128G 1% /mnt/ceph

Step8. 測試

[root@f181 src]# dd if=/dev/zero of=/mnt/ceph/bigfile bs=4k count=10000k
10240000+0 records in
10240000+0 records out
41943040000 bytes (42 GB) copied, 1093.68 s, 38.4 MB/s
資料會寫入各硬碟中。

Final. 結果觀察

1. 空間的寫入,會隨時增減 (猜測應該是在做 balance 的動作)
2. df 空間無法反應實際檔案大小,各 osd 也沒法反應實際空間使用狀況。

ref. http://ceph.newdream.net/wiki/Main_Page

沒有留言:

張貼留言