2015年2月5日 星期四

Percona XtraDB Cluster Installation

為了存些大量的資料,又不想日後煩惱 DB HA/scale 問題,先來試用一下這套看看。
在安裝上其實還蠻容易的,官方也有寫安裝文件,感覺十分貼心。

參考文件:
http://www.percona.com/doc/percona-xtradb-cluster/5.6/howtos/ubuntu_howto.html

1. 事前準備:
Ubuntu 14.04 LTS servers: pxc1(192.168.3.12), pxc2(192.168.3.13), pxc3(192.168.3.14)
注意 SELinux, Apparmor 及 firewall 設定,開放 ports: 3306, 4444, 4567 and 4568

2. Installation:

2.1. 新增簽章金鑰
$ apt-key adv --keyserver keys.gnupg.net --recv-keys 1C4CBDCDCD2EFD2A

2.2. prepare percona apt repository
add to /etc/apt/sources.list
deb http://repo.percona.com/apt trusty main
deb-src http://repo.percona.com/apt trusty main

$ apt-get update

2.3. install package
$ apt-get install percona-xtradb-cluster-56
or
$ apt-get install percona-xtradb-cluster-full-56

3. Setup

3.1. stop mysql running on all nodes
$ /etc/init.d/mysql stop

3.2. first node configuration
root@pxc1:~# cat /etc/mysql/my.cnf
===
[mysqld]

datadir=/var/lib/mysql
user=mysql

# Path to Galera library
wsrep_provider=/usr/lib/libgalera_smm.so

# Cluster connection URL contains the IPs of node#1, node#2 and node#3
wsrep_cluster_address=gcomm://

# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW

# MyISAM storage engine has only experimental support
default_storage_engine=InnoDB

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

# Node #1 address
wsrep_node_address=192.168.3.12

wsrep_node_name=pxc1

# SST method
wsrep_sst_method=xtrabackup-v2

# Cluster name
wsrep_cluster_name=my_PXC

# Authentication for SST method
wsrep_sst_auth="sstuser:s3cretPass"
===

3.3. start first node
root@pxc1:~# /etc/init.d/mysql bootstrap-pxc

3.4. grant privilege
mysql@pxc1> CREATE USER 'sstuser'@'localhost' IDENTIFIED BY 's3cretPass';
mysql@pxc1> GRANT RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost';
mysql@pxc1> FLUSH PRIVILEGES;

3.5 2nd and 3rd node configuration
===
[mysqld]

datadir=/var/lib/mysql
user=mysql

# Path to Galera library
wsrep_provider=/usr/lib/libgalera_smm.so

# Cluster connection URL contains the IPs of node#1, node#2 and node#3
wsrep_cluster_address=gcomm://192.168.3.12  #first node IP

# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW

# MyISAM storage engine has only experimental support
default_storage_engine=InnoDB

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

# Node #2 address
wsrep_node_address=192.168.3.13 #change as you need

wsrep_node_name=pxc2 #change as you need

# SST method
wsrep_sst_method=xtrabackup-v2

# Cluster name
wsrep_cluster_name=my_PXC

# Authentication for SST method
wsrep_sst_auth="sstuser:s3cretPass"
===

3.6. start 2nd and 3rd node mysq service
$ /etc/init.d/mysql start

3.7. check if all three node register



















3.8. final configuration

change all nodes cluster address to
wsrep_cluster_address=gcomm://192.168.3.12,192.168.3.13,192.168.3.14

2010年6月6日 星期日

FreeBSD 7.3 -> 8.1 remote upgrade

由於考慮 7.x 系列快 EoL 了,趁有空時趕緊 upgrade 過去,老實說,這種跳大版本的,很難 remote upgrade 成功。

這次先用 vm 測了一下,好像可以,於是把手邊的機器都更新上來了 :D

目前試過 7.3-RELEASE 及 7.3-Stable 都可成功 remote upgrade 至 8.1-PRERELEASE

更新步驟很簡單:

checkout 8-Stable src code
# cat stable-supfile
*default release=cvs tag=RELENG_8
# cvsup -g -L 2 stable-supfile

造世界開始~
# cd /usr/src ; make buildworld
# make buildkernel
# make installworld
# make installkernel # 請特別注意在 installworld 後才 install new kernel.
# reboot
# mergemaster

基本上,上述的步驟是我試過 remote upgrade ok 的,但也請要有跑 console 的準備 XD 重要資料,請務必要事先備份!

造完世界後,也請順便把 ports 全部翻新吧~ :P

2010年5月14日 星期五

Hadoop install on Fedora

最近因為有大量的統計需求,需要強大的運算能力,因此小試了一下 Hadoop, 看是否能符合需求。
基本上安裝相當容易,不容易的地方在 Map/Reduce 相關的程式。

在開始前,建議先閱讀 HDFS Architecture 以了解設定上相關角色。

目前安裝在 Fedora 12 上,套件採用最新版的 hadoop-0.20.2
預計 master node: f180, slave node: f172, f173

Step1. 套件下載與設定

套件下載 (所有節點都執行)
# mkdir /usr/src/hadoop/
# cd /usr/src/hadoop/
# wget http://ftp.twaren.net/Unix/Web/apache/hadoop/core/stable/hadoop-0.20.2.tar.gz
# tar xvf hadoop-0.20.2.tar.gz
# cd hadoop-0.20.2

環境設定 (所有節點都執行)
指定 JAVA 目錄 (必要!)
# cat conf/hadoop-env.sh (Fedora 12 目前使用 openjdk)
-# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
+export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk
指定結點資訊
# cat conf/core-site.xml
hadoop.tmp.dir
/mnt/btrfs/hadoop/hadoop-${user.name} #指定資料暫存目錄
fs.default.name
hdfs://f180:9000 #指定 HDFS master

# cat conf/hdfs-site.xml
dfs.replication
2 #設定 block replication

# cat conf/mapred-site.xml
mapred.job.tracker
f180:9001 #指定 Map/Reduce master

設定 master node (只在 master 執行)
[root@f180 hadoop-0.20.2]# cat conf/masters
f180
設定 slave nodes (只在 master 執行)
[root@f180 hadoop-0.20.2]# cat conf/slaves
f172
f173

建立暫存空間
[root@f180 hadoop-0.20.2]# mkfs.btrfs /dev/cciss/c0d1p1
[root@f180 hadoop-0.20.2]# mkdir /mnt/btrfs/hadoop/
[root@f180 hadoop-0.20.2]# mount -t btrfs /dev/cciss/c0d1p1 /mnt/btrfs/hadoop/

[root@f172 hadoop-0.20.2]# mkfs.btrfs /dev/cciss/c0d4p1
[root@f172 hadoop-0.20.2]# mkdir /mnt/btrfs/hadoop/
[root@f172 hadoop-0.20.2]# mount -t btrfs /dev/cciss/c0d4p1 /mnt/btrfs/hadoop/

[root@f173 hadoop-0.20.2]# mkfs.btrfs /dev/cciss/c0d4p1
[root@f173 hadoop-0.20.2]# mkdir /mnt/btrfs/hadoop/
[root@f173 hadoop-0.20.2]# mount -t btrfs /dev/cciss/c0d4p1 /mnt/btrfs/hadoop/

確認所有結點間,皆可透過 ssh, 以無密碼方式登入。
這部分做法請參考:SSH Login Without Password

Step2. 初始化檔案系統

[root@f180 hadoop-0.20.2]# bin/hadoop namenode -format
10/05/14 17:17:43 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = f180.twaren.net/211.79.x.180
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/14 17:17:43 INFO namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
10/05/14 17:17:43 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/14 17:17:43 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/14 17:17:43 INFO common.Storage: Image file of size 94 saved in 0 seconds.
10/05/14 17:17:44 INFO common.Storage: Storage directory /mnt/btrfs/hadoop/hadoop-root/dfs/name has been successfully formatted.
10/05/14 17:17:44 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at f180.twaren.net/211.79.x.180
************************************************************/

Step3. 啟動 DFS daemon

[root@f180 hadoop-0.20.2]# bin/start-dfs.sh
starting namenode, logging to /usr/src/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-namenode-f180.twaren.net.out
f173.twaren.net: starting datanode, logging to /usr/src/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-datanode-f173.twaren.net.out
f172.twaren.net: starting datanode, logging to /usr/src/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-datanode-f172.twaren.net.out
f180.twaren.net: starting secondarynamenode, logging to /usr/src/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-secondarynamenode-f180.twaren.net.out

Step4. 啟動 Map/Reduce daemon

[root@f180 hadoop-0.20.2]# bin/start-mapred.sh
starting jobtracker, logging to /usr/src/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-jobtracker-f180.twaren.net.out
f173.twaren.net: starting tasktracker, logging to /usr/src/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-tasktracker-f173.twaren.net.out
f172.twaren.net: starting tasktracker, logging to /usr/src/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-tasktracker-f172.twaren.net.out

Step5. 運算測試!
幾本上到 Step4 整個設定就算完成了,你可以檢查 logs 目錄相關紀錄,看是否有 errors 產生。

目前 hadoop 附的範例測試程式有 word count, 可到 gutenberg.org 下載一些電子書來計算。

目前抓了六份文件
[root@f180 hadoop-0.20.2]# ls -al /tmp/gutenberg/
total 6196
drwxr-xr-x 2 root root 4096 2010-05-14 17:29 .
drwxrwxrwt. 8 root root 4096 2010-05-14 17:20 ..
-rw-r--r-- 1 root root 343694 2007-12-03 23:28 132.txt
-rw-r--r-- 1 root root 1945731 2007-04-14 04:34 19699.txt
-rw-r--r-- 1 root root 674762 2007-01-22 18:56 20417.txt
-rw-r--r-- 1 root root 1573044 2008-08-01 20:31 4300.txt
-rw-r--r-- 1 root root 1391706 2009-08-14 07:19 7ldvc10.txt
-rw-r--r-- 1 root root 393995 2009-03-18 19:51 972.txt

計算前須將測試檔案載入 HDFS 中
[root@f180 hadoop-0.20.2]# bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
[root@f180 hadoop-0.20.2]# bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - root supergroup 0 2010-05-14 17:29 /user/root/gutenberg
[root@f180 hadoop-0.20.2]# bin/hadoop dfs -ls gutenberg
Found 6 items
-rw-r--r-- 2 root supergroup 343694 2010-05-14 17:29 /user/root/gutenberg/132.txt
-rw-r--r-- 2 root supergroup 1945731 2010-05-14 17:29 /user/root/gutenberg/19699.txt
-rw-r--r-- 2 root supergroup 674762 2010-05-14 17:29 /user/root/gutenberg/20417.txt
-rw-r--r-- 2 root supergroup 1573044 2010-05-14 17:29 /user/root/gutenberg/4300.txt
-rw-r--r-- 2 root supergroup 1391706 2010-05-14 17:29 /user/root/gutenberg/7ldvc10.txt
-rw-r--r-- 2 root supergroup 393995 2010-05-14 17:29 /user/root/gutenberg/972.txt

執行 Map/Reduce !
[root@f180 hadoop-0.20.2]# bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output
10/05/14 17:33:51 INFO input.FileInputFormat: Total input paths to process : 6
10/05/14 17:33:52 INFO mapred.JobClient: Running job: job_201005141720_0001
10/05/14 17:33:53 INFO mapred.JobClient: map 0% reduce 0%
10/05/14 17:34:05 INFO mapred.JobClient: map 33% reduce 0%
10/05/14 17:34:08 INFO mapred.JobClient: map 66% reduce 0%
10/05/14 17:34:11 INFO mapred.JobClient: map 100% reduce 0%
10/05/14 17:34:14 INFO mapred.JobClient: map 100% reduce 33%
10/05/14 17:34:20 INFO mapred.JobClient: map 100% reduce 100%
10/05/14 17:34:22 INFO mapred.JobClient: Job complete: job_201005141720_0001
10/05/14 17:34:22 INFO mapred.JobClient: Counters: 17
10/05/14 17:34:22 INFO mapred.JobClient: Job Counters
10/05/14 17:34:22 INFO mapred.JobClient: Launched reduce tasks=1
10/05/14 17:34:22 INFO mapred.JobClient: Launched map tasks=6
10/05/14 17:34:22 INFO mapred.JobClient: Data-local map tasks=6
10/05/14 17:34:22 INFO mapred.JobClient: FileSystemCounters
10/05/14 17:34:22 INFO mapred.JobClient: FILE_BYTES_READ=4241310
10/05/14 17:34:22 INFO mapred.JobClient: HDFS_BYTES_READ=6322932
10/05/14 17:34:22 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6936977
10/05/14 17:34:22 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1353587
10/05/14 17:34:22 INFO mapred.JobClient: Map-Reduce Framework
10/05/14 17:34:22 INFO mapred.JobClient: Reduce input groups=123471
10/05/14 17:34:22 INFO mapred.JobClient: Combine output records=185701
10/05/14 17:34:22 INFO mapred.JobClient: Map input records=124099
10/05/14 17:34:22 INFO mapred.JobClient: Reduce shuffle bytes=2695475
10/05/14 17:34:22 INFO mapred.JobClient: Reduce output records=123471
10/05/14 17:34:22 INFO mapred.JobClient: Spilled Records=477165
10/05/14 17:34:22 INFO mapred.JobClient: Map output bytes=10427755
10/05/14 17:34:22 INFO mapred.JobClient: Combine input records=1067656
10/05/14 17:34:22 INFO mapred.JobClient: Map output records=1067656
10/05/14 17:34:22 INFO mapred.JobClient: Reduce input records=185701

運算結果產生於 gutenberg-output
[root@f180 hadoop-0.20.2]# bin/hadoop dfs -ls
Found 2 items
drwxr-xr-x - root supergroup 0 2010-05-14 17:29 /user/root/gutenberg
drwxr-xr-x - root supergroup 0 2010-05-14 17:34 /user/root/gutenberg-output
[root@f180 hadoop-0.20.2]# bin/hadoop dfs -ls gutenberg-output
Found 2 items
drwxr-xr-x - root supergroup 0 2010-05-14 17:33 /user/root/gutenberg-output/_logs
-rw-r--r-- 2 root supergroup 1353587 2010-05-14 17:34 /user/root/gutenberg-output/part-r-00000

將運算結果由 HDFS 取回
[root@f180 hadoop-0.20.2]# mkdir /tmp/gutenberg-output
[root@f180 hadoop-0.20.2]# bin/hadoop dfs -getmerge gutenberg-output /tmp/gutenberg-output
[root@f180 hadoop-0.20.2]# head /tmp/gutenberg-output/gutenberg-output
" 34
"'Course 1
"'Spells 1
"'Tis 1
"'Twas 1
"'Twere 1
"'army' 1
"(1) 1
"(Lo)cra" 1
"13 4

最後,Hadoop 提供幾個 Web UI 介面給大家參考。
http://f180:50030/ - web UI for MapReduce job tracker(s)
http://f180:50060/ - web UI for task tracker(s)
http://f180:50070/ - web UI for HDFS name node(s)


2010年5月9日 星期日

MySQL Replication

這邊要做的是簡易版的 db replication. 單純只是想取代傳統用 mysqldump 方式來備份 db.
如果你是想用強大的功能,請參考 MySQL Cluster.
設定非常簡單,目前mysql 版本為 5.1.46, 作業系統是 FreeBSD 7.3-STABLE / 8.0-STABLE
預設先採用 /usr/local/share/mysql/my-large.cnf

# cp /usr/local/share/mysql/my-large.cnf /etc/my.cnf

環境設定目標為,一台 sql master, 一台 slave, master 即時更新資料到 slave.

在此不限定是一台 master, 一台 slave, 可以是一台 master 對多台 slave, 或是多階層架構。

Step1. Master 環境設定
主要在 my.cnf 設定,告訴 SQL server 此次扮演的角色是甚麼,順便設定 replicate 哪個 database

確認底下幾行設定存在於 my.cnf (最低需求)
# cat /etc/my.cnf
log-bin=mysql-bin # mysql 藉由 log 記錄,來進行 replication 工作。
binlog-do-db = 100mountain # 指定要 replication 的資料庫。
server-id = 1 # 指定此機器的角色。

Step2. 建立同步需要的使用者及權限。

mysql> CREATE USER 'db_syncuser'@'%' IDENTIFIED BY 'syncuser_password';
mysql> GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'db_syncuser'@'%';
mysql> FLUSH PRIVILEGES;
mysql> SHOW MASTER STATUS;

Step3. restart mysql master server

# /usr/local/etc/rc.d/mysql-server restart

Step4. Slave 環境設定

# cat /etc/my.cnf (最低需求)
replicate-do-db = 100mountain # 指定 replication 的資料庫。
server-id = 2 # 必須要與 master 不同。

master-host = db_master_ip
master-user = db_syncuser
master-password = syncuser_password

Step5. restart mysql slave server

# /usr/local/etc/rc.d/mysql-server restart

Step6. 測試

確認兩邊皆有 100mountain 這個資料庫
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| 100mountain |
| mysql |
| test |
+--------------------+
4 rows in set (0.00 sec)

確認 master 及 slave 都有正常啟動
mysql> SHOW MASTER STATUS;
mysql> SHOW SLAVE STATUS;

mysql> use 100moutain; # at master
Database changed
mysql> create table dbtest (col1 INT);
Query OK, 0 rows affected (0.01 sec)

mysql> use 100moutain; # at slave
Database changed
mysql> show tables;
+-----------------------+
| Tables_in_100mountain |
+-----------------------+
| dbtest |
+-----------------------+
1 row in set (0.01 sec)

2010年4月28日 星期三

Ceph install on Fedora

Ceph, 另外一套分散式檔案系統,是人家的PhD論文發展來的(外國論文實做能力都超強的 *_*),一樣是 POSIX-compatible, 目前尚處於開發階段,不適用於 production 環境

針對其架構介紹,在 Linux Magazine 有篇簡單的介紹:Ceph: The Distributed File System Creature from the Object Lagoon

目前我選擇在 Fedora 上安裝,主要原因是目前測試過的大多數檔案系統,對於 RHEL, SuSE like 的作業系統支援度,都相較於其他 OS 為高,只是想避免一些不必要的麻煩。

OS: Fedora 12
Kernel: 2.6.34-rc3

Step1. Build Ceph Client
目前 Ceph 已經 merge 進 2.6.34 了,因此在安裝上會相對的簡單許多。

Check out new kernel source and compile

# cd /usr/src/kernels
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
# cd ceph-client/
# make menuconfig

選擇 Ceph 成 module 或 build-in 到 kernel 裡
由於 Ceph 已經放棄原本的 Ebofs, 轉而用 Btrfs 取代之,因此也將 Btrfs 編進去吧。
編譯好重開後,系統應已支援這兩種檔案系統。

Step2. 準備 OSD 空間
此步驟重複於所有希望當 OSD 的結點。

Partion 切割

你可以使用 fdisk, cfdisk((util-linux-ng) 或其他你善用的工作來進行 partition 切割。

# fdisk /dev/cciss/c0d2

The number of cylinders for this disk is set to 8716.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): m
Command action
a toggle a bootable flag
b edit bsd disklabel
c toggle the dos compatibility flag
d delete a partition
l list known partition types
m print this menu
n add a new partition
o create a new empty DOS partition table
p print the partition table
q quit without saving changes
s create a new empty Sun disklabel
t change a partition's system id
u change display/entry units
v verify the partition table
w write table to disk and exit
x extra functionality (experts only)

Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-8716, default 1): 1
Last cylinder, +cylinders or +size{K,M,G} (1-8716, default 8716): 8716

Command (m for help): p

Disk /dev/cciss/c0d2: 36.4 GB, 36414750720 bytes
255 heads, 32 sectors/track, 8716 cylinders
Units = cylinders of 8160 * 512 = 4177920 bytes
Disk identifier: 0x0004f893

Device Boot Start End Blocks Id System
/dev/cciss/c0d2p1 1 8716 35561264 83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

# mkdir /mnt/btrfs
# cd /mnt/btrfs/
# mkdir osd0 osd1

格式化硬碟

# yum install btrfs-progs.i686
# mkfs.btrfs /dev/cciss/c0d2p1

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d2p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19

OSD local 硬碟掛載

# mount -t btrfs /dev/cciss/c0d2p1 /mnt/btrfs/osd0/
# mount -t btrfs /dev/cciss/c0d3p1 /mnt/btrfs/osd1
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_f172-lv_root
32G 6.2G 24G 21% /
tmpfs 503M 0 503M 0% /dev/shm
/dev/cciss/c0d0p1 194M 37M 147M 20% /boot
/dev/cciss/c0d2p1 34G 56K 32G 1% /mnt/btrfs/osd0
/dev/cciss/c0d3p1 34G 56K 32G 1% /mnt/btrfs/osd1

# umount /mnt/btrfs/osd0 /mnt/btrfs/osd1 (格式化、掛載 osd disk 等動作由 Step5, Step6 時,由該 script 來執行,在此僅確認這些硬碟都 ok)

Step3. 編譯 Ceph 套件
此步驟重覆於所有結點

取得 Ceph source code

# cd /usr/src
# git clone git://ceph.newdream.net/ceph.git
Initialized empty Git repository in /usr/src/ceph/.git/
remote: Counting objects: 79397, done.
remote: Compressing objects: 100% (19438/19438), done.
remote: Total 79397 (delta 64545), reused 73253 (delta 58751)
Receiving objects: 100% (79397/79397), 15.10 MiB | 3.17 MiB/s, done.
Resolving deltas: 100% (64545/64545), done.

安裝必要檔案

# yum install boost.i686
# yum install boost-devel.i686
# yum install libedit-devel.i686
# yum install openssl.i686
# yum install openssl-devel.i686

編譯 Ceph

# cd /usr/src/ceph/
# more INSTALL
# ./autogen.sh
# ./configure
# make
# make install

Step4. 設定

確認所有結點間,皆可透過 ssh, 以無密碼方式登入。
這部分做法請參考:SSH Login Without Password

ceph.conf, 確認每個結點這部分設定都相同

# cd /mnt/btrfs/
# mkdir mon0
# cd /usr/local/etc/ceph/
# cat ceph.conf
===
;
; Sample ceph ceph.conf file.
;
; This file defines cluster membership, the various locations
; that Ceph stores data, and any other runtime options.

; If a 'host' is defined for a daemon, the start/stop script will
; verify that it matches the hostname (or else ignore it). If it is
; not defined, it is assumed that the daemon is intended to start on
; the current host (e.g., in a setup with a startup.conf on each
; node).

; global
[global]
pid file = /var/run/ceph/$name.pid

; some minimal logging (just message traffic) to aid debugging
debug ms = 1

; monitor
[mon]
mon data = /mnt/btrfs/mon$id

[mon0]
host = node181
mon addr = 10.0.0.181:6789

; mds
[mds]

[mds.node181]
host = node181

; osd
[osd]
sudo = true
osd data = /mnt/btrfs/osd$id

[osd0]
host = node172

; if 'btrfs devs' is not specified, you're responsible for
; setting up the 'osd data' dir. if it is not btrfs, things
; will behave up until you try to recover from a crash (which
; usually fine for basic testing).
btrfs devs = /dev/cciss/c0d2p1
osd data = /mnt/btrfs/osd0

[osd1]
host = node172
btrfs devs = /dev/cciss/c0d3p1
osd data = /mnt/btrfs/osd1

[osd2]
host = node173
btrfs devs = /dev/cciss/c0d2p1
osd data = /mnt/btrfs/osd2

[osd3]
host = node173
btrfs devs = /dev/cciss/c0d3p1
osd data = /mnt/btrfs/osd3

; access control
[group everyone]
; you probably want to limit this to a small or a list of
; hosts. clients are fully trusted.
addr = 0.0.0.0/0

[mount /]
allow = %everyone
===

Step5. Create file system

[root@f181 ceph]# mkcephfs -c /usr/local/etc/ceph/ceph.conf --allhosts --mkbtrfs
/usr/local/bin/monmaptool --create --clobber --add 10.0.0.181:6789 --print /tmp/monmap.2319
/usr/local/bin/monmaptool: monmap file /tmp/monmap.2319
/usr/local/bin/monmaptool: generated fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43
epoch 1
fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43
last_changed 10.04.24 20:10:01.089929
created 10.04.24 20:10:01.089929
mon0 10.0.0.181:6789/0
/usr/local/bin/monmaptool: writing epoch 1 to /tmp/monmap.2319 (1 monitors)
max osd in /usr/local/etc/ceph/ceph.conf is 3, num osd is 4
/usr/local/bin/osdmaptool: osdmap file '/tmp/osdmap.2319'
/usr/local/bin/osdmaptool: writing epoch 1 to /tmp/osdmap.2319
Building admin keyring at /tmp/admin.keyring.2319
creating /tmp/admin.keyring.2319
Building monitor keyring with all service keys
creating /tmp/monkeyring.2319
importing contents of /tmp/admin.keyring.2319 into /tmp/monkeyring.2319
creating /tmp/keyring.mds.node181
importing contents of /tmp/keyring.mds.node181 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.0
importing contents of /tmp/keyring.osd.0 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.1
importing contents of /tmp/keyring.osd.1 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.2
importing contents of /tmp/keyring.osd.2 into /tmp/monkeyring.2319
creating /tmp/keyring.osd.3
importing contents of /tmp/keyring.osd.3 into /tmp/monkeyring.2319
=== mon0 ===
10.04.24 20:10:02.245494 store(/mnt/btrfs/mon0) mkfs
10.04.24 20:10:02.245722 store(/mnt/btrfs/mon0) test -d /mnt/btrfs/mon0 && /bin/rm -rf /mnt/btrfs/mon0 ; mkdir -p /mnt/btrfs/mon0
10.04.24 20:10:02.581627 mon0(starting).class v0 create_initial -- creating initial map
10.04.24 20:10:02.692877 mon0(starting).auth v0 create_initial -- creating initial map
10.04.24 20:10:02.692940 mon0(starting).auth v0 reading initial keyring
/usr/local/bin/mkmonfs: created monfs at /mnt/btrfs/mon0 for mon0
admin.keyring.2319 100% 119 0.1KB/s 00:00
=== mds.node181 ===
WARNING: no keyring specified for mds.node181
=== osd0 ===
umount: /mnt/btrfs/osd0: not mounted
umount: /dev/cciss/c0d2p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d2p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd0 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd0
WARNING: no keyring specified for osd0
=== osd1 ===
umount: /mnt/btrfs/osd1: not mounted
umount: /dev/cciss/c0d3p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d3p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd1 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd1
WARNING: no keyring specified for osd1
=== osd2 ===
umount: /mnt/btrfs/osd2: not mounted
umount: /dev/cciss/c0d2p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d2p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd2 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd2
WARNING: no keyring specified for osd2
=== osd3 ===
umount: /mnt/btrfs/osd3: not mounted
umount: /dev/cciss/c0d3p1: not mounted
FATAL: Module btrfs not found.

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/cciss/c0d3p1
nodesize 4096 leafsize 4096 sectorsize 4096 size 33.91GB
Btrfs Btrfs v0.19
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
monmap.2319 100% 187 0.2KB/s 00:00
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
created object store for osd3 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43 on /mnt/btrfs/osd3
WARNING: no keyring specified for osd3

Step6. 啟動 Ceph Server

[root@f181 ceph]# cd /usr/src/ceph/src/
[root@f181 src]# ./init-ceph -a -c /usr/local/etc/ceph/ceph.conf start
=== mon0 ===
Starting Ceph mon0 on node181...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting mon0 at 10.0.0.181:6789/0 mon_data /mnt/btrfs/mon0 fsid ef8475d4-52ea-d6f6-f75b-b331c34b9a43
=== mds.node181 ===
Starting Ceph mds.node181 on node181...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting mds.node181 at 0.0.0.0:6800/3219
=== osd0 ===
Mounting Btrfs on node172:/mnt/btrfs/osd0
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd0 on node172...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd0 at 0.0.0.0:6800/3484 osd_data /mnt/btrfs/osd0 (no journal)
=== osd1 ===
Mounting Btrfs on node172:/mnt/btrfs/osd1
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd1 on node172...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd1 at 0.0.0.0:6802/3717 osd_data /mnt/btrfs/osd1 (no journal)
=== osd2 ===
Mounting Btrfs on node173:/mnt/btrfs/osd2
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd2 on node173...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd2 at 0.0.0.0:6800/3310 osd_data /mnt/btrfs/osd2 (no journal)
=== osd3 ===
Mounting Btrfs on node173:/mnt/btrfs/osd3
FATAL: Module btrfs not found.
Scanning for Btrfs filesystems
failed to read /dev/loop7
failed to read /dev/loop6
failed to read /dev/loop5
failed to read /dev/loop4
failed to read /dev/loop3
failed to read /dev/loop2
failed to read /dev/loop1
failed to read /dev/loop0
Starting Ceph osd3 on node173...
** WARNING: Ceph is still under heavy development, and is only suitable for **
** testing and review. Do not trust it with important data. **
starting osd3 at 0.0.0.0:6802/3548 osd_data /mnt/btrfs/osd3 (no journal)

Step7. 掛載!

[root@f181 src]# mount -t ceph 10.0.0.181:/ /mnt/ceph/
[root@f181 src]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_f181-lv_root
32G 6.2G 24G 21% /
tmpfs 503M 0 503M 0% /dev/shm
/dev/cciss/c0d0p1 194M 33M 152M 18% /boot
10.0.0.181:/ 136G 7.0M 128G 1% /mnt/ceph

Step8. 測試

[root@f181 src]# dd if=/dev/zero of=/mnt/ceph/bigfile bs=4k count=10000k
10240000+0 records in
10240000+0 records out
41943040000 bytes (42 GB) copied, 1093.68 s, 38.4 MB/s
資料會寫入各硬碟中。

Final. 結果觀察

1. 空間的寫入,會隨時增減 (猜測應該是在做 balance 的動作)
2. df 空間無法反應實際檔案大小,各 osd 也沒法反應實際空間使用狀況。

ref. http://ceph.newdream.net/wiki/Main_Page

2010年4月22日 星期四

SSH Login Without Password

因為文章裡會 reference 比較多次,所以乾脆獨立出來。

當你的機器設備較多時,密碼認證就變成一種相當大的負擔,用 keys 來認證,相對成為認證方式首選!

Step1. 產生 key pair
產生認證所需的 key

[root@nodeB ~] # ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase): # 請注意,這是存取 private key 的密碼,如果你想做 auto login, 請不要輸入任何東西。
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
2d:52:2c:9f:d9:f8:7a:35:83:d8:12:7e:cc:4d:2a:ff root@nodeB
The key's randomart image is:
+--[ RSA 2048]----+
| |
| . |
| . o |
| +.* . |
| ..S*o= |
| .=oB = |
| =.o o |
| .o E |
| .. |
+-----------------+

Step2. 佈署 public key
若您要從 nodeB, 以 ssh w/o passwd 進入 nodeA
請將 nodeB 的 public key, 放置於 nodeA, /UserAccount/.ssh/authorized_keys

[root@nodeA ~] # cat /root/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwLmrP46JukEdUjHhK1tsyhHgTtKd65nlxHjSdVDBtXLvYukYMij3qX1qLOG9x2gQBWt2WBOyacULi5HajexMuU9aQTTTYdHXbA+Vyn/tO26
NkGVXiQm0LFZpRBqQBrhUcvTeOD0yHmMC9iqbEUBpyQXA8XELZNoKrYGOIKiPHpfduaeYfnt7nZ7T7Nou/MoFbpuvXmfpeVg6S01gPMOKXpkgFrigwtcw/W59d7ia5I6rhPEIrkHQjGGN/Q
N6NCifvtkIeWt+UY2lz1f+hhmvtKC6esiY3y7U3U9iLrSHBLHh+Tm5KLcKhJw1qgnDu8vsj7uL5DMg6kk98CZKuKi+== root@nodeB

Step3. 測試連線

[root@nodeB ~] # ssh -2 nodeA
Last login: Thu Apr 22 12:33:07 2010 from nodeB
[root@nodeA ~] #


2010年4月13日 星期二

Install Fedora Over HTTP

由於最近要再測一些Distributed file system, 找來一些舊機器來試驗...

機器一多,就得思考怎麼做才能快速安裝,事實證明人類是種很懶惰的動物... zzz...

PXE, 簡單來說是種結合多種 protocols, 其中包含 DHCPTFTP. 可以用來讓一些裝置進行網路開機的一種協定。目前,我是用Gentoo Linux來提供 boot service, 透過 HTTP 安裝 Fedora 12.

當然,你得先確認你的 client 網卡支援 PXE.

Step1. 安裝 boot service 必要套件
gentoo# emerge -uD net-misc/dhcp
gentoo# emerge -uD net-ftp/atftp
gentoo# emerge -uD sys-boot/syslinux

Step2. 設定 boot service
gentoo# cat /etc/dhcp/dhcpd.conf
---
option domain-name "twaren.net";
option domain-name-servers 8.8.8.8, 211.79.61.4;
default-lease-time 600;
max-lease-time 7200;
ddns-update-style interim;
log-facility local7;
allow booting;
allow bootp;
ignore client-updates;

subnet 211.79.x.128 netmask 255.255.255.128 {
range 211.79.x.172 211.79.x.181;
option subnet-mask 255.255.255.128;
option broadcast-address 211.79.x.255;
option routers 211.79.x.254;
next-server 211.79.x.154; #指定 booter
filename "pxelinux.0"; #指定boot file
}
---
gentoo# cp /usr/share/syslinux/pxelinux.0 /home/tftpd
gentoo# wget http://ftp.twaren.net/Linux/Fedora/linux/releases/12/\
Fedora/i386/os/images/pxeboot/initrd.img
gentoo# wget http://ftp.twaren.net/Linux/Fedora/linux/releases/12/\
Fedora/i386/os/images/pxeboot/vmlinuz
gentoo# cat /home/tftpd/pxelinux.cfg/default #你可以設定不同 host, 吃不同的 boot images
---
prompt 1
default pxeboot
timeout 50

label pxeboot
kernel vmlinuz
append initrd=initrd.img

ONERROR LOCALBOOT 0
---
gentoo# find /home/tftpd/ -type f
/home/tftpd/pxelinux.0
/home/tftpd/initrd.img
/home/tftpd/vmlinuz
/home/tftpd/pxelinux.cfg/default

Step3. 啟動 boot service
gentoo# /etc/init.d/dhcpd start
gentoo# atftpd --logfile /var/log/tftpd.log --daemon /home/tftpd/

Step4. 啟動 client, 開始安裝 OS
幾本上如果你上述的設定都正常 work, 且你的 client 網卡支援 PXE, 你應該可以取得 IP, 並下載 kernel, 啟動安裝流程。

在安裝來源選擇 URL 即可透過 ftp, http 進行網路安裝。

其實這個還不完整,還需要人為介入,應該還要加上 Red Hat 的 Kickstart 這樣的東西才行。
ref: