Join the Community
and take part in the story

[debian][puppet] Trouble with redis and zookeeper


#1

Hi,

I have closely followed the installation guide here:
http://docs.openio.io/install-guide-debian/installation.html

…and everything went quite well until the initialization.

The zk-bootstrap failed at some step:

2016-12-14 20:01:36,579:30994(0x7ff965f5f700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2016-12-14 20:01:36,579:30994(0x7ff965f5f700):ZOO_INFO@log_env@716: Client environment:host.name=openio-test-01
2016-12-14 20:01:36,579:30994(0x7ff965f5f700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2016-12-14 20:01:36,579:30994(0x7ff965f5f700):ZOO_INFO@log_env@724: Client environment:os.arch=4.3.0-0.bpo.1-amd64
2016-12-14 20:01:36,579:30994(0x7ff965f5f700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 4.3.5-1~bpo8+1 (2016-02-23)
2016-12-14 20:01:36,580:30994(0x7ff965f5f700):ZOO_INFO@log_env@733: Client environment:user.name=root
2016-12-14 20:01:36,580:30994(0x7ff965f5f700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2016-12-14 20:01:36,580:30994(0x7ff965f5f700):ZOO_INFO@log_env@753: Client environment:user.dir=/root
2016-12-14 20:01:36,580:30994(0x7ff965f5f700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=172.xx.yy.39:6005,172.xx.yy.13:6005,172.xx.yy.11:6005 sessionTimeout=10000 watcher=(nil) sessionId=0 sessionPasswd= context=(nil) flags=0
2016-12-14 20:01:36,580:30994(0x7ff961909700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [172.xx.yy.11:6005] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
Traceback (most recent call last):
  File "/usr/bin/zk-bootstrap.py", line 167, in 
    main()
  File "/usr/bin/zk-bootstrap.py", line 160, in main
    zookeeper.create(zh, PREFIX, '', acl_openbar, 0)
zookeeper.ConnectionLossException: connection loss
2016-12-14 20:01:36,581:30994(0x7ff961909700):ZOO_INFO@check_events@1703: initiated connection to server [172.xx.yy.13:6005]
2016-12-14 20:01:36,588:30994(0x7ff961909700):ZOO_INFO@check_events@1750: session establishment complete on server [172.xx.yy.13:6005], sessionId=0x258feb5a4930004, negotiated timeout=10000

…resulting in the following cluster state:

+---------+-------------------+---------------------------------+----------------+-------+-------+-------+
| Type    | Id                | Volume                          | Location       | Slots | Up    | Score |
+---------+-------------------+---------------------------------+----------------+-------+-------+-------+
| rdir    | 172.xx.yy.39:6010 | /var/lib/oio/sds/OPENIO/rdir-0  | openio-test-01 | n/a   | True  |     0 |
| rdir    | 172.xx.yy.13:6010 | /var/lib/oio/sds/OPENIO/rdir-0  | openio-test-02 | n/a   | True  |     0 |
| rdir    | 172.xx.yy.11:6010 | /var/lib/oio/sds/OPENIO/rdir-0  | openio-test-05 | n/a   | True  |     0 |
| account | 172.xx.yy.39:6009 | n/a                             | openio-test-01 | n/a   | True  |     0 |
| account | 172.xx.yy.13:6009 | n/a                             | openio-test-02 | n/a   | True  |     0 |
| account | 172.xx.yy.11:6009 | n/a                             | openio-test-05 | n/a   | True  |     0 |
| rawx    | 172.xx.yy.39:6004 | /var/lib/oio/sds/OPENIO/rawx-0  | openio-test-01 | n/a   | True  |     0 |
| rawx    | 172.xx.yy.13:6004 | /var/lib/oio/sds/OPENIO/rawx-0  | openio-test-02 | n/a   | True  |     0 |
| rawx    | 172.xx.yy.11:6004 | /var/lib/oio/sds/OPENIO/rawx-0  | openio-test-05 | n/a   | True  |     0 |
| meta2   | 172.xx.yy.39:6003 | /var/lib/oio/sds/OPENIO/meta2-0 | openio-test-01 | n/a   | True  |     0 |
| meta2   | 172.xx.yy.13:6003 | /var/lib/oio/sds/OPENIO/meta2-0 | openio-test-02 | n/a   | True  |     0 |
| meta2   | 172.xx.yy.11:6003 | /var/lib/oio/sds/OPENIO/meta2-0 | openio-test-05 | n/a   | True  |     0 |
| meta1   | 172.xx.yy.39:6002 | n/a                             | openio-test-01 | n/a   | False |     0 |
| meta1   | 172.xx.yy.13:6002 | n/a                             | openio-test-02 | n/a   | False |     0 |
| meta1   | 172.xx.yy.11:6002 | n/a                             | openio-test-05 | n/a   | False |     0 |
| meta0   | 172.xx.yy.39:6001 | n/a                             | openio-test-01 | n/a   | False |     0 |
| meta0   | 172.xx.yy.13:6001 | n/a                             | openio-test-02 | n/a   | False |     0 |
| meta0   | 172.xx.yy.11:6001 | n/a                             | openio-test-05 | n/a   | False |     0 |
+---------+-------------------+---------------------------------+----------------+-------+-------+-------+

After waiting a bit I re-issued the zk-bootstrap.py but it no longer complained about server 172.xx.yy.11 (looks like it just blacklisted it and doesn’t even try):

2016-12-14 20:03:14,550:31011(0x7fd4806a1700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2016-12-14 20:03:14,550:31011(0x7fd4806a1700):ZOO_INFO@log_env@716: Client environment:host.name=openio-test-01
2016-12-14 20:03:14,550:31011(0x7fd4806a1700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2016-12-14 20:03:14,550:31011(0x7fd4806a1700):ZOO_INFO@log_env@724: Client environment:os.arch=4.3.0-0.bpo.1-amd64
2016-12-14 20:03:14,550:31011(0x7fd4806a1700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 4.3.5-1~bpo8+1 (2016-02-23)
2016-12-14 20:03:14,551:31011(0x7fd4806a1700):ZOO_INFO@log_env@733: Client environment:user.name=root
2016-12-14 20:03:14,551:31011(0x7fd4806a1700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2016-12-14 20:03:14,551:31011(0x7fd4806a1700):ZOO_INFO@log_env@753: Client environment:user.dir=/root
2016-12-14 20:03:14,551:31011(0x7fd4806a1700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=172.xx.yy.39:6005,172.xx.yy.13:6005,172.xx.yy.11:6005 sessionTimeout=10000 watcher=(nil) sessionId=0 sessionPasswd= context=(nil) flags=0
2016-12-14 20:03:14,552:31011(0x7fd47c04b700):ZOO_INFO@check_events@1703: initiated connection to server [172.xx.yy.13:6005]
2016-12-14 20:03:14,559:31011(0x7fd47c04b700):ZOO_INFO@check_events@1750: session establishment complete on server [172.xx.yy.13:6005], sessionId=0x258feb5a4930005, negotiated timeout=10000
 > batch(1,0) in 0.00468993186951s
 > batch(1,0) in 0.00433707237244s
 > batch(1,0) in 0.00395202636719s
 > batch(1,0) in 0.00404500961304s
...
 > batch(2048,0) in 0.0694358348846s
 > batch(2048,0) in 0.067773103714s
 > batch(2048,0) in 0.0778110027313s
 > batch(2047,0) in 0.102826118469s
Created nodes : ok 135689 ko 0
2016-12-14 20:03:22,885:31011(0x7fd4806a1700):ZOO_INFO@zookeeper_close@2505: Closing zookeeper sessionId=0x258feb5a4930005 to [172.xx.yy.13:6005]

Nevertheless, afterwards the cluster status went all-OK:

+---------+-------------------+---------------------------------+----------------+-------+------+-------+
| Type    | Id                | Volume                          | Location       | Slots | Up   | Score |
+---------+-------------------+---------------------------------+----------------+-------+------+-------+
| rdir    | 172.xx.yy.39:6010 | /var/lib/oio/sds/OPENIO/rdir-0  | openio-test-01 | n/a   | True |     0 |
| rdir    | 172.xx.yy.13:6010 | /var/lib/oio/sds/OPENIO/rdir-0  | openio-test-02 | n/a   | True |     0 |
| rdir    | 172.xx.yy.11:6010 | /var/lib/oio/sds/OPENIO/rdir-0  | openio-test-05 | n/a   | True |     0 |
| account | 172.xx.yy.39:6009 | n/a                             | openio-test-01 | n/a   | True |     0 |
| account | 172.xx.yy.13:6009 | n/a                             | openio-test-02 | n/a   | True |     0 |
| account | 172.xx.yy.11:6009 | n/a                             | openio-test-05 | n/a   | True |     0 |
| rawx    | 172.xx.yy.39:6004 | /var/lib/oio/sds/OPENIO/rawx-0  | openio-test-01 | n/a   | True |     0 |
| rawx    | 172.xx.yy.13:6004 | /var/lib/oio/sds/OPENIO/rawx-0  | openio-test-02 | n/a   | True |     0 |
| rawx    | 172.xx.yy.11:6004 | /var/lib/oio/sds/OPENIO/rawx-0  | openio-test-05 | n/a   | True |     0 |
| meta2   | 172.xx.yy.39:6003 | /var/lib/oio/sds/OPENIO/meta2-0 | openio-test-01 | n/a   | True |     0 |
| meta2   | 172.xx.yy.13:6003 | /var/lib/oio/sds/OPENIO/meta2-0 | openio-test-02 | n/a   | True |     0 |
| meta2   | 172.xx.yy.11:6003 | /var/lib/oio/sds/OPENIO/meta2-0 | openio-test-05 | n/a   | True |     0 |
| meta1   | 172.xx.yy.39:6002 | /var/lib/oio/sds/OPENIO/meta1-0 | openio-test-01 | n/a   | True |     0 |
| meta1   | 172.xx.yy.13:6002 | /var/lib/oio/sds/OPENIO/meta1-0 | openio-test-02 | n/a   | True |     0 |
| meta1   | 172.xx.yy.11:6002 | /var/lib/oio/sds/OPENIO/meta1-0 | openio-test-05 | n/a   | True |     0 |
| meta0   | 172.xx.yy.39:6001 | /var/lib/oio/sds/OPENIO/meta0-0 | openio-test-01 | n/a   | True |     0 |
| meta0   | 172.xx.yy.13:6001 | /var/lib/oio/sds/OPENIO/meta0-0 | openio-test-02 | n/a   | True |     0 |
| meta0   | 172.xx.yy.11:6001 | /var/lib/oio/sds/OPENIO/meta0-0 | openio-test-05 | n/a   | True |     0 |
+---------+-------------------+---------------------------------+----------------+-------+------+-------+

The following steps openio --oio-ns=OPENIO cluster unlockall and openio --oio-ns=OPENIO directory bootstrap --replicas 3 went fine.

After checking the node openio-test-05 (172.xx.yy.11) where zookeeper didn’t want to connect to and looking into the logs I figured out that the myid configuration actually had to be sequential:

2016-12-14 20:01:18,951 - ERROR [main:QuorumPeer@171] - Setting LearnerType to PARTICIPANT but 5 not in QuorumPeers. 
2016-12-14 20:01:18,955 - INFO  [main:QuorumPeer@444] - acceptedEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
2016-12-14 20:01:18,969 - ERROR [main:QuorumPeerMain@89] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 5 not in the peer list
        at org.apache.zookeeper.server.quorum.QuorumPeer.startLeaderElection(QuorumPeer.java:479)
        at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:411)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

Changing the myid in the puppet manifest from 5 to 3 (as in: 3rd node in the cluster) and re-applying the change, zookeeper started to work on that node as well.

This left me still with a non-working redis on both “slave” nodes (openio-test-02 and openio-test-05).

The zookeeper troubleshooting actually gave me a hint on where to start looking. In the configuration file /etc/oio/sds/OPENIO/redis-0/redis.conf I can see that the last line is confusing:

...
aof-rewrite-incremental-fsync yes
maxclients 10000
maxmemory 0
protected-mode noslaveof 172.xx.yy.39 6011

After fixing that and restarting redis-0 via gridinit_cmd restart OPENIO-redis-0, also that problem was fixed immediately:

# gridinit_cmd status
KEY                       STATUS      PID GROUP
OPENIO-account-0          UP        22568 OPENIO,account,account-0
OPENIO-beanstalkd-0       UP        22522 OPENIO,beanstalkd,beanstalkd-0
OPENIO-conscienceagent-0  UP        22363 OPENIO,conscienceagent,conscienceagent-0
OPENIO-meta0-0            UP        22678 OPENIO,meta0,meta0-0
OPENIO-meta1-0            UP        22688 OPENIO,meta1,meta1-0
OPENIO-meta2-0            UP        22574 OPENIO,meta2,meta2-0
OPENIO-oio-blob-indexer-0 UP        22426 OPENIO,oio-blob-indexer,oio-blob-indexer-0
OPENIO-oio-event-agent-0  UP        22585 OPENIO,oio-event-agent,oio-event-agent-0
OPENIO-oioproxy-0         UP        22593 OPENIO,oioproxy,oioproxy-0
OPENIO-rawx-0             UP        22352 OPENIO,rawx,rawx-0
OPENIO-rdir-0             UP        22545 OPENIO,rdir,rdir-0
OPENIO-redis-0            UP        27409 OPENIO,redis,redis-0
OPENIO-redissentinel-0    UP        22558 OPENIO,redissentinel,redissentinel-0
OPENIO-zookeeper-0        UP        27214 OPENIO,zookeeper,zookeeper-0

So, I suspect some error in the puppet module.

The entry for the slaves looks at follows (following the documentation):

openiosds::redis {'redis-0':
  ns        => 'OPENIO',
  ipaddress => $ipaddress,
  slaveof   => '172.xx.yy.39 6011',
}

So, I think people from the outside starting with OpenIO need a troubleshooting guide with at least pointing them where to start looking (log files, configuration files, …). In many cases people will find solutions on their own and would contribute them back.


#2

Hi @ancoron

Thanks a lot for your feedbacks, we appreciate it a lot.

A troubleshooting section in our documentation website is definitely mandatory.
We are working everyday to improve it, this section will come soon.

To answer to your questions, and for people who will face the same issue:

  • Log files are stored in /var/log/oio
  • Configuration files are stored in /etc/oio/sds

Guillaume.


#3

Hello and thank you for your detailed feedback!

I see that you set id 1,2 and 5 (instead of 1,2 and 3) which caused your problems with zookeeper.

I agree, this part is not very clear in the install documentation using puppet, it could really benefit a complete example manifest for each server.


#4

Hi, thanx for the feedback.

After reading the zookeeper docs in more detail I came across the exact reason for the problem:

The configuration file lists the contact information for each server identified by its server id. When a ZooKeeper server instance starts, it reads its id from the myid file and then, using that id, reads from the configuration file, looking up the port on which it should listen.

This totally explains it. Read “5” from file “/var/lib/oio/sds/OPENIO/zookeeper-0/data/myid” but then only see IP/port configurations for servers with IDs “1”, “2” and “3” in the “zoo.cfg”.

Looking into the puppet manifest and zoo.cfg template I see that the server ID’s are generated sequentially from the given array.

As an improvement would it be possible to autodetect from the given IP addresses which system we are configuring and then use that server ID to generate the “myid” content?

In case of multiple zookeepers on one system (different IP addresses or ports), the override with specifying a “myid” would still work, though. So, if “myid” is not specified, match an entry from the array using $ipaddress.

Would that work or am I on the wrong track?


#5

Hi anacron and thanks for your feedback.

For the ZooKeeper ids’, I’ll take a look at other Puppet modules and update ours to remove the confusion. I was thinking of it for a long time but did not take the time to improve it. I’ll do it next week!

For your Redis problem, I think you have a buggy version of our Puppet module, I recommend you to update your package puppet-module-openio-openiosds to the 1.1.49 version that is available in our 16.10 repository.
The problem is here:

It lacks a newline, should be:

Let me know if it helps you!


#6

Sorry for the delay on the matter.

Just added a forth node to the cluster using the exact same steps with the new package versions and all is working just fine.


#7

OK nice.

I’ve added the auto-detection of the id based on the zookeeper servers list. Hopefully we will release it soon.