Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Tasking for a Clustered System.
New Posts  All Forums:Forum Nav:

Tasking for a Clustered System.

post #1 of 5
Thread Starter 
I have a TON of data that needs to be processed, so I was going to create a cluster of Raspbery Pi 3's to process all of the data. So, the question is ... whats the best approach for telling the 50 or so nodes what to do? I can have them query a DB and pick from items in a table, or I can setup some Master\slave relationship where 1 of the devices tells the others what to do.

I am leaning towards them being independent and querying the DB for their tasks. Add a heartbeat and the nodes identify by their assigned DHCP IP addressed. My issue with this approach is how do you update all the nodes if the server DB IP changes? How can I nicely update all the nodes? I guess thats probably another service\application. So, this will all be on linux, not sure what distro yet... so feel free to toss in your 2 cents.
Zev's Comp
(15 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i5-2500K Sandy Bridge 3.3GHz GIGABYTE GA-Z68X-UD3H-B3 LGA 1155 Intel Z68 HDM... GeForce GTX 750 Ti G.SKILL Ripjaws X Series 8GB 
Hard DriveHard DriveHard DrivePower
1TB HDD 64GB SSD (Used for SRT) 500 GB. Antec BP550 Plus 550W Continuous Power ATX12V V... 
Case
COOLER MASTER ELITE 335 RC-335-KKN1-GP Black S... 
  hide details  
Reply
Zev's Comp
(15 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i5-2500K Sandy Bridge 3.3GHz GIGABYTE GA-Z68X-UD3H-B3 LGA 1155 Intel Z68 HDM... GeForce GTX 750 Ti G.SKILL Ripjaws X Series 8GB 
Hard DriveHard DriveHard DrivePower
1TB HDD 64GB SSD (Used for SRT) 500 GB. Antec BP550 Plus 550W Continuous Power ATX12V V... 
Case
COOLER MASTER ELITE 335 RC-335-KKN1-GP Black S... 
  hide details  
Reply
post #2 of 5
A single Xeon would be faster, cheaper, and way more boring. But if you are dead set on a cluster for lab purposes, you're gonna want static IP addresses for each node to start. Better do 48 instead of 50 so that you can use a reasonably priced switch for your backbone. No gigabit neccessary, should be cheap enough for a 48 port with gigabit uplink. For that many nodes you're gonna want MPI or Hadoop, or some such. Hadoop should be easier to manage, but the performance loss due to Java will be noticeable in this case. Make sure to use gpu acceleration, as the gpu is actually way more powerful than the CPUs in a raspberry. You're gonna have a metric ton of cables to manage. Better plan ahead. Google raspberry pi cluster for some ideas.
post #3 of 5
It really depends on how you want to .. 'do' it. Kafka would be an excellent way to introduce messaging in the master / slave paradigm. However, I'm not sure how it would scale down to the size of the Pi3's architecture.

Gamester3333 seems to think the performance hit of java on the Pi3 would be extensive, and I don't know from experience if the lack of hardware acceleration or other support would validate his claims, or if he is basing them solely on old anecdotes about the efficiency of java compared to native code (it's now almost unilaterally within 3% or less).

A binary vs character based protocol would suffer a much heavier penalty with character encoding coming in roughly 35% less performant due to encoding and decoding operations in the protocol.

I am not sure if anyone has ever put these kind of tiny systems in a virtual cluster, but one way to manage the networking and other issues would be to use 1-1 mappings of vm's to hosts allowing the VM tool to control the cluster.

If you have a known dataset size, you could always partition the DB id records with modulus to select incremental sets, and paginate through the resultsets via SQL.

Otherwise, you could use Kakfa as stated to be 'the database' and have the messages partitioned via a numbering mechanism in the kafka partition name, and directly associating that with a single node on one node. This could be useful for creating the database, but you could have an unknown number of records and by managing the inserts to each partition the round robin inserts would distribute the load quasi evenly.

This would require an external kafka host on your network, as well as a machine to 'insert' the dataset into the queues, which could be one in the same if it was strong enough... however you may run into issues where network capacity or IO at this service is not sufficient to feed the swarm.

I don't think the Pi3 would be a good option to try to run hadoop on, or any of its services, but this just an anecdotal guess based on memory and other consumption habits. But you most likely could utilize the queue as an extremely effective distribution platform outside of the cluster.
Nightrider
(17 items)
 
Commodore 64
(10 items)
 
 
CPUMotherboardGraphicsRAM
3930k x79 gd45 PLUS GTX Titan Crucial Ballistix Sport VLP  
Hard DriveHard DriveHard DriveCooling
HyperX 3k Intel 320 Seagate Barracuda Swifttech H220 
CoolingCoolingOSOS
Swifttech 220QP Corsair SP120 Windows 8.1 Pro Windows 10 Pro 
OSOSMonitorMonitor
Windows 7 Home Ubuntu 15.4 QNIX 2710 Catleap 2B 
Keyboard
Ducky - Cherry MX Red 
CPUMotherboardGraphicsRAM
3570k DZ77GA - 70K GTX670-DC2-4GD5  MV-3V4G3D/US 
Hard DriveCoolingOSOS
HyperX 3k CM 212 + Win 7 64 ubuntu 
PowerCase
Seventeam 850w modular CS-NT-ZERO-2  
  hide details  
Reply
Nightrider
(17 items)
 
Commodore 64
(10 items)
 
 
CPUMotherboardGraphicsRAM
3930k x79 gd45 PLUS GTX Titan Crucial Ballistix Sport VLP  
Hard DriveHard DriveHard DriveCooling
HyperX 3k Intel 320 Seagate Barracuda Swifttech H220 
CoolingCoolingOSOS
Swifttech 220QP Corsair SP120 Windows 8.1 Pro Windows 10 Pro 
OSOSMonitorMonitor
Windows 7 Home Ubuntu 15.4 QNIX 2710 Catleap 2B 
Keyboard
Ducky - Cherry MX Red 
CPUMotherboardGraphicsRAM
3570k DZ77GA - 70K GTX670-DC2-4GD5  MV-3V4G3D/US 
Hard DriveCoolingOSOS
HyperX 3k CM 212 + Win 7 64 ubuntu 
PowerCase
Seventeam 850w modular CS-NT-ZERO-2  
  hide details  
Reply
post #4 of 5
Thread Starter 
I will be transcoding 500+ TB of video, and I was able to get a build working with codecs that utilize the h264 onboard chip which makes them pretty fast and cheap. Its possible that I can create a PC with multiple GPUS and work that way, but I dont think it will be as fast. I will give it a bit more testing and do a bit more googling just in case.

Since I already have a SQL DB on a server, I probably should just use that. I guess my main issue would then be how to update all of the nodes. So, like in my example above, if the IP for the sql db changes, how do i tell all the nodes to switch over? Since their IPs will be DHCPed, I would probably have a table setup with the last update from them, the IP they are using, versions of my software , config version, and their MAC address as a unique identifier. From there, if the SQL db changes, I will have a list of IPs of all the devices, but is there a nice way to bulk send a file? Perhaps write a seperate application to place in the init.d that would check a server and backup server to see if it needs to update the application or config that runs before the main application. So, if i need to update the device I would then just put a flag in the DB for them to reboot.
Zev's Comp
(15 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i5-2500K Sandy Bridge 3.3GHz GIGABYTE GA-Z68X-UD3H-B3 LGA 1155 Intel Z68 HDM... GeForce GTX 750 Ti G.SKILL Ripjaws X Series 8GB 
Hard DriveHard DriveHard DrivePower
1TB HDD 64GB SSD (Used for SRT) 500 GB. Antec BP550 Plus 550W Continuous Power ATX12V V... 
Case
COOLER MASTER ELITE 335 RC-335-KKN1-GP Black S... 
  hide details  
Reply
Zev's Comp
(15 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i5-2500K Sandy Bridge 3.3GHz GIGABYTE GA-Z68X-UD3H-B3 LGA 1155 Intel Z68 HDM... GeForce GTX 750 Ti G.SKILL Ripjaws X Series 8GB 
Hard DriveHard DriveHard DrivePower
1TB HDD 64GB SSD (Used for SRT) 500 GB. Antec BP550 Plus 550W Continuous Power ATX12V V... 
Case
COOLER MASTER ELITE 335 RC-335-KKN1-GP Black S... 
  hide details  
Reply
post #5 of 5
Those questions are exactly why I was pointing you to perhaps putting a hypervisor on the Pi3s and having chef / puppet / vagrant manage the creation of the VM's because one of the things that can do is give you access to host and namespace resolution.

Vagrant has solved this particular problem with the hosts plugin, almost completely.

The other option is to just give you DB a known hostname, and put a DNS server in front of the Pi3's so that they resolve the IP, instead of directly access a raw IP. I'm not sure why A. the ip would change on you if the stuff is a local network or B. each node would need to know about each other.
Nightrider
(17 items)
 
Commodore 64
(10 items)
 
 
CPUMotherboardGraphicsRAM
3930k x79 gd45 PLUS GTX Titan Crucial Ballistix Sport VLP  
Hard DriveHard DriveHard DriveCooling
HyperX 3k Intel 320 Seagate Barracuda Swifttech H220 
CoolingCoolingOSOS
Swifttech 220QP Corsair SP120 Windows 8.1 Pro Windows 10 Pro 
OSOSMonitorMonitor
Windows 7 Home Ubuntu 15.4 QNIX 2710 Catleap 2B 
Keyboard
Ducky - Cherry MX Red 
CPUMotherboardGraphicsRAM
3570k DZ77GA - 70K GTX670-DC2-4GD5  MV-3V4G3D/US 
Hard DriveCoolingOSOS
HyperX 3k CM 212 + Win 7 64 ubuntu 
PowerCase
Seventeam 850w modular CS-NT-ZERO-2  
  hide details  
Reply
Nightrider
(17 items)
 
Commodore 64
(10 items)
 
 
CPUMotherboardGraphicsRAM
3930k x79 gd45 PLUS GTX Titan Crucial Ballistix Sport VLP  
Hard DriveHard DriveHard DriveCooling
HyperX 3k Intel 320 Seagate Barracuda Swifttech H220 
CoolingCoolingOSOS
Swifttech 220QP Corsair SP120 Windows 8.1 Pro Windows 10 Pro 
OSOSMonitorMonitor
Windows 7 Home Ubuntu 15.4 QNIX 2710 Catleap 2B 
Keyboard
Ducky - Cherry MX Red 
CPUMotherboardGraphicsRAM
3570k DZ77GA - 70K GTX670-DC2-4GD5  MV-3V4G3D/US 
Hard DriveCoolingOSOS
HyperX 3k CM 212 + Win 7 64 ubuntu 
PowerCase
Seventeam 850w modular CS-NT-ZERO-2  
  hide details  
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Coding and Programming
Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Tasking for a Clustered System.