Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Application Programming › Looking For A Giga-Record Database
New Posts  All Forums:Forum Nav:

Looking For A Giga-Record Database

post #1 of 15
Thread Starter 
I'm in the process of planning a NNTP robot for KDE 4.x. The plan is for it to have feature parity with NewsLeecher, which I've been using for quite a while. Unfortunately, NewsLeecher has issues.

What I'm planning will not be simply an NZB downloader, but a fully fledged NNTP client that stores headers locally and allows local searching. However, some newsgroups can have header counts reaching into the hundreds of millions and sometimes further - I've seen a header count of over 3 billion for one particular newsgroup.

So what I'm looking for is an embeddable database capable of handling that many records, with stability and a decent level of performance. I've been looking at NoSQL databases, but I'm not sure whether key/value or document databases would be a better fit.

Or should I tie the project directly into Akonadi and use that to index the headers instead?

Any advice is appreciated. smile.gif
Edited by parityboy - 9/5/12 at 4:32am
Mythica
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel i3 530 Gigabyte GA-H55M-D2H Palit nVidia GT430 Corsair Dominator 4GB TW3X4G1333C9A 
Hard DriveHard DriveOSMonitor
Hitachi Deskstar 7K500 Samsung HD204UI Linux Mint 13 HP L1800 
KeyboardPowerCaseMouse
Trust EasyScroll Silverline Corsair HX520 Lian-Li PC-A04B Logitech Trackman Wheel 
  hide details  
Reply
Mythica
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel i3 530 Gigabyte GA-H55M-D2H Palit nVidia GT430 Corsair Dominator 4GB TW3X4G1333C9A 
Hard DriveHard DriveOSMonitor
Hitachi Deskstar 7K500 Samsung HD204UI Linux Mint 13 HP L1800 
KeyboardPowerCaseMouse
Trust EasyScroll Silverline Corsair HX520 Lian-Li PC-A04B Logitech Trackman Wheel 
  hide details  
Reply
post #2 of 15
Quote:
Originally Posted by parityboy View Post

I'm in the process of planning a NNTP robot for KDE 4.x. The plan is for it to have feature parity with NewsLeecher, which I've been using for quite a while. Unfortunately, NewsLeecher has issues.
What I'm planning will not be simply an NZB downloader, but a fully fledged NNTP client that stores headers locally and allows local searching. However, some newsgroups can have header counts reaching into the hundreds of millions and sometimes further - I've seen a header count of over 3 billion for one particular newsgroup.
So what I'm looking for is an embeddable database capable of handling that many records. with stability and a decent level of performance. I've been looking at NoSQL databases, but I'm not sure whether key/value or document databases would be a better fit.
Or should I tie the project directly into Akonadi and use that to index the headers instead?
Any advice is appreciated. smile.gif

Couchbase isn't technically embeddable (runs as a service), but the client libraries are... so... maybe...? smile.gif
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
post #3 of 15
Thread Starter 
Tempting but I don't really want to call an external service, simply to minimise the dependencies for the user. I was looking at Berkeley DB, Tokyo Cabinet and Kyoto Cabinet. One thing I know for certain is that a relational database will be out of the question - the answer is either NoSQL or an Object database.
Code:
Path:   news.astraweb.com!router1.astraweb.com!news.glorb.com!feeder.erje.net!feeder2.news.saunalahti.fi!reader1.news.saunalahti.fi!53ab2750!not-for-mail
From:  Sender <sender@host.com>
Newsgroups:   alt.fan.robert-jordan
Subject:   Exciting times
Message-ID:   <9arce416oebalh8i19l4s5v4rnjdlegccj@4ax.com>
References:   <17811870-955f-420e b0b9-7ca541384048@y71g2000hsa.googlegroups.com>
X-Newsreader:   Forte Free Agent 1.93/32.576 English (American)
MIME-Version:   1.0
Content-Type:   text/plain; charset="us-ascii"
Content-Transfer-Encoding:   7bit
Lines:   50
Date:   09:57 +0300
NNTP-Posting-Host:  xxx.xxx.xxx.xxx
X-Complaints-To:   newsmaster@saunalahti.com
X-Trace:   10:04 EEST)
NNTP-Posting-Date:   10:04 EEST
Organization:   Saunalahti Customer

A typical NNTP header looks like this, and out of all of those headers I'll likely need the group and the message ID for retrieval, and a few other obvious data points for display purposes.

I'm thinking an object database might be too CPU-intensive. I'm also thinking that a JSON-based document database might in fact be the best bet, although this data could map to a K/V database quite well.

What do you think? Are there other databases which are known to solve this kind of problem quite well?
Edited by parityboy - 9/5/12 at 8:29am
Mythica
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel i3 530 Gigabyte GA-H55M-D2H Palit nVidia GT430 Corsair Dominator 4GB TW3X4G1333C9A 
Hard DriveHard DriveOSMonitor
Hitachi Deskstar 7K500 Samsung HD204UI Linux Mint 13 HP L1800 
KeyboardPowerCaseMouse
Trust EasyScroll Silverline Corsair HX520 Lian-Li PC-A04B Logitech Trackman Wheel 
  hide details  
Reply
Mythica
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel i3 530 Gigabyte GA-H55M-D2H Palit nVidia GT430 Corsair Dominator 4GB TW3X4G1333C9A 
Hard DriveHard DriveOSMonitor
Hitachi Deskstar 7K500 Samsung HD204UI Linux Mint 13 HP L1800 
KeyboardPowerCaseMouse
Trust EasyScroll Silverline Corsair HX520 Lian-Li PC-A04B Logitech Trackman Wheel 
  hide details  
Reply
post #4 of 15
Quote:
Originally Posted by parityboy View Post

Tempting but I don't really want to call an external service, simply to minimise the dependencies for the user. I was looking at Berkeley DB, Tokyo Cabinet and Kyoto Cabinet. One thing I know for certain is that a relational database will be out of the question - the answer is either NoSQL or an Object database.

Berkeley DB! Although it would be nice for something free :O
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
post #5 of 15
Can you not use SQLite? You're only record count limit is the file size limit

@tompsonn: AFAIK Berkeley DB is GPL and BSD compatible. So is free as long as your project is open source
Edited by Plan9 - 9/5/12 at 5:16am
post #6 of 15
Quote:
Originally Posted by Plan9 View Post

Can you not use SQLite? You're only record count limit is the file size limit
@tompsonn: AFAIK Berkeley DB is GPL and BSD compatible. So is free as long as your project is open source

SQLite = relational... redface.gif
Back to my original point then... Berkeley DB!!
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
post #7 of 15
Quote:
Originally Posted by tompsonn View Post

SQLite = relational... redface.gif
Back to my original point then... Berkeley DB!!

Yeah I know SQLite is relational, but if I understand the OP correctly, he was only discounting RDBMS because he didn't want external dependencies and yet still needed multi-billion record databases. SQLite meets both those requirements despite being relational.

I've never played around with non-relational databases though (aside a few bespoke No-SQL databases I built for specific projects) - so I can't comment on performance comparisons or even if SQLite is any good with that large of datasets. But I do know it's commonly used on embedded systems and desktops alike. So it's performance with smaller datasets must be pretty good.
post #8 of 15
Quote:
Originally Posted by Plan9 View Post

Yeah I know SQLite is relational, but if I understand the OP correctly, he was only discounting RDBMS because he didn't want external dependencies and yet still needed multi-billion record databases. SQLite meets both those requirements despite being relational.
I've never played around with non-relational databases though (aside a few bespoke No-SQL databases I built for specific projects) - so I can't comment on performance comparisons or even if SQLite is any good with that large of datasets. But I do know it's commonly used on embedded systems and desktops alike. So it's performance with smaller datasets must be pretty good.

Nah totally agreed - was going to suggest SQLite myself, but then I don't bother if such a type is ruled out early on tongue.gif
(We're making a habit out of this in-thread conversation thing... eek.gif )
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
My System
(30 items)
 
"Zeus"
(13 items)
 
 
CPUMotherboardGraphicsRAM
Intel Core i5 2500K (4.5ghz @ 1.320v) Gigabyte Z68X-UD3R-B3 MSI R7970 Lightning Corsair 16GB (4x4GB) 
Hard DriveHard DriveHard DriveHard Drive
Plextor PX-256M5S 256GB Crucial M4 128GB Hitachi HDS721010CLA332 Hitachi HDS723020BLA642 
Hard DriveHard DriveHard DriveOptical Drive
Hitachi HDS723020BLA642 Hitachi HUA722010CLA330 WDC WD10EARS-00Z5B1 TSSTcorp CDDVDW SH-S223B 
CoolingCoolingOSMonitor
Phanteks PH-TC14PE with TY-140's Lamptron FCv5 (x2) Windows 7 Ultimate 64-bit Dell U2412M 
MonitorMonitorMonitorKeyboard
Dell U2412M Dell U2212HM Dell U2212HM Ducky DK9087 G2 Pro 
PowerCaseMouseMouse Pad
Corsair AX-750 Corsair Obsidian 650D Microsoft IntelliMouse Optical  XTRAC Ripper XXL 
AudioAudioAudioAudio
Westone W3 IEMs RE-272 IEMs Shure SE-215 IEMs Schiit Bifrost DAC 
AudioAudio
Schiit Asgard 2 amp HiVi Swan M50W 2.1 
CPUMotherboardGraphicsRAM
Intel Core i7 950 GA-X58-UD3R Radeon HD 5450  24GB Corsair @ 1333mhz 
Hard DriveOSPowerCase
4x WD Cavair Red 1TB in RAID 0 Windows Server 2008 R2 x64 Corsair HX-520 LianLi LanCool 
  hide details  
Reply
post #9 of 15
Quote:
Originally Posted by tompsonn View Post

Nah totally agreed - was going to suggest SQLite myself, but then I don't bother if such a type is ruled out early on tongue.gif
(We're making a habit out of this in-thread conversation thing... eek.gif )

hahaha that we are thumb.gif
post #10 of 15
Thread Starter 
lol guys, thanks for the replies (and the in-thread convo, lol). I discounted relational for performance reasons, simply because the data I'm looking to store maps better to document and K/V structures. After looking at the fragment I posted, I'm beginning to think K/V is the way to go, but I know little about them apart from the basic concept.

What's your experience with K/V databases as a developer? Are they easy to use? What do you use to query them - is there a common query language?
Mythica
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel i3 530 Gigabyte GA-H55M-D2H Palit nVidia GT430 Corsair Dominator 4GB TW3X4G1333C9A 
Hard DriveHard DriveOSMonitor
Hitachi Deskstar 7K500 Samsung HD204UI Linux Mint 13 HP L1800 
KeyboardPowerCaseMouse
Trust EasyScroll Silverline Corsair HX520 Lian-Li PC-A04B Logitech Trackman Wheel 
  hide details  
Reply
Mythica
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel i3 530 Gigabyte GA-H55M-D2H Palit nVidia GT430 Corsair Dominator 4GB TW3X4G1333C9A 
Hard DriveHard DriveOSMonitor
Hitachi Deskstar 7K500 Samsung HD204UI Linux Mint 13 HP L1800 
KeyboardPowerCaseMouse
Trust EasyScroll Silverline Corsair HX520 Lian-Li PC-A04B Logitech Trackman Wheel 
  hide details  
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Application Programming
Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Application Programming › Looking For A Giga-Record Database