Overclock.net - Overclocking.net
     
 
Home Gallery Reviews Blogs Register Today's Posts Mark Forums Read Members List


Go Back   Overclock.net - Overclocking.net > Software, Programming and Coding > Coding and Programming

Reply
 
LinkBack Thread Tools
Old 07-10-08   #1 (permalink)
First Time Build
 
Turnoz's Avatar
 
intel nvidia

Join Date: Dec 2006
Location: Toronto!
Posts: 1,313

Rep: 85 Turnoz is acknowledged by some
Unique Rep: 76
Trader Rating: 4
Default [SQL/Oracle] Searching for duplicates

Hey everyone,

At my work we have this database of around 750 000 constituents and we believe that there is an incredible amount of duplicates. Just searching by postal code and exact first name and last name I managed to get a list of 500 potential duplicates. Out of all of those so far 480 have been duplicates.

I was wondering what would be the best way to code a querry that would give me the potential duplicates.

Backround info:
- Duplicates were created through large database merges
- Constituents have an ID number, first name, last name, address, postal code and much more info that is probably irrelevant (if you want more info just ask)
- The system runs on a 7 year old server with dual pentium 3s so the most efficient search method would be great

So far my best idea is to do a character comparison while keeping lets say postal code exact. The only problem with that so far is that if there is one spelling mistake somewhere then the whooole rest of the string is marked as an error.

Taking jack frost and jck frost for example

Compare first char. exact match
Compare 2nd char. error.
compare 3rd char. error. (I don't want it to say this is an error).
etc.

I was thinking that I could just tell it if it finds an error. mark it and read the next char in that string to see if it matches with this one. If it does then delay the other string by 1 char. However, if it is like jeck frost and jack frost then it would screw up that way.



Anyone have an idea?
__________________
New Build
[||||||||||||||||||||]
0$ 455$ 500$
Quote:
Originally Posted by DigitalSonata View Post
There is an obvious solution to their problem: wear a tinfoil hat

System: Finally an upgrade
CPU
E4300 333x9 @ 1.38v
Motherboard
EP45-DS3L
Memory
A-Data 800mhz
Graphics Card
XFX 8800GS
Hard Drive
250Gig WD IDE
Sound Card
Integrated
Power Supply
650W Cooler Master
Case
Antec 300
CPU cooling
GeminiII + Tricools
GPU cooling
Stock
OS
Vista 32-bit
Monitor
20.1" Metro

Last edited by Turnoz : 07-11-08 at 11:13 AM. Reason: Title change
Turnoz is offline   Reply With Quote
Old 07-10-08   #2 (permalink)
AMD Overclocker
 
decompiled's Avatar
 
amd nvidia

Join Date: Feb 2006
Location: Redsox Nation
Posts: 379

Rep: 29 decompiled is acknowledged by some
Unique Rep: 28
Hardware Reviews: 9
Trader Rating: 0
Default

What kind of database?

Oracle, MS 2000, MS 2005, Access, Btrieve, Cache... please be more specific =)

Jason

System: Slow Poke
CPU
AMD X2 4400
Motherboard
MSI K8N Diamond +
Memory
3gb TCC5
Graphics Card
eVGA 7900GS
Hard Drive
74gb Raptor + Raid 1 640's
Sound Card
Audigy SE
Power Supply
OCZ 520 PowerStream
Case
Antec P180
CPU cooling
Stock AMD
GPU cooling
Stock eVGA
OS
Vista x64
Monitor
Dell 1905FP
decompiled is offline   Reply With Quote
Old 07-11-08   #3 (permalink)
First Time Build
 
Turnoz's Avatar
 
intel nvidia

Join Date: Dec 2006
Location: Toronto!
Posts: 1,313

Rep: 85 Turnoz is acknowledged by some
Unique Rep: 76
Trader Rating: 4
Default

Its "Raiser's Edge". That's all I really know about it. The guy that's in charge of that comes back on monday so I will be able to get more information. All I know is that there's a plugin to run SQL queries on there.

Im really noob at all of this. Just scratched at java and making a database in java.
__________________
New Build
[||||||||||||||||||||]
0$ 455$ 500$
Quote:
Originally Posted by DigitalSonata View Post
There is an obvious solution to their problem: wear a tinfoil hat

System: Finally an upgrade
CPU
E4300 333x9 @ 1.38v
Motherboard
EP45-DS3L
Memory
A-Data 800mhz
Graphics Card
XFX 8800GS
Hard Drive
250Gig WD IDE
Sound Card
Integrated
Power Supply
650W Cooler Master
Case
Antec 300
CPU cooling
GeminiII + Tricools
GPU cooling
Stock
OS
Vista 32-bit
Monitor
20.1" Metro
Turnoz is offline   Reply With Quote
Old 07-11-08   #4 (permalink)
Turing Test is Overrated
 
DuckieHo's Avatar
 
intel nvidia

Join Date: Nov 2006
Location: In a Chair.
Posts: 22,359

Rep: 2572 DuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legend
Unique Rep: 1199
Folding Team Rank: 816
Trader Rating: 33
Default

Your method is going to need to use pointers and will be time consuming... Unfortunately, cleaning up these types of errors are very manual since judgement calls will be needed. What if Jack Frost and Jeck Frost are brothers living in the same house?
__________________
To answer most of your questions: (1) a fridge cannot cool a PC (2) 64-bit OS for over 3GB or so (3) PCIe 2.0 is backwards compatible with PCIe 1.x (4) Resolution, not screen size (5) If you have a question, it is not news (6) Read TOS (7) Report, not respond to Spam (8) Uninstall nTune (9) Single/Non-Modular Rail PSUs are NOT better than Multi-Rail/Modular (10) Edward is the Law!
Quote:
Originally Posted by Brythe View Post
But I still dont get it, whats the thing between the chair and the desk?

edit: got it..haha..its meee

System: My "DF-DIE" Replacement
CPU
Q6600 (3.7GHz)
Motherboard
Asus P5E X38 (MF1208 BIOS)
Memory
2x2GB OCZ Reaper 1096MHz
Graphics Card
8800GT (729/1836/2088)
Hard Drive
PERC 5/i: 3xRAID0 Raptor 74GB + 7200.10 250GB
Sound Card
X-Fi XtremeMusic
Power Supply
Corsair 620HX
Case
Li Lian PC-V2100 [10x120mm fans]
CPU cooling
FuZion V2 + Quad-Heatercore
GPU cooling
MCW60 + Iandh HS + DDC-3.2
OS
Vista Ultimate 64
Monitor
Samsung 226BW "C" + Sceptre 19"

Last edited by DuckieHo : 07-11-08 at 11:05 AM.
DuckieHo is offline I fold for Overclock.net Overclocked Account DuckieHo's Gallery   Reply With Quote
Old 07-11-08   #5 (permalink)
First Time Build
 
Turnoz's Avatar
 
intel nvidia

Join Date: Dec 2006
Location: Toronto!
Posts: 1,313

Rep: 85 Turnoz is acknowledged by some
Unique Rep: 76
Trader Rating: 4
Default

Oh I know :P Im getting payed to do this over the summer and then chances are ill be doing this for the summer after too.

I have already gone through 2000 records... and there is 750 000 in the database atm.

I just need a search query to manage to give me a list of potential duplicates as so far, the best I can get is the built in tool saying perfect last name, perfect postal code and x number of first name letters. Lets just say that its pretty crappy. It gives me allll the family members that have the same first letter. I am just trying to find a way to tune it better.

Also, I found another tool in there that is for queries but using very linear type of criterias

ex: if (Last_name) = exact
{Display Constituent];

Would I be able to do what im trying to do with this tool? The types used for last, middle, first name is varchar.

EDIT: I was looking on the website and pretty much the two types the database can run on is SQL Server® or Oracle®. Im pretty sure we are on Oracle because I heard the consultant saying that word a couple times
__________________
New Build
[||||||||||||||||||||]
0$ 455$ 500$
Quote:
Originally Posted by DigitalSonata View Post
There is an obvious solution to their problem: wear a tinfoil hat

System: Finally an upgrade
CPU
E4300 333x9 @ 1.38v
Motherboard
EP45-DS3L
Memory
A-Data 800mhz
Graphics Card
XFX 8800GS
Hard Drive
250Gig WD IDE
Sound Card
Integrated
Power Supply
650W Cooler Master
Case
Antec 300
CPU cooling
GeminiII + Tricools
GPU cooling
Stock
OS
Vista 32-bit
Monitor
20.1" Metro

Last edited by Turnoz : 07-11-08 at 11:11 AM.
Turnoz is offline   Reply With Quote
Old 07-11-08   #6 (permalink)
Turing Test is Overrated
 
DuckieHo's Avatar
 
intel nvidia

Join Date: Nov 2006
Location: In a Chair.
Posts: 22,359

Rep: 2572 DuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legend
Unique Rep: 1199
Folding Team Rank: 816
Trader Rating: 33
Default

Here's what I would do:
1) Export the data into Excel 2007 (supports up to 2M rows) on a newer machine with 4GB if possible.
2) Create Soundex (http://en.wikipedia.org/wiki/Soundex) keys on named fields (You can find Excel algorithms for it)
3) sort by Soundex + zip code (assuming at least zip code is correct)
4) Use a simple Excel function "if row N matches row N+1, display message"
5) Manually look at each field that has a message.
6) After taking care of the majority of dups, go back again and start character by charcter (let me know when you get to this point and I'll give you the function)






I've done similar work on a few 10K tables and once on a 500K table.
__________________
To answer most of your questions: (1) a fridge cannot cool a PC (2) 64-bit OS for over 3GB or so (3) PCIe 2.0 is backwards compatible with PCIe 1.x (4) Resolution, not screen size (5) If you have a question, it is not news (6) Read TOS (7) Report, not respond to Spam (8) Uninstall nTune (9) Single/Non-Modular Rail PSUs are NOT better than Multi-Rail/Modular (10) Edward is the Law!
Quote:
Originally Posted by Brythe View Post
But I still dont get it, whats the thing between the chair and the desk?

edit: got it..haha..its meee

System: My "DF-DIE" Replacement
CPU
Q6600 (3.7GHz)
Motherboard
Asus P5E X38 (MF1208 BIOS)
Memory
2x2GB OCZ Reaper 1096MHz
Graphics Card
8800GT (729/1836/2088)
Hard Drive
PERC 5/i: 3xRAID0 Raptor 74GB + 7200.10 250GB
Sound Card
X-Fi XtremeMusic
Power Supply
Corsair 620HX
Case
Li Lian PC-V2100 [10x120mm fans]
CPU cooling
FuZion V2 + Quad-Heatercore
GPU cooling
MCW60 + Iandh HS + DDC-3.2
OS
Vista Ultimate 64
Monitor
Samsung 226BW "C" + Sceptre 19"

Last edited by DuckieHo : 07-11-08 at 11:28 AM.
DuckieHo is offline I fold for Overclock.net Overclocked Account DuckieHo's Gallery   Reply With Quote
Old 07-11-08   #7 (permalink)
First Time Build
 
Turnoz's Avatar
 
intel nvidia

Join Date: Dec 2006
Location: Toronto!
Posts: 1,313

Rep: 85 Turnoz is acknowledged by some
Unique Rep: 76
Trader Rating: 4
Default

Quote:
Originally Posted by DuckieHo View Post
Here's what I would do:
1) Export the data into Excel 2007 (supports up to 2M rows) on a newer machine with 4GB if possible.
2) Create Soundex (http://en.wikipedia.org/wiki/Soundex) keys on named fields (You can find Excel algorithms for it)
3) sort by Soundex + zip code (assuming at least zip code is correct)
4) Use a simple Excel function "if row N matches row N+1, display message"
5) Manually look at each field that has a message.
heheheh ... im running on pentium 3 with 256 megs of ram. oh jeez
Ill try it out and report back thanks

EDIT:

Its running! Calculating number of Constituent to export.... jeez.... Maybe I should do this over the weekend instead of now.

Man pentium 3s suck. Just exporting all the data is gonna take AAGES.

EDIT2: Its now at calculating 500 000 records to export. Not even started the export yet >.>

yayy! processing....
__________________
New Build
[||||||||||||||||||||]
0$ 455$ 500$
Quote:
Originally Posted by DigitalSonata View Post
There is an obvious solution to their problem: wear a tinfoil hat

System: Finally an upgrade
CPU
E4300 333x9 @ 1.38v
Motherboard
EP45-DS3L
Memory
A-Data 800mhz
Graphics Card
XFX 8800GS
Hard Drive
250Gig WD IDE
Sound Card
Integrated
Power Supply
650W Cooler Master
Case
Antec 300
CPU cooling
GeminiII + Tricools
GPU cooling
Stock
OS
Vista 32-bit
Monitor
20.1" Metro

Last edited by Turnoz : 07-11-08 at 11:42 AM.
Turnoz is offline   Reply With Quote
Old 07-11-08   #8 (permalink)
Turing Test is Overrated
 
DuckieHo's Avatar
 
intel nvidia

Join Date: Nov 2006
Location: In a Chair.
Posts: 22,359

Rep: 2572 DuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legendDuckieHo is a legend
Unique Rep: 1199
Folding Team Rank: 816
Trader Rating: 33
Default

Quote:
Originally Posted by Turnoz View Post
heheheh ... im running on pentium 3 with 256 megs of ram. oh jeez
Ill try it out and report back thanks

EDIT:

Its running! Calculating number of Constituent to export.... jeez.... Maybe I should do this over the weekend instead of now.

Man pentium 3s suck. Just exporting all the data is gonna take AAGES.


Can you export to a text file instead? Then import into Excel 2007 (it MUST be 2007 since all other Excel only support up to 64K rows).

Don't forget that you get paid for the time it takes to export.
__________________
To answer most of your questions: (1) a fridge cannot cool a PC (2) 64-bit OS for over 3GB or so (3) PCIe 2.0 is backwards compatible with PCIe 1.x (4) Resolution, not screen size (5) If you have a question, it is not news (6) Read TOS (7) Report, not respond to Spam (8) Uninstall nTune (9) Single/Non-Modular Rail PSUs are NOT better than Multi-Rail/Modular (10) Edward is the Law!
Quote:
Originally Posted by Brythe View Post
But I still dont get it, whats the thing between the chair and the desk?

edit: got it..haha..its meee

System: My "DF-DIE" Replacement
CPU
Q6600 (3.7GHz)
Motherboard
Asus P5E X38 (MF1208 BIOS)
Memory
2x2GB OCZ Reaper 1096MHz
Graphics Card
8800GT (729/1836/2088)
Hard Drive
PERC 5/i: 3xRAID0 Raptor 74GB + 7200.10 250GB
Sound Card
X-Fi XtremeMusic
Power Supply
Corsair 620HX
Case
Li Lian PC-V2100 [10x120mm fans]
CPU cooling
FuZion V2 + Quad-Heatercore
GPU cooling
MCW60 + Iandh HS + DDC-3.2
OS
Vista Ultimate 64
Monitor
Samsung 226BW "C" + Sceptre 19"
DuckieHo is offline I fold for Overclock.net Overclocked Account DuckieHo's Gallery   Reply With Quote
Old 07-11-08   #9 (permalink)
Performance...
 
linskingdom's Avatar
 
intel nvidia

Join Date: Jun 2007
Location: In Office
Posts: 8,729

Rep: 2152 linskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legendlinskingdom is a legend
Unique Rep: 931
Trader Rating: 3
Default

DuckieHo's suggestion is good. It is more manual and auto udapte/verification togeher. Or you may try this, for instance in SQL (they are similar)

1. Backup database.
2. Figure our how many those records, let's say.
Select disctinct firstname, lastname,zip, count(*) as itemnumbers into temptbl from [the table name] group by firstname, lastname,zip order by firstname, lastname,zip
3. Once the table created, add three new fields to it to store the correct information. You can manually add them in SQL/oracle server or use create table statement. So you will have this in the new table.
Firstname, lastname, zip, nfirstname, nlastname,nzip
Go through it and assign correct value to these new fields. For instance,

Firstname, lastname, zip, nfirstname, nlastname,nzip

Jack, White, 22301, Jack, White, 22301,
Jeck, White, 22301, Jack, White, 22301,
Jackie, White, 22301, Jack, White, 22301, ……

4. The new table will be your standard table. In relational database, standardization is very important. If you interface can change, you can add a drop down or something to force them to pick the correct name. Back to the topic, use SQL statement to change.

Update [the orginal table name/A] set firstname=t.nfirstname, lastname=t.nlastname where exitsts (select t.firstname, t.lastname,t.zip from [new table/T] where t.firstname= a.firstname and t.lastname=a.lastname and t.zip=a.zip)

5.Verify the updated items number to see if they match the itemnumbers in the new table.

Sorry for the quick type up. Kind of mess.

System: QD65D45
CPU
65Q/D&45D
Motherboard
DFI/Asus/Gigabyte
linskingdom is online now Overclocked Account linskingdom's Gallery   Reply With Quote
Old 07-11-08   #10 (permalink)
First Time Build
 
Turnoz's Avatar
 
intel nvidia

Join Date: Dec 2006
Location: Toronto!
Posts: 1,313

Rep: 85 Turnoz is acknowledged by some
Unique Rep: 76
Trader Rating: 4
Default

Quote:
Originally Posted by DuckieHo View Post
Can you export to a text file instead? Then import into Excel 2007 (it MUST be 2007 since all other Excel only support up to 64K rows).

Don't forget that you get paid for the time it takes to export.
I just decided to cut down on the export. First name = Ronald. That way I can try it out without it going nuts. Also, don't have excell 2007 on this machine. I do on my laptop so ill export it to text and then into my laptop.

Its also my 2nd week here and if im just fiddling with my thumbs its gonna look suspicious :P
__________________
New Build
[||||||||||||||||||||]
0$ 455$ 500$
Quote:
Originally Posted by DigitalSonata View Post
There is an obvious solution to their problem: wear a tinfoil hat

System: Finally an upgrade
CPU
E4300 333x9 @ 1.38v
Motherboard
EP45-DS3L
Memory
A-Data 800mhz
Graphics Card
XFX 8800GS
Hard Drive
250Gig WD IDE
Sound Card
Integrated
Power Supply
650W Cooler Master
Case
Antec 300
CPU cooling
GeminiII + Tricools
GPU cooling
Stock
OS
Vista 32-bit
Monitor
20.1" Metro
Turnoz is offline   Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools



All times are GMT -4. The time now is 03:46 PM.


Overclock.net is a Carbon Neutral Site Creative Commons License Internet Security By ControlScan

Terms of Service / Forum Rules | Privacy Policy | Advertising | Become an Official Vendor
Copyright © 2008 Shogun Interactive Development. Most rights reserved.
Page generated in 0.22399 seconds with 9 queries