|
![]() |
Overclock.net - Overclocking.net > Software, Programming and Coding > Coding and Programming | |
[SQL/Oracle] Searching for duplicates
|
||
![]() |
|
|
LinkBack | Thread Tools |
|
|
#1 (permalink) | |||||||||||||
|
First Time Build
|
Hey everyone,
At my work we have this database of around 750 000 constituents and we believe that there is an incredible amount of duplicates. Just searching by postal code and exact first name and last name I managed to get a list of 500 potential duplicates. Out of all of those so far 480 have been duplicates. I was wondering what would be the best way to code a querry that would give me the potential duplicates. Backround info: - Duplicates were created through large database merges - Constituents have an ID number, first name, last name, address, postal code and much more info that is probably irrelevant (if you want more info just ask) - The system runs on a 7 year old server with dual pentium 3s so the most efficient search method would be great ![]() So far my best idea is to do a character comparison while keeping lets say postal code exact. The only problem with that so far is that if there is one spelling mistake somewhere then the whooole rest of the string is marked as an error. Taking jack frost and jck frost for example Compare first char. exact match Compare 2nd char. error. compare 3rd char. error. (I don't want it to say this is an error). etc. I was thinking that I could just tell it if it finds an error. mark it and read the next char in that string to see if it matches with this one. If it does then delay the other string by 1 char. However, if it is like jeck frost and jack frost then it would screw up that way. Anyone have an idea? ![]()
__________________
New Build [||||||||||||||||||||] 0$ 455$ 500$
Last edited by Turnoz : 07-11-08 at 11:13 AM. Reason: Title change |
|||||||||||||
|
|
|
|
|
#2 (permalink) | |||||||||||||
|
AMD Overclocker
|
What kind of database?
Oracle, MS 2000, MS 2005, Access, Btrieve, Cache... please be more specific =) Jason
|
|||||||||||||
|
|
|
|
|
#3 (permalink) | |||||||||||||
|
First Time Build
|
Its "Raiser's Edge". That's all I really know about it. The guy that's in charge of that comes back on monday so I will be able to get more information. All I know is that there's a plugin to run SQL queries on there.
Im really noob at all of this. Just scratched at java and making a database in java.
__________________
New Build [||||||||||||||||||||] 0$ 455$ 500$
|
|||||||||||||
|
|
|
|
|
#4 (permalink) | |||||||||||||
|
Turing Test is Overrated
|
Your method is going to need to use pointers and will be time consuming... Unfortunately, cleaning up these types of errors are very manual since judgement calls will be needed. What if Jack Frost and Jeck Frost are brothers living in the same house?
__________________
To answer most of your questions: (1) a fridge cannot cool a PC (2) 64-bit OS for over 3GB or so (3) PCIe 2.0 is backwards compatible with PCIe 1.x (4) Resolution, not screen size (5) If you have a question, it is not news (6) Read TOS (7) Report, not respond to Spam (8) Uninstall nTune (9) Single/Non-Modular Rail PSUs are NOT better than Multi-Rail/Modular (10) Edward is the Law!
Last edited by DuckieHo : 07-11-08 at 11:05 AM. |
|||||||||||||
|
|
|
|
#5 (permalink) | |||||||||||||
|
First Time Build
|
Oh I know :P Im getting payed to do this over the summer and then chances are ill be doing this for the summer after too.
I have already gone through 2000 records... and there is 750 000 in the database atm. I just need a search query to manage to give me a list of potential duplicates as so far, the best I can get is the built in tool saying perfect last name, perfect postal code and x number of first name letters. Lets just say that its pretty crappy. It gives me allll the family members that have the same first letter. I am just trying to find a way to tune it better. Also, I found another tool in there that is for queries but using very linear type of criterias ex: if (Last_name) = exact {Display Constituent]; Would I be able to do what im trying to do with this tool? The types used for last, middle, first name is varchar. EDIT: I was looking on the website and pretty much the two types the database can run on is SQL Server® or Oracle®. Im pretty sure we are on Oracle because I heard the consultant saying that word a couple times ![]()
__________________
New Build [||||||||||||||||||||] 0$ 455$ 500$
Last edited by Turnoz : 07-11-08 at 11:11 AM. |
|||||||||||||
|
|
|
|
|
#6 (permalink) | |||||||||||||
|
Turing Test is Overrated
|
Here's what I would do:
1) Export the data into Excel 2007 (supports up to 2M rows) on a newer machine with 4GB if possible. 2) Create Soundex (http://en.wikipedia.org/wiki/Soundex) keys on named fields (You can find Excel algorithms for it) 3) sort by Soundex + zip code (assuming at least zip code is correct) 4) Use a simple Excel function "if row N matches row N+1, display message" 5) Manually look at each field that has a message. 6) After taking care of the majority of dups, go back again and start character by charcter (let me know when you get to this point and I'll give you the function) I've done similar work on a few 10K tables and once on a 500K table.
__________________
To answer most of your questions: (1) a fridge cannot cool a PC (2) 64-bit OS for over 3GB or so (3) PCIe 2.0 is backwards compatible with PCIe 1.x (4) Resolution, not screen size (5) If you have a question, it is not news (6) Read TOS (7) Report, not respond to Spam (8) Uninstall nTune (9) Single/Non-Modular Rail PSUs are NOT better than Multi-Rail/Modular (10) Edward is the Law!
Last edited by DuckieHo : 07-11-08 at 11:28 AM. |
|||||||||||||
|
|
|
|
#7 (permalink) | ||||||||||||||
|
First Time Build
|
Quote:
![]() Ill try it out and report back thanks ![]() EDIT: Its running! Calculating number of Constituent to export.... jeez.... Maybe I should do this over the weekend instead of now. Man pentium 3s suck. Just exporting all the data is gonna take AAGES. EDIT2: Its now at calculating 500 000 records to export. Not even started the export yet >.> yayy! processing....
__________________
New Build [||||||||||||||||||||] 0$ 455$ 500$
Last edited by Turnoz : 07-11-08 at 11:42 AM. |
||||||||||||||
|
|
|
|
|
#8 (permalink) | ||||||||||||||
|
Turing Test is Overrated
|
Quote:
Can you export to a text file instead? Then import into Excel 2007 (it MUST be 2007 since all other Excel only support up to 64K rows). Don't forget that you get paid for the time it takes to export.
__________________
To answer most of your questions: (1) a fridge cannot cool a PC (2) 64-bit OS for over 3GB or so (3) PCIe 2.0 is backwards compatible with PCIe 1.x (4) Resolution, not screen size (5) If you have a question, it is not news (6) Read TOS (7) Report, not respond to Spam (8) Uninstall nTune (9) Single/Non-Modular Rail PSUs are NOT better than Multi-Rail/Modular (10) Edward is the Law!
|
||||||||||||||
|
|
|
|
#9 (permalink) | |||
|
Performance...
|
DuckieHo's suggestion is good. It is more manual and auto udapte/verification togeher. Or you may try this, for instance in SQL (they are similar)
1. Backup database. 2. Figure our how many those records, let's say. Select disctinct firstname, lastname,zip, count(*) as itemnumbers into temptbl from [the table name] group by firstname, lastname,zip order by firstname, lastname,zip 3. Once the table created, add three new fields to it to store the correct information. You can manually add them in SQL/oracle server or use create table statement. So you will have this in the new table. Firstname, lastname, zip, nfirstname, nlastname,nzip Go through it and assign correct value to these new fields. For instance, Firstname, lastname, zip, nfirstname, nlastname,nzip Jack, White, 22301, Jack, White, 22301, Jeck, White, 22301, Jack, White, 22301, Jackie, White, 22301, Jack, White, 22301, …… 4. The new table will be your standard table. In relational database, standardization is very important. If you interface can change, you can add a drop down or something to force them to pick the correct name. Back to the topic, use SQL statement to change. Update [the orginal table name/A] set firstname=t.nfirstname, lastname=t.nlastname where exitsts (select t.firstname, t.lastname,t.zip from [new table/T] where t.firstname= a.firstname and t.lastname=a.lastname and t.zip=a.zip) 5.Verify the updated items number to see if they match the itemnumbers in the new table. Sorry for the quick type up. Kind of mess.
__________________
Push the limit one more time ... ![]()
|
|||
|
|
|
|
#10 (permalink) | ||||||||||||||
|
First Time Build
|
Quote:
Its also my 2nd week here and if im just fiddling with my thumbs its gonna look suspicious :P
__________________
New Build [||||||||||||||||||||] 0$ 455$ 500$
|
||||||||||||||
|
|
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|