CoderZone.org
Pages: 1 « previous     next »
  Print  
Author Topic: Easy way to de-dupe a large list?  (Read 19178 times) Bookmark and Share
Max
Jr. Member
*****
Posts: 75



View Profile WWW
« on: Apr 07, 2011, 07:33:10 pm »

I've got a list of about 50,000 lines of items, some of which are probably (read: definitely) duplicates. I'd like to de-dupe them either with PHP or possibly a bash script...anyone have any snippets to do this efficiently?

(Yes, I could write a small script to do this but my guess is that some has a nifty snippet already tucked away somewhere, and it's probably smarter and faster than what I'd churn out.)
Logged
phpMan2010
Newbie
*
Posts: 32



View Profile
« Reply #1 on: Apr 07, 2011, 07:57:39 pm »

cut | sort | uniq

Smiley

If you can post a couple of lines, I can put more details
Logged
Max
Jr. Member
*****
Posts: 75



View Profile WWW
« Reply #2 on: Apr 08, 2011, 02:02:28 pm »

Here's a sample:

100ideas.com
1dog.net
1dollargreetings.com
1dollarnoni.com
bevirusproof.org
beyond24.com
beyond30k.com
beyondigital.com
beyondspecials.com
custhelp.com
customadultmalls.com
customer-contact.net
customer-svc.com
customerblast.com
customerparadigm.com
customizejacketsandtees4you.com
customoffers.com
customoffersmail.com
cut-to-the-chase.com
cuteandcuddly.com
emaildealz.net
emaildelivery.org
emaildelvery.org
emaildownunder.com
emailfactory.com
emailgaul.com
emailgids.net
Logged
Keith
Newbie
*
Posts: 11


View Profile
« Reply #3 on: Apr 08, 2011, 03:07:27 pm »

This may use upwards of a meg of memory or so, but it sounds like a one-and-done thing (possibly on your local machine?)  so no worries. Smiley

Code:  
Highlight Mode: (PHP)
  1. $lines = file( './domains.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES );
  2. $lines = array_unique( $lines );
  3. $lines = implode( PHP_EOL, $lines );
  4. file_put_contents( './domains.unique.txt', $lines );
 

...and by replacing array_unique() with array_flip()/array_keys() you can cut the processing time in half-ish:
Code:  
Highlight Mode: (PHP)
  1. $lines = file( './domains.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES );
  2. $lines = array_flip( $lines );
  3. $lines = array_keys( $lines );
  4. $lines = implode( PHP_EOL, $lines );
  5. file_put_contents( './domains.unique.txt', $lines );
 
« Last Edit: Apr 08, 2011, 03:13:32 pm by Keith » Logged
phpMan2010
Newbie
*
Posts: 32



View Profile
« Reply #4 on: Apr 08, 2011, 05:46:36 pm »

On the command line:

Code:  
Highlight Mode: (Bash)
  1. sort listofdomains | uniq > uniqed
 

With PHP:

Code:  
Highlight Mode: (PHP)
  1. <?php
  2. $list=explode(PHP_EOL,`sort listofdomains | uniq`);
  3. var_dump($list);
 
Logged
Max
Jr. Member
*****
Posts: 75



View Profile WWW
« Reply #5 on: Apr 10, 2011, 08:28:32 am »

Code:  
Highlight Mode: (Bash)
  1. sort listofdomains | uniq > uniqed
 

Awesome- exactly what I was looking for, a slick one-liner. Thank you!
Logged
Max
Jr. Member
*****
Posts: 75



View Profile WWW
« Reply #6 on: Apr 10, 2011, 08:31:35 am »

phpMan's bash script is perfect for what I need, but I'm also going to use one of your snippets for some other de-duping I need to do on a regular basis (via cron). My server restricts non-root users from running eval(), passthru(), or exec() so I can't use a bash script from within a PHP script...but I can use what you provided. Thank you.

This may use upwards of a meg of memory or so, but it sounds like a one-and-done thing (possibly on your local machine?)  so no worries. Smiley

Code:  
Highlight Mode: (PHP)
  1. $lines = file( './domains.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES );
  2. $lines = array_unique( $lines );
  3. $lines = implode( PHP_EOL, $lines );
  4. file_put_contents( './domains.unique.txt', $lines );
 

...and by replacing array_unique() with array_flip()/array_keys() you can cut the processing time in half-ish:
Code:  
Highlight Mode: (PHP)
  1. $lines = file( './domains.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES );
  2. $lines = array_flip( $lines );
  3. $lines = array_keys( $lines );
  4. $lines = implode( PHP_EOL, $lines );
  5. file_put_contents( './domains.unique.txt', $lines );
 
Logged
Tags:
Pages: 1
  Print  
 
Jump to: