Over the past year, there have been a couple of times where I've needed to sort some large list of values, more than 100 million lines in one case.
In each case, I was dealing with a data source where there was surely duplicate entries. For example, duplicate usernames, emails, or URLs. To address this, I decided to get the unique values from the file before I ran a final processing script over them. This would require sorting all of the values in the given file and then deduping in the resulting groups of values.
This sorting and deduping can be a bit challenging. There are various algorithms to consider and if the dataset is large enough, we also need to ensure that we're handling the data in a way that we don't run out of memory.
Shell commands to the rescue 🙂
Luckily, there are shell commands that make it quite simple to get the unique values in a file. Here's what I ended up using to get the unique values in a file:
cat $file | sort | uniq
In this example, we are:
- Opening the file at
$file
- Sorting the file so that duplicates end up in a contiguous block
- Dedupe so that only one value remains from each contiguous block
Here's another example of this command with piped input:
php -r 'for ( $i = 0; $i < 1000000; $i++ ) { echo sprintf( "%d\n", random_int( 0, 100 ) ); }' | sort -n | uniq
?
In this example, we are
- Generating 1,000,000 million random numbers, between 0 and 1,000) on their own lines
- Sorting that output so that like numbers are together
- Note that we're using
-n
here to do an integer sort.
- Note that we're using
- Deduping that so that we end up with a unique number on each line
If we wanted know how often each number occurred in the file, we could simple add -c
to the end of the command above. The resulting command would be php -r 'for ( $i = 0; $i < 1000000; $i++ ) { echo sprintf( "%d\n", random_int( 0, 100 ) ); }' | sort -n | uniq -c
and we would get some output that looked like this:
9880 0
10179 1
9725 2
10024 3
9921 4
9893 5
9945 6
9881 7
9707 8
9955 9
9896 10
9845 11
9928 12
10024 13
10005 14
9834 15
9929 16
9764 17
9795 18
9932 19
9735 20
10082 21
9876 22
9835 23
9748 24
9947 25
9975 26
9841 27
9856 28
9751 29
10138 30
10037 31
10026 32
10128 33
9926 34
9821 35
9990 36
9920 37
9696 38
9886 39
9896 40
9815 41
9924 42
9739 43
9854 44
9936 45
9977 46
9873 47
9824 48
10043 49
10054 50
9870 51
9783 52
9901 53
9819 54
9882 55
10022 56
9899 57
9922 58
9922 59
9902 60
10036 61
9830 62
9792 63
9894 64
10008 65
9774 66
9918 67
9986 68
9814 69
9661 70
10117 71
10046 72
9704 73
10016 74
9601 75
9901 76
9923 77
9931 78
9909 79
9895 80
9771 81
10044 82
10059 83
9864 84
9938 85
9799 86
10006 87
9883 88
9880 89
9837 90
9701 91
9870 92
9998 93
9809 94
9883 95
10144 96
9935 97
9979 98
9922 99
9789 100
Leave a Reply