In the past, I’ve often had to generate email lists of users that fit specific conditions. This usually isn’t too difficult with some of our in-house data tools at Automattic. But, when I hit a case where I have to work across systems, it usually results in me dumping the data from each system and then comparing large lists.
Comparing large lists on the command line isn’t that difficult. All it takes is a few commands which I’ll walk you through in this blog post!
So, let’s come up with a theoretical example. Let’s say that I have two separate lists, one of users that have purchased Product A and one of users that have purchased Product B. Second, let’s also agree that these lists contain the user’s email address and the date of purchase. So, in a CSV, the data would look a bit like this:
Prepping the list of users
Later on, we’re going to use the
comm command for the actual comparing of the lists. Before we can use that command, there’s a bit of prep work that we first need to complete. Specifically, we need to:
- Remove the CSV header row if there is one
- Filter down the source data to whatever field we want to compare, email addresses in our theoretical example
- Sort the list
- Unique the list
To cut the CSV header from file, we’re going to use the following:
sed 1d product_a.csv
This will remove the first line and print the rest of the file to standard out.
Next, we’re going to pull out the column that contains email address, or the first column in our dataset above. To do this, we’re going to use the following:
cut -d',' -f1 product_a.csv
This command is setting the
, as the delimeter and then pulling the first column of the file.
From here, we simply need to sort and unique, which we can do with the
uniq commands. If we put all of the above together, we can run it in a single go for each file:
sed 1d product_a.csv | cut -d',' -f1 | sort uniq > product_a_output.csv
At this point, now we just need to take the files and actually compare them. 😄
Actually comparing large lists 🎉
Alright, now that we have two processed files, we can get to the easy part, comparing. The easiest way I’ve found to compare files for these use cases is the
comm command. You can
man comm to get a detailed view of how that command works, but in summary, you use it like this:
comm file1 file2
That command then outputs three columns, where:
- The first column is all values that are in file1 and not file2
- The second column is all values in file2 and not file1
- The third column is all values in both files
If you’re only interested in the lines that are in both files, you can do the following, which will remove the first and second columns of values:
comm -12 file1 file2
You can then write redirect that output to a file or however you’d like.