On comparing large lists

Graphic about lists being the same

In the past, I’ve often had to generate email lists of users that fit specific conditions. This usually isn’t too difficult with some of our in-house data tools at Automattic. But, when I hit a case where I have to work across systems, it usually results in me dumping the data from each system and then comparing large lists.

Comparing large lists on the command line isn’t that difficult. All it takes is a few commands which I’ll walk you through in this blog post!

Graphic about lists being the same

So, let’s come up with a theoretical example. Let’s say that I have two separate lists, one of users that have purchased Product A and one of users that have purchased Product B. Second, let’s also agree that these lists contain the user’s email address and the date of purchase. So, in a CSV, the data would look a bit like this:

email_address,date
"user@example.com","2022-03-01"

Prepping the list of users

Later on, we’re going to use the comm command for the actual comparing of the lists. Before we can use that command, there’s a bit of prep work that we first need to complete. Specifically, we need to:

  • Remove the CSV header row if there is one
  • Filter down the source data to whatever field we want to compare, email addresses in our theoretical example
  • Sort the list
  • Unique the list

To cut the CSV header from file, we’re going to use the following:

sed 1d product_a.csv

This will remove the first line and print the rest of the file to standard out.

Next, we’re going to pull out the column that contains email address, or the first column in our dataset above. To do this, we’re going to use the following:

cut -d',' -f1 product_a.csv

This command is setting the , as the delimeter and then pulling the first column of the file.

From here, we simply need to sort and unique, which we can do with the sort and uniq commands. If we put all of the above together, we can run it in a single go for each file:

sed 1d product_a.csv | cut -d',' -f1 | sort uniq > product_a_output.csv

At this point, now we just need to take the files and actually compare them. ?

Actually comparing large lists ?

Alright, now that we have two processed files, we can get to the easy part, comparing. The easiest way I’ve found to compare files for these use cases is the comm command. You can man comm to get a detailed view of how that command works, but in summary, you use it like this:

comm file1 file2

That command then outputs three columns, where:

  • The first column is all values that are in file1 and not file2
  • The second column is all values in file2 and not file1
  • The third column is all values in both files

If you’re only interested in the lines that are in both files, you can do the following, which will remove the first and second columns of values:

comm -12 file1 file2

You can then write redirect that output to a file or however you’d like.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.