Get unique values in file with shell command

Over the past year, there have been a couple of times where I've needed to sort some large list of values, more than 100 million lines in one case. 😱

In each case, I was dealing with a data source where there was surely duplicate entries. For example, duplicate usernames, emails, or URLs. To address this, I decided to get the unique values from the file before I ran a final processing script over them. This would require sorting all of the values in the given file and then deduping in the resulting groups of values.

This sorting and deduping can be a bit challenging. There are various algorithms to consider and if the dataset is large enough, we also need to ensure that we're handling the data in a way that we don't run out of memory. 

Shell commands to the rescue 🙂

Luckily, there are shell commands that make it quite simple to get the unique values in a file. Here's what I ended up using to get the unique values in a file:

cat $file | sort | uniq

In this example, we are:

  • Opening the file at $file
  • Sorting the file so that duplicates end up in a contiguous block
  • Dedupe so that only one value remains from each contiguous block

Here's another example of this command with piped input:

php -r 'for ( $i = 0; $i < 1000000; $i++ ) { echo sprintf( "%d\n", random_int( 0, 100 ) ); }' | sort -n | uniq

In this example, we are

  • Generating 1,000,000 million random numbers, between 0 and 1,000) on their own lines
  • Sorting that output so that like numbers are together
    • Note that we're using -n here to do an integer sort.
  • Deduping that so that we end up with a unique number on each line

If we wanted know how often each number occurred in the file, we could simple add -c to the end of the command above. The resulting command would be php -r 'for ( $i = 0; $i < 1000000; $i++ ) { echo sprintf( "%d\n", random_int( 0, 100 ) ); }' | sort -n | uniq -c and we would get some output that looked like this:

9880 0
10179 1
9725 2
10024 3
9921 4
9893 5
9945 6
9881 7
9707 8
9955 9
9896 10
9845 11
9928 12
10024 13
10005 14
9834 15
9929 16
9764 17
9795 18
9932 19
9735 20
10082 21
9876 22
9835 23
9748 24
9947 25
9975 26
9841 27
9856 28
9751 29
10138 30
10037 31
10026 32
10128 33
9926 34
9821 35
9990 36
9920 37
9696 38
9886 39
9896 40
9815 41
9924 42
9739 43
9854 44
9936 45
9977 46
9873 47
9824 48
10043 49
10054 50
9870 51
9783 52
9901 53
9819 54
9882 55
10022 56
9899 57
9922 58
9922 59
9902 60
10036 61
9830 62
9792 63
9894 64
10008 65
9774 66
9918 67
9986 68
9814 69
9661 70
10117 71
10046 72
9704 73
10016 74
9601 75
9901 76
9923 77
9931 78
9909 79
9895 80
9771 81
10044 82
10059 83
9864 84
9938 85
9799 86
10006 87
9883 88
9880 89
9837 90
9701 91
9870 92
9998 93
9809 94
9883 95
10144 96
9935 97
9979 98
9922 99
9789 100

What is the JavaScript event loop?

I remember the first time I saw a setTimeout( fn, 0 ) call in some React. Luckily there was a comment with the code, so I kind of had an idea of why that code was there. Even with the comment though, it was still confusing. 😉

Since then, I've read several articles about the event loop and got to a point where I was fairly comfortable with my understanding. But, after watching this JSConf talk by Philip Roberts, I feel like I've got a much better understanding.

In the talk, Philip uses a slowed down demonstration of the event loop to explain what's going on to his audience. Philip also demonstrates a tool that he built which allows users to type in code and visualize all of the parts that make JavaScript asynchronous actions work.

You can check out the tool at, but I'd recommend doing it after watching the video.

How to install Unison 2.48 on Ubuntu

For developing on remote servers, but using a local IDE, I prefer to use Unison over other methods that rely on syncing files via rsync or SFTP.

But, one issue with Unison is that two computers must have the same version to sync. And since Homebrew installs Unison 2.48.4 and apt-get install unison installs something like 2.0.x, this meant I couldn’t sync between my computer and a development machine if I wanted to install Unison via apt-get

No worries, by following the documentation, and a bit more searching, I was able to figure out how to build Unison 2.48.4 on my development server!

Note: I did run into a warning at the end of the build. But, from what I can tell, the build actually succeeded. The second-to-last step below helps you test if the build succeeded.

  • apt-get install ocaml
  • apt-get install make
  • curl -O curl -O
  • tar -xvzf unison-2.48.4.tar.gz
  • cd src
  • make UISTYLE=text
  • ./unison to make sure it built correctly. You should see something like this:
    Usage: unison [options]
    or unison root1 root2 [options]
    or unison profilename [options]
    For a list of options, type "unison -help".
    For a tutorial on basic usage, type "unison -doc tutorial".
    For other documentation, type "unison -doc topics".
  • mv unison /usr/local/bin

After going through these commands, unison should be in your path, so you should be able to use unison from any directory without specifying the location of the binary.

How to apply a filter to an aggregation in Elasticsearch

When using Elasticsearch for reporting efforts, aggregations have been invaluable. Writing my first aggregation was pretty awesome. But, pretty soon after, I needed to figure out a way to run an aggregation over a filtered data set.

As with learning all new things, I was clueless how to do this. 😄 Turns out, it’s quite easy. Within a few minutes, I came across some articles that recommended using a top-level query with a filtered argument, which seemed cool because I could just copy my filter up.

That’d look something like:

    "query": {
        "filtered": {}

But, one of my coworkers pointed out that filtered queries have been deprecated and removed in 5.x. Womp womp. So, the alternative was to just convert the filter to a bool must query.

Here’s an example:


You can find the Shakespeare data set that I’m using, as well as instructions on how to install it here. Using real data and actually running the query seems to help me learn better, so hopefully you’ll find it helpful.

Once you’ve got the data, let’s run a simple aggregation to get the list of unique plays.

GET shakespeare/_search
     "aggs": {
      "play_name": {
        "terms": {
          "field": "play_name",
          "size": 200
      "play_count": {
          "cardinality": {
            "field": "play_name"
} } }, "size": 0 }

Based on this query, we can see that there are 36 plays in the dataset, which is one off from what a Google search suggested. I’ll chalk that up to slightly off data perhaps?

Now, if we were to dig through the buckets, we could list out every single play that Shakespeare wrote, without having to iterate over every single doc in the dataset. Pretty cool, eh?

But, what if we wanted to see all plays that Falstaff was a speaker in? We could easily update the query to be something like the following:

GET shakespeare/_search
    "query": {
      "bool": {
        "must": {
            "term": {
                "speaker": "FALSTAFF"
} } } }, "aggs": { "play_name": { "terms": { "field": "play_name", "size": 200 } } }, "size": 0 }

In this case, we’ve simply added a top-level query that returns only docs where FALSTAFF is the speaker. Then, we take those docs and run the aggregation. This gives us results like this:

   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   "hits": {
      "total": 1117,
      "max_score": 0,
      "hits": []
   "aggregations": {
      "play_name": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
               "key": "Henry IV",
               "doc_count": 654
               "key": "Merry Wives of Windsor",
               "doc_count": 463

And based on that, we can see that FALSTAFF was in “Henry IV” and “Merry Wives of Windsor”.


Feel free to leave a comment below if you have critical feedback or if this helped you!

How to retry Selenium Webdriver tests in Mocha

While working on some functional tests for a hosting provider, I kept running into an issue where the login test was failing due to a 500 error. It appeared as if the site hadn’t been fully provisioned by the time my test was trying to login.

Initially, I attempted adding timeouts to give the installation process more time, but that seemed prone to error as well since the delay was variable. Also, with a timeout, I would’ve had to make the timeout be the longest expected time, and waiting a minute or so in a test suite didn’t seem like a good idea.

Getting it done

You think it’d be a quick fix, right? If this errors, do it again.

Within minutes, I had found a setting in Mocha that allowed retrying a test. So, I happily plugged that in, ran the test suite again, and it failed…

The issue? The JS bindings for Selenium Webdriver work off of promises, so they don’t quite mesh with the built-in test retry logic. And not having dug in to promises much yet, it definitely took me a bit to wrap my head around a solution.

That being said, there are plenty of articles out there that talk about retries with JavaScript promises, which helped bring me up to speed. But, I didn’t find any that were for specifically retrying promises with Selenium Webdriver in a Mocha test suite.

So, I learned from a couple of examples, and came up with a solution that’d work in my Selenium Webdriver Mocha tests.

The Code

You can find a repo with the code and dependencies here, but for convenience, I’m also copying the relevant snippets below:

The retry logic

This function below recursively calls itself, fetching a promise with the test assertions, and decrementing the number of tries each time.

Each time the function is called, a new promise is created. In that promise, we use catch so that we can hook into the errors and decide whether to retry the test or throw the error.

Note: The syntax looks a bit cleaner in ES6 syntax, but I didn’t want to set that up. 😄

var handleRetries = function ( browser, fetchPromise, numRetries ) {
    numRetries = 'undefined' === typeof numRetries
        ? 1
        : numRetries;
    return fetchPromise().catch( function( err ) {
        if ( numRetries > 0 ) {
            return handleRetries( browser, fetchPromise, numRetries - 1 );
        throw err;
    } );
The test

The original test, without retries, looked something like this:

test.describe( 'Can fetch URL', function() { 'page contains something', function() {
        var selector = 'ebinnion' ),
            i = 1;
        browser.get( '' );
        return browser.findElement( selector );
    } );
} );

After integrating with the retry logic, it now looks like this:

test.describe( 'Can fetch URL', function() { 'page contains something', function() {
        var selector = 'ebinnion' ),
            i = 1;
        return handleRetries( browser, function() {
            console.log( 'Trying: ' + i++ );
            browser.get( '' );
            return browser.findElement( selector );
        }, 3 );
    } );
} );

Note that the only thing we did different in the test was put the Selenium Webdriver calls (which return a promise) inside a callback that gets called from handleRetries. Putting the calls inside this callback allows us to get a new promise each time we retry.


Feel free to leave a comment if you have input or questions. Admittedly, I may not be too much help if it’s a very technical testing question, but I can try.

I’m also glad to accept critical feedback if there’s a better approach. Particular an approach that doesn’t require an external module, although I’m glad to hear of those as well.

PHP – Get methods of a class along with arguments

Lately, I’ve been using the command line a lot more often at work. I found two things hard about using the command line to interact with PHP files:

  1. Figuring out the require path every time I opened an interactive shell
  2. Remember what methods were available in a class and what arguments the method expected

The first was pretty easy to handle by writing a function that would require often used files. The second one turned out to not be too hard and is the subject of this post.

The code

Below is the code that I used to get the methods of an object as well as the arguments for each method.


function print_object_methods( $mgr ) {
    foreach ( get_class_methods( $mgr ) as $method ) {
        echo $method;
        $r = new ReflectionMethod( $mgr, $method );
        $params = $r->getParameters();
        if ( ! empty( $params ) ) {
            $param_names = array();
            foreach ( $params as $param ) {
                $param_names[] = sprintf( '$%s', $param->getName() );
            echo sprintf( '( %s )', implode(', ', $param_names ) );
        echo "\n";

An example

Let’s use the Jetpack_Options class from Jetpack as an example. You can find it here:

For that class, the above code would output:

get_option_names( $type )
is_valid( $name, $group )
is_network_option( $option_name )
get_option( $name, $default )
get_option_and_ensure_autoload( $name, $default )
update_option( $name, $value, $autoload )
update_options( $array )
delete_option( $names )
delete_raw_option( $name )
update_raw_option( $name, $value, $autoload )
get_raw_option( $name, $default )

As a note, in this case, it could also be nice to print out the docblock for each method instead of just the arguments to add some context. But, I didn’t need too much context for a file that I’m in pretty often. Your mileage may vary.

A Year of Google Maps & Apple Maps

I came across a really great article that compares changes in Google Maps and Apple Maps over a year. It’s really great to see how much Google is experimenting and improving their product.

Similar to how a software engineer refactors their code before expanding it, Google has repeatedly refactored the styling of its map as it has added new datasets. And we see this in the evolution of Google Maps’s cartography:

As Google has added more and more datasets, it has continually rebalanced the colors, weights, and intensities of the items already on its map – each time increasing its map’s capacity for more.

Source: A Year of Google Maps & Apple Maps

What I do as a software developer at Automattic

This past Sunday, an 18-year-old who intends on starting at UNT next Fall and majoring in Computer Science explained to me the difference between a computer scientist and a programmer.

As he explained it, computer scientists are people who conceptualize software and programmers are the people who merely carry out the plan that the computer scientists made.

This kid went on to describe computer scientists as the people who do the thinking and the programmers as “dime a dozen.”

But, that’s not right…

The more the kid spoke, the more irritated I became because I do not perceive my role as a programmer to be one where I simply follow instructions without thought.

So, while I was initially very irritated, I later realized that:

  1. Automattic is my first job out of University. So, perhaps my perception is skewed
  2. This kid likely has no experience and is making assumptions

Thinking some more, it seemed like the best way to move forward would be to simply explain what I do as a programmer.

So, what do I do at Automattic?

Code Wrangler job description
This is the Automattic Code Wrangler job description as of December 21st, 2016.

My title at Automattic is Code Wrangler. But, that title is merely the default title for developers at Automattic and doesn’t mean much since all Automatticians can choose any title they like. 😂

Further, Code Wrangler is a very generic job title at Automattic, so what I do can be very different from what a Code Wrangler working on another product does.

That being said, I’d still like to share the things that I do as a Code Wrangler (aka software developer) at Automattic.

Maker of things

As a Code Wrangler, bringing things to life with code, or fixing broken things, is my “bread and butter”.

I spend a majority of my time on the front-end building interactive UIs with JavaScript, HTML, CSS, PHP, etc., which you can see from the images I share in this post.

But, thanks to being on team Poseidon (the Jetpack platform team), I also get to do other interesting things, such as:

  • Writing a script to loop through all Jetpack sites and backfill missing data
  • Somewhat fix a several years old bug that is lovingly called an Identity Crisis
  • Add CLI commands to Jetpack’s sync functionality

And while I spend a majority of my time actually working with code, I do spend significant amounts of time doing other things.

Manager of things

Account management section of
Allen Snook, Kevin Conboy, and I were the main people behind getting the account management section into the newer JavaScript based version of nicknamed Calypso. You can find more information here.

At Automattic, we don’t really hire for typical project manager roles, at least to my knowledge.

Because of this, I’ve had the opportunity to take on the role of a project manager for several of the projects that I’ve worked on.

I like to get shit done, and I will gladly do whatever is necessary to push a project to the finish line.

This means that I am often involved in much of the planning for the projects I work on, which can include diagramming, white-boarding, creating and assigning tasks, discussing designs, etc.

Maybe I’m good at managing things. Maybe I’m just opinionated and don’t mind sharing those opinions. ¯(ツ)

Breaker of things

Jetpack Secure Sign On
Earlier this year, I refreshed the Secure Sign On module with design input from Michael Arestad. We were able to spruce the design up, but I also fixed many flow issues and bugs.

At Automattic, we have the “Flow” team whose job I understand to be:

  • Manually testing Automattic’s products
  • Implementing automated functional testing of Automattic’s products
  • Working with other teams at Automattic to improve testing processes
  • Generally being awesome 😄

While I am not on this team, testing is definitely one of the things I like most about my job, even though testing isn’t mentioned anywhere in the Code Wrangler job description.

I often find myself asking things such as:

  • I wonder what happens if I rotate the screen here?
  • Does this input have validation?
  • What does the mobile layout look like here?
  • What’s a flow that wasn’t thought about?
  • How do things look in Internet Explorer?

I don’t have a strict testing list, other than the fact that I try and test Internet Explorer every Friday. Yet, I am often able to find bugs in new software as well as regressions in older software.

The thought I’d like to leave you with for this section is that testers are necessary to make sure that we have considered as many things as possible before our users unwittingly become our testers. Sure, some planning ahead can help reduce issues, but no one person can catch them all.


User Management Section of
This is the user management section of, a project which I had the honor of leading. I worked with Miguel Lezama, Rocco Tripaldi, Rick Banister, and Kevin Conboy to bring this together.

I have the honor to work with amazing designers such as Rick Banister, Michael Arestad, Jeff Golenski, and many others.

And I’ll be the first to tell you that I’m definitely not a designer in the same way that they are designers. They’re all badasses.

But, one of the things I learned early on, is that as a developer, I can play a vital role in the design of a project

For example, if I am a developer on a project, and I am delivered a design, should I simply implement that design? I’m not so sure about that. invite users form
This is the form to invite users to a site. Worked on by myself, Miguel Lezama, Rocco Tripaldi, and Rick Banister.

On one hand, the designers have likely delivered what the ideal flow should be. But, have the designs taken in to account technical limitations? What about business limitations? If the designers only delivered mobile designs, what does this thing look like on desktops? Vice versa?

As the person who will implement a design, it’s my job to provide feedback to designers so that, together, we can can deliver the best product possible in the shortest timeframe possible.

I didn’t always understand my role to be like this. The first project I led, the user management section of (pictured at the top of this section), went very differently actually. At the beginning of the project, Rocco Tripaldi and I were given designs, and we immediately set out to make those designs a reality.

Jetpack Sync Panel
This is the Jetpack sync panel that is displayed within This allows the user to trigger a full sync of their site and watch the progress.

After Rocco and I spent several weeks quietly working towards implementing the design, we finally came to the conclusion that the design was not then possible without some significant changes to how we stored data, so we decided to tweak the design a bit to be compatible with how we then stored data.

This was an expensive (to the company) lesson for me. Had I provided some feedback with the issues I was running into earlier, we could have made the decision to change sooner and potentially saved a month or more of developer time (two developers times 1-3 weeks).

In my experience, the best work that I’ve done has been a result of great communication and collaboration between development and design.

What do you do as a software developer?

One of the things that I became interested in after talking to this kid was the fact that software developers likely work very differently at different companies.

So, how do you work? Do you wear different hats, or do you tend to just do programming? Feel free to leave a comment or link back to this post.

Interested in wrangling code with me?

Automattic is always hiring, so if you found this post interesting, check out our open positions.

This picture was taken by Jeff Golenski at the 2015 Automattic Grand Meetup in Park City

Tired of Vagrant? Try Laravel Valet

I’m always interested in optimizing my dev environment, and after Thomas’s article about Laravel Valet, I’m going to be giving that a try for local WordPress development.

I’ve used Vagrant for more than a year now and although it was crashing from time to time, I always managed to get it working again. Not last week. I don’t what happened, but enough was enough – I decided to pull the plug and look for a better alternative.

That’s when I found Laravel Valet (or Valet for short)… It’s so easy I wish I’d switched to it a few months ago when it was initially released.

Source: Tired of Vagrant? Try Laravel Valet – ThemeShaper

The Principles of Design: Font Pairing

I find working with fonts to be one of the most difficult aspects of design. Line height, kerning, font pairing, and everything else is confusing to me. 😄 released an article today about font pairing with some great examples of font pairings. While the article is meant for users, the examples of paired fonts make it a great read.

In the past, we’ve discussed some tips for choosing fonts. Today we’ll talk about how to navigate choosing more than one font for your site.

If you look closely at most websites (like The Daily Post), you’ll see that they’re often using more than one font. In most cases, that’s a stylistic choice. The site would probably function fine with just a single font, but the designer has chosen two to introduce a little more visual hierarchy to the typography.

Source: The Principles of Design: Font Pairing | The Daily Post

What do programmers do?

One of my coworkers recently shared what I found to a great read about what programmers do. Here are some sections that I liked most:

One method for maintaining stability is the maintenance programmer. The longevity of the program is therefore dependent on the capability, comprehension and intelligence of this person. But humans are not omniscient in comprehending programs. As a matter of fact one of the most intellectual endeavors is the analysis and comprehension of an existing program structure.

The ritual of programming is of great consequence because it deals with the communication between the original program author and the programmer responsible for maintaining the structural integrity of the program.

A programmer does not primarily write code; rather, he primarily writes to another programmer about his problem solution. The understanding of this fact is the final step in his maturation as technician.