When using Elasticsearch for reporting efforts, aggregations have been invaluable. Writing my first aggregation was pretty awesome. But, pretty soon after, I needed to figure out a way to run an aggregation over a filtered data set.
As with learning all new things, I was clueless how to do this. Turns out, it’s quite easy. Within a few minutes, I came across some articles that recommended using a top-level query with a filtered argument, which seemed cool because I could just copy my filter up.
That’d look something like:
[code]
{
“query”: {
“filtered”: {}
}
}
[/code]
But, one of my coworkers pointed out that filtered queries have been deprecated and removed in 5.x
. Womp womp. So, the alternative was to just convert the filter to a bool must query.
Here’s an example:
Example
You can find the Shakespeare data set that I’m using, as well as instructions on how to install it here. Using real data and actually running the query seems to help me learn better, so hopefully you’ll find it helpful.
Once you’ve got the data, let’s run a simple aggregation to get the list of unique plays.
[code]
GET shakespeare/_search
{
“aggs”: {
“play_name”: {
“terms”: {
“field”: “play_name”,
“size”: 200
}
},
“play_count”: {
“cardinality”: {
“field”: “play_name”
}
}
},
“size”: 0
}
[/code]
Based on this query, we can see that there are 36 plays in the dataset, which is one off from what a Google search suggested. I’ll chalk that up to slightly off data perhaps?
Now, if we were to dig through the buckets, we could list out every single play that Shakespeare wrote, without having to iterate over every single doc in the dataset. Pretty cool, eh?
But, what if we wanted to see all plays that Falstaff was a speaker in? We could easily update the query to be something like the following:
[code]
GET shakespeare/_search
{
“query”: {
“bool”: {
“must”: {
“term”: {
“speaker”: “FALSTAFF”
}
}
}
},
“aggs”: {
“play_name”: {
“terms”: {
“field”: “play_name”,
“size”: 200
}
}
},
“size”: 0
}
[/code]
In this case, we’ve simply added a top-level query that returns only docs where FALSTAFF
is the speaker. Then, we take those docs and run the aggregation. This gives us results like this:
[code]
{
“took”: 5,
“timed_out”: false,
“_shards”: {
“total”: 5,
“successful”: 5,
“failed”: 0
},
“hits”: {
“total”: 1117,
“max_score”: 0,
“hits”: []
},
“aggregations”: {
“play_name”: {
“doc_count_error_upper_bound”: 0,
“sum_other_doc_count”: 0,
“buckets”: [
{
“key”: “Henry IV”,
“doc_count”: 654
},
{
“key”: “Merry Wives of Windsor”,
“doc_count”: 463
}
]
}
}
}
[/code]
And based on that, we can see that FALSTAFF
was in “Henry IV” and “Merry Wives of Windsor”.
Comments
Feel free to leave a comment below if you have critical feedback or if this helped you!
Leave a Reply