How I use Summary Indexes in Splunk

At the recent San Francisco Splunk Meetup, there was a brief joking exchange about how the secret to using Summary Indexing was to ignore the summary index commands (sistats, etc.). This brought up a question about realistically, how we one should use summary indexing, so I decided to create an explanation of how I use it in my environment. There are a lot of ways to use summary indexing (and particularly the si commands); this won't provide full coverage on the topic, but it is definitely a solid way to enter the powerful world of summary indexing.

I'll also try to follow this up with an actual post on when it makes sense to use si commands, in the next week.

There are two high-level perspectives that I chose between when I do summary indexing:

Know exactly what search you want to run
Unlike the rest of Splunk, where you've got a ton of flexibility, you want your Summary Index to be as small as it can. The good news is that it is generally pretty easy to clear and backfill a summary index -- it just may take a while. If you're indexing 5 TB of logs, this probably isn't true, so it's all the more important to really know your requirements. My general process is to build a summary index only when I'm finally ready to productionalize my app.

An example of this would be a report on the top 10 firewall src-dst pairs for firewall denys per day. That's the report I'm tossing on the dashboard, that's all the data I'm going to be indexing.

If you don't know, generalize.
If you're indexing a data source that contains several data points at different time intervals, grab them all. As detailed below, adding additional data points for a set time interval (e.g., avg(val) max(val) min(val) ) is essentially free. �So go wild.

I follow this approach for daily summaries of csv files, where I can split by only one or two fields, and pull out a huge amount of data.

Beyond the higher level guidelines, there are a few critical technical guidelines

Ignore the si commands.
This is where you'll first turn for Splunk summary indexing, because the docs use them as examples. But I never, ever use them. There are benefits to using the si commands, and I'll hope to detail them in a future post, but they only add benefit in specific scenarios, and they add a complexity overhead to the summary indexing process. In essence, they will only work well if you're going to use -exactly- the stats command you're using to generate your index. If you change things around, you're going to find yourself trying to understand why on earth you can't read the contents of your index. My advice, don't start with them.

What should you do instead? Just use a normal stats command. And make sure to...

Rename your fields.
If you're trying to do a summary index of
```
YourSearch earliest=-1d@d latest=@d
   | stats sum(HourlyTotal), avg(HourlyTotal)
```
make that:
```
YourSearch earliest=-1d@d latest=@d
   | stats sum(HourlyTotal) as DailyTotal, avg(HourlyTotal) as HourlyAverage
```
This has two benefits -- it allows you to consistently give things a logical name that you'll later understand, but more importantly, it will allow you to actually reference the command later. When you're looking through your summary index, it will turn all those sum(HourlyTotal) into a sum_HourlyTotal_ and you get into all manner of complexity with referencing it later

Report as much as you can, splitting by as little as possible.
If I have a datasource that has the number of requests coming in to a web service and I want to archive Daily data, I will probably make my summary index something like:
```
MySearch
    | bucket _time span=1h
    | stats count as Req by _time, server
    | stats sum(Req) as DailyRequestTotal, avg(Req) as HourlyRequestAverage,
               max(Req) as BusiestHour, min(Req) as SlowestHour by server
```
What I will not do is:
```
MySearch
    | bucket _time span=1h
    | stats count as Req by  _time, server,status_code,uri
    | stats sum(Req) as DailyRequestTotal by server,status_code,uri
```
Adding a number of fields to the start will linearly increase the size of your index, while keeping the same number of events. Adding fields to the end will increase the size of your index and the number of events exponentially. If you had 30 different uris, 4 different status codes and 5 servers, switching from the first to the second query would go from 5 events per day to 600.
A caveat: when I say anything on the front side is free, that's not entirely true. According to Gerald Kanapathy's presentation at the first user conference, the following statistical functions are free: count, avg, sum, stdev, max, min, first, last. The following are not free: median, percXX, dc, mode, top, list, value. He says, though, that if you've got less than 1k values per summary run, it won't be a problem.
Backfill your index to verify success
Backfilling the index will go through your old logs and fill your index. There used to be a script with the word backfill in the filename, that you'll readily find on Google if you're searching for the command -- that script is now old. The new method is to run:
```
cd /opt/splunk/bin/ && ./splunk cmd python fill_summary_index.py -app YourAppName -name "YourScheduledSearchName" -et -1mon@d -lt @d -j 8 -auth admin:changeme -owner YourUsername
```

The above should get you on the road to summary indexing success. I'll plan to do a follow-up post on where the si commands -should- be used (in essence, the areas where you can ignore half of what's above).