Thursday, 28 October 2010

Python Squid Log Analyser

The Python Squid Log Analyser is a small Python script which will read Squids access.log file and produces a report on:
  • Caching performance
  • Data transfer
  • most requested domains (by hits and by data transfer)
  • most requesting hosts (by hits and by data transfer)

My initial reasons for writing this were to take a look into the Squid caching performance and because I wanted to learn a bit more Python. (Being mostly used to PHP and Bash)
As I started to write this, it made sense to include statistics on the other things happening in the access.log file, since the script read through it all anyway, and so I ended up with a nice summary of daily usage. The way this runs on my system, is within a Bash script, which copies the access.log.1.gz file (yesterdays stats) at 5am each morning. It then extracts and renames it, runs the python against it and emails out the result. I hope this can be useful to someone else!


THE SCRIPT

'''
Created on Oct 11, 2010

@author: His Royal Highness

The squid log reader will analyse the squid access.log file and produce stats on its
content. it expects access.log to be in the same directory, 
but this can be changed to an absolute path as requried.
'''

''' VARIABLES '''
filename = "access.log"
file = open(filename)
''' Request Variables '''
total_requests = 0
total_cache_hits = 0
total_cache_misses = 0
''' Byte Total Variables '''
total_requested_bytes = 0
total_cache_bytes = 0
total_miss_bytes = 0
total_other_bytes = 0
host_requests = {}
host_total_bytes = {}
domain_requests = {}
domain_bytes = {}

''' Strip back URL to what we will measure on '''
def striptodomain(url):
    stripped = url.replace("http://", "")
    #stripped = stripped.repace("www.")
    bits = stripped.split('/')
    hostname = bits[0]
    return hostname


''' Rip through the file and generate the stats '''
for line in file:
    bit = line.split()
    total_requests += 1

    ''' Add in the host request and bytes totals '''
    if bit[2] in host_requests:
        host_requests[bit[2]] += 1
        host_total_bytes[bit[2]] = int(host_total_bytes[bit[2]]) + int(bit[1])
    else:
        host_requests[bit[2]] = 1
        host_total_bytes[bit[2]] = int(bit[1])
        
                
    ''' Add in the Domain request and bytes totals '''
    hostname = striptodomain(bit[6])
    if hostname in domain_requests:
        domain_requests[hostname] += 1
        domain_bytes[hostname] = int(domain_bytes[hostname]) + int(bit[1])
    else:
        domain_requests[hostname] = 1
        domain_bytes[hostname] = int(bit[1]) 
    
        
        
    if 'TCP_HIT' in line or 'TCP_MEM_HIT' in line or 'TCP_NEGATIVE_HIT' in line or 'TCP_IMS_HIT' in line or 'TCP_REFRESH_HIT' in line:
        total_cache_hits += 1                       
        total_cache_bytes += int(bit[1])
        total_requested_bytes += int(bit[1])
        
    elif "TCP_MISS" in line or 'TCP_REFRESH_MISS' in line or 'TCP_CLIENT_REFRESH_MISS' in line:
        total_cache_misses += 1
        total_miss_bytes += int(bit[1])
        total_requested_bytes += int(bit[1])
        
    else:
        total_other_bytes += int(bit[1])
        total_requested_bytes += int(bit[1])
file.close()

    
''' Extra file stats '''        
total_other = (total_requests - total_cache_hits) - total_cache_misses


''' Calculate Percent values '''
total_cache_hits_pc = round((1.0 * total_cache_hits / total_requests) * 100,2)
total_cache_misses_pc = round((1.0 * total_cache_misses / total_requests) * 100,2)
total_other_pc = round((1.0 * total_other / total_requests) * 100,2)

total_cache_bytes_pc = round((1.0 * total_cache_bytes / total_requested_bytes) * 100,2)
total_miss_bytes_pc = round((1.0 * total_miss_bytes / total_requested_bytes) * 100,2)
total_other_bytes_pc = round((1.0 * total_other_bytes / total_requested_bytes) * 100,2)

    

''' File type counts'''
print ""
print "Caching Performance"
print "Total requests: \t\t", total_requests
print "Total Cache hits:\t\t",total_cache_hits, total_cache_hits_pc, "%"
print "Total Cache misses:\t" ,total_cache_misses, total_cache_misses_pc, "%"
print "Total Other:\t\t" ,total_other, total_other_pc, "%"
print ""
print "Total bytes: \t\t", total_requested_bytes
print "Total Cache bytes:\t",total_cache_bytes, total_cache_bytes_pc, "%"
print "Total Miss bytes:\t\t",total_miss_bytes, total_miss_bytes_pc, "%"
print "Total Other bytes:\t",total_other_bytes, total_other_bytes_pc, "%"
print ""
print ""
print ""




''' Print Domain statistics '''
#sorted_domain_requests = sorted(domain_requests)
print "Most Requested Domains (hits)"
temp_list = domain_requests.keys()
temp_list.sort( key = domain_requests.__getitem__, reverse = True)
for a in range(25):
    print domain_requests[temp_list[a]]  , "\t\t" , temp_list[a]
print ""
print ""


print "Highest Data Transfer by Domain (bytes)"
temp_list = domain_bytes.keys()
temp_list.sort( key = domain_bytes.__getitem__, reverse = True)
for a in range(25):
    print domain_bytes[temp_list[a]]  , "\t" , temp_list[a]
print ""
print ""

print "Most requesting hosts (hits)"
temp_list = host_requests.keys()
temp_list.sort( key = host_requests.__getitem__, reverse = True)
if len(temp_list) < 25:
    hosts = len(temp_list)
else:
    hosts = 25
for a in range(hosts):
    print host_requests[temp_list[a]]  , "\t" , temp_list[a]
print ""
print ""

print "Highest Data Transfer by Host (bytes)"
temp_list = host_total_bytes.keys()
temp_list.sort( key = host_total_bytes.__getitem__, reverse = True)
for a in range(hosts):
    print host_total_bytes[temp_list[a]]  , "\t" , temp_list[a]
print ""
print ""

EXAMPLE OUTPUT - (Reduced and formatted)
Total requests:             2419883
Total Cache hits:          573516 23.7 %
Total Cache misses:     1845764 76.27 %
Total Other:                 603 0.02 %

Total bytes:                 5314660302
Total Cache bytes:      37366112 0.7 %
Total Miss bytes:         5273482834 99.23 %
Total Other bytes:       3811356 0.07 %

Most Requested Domains (hits)
78374           www.thomascook.com
73540           urs.microsoft.com:443
58916           sis.baa.com:443

Highest Data Transfer by Domain (bytes)
488356114       sitatex1.mconnect.aero:443
406234484       mail.google.com:443
271621761       b.mail.google.com:443

Most requesting hosts (hits)
83045   128.1.197.202
51284   172.16.248.116
38608   128.1.93.30

Highest Data Transfer by Host (bytes)
487945744       172.16.0.41
237554573       172.16.0.154
157165427       128.1.8.251


NOTES
  • The script can take a couple of minutes to run on large log files and is CPU intensive . In my environment, I am analysing ~400M files and it takes just over a minute.
  • The \t's in the output formatting have been configured to display in outlook, as I have this report sent via email. Running from the CLI may show badly indented columns.
  •  The script expects the access.log file to be in its own directory. If this is not the case, you can change the filename variable at the top from "access.log" to whatever is required. (Likely to be /var/log/squid/access.log if you are running on the box where squid is installed)
  • On my setup, logrotate was configured to move the access.log file on a weekly basis. As I wanted to produce a daily report, I changed the settings in /etc/logrotate.d/squid to daily and increased the rotate number.
  • The function strip to domain could definitely be improved, I was experimenting with splitting the host and taking the -2,-1 values to form the domain. If I had more time, I could have entered all TLDs, made a note of which the current URLs came from and strip it back based on that, but then you also run into the problem for sites like maps.google.com, which you may want to distinguish from www.google.com. If anyone has any ideas for this, I’d be interested to hear them.
  • Finally, I wrote this while learning Python, so all criticism is welcomed so as long as it is constructive!