On this page
Given a JSON or CSV log file, analyze the data provided and solve each of these scenarios:
- Generate a table of response codes returned for each request.
- Generate a table showing the sum of each response code returned.
- Plot a graph showing the sum of each response code returned.
Analyzing Log Files with Python
To analyze a JSON or CSV log file with Python, we can use Pandas. Pandas DataFrames are two-dimensional tables with labeled axes that are good for analyzing datasets:
- Import the necessary libraries (pandas):
import pandas as pd- Load the log file as a DataFrame. The pandas library has the methods
read_json()andread_csv(), which convert the respective formats into a pandas DataFrame. Thelinesbool reads a JSON log file as one JSON object per line.
data = pd.read_json('nginx.json', lines=True) # read_json() example
data = pd.read_csv('nginx.csv') # read_csv() example- View and analyze the data.
The info method prints information about a DataFrame including the index dtype and columns, non-null values, and memory usage.
print(data.info())describe will generate descriptive statistics which include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
print(data.describe())To view the first five entries, we can use the head method.
print(data.head())Let's put it all together and see what we get.
import pandas as pd
data = pd.read_csv('nginx.csv') # load csv file as a pandas dataframe
data = pd.read_json('nginx.json', lines=True) # load json file as a pandas dataframe
print(data.info()) # dataframe info
print(data.describe()) # describe dataframe
print(data.head()) # show first 5 entries<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51462 entries, 0 to 51461
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 51462 non-null object
1 remote_ip 51462 non-null object
2 remote_user 51462 non-null object
3 request 51462 non-null object
4 response 51462 non-null int64
5 bytes 51462 non-null int64
6 referrer 51462 non-null object
7 agent 51462 non-null object
dtypes: int64(2), object(6)
memory usage: 3.1+ MB
None
response bytes
count 51462.000000 5.146200e+04
mean 361.414597 6.595095e+05
std 64.620998 6.518840e+06
min 200.000000 0.000000e+00
25% 304.000000 0.000000e+00
50% 404.000000 3.340000e+02
75% 404.000000 3.380000e+02
max 416.000000 8.637717e+07
time remote_ip remote_user \
0 17/May/2015:08:05:32 +0000 93.180.71.3 -
1 17/May/2015:08:05:23 +0000 93.180.71.3 -
2 17/May/2015:08:05:24 +0000 80.91.33.133 -
3 17/May/2015:08:05:34 +0000 217.168.17.5 -
4 17/May/2015:08:05:09 +0000 217.168.17.5 -
request response bytes referrer \
0 GET /downloads/product_1 HTTP/1.1 304 0 -
1 GET /downloads/product_1 HTTP/1.1 304 0 -
2 GET /downloads/product_1 HTTP/1.1 304 0 -
3 GET /downloads/product_1 HTTP/1.1 200 490 -
4 GET /downloads/product_2 HTTP/1.1 200 490 -
agent
0 Debian APT-HTTP/1.3 (0.8.16~exp12ubuntu10.21)
1 Debian APT-HTTP/1.3 (0.8.16~exp12ubuntu10.21)
2 Debian APT-HTTP/1.3 (0.8.16~exp12ubuntu10.17)
3 Debian APT-HTTP/1.3 (0.8.10.3)
4 Debian APT-HTTP/1.3 (0.8.10.3)Prompt 1
Generate a table of response codes returned for each request.
responses = data[['time','request','response']] # extract requests and responses
print(responses.head()) # output responsestime request response
0 17/May/2015:08:05:32 +0000 GET /downloads/product_1 HTTP/1.1 304
1 17/May/2015:08:05:23 +0000 GET /downloads/product_1 HTTP/1.1 304
2 17/May/2015:08:05:24 +0000 GET /downloads/product_1 HTTP/1.1 304
3 17/May/2015:08:05:34 +0000 GET /downloads/product_1 HTTP/1.1 200
4 17/May/2015:08:05:09 +0000 GET /downloads/product_2 HTTP/1.1 200Prompt 2
Generate a table showing the sum of each response code returned.
count_responses = responses.value_counts(responses['response']) # count response codes
print(count_responses) # ouput countsresponse
404 33876
304 13330
200 4028
206 186
403 38
416 4
Name: count, dtype: int64Prompt 3
Plot a graph showing the sum of each response code returned.
import matplotlib.pyplot as plt
# plot graph of count by response code
count_responses.plot.bar(
title='Count of Response Codes',
xlabel='Response Codes',
ylabel='Count'
)
plt.show() # output plot