Table of Contents
- Honeypots Used
- Data Analysis
I am taking a cyber security class. This week’s assignment had us work on Honeypots. Honeypot is a server that pretends to have a vulnerability of sorts (open ports, old software etc.) and instead collects data on people who are trying to hack it.
At the end of the experiment I ended up with some data for four honeypots. The data and my write up can be found here. In this post I wanted to focus on how I used Pandas and Python to help me gather some insight into data that I’ve collected.
Pandas is a great library for Python that makes it really easy to explore various kinds of data (JSON, CSV etc). It’s available via pip install pandas. I recommend using ipython (available via pip install ipython) for all simple Python tasks. If you are not used to the command line I recommend using Jupyter Notebook instead, which is like ipython in a browser.
Honeypots Used
- Dionaea with HTTP – Network Scanners – https://github.com/DinoTools/dionaea
- p0f – Network Scanners – https://github.com/threatstream/mhn/wiki/p0f-Sensor
- Suricata Sensor – IDS, IPS and NSM engine – https://github.com/threatstream/mhn/wiki/Suricata-Sensor
- ElasticHoney Sensor – Remote Code Execution in ES before 1.3.8 – https://github.com/threatstream/mhn/wiki/ElasticHoney-Sensor
Data Analysis
Load Data
import pandas as pd
import json
# Load data, can be found at https://github.com/akras14/codepath9/blob/master/session.json
with open("session.json") as f:
data = f.read()
data = data.split("\n")
data.pop() # Drop last empty element
data = [json.loads(d) for d in data]
Code language: PHP (php)
Sample output
data
[{'_id': {'$oid': '5ac00385616a1e781bfa54b3'},
'destination_port': 80,
'honeypot': 'dionaea',
'hpfeed_id': {'$oid': '5ac00383616a1e781bfa54b2'},
'identifier': 'e8351d14-352d-11e8-a320-42010a800002',
'protocol': 'httpd',
'source_ip': '199.201.64.145',
'source_port': 38877,
'timestamp': {'$date': '2018-03-31T21:54:11.887+0000'}}, # etc ...
Code language: PHP (php)
Load data into Panda’s DataFrame
df = pd.DataFrame.from_dict(data)
df.iloc[0] # Show first item in dataframe
Code language: PHP (php)
Sample Output
_id {'$oid': '5ac00385616a1e781bfa54b3'}
destination_ip NaN
destination_port 80
honeypot dionaea
hpfeed_id {'$oid': '5ac00383616a1e781bfa54b2'}
identifier e8351d14-352d-11e8-a320-42010a800002
protocol httpd
sensor NaN
source_ip 199.201.64.145
source_port 38877
suricata NaN
timestamp {'$date': '2018-03-31T21:54:11.887+0000'}
Name: 0, dtype: object
Code language: JavaScript (javascript)
OK, that’s cool. Let see which IP hit me the most.
Show top 10 IPs and attack count
most_common = df['source_ip'].value_counts()[:10]
# and
most_common.to_dict()
Code language: PHP (php)
Sample output
10.128.0.8 4457
199.201.64.145 1965
5.188.11.145 1295
199.201.64.139 992
191.101.167.7 764
5.62.39.237 658
5.62.43.21 657
77.72.85.25 512
5.188.9.25 441
5.188.11.63 410
Name: source_ip, dtype: int64
# As dictionary
{'10.128.0.8': 4457,
'191.101.167.7': 764,
'199.201.64.139': 992,
'199.201.64.145': 1965,
'5.188.11.145': 1295,
'5.188.11.63': 410,
'5.188.9.25': 441,
'5.62.39.237': 658,
'5.62.43.21': 657,
'77.72.85.25': 512}
Code language: PHP (php)
Note: Dictionary is out of order. You can also do a for loop on most_common variable itself, but I wanted to demo a to_dict() conversion.
I wonder what attacks those IPs run on my honeypots?
Show attacks for most common IPs
for ip in most_common.to_dict():
print (ip, df[df['source_ip'] == ip]['honeypot'].unique())
Code language: PHP (php)
Sample Output
10.128.0.8 ['suricata']
199.201.64.145 ['dionaea']
5.188.11.145 ['dionaea']
199.201.64.139 ['dionaea']
191.101.167.7 ['dionaea']
5.62.39.237 ['dionaea']
5.62.43.21 ['dionaea']
77.72.85.25 ['dionaea']
5.188.9.25 ['dionaea' 'p0f' 'suricata']
5.188.11.63 ['dionaea' 'p0f' 'suricata']
Code language: CSS (css)
Cool looks like two IPs were able to hit 3 out of the 4 honeypots. BTW, let’s check all of the attacks that took place.
Show All attacks that took place
df['honeypot'].value_counts()
Code language: CSS (css)
Sample Output
dionaea 21657
suricata 5454
p0f 2403
elastichoney 6
Name: honeypot, dtype: int64
Interesting. There were only 6 elastichoney attacks. It’s something that the most common IPs check would have overlooked. Let’s see which IPs they came from.
Show all IPs for elastichoney attack
df[df['honeypot'] == 'elastichoney']['source_ip'].unique()
Code language: JavaScript (javascript)
Sample Output
array(['125.212.217.215', '221.229.204.122', '216.218.206.68',
'211.23.154.138'], dtype=object)
Code language: PHP (php)
And what kind of attacks did those IPs perform?
Types of attacks for each elastichoney IP found
for ip in df[df['honeypot'] == 'elastichoney']['source_ip'].unique():
print(ip)
print(df[df['source_ip'] == ip]['honeypot'].value_counts())
print("\n")
Code language: PHP (php)
Sample Output
125.212.217.215
dionaea 9
p0f 5
elastichoney 3
Name: honeypot, dtype: int64
221.229.204.122
dionaea 32
p0f 1
elastichoney 1
Name: honeypot, dtype: int64
216.218.206.68
dionaea 2
elastichoney 1
Name: honeypot, dtype: int64
211.23.154.138
dionaea 9
p0f 4
elastichoney 1
Name: honeypot, dtype: int64
Code language: CSS (css)
So looks like every IP that hit elastichoney pot, also hit other honeypots, but they did it only a few times. Probably to avoid getting detected by the sort of most common IP check that I ran first.
hi, how to change date in that above ???
i need to count per day attack so need that timestamp column to change YYYY-MM-DAY only.
for counting per day attack and need some comparison between week attack.
can you help me
Hi Alex. I really like this task. If i may ask does the modern honeypot network produce results in .csv format.
Hello Alex, this sounds interesting. Out of curiosity, how did you capture the data from honeypots?
Thank you for asking. It was collected using Modern Honeypot Framework: https://github.com/threatstream/mhn