Learn how to store network information in MongoDB using tshark and Python.
Most of you probably have heard of Wireshark, a very popular and capable network protocol analyzer. What you may not know is that there exists a console version of Wireshark called tshark. The two main advantages of tshark are that it can be used in scripts and on a remote computer through an SSH connection. Its main disadvantage is that it does not have a GUI, which can be really handy when you have to search lots of network data.
You can get tshark either from its Web site and compile it yourself or from your Linux distribution as a precompiled package. The second way is quicker and simpler. To install tshark on a Debian 7 system, you just have to run the following command as root:
# apt-get install tshark Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: libc-ares2 libcap2-bin libpam-cap libsmi2ldbl libwireshark-data libwireshark2 libwiretap2 libwsutil2 wireshark-common Suggested packages: libcap-dev snmp-mibs-downloader wireshark-doc The following NEW packages will be installed: libc-ares2 libcap2-bin libpam-cap libsmi2ldbl libwireshark-data libwireshark2 libwiretap2 libwsutil2 tshark wireshark-common 0 upgraded, 10 newly installed, 0 to remove and 0 not upgraded. Need to get 15.6 MB of archives. After this operation, 65.7 MB of additional disk space will be used. Do you want to continue [Y/n]? Y ...
To find out whether tshark is installed properly, as well as its version, execute this command:
$ tshark -v TShark 1.8.2 ...
Note: this article assumes that you already are familiar with network data, TCP/IP, packet capturing and maybe Wireshark, and that you want to know more about tshark.
tshark can do anything Wireshark can do, provided that it does not require a GUI. It also can be used as a replacement for tcpdump, which used to be the industry standard for network data capturing. Apart from the capturing part, where both tools are equivalent, tshark is more powerful than tcpdump; therefore, if you want to learn just one tool, tshark should be your choice.
As you can imagine, tshark has many command-line options. Refer to its man page for the full list.
The first command you should run is sudo tshark -D to get a list of the available network interfaces:
$ sudo tshark -D 1. eth0 2. nflog (Linux netfilter log (NFLOG) interface) 3. any (Pseudo-device that captures on all interfaces) 4. lo
If you run tshark as a normal user, you most likely will get the following output, because normal users do not have direct access to network interface devices:
$ tshark -D tshark: There are no interfaces on which a capture can be done
The simplest way of capturing data is by running tshark without any parameters, which will display all data on screen. You can stop data capturing by pressing Ctrl-C.
The output will scroll very fast on a busy network, so it won't be helpful at all. Older computers could not keep up with a busy network, so programs like tshark and tcpdump used to drop network packets. As modern computers are pretty powerful, this is no longer an issue.
The single-most useful command-line parameter is -w, followed by a filename. This parameter allows you to save network data to a file in order to process it later. The following tshark command captures 500 network packets (-c 500) and saves them into a file called LJ.pcap (-w LJ.pcap):
$ tshark -c 500 -w LJ.pcap
The second-most useful parameter is -r. When followed by a valid filename, it allows you to read and process a previously captured file with network data.
Capture filters are filters that are applied during data capturing; therefore, they make tshark discard network traffic that does not match the filter criteria and avoids the creation of huge capture files. This can be done using the -f command-line parameter, followed by a filter in double quotes.
The most important TCP-related Field Names used in capture filters are tcp.port (which is for filtering the source or the destination TCP port), tcp.srcport (which is for checking the TCP source port) and tcp.dstport (which is for checking the destination port).
Generally speaking, applying a filter after data capturing is considered more practical and versatile than filtering during the capture stage, because most of the time, you do not know in advance what you want to inspect. Nevertheless, if you really know what you're doing, using capture filters can save you time and disk space, and that is the main reason for using them.
Remember that filter strings always should be written in lowercase.
Display filters are filters that are applied after packet capturing; therefore, they just “hide” network traffic without deleting it. You always can remove the effects of a display filter and get all your data back.
Display Filters support comparison and logical operators. The http.response.code == 404 && ip.addr == 192.168.10.1 display filter shows the traffic that either comes from the 192.168.10.1 IP address or goes to the 192.168.10.1 IP address that also has the 404 (Not Found) HTTP response code in it. The !bootp && !ip filter excludes BOOTP and IP traffic from the output. The eth.addr == 01:23:45:67:89:ab && tcp.port == 25 filter displays the traffic to or from the network device with the 01:23:45:67:89:ab MAC address that uses TCP port 25 for its incoming or outgoing connections.
When defining rules, remember that the ip.addr != 192.168.1.5 expression does not mean that none of the ip.addr fields can contain the 192.168.1.5 IP address. It means that one of the ip.addr fields should not contain the 192.168.1.5 IP address! Therefore, the other ip.addr field value can be equal to 192.168.1.5! You can think of it as “there exists one ip.addr field that is not 192.168.1.5”. The correct way of expressing it is by typing !(ip.addr == 192.168.1.5). This is a common misconception with display filters.
Also remember that MAC addresses are truly useful when you want to track a given machine on your LAN, because the IP of a machine can change if it uses DHCP, but its MAC address is more difficult to change.
Display filters are extremely useful tools when used correctly, but you still have to interpret the results, find the problem and think about the possible solutions yourself. It is advisable that you visit the display filters reference site for TCP-related traffic at www.wireshark.org/docs/dfref/t/tcp.html. For the list of all the available field names related to UDP traffic, see www.wireshark.org/docs/dfref/u/udp.html.
Imagine you want to extract the frame number, the relative time of the frame, the source IP address, the destination IP address, the protocol of the packet and the length of the network packet from previously captured network traffic. The following tshark command will do the trick for you:
$ tshark -r login.tcpdump -T fields -e frame.number -e ↪frame.time_relative -e ip.src -e ip.dst -e ↪frame.protocols -e frame.len -E header=y -E ↪quote=n -E occurrence=f
The -E header=y option tells tshark first to print a header line. The -E quote=n dictates that tshark not include the data in quotes, and the -E occurrence=f tells tshark to use only the first occurrence for fields that have multiple occurrences.
Having plain text as output means that you easily can process it the UNIX way. The following command shows the ten most popular IPs using input from the ip.src field:
$ tshark -r ~/netData.pcap -T fields -e ip.src | sort ↪| sed '/^\s*$/d' | uniq -c | sort -rn ↪| awk {'print $2 " " $1'} | head
Now, let's look at two Python scripts that read tshark's text output and process it. I can't imagine doing the same thing with a GUI application, such as Wireshark!
Listing 1 shows the full Python code of the first script that checks the validity of an IP address.
The purpose of the checkIP.py Python script is just to find invalid IP addresses, and it implies that the network data already is captured with tshark. You can use it as follows:
$ tshark -r ~/networkData.pcap -T fields -e ip.src ↪| python checkIP.py Total number of IPs checked: 1000 Valid IPs found: 896 Invalid IPs found: 104
Listing 2 shows the full code of the second Python script (storeMongo.py).
The Python script shown in Listing 2 inserts network data into a MongoDB database for further processing and querying. You can use any database you want. The main reason I used MongoDB is because I like the flexibility it offers when storing structured data that may have some irregular records (records with missing fields).
The name of the Python script is storeMongo.py, and it assumes that the network data already is captured using either tshark or tcpdump. The next shell command runs the Python script with input from tshark:
$ tshark -r ~/var/test.pcap -T fields -e frame.number ↪-e ip.src -e ip.dst -e frame.len -e ↪ip.len -E header=n -E quote=n -E occurrence=f ↪| python storeMongo.py Total number of documents stored: 500
The text output of the tshark command is similar to the following:
5 yy.xx.zz.189 yyy.74.xxx.253 66 52 6 197.224.xxx.145 yyy.74.xxx.253 86 72 7 109.xxx.yyy.253 zzz.224.xxx.145 114 100 8 197.xxx.zzz.145 zzz.xxx.xxx.253 86 72 9 109.zzz.193.yyy 197.224.zzz.145 114 100
Currently, all numerical values are stored as strings, but you easily can convert them to numbers if you want. The following command converts all string values from the IPlength column to their respective integer values:
> db.netdata.find({IPlength : {$exists : true}}).forEach( ↪function(obj) { obj.IPlength = new NumberInt( ↪obj.IPlength ); db.netdata.save(obj); } );
Now you can start querying the MongoDB database. The following commands find all “records” (called documents in NoSQL terminology) that contain a given destination IP address:
> use LJ switched to db LJ > db.netdata.find({ "destIP": "192.168.1.12" }) ... >
The next command finds all entries with a frame.len value that is less than 70:
> use LJ switched to db LJ > db.netdata.find({ "framelength": {"$lt" : "70" }}) ... >
The next command finds all entries with an IPlength value greater than 100 and less than 200:
> use LJ switched to db LJ > db.netdata.find({ "IPlength": {"$lt" : "200", "$gt": "100" }}) ... >
What you should remember is not the actual commands but the fact that you can query the database of your choice, using the query language you want and find useful information without the need to re-run tshark and parse the network data again.
After you test your queries, you can run them as cron jobs. La vie est belle!
Next, let's examine the network traffic that is produced by Nmap when it performs a ping scan. The purpose of the ping scan is simply to find out whether an IP address is up. What is important for Nmap in a ping scan is not the actual data of the received packets but, put simply, the actual existence of a reply packet. Nmap ping scans inside a LAN are using the ARP protocol; whereas hosts outside a LAN are scanned using the ICMP protocol. The performed scan pings IP addresses outside the LAN.
The following Nmap command scans 64 IP addresses, from 2.x.yy.1 to 2.x.yy.64:
# nmap -sP 2.x.yy.1-64 Starting Nmap 6.00 ( http://nmap.org ) at 2014-10-29 11:55 EET Nmap scan report for ppp-4.home.SOMEisp.gr (2.x.yy.4) Host is up (0.067s latency). Nmap scan report for ppp-6.home.SOMEisp.gr (2.x.yy.6) Host is up (0.084s latency). ... Nmap scan report for ppp-64.home.SOMEisp.gr (2.x.yy.64) Host is up (0.059s latency). Nmap done: 64 IP addresses (35 hosts up) scanned in 3.10 seconds
The results show that at execution time only 35 hosts were up, or to be 100% precise, only 35 hosts answered the Nmap scan. Nmap also calculates the round-trip time delay (or latency). This gives a pretty accurate estimate of the time needed for the initial packet (sent by Nmap) to go to the target device plus the time that the response packet took to return back to Nmap.
The following tshark command is used for the capturing and is terminated with Ctrl-C:
# tshark -w nmap.pcap Running as user "root" and group "root". This could be dangerous. Capturing on eth0 2587 ^C 18 packets dropped # ls -l nmap.pcap -rw------- 1 root root 349036 Oct 29 11:55 nmap.pcap
Now, let's analyze the generated traffic using tshark. The following command searches for traffic to or from the 2.x.yy.6 IP address:
$ tshark -r nmap.pcap -R "ip.src == 2.x.yy.6 || ip.dst == 2.x.yy.6" 712 3.237125000 109.zz.yyy.253 -> 2.x.yy.6 ↪ICMP 42 Echo (ping) request id=0xa690, seq=0/0, ttl=54 1420 5.239804000 109.zz.yyy.253 -> 2.x.yy.6 ↪ICMP 42 Echo (ping) request id=0x699a, seq=0/0, ttl=49 1432 5.240111000 109.zz.yyy.253 -> 2.x.yy.6 ↪TCP 58 41242 > https [SYN] Seq=0 Win=1024 Len=0 MSS=1460 1441 5.296861000 2.x.yy.6 -> 109.zz.yyy.253 ICMP 60 ↪Timestamp reply id=0x0549, seq=0/0, ttl=57
As you can see, the existence of a response packet (1441) from 2.x.yy.6 is enough for the host to be considered up by Nmap; therefore, no additional tests are tried on this IP.
Now, let's look at the traffic for an IP that is considered down:
$ tshark -r nmap.pcap -R "ip.src == 2.x.yy.2 || ip.dst == 2.x.yy.2" 708 3.236922000 109.zz.yyy.253 -> 2.x.yy.2 ↪ICMP 42 Echo (ping) request id=0xb194, seq=0/0, ttl=59 1407 5.237255000 109.zz.yyy.253 -> 2.x.yy.2 ↪ICMP 42 Echo (ping) request id=0x24ed, seq=0/0, ttl=47 1410 5.237358000 109.zz.yyy.253 -> 2.x.yy.2 ↪TCP 58 41242 > https [SYN] Seq=0 Win=1024 Len=0 MSS=1460 1413 5.237448000 109.zz.yyy.253 -> 2.x.yy.2 ↪TCP 54 41242 > http [ACK] Seq=1 Ack=1 Win=1024 Len=0 1416 5.237533000 109.zz.yyy.253 -> 2.x.yy.2 ↪ICMP 54 Timestamp request id=0xf7af, seq=0/0, ttl=51 1463 5.348871000 109.zz.yyy.253 -> 2.x.yy.2 ↪ICMP 54 Timestamp request id=0x9d7e, seq=0/0, ttl=39 1465 5.349006000 109.zz.yyy.253 -> 2.x.yy.2 ↪TCP 54 41243 > http [ACK] Seq=1 Ack=1 Win=1024 Len=0 1467 5.349106000 109.zz.yyy.253 -> 2.x.yy.2 ↪TCP 58 41243 > https [SYN] Seq=0 Win=1024 Len=0 MSS=1460
As the ICMP packet did not get a response, Nmap makes more tries on the 2.x.yy.2 IP by sending an HTTP and an HTTPS packet, still without any success. This happens because Nmap adds intelligence to the standard ping (ICMP protocol) by trying some common TCP ports in case the ICMP request is blocked for some reason.
The total number of ICMP packets sent can be found with the help of the following command:
$ tshark -r nmap.pcap -R "icmp" | grep "2.x" | wc -l 233
tshark allows you to display useful statistics about a specific protocol. The following command displays statistics about the HTTP protocol using an existing file with network data:
$ tshark -q -r http.pcap -R http -z http,tree ===================================================== HTTP/Packet Counter value rate percent ----------------------------------------------------- Total HTTP Packets 118 0.017749 HTTP Request Packets 66 0.009928 55.93% GET 66 0.009928 100.00% HTTP Response Packets 52 0.007822 44.07% ???: broken 0 0.000000 0.00% 1xx: Informational 0 0.000000 0.00% 2xx: Success 51 0.007671 98.08% 200 OK 51 0.007671 100.00% 3xx: Redirection 0 0.000000 0.00% 4xx: Client Error 1 0.000150 1.92% 404 Not Found 1 0.000150 100.00% 5xx: Server Error 0 0.000000 0.00% Other HTTP Packets 0 0.000000 0.00% =====================================================
All the work is done by the -z option, which is for calculating statistics, and the -q option, which is for disabling the printing of information per individual packet. The -R option discards all packets that do not match the specified filter before doing any other processing.
Here's another useful command that shows protocol hierarchy statistics:
$ tshark -nr ~/var/http.pcap -qz "io,phs"
Try it yourself to see the output!
If you have an in-depth understanding of display filters and a good knowledge of TCP/IP and networks, with the help of tshark or Wireshark, network-related issues will not longer be a problem.
It takes time to master tshark, but I think it will be time well spent.