File-type Detection Using Naïve Bayes and n-gram Analysis
If hard drives are destroyed either deliberately or by accident it can be challenging to reconstruct the lost data. In situations where file-extensions, -tables and -signatures are lost, the raw data may still be available and rebuilt into working files. However, one of the biggest hurdles in these scenarios is assigning file types to each block of data, so-called content based file type detection. Automating this process will significantly reduce the work needed for data reconstruction.
This paper explores the use of the naïve Bayes classifier combined with n-gram analysis of byte sequences in files to correctly identify the file type. It further examines both the use of various n-gram levels to increase the classification accuracy, and which fragment sizes are needed to achieve levels of accuracy.
The proposed algorithm outperforms other related work. Most signifi- cantly, with our training data, the proposed solution is correctly assigning file types based only on file fragments of size 1024 with an accuracy of 98.3%.