document updated 2 months ago, on Apr 11, 2025

auto-detecting a file's type

... looking only at files' contents, not at their file extension.

The relevant Wikipedia article is "content sniffing".

tools available

I really like using file(1) and magic(5) to auto-detect a file's type based on its contents. Its database covers MANY different file types, and it generally seems to get things right. However, it IS still a guess, and sometimes guesses are wrong.

Is this file text or binary?

You should be aware that auto-detecting whether a file is text is a guess/heuristic, and different heuristics often disagree with each other about whether a particular file is text.

Some tools that contain text vs binary heuristics:

Perl's -T and -B operators [algorithm description]
- find -type f | perl -nle 'print if -T'
file(1)/magic(5) can guess at the file's MIME type, and MIME types conveniently mention when they're part of the 'text/*' type.
- find -type f | perl -nle 'print if qx{file -bi $_} =~ m#^text/#'
grep has a heuristic that allows it to ignore binary files [algorithm]
- grep -rlI ^
- using long argument names, that's grep --recursive --files-with-matches --binary-files=without-match ^
- "^" means "always match, regardless of file contents"

What character encoding does this text file use?

These are always a heuristic guess, and ideally a file's character encoding should always be stated explicitly.

Some tools that can do this:

file(1)/magic(5)
- find -type f | perl -nle 'print if qx{file -bk $_} =~ /\bASCII text\b/'
various Perl modules:
- Encode::Guess and encguess — part of Perl core since v5.8.0
- Encode::Detect
- Encode::Detect::Detector

Is this file a Perl source-code file?

Some tools that can auto-detect if a file contains Perl source:

based on file extensions and whether a shebang is present:
- Perl::Metrics::Simple->is_perl_file() [algorithm]
  - find -type f | perl -MPerl::Metrics::Simple -nle 'print if Perl::Metrics::Simple->is_perl_file($_)'
- File::Find::Rule::Perl
  - perl -MFile::Find::Rule::Perl -le 'print join "\n", find(perl_file => 1, in => ".")'
  - the above command uses the interface documented at File::Find::Rule::Procedural
based on file(1)/magic(5):
- find -type f | perl -nle 'print if qx{file -b $_} =~ /^perl(?! Storable)/i'
- or, to avoid having filename characters interpreted by the shell — find -type f | perl -nle 'open(my$f,"-|","file","-b",$_)or die$!; print if <$f> =~ /^perl(?! Storable)/i'
- regarding the above commands, when I'm looking for Perl source files, I'm usually uninterested in including binary Storable files in that list
- Note that file often misclassifies extremely short Perl files as just ASCII.
based on PPI, which is basically the tool that allows static analysis of Perl to happen:
- Syntax::Check
  - (This actually malfunctions for me — it has false positives, it throws errors in places that perl -c does not, across a WIDE variety of files)
  - find -type f | xargs -n 1 syncheck
  - (the above errors out A LOT, so instead try...) find -type f | perl -nle 'system "syncheck", $_'