Source Code Identification

When KeyView auto-detects a file that contains source code, it can attempt to identify the programming language that it is written in.

When you do not enable source code identification, files that contain source code are often identified as ASCII text files, which means the application treats them in the same way as ordinary text. However, in many instances, it can be useful to route these files elsewhere or filter them out. For example, indexing source code into an IDOL index has minimal value and might bloat the engine with terms that are of no use in retrieval.

Source code identification identifies files that contain a particular programming language as a more specific format.

To detect the format of a file, KeyView uses the content of the file, rather than the file extension. While some file formats, such as PDF, have distinctive signatures that make them easy to identify, source code files tend to be less distinctive. For example, a fragment of JavaScript might look very similar to a fragment of another language such as Java, C#, C, or C++, or even plain text. KeyView uses an analysis of the text that appears in the file to calculate the likelihood of it being in a particular programming language.

You can set source code identification to different levels.

Option Description
KVSOURCECODE_OFF Do not enable source code identification.
KVSOURCECODE_ENABLED

Enable source code identification for the most common source code formats.

This option can detect KeyView formats 498-545, which would otherwise be detected as ASCII_Text_Fmt.

KVSOURCECODE_EXTENDED

Enable source code identification for all supported source code formats. This option might lead to false positives in some cases (for example, a C++ file might get identified as a rarer format).

This option can detect KeyView formats 498-545, and 749-907, which would otherwise be detected as ASCII_Text_Fmt.

For the complete list of source code formats supported for both options, see Supported Formats.

TIP: Source code identification does not correspond exactly to the adSOURCECODE file class. Most files that are detected by source code identification have this file class, but there are exceptions. Also be aware that some formats in the adSOURCECODE class can be detected without enabling source code identification.

To configure source code identification

  • In the Python API, call the method source_code_identification on your session configuration. For example:

    session.config.source_code_identification(kv.SourceCodeIdentificationLevel.Enabled)
  • In formats.ini, set the following parameter to the appropriate level. (This is an alternative approach - you do not need to do this if you have configured this feature through the API).

    [Options]
    SourceCodeDetection=KVSOURCECODE_ENABLED