AboutCode Documentation¶
Aboutcode Projects¶
ScanCode-Toolkit Documentation¶
Getting Started¶
Home¶
ScanCode is a tool to scan code and detect licenses, copyrights and more.
Why ScanCode?¶
Discovering the origin and license for a software component is important, but it is often much harder to accomplish than it should be because:
- A typical software project may reuse tens or hundreds of third-party software components
- Software authors do not always provide copyright and license information
- Copyright and license information that is provided may be hard to find and interpret
ScanCode tries to address this issue by offering:
- A comprehensive code scanner that can detect origin or license information inside codebase files
- A simple command line approach that runs on Windows, Linux, and Mac
- Your choice of JSON or other output formats (SPDX, HTML, CSV) for integration with other tools
- ScanCode workbench for Visualization
- Well-tested, easy to hack, and well-documented code
- Release of the code and reference data under attribution licenses (Apache 2.0 and CC-BY-1.0)
- Plugin System for easily adding new Functionality to Scans.
- Python 3 Unicode Capabilities for better supporting users from 100+ languages.
- Extensive Documentation Support.
What does ScanCode Toolkit do?¶
ScanCode finds the origin history information that is in your codebase with a focus on:
- Copyright and other origin clues (emails, urls, authors etc)
- License notices and license text with reference information about detected licenses.
Using this data you can:
- Discover the origin and license of the open source and third-party software components that you use,
- Create a software component Inventory for your codebase, and
- Use this data to comply with open source license obligations such as attribution and redistribution.
How does it work?¶
Given a codebase in a directory, ScanCode will:
- Collect an inventory of the code files and classify the code using file types
- Extract files from any archive using a general purpose extractor
- Extract texts from binary files if needed
- Use an extensible rules engine to detect open source license text and notices
- Use a specialized parser to capture copyright statements
- Identify packaged code and collect metadata from packages
- Report the results in the formats of your choice (JSON, SPDX, etc.) for integration with other tools
- Browse the results using the ScanCode Workbench companion app to assist your analysis.
ScanCode should enable you to identify the “easy” cases on your own, but a software development team will probably need to build internal expertise or use outside experts (like nexB) in many cases.
ScanCode is written in Python and also uses other open source packages.
Alternatives?¶
There are several utilities that do some of what ScanCode does - e.g. You can grep files for copyright and license text. This may work well for simple cases - e.g. at the single file level, but we created ScanCode for ourselves because this approach does not help you to see the recurring patterns of licenses and other origin history clues.
Or you can consider other tools such as:
- FOSSology (open source, written in C, Linux only, GPL-licensed)
- Ninka (open source, written in Perl, GPL-licensed)
- Commercially-licensed tools, most of them written in Java
History¶
ScanCode was originally created by nexB to support our software audit consulting services. We have used and continuously enhanced the underlying toolkit for six years. We decided to release ScanCode as open source software to give software development teams the opportunity to perform as much of the software audit function as they like on their own.
If you have questions or are interested in nexB-provided training or support for ScanCode, please send us a note at info@scancode.io or visit http://www.nexb.com/.
We are part of nexB Inc. and most of us are located in the San Francisco Bay Area. Our mission is to provide the tools and services that enable and accelerate component-based software development. Reusing software components is essential for the efficient delivery of software products and systems in every industry.
Thank you for giving ScanCode a try!
Comprehensive Installation¶
ScanCode requires Python 2.7.x and is tested on Linux, Mac, and Windows. Make sure Python 2.7 is installed first.
System Requirements¶
Hardware : ScanCode will run best with a modern X86 processor and at least 2GB of RAM and 250MB of disk.
Supported operating systems : ScanCode should run on these OSes:
- Linux: on most recent 64-bit Linux distributions (32-bit distros are only partially supported),
- Mac: on recent Mac OSX (10.6.8 and up),
- Windows: on Windows 7 and up (32- or 64-bit) using a 32-bit Python.
Prerequisites¶
ScanCode needs a Python 2.7 interpreter.
On Linux: Use your package manager to install python2.7. If Python 2.7 is not available from your package manager, you must compile it from sources. For instance, visit https://github.com/dejacode/about-code-tool/wiki/BuildingPython27OnCentos6 for instructions to compile Python from sources on Centos.
On Ubuntu 12.04, 14.04 and 16.04, you will need to install these packages first:
python-dev bzip2 xz-utils zlib1g libxml2-dev libxslt1-dev
On Debian and Debian-based distros you will need to install these packages first:
python-dev libbz2-1.0 xz-utils zlib1g libxml2-dev libxslt1-dev
On RPM-based distros, you will need to install these packages first:
python-devel zlib bzip2-libs xz-libs libxml2-devel libxslt-devel
On Windows:
Use the Python 2.7 32-bit (e.g. The Windows x86 MSI installer) for X86 regardless of whether you run Windows on 32-bit or 64-bit. DO NOT USE Python X86_64 installer even if you run 64 bit Windows. Download Python from this url: https://www.python.org/ftp/python/2.7.13/python-2.7.13.msi
Install Python on the c: drive and use all default installer options (scancode will try to find python just in c:python27python.exe). See the Windows installation section for more installation details.
On Mac: Download and install Python from this url: https://www.python.org/ftp/python/2.7.13/python-2.7.13-macosx10.6.pkg
Do not use Unicode, non-ASCII in your installation Path¶
There is a bug in underlying libraries that prevent this.
Installation on Linux and Mac¶
Download and extract the latest ScanCode release from: https://github.com/nexB/scancode-toolkit/releases/
Open a terminal in the extracted directory and run:
./scancode --help
This will configure ScanCode and display the command line help.
Installation on Windows¶
- Download the latest ScanCode release zip file from https://github.com/nexB/scancode-toolkit/releases/
- In Windows Explorer (called File Explorer on Windows 10), select the downloaded ScanCode zip and right-click.
- In the pop-up menu select ‘Extract All…’
- In the pop-up window ‘Extract zip folders’ (‘Extract Compressed (Zipped) Folders’ on Windows 10) use the default options to extract.
- Once the extraction is complete, a new Windows Explorer/File Explorer window will pop up.
- In this Explorer window, select the new folder that was created and right-click.
Note
On Windows 10, double-click the new folder, select one of the files inside the folder (e.g., ‘setup.py’), and right-click.
In the pop-up menu select ‘Properties’.
In the pop-up window ‘Properties’, select the Location value. Copy this to the clipboard and close the ‘Properties’ window.
Press the start menu button (On Windows 10, click the search box or search icon in the taskbar.)
In the search box type:
cmd
Select ‘cmd.exe’ listed in the search results. (On Windows 10, you may see ‘Command Prompt’ instead – select that.)
A new ‘cmd.exe’ window (‘Command Prompt’ on Windows 10) pops up.
In this window (aka a ‘command prompt’), type the following (i.e., ‘cd’ followed by a space):
cd
Right-click in this window and select Paste. This will paste the path where you extracted ScanCode.
Press Enter.
This will change the current location of your command prompt to the root directory where ScanCode is installed.
Then type:
scancode -h
Press enter. This will configure your ScanCode installation.
Several messages are displayed followed by the scancode command help.
The installation is complete.
Un-installation¶
- Delete the directory in which you extracted ScanCode.
- Delete any temporary files created in your system temp directory under a ScanCode directory.
IDE Configuration¶
The instructions below assume that you followed the Contributing to Code Development including a python virtualenv.
PyCharm¶
Open the settings dialog and navigate to “Project Interpreter”. Click on the gear button in the
upper left corner and select “Add Local”. Find the python binary in the virtualenv
(bin/python
in the repository root) and confirm. Open a file that contains tests and set a
breakpoint. Right click in the test and select “Debug <name of test>”. Afterwards you can re-run
the same test in the debugger using the appropriate keyboard shortcut (e.g. Shift-F9, depending
on platform and configured layout).
Visual Studio Code¶
Install the Python extension from Microsoft.
The configure
script should have created a VSCode workspace directory with a basic
settings.json
. To do this manually, add to or create the workspace settings file
.vscode/settings.json
:
"python.pythonPath": "${workspaceRoot}/bin/python",
"python.unitTest.pyTestEnabled": true
If you created the file, also add {
and }
on the first and last line respectively.
When you open the project root folder in VSCode, the status bar should show the correct python interpreter and, after a while, a “Run Tests” button. If not, try restarting VSCode.
Open a file that contains tests (e.g. tests/cluecode/test_copyrights.py
). Above the test
functions you should now see “Run Test” and “Debug Test”. Set a breakpoint in a test function
and click on “Debug Test” above it. The debugger panel should show up on the left and show the
program state at the breakpoint. Stepping over and into code seems not to work. Clicking one of
those buttons just runs the test to completion. As a workaround, navigate to the function you want
to step into, set another breakpoint and click on “continue” instead.
Documentation¶
This page provides an index of current ScanCode user documentation.
Documentation¶
The ScanCode toolkit documentation lives at aboutcode.readthedocs.io/en/latest/scancode-toolkit/.
Contribute to Docs¶
See Contributing to the Documentation for more details.
Google Summer of Docs¶
See GSoD2019 for more details.
What’s New in This Release?¶
A new release of Scancode-Toolkit is here!
Quick Summary¶
- Version - 3.1.1
- Tag - “v.3.1.1”
- Date - 5th September 2019
- Type - Pre-Release
- Comments - Release v3.1.1 which the release candidate 2 of 3.1.x
Main New Features¶
This is the first 3.1 release with the best, fastest and most efficient ScanCode ever released.
This release contains many improvements, fixes and new features including breaking API changes (when compared to 2.2.x). See the CHANGELOG for details.
This release also comes with a Full Documentation hosted at aboutcode.readthedocs.io/en/latest/scancode-toolkit/.
To install, download scancode-toolkit-3.1.1.zip or scancode-toolkit-3.1.1.tar.bz2 from the Downloads section below and follow installation instructions in the README.
This is also available as a Python library from Pypi with pip install scancode-toolkit
.
You can also download the corresponding source code for bundled pre-built third-party binaries from these locations:
Explanations¶
Command Line Interface Reference¶
Synopsis¶
ScanCode detects licenses, copyrights, package manifests and direct dependencies and more, both in source code and binary files, by scanning the files. This page introduces you to the ScanCode Toolkit Command Line Interface in the following sections:
- Quickstart
- Type of Options
- Output Formats
- Other Important Documentation
Quickstart¶
The basic usage is:
path/to/scancode [OPTIONS] <OUTPUT FORMAT OPTION(s)> <SCAN INPUT>
To scan the samples
directory, the command will be:
path/to/scancode -clpieu --json-pp path/to/output.json path/to/samples
Note
The <OUTPUT FORMAT OPTION(s)> includes both the output option and output file name.
For example in ./scancode -clpieu --json-pp output.json samples
,
--json-pp output.json
is <OUTPUT FORMAT OPTION(s)>.
Tip
On Windows use scancode
instead of path/to/scancode
.
Warning
There isn’t a “Default” output option in Versions 3.x onwards, you have to specify <OUTPUT FORMAT OPTION(s)> explicitly.
Alternatively, instead of using path/to/scancode
(the path from root of file system) we can
go into the scancode directory (like scancode-toolkit-3.1.1
) and then use ./scancode
.
The same applies for input and output options. To scan a folder samples
inside ScanCode
directory, and output to a file output.json
in the same directory, the command will be:
./scancode -clpieu --json-pp output.json samples
While a scan using absolute paths from the file system root will look like:
home/ayansm/software/scancode-toolkit-3.1.1/scancode -clpieu --json-pp home/ayansm/scan_scan_results/output.json home/ayansm/codebases/samples/
Throughout the documentation ./scancode --clpieu --json-pp output.json samples
will be used
as am example when the terminal is at scancode-toolkit-3.1.1
and we are scanning the
default samples
folder distributed with Scancode-Toolkit.
Scans the <SCAN INPUT> file or directory for license, origin and packages and saves results to FILE(s) using one or more output format option. Error and progress are printed to stdout.
Type of Options¶
ScanCode Toolkit Command Line options can be divided into these major sections:
Output Formats¶
The output file format is set by using the various output options. The default output format is JSON, the entire file being in one line, without whitespace characters.
The following example scans will show you how to run a scan with each of the result formats. For
the scans, we will use the samples
directory provided with the ScanCode Toolkit.
Tip
You can also output to stdout
instead of a file. For more information refer
Print to stdout (Terminal).
Scan the samples
directory and save the scan to a JSON file::
./scancode -clpieu --json-pp output.json samples
A sample JSON output file structure will look like:
{
"headers": [
{
"tool_name": "scancode-toolkit",
"tool_version": "3.1.1",
"options": {
"input": [
"samples/"
],
"--copyright": true,
"--email": true,
"--info": true,
"--json-pp": "output.json",
"--license": true,
"--package": true,
"--url": true
},
"notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
"start_timestamp": "2019-10-19T191117.292858",
"end_timestamp": "2019-10-19T191219.743133",
"message": null,
"errors": [],
"extra_data": {
"files_count": 36
}
}
],
"files": [
{
"path": "samples",
"type": "directory",
...
...
...
"scan_errors": []
},
{
"path": "samples/README",
"type": "file",
"name": "README",
"base_name": "README",
"extension": "",
"size": 236,
"date": "2019-02-12",
"sha1": "2e07e32c52d607204fad196052d70e3d18fb8636",
"md5": "effc6856ef85a9250fb1a470792b3f38",
"mime_type": "text/plain",
"file_type": "ASCII text",
"programming_language": null,
"is_binary": false,
"is_text": true,
"is_archive": false,
"is_media": false,
"is_source": false,
"is_script": false,
"licenses": [],
"license_expressions": [],
"copyrights": [],
"holders": [],
"authors": [],
"packages": [],
"emails": [],
"urls": [],
"files_count": 0,
"dirs_count": 0,
"size_count": 0,
"scan_errors": []
},
...
...
...
{
"path": "samples/zlib/iostream2/zstream_test.cpp",
"type": "file",
"name": "zstream_test.cpp",
"base_name": "zstream_test",
"extension": ".cpp",
"size": 711,
"date": "2019-02-12",
...
...
...
"scan_errors": []
}
]
}
A sample JSON output for an individual file will look like:
{
"path": "samples/zlib/iostream2/zstream.h",
"type": "file",
"name": "zstream.h",
"base_name": "zstream",
"extension": ".h",
"size": 9283,
"date": "2019-02-12",
"sha1": "fca4540d490fff36bb90fd801cf9cd8fc695bb17",
"md5": "a980b61c1e8be68d5cdb1236ba6b43e7",
"mime_type": "text/x-c++",
"file_type": "C++ source, ASCII text",
"programming_language": "C++",
"is_binary": false,
"is_text": true,
"is_archive": false,
"is_media": false,
"is_source": true,
"is_script": false,
"licenses": [
{
"key": "mit-old-style",
"score": 100.0,
"name": "MIT Old Style",
"short_name": "MIT Old Style",
"category": "Permissive",
"is_exception": false,
"owner": "MIT",
"homepage_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
"text_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
"reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit-old-style",
"spdx_license_key": null,
"spdx_url": "",
"start_line": 9,
"end_line": 15,
"matched_rule": {
"identifier": "mit-old-style_cmr-no_1.RULE",
"license_expression": "mit-old-style",
"licenses": [
"mit-old-style"
],
"is_license_text": true,
"is_license_notice": false,
"is_license_reference": false,
"is_license_tag": false,
"matcher": "2-aho",
"rule_length": 71,
"matched_length": 71,
"match_coverage": 100.0,
"rule_relevance": 100
}
}
],
"license_expressions": [
"mit-old-style"
],
"copyrights": [
{
"value": "Copyright (c) 1997 Christian Michelsen Research AS Advanced Computing",
"start_line": 3,
"end_line": 5
}
],
"holders": [
{
"value": "Christian Michelsen Research AS Advanced Computing",
"start_line": 3,
"end_line": 5
}
],
"authors": [],
"packages": [],
"emails": [],
"urls": [
{
"url": "http://www.cmr.no/",
"start_line": 7,
"end_line": 7
}
],
"files_count": 0,
"dirs_count": 0,
"size_count": 0,
"scan_errors": []
},
Scan the samples
directory for licenses and copyrights and save the scan results to an HTML
file. When the scan is done, open samples.html
in your web browser.
./scancode -clpieu --html output.html samples


Getting Help from the Command Line¶
ScanCode-Toolkit Command Line Interface can help you to search for specific options or use cases
from the command line itself. These are two options are --help
and --examples
, and are
very helpful if you need a quick glance of the options or use cases. Or it can be useful when you
can’t access, the more elaborate online documentation.
All Documentation/Help Options¶
-h, --help | Show the Help text and exit. |
--examples | Show the Command Examples Text and exit. |
--about | Show information about ScanCode and licensing and exit. |
--version | Show the version and exit. |
--list-packages | |
Show the list of supported package types and exit. | |
--plugins | Show the list of available ScanCode plugins and exit. |
--print-options | |
Show the list of selected options and exit. |
Help text¶
The Scancode-Toolkit Command Line Interface has a Help option displaying all the options. It also
displays basic usage, and some simple examples. The command line option for this is --help
.
Tip
You can also use the shorter -h
option, which does the same.
For Linux based systems the full command is:
$ ./scancode --help
And for windows, it will be like:
$ scancode --help
Note
Make sure you are in the Scancode Root Directory before carrying out this command. After
extracting the .zip
or .tar.bz
file, the folder for Scancode-Toolkit version 3.1.1
will be named like “scancode-toolkit-3.1.1”.
The Following Help Text is displayed, i.e. This is the help text for Scancode Version 3.1.1
Usage: scancode [OPTIONS] <OUTPUT FORMAT OPTION(s)> <input>...
scan the <input> file or directory for license, origin and packages
and save results to FILE(s) using one or more output format option.
Error and progress are printed to stderr.
Options:
primary scans:
-l, --license Scan <input> for licenses.
-p, --package Scan <input> for package manifests and packages.
-c, --copyright Scan <input> for copyrights.
other scans:
-i, --info Scan <input> for file information (size, checksums, etc).
--generated Classify automatically generated code files with a flag.
-e, --email Scan <input> for emails.
-u, --url Scan <input> for urls.
scan options:
--license-score INTEGER Do not return license matches with a
score lower than this score. A number
between 0 and 100. [default: 0]
--license-text Include the detected licenses matched
text.
--license-text-diagnostics In the matched license text, include
diagnostic highlights surrounding with
square brackets [] words that are not
matched.
--license-url-template TEXT Set the template URL used for the license
reference URLs. Curly braces ({}) are
replaced by the license key. [default: h
ttps://enterprise.dejacode.com/urn/urn:dj
e:license:{}]
--max-email INT Report only up to INT emails found in a
file. Use 0 for no limit. [default: 50]
--max-url INT Report only up to INT urls found in a
file. Use 0 for no limit. [default: 50]
output formats:
--json FILE Write scan output as compact JSON to FILE.
--json-pp FILE Write scan output as pretty-printed JSON to
FILE.
--json-lines FILE Write scan output as JSON Lines to FILE.
--csv FILE Write scan output as CSV to FILE.
--html FILE Write scan output as HTML to FILE.
--custom-output FILE Write scan output to FILE formatted with the
custom Jinja template file.
--custom-template FILE Use this Jinja template FILE as a custom
template.
--spdx-rdf FILE Write scan output as SPDX RDF to FILE.
--spdx-tv FILE Write scan output as SPDX Tag/Value to FILE.
--html-app FILE (DEPRECATED: use the ScanCode Workbench app
instead ) Write scan output as a mini HTML
application to FILE.
output filters:
--ignore-author <pattern> Ignore a file (and all its findings)
if an author contains a match to the
<pattern> regular expression. Note
that this will ignore a file even if
it has other findings such as a
license or errors.
--ignore-copyright-holder <pattern>
Ignore a file (and all its findings)
if a copyright holder contains a match
to the <pattern> regular expression.
Note that this will ignore a file even
if it has other scanned data such as a
license or errors.
--only-findings Only return files or directories with
findings for the requested scans.
Files and directories without findings
are omitted (file information is not
treated as findings).
output control:
--full-root Report full, absolute paths.
--strip-root Strip the root directory segment of all paths. The
default is to always include the last directory segment
of the scanned path such that all paths have a common
root directory.
pre-scan:
--ignore <pattern> Ignore files matching <pattern>.
--include <pattern> Include files matching <pattern>.
--classify Classify files with flags telling if the
file is a legal, or readme or test file,
etc.
--facet <facet>=<pattern> Add the <facet> to files with a path
matching <pattern>.
post-scan:
--consolidate Group resources by Packages or license and
copyright holder and return those groupings
as a list of consolidated packages and a list
of consolidated components. This requires the
scan to have/be run with the copyright,
license, and package options active
--filter-clues Filter redundant duplicated clues already
contained in detected license and copyright
texts and notices.
--is-license-text Set the "is_license_text" flag to true for
files that contain mostly license texts and
notices (e.g over 90% of the content).
[EXPERIMENTAL]
--license-clarity-score Compute a summary license clarity score at
the codebase level.
--license-policy FILE Load a License Policy file and apply it to
the scan at the Resource level.
--mark-source Set the "is_source" to true for directories
that contain over 90% of source files as
children and descendants. Count the number of
source files in a directory as a new
source_file_counts attribute
--summary Summarize license, copyright and other scans
at the codebase level.
--summary-by-facet Summarize license, copyright and other scans
and group the results by facet.
--summary-key-files Summarize license, copyright and other scans
for key, top-level files. Key files are top-
level codebase files such as COPYING, README
and package manifests as reported by the
--classify option "is_legal", "is_readme",
"is_manifest" and "is_top_level" flags.
--summary-with-details Summarize license, copyright and other scans
at the codebase level, keeping intermediate
details at the file and directory level.
core:
--timeout <secs> Stop an unfinished file scan after a timeout
in seconds. [default: 120 seconds]
-n, --processes INT Set the number of parallel processes to use.
Disable parallel processing if 0. Also
disable threading if -1. [default: 1]
--quiet Do not print summary or progress.
--verbose Print progress as file-by-file path instead
of a progress bar. Print verbose scan
counters.
--from-json Load codebase from an existing JSON scan
--max-in-memory INTEGER Maximum number of files and directories scan
details kept in memory during a scan.
Additional files and directories scan details
above this number are cached on-disk rather
than in memory. Use 0 to use unlimited memory
and disable on-disk caching. Use -1 to use
only on-disk caching. [default: 10000]
miscellaneous:
--reindex-licenses Check the license index cache and reindex if
needed and exit.
documentation:
-h, --help Show this message and exit.
--about Show information about ScanCode and licensing and
exit.
--version Show the version and exit.
--examples Show command examples and exit.
--list-packages Show the list of supported package types and exit.
--plugins Show the list of available ScanCode plugins and exit.
--print-options Show the list of selected options and exit.
Examples (use --examples for more):
Scan the 'samples' directory for licenses and copyrights.
Save scan results to the 'scancode_result.json' JSON file:
scancode --license --copyright --json-pp scancode_result.json
samples
Scan the 'samples' directory for licenses and package manifests. Print scan
results on screen as pretty-formatted JSON (using the special '-' FILE to print
to on screen/to stdout):
scancode --json-pp - --license --package samples
Note: when you run scancode, a progress bar is displayed with a
counter of the number of files processed. Use --verbose to display
file-by-file progress.
Command Examples Text¶
The Scancode-Toolkit Command Line Interface has an --examples
option which displays some basic
examples (more than the basic synopsis in --help
). These examples include the following aspects
of code scanning:
- Scanning Single File/Directory
- Output Scan results to stdout (as JSON) or HTML/JSON file
- Scanning for only Copyrights/Licenses
- Ignoring Files
- Using GLOB Patterns to Scan Multiple Files
- Using Verbose Mode
The command line option for displaying these basic examples is --examples
.
For Linux based systems the full command is:
$ ./scancode --examples
And for windows, it will be like:
$ scancode --examples
The Following Text is displayed, i.e. This is the examples for Scancode Version 3.1.1
Scancode command lines examples:
(Note for Windows: use '\' back slash instead of '/' forward slash for paths.)
Scan a single file for copyrights. Print scan results to stdout as pretty JSON:
scancode --copyright samples/zlib/zlib.h --json-pp -
Scan a single file for licenses, print verbose progress to stderr as each
file is scanned. Save scan to a JSON file:
scancode --license --verbose samples/zlib/zlib.h --json licenses.json
Scan a directory explicitly for licenses and copyrights. Redirect JSON scan
results to a file:
scancode --license --copyright samples/zlib/ --json - > scan.json
Scan a directory while ignoring a single file. Scan for license, copyright and
package manifests. Use four parallel processes.
Print scan results to stdout as pretty formatted JSON.
scancode -lc --package --ignore README --processes 4 --json-pp - samples/
Scan a directory while ignoring all files with .txt extension.
Print scan results to stdout as pretty formatted JSON.
It is recommended to use quotes around glob patterns to prevent pattern
expansion by the shell:
scancode --json-pp - --ignore "*.txt" samples/
Special characters supported in GLOB pattern:
- * matches everything
- ? matches any single character
- [seq] matches any character in seq
- [!seq] matches any character not in seq
For a literal match, wrap the meta-characters in brackets.
For example, '[?]' matches the character '?'.
For details on GLOB patterns see https://en.wikipedia.org/wiki/Glob_(programming).
Note: Glob patterns cannot be applied to path as strings.
For example, this will not ignore "samples/JGroups/licenses".
scancode --json - --ignore "samples*licenses" samples/
Scan a directory while ignoring multiple files (or glob patterns).
Print the scan results to stdout as JSON:
scancode --json - --ignore README --ignore "*.txt" samples/
Scan a directory for licenses and copyrights. Save scan results to an
HTML file:
scancode --license --copyright --html scancode_result.html samples/zlib
To extract archives, see the 'extractcode' command instead.
Plugins Help Text¶
The command line option for displaying all the plugins is:
--plugins
For Linux based systems the full command is:
$ ./scancode --plugins
And for windows, it will be like:
$ scancode --plugins
Note
Plugins that are shown by using --plugins
include the following:
- Post-Scan Plugins
- Pre-Scan Plugins
- Output Options
- Output Control
- Basic Scan Options
The Following Text is displayed, i.e. This is the available plugins for Scancode Version 3.1.1
--------------------------------------------
Plugin: scancode_output:csv class: formattedcode.output_csv:CsvOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: csv: --csv
help: Write scan output as CSV to FILE.
doc: None
--------------------------------------------
Plugin: scancode_output:html class: formattedcode.output_html:HtmlOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: html: --html
help: Write scan output as HTML to FILE.
doc: None
--------------------------------------------
Plugin: scancode_output:html-app class: formattedcode.output_html:HtmlAppOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: html_app: --html-app
help: (DEPRECATED: use the ScanCode Workbench app instead ) Write scan output as a mini HTML application to FILE.
doc:
Write scan output as a mini HTML application.
--------------------------------------------
Plugin: scancode_output:json class: formattedcode.output_json:JsonCompactOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: output_json: --json
help: Write scan output as compact JSON to FILE.
doc: None
--------------------------------------------
Plugin: scancode_output:json-pp class: formattedcode.output_json:JsonPrettyOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: output_json_pp: --json-pp
help: Write scan output as pretty-printed JSON to FILE.
doc: None
--------------------------------------------
Plugin: scancode_output:jsonlines class: formattedcode.output_jsonlines:JsonLinesOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: output_json_lines: --json-lines
help: Write scan output as JSON Lines to FILE.
doc: None
--------------------------------------------
Plugin: scancode_output:spdx-rdf class: formattedcode.output_spdx:SpdxRdfOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: spdx_rdf: --spdx-rdf
help: Write scan output as SPDX RDF to FILE.
doc: None
--------------------------------------------
Plugin: scancode_output:spdx-tv class: formattedcode.output_spdx:SpdxTvOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: spdx_tv: --spdx-tv
help: Write scan output as SPDX Tag/Value to FILE.
doc: None
--------------------------------------------
Plugin: scancode_output:template class: formattedcode.output_html:CustomTemplateOutput
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output formats, name: custom_output: --custom-output
help: Write scan output to FILE formatted with the custom Jinja template file.
help_group: output formats, name: custom_template: --custom-template
help: Use this Jinja template FILE as a custom template.
doc: None
--------------------------------------------
Plugin: scancode_output_filter:ignore-copyrights class: cluecode.plugin_ignore_copyrights:IgnoreCopyrights
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output filters, name: ignore_copyright_holder: --ignore-copyright-holder
help: Ignore a file (and all its findings) if a copyright holder contains a match to the <pattern> regular expression. Note that this will ignore a file even if it has other scanned data such as a license or errors.
help_group: output filters, name: ignore_author: --ignore-author
help: Ignore a file (and all its findings) if an author contains a match to the <pattern> regular expression. Note that this will ignore a file even if it has other findings such as a license or errors.
doc:
Filter findings that match given copyright holder or author patterns.
Has no effect unless the --copyright scan is requested.
--------------------------------------------
Plugin: scancode_output_filter:only-findings class: scancode.plugin_only_findings:OnlyFindings
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: output filters, name: only_findings: --only-findings
help: Only return files or directories with findings for the requested scans. Files and directories without findings are omitted (file information is not treated as findings).
doc:
Filter files or directories without scan findings for the requested scans.
--------------------------------------------
Plugin: scancode_post_scan:classify-package class: summarycode.classify:PackageTopAndKeyFilesTagger
codebase_attributes:
resource_attributes:
sort_order: 0
required_plugins:
options:
doc:
Tag resources as key or top level based on Package-type specific settings.
--------------------------------------------
Plugin: scancode_post_scan:consolidate class: scancode.plugin_consolidate:Consolidator
codebase_attributes: consolidated_components, consolidated_packages
resource_attributes: consolidated_to
sort_order: 8
required_plugins:
options:
help_group: post-scan, name: consolidate: --consolidate
help: Group resources by Packages or license and copyright holder and return those groupings as a list of consolidated packages and a list of consolidated components. This requires the scan to have/be run with the copyright, license, and package options active
doc:
A ScanCode post-scan plugin to return consolidated components and consolidated
packages for different types of codebase summarization.
A consolidated component is a group of Resources that have the same origin.
Currently, consolidated components are created by grouping Resources that have
the same license expression and copyright holders and the files that contain
this license expression and copyright holders combination make up 75% or more of
the files in the directory where they are found.
A consolidated package is a detected package in the scanned codebase that has
been enhanced with data about other licenses and holders found within it.
If a Resource is part of a consolidated component or consolidated package, then
the identifier of the consolidated component or consolidated package it is part
of is in the Resource's ``consolidated_to`` field.
--------------------------------------------
Plugin: scancode_post_scan:filter-clues class: cluecode.plugin_filter_clues:RedundantCluesFilter
codebase_attributes:
resource_attributes:
sort_order: 1
required_plugins:
options:
help_group: post-scan, name: filter_clues: --filter-clues
help: Filter redundant duplicated clues already contained in detected license and copyright texts and notices.
doc:
Filter redundant clues (copyrights, authors, emails, and urls) that are already
contained in another more important scan result.
--------------------------------------------
Plugin: scancode_post_scan:is-license-text class: licensedcode.plugin_license_text:IsLicenseText
codebase_attributes:
resource_attributes: is_license_text
sort_order: 80
required_plugins:
options:
help_group: post-scan, name: is_license_text: --is-license-text
help: Set the "is_license_text" flag to true for files that contain mostly license texts and notices (e.g over 90% of the content). [EXPERIMENTAL]
doc:
Set the "is_license_text" flag to true for at the file level for text files
that contain mostly (as 90% of their size) license texts or notices.
Has no effect unless --license, --license-text and --info scan data
are available.
--------------------------------------------
Plugin: scancode_post_scan:license-clarity-score class: summarycode.score:LicenseClarityScore
codebase_attributes: license_clarity_score
resource_attributes:
sort_order: 110
required_plugins:
options:
help_group: post-scan, name: license_clarity_score: --license-clarity-score
help: Compute a summary license clarity score at the codebase level.
doc:
Compute a License clarity score at the codebase level.
--------------------------------------------
Plugin: scancode_post_scan:license-policy class: licensedcode.plugin_license_policy:LicensePolicy
codebase_attributes:
resource_attributes: license_policy
sort_order: 9
required_plugins:
options:
help_group: post-scan, name: license_policy: --license-policy
help: Load a License Policy file and apply it to the scan at the Resource level.
doc:
Add the "license_policy" attribute to a resouce if it contains a
detected license key that is found in the license_policy.yml file
--------------------------------------------
Plugin: scancode_post_scan:mark-source class: scancode.plugin_mark_source:MarkSource
codebase_attributes:
resource_attributes: source_count
sort_order: 8
required_plugins:
options:
help_group: post-scan, name: mark_source: --mark-source
help: Set the "is_source" to true for directories that contain over 90% of source files as children and descendants. Count the number of source files in a directory as a new source_file_counts attribute
doc:
Set the "is_source" flag to true for directories that contain
over 90% of source files as direct children.
Has no effect unless the --info scan is requested.
--------------------------------------------
Plugin: scancode_post_scan:summary class: summarycode.summarizer:ScanSummary
codebase_attributes: summary
resource_attributes:
sort_order: 10
required_plugins:
options:
help_group: post-scan, name: summary: --summary
help: Summarize license, copyright and other scans at the codebase level.
doc:
Summarize a scan at the codebase level.
--------------------------------------------
Plugin: scancode_post_scan:summary-by-facet class: summarycode.summarizer:ScanByFacetSummary
codebase_attributes: summary_by_facet
resource_attributes:
sort_order: 200
required_plugins:
options:
help_group: post-scan, name: summary_by_facet: --summary-by-facet
help: Summarize license, copyright and other scans and group the results by facet.
doc:
Summarize a scan at the codebase level groupping by facets.
--------------------------------------------
Plugin: scancode_post_scan:summary-keeping-details class: summarycode.summarizer:ScanSummaryWithDetails
codebase_attributes: summary
resource_attributes: summary
sort_order: 100
required_plugins:
options:
help_group: post-scan, name: summary_with_details: --summary-with-details
help: Summarize license, copyright and other scans at the codebase level, keeping intermediate details at the file and directory level.
doc:
Summarize a scan at the codebase level and keep file and directory details.
--------------------------------------------
Plugin: scancode_post_scan:summary-key-files class: summarycode.summarizer:ScanKeyFilesSummary
codebase_attributes: summary_of_key_files
resource_attributes:
sort_order: 150
required_plugins:
options:
help_group: post-scan, name: summary_key_files: --summary-key-files
help: Summarize license, copyright and other scans for key, top-level files. Key files are top-level codebase files such as COPYING, README and package manifests as reported by the --classify option "is_legal", "is_readme", "is_manifest" and "is_top_level" flags.
doc:
Summarize a scan at the codebase level for only key files.
--------------------------------------------
Plugin: scancode_pre_scan:classify class: summarycode.classify:FileClassifier
codebase_attributes:
resource_attributes: is_legal, is_manifest, is_readme, is_top_level, is_key_file
sort_order: 50
required_plugins:
options:
help_group: pre-scan, name: classify: --classify
help: Classify files with flags telling if the file is a legal, or readme or test file, etc.
doc:
Classify a file such as a COPYING file or a package manifest with a flag.
--------------------------------------------
Plugin: scancode_pre_scan:facet class: summarycode.facet:AddFacet
codebase_attributes:
resource_attributes: facets
sort_order: 20
required_plugins:
options:
help_group: pre-scan, name: facet: --facet
help: Add the <facet> to files with a path matching <pattern>.
doc:
Assign one or more "facet" to each file (and NOT to directories). Facets are
a way to qualify that some part of the scanned code may be core code vs.
test vs. data, etc.
--------------------------------------------
Plugin: scancode_pre_scan:ignore class: scancode.plugin_ignore:ProcessIgnore
codebase_attributes:
resource_attributes:
sort_order: 100
required_plugins:
options:
help_group: pre-scan, name: ignore: --ignore
help: Ignore files matching <pattern>.
help_group: pre-scan, name: include: --include
help: Include files matching <pattern>.
doc:
Include or ignore files matching patterns.
--------------------------------------------
Plugin: scancode_scan:copyrights class: cluecode.plugin_copyright:CopyrightScanner
codebase_attributes:
resource_attributes: copyrights, holders, authors
sort_order: 4
required_plugins:
options:
help_group: primary scans, name: copyright: -c, --copyright
help: Scan <input> for copyrights.
doc:
Scan a Resource for copyrights.
--------------------------------------------
Plugin: scancode_scan:emails class: cluecode.plugin_email:EmailScanner
codebase_attributes:
resource_attributes: emails
sort_order: 8
required_plugins:
options:
help_group: other scans, name: email: -e, --email
help: Scan <input> for emails.
help_group: scan options, name: max_email: --max-email
help: Report only up to INT emails found in a file. Use 0 for no limit.
doc:
Scan a Resource for emails.
--------------------------------------------
Plugin: scancode_scan:generated class: summarycode.generated:GeneratedCodeDetector
codebase_attributes:
resource_attributes: is_generated
sort_order: 50
required_plugins:
options:
help_group: other scans, name: generated: --generated
help: Classify automatically generated code files with a flag.
doc:
Tag a file as generated.
--------------------------------------------
Plugin: scancode_scan:info class: scancode.plugin_info:InfoScanner
codebase_attributes:
resource_attributes: date, sha1, md5, mime_type, file_type, programming_language, is_binary, is_text, is_archive, is_media, is_source, is_script
sort_order: 0
required_plugins:
options:
help_group: other scans, name: info: -i, --info
help: Scan <input> for file information (size, checksums, etc).
doc:
Scan a file Resource for miscellaneous information such as mime/filetype and
basic checksums.
--------------------------------------------
Plugin: scancode_scan:licenses class: licensedcode.plugin_license:LicenseScanner
codebase_attributes:
resource_attributes: licenses, license_expressions
sort_order: 2
required_plugins:
options:
help_group: primary scans, name: license: -l, --license
help: Scan <input> for licenses.
help_group: scan options, name: license_score: --license-score
help: Do not return license matches with a score lower than this score. A number between 0 and 100.
help_group: scan options, name: license_text: --license-text
help: Include the detected licenses matched text.
help_group: scan options, name: license_text_diagnostics: --license-text-diagnostics
help: In the matched license text, include diagnostic highlights surrounding with square brackets [] words that are not matched.
help_group: scan options, name: license_url_template: --license-url-template
help: Set the template URL used for the license reference URLs. Curly braces ({}) are replaced by the license key.
help_group: scan options, name: license_diag: --license-diag
help: (DEPRECATED: this is always included by default now). Include diagnostic information in license scan results.
help_group: miscellaneous, name: reindex_licenses: --reindex-licenses
help: Check the license index cache and reindex if needed and exit.
doc:
Scan a Resource for licenses.
--------------------------------------------
Plugin: scancode_scan:packages class: packagedcode.plugin_package:PackageScanner
codebase_attributes:
resource_attributes: packages
sort_order: 6
required_plugins: scan:licenses
options:
help_group: primary scans, name: package: -p, --package
help: Scan <input> for package manifests and packages.
help_group: documentation, name: list_packages: --list-packages
help: Show the list of supported package types and exit.
doc:
Scan a Resource for Package manifests and report these as "packages" at the
right file or directory level.
--------------------------------------------
Plugin: scancode_scan:urls class: cluecode.plugin_url:UrlScanner
codebase_attributes:
resource_attributes: urls
sort_order: 10
required_plugins:
options:
help_group: other scans, name: url: -u, --url
help: Scan <input> for urls.
help_group: scan options, name: max_url: --max-url
help: Report only up to INT urls found in a file. Use 0 for no limit.
doc:
Scan a Resource for URLs.
--list-packages
Option¶
This shows all the types of packages that can be scanned using Scancode. These are located in packagedcode i.e. Code used to parse various package formats.
--print-options
Option¶
This option prints the options selected for one specific scan command.
If we run this command:
./scancode -clpieu --json-pp sample.json samples --classify --summary --summary-with-details --print-options
The output will be:
Options:
classify: True
copyright: True
email: True
info: True
license: True
list_packages: None
output_json_pp: <unopened file 'sample.json' wb>
package: True
reindex_licenses: None
summary: True
summary_with_details: True
url: True
All Available Options¶
This section contains an exhaustive list of all Scancode options, arranged in various sections. The sections are as follows:
- Basic Scan Options
- Core Scan Options
- Output Formats
- Controlling Output and Filters
- Pre-Scan Options
- Post-Scan Options
There’s also another section for extractcode
options.
The order of the sections and all their options is the same as in the :ref:’cli_help_text’, available in the command line.
All “Basic” Scan Options¶
Option lists are two-column lists of command-line options and descriptions, documenting a program’s options. For example:
-c, --copyright | |
Scan Sub-Options:
| |
-l, --license | Scan Sub-Options:
|
-p, --package | Scan Sub-Options:
|
-e, --email | Scan Sub-Options:
|
-u, --url | Scan Sub-Options:
|
-i, --info | Include information such as:
Sub-Options:
|
Note
Unlike previous 2.x versions, -c, -l, and -p are not default. If any of combination of these
options are used, ScanCode only performs that specific task, and not the others.
./scancode -e
only scans for emails, and doesn’t scan for copyright/license/packages/general
information.
Note
These options, i.e. -c, -l, -p, -e, -u, and -i can be used together. As in, instead of
./scancode -c -i -p
, you can write ./scancode -cip
and it will be the same.
--generated | Classify automatically generated code files with a flag. |
--max-email INT | |
Report only up to INT emails found in a file. Use 0 for no limit. [Default: 50] Sub-Option of - | |
--max-url INT | Report only up to INT urls found in a file. Use 0 for no limit. [Default: 50] Sub-Option of - |
--license-score INTEGER | |
Do not return license matches with scores lower than this score. A number between 0 and 100. [Default: 0] Here, a bigger number means a better match, i.e. Setting a higher license score translates to a higher threshold (with equal or less number of matches). Sub-Option of - | |
--license-text | Include the matched text for the detected licenses in the output report. Sub-Option of - Sub-Options:
|
--license-url-template TEXT | |
Set the template URL used for the license reference URLs. In a template URL, curly braces ({}) are replaced by the license key. [Default: https://enterprise.dejacode.com/urn/urn:dje:license:{}] Sub-Option of - | |
--license-text-diagnostics | |
In the matched license text, include diagnostic highlights surrounding with square brackets [] words that are not matched. Sub-Option of - |
All Extractcode Options¶
This is intended to be used as an input preparation step, before running the scan. Archives found in an extracted archive are extracted recursively by default. Extraction is done in-place in a directory named ‘-extract’ side-by-side with an archive.
To extract the packages in the samples
directory
./extractcode samples
This extracts the zlib.tar.gz package:

--shallow | Do not extract recursively nested archives (e.g. Not archives in archives). |
--verbose | Print verbose file-by-file progress messages. |
--quiet | Do not print any summary or progress message. |
-h, --help | Show the extractcode help message and exit. |
--about | Show information about ScanCode and licensing and exit. |
--version | Show the version and exit. |
All “Core” Scan Options¶
-n, --processes INTEGER | |
Scan <input> using n parallel processes.
[Default: 1] | |
--verbose | Print verbose file-by-file progress messages. |
--quiet | Do not print summary or progress messages. |
--timeout FLOAT | |
Stop scanning a file if scanning takes longer than a timeout in seconds. [Default: 120] | |
--reindex-licenses | |
Force a check and possible reindexing of the cached license index. | |
--from-json | Load codebase from an existing JSON scan |
--max-in-memory INTEGER | |
Maximum number of files and directories scan details kept in memory during a scan. Additional files and directories scan details above this number are cached on-disk rather than in memory. Use 0 to use unlimited memory and disable on-disk caching. Use -1 to use only on-disk caching. [Default: 10000] |
Note
All the Core Options are independent options, i.e. They don’t depend on other options.
All Scan Output Options¶
--json FILE | Write scan output as compact JSON to FILE. |
--json-pp FILE | Write scan output as pretty-printed JSON to FILE. |
--json-lines FILE | |
Write scan output as JSON Lines to FILE. | |
--csv FILE | Write scan output as CSV to FILE. |
--html FILE | Write scan output as HTML to FILE. |
--custom-output | |
Write scan output to FILE formatted with the custom Jinja template file. Mandatory Sub-option:
| |
--custom-template FILE | |
Use this Jinja template FILE as a custom template. Sub-Option of: | |
--spdx-rdf FILE | |
Write scan output as SPDX RDF to FILE. | |
--spdx-tv FILE | Write scan output as SPDX Tag/Value to FILE. |
--html-app FILE | |
Write scan output as a mini HTML application to FILE. |
Warning
The html-app feature has been deprecated and you should use Scancode Workbench instead to visualize scan results. The official Repository link. Also refer How to Visualize Scan results.
All “Output Control” Scan Options¶
--strip-root | Strip the root directory segment of all paths. |
--full-root | Report full, absolute paths. |
Note
The options --strip-root
and --full-root
can’t be used together, i.e. Any one option
may be used in a single scan.
Note
The default is to always include the last directory segment of the scanned path such that all paths have a common root directory.
--ignore-author <pattern> | |
Ignore a file (and all its findings)
if an author contains a match to the
<pattern> regular expression. | |
--ignore-copyright-holder <pattern> | |
Ignore a file (and all its findings)
if a copyright holder contains a match
to the <pattern> regular expression. |
Note
Note that this both the options --ignore-author
and --ignore-copyright-holder
will
ignore a file even if it has other scanned data such as a license or errors.
--only-findings | |
Only return files or directories with findings for the requested scans. Files and directories without findings are omitted (file information is not treated as findings). |
All “Pre-Scan” Options¶
--ignore <pattern> | |
Ignore files matching <pattern> . | |
--include <pattern> | |
Include files matching <pattern> . | |
--classify | Classify files with flags telling if the file is a legal, or readme or test file, etc. Sub-Options:
|
--facet <facet_pattern> | |
Here Sub-Options:
|
All “Post-Scan” Options¶
--mark-source | Set the “is_source” flag to true for directories that contain over 90% of source files as direct children and descendants. Count the number of source files in a directory as a new “source_file_counts” attribute Sub-Option of - |
--consolidate | Group resources by Packages or license and copyright holder and return those groupings as a list of consolidated packages and a list of consolidated components. Sub-Option of - |
--filter-clues | Filter redundant duplicated clues already contained in detected licenses, copyright texts and notices. |
--is-license-text | |
Set the “is_license_text” flag to true for files that contain mostly license texts and notices (e.g. over 90% of the content). Sub-Option of - |
Warning
--is-license-text
is an experimental Option.
--license-clarity-score | |
Compute a summary license clarity score at the codebase level. Sub-Option of - | |
--license-policy FILE | |
Load a License Policy file and apply it to the scan at the Resource level. | |
--summary | Summarize license, copyright and other scans at the codebase level. Sub-Options:
|
--summary-by-facet | |
Summarize license, copyright and other scans and group the results by facet. Sub-Option of - | |
--summary-key-files | |
Summarize license, copyright and other scans
for key, top-level files. Key files are top-
level codebase files such as COPYING, README
and package manifests as reported by the
Sub-Option of - | |
--summary-with-details | |
Summarize license, copyright and other scans at the codebase level, keeping intermediate details at the file and directory level. |
How to Run a Scan¶
In this simple tutorial example, we perform a basic scan on the samples
directory distributed
by default with Scancode.
Warning
This tutorial is for Linux based systems presently. Additional Help for Windows/MacOS will be added.
Setting up a Virtual Environment¶
Scancode Toolkit 3.1.1 and Workbench 3.1.0 is not compatible with python 3.x so we will create a
virtual environment using the Virtualenv
tool with a python 2.7 interpreter.
The following commands set up and activate the Virtual Environment venv-scan3.1.1
:
virtualenv -p /usr/bin/python2.7 venv-scan3.1.1
source venv-scan3.1.1/bin/activate
Setting up Scancode Toolkit¶
Get the Scancode Toolkit Version 3.1.1 tarball or .zip archive from the Toolkit GitHub Release Page under assets options. Download and extract the Archive from command line:
For .zip archive:
unzip scancode-toolkit-3.1.1.zip
For .tar.bz2 archive:
tar -xvf scancode-toolkit-3.1.1.tar.bz2
Or Right Click and select “Extract Here”.
Check whether the Prerequisites are installed. Open a terminal in the extracted directory and run:
./scancode --help
This will configure ScanCode and display the command line Help text.
Looking into Files¶
As mentioned previously, we are going to perform the scan on the samples
directory distributed
by default with Scancode Toolkit. Here’s the directory structure and respective files:

We notice here that the sample files contain a package zlib.tar.gz
. So we have to extract the
archive before running the scan, to also scan the files inside this package.
Performing Extraction¶
To extract the packages inside samples
directory:
./extractcode samples
This extracts the zlib.tar.gz package:

Note
--shallow
option can be used to recursively extract packages.
Deciding Scan Options¶
These are some common scan options you should consider using before you start the actual scan, according to your requirements.
- The Basic Scan options, i.e.
-c
,-l
,-p
,-e
,-u
, and-i
are to be decided, according to your requirements. If you do not need one specific type of information (say, licenses), consider removing it, because the more things you scan for, longer it will take for the scan to complete.
Note
You have to select these options explicitly, as they are not default anymore from
versions 3.x, unlike earlier versions having -clp
as default.
--license-score INTEGER
is to be set if license matching accuracy is desired (Default is 0, and increasing this means a more accurate match). Also, using--license-text
includes the matched text to the result.-n INTEGER
option can be used to speed up the scan using multiple parallel processes.--timeout FLOAT
option can be used to skip a file taking a lot of time to scan.--ignore <pattern>
can be used to skip certain group of files.<OUTPUT FORMAT OPTION(s)>
is also a very important decision when you want to use the output for specific tasks/have requirements. Here we are usingjson
as ScanCode Workbench importsjson
files only.
For the complete list of options, refer All Available Options.
Running The Scan¶
Now, run the scan with the options decided:
./scancode -clpeui -n 2 --ignore "*.java" --json-pp sample.json samples
A Progress report is shown:
Setup plugins...
Collect file inventory...
Scan files for: info, licenses, copyrights, packages, emails, urls with 2 process(es)...
[####################] 29
Scanning done.
Summary: info, licenses, copyrights, packages, emails, urls with 2 process(es)
Errors count: 0
Scan Speed: 1.09 files/sec. 40.67 KB/sec.
Initial counts: 49 resource(s): 36 file(s) and 13 directorie(s)
Final counts: 42 resource(s): 29 file(s) and 13 directorie(s) for 1.06 MB
Timings:
scan_start: 2019-09-24T203514.573671
scan_end: 2019-09-24T203545.649805
setup_scan:licenses: 4.30s
setup: 4.30s
scan: 26.62s
total: 31.14s
Removing temporary files...done.
Basic Options¶
All “Basic” Scan Options¶
Option lists are two-column lists of command-line options and descriptions, documenting a program’s options. For example:
-c, --copyright | |
Scan Sub-Options:
| |
-l, --license | Scan Sub-Options:
|
-p, --package | Scan Sub-Options:
|
-e, --email | Scan Sub-Options:
|
-u, --url | Scan Sub-Options:
|
-i, --info | Include information such as:
Sub-Options:
|
Note
Unlike previous 2.x versions, -c, -l, and -p are not default. If any of combination of these
options are used, ScanCode only performs that specific task, and not the others.
./scancode -e
only scans for emails, and doesn’t scan for copyright/license/packages/general
information.
Note
These options, i.e. -c, -l, -p, -e, -u, and -i can be used together. As in, instead of
./scancode -c -i -p
, you can write ./scancode -cip
and it will be the same.
--generated | Classify automatically generated code files with a flag. |
--max-email INT | |
Report only up to INT emails found in a file. Use 0 for no limit. [Default: 50] Sub-Option of - | |
--max-url INT | Report only up to INT urls found in a file. Use 0 for no limit. [Default: 50] Sub-Option of - |
--license-score INTEGER | |
Do not return license matches with scores lower than this score. A number between 0 and 100. [Default: 0] Here, a bigger number means a better match, i.e. Setting a higher license score translates to a higher threshold (with equal or less number of matches). Sub-Option of - | |
--license-text | Include the matched text for the detected licenses in the output report. Sub-Option of - Sub-Options:
|
--license-url-template TEXT | |
Set the template URL used for the license reference URLs. In a template URL, curly braces ({}) are replaced by the license key. [Default: https://enterprise.dejacode.com/urn/urn:dje:license:{}] Sub-Option of - | |
--license-text-diagnostics | |
In the matched license text, include diagnostic highlights surrounding with square brackets [] words that are not matched. Sub-Option of - |
--generated
Options¶
The
--generated
option classifies automatically generated code files with a flag.An example of using
--generated
in a scan:./scancode -clpieu --json-pp output.json samples --generatedIn the results, for each file the following attribute is added with it’s corresponding
true
/false
value"is_generated": trueIn the samples folder, the following files have a true value for their is_generated attribute:
"samples/zlib/dotzlib/LICENSE_1_0.txt" "samples/JGroups/licenses/apache-2.0.txt"
--max-email
Options¶
Dependency
The option
--max-email
is a sub-option of and requires the optionIf in the files that are scanned, in individual files, there are a lot of emails (i.e lists) which are unnecessary and clutter the scan results,
--max-email
option can be used to report emails only up to a limit in individual files.Some important INTEGER values of the
--max-email INTEGER
option:
- 0 - No limit, include all emails.
- 50 - Default.
An example usage:
./scancode -clpieu --json-pp output.json samples --max-email 5This only reports 5 email addresses per file and ignores the rest.
--max-url
Options¶
Dependency
The option
--max-url
is a sub-option of and requires the option--url
.If in the files that are scanned, in individual files, there are a lot of links to other websites (i.e url lists) which are unnecessary and clutter the scan results,
--max-url
option can be used to report urls only up to a limit in individual files.Some important INTEGER values of the
--max-url INTEGER
option:
- 0 - No limit, include all urls.
- 50 - Default.
An example usage:
./scancode -clpieu --json-pp output.json samples --max-url 10This only reports 10 urls per file and ignores the rest.
--license-score
Options¶
Dependency
The option
--license-score
is a sub-option of and requires the option--license
.License matching strictness, i.e. How closely matched licenses are detected in a scan, can be modified by using this
--license-score
option.Some important INTEGER values of the
--license-score INTEGER
option:
- 0 - Default and Lowest Value, All matches are reported.
- 100 - Highest Value, Only licenses with a much better match are reported
Here, a bigger number means a better match, i.e. Setting a higher license score translates to a higher threshold for matching licenses (with equal or less number of license matches).
An example usage:
./scancode -clpieu --json-pp output.json samples --license-score 70Here’s the license results on setting the integer value to 100, Vs. the default value 0. This is visualized using ScanCode workbench in the License Info Dashboard.
License scan results of Samples Directory.¶ ![]()
License Score 0 (Default).
![]()
License Score 100.
--license-text
Options¶
Dependency
The option
--license-text
is a sub-option of and requires the option--license
.Sub-Option
The option
--license-text-diagnostics
and--is-license-text
are sub-options of--license-text
.--is-license-text
is a Post-Scan Option.With the
--license-text
option, the scan results attribute “matched text” includes the matched text for the detected license.An example Scan:
./scancode -cplieu --json-pp output.json samples --license-textAn example matched text included in the results is as follows:
"matched_text": " This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Jean-loup Gailly Mark Adler jloup@gzip.org madler@alumni.caltech.edu"
- The file in which this license was detected:
samples/arch/zlib.tar.gz-extract/zlib-1.2.8/zlib.h
- License name: “ZLIB License”
--license-url-template
Options¶
Dependency
The option
--license-url-template
is a sub-option of and requires the option--license
.The
--license-url-template
option sets the template URL used for the license reference URLs.The default template URL is : [https://enterprise.dejacode.com/urn/urn:dje:license:{}] In a template URL, curly braces ({}) are replaced by the license key.
So, by default the license reference URL points to the dejacode page for that license.
A scan example using the
--license-url-template TEXT
option./scancode -clpieu --json-pp output.json samples --license-url-template https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/{}.ymlIn a normal scan, reference url for “ZLIB License” is as follows:
"reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:zlib",After using the option in the following manner:
``--license-url-template https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/{}``the reference URL changes to this zlib.yml file:
"reference_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/zlib.yml",The reference URL changes for all detected licenses in the scan, across the scan result file.
--license-text-diagnostics
Options¶
Dependency
The option
--license-text-diagnostics
is a sub-option of and requires the options--license
and--license-text
.In the matched license text, include diagnostic highlights surrounding with square brackets [] words that are not matched.
In a normal scan, whole lines of text are included in the matched license text, including parts that are possibly unmatched.
An example Scan:
./scancode -cplieu --json-pp output.json samples --license-text --license-text-diagnosticsRunning a scan on the samples directory with
--license-text --license-text-diagnostics
options, causes the following difference in the scan result of the filesamples/JGroups/licenses/bouncycastle.txt
.Without Diagnostics:
"matched_text": "License Copyright (c) 2000 - 2006 The Legion Of The Bouncy Castle (http://www.bouncycastle.org) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restrictionWith Diagnostics on:
"matched_text": "License [Copyright] ([c]) [2000] - [2006] [The] [Legion] [Of] [The] [Bouncy] [Castle] ([http]://[www].[bouncycastle].[org]) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction,
Core Options¶
All “Core” Scan Options¶
-n, --processes INTEGER | |
Scan <input> using n parallel processes.
[Default: 1] | |
--verbose | Print verbose file-by-file progress messages. |
--quiet | Do not print summary or progress messages. |
--timeout FLOAT | |
Stop scanning a file if scanning takes longer than a timeout in seconds. [Default: 120] | |
--reindex-licenses | |
Force a check and possible reindexing of the cached license index. | |
--from-json | Load codebase from an existing JSON scan |
--max-in-memory INTEGER | |
Maximum number of files and directories scan details kept in memory during a scan. Additional files and directories scan details above this number are cached on-disk rather than in memory. Use 0 to use unlimited memory and disable on-disk caching. Use -1 to use only on-disk caching. [Default: 10000] |
Note
All the Core Options are independent options, i.e. They don’t depend on other options.
Comparing Progress Message Options¶
Default Progress Message:
Scanning files for: infos, licenses, copyrights, packages, emails, urls with 1 process(es)... Building license detection index...Done. Scanning files... [####################] 43 Scanning done. Scan statistics: 43 files scanned in 33s. Scan options: infos, licenses, copyrights, packages, emails, urls with 1 process(es). Scanning speed: 1.4 files per sec. Scanning time: 30s. Indexing time: 2s. Saving results.Progress Message with ``–verbose``:
Scanning files for: infos, licenses, copyrights, packages, emails, urls with 1 process(es)... Building license detection index...Done. Scanning files... Scanned: screenshot.png Scanned: README ... Scanned: zlib/dotzlib/ChecksumImpl.cs Scanned: zlib/dotzlib/readme.txt Scanned: zlib/gcc_gvmat64/gvmat64.S Scanned: zlib/ada/zlib.ads Scanned: zlib/infback9/infback9.c Scanned: zlib/infback9/infback9.h Scanned: arch/zlib.tar.gz Scanning done. Scan statistics: 43 files scanned in 29s. Scan options: infos, licenses, copyrights, packages, emails, urls with 1 process(es). Scanning speed: 1.58 files per sec. Scanning time: 27s. Indexing time: 2s. Saving results.So, with
--verbose
enables, progress messages for individual files are shown.With the ``–quiet`` option enabled, nothing is printed on the Command Line.
--timeout
Option¶
This option sets scan timeout for each file (and not the entire scan). If some file scan exceeds the specified timeout, that file isn’t scanned anymore and the next file scanning starts. This helps avoiding very large/long files, and saves time.
Also the number (timeout in seconds) to be followed by this option can be a floating point number, i.e. 1.5467.
--reindex-licenses
Option¶
ScanCode maintains a license index to search for and detect licenses. When Scancode is configured for the first time, a license index is built and used in every scan thereafter.
This
--reindex-licenses
option rebuilds the license index. Running a scan with this option displays the following message to the terminal in addition to what it normally shows:Checking and rebuilding the license index...
--from-json
Option¶
If you want to input scan results from a .json file, and run a scan again on those same files, with some other options/output format, you can do so using the
--from-json
option.An example scan command using
--from-json
:./scancode --from-json sample.json --json-pp sample_2.json --classifyThis inputs the scan results from
sample.json
, runs the post-scan plugin--classify
and outputs the results for this scan tosample_2.json
.
--max-in-memory
Option¶
During a scan, as individual files are scanned, the scan details for those files are kept on memory till the scan is completed. Then after the scan is completed, they are written in the specified output format.
Now, if the scan involves a very large number of files, they might not fit in the memory during the scan. For this reason, disk-caching can be used for some/all of the files.
Some important INTEGER values of the
--max-in-memory INTEGER
option:
- 0 - Unlimited Memory, store all the file/directory scan results on memory
- -1 - Use only Disk-Caching, store all the file/directory scan results on disk
- 10000 - Default, store 10,000 file/directory scan results on memory and the rest on disk
An example usage:
./scancode -clieu --json-pp sample.json samples --max-in-memory -1
Scancode Output Formats¶
Scan results generated by Scancode are available in different formats, to be specified by the following options.
All Scan Output Options¶
--json FILE | Write scan output as compact JSON to FILE. |
--json-pp FILE | Write scan output as pretty-printed JSON to FILE. |
--json-lines FILE | |
Write scan output as JSON Lines to FILE. | |
--csv FILE | Write scan output as CSV to FILE. |
--html FILE | Write scan output as HTML to FILE. |
--custom-output | |
Write scan output to FILE formatted with the custom Jinja template file. Mandatory Sub-option:
| |
--custom-template FILE | |
Use this Jinja template FILE as a custom template. Sub-Option of: | |
--spdx-rdf FILE | |
Write scan output as SPDX RDF to FILE. | |
--spdx-tv FILE | Write scan output as SPDX Tag/Value to FILE. |
--html-app FILE | |
Write scan output as a mini HTML application to FILE. |
Warning
The html-app feature has been deprecated and you should use Scancode Workbench instead to visualize scan results. The official Repository link. Also refer How to Visualize Scan results.
Note
You can Output Scan Results in two different file formats simultaniously in one Scan. An
example - ./scancode -clpieu --json-pp output.json --html output.html samples
.
Note
All the examples and snippets that follows heas been generated by scanning the samples
folder distributed with scancode-toolkit.
Print to stdout
(Terminal)¶
If you want to format the output in JSON and print it at stdout, you can replace the JSON filename
with a “-“, like --json-pp -
instead of --json-pp output.json
.
The following command will output the scan results in JSON format to stdout
(In the Terminal):
./scancode -clpieu --json-pp - samples/
--json FILE
¶
Among the ScanCode Output Formats,
json
is the most important one, and is recommended over others. Scancode Workbench and other applications that use Scancode Result data as input accept only thejson
format.Note
There isn’t any default output option in Scancode Versions 3.x, unlike 2.x versions (which had
json
as default).The following code performs a scan on the samples directory, and publishes the results in
json
format:./scancode -clpieu --json output.json samplesNote
The default
json
format prints the whole report without line breaks/spaces/indentations, which can be ugly to look at.![]()
The entire JSON file is structured in the following manner:
At first some general information on the scan, what options were used, the number of files etc. And then all the files follow.
{ "headers": [ { "tool_name": "scancode-toolkit", "tool_version": "3.1.1", "options": { "input": [ "samples/" ], "--copyright": true, "--email": true, "--info": true, "--json-pp": "output.json", "--license": true, "--package": true, "--url": true }, "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.", "start_timestamp": "2019-10-19T191117.292858", "end_timestamp": "2019-10-19T191219.743133", "message": null, "errors": [], "extra_data": { "files_count": 36 } } ], "files": [ { "path": "samples", "type": "directory", ... ... ... "scan_errors": [] }, { "path": "samples/README", "type": "file", "name": "README", "base_name": "README", "extension": "", "size": 236, "date": "2019-02-12", "sha1": "2e07e32c52d607204fad196052d70e3d18fb8636", "md5": "effc6856ef85a9250fb1a470792b3f38", "mime_type": "text/plain", "file_type": "ASCII text", "programming_language": null, "is_binary": false, "is_text": true, "is_archive": false, "is_media": false, "is_source": false, "is_script": false, "licenses": [], "license_expressions": [], "copyrights": [], "holders": [], "authors": [], "packages": [], "emails": [], "urls": [], "files_count": 0, "dirs_count": 0, "size_count": 0, "scan_errors": [] }, ... ... ... { "path": "samples/zlib/iostream2/zstream_test.cpp", "type": "file", "name": "zstream_test.cpp", "base_name": "zstream_test", "extension": ".cpp", "size": 711, "date": "2019-02-12", ... ... ... "scan_errors": [] } ] }
--json-pp FILE
¶
json-pp
stands for JSON Pretty-Print format. In the previous format, i.e. Simplejson
, the whole output is printed in one line, which isn’t well suited for getting information if you’re looking at the file itself (or printing at stdout). So this option formats the output results in json but in a properly spaced and indented manner, and is easy to look at.The following code performs a scan on the samples directory, and publishes the results in
json-pp
format:./scancode -clpieu --json-pp output.json samplesA sample JSON output for an individual file will look like:
{ "path": "samples/zlib/iostream2/zstream.h", "type": "file", "name": "zstream.h", "base_name": "zstream", "extension": ".h", "size": 9283, "date": "2019-02-12", "sha1": "fca4540d490fff36bb90fd801cf9cd8fc695bb17", "md5": "a980b61c1e8be68d5cdb1236ba6b43e7", "mime_type": "text/x-c++", "file_type": "C++ source, ASCII text", "programming_language": "C++", "is_binary": false, "is_text": true, "is_archive": false, "is_media": false, "is_source": true, "is_script": false, "licenses": [ { "key": "mit-old-style", "score": 100.0, "name": "MIT Old Style", "short_name": "MIT Old Style", "category": "Permissive", "is_exception": false, "owner": "MIT", "homepage_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style", "text_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style", "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit-old-style", "spdx_license_key": null, "spdx_url": "", "start_line": 9, "end_line": 15, "matched_rule": { "identifier": "mit-old-style_cmr-no_1.RULE", "license_expression": "mit-old-style", "licenses": [ "mit-old-style" ], "is_license_text": true, "is_license_notice": false, "is_license_reference": false, "is_license_tag": false, "matcher": "2-aho", "rule_length": 71, "matched_length": 71, "match_coverage": 100.0, "rule_relevance": 100 } } ], "license_expressions": [ "mit-old-style" ], "copyrights": [ { "value": "Copyright (c) 1997 Christian Michelsen Research AS Advanced Computing", "start_line": 3, "end_line": 5 } ], "holders": [ { "value": "Christian Michelsen Research AS Advanced Computing", "start_line": 3, "end_line": 5 } ], "authors": [], "packages": [], "emails": [], "urls": [ { "url": "http://www.cmr.no/", "start_line": 7, "end_line": 7 } ], "files_count": 0, "dirs_count": 0, "size_count": 0, "scan_errors": [] },This is the recommended Output option for Scancode Toolkit.
--json-lines FILE
¶
ScanCode also has a
--json-lines
format option, where each report of a file scanned is formatted in one line.The following code performs a scan on the samples directory, and publishes the results in
json-lines
format:./scancode -clpieu --json-lines output.json samplesHere is a sample line from a report generated by the
jsonlines
format:{"files":[{"path":"samples/zlib/ada",licenses":[],"copyrights":[],"packages":[]}]}The header information is also formatted in one line (i.e. The First Line of the file).
The whole Output file looks like:
{"headers":[{"tool_name":"scancode-toolkit","tool_version":"3.1.1","options":{"input":["samples/"],"--copyright":true,"--email":true,"--info":true,"--json-lines":"output.json","--license":true,"--package":true,"--url":true},"notice":"Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.","start_timestamp":"2019-10-19T210920.143831","end_timestamp":"2019-10-19T211052.048182","message":null,"errors":[],"extra_data":{"files_count":36}}]} {"files":[{"path":"samples" ... "scan_errors":[]}]} {"files":[{"path":"samples/README", ... "scan_errors":[]}]} {"files":[{"path":"samples/screenshot.png", ... "scan_errors":[]}]} {"files":[{"path":"samples/arch", ... "scan_errors":[]}]} {"files":[{"path":"samples/arch/zlib.tar.gz", ... "scan_errors":[]}]} {"files":[{"path":"samples/arch/zlib.tar.gz-extract", ... "scan_errors":[]}]} {"files":[{"path":"samples/arch/zlib.tar.gz-extract/zlib-1.2.8", ... "scan_errors":[]}]} {"files":[{"path":"samples/arch/zlib.tar.gz-extract/zlib-1.2.8/adler32.c", ... "scan_errors":[]}]} {"files":[{"path":"samples/arch/zlib.tar.gz-extract/zlib-1.2.8/zlib.h", ... "scan_errors":[]}]} {"files":[{"path":"samples/arch/zlib.tar.gz-extract/zlib-1.2.8/zutil.h", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/EULA", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/LICENSE", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/licenses", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/licenses/apache-1.1.txt", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/licenses/apache-2.0.txt", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/licenses/bouncycastle.txt", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/licenses/cpl-1.0.txt", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/licenses/lgpl.txt", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src/FixedMembershipToken.java", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src/GuardedBy.java", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src/ImmutableReference.java", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src/RATE_LIMITER.java", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src/RouterStub.java", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src/RouterStubManager.java", ... "scan_errors":[]}]} {"files":[{"path":"samples/JGroups/src/S3_PING.java", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/adler32.c", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/deflate.c", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/deflate.h", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/zlib.h", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/zutil.c", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/zutil.h", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/ada", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/ada/zlib.ads", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/dotzlib", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/dotzlib/AssemblyInfo.cs", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/dotzlib/ChecksumImpl.cs", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/dotzlib/LICENSE_1_0.txt", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/dotzlib/readme.txt", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/gcc_gvmat64" ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/gcc_gvmat64/gvmat64.S" ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/infback9", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/infback9/infback9.c", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/infback9/infback9.h", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/iostream2", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/iostream2/zstream.h", ... "scan_errors":[]}]} {"files":[{"path":"samples/zlib/iostream2/zstream_test.cpp", ... "scan_errors":[]}]}Note
This
jsonlines
format also omits other file information like type, name, date, extension, sha1 and md5 hashes, programming language etc.
Comparing Different json
Output Formats¶
Default
--json
Output:![]()
--json-pp
Output:![]()
--json-lines
Output:![]()
--spdx-rdf FILE
¶
SPDX stands for “Software Package and Data Exchange” and is an open standard for communicating software bill of material information (including components, licenses, copyrights, and security references).
The following code performs a scan on the samples directory, and publishes the results in
spdx-rdf
format:./scancode -clpieu --spdx-rdf output.spdx samplesLearn more about SPDX specifications here and in this GitHub repository.
Here the file is structured as a dictionary of named properties and classes using W3C’s RDF Technology.
… figure:: data/output_spdx_rdf1.png
--spdx-tv FILE
¶
This format is another SPDX variant, with the output file being structured in the following manner:
The following code performs a scan on the samples directory, and publishes the results in
spdx-tv
format:./scancode -clpieu --spdx-tv output.spdx samplesA SPDX-TV file starts with:
# Document Information SPDXVersion: SPDX-2.1 DataLicense: CC0-1.0 DocumentComment: <text>Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. No content created from ScanCode should be considered or used as legal advice. Consult an Attorney for any legal advice. ScanCode is a free software code scanning tool from nexB Inc. and others. Visit https://github.com/nexB/scancode-toolkit/ for support and download.</text> # Creation Info Creator: Tool: ScanCode 2.2.1 Created: 2019-09-22T21:55:04ZAfter a section titled
#Packages
, a list follows.![]()
Each File information is listed under a
#File
title, for each of the files.
- FileName
- FileChecksum
- LicenseConcluded
- LicenseInfoInFile
- FileCopyrightText
An example goes as follows:
![]()
After the files section, there’s a section for licenses under a
#Licences
title, with the following information for each license:
- LicenseID
- LicenseComment
- ExtractedText
Here’s an example:
![]()
--html FILE
¶
ScanCode supports formatting the Output result is a simple
html
format, to open with your favorite browser. This helps quick visualization of the detected license/copyright and other main information in the form of tables.The following code performs a scan on the samples directory, and publishes the results in HTML format:
./scancode -clpieu --html output.html samplesThe HTML page generated has these following Tables:
- Copyright and Licenses Information
- File Information
- Package Information
- Licenses (Links to Dejacode/License Homepage)
![]()
![]()
![]()
--html-app FILE
¶
ScanCode also supports formatting the output in a HTML visualization tool, which is more helpful than the standard HTML format.
The following code performs a scan on the samples directory, and publishes the results in
html-app
format:./scancode -clpieu --csv output.html samplesThe Files scanned are shown in the left sidebar, and the section on the right contains separate tabs for the following:
- License Summary
- Copyright Summary
- Clues
- File Details
- Packages
Note
The HTML app also contains a Search option to easily find what you are looking for.
Warning
The html-app feature has been deprecated and you should use Scancode Workbench instead to visualize scan results. The official Repository link. Also refer How to Visualize Scan results.
![]()
![]()
![]()
--csv FILE
¶
ScanCode can publish results in the useful
.csv
format.The following code performs a scan on the samples directory, and publishes the results in
csv
format:./scancode -lpceiu --csv sample.csv samplesThe first line of the csv file contains the headings, and they are:
- Resource,
- type,
- name,
- base_name,
- extension,
- date,
- size,
- sha1,
- md5,
- files_count,
- mime_type,
- file_type,
- programming_language,
- is_binary,
- is_text,
- is_archive,
- is_media,
- is_source,
- is_script,
- scan_errors,
- license__key,
- license__score,
- license__short_name,
- license__category,
- license__owner,
- license__homepage_url,
- license__text_url,
- license__reference_url,
- license__spdx_license_key,
- license__spdx_url,
- matched_rule__identifier,
- matched_rule__license_choice,
- matched_rule__licenses,
- copyright,
- copyright_holder,
- author,
- email,
- start_line,
- end_line,
- url,
- package__type,
- package__name,
- package__version,
- package__primary_language,
- package__summary,
- package__description,
- package__size,
- package__release_date,
- package__homepage_url,
- package__notes,
- package__bug_tracking_url,
- package__vcs_repository,
- package__copyright_top_level
Each subsequent line represents one element, i.e. can be any of the following:
- license
- copyright
- package
- url
So if there’s multiple elements in a file, they are each given an entry with the details mentioned earlier.
![]()
Custom Output Format¶
While the three built-in output formats are convenient for a verity of use-cases, one may wish to create their own output template, using the following arguments:
``--custom-output FILE --custom-template TEMP_FILE``
ScanCode makes this very easy, as it uses the popular Jinja2 template engine. Simply pass the path
to the custom template to the --custom-template
argument, or drop it in a folder to
src/scancode/templates
directory.
For example, if I wanted a simple CLI output I would create a template2.html
with the
particular data I wish to see. In this case, I am only interested in the license and copyright
data for this particular scan.
## template.html:
[
{% if files.license_copyright %}
{% for location, data in files.license_copyright.items() %}
{% for row in data %}
location:"{{ location }}",
{% if row.what == 'copyright' %}copyright:"{{ row.value|escape }}",{% endif %}
{% endfor %}
{% endfor %}
{% endif %}
]
.. note::
File name and extension does not matter for the template file.
Now I can run ScanCode using my newly created template:
$ ./scancode -clpeui --custom-output output.json --custom-template template.html samples
Scanning files...
[####################################] 46
Scanning done.
Now are results are saved in output.json
and we can easily view them with head output.json
:
[
location:"samples/JGroups/LICENSE",
copyright:"Copyright (c) 1991, 1999 Free Software Foundation, Inc.",
location:"samples/JGroups/LICENSE",
copyright:"copyrighted by the Free Software Foundation",
]
For a more elaborate template, refer this default template
given with Scancode, to generate HTML output with the --html
output format option.
Documentation on Jinja templates.
Controlling Scancode Output and Filters¶
All “Output Control” Scan Options¶
--strip-root | Strip the root directory segment of all paths. |
--full-root | Report full, absolute paths. |
Note
The options --strip-root
and --full-root
can’t be used together, i.e. Any one option
may be used in a single scan.
Note
The default is to always include the last directory segment of the scanned path such that all paths have a common root directory.
--ignore-author <pattern> | |
Ignore a file (and all its findings)
if an author contains a match to the
<pattern> regular expression. | |
--ignore-copyright-holder <pattern> | |
Ignore a file (and all its findings)
if a copyright holder contains a match
to the <pattern> regular expression. |
Note
Note that this both the options --ignore-author
and --ignore-copyright-holder
will
ignore a file even if it has other scanned data such as a license or errors.
--only-findings | |
Only return files or directories with findings for the requested scans. Files and directories without findings are omitted (file information is not treated as findings). |
--strip-root
Vs. --full-root
¶
For a default scan of the “samples” folder, this a comparison between the default,
strip-root
andfull-root
options.An example Scan
./scancode -cplieu --json-pp output.json samples --full-rootThese two changes only the “path” attribute of the file information. For this comparison we compare the “path” attributes of the file
LICENSE
insideJGroups
directory.The default path:
"path": "samples/JGroups/LICENSE",For the
--full-root
option, the path relative to the Root of your local filesystem."path": "/home/ayansm/Desktop/GSoD/scancode-toolkit-versions/scancode-toolkit-2.2.1/samples/JGroups/LICENSE"For the
--strip-root
option, the root directory (heresamples
) is removed from path :"path": "JGroups/LICENSE"Note
The options
--strip-root
and--full-root
can’t be used together, i.e. Any one option may be used in a single scan.Note
The default is to always include the last directory segment of the scanned path such that all paths have a common root directory.
--ignore-author <pattern>
Option¶
In a normal scan, all files inside the directory specified as an input argument is scanned and subsequently included in the scan report. But if you want to run the scan on only some selective files, with some specific common author then
--ignore-author
option can be used to do the same.This scan ignores all files with authors matching the string “Apache Software Foundation”:
./scancode -cplieu --json-pp output.json samples --ignore-author "Apache Software Foundation"More information on Glob Pattern Matching.
Note
Note that this both the options
--ignore-author
and--ignore-copyright-holder
will ignore a file even if it has other scanned data such as a license or errors.
--ignore-copyright-holder <pattern>
Option¶
In a normal scan, all files inside the directory specified as an input argument is scanned and subsequently included in the scan report. But if you want to run the scan on only some selective files, with some specific common copyright holder then
--ignore-copyright-holder
option can be used to do the same.This scan ignores all files with Copyright Holders matching the string “Free Software Foundation”:
./scancode -cplieu --json-pp output.json samples --ignore-copyright-holder "Free Software Foundation"More information on Glob Pattern Matching.
--only-findings
Plugin¶
This option removes from the scan results, the files where nothing significant has been detected, like files which doesn’t contain any licenses, copyrights, emails or urls (if requested in the scan options), and isn’t a package.
An example Scan:
./scancode -cplieu --json-pp output.json samples --only-findingsNote
This also changes in the result displayed, the number of files scanned.
For example, scanning the
sample
files (distributed by default with scancode-toolkit) without this option, displays in it’s report information of 43 files. But after enabling this option, the result shows information for only 31 files.
Pre-Scan Options¶
All “Pre-Scan” Options¶
--ignore <pattern> | |
Ignore files matching <pattern> . | |
--include <pattern> | |
Include files matching <pattern> . | |
--classify | Classify files with flags telling if the file is a legal, or readme or test file, etc. Sub-Options:
|
--facet <facet_pattern> | |
Here Sub-Options:
|
--ignore
Option¶
In a scan, all files inside the directory specified as an input argument is scanned. But if there are some files which you don’t want to scan, the
--ignore
option can be used to do the same.A sample usage:
./scancode --ignore "*.java" samples samples.jsonHere, Scancode ignores files ending with .java, and continues with other files as usual.
More information on Glob Pattern Matching.
--include
Option¶
In a normal scan, all files inside the directory specified as an input argument is scanned. But if you want to run the scan on only some selective files, then
--include
option can be used to do the same.A sample usage:
./scancode --include "*.java" samples samples.jsonHere, Scancode selectively scans files that has names ending with .java, and ignores all other files. This is basically complementary in behavior to the
--ignore
option.More information on Glob Pattern Matching.
--classify
¶
Sub-Option
The options
--license-clarity-score
and--summary-key-files
are sub-options of--classify
.--license-clarity-score
and--summary-key-files
are Post-Scan Options.This option makes ScanCode further classify scanned files/directories, to determine whether they fall in these following categories
legal
readme
top-level
manifest
A manifest file in computing is a file containing metadata for a group of accompanying files that are part of a set or coherent unit.
key-file
A KEY file is a generic file extension used by various programs when registering legal copies of the software. It may be saved in a plain text format, but generally contains some form of encrypted key string that authenticates the purchase and registers the software.
As in, to the JSON object of each file scanned, these extra attributes are added:
"is_legal": false, "is_manifest": false, "is_readme": true, "is_top_level": true, "is_key_file": true,
--facet
Option¶
Sub-Option
The option
--summary-by-facet
is a sub-option of--facet
.--summary-by-facet
is a Post-Scan Option.Valid
<facet>
values are:
- core,
- dev,
- tests,
- docs,
- data,
- examples.
You can use the
--facet
option in the following manner:./scancode -clpieu --json-pp sample_facet.json samples --facet dev="*.java" --facet dev="*.c"This adds to the header object, the following attribute:
"--facet": [ "dev=*.java", "dev=*.c" ],Here in this example,
.java
and.c
files are marked as it belongs to facetdev
.As a result,
.java
file has the following attribute added:"facets": [ "dev" ],Note
All other files which are not
dev
are marked to be included in the facetcore
.For each facet, the
--facet
option precedes the<facet>=<pattern>
argument. For specifying multiple facets, this whole part is repeated, including the--facet
option.For users who want to know What is a Facet?.
Glob Pattern Matching¶
All the Pre-Scan options use pattern matching, so the basics of Glob Pattern Matching is discussed briefly below.
Glob pattern matching is useful for matching a group of files, by using patterns in their names. Then using these patterns, files are grouped and treated differently as required.
Here are some rules from the Linux Manual on glob patterns. Refer the same for more detailed information.
A string is a wildcard pattern if it contains one of the characters ‘?’, ‘*’ or ‘[‘. Globbing is the operation that expands a wildcard pattern into the list of pathnames matching the pattern. Matching is defined by:
- A ‘?’ (not between brackets) matches any single character.
- A ‘*’ (not between brackets) matches any string, including the empty string.
- An expression “[…]” where the first character after the leading ‘[‘ is not an ‘!’ matches a single character, namely any of the characters enclosed by the brackets.
- There is one special convention: two characters separated by ‘-‘ denote a range.
- An expression “[!…]” matches a single character, namely any character that is not matched by the expression obtained by removing the first ‘!’ from it.
- A ‘/’ in a pathname cannot be matched by a ‘?’ or ‘*’ wildcard, or by a range like “[.-0]”.
Note that wildcard patterns are not regular expressions, although they are a bit similar.
For more information on Glob pattern matching refer these resources:
You can also import these Python Libraries to practice UNIX style pattern matching:
What is a Facet?¶
A facet is defined as follows (by ClearlyDefined):
A facet of a component is a subset of the files related to the component. It’s really just a grouping that helps us understand the shape of the project. Each facet is described by a set of glob expressions, essentially wildcard patterns that are matched against file names.
Each facet definition can have zero or more glob expressions. A file can be captured by more than one facet. Any file found but not captured by a defined facet is automatically assigned to the core facet.
core
- The files that go into making the release of the component. Note that the core facet is not explicitly defined. Rather, it is made up of whatever is not in any other facet. So, by default, all files are in the core facet unless otherwise specified.data
- The files included in any data distribution of the component.dev
- Files primarily used at development time (e.g., build utilities) and not distributed with the componentdocs
- Documentation files. Docs may be included with the executable component or separately or not at all.examples
– Like docs, examples may be included in the main component release or separately.tests
– Test files may include code, data and other artifacts.Important Links:
Post-Scan Options¶
Post-Scan options activate their respective post-scan plugins which execute the task.
All “Post-Scan” Options¶
--mark-source | Set the “is_source” flag to true for directories that contain over 90% of source files as direct children and descendants. Count the number of source files in a directory as a new “source_file_counts” attribute Sub-Option of - |
--consolidate | Group resources by Packages or license and copyright holder and return those groupings as a list of consolidated packages and a list of consolidated components. Sub-Option of - |
--filter-clues | Filter redundant duplicated clues already contained in detected licenses, copyright texts and notices. |
--is-license-text | |
Set the “is_license_text” flag to true for files that contain mostly license texts and notices (e.g. over 90% of the content). Sub-Option of - |
Warning
--is-license-text
is an experimental Option.
--license-clarity-score | |
Compute a summary license clarity score at the codebase level. Sub-Option of - | |
--license-policy FILE | |
Load a License Policy file and apply it to the scan at the Resource level. | |
--summary | Summarize license, copyright and other scans at the codebase level. Sub-Options:
|
--summary-by-facet | |
Summarize license, copyright and other scans and group the results by facet. Sub-Option of - | |
--summary-key-files | |
Summarize license, copyright and other scans
for key, top-level files. Key files are top-
level codebase files such as COPYING, README
and package manifests as reported by the
Sub-Option of - | |
--summary-with-details | |
Summarize license, copyright and other scans at the codebase level, keeping intermediate details at the file and directory level. |
To see all plugins available via command line help, use --plugins
.
Note
Plugins that are shown by using --plugins
inlcude the following:
- Post-Scan Plugins (and, the following)
- Pre-Scan Plugins
- Output Options
- Output Control
- Basic Scan Options
--mark-source
Option¶
Dependency
The option
--mark-source
is a sub-option of and requires the option--info
.The
mark-source
option marks the “is_source” attribute of a directory to be “True”, if more than 90% of the files under that directory is source files, i.e. Their “is_source” attribute is “True”.When the following command is executed to scan the
samples
directory with this option enabled:./scancode -clpieu --json-pp output.json samples --mark-sourceThen, the following directories are marked as “Source”, i.e. Their “is_source” attribute is changed from “false” to “True”.
samples/JGroups/src
samples/zlib/iostream2
samples/zlib/gcc_gvmat64
samples/zlib/ada
samples/zlib/infback9
--consolidate
Option¶
Dependency
The option
--consolidate
is a sub-option of and requires the options--license
,--copyright
and--package
.The JSON file containing scan results after using the
--consolidate
Plugin is structured as follows: (note: “…” in the image contains more data)An example Scan:
./scancode -clpieu --json-pp output.json samples --consolidateThe JSON output file is structured as follows:
{ "headers": [ {...} ], "consolidated_components": [ {... }, { "type": "license-holders", "identifier": "dmitriy_anisimkov_1", "consolidated_license_expression": "gpl-2.0-plus WITH ada-linking-exception", "consolidated_holders": [ "Dmitriy Anisimkov" ], "consolidated_copyright": "Copyright (c) Dmitriy Anisimkov", "core_license_expression": "gpl-2.0-plus WITH ada-linking-exception", "core_holders": [ "Dmitriy Anisimkov" ], "other_license_expression": null, "other_holders": [], "files_count": 1 }, {... } ], "consolidated_packages": [], "files": [ ] }Each consolidated component has the following information:
"consolidated_components": [ { "type": "license-holders", "identifier": "dmitriy_anisimkov_1", "consolidated_license_expression": "gpl-2.0-plus WITH ada-linking-exception", "consolidated_holders": [ "Dmitriy Anisimkov" ], "consolidated_copyright": "Copyright (c) Dmitriy Anisimkov", "core_license_expression": "gpl-2.0-plus WITH ada-linking-exception", "core_holders": [ "Dmitriy Anisimkov" ], "other_license_expression": null, "other_holders": [], "files_count": 1 },In addition to this, in every file/directory where the consolidated part (i.e. License information) was present, a “consolidated_to” attribute is added pointing to the “identifier” of “consolidated_components”:
"consolidated_to": [ "dmitriy_anisimkov_1" ],Note that multiple files may have the same “consolidated_to” attribute.
--filter-clues
Option¶
The--filter-clues
Plugin filters redundant duplicated clues already contained in detected licenses, copyright texts and notices.
--is-license-text
Option¶
Dependency
The option
--is-license-text
is a sub-option of and requires the options--info
and--license-text
. Also, the option--license-text
is a sub-option of and requires the options--license
.If the
--is-license-text
is used, then the “is_license_text” flag is set to true for files that contain mostly license texts and notices. Here mostly means over 90% of the content of the file.An example Scan:
./scancode -clpieu --json-pp output.json samples --license-text --is-license-textIf the samples directory is scanned with this plugin, the files containing mostly license texts will have the following attribute set to ‘true’:
"is_license_text": true,The files in samples that will have the “is_license_text” to be true are:
samples/JGroups/EULA samples/JGroups/LICENSE samples/JGroups/licenses/apache-1.1.txt samples/JGroups/licenses/apache-2.0.txt samples/JGroups/licenses/bouncycastle.txt samples/JGroups/licenses/cpl-1.0.txt samples/JGroups/licenses/lgpl.txt samples/zlib/dotzlib/LICENSE_1_0.txtNote that the license objects for each detected license in the files already has “is_license_text” attributes by default, but not the file objects. They only have this attribute if the plugin is used.
Warning
--is-license-text
is an experimental Option.
--license-clarity-score
Option¶
Dependency
The option
--license-clarity-score
is a sub-option of and requires the option--classify
.The
--license-clarity-score
plugin when used in a scan, computes a summary license clarity score at the codebase level.An example Scan:
./scancode -clpieu --json-pp output.json samples --classify --license-clarity-scoreThe “license_clarity_score” will have the following attributes:
- “score”
- “declared”
- “discovered”
- “consistency”
- “spdx”
- “license_texts”
It whole JSON file is structured as follows, when it has “license_clarity_score”:
{ "headers": [ { ... } ], "license_clarity_score": { "score": 17, "declared": false, "discovered": 0.69, "consistency": false, "spdx": false, "license_texts": false }, "files": [ ... ] }
--license-policy FILE
Option¶
The Policy file is a YAML (.yml) document with the following structure:
license_policies: - license_key: mit label: Approved License color_code: '#00800' icon: icon-ok-circle - license_key: agpl-3.0 label: Approved License color_code: '#008000' icon: icon-ok-circleNote
In the policy file only the “license_key” is a required field.
Applying License Policies during a ScanCode scan, using the
--license-policy
Plugin:./scancode -clipeu --json-pp output.json samples --license-policy policy-file.ymlNote
--license-policy FILE
is a not a sub-option of--license
. It works normally without-l
.This adds to every file/directory an object “license_policy”, having as further attributes under it the fields as specified in the .YAML file. Here according to our example .YAML file, the attributes will be:
- “license_key”
- “label”
- “color_code”
- “icon”
Here the
samples
directory is scanned, and the Scan Results for a sample file is as follows:{ "path": "samples/JGroups/licenses/apache-2.0.txt", ... ... ... "licenses": [ ... ... ... ], "license_expressions": [ "apache-2.0" ], "copyrights": [], "holders": [], "authors": [], "packages": [], "emails": [], "license_policy": { "license_key": "apache-2.0", "label": "Approved License", "color_code": "#008000", "icon": "icon-ok-circle" }, "urls": [], "files_count": 0, "dirs_count": 0, "size_count": 0, "scan_errors": [] },More information on the License Policy Plugin and usage.
--summary
Option¶
Sub-Option
The option
--summary-by-facet
,--summary-key-files
and--summary-with-details``are sub-options of ``--summary
. These Sub-Options are all Post-Scan Options.An example Scan:
./scancode -clpieu --json-pp output.json samples --summaryThe whole JSON file is structured as follows, when the
--summary
plugin is applied:{ "headers": [ { ... } ], "summary": { "license_expressions": [ ... ], "copyrights": [ ... ], "holders": [ ... ], "authors": [ ... ], "programming_language": [ ... ], "packages": [] }, "files": [ ... ] }The Summary object has the following attributes.
- “license_expressions”
- “copyrights”
- “holders”
- “authors”
- “programming_language”
- “packages”
Each attribute has multiple entries each containing “value” and “count”, with their values having the summary information inside them.
A sample summary object generated:
"summary": { "license_expressions": [ { "value": "zlib", "count": 13 }, ] ], "copyrights": [ { "value": "Copyright (c) Mark Adler", "count": 4 }, { "value": "Copyright (c) Free Software Foundation, Inc.", "count": 2 }, { "value": "Copyright (c) The Apache Software Foundation", "count": 1 }, { "value": "Copyright Red Hat, Inc. and individual contributors", "count": 1 } ], "holders": [ { "value": null, "count": 10 }, { "value": "Mark Adler", "count": 4 }, { "value": "Red Hat, Inc. and individual contributors", "count": 1 }, { "value": "The Apache Software Foundation", "count": 1 }, ], "authors": [ { "value": "Bela Ban", "count": 4 }, { "value": "Brian Stansberry", "count": 1 }, { "value": "the Apache Software Foundation (http://www.apache.org/)", "count": 1 } ], "programming_language": [ { "value": "C++", "count": 13 }, { "value": "Java", "count": 7 }, ], "packages": []
--summary-by-facet
Option¶
Dependency
The option
--summary-by-facet
is a sub-option of and requires the options--facet
and--summary
.Running the scan with
--summary --summary-by-facet
Plugins creates individual summaries for all the facets with the same license, copyright and other scan information, at a codebase level (in addition to the codebase level general summary generated by--summary
Plugin)An example scan using the
--summary-by-facet
Plugin:./scancode -clieu --json-pp output.json samples --summary --facet dev="*.java" --facet dev="*.c" --summary-by-facetNote
All other files which are not
dev
are marked to be included in the facetcore
.Warning
Running the same scan with
./scancode -clpieu
i.e. with-p
generates an error. Avoid this.The JSON file containing scan results is structured as follows:
{ "headers": [ ... ], "summary": { ... }, "summary_by_facet": [ { "facet": "core", "summary": { ... } }, { "facet": "dev", "summary": { ... } }, { "facet": "tests", "summary": { ... } }, { "facet": "docs", "summary": { ... } }, { "facet": "data", "summary": { ... } }, { "facet": "examples", "summary": { ... } } ], "files": [ }A sample “summary_by_facet” object generated by the previous scan (shortened):
"summary_by_facet": [ { "facet": "core", "summary": { "license_expressions": [ { "value": "mit", "count": 1 }, ], "copyrights": [ { "value": "Copyright (c) Free Software Foundation, Inc.", "count": 2 }, ], "holders": [ { "value": "The Apache Software Foundation", "count": 1 }, "authors": [ { "value": "Gilles Vollant", "count": 1 }, ], "programming_language": [ { "value": "C++", "count": 8 }, ] } }, { "facet": "dev", "summary": { "license_expressions": [ { "value": "zlib", "count": 5 }, "copyrights": [ { "value": "Copyright Red Hat Middleware LLC, and individual contributors", "count": 1 }, ], "holders": [ { "value": "Mark Adler", "count": 3 }, ], "authors": [ "value": "Brian Stansberry", "count": 1 }, ], "programming_language": [ { "value": "Java", "count": 7 }, { "value": "C++", "count": 5 } ] } }, ],Note
Summaries for all the facets are generated by default, regardless of facets not having any files under them.
For users who want to know What is a Facet?.
--summary-key-files
Option¶
Dependency
The option
--summary-key-files
is a sub-option of and requires the options--classify
and--summary
.An example Scan:
./scancode -clpieu --json-pp output.json samples --classify --summary --summary-key-filesRunning the scan with
--summary --summary-key-files
Plugins creates summaries for key files with the same license, copyright and other scan information, at a codebase level (in addition to the codebase level general summary generated by--summary
Plugin)The resulting JSON file containing the scan results is structured as follows:
{ "headers": [ ... ], "summary": { "license_expressions": [ ... ], "copyrights": [ ... ], "holders": [ ... ], "authors": [ ... ], "programming_language": [ ... ], "packages": [] }, "summary_of_key_files": { "license_expressions": [ { "value": null, "count": 1 } ], "copyrights": [ { "value": null, "count": 1 } ], "holders": [ { "value": null, "count": 1 } ], "authors": [ { "value": null, "count": 1 } ], "programming_language": [ { "value": null, "count": 1 } ] }, "files": [These following flags for each file/directory is also present (generated by
--classify
)
- “is_legal”
- “is_manifest”
- “is_readme”
- “is_top_level”
- “is_key_file”
--summary-with-details
Option¶
The
--summary
plugin summarizes license, copyright and other scan information at the codebase level. Now running the scan with the--summary-with-details
plugin instead creates summaries at individual file/directories with the same license, copyright and other scan information, but at a file/directory level (in addition to the the codebase level summary).An example Scan:
./scancode -clpieu --json-pp output.json samples --summary-with-detailsNote
--summary
is redundant in a scan when--summary-with-details
is already selected.A sample file object in the scan results (a directory level summary of
samples/arch
) is structured as follows:{ "path": "samples/arch", "type": "directory", "name": "arch", "base_name": "arch", "extension": "", "size": 0, "date": null, "sha1": null, "md5": null, "mime_type": null, "file_type": null, "programming_language": null, "is_binary": false, "is_text": false, "is_archive": false, "is_media": false, "is_source": false, "is_script": false, "licenses": [], "license_expressions": [], "copyrights": [], "holders": [], "authors": [], "packages": [], "emails": [], "urls": [], "is_legal": false, "is_manifest": false, "is_readme": false, "is_top_level": true, "is_key_file": false, "summary": { "license_expressions": [ { "value": "zlib", "count": 3 }, { "value": null, "count": 1 } ], "copyrights": [ { "value": null, "count": 1 }, { "value": "Copyright (c) Jean-loup Gailly", "count": 1 }, { "value": "Copyright (c) Jean-loup Gailly and Mark Adler", "count": 1 }, { "value": "Copyright (c) Mark Adler", "count": 1 } ], "holders": [ { "value": null, "count": 1 }, { "value": "Jean-loup Gailly", "count": 1 }, { "value": "Jean-loup Gailly and Mark Adler", "count": 1 }, { "value": "Mark Adler", "count": 1 } ], "authors": [ { "value": null, "count": 4 } ], "programming_language": [ { "value": "C++", "count": 3 }, { "value": null, "count": 1 } ] }, "files_count": 4, "dirs_count": 2, "size_count": 127720, "scan_errors": [] },These following flags for each file/directory is also present (generated by
--classify
)
- “is_legal”
- “is_manifest”
- “is_readme”
- “is_top_level”
- “is_key_file”
Plugins¶
Plugin Architecture¶
Notes: this is the initial design for ScanCode plugins. The actual architecture evolved and is different.
Abstract:¶
This project’s purpose is to create a decoupled plugin architecture for scancode such that it can handle plugins at different stages of a scan and can be coupled at runtime. These stages would be
- Pre - scan: Before starting the scan
E.g Plugins to handle extraction of different archive types or instructions on how to handle certain types of files.
- Scan proper: During the scan
E.g Plugins to add more options for the scan, maybe to ignore certain files or add some command line arguments, create new scans (alternative or as a dependency for further scanning) etc.
- Post - scan: After the scan
E.g Plugins for output deduction, formatting or converting output to other formats (such as json, spdx, csv, xml, etc.)
Upside of building a pluggable system would be to allow easier additions and rare modifications to code, without having to really fiddle around with core codebase. This will also provide a level of abstraction between the plugins and scancode so that any erroneous plugin would not affect the functioning of scancode as a whole.
Description:¶
This project aims at making scancode a “pluggable” system, where new functionalities can be added to scancode at runtime as “plugins”. These plugins can be hooked into scancode using some predefined hooks. I would consider pluggy as the way to go for a plugin management system.
Pluggy is well documented and maintained regularly, and has proved its worth in projects such as py.test. Pluggy relies on hook specifications and hook implementations (callbacks) instead of the conventional subclassing approach which may encourage tight-coupling in the overlying framework. Basically a hook specification contains method signatures (no code), these are defined by the application. A hook implementation contains definitions for methods declared in the corresponding hook specification implemented by a plugin.
As mentioned in the abstract, the plugin architecture will have 3 hook specifications (can be increased if required)
- Structure -
prescan_hookspec = HookspecMarker('prescan')
@prescan_hookspec
def extract_archive(args):
Here the path of the archive to be extracted will be passed as an argument to the extract_archive function which will be called before scan, at the time of extraction. This will process the archive type and extract the contents accordingly. This functionality can be further extended by calling this function if any archive is found inside the scanning tree.
- Structure
scanproper_hookspec = HookspecMarker('scanproper')
@scanproper_hookspec
def add_cmdline_option(args):
This function will be called before starting the scan, without any arguments, it will return a dict containing the click extension details and possibly some help text. If this option is called by the user then the call will be rerouted to the callback defined by the click extension. For instance say a plugin implements functionality to add regex as a valid ignore pattern, then this function will return a dict as:
{
'name': '--ignore-regex',
'options' : {
'default': None,
'multiple': True,
'metavar': <pattern>
},
'help': 'Ignore files matching regex <pattern>'
'call_after': 'is_ignored'
}
According to the above dict, if the option –ignore-regex is supplied, this function will be called after the is_ignored function and the data returned by the is_ignored function will be supplied to this function as its argument(s). So if the program flow was:
scancode() ⇔ scan() ⇔ resource_paths() ⇔ is_ignored()
It will now be edited to
scancode() ⇔ scan() ⇔ resource_paths() ⇔ is_ignored() ⇔ add_cmdline_option()
Options such as call_after, call_before, call_first, call_last can be defined to determine when the function is to be executed.
@scanproper_hookspec
def dependency_scan(args):
This function will be called before starting the scan without any arguments, it will return a list of file types or attributes which if encountered in the scanned tree, will call this function with the path to the file as an argument. This function can do some extra processing on those files and return the data to be processed as a dependency for the normal scanning process. E.g. It can return a list such as:
[ 'debian/copyright' ]
Whenever a file matches this pattern, this function will be called and the data returned will be supplied to the main scancode function.
- Structure -
postscan_hookspec = HookspecMarker('postscan')
@postscan_hookspec
def format_output(args):
This function will be called after a scan is finished. It will be supplied with path to the ABC data generated from the scan, path to the root of the scanned code and a path where the output is expected to be stored. The function will store the processed data in the output path supplied. This can be used to convert output to other formats such as CSV, SPDX, JSON, etc.
@postscan_hookspec
def summarize_output(args):
This function will be called after a scan is finished. It will be supplied the data to be reported to the user as well as a path to the root of the scanned node. The data returned can then be reported to the user. This can be used to summarize output, maybe encapsulate the data to be reported or omit similar file metadata or even classify files such as tests, code proper, licenses, readme, configs, build scripts etc.
- Identifying or configuring plugins
For python plugins, pluggy supports loading modules from setuptools entrypoints, E.g.
entry_points = {
'scancode_plugins': [
'name_of_plugin = ignore_regex',
]
}
This plugin can be loaded using the PluginManager class’s load_setuptools_entrypoints(‘scancode_plugins’) method which will return a list of loaded plugins.
For non python plugins, all such plugins will be stored in a common directory and each of these plugins will have a manifest configuration in YAML format. This directory will be scanned at startup for plugins. After parsing the config file of a plugin, the data will be supplied to the plugin manager as if it were supplied using setuptools entrypoints.
In case of non python plugins, the plugin executables will be spawned in their own processes and according to their config data, they will be passed arguments and would return data as necessary. In addition to this, the desired hook function can be called from a non python plugin using certain arguments, which again can be mapped in the config file.
Sample config file for a ignore_regex plugin calling scanproper hook would be:
name: ignore_regex
hook: scanproper
hookfunctions:
add_cmdline_option: '-aco'
dependency_scan: '-dc'
data:
add_cmdline_option':
- name: '--ignore-regex'
- options:
- default: None
- multiple: True
- metavar: <pattern>
- help: 'Ignore files matching regex <pattern>'
- call_after: 'is_ignored'
Existing solutions:¶
An alternate solution to a “pluggable” system would be the more conventional approach of adding functionalities directly to the core codebase, which removes the abstraction layer provided by a plugin management and hook calling system.
License Policy Plugin¶
This plugin allows the user to apply policy details to a scancode scan, depending on which
licenses are detected in a particular file. If a license specified in the Policy file is
detected by scancode, this plugin will apply that policy information to the Resource as a new
attribute: license_policy
.
Policy File Specification¶
The Policy file is a YAML (.yml
) document with the following structure:
license_policies:
- license_key: mit
label: Approved License
color_code: '#00800'
icon: icon-ok-circle
- license_key: agpl-3.0
label: Approved License
color_code: '#008000'
icon: icon-ok-circle
- license_key: broadcom-commercial
label: Restricted License
color_code: '#FFcc33'
icon: icon-warning-sign
The only required key is license_key
, which represents the ScanCode license key to match
against the detected licenses in the scan results.
In the above example, a descriptive label is added along with a color code and CSS id
name
for potential visual display.
Using the Plugin¶
To apply License Policies during a ScanCode scan, specify the --license-policy
option.
For example, use the following command to run a File Info and License scan on
/path/to/codebase/
, using a License Policy file found at ~/path/to/policy-file.yml
:
$ scancode -clipeu /path/to/codebase/ --license-policy ~/path/to/policy-file.yml --json-pp
~/path/to/scan-output.json
Example Output¶
Here is an example of the ScanCode output after running --license-policy
:
{
"path": "samples/zlib/deflate.c",
"type": "file",
"licenses": [
{
"key": "zlib",
...
...
...
}
],
"license_policy": {
"license_key": "zlib",
"label": "Approved License",
"color_code": "#00800",
"icon": "icon-ok-circle"
},
"scan_errors": []
}
Plugin Tutorials¶
Basic Tutorials¶
How to Run a Scan¶
In this simple tutorial example, we perform a basic scan on the samples
directory distributed
by default with Scancode.
Warning
This tutorial is for Linux based systems presently. Additional Help for Windows/MacOS will be added.
Setting up a Virtual Environment¶
Scancode Toolkit 3.1.1 and Workbench 3.1.0 is not compatible with python 3.x so we will create a
virtual environment using the Virtualenv
tool with a python 2.7 interpreter.
The following commands set up and activate the Virtual Environment venv-scan3.1.1
:
virtualenv -p /usr/bin/python2.7 venv-scan3.1.1
source venv-scan3.1.1/bin/activate
Setting up Scancode Toolkit¶
Get the Scancode Toolkit Version 3.1.1 tarball or .zip archive from the Toolkit GitHub Release Page under assets options. Download and extract the Archive from command line:
For .zip archive:
unzip scancode-toolkit-3.1.1.zip
For .tar.bz2 archive:
tar -xvf scancode-toolkit-3.1.1.tar.bz2
Or Right Click and select “Extract Here”.
Check whether the Prerequisites are installed. Open a terminal in the extracted directory and run:
./scancode --help
This will configure ScanCode and display the command line Help text.
Looking into Files¶
As mentioned previously, we are going to perform the scan on the samples
directory distributed
by default with Scancode Toolkit. Here’s the directory structure and respective files:

We notice here that the sample files contain a package zlib.tar.gz
. So we have to extract the
archive before running the scan, to also scan the files inside this package.
Performing Extraction¶
To extract the packages inside samples
directory:
./extractcode samples
This extracts the zlib.tar.gz package:

Note
--shallow
option can be used to recursively extract packages.
Deciding Scan Options¶
These are some common scan options you should consider using before you start the actual scan, according to your requirements.
- The Basic Scan options, i.e.
-c
,-l
,-p
,-e
,-u
, and-i
are to be decided, according to your requirements. If you do not need one specific type of information (say, licenses), consider removing it, because the more things you scan for, longer it will take for the scan to complete.
Note
You have to select these options explicitly, as they are not default anymore from
versions 3.x, unlike earlier versions having -clp
as default.
--license-score INTEGER
is to be set if license matching accuracy is desired (Default is 0, and increasing this means a more accurate match). Also, using--license-text
includes the matched text to the result.-n INTEGER
option can be used to speed up the scan using multiple parallel processes.--timeout FLOAT
option can be used to skip a file taking a lot of time to scan.--ignore <pattern>
can be used to skip certain group of files.<OUTPUT FORMAT OPTION(s)>
is also a very important decision when you want to use the output for specific tasks/have requirements. Here we are usingjson
as ScanCode Workbench importsjson
files only.
For the complete list of options, refer All Available Options.
Running The Scan¶
Now, run the scan with the options decided:
./scancode -clpeui -n 2 --ignore "*.java" --json-pp sample.json samples
A Progress report is shown:
Setup plugins...
Collect file inventory...
Scan files for: info, licenses, copyrights, packages, emails, urls with 2 process(es)...
[####################] 29
Scanning done.
Summary: info, licenses, copyrights, packages, emails, urls with 2 process(es)
Errors count: 0
Scan Speed: 1.09 files/sec. 40.67 KB/sec.
Initial counts: 49 resource(s): 36 file(s) and 13 directorie(s)
Final counts: 42 resource(s): 29 file(s) and 13 directorie(s) for 1.06 MB
Timings:
scan_start: 2019-09-24T203514.573671
scan_end: 2019-09-24T203545.649805
setup_scan:licenses: 4.30s
setup: 4.30s
scan: 26.62s
total: 31.14s
Removing temporary files...done.
How to Visualize Scan results¶
In this simple tutorial example, we import results from a basic scan performed on the samples
directory distributed by default with Scancode, and visualize the outputs through
Scancode Workbench.
Warning
This tutorial uses the 3.1.1 version of Scancode Toolkit, and Scancode Workbench 3.1.0 (This beta version of ScanCode Workbench is compatible with scans from any ScanCode Toolkit develop version/branch at or after v3.0.2). If you are using an older version of Scancode Toolkit, check respective versions of this documentation. Also refer the Scancode Workbench release highlights.
Warning
This tutorial is for Linux based systems presently. Additional Help for Windows/MacOS will be added.
Setting up Scancode Workbench¶
According to the Install workbench_requirements, we have to install Node.js 6.x or later. Refer to Node.js install instructions here.
You can also run the following commands:
sudo apt-get install -y nodejs
sudo npm install npm@5.2.0 -g
After Node.js
and npm
is installed and get the Scancode Workbench 3.1.0 tarball from the
Workbench Release Page. Extract
the package and then launch Scancode Workbench:
./ScanCode-Workbench
This opens the Workbench.
Note
You can also build Scancode Toolkit and Scancode Workbench from source. Clone the repository,
don’t forget to checkout to the specific release using git checkout <release>
, and follow
the build instructions. You’ll also have to create a Python 2.7 Virtual Environment, or use the
same venv-3.1.1 created here at How to Run a Scan.
Importing Data into Scancode Workbench¶
- Click on the
File -> Import JSON File
or PressCtrl + I
. - Select the file from the pop-up window.
- Select a Name and Location (where you want it later) for the .sqlite output file.
Note
You can also import a .sqlite file you’ve saved in the past to load scan results. As it is much faster, once you’ve imported the JSON file and a corresponding SQLite file has been created, you shouldn’t repeat this. Instead, import the SQLite file next time you want to visualize the same scan result.
Visualization¶
Refer workbench_views for more information on Visualization.
The dashboard has a general overview.

There are 3 principal views (They appear in the same order in the GIFs):
- Chart Summary View,
- Table View,
- Components Summary View.

You can also click any file/directory on the file list located on the right, to filter the results such that it only contains results from that File/Directory.

Refer workbench_components for more information on Components.
In the table view,
- Apply filters by selecting Files/Directories
- Right Click on the Left Panel
- Select
Edit Component
- A pop-up opens with fields, make necessary edits and Save.
- Go to the Component Summary View to see the Component.

How To Extract Archives¶
ScanCode Toolkit provides archive extraction. This command can be used before running a scan over
a codebase in order to ensure all archives are extracted. Archives found inside an extracted
archive are extracted recursively. Extraction is done in-place in a directory and named after the
archive with '-extract'
appended.

Usage:¶
./extractcode [OPTIONS] <input>
All Extractcode Options¶
This is intended to be used as an input preparation step, before running the scan. Archives found in an extracted archive are extracted recursively by default. Extraction is done in-place in a directory named ‘-extract’ side-by-side with an archive.
To extract the packages in the samples
directory
./extractcode samples
This extracts the zlib.tar.gz package:

--shallow | Do not extract recursively nested archives (e.g. Not archives in archives). |
--verbose | Print verbose file-by-file progress messages. |
--quiet | Do not print any summary or progress message. |
-h, --help | Show the extractcode help message and exit. |
--about | Show information about ScanCode and licensing and exit. |
--version | Show the version and exit. |
How to specify Scancode Output Format¶
A basic overview of formatting Scancode Output is presented here.
More information on Scancode Output Formats.
JSON¶
If you want JSON output of ScanCode results, you can pass the --json
argument to ScanCode.
The following commands will output scan results in a formatted json file:
./scancode --json /path/to/output.json /path/to/target/dir
./scancode --json-pp /path/to/output.json /path/to/target/dir
./scancode --json-lines /path/to/output.json /path/to/target/dir
To compare the JSON output in different formats refer Comparing Different json Output Formats.
Print to stdout
(Terminal)¶
If you want to format the output in JSON and print it at stdout, you can replace the JSON filename
with a “-“, like --json-pp -
instead of --json-pp output.json
.
The following command will output the scan results in JSON format to stdout
(In the Terminal):
./scancode -clpieu --json-pp - samples/
HTML¶
If you want HTML output of ScanCode results, you can pass the --html
argument to ScanCode.
The following commands will output scan results in a formatted HTML page or simple web application:
./scancode --html /path/to/output.html /path/to/target/dir
./scancode --html-app /path/to/output.html /path/to/target/dir
For more details on the HTML output format refer --html FILE.
Warning
The --html-app
option has been deprecated, use Scancode Workbench instead.
Custom Output Format¶
While the three built-in output formats are convenient for a verity of use-cases, one may wish to create their own output template, using the following arguments:
``--custom-output FILE --custom-template TEMP_FILE``
ScanCode makes this very easy, as it uses the popular Jinja2 template engine. Simply pass the path
to the custom template to the --custom-template
argument, or drop it in a folder to
src/scancode/templates
directory.
For example, if I wanted a simple CLI output I would create a template2.html
with the
particular data I wish to see. In this case, I am only interested in the license and copyright
data for this particular scan.
## template.html:
[
{% if files.license_copyright %}
{% for location, data in files.license_copyright.items() %}
{% for row in data %}
location:"{{ location }}",
{% if row.what == 'copyright' %}copyright:"{{ row.value|escape }}",{% endif %}
{% endfor %}
{% endfor %}
{% endif %}
]
.. note::
File name and extension does not matter for the template file.
Now I can run ScanCode using my newly created template:
$ ./scancode -clpeui --custom-output output.json --custom-template template.html samples
Scanning files...
[####################################] 46
Scanning done.
Now are results are saved in output.json
and we can easily view them with head output.json
:
[
location:"samples/JGroups/LICENSE",
copyright:"Copyright (c) 1991, 1999 Free Software Foundation, Inc.",
location:"samples/JGroups/LICENSE",
copyright:"copyrighted by the Free Software Foundation",
]
For a more elaborate template, refer this default template
given with Scancode, to generate HTML output with the --html
output format option.
Documentation on Jinja templates.
How to set what will be detected in Scan¶
ScanCode allows you to scan a codebase for license, copyright and other interesting information that can be discovered in files. The following options are available for detection when using ScanCode Toolkit:
All “Basic” Scan Options¶
Option lists are two-column lists of command-line options and descriptions, documenting a program’s options. For example:
-c, --copyright | |
Scan Sub-Options:
| |
-l, --license | Scan Sub-Options:
|
-p, --package | Scan Sub-Options:
|
-e, --email | Scan Sub-Options:
|
-u, --url | Scan Sub-Options:
|
-i, --info | Include information such as:
Sub-Options:
|
Note
Unlike previous 2.x versions, -c, -l, and -p are not default. If any of combination of these
options are used, ScanCode only performs that specific task, and not the others.
./scancode -e
only scans for emails, and doesn’t scan for copyright/license/packages/general
information.
Note
These options, i.e. -c, -l, -p, -e, -u, and -i can be used together. As in, instead of
./scancode -c -i -p
, you can write ./scancode -cip
and it will be the same.
--generated | Classify automatically generated code files with a flag. |
--max-email INT | |
Report only up to INT emails found in a file. Use 0 for no limit. [Default: 50] Sub-Option of - | |
--max-url INT | Report only up to INT urls found in a file. Use 0 for no limit. [Default: 50] Sub-Option of - |
--license-score INTEGER | |
Do not return license matches with scores lower than this score. A number between 0 and 100. [Default: 0] Here, a bigger number means a better match, i.e. Setting a higher license score translates to a higher threshold (with equal or less number of matches). Sub-Option of - | |
--license-text | Include the matched text for the detected licenses in the output report. Sub-Option of - Sub-Options:
|
--license-url-template TEXT | |
Set the template URL used for the license reference URLs. In a template URL, curly braces ({}) are replaced by the license key. [Default: https://enterprise.dejacode.com/urn/urn:dje:license:{}] Sub-Option of - | |
--license-text-diagnostics | |
In the matched license text, include diagnostic highlights surrounding with square brackets [] words that are not matched. Sub-Option of - |
Different Scans¶
The following examples will use the samples
directory that is provided with the ScanCode
Toolkit code. All examples will
be saved in the JSON format, which can be loaded into Scancode Workbench for visualization. See
How to Visualize Scan results for more information. Another output format option is a
static html file. See Scancode Output Formats for more information.
To scan for licenses, copyrights, urls, emails, package information, and file information
./scancode -clipeu --json output.json samples
./scancode -cl --json-pp output.json samples
./scancode -eu --json-pp output.json samples
./scancode -p --json-pp output.json samples
./scancode -i --json-pp output.json samples
./scancode --examples
For more information, refer All Available Options.
Add A Post-Scan Plugin¶
Built-In vs. Optional Installation¶
Some post-scan plugins are installed when ScanCode itself is installed, e.g., the License Policy Plugin, whose code is located here:
https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/plugin_license_policy.py
These plugins do not require any additional installation steps and can be used as soon as ScanCode is up and running.
ScanCode is also designed to use post-scan plugins that must be installed separately from the installation of ScanCode. The code for this sort of plugin is located here:
https://github.com/nexB/scancode-toolkit/tree/develop/plugins/
This wiki page will focus on optional post-scan plugins.
Example Post-Scan Plugin: Hello ScanCode¶
To illustrate the creation of a simple post-scan plugin, we’ll create a hypothetical plugin named
Hello ScanCode
, which will print Hello ScanCode!
in your terminal after you’ve run a scan.
Your command will look like something like this:
scancode -i -n 2 <path to target codebase> --hello --json <path to JSON output file>
We’ll start by creating three folders:
- Top-level folder –
/scancode-hello/
- 2nd-level folder –
/src/
- 3rd-level folder –
/hello_scancode/
/scancode-hello/
¶- In the
/scancode-toolkit/plugins/
directory, add a folder with a relevant name, e.g.,scancode-hello
. This folder will hold all of your plugin code. - Inside the
/scancode-hello/
folder you’ll need to add a folder namedsrc
and 7 files.
/src/
– This folder will contain your primary Python code and is discussed in more detail in the following section.
The 7 Files are:
.gitignore
– See, e.g., /plugins/scancode-ignore-binaries/.gitignore
/build/
/dist/
apache-2.0.LICENSE
– See, e.g., /plugins/scancode-ignore-binaries/apache-2.0.LICENSEMANIFEST.in
graft src
include setup.py
include setup.cfg
include .gitignore
include README.md
include MANIFEST.in
include NOTICE
include apache-2.0.LICENSE
global-exclude *.py[co] __pycache__ *.*~
NOTICE
– See, e.g., /plugins/scancode-ignore-binaries/NOTICEREADME.md
setup.cfg
[metadata]
license_file = NOTICE
[bdist_wheel]
universal = 1
[aliases]
release = clean --all bdist_wheel
setup.py
– This is an example of what oursetup.py
file would look like:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from __future__ import absolute_import
from __future__ import print_function
from glob import glob
from os.path import basename
from os.path import join
from os.path import splitext
from setuptools import find_packages
from setuptools import setup
desc = '''A ScanCode post-scan plugin to to illustrate the creation of a simple post-scan plugin.'''
setup(
name='scancode-hello',
version='1.0.0',
license='Apache-2.0 with ScanCode acknowledgment',
description=desc,
long_description=desc,
author='nexB',
author_email='info@aboutcode.org',
url='https://github.com/nexB/scancode-toolkit/plugins/scancode-categories',
packages=find_packages('src'),
package_dir={'': 'src'},
py_modules=[splitext(basename(path))[0] for path in glob('src/*.py')],
include_package_data=True,
zip_safe=False,
classifiers=[
# complete classifier list: http://pypi.python.org/pypi?%3Aaction=list_classifiers
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'License :: OSI Approved :: Apache Software License',
'Programming Language :: Python',
'Programming Language :: Python :: 2.7',
'Topic :: Utilities',
],
keywords=[
'scancode', 'plugin', 'post-scan'
],
install_requires=[
'scancode-toolkit',
],
entry_points={
'scancode_post_scan': [
'hello = hello_scancode.hello_scancode:SayHello',
],
}
)
/src/
¶- Add an
__init__.py
file inside thesrc
folder. This file can be empty, and is used to indicate that the folder should be treated as a Python package directory. - Add a folder that will contain our primary code – we’ll name the folder
hello_scancode
. If you look at the example of thesetup.py
file above, you’ll see this line in theentry_points
section:
'hello = hello_scancode.hello_scancode:SayHello',
hello
refers to the name of the command flag.- The first
hello_scancode
is the name of the folder we just created. - The second
hello_scancode
is the name of the.py
file containing our code (discussed in the next section). SayHello
is the name of thePostScanPlugin
class we create in that file (see sample code below).
/hello_scancode/
¶- Add an
__init__.py
file inside thehello_scancode
folder. As noted above, this file can be empty. - Add a
hello_scancode.py
file.
#
# Copyright (c) 2019 nexB Inc. and others. All rights reserved.
# http://nexb.com and https://github.com/nexB/scancode-toolkit/
# The ScanCode software is licensed under the Apache License version 2.0.
# Data generated with ScanCode require an acknowledgment.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# When you publish or redistribute any data created with ScanCode or any ScanCode
# derivative work, you must accompany this data with the following acknowledgment:
#
# Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
# ScanCode is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/nexB/scancode-toolkit/ for support and download.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from plugincode.post_scan import PostScanPlugin
from plugincode.post_scan import post_scan_impl
from scancode import CommandLineOption
from scancode import POST_SCAN_GROUP
PostScanPlugin
class¶The PostScanPlugin
class (see L40-L45
code)
inherits from the CodebasePlugin
class (see L139-L150
code ),
which inherits from the BasePlugin
class (see L38-L136
code ).
@post_scan_impl
class SayHello(PostScanPlugin):
"""
Illustrate a simple "Hello World" post-scan plugin.
"""
options = [
CommandLineOption(('--hello',),
is_flag=True, default=False,
help='Generate a simple "Hello ScanCode" greeting in the terminal.',
help_group=POST_SCAN_GROUP)
]
def is_enabled(self, hello, **kwargs):
return hello
def process_codebase(self, codebase, hello, **kwargs):
"""
Say hello.
"""
if not self.is_enabled(hello):
return
print('\nHello ScanCode!!\n')
Load the plugin¶
- To load and use the plugin in the normal course, navigate to the plugin’s root folder (in this
example:
/plugins/scancode-hello/
) and runpip install .
(don’t forget the final.
). - If you’re developing and want to test your work, save your edits and run
pip install -e .
from the same folder.
More-complex examples¶
This Hello ScanCode example is quite simple. For examples of more-complex structures and functionalities you can take a look at the other post-scan plugins for guidance and ideas.
One good example is the License Policy post-scan plugin. This plugin is installed when ScanCode
is installed and consequently is not located in the /plugins/
directory used for
manually-installed post-scan plugins. The code for the License Policy plugin can be found at
/scancode-toolkit/src/licensedcode/plugin_license_policy.py
and illustrates how a plugin can be used to analyze the results of a ScanCode scan using external
data files and add the results of that analysis as a new field in the ScanCode JSON output file.
How-To Guides¶
How To Add a New License for Detection¶
How to add a new license for detection?¶
To add new license, you first need to select a new and unique license key (mit and gpl-2.0 are some of the existing license keys). All licenses are stored as plain text files in the src/licensedcode/data/licenses directory using their key as part of the file names.
You need to create a pair of files:
a file with the text of the license saved in a plain text file named key.LICENSE
a small text data file (in YAML format) named key.yml that contains license information such as:
key: my-license name: My License
The key name can contain only these symbols:
- lowercase letters from a to z,
- numbers from 0 to 9,and
- dash - and . period signs. No spaces.
Save these two files in the src/licensedcode/data/licenses/
directory.
Done!
See the src/licensedcode/data/licenses/
directory for examples.
How to Add New License Rules for Enhanced Detection¶
ScanCode relies on license rules to detect licenses. A rule is a simple text file containing a license text or notice or mention. And a small YAML text file that tells ScanCode which licenses to report when the text is detected.
See the FAQ for a high level description of How to Add New License Rules for Enhanced Detection.
How to add a new license detection rule?¶
A license detection rule is a pair of files:
- a plain text rule file that is typically a variant of a license text, notice or license mention.
- a small text data file (in YAML format) documenting which license(s) should be detected for the rule text.
To add a new rule, you need to pick a unique base file name. As a convention, we like to include the license key(s) that should be detected in that name to make it more descriptive. For example: mit_and_gpl-2.0 is a good base name. Add a suffix to make it unique if there is already a rule with this base name. Do not use spaces or special characters in that name.
Then create the rule file in the src/licensedcode/data/rules/ directory using this name, replacing selected_base_name with the base name you selected:
selected_base_name.RULE
Save your rule text in this file.
Then create the YAML data file in the src/licensedcode/data/rules/ directory using this name:
selected_base_name.yml
For a simple mit and gpl-2.0 detection license keys detection, the content of this file can be this YAML snippet:
licenses:
- mit
- gpl-2.0
Save these two files in the src/licensedcode/data/licenses/
directory and you are done!
See the src/licensedcode/data/rules/
directory for examples.
More (advanced) rules options:
- you can use a notes: text field to document this rule.
- if no license should be detected for your .RULE text, do not add a list of license keys, just add a note.
- .RULE text can contain special text regions that can be ignored when scanning for licenses. You can mark a template region in your rule text using {{double curly braces}} and up to five words can vary and still match this rule. You must add this field in your .yml data file to mark this rule as a template
template: yes
By using a number after the opening braces, more than five words can be skipped. With {{10 double curly braces }} ten words would be skipped.
To mark a rule as detecting a choice of licenses, add this field in your .yml file:
license_choice: yes
See the #257 issue and the related #258 pull request for an example: this adds a new rule to detect a combination of MIT or GPL.
How it all Works¶
Overview¶
How does ScanCode work?¶
For license detection, ScanCode uses a (large) number of license texts and license detection ‘rules’ that are compiled in a search index. When scanning, the text of the target file is extracted and used to query the license search index and find license matches.
For copyright detection, ScanCode uses a grammar that defines the most common and less common forms of copyright statements. When scanning, the target file text is extracted and ‘parsed’ with this grammar to extract copyright statements.
ScanCode-Toolkit performs the scan on a codebase in the following steps :
- Collect an inventory of the code files and classify the code using file types,
- Extract files from any archive using a general purpose extractor
- Extract texts from binary files if needed
- Use an extensible rules engine to detect open source license text and notices
- Use a specialized parser to capture copyright statements
- Identify packaged code and collect metadata from packages
- Report the results in the formats of your choice (JSON, CSV, etc.) for integration with other tools
Scan results are provided in various formats:
- a JSON file simple or pretty-printed,
- SPDX tag value or XML, RDF formats,
- CSV,
- a simple unformatted HTML file that can be opened in browser or as a spreadsheet.
For each scanned file, the result contains:
- its location in the codebase,
- the detected licenses and copyright statements,
- the start and end line numbers identifying where the license or copyright was found in the scanned file, and
- reference information for the detected license.
For archive extraction, ScanCode uses a combination of Python modules, 7zip and libarchive/bsdtar to detect archive types and extract these recursively.
Several other utility modules are used such as libmagic for file and mime type detection.
Contribute¶
Contributing to Code Development¶
See CONTRIBUTING.rst for details.
Code layout and conventions¶
Source code is in src/
Tests are in tests/
.
There is one Python package for each major feature under src/
and a corresponding directory
with the same name under tests
(but this is not a package by design).
Each test script is named test_XXXX
and while we love to use py.test
as a test runner,
most tests have no dependencies on py.test
, only on the unittest
module (with the exception
of some command line tests that depend on pytest monkeypatching capabilities.
When source or tests need data files, we store these in a data
subdirectory.
We use PEP8 conventions with a relaxed line length that can be up to 90’ish characters long when needed to keep the code clear and readable.
We store pre-built bundled native binaries in bin/
sub-directories of each src/
packages.
These binaries are organized by OS and architecture. This ensures that ScanCode works out of the box
either using a checkout or a download, without needing a compiler and toolchain to be installed.
The corresponding source code for the pre-built binaries are stored in a separate repository at
https://github.com/nexB/scancode-thirdparty-src.
We store bundled thirdparty components and libraries in the thirdparty
directory. Python
libraries are stored as wheels, eventually pre-built if the corresponding wheel is not available
in the Pypi repository. Some of these components may be advanced builds with bug fixes or advanced
patches.
We write tests, a lot of tests, thousands of tests. Several tests are data-driven and use data files as test input and sometimes data files as test expectation (in this case using either JSON or YAML files). The tests should pass on Linux 64 bits, Windows 32 and 64 bits and on MacOSX 10.6.8 and up. We maintain two CI loops with Travis (Linux) at https://travis-ci.org/nexB/scancode-toolkit and Appveyor (Windows) at https://ci.appveyor.com/project/nexB/scancode-toolkit.
When finding bugs or adding new features, we add tests. See existing test code for examples.
More info:
- Source code and license datasets are in the /src/ directory.
- Test code and test data are in the /tests/ directory.
- Datasets and test data are in /data/ sub-directories.
- Third-party components are vendored in the /thirdparty/ directory. ScanCode is self contained and should not require network access for installation or configuration of third-party libraries.
- Additional pre-compiled vendored binaries are stored in bin/ sub-directories of the /src/ directory with their sources in this repo: https://github.com/nexB/scancode-thirdparty-src/
- Porting ScanCode to other OS (FreeBSD, etc.) is possible. Enter an issue for help.
- Bugs and pull requests are welcomed.
- See the wiki and CONTRIBUTING.rst for more info.
Running tests¶
ScanCode comes with over 13,000 unit tests to ensure detection accuracy and stability across Linux, Windows and macOS OSes: we kinda love tests, do we?
We use pytest to run the tests: call the py.test
script to run the whole test suite. This is
installed by pytest
, which is bundled with a ScanCode checkout and installed when you
run ./configure
).
If you are running from a fresh git clone and you run ./configure
and then
source bin/activate
the py.test
command will be available in your path.
Alternatively, if you have already configured but are not in an activated “virtualenv” the
py.test
command is available under <root of your checkout>/bin/py.test
(Note: paths here are for POSIX, but mostly the same applies to Windows)
If you have a multiprocessor machine you might want to run the tests in parallel (and faster)
For instance: py.test -n4
runs the tests on 4 CPUs. We typically run the tests in
verbose mode with py.test -vvs -n4
.
You can also run a subset of the test suite as shown in the CI configs
https://github.com/nexB/scancode-toolkit/blob/develop/appveyor.yml#L6 e,g,
py.test -n 2 -vvs tests/scancode
runs only the test scripts present in the tests/scancode
directory. (You can pass a path to a specific test script file there too).
See also https://docs.pytest.org for details or use the py.test -h
command to show the many
other options available.
One useful option is to run a select subset of the test functions matching a pattern with the
-k
option, for instance: py.test -vvs -k tcpdump
would only run test functions that contain
the string “tcpdump” in their name or their class name or module name .
Another useful option after a test run with some failures is to re-run only the failed tests with
the --lf
option, for instance: py.test -vvs --lf
would only run only test functions that
failed in the previous run.
pip requirements and the configure script¶
ScanCode use the configure
and configure.bat
(and etc/configure.py
behind the scenes)
scripts to install a virtualenv , install required
packaged dependencies as pip requirements and more configure tasks
such that ScanCode can be installed in a self-contained way with no network connectivity required.
Earlier unreleased versions of ScanCode where using buildout
to install and configure
eventually complex dependencies. We had some improvements that were merged in the upstream
buildout
to support bootstrapping and installing without a network connection and When we
migrated to use pip
and wheels
as new, improved and faster way to install and configure
dependencies we missed some of the features of buildout
like the recipes
, being able to
invoke arbitrary Python or shell scripts after installing packages and have scripts or requirements
that are operating system-specific.
ScanCode requirements and third-party Python libraries¶
In a somewhat unconventional way, all the required libraries are bundled aka. Copied in the repo
itself in the thirdparty/ directory. If ScanCode were only a library it would not make sense. But
it is first an application and having a well defined frozen set of dependent packages is important
for an app. The benefit of this approach (combined with the configure
script) means that a mere
checkout of the repository contains everything needed to run ScanCode except for a
Python interpreter.
Using ScanCode as a Python library¶
ScanCode can be used alright as a Python library and is available as as a Python wheel in Pypi and
installed with pip install scancode-toolkit
.
How to cut a new release:¶
run bumpversion with major, minor or patch to bump the version in:
src/scancode/__init__.py
setup.py
- Update the CHANGELOG.rst
commit changes and push changes to develop:
git commit -m "commit message"
git push --set-upstream origin develop
merge develop branch in master and tag the release.
git checkout master
git merge develop
git tag -a v1.6.1 -m "Release v1.6.1"
git push --set-upstream origin master
git push --set-upstream origin v1.6.1
draft a new release in GitHub, using the previous release blurb as a base. Highlight new and noteworthy changes from the CHANGELOG.rst.
run
etc/release/release.sh
locally.upload the release archives created in the
dist/
directory to the GitHub release page.save the release as a draft. Use the previous release notes to create notes in the same style. Ensure that the link to third-party source code is present.
test the downloads.
publish the release on GitHub
then build and publish the released wheel on Pypi. For this you need your own Pypi credentials (and get authorized to publish Pypi release: ask @pombredanne) and you need to have the
twine
package installed and configured.- Build a
.whl
withpython setup.py bdist_wheel
- Run twine with
twine upload dist/<path to the built wheel>
- Once uploaded check the published release at https://pypi.python.org/pypi/scancode-toolkit/
- Then create a new fresh local virtualenv and test the wheel installation with:
pip install scancode-toolkit
- Build a
Contributing to the Documentation¶
Continuous Integration¶
The documentations are checked on every new commit through Travis-CI, so that common errors are avoided and documentation standards are enforced. Travis-CI presently checks for these 3 aspects of the documentation :
- Successful Builds (By using
sphinx-build
) - No Broken Links (By Using
link-check
) - Linting Errors (By Using
Doc8
)
Style Checks Using Doc8
¶
In the project root, run the following command:
$ doc8 --max-line-length 100 docs/source/scancode-toolkit --ignore D000
Note
Only the scancode-toolkit documentation style standards are enforced presently.
A sample output is:
Scanning...
Validating...
docs/source/scancode-toolkit/misc/licence_policy_plugin.rst:37: D002 Trailing whitespace
docs/source/scancode-toolkit/misc/faq.rst:45: D003 Tabulation used for indentation
docs/source/scancode-toolkit/misc/faq.rst:9: D001 Line too long
docs/source/scancode-toolkit/misc/support.rst:6: D005 No newline at end of file
========
Total files scanned = 34
Total files ignored = 0
Total accumulated errors = 326
Detailed error counts:
- CheckCarriageReturn = 0
- CheckIndentationNoTab = 75
- CheckMaxLineLength = 190
- CheckNewlineEndOfFile = 13
- CheckTrailingWhitespace = 47
- CheckValidity = 1
Now fix the errors and run again till there isn’t any style error in the documentation.
PyCQA is an Organization for code quality tools (and plugins) for the Python programming language. Doc8 is a sub-project of the same Organization. Refer this README for more details.
What is checked:
invalid rst format - D000
lines should not be longer than 100 characters - D001
- RST exception: line with no whitespace except in the beginning
- RST exception: lines with http or https URLs
- RST exception: literal blocks
- RST exception: rst target directives
no trailing whitespace - D002
no tabulation for indentation - D003
no carriage returns (use UNIX newlines) - D004
no newline at end of file - D005
Extra Style Checks¶
Headings
(Refer) Normally, there are no heading levels assigned to certain characters as the structure is determined from the succession of headings. However, this convention is used in Python’s Style Guide for documenting which you may follow:
# with overline, for parts
- with overline, for chapters
=, for sections
-, for subsections
^, for sub-subsections
“, for paragraphs
Heading Underlines
Do not use underlines that are longer/shorter than the title headline itself. As in:
Correct :
Extra Style Checks
------------------
Incorrect :
Extra Style Checks
------------------------
Note
Underlines shorter than the Title text generates Errors on sphinx-build.
Internal Links
Using
:ref:
is advised over standard reStructuredText links to sections (like`Section title`_
) because it works across files, when section headings are changed, will raise warnings if incorrect, and works for all builders that support cross-references. However, external links are created by using the standard`Section title`_
method.Eliminate Redundancy
If a section/file has to be repeated somewhere else, do not write the exact same section/file twice. Use
.. include: ../README.rst
instead. Here,../
refers to the documentation root, so file location can be used accordingly. This enables us to link documents from other upstream folders.Using
:ref:
only when necessaryUse
:ref:
to create internal links only when needed, i.e. it is referenced somewhere. Do not create references for all the sections and then only reference some of them, because this created unnecessary references. This also generates ERROR inrestructuredtext-lint
.Spelling
You should check for spelling errors before you push changes. Aspell is a GNU project Command Line tool you can use for this purpose. Download and install Aspell, then execute
aspell check <file-name>
for all the files changed. Be careful about not changing commands or other stuff as Aspell gives prompts for a lot of them. Also delete the temporary.bak
files generated. Refer the manual for more information on how to use.Notes and Warning Snippets
Every
Note
andWarning
sections are to be kept inrst_snippets/note_snippets/
andrst_snippets/warning_snippets/
and then included to eliminate redundancy, as these are frequently used in multiple files.
Converting from Markdown¶
If you want to convert a .md
file to a .rst
file, this tool
does it pretty well. You’d still have to clean up and check for errors as this contains a lot of
bugs. But this is definitely better than converting everything by yourself.
This will be helpful in converting GitHub wiki’s (Markdown Files) to reStructuredtext files for Sphinx/ReadTheDocs hosting.
Roadmap¶
This is a high level list of what we are working on and what is completed.
Legend¶
Work in Progress¶
(see Completed features below)
Docker images base (as part of: https://github.com/pombredanne/conan ) #651
RubyGems base and dependencies #650 (code in https://github.com/nexB/scancode-toolkit-contrib/ )
Perl, CPAN (basic in https://github.com/nexB/scancode-toolkit-contrib/)
Go : parsing for Godep in https://github.com/nexB/scancode-toolkit-contrib/
Windows PE #652
RPMs dependencies #649
Windows Nuget dependencies #648
Bower packages #654
Python dependencies #653
CRAN
Plain packages
other Java-related meta files (SBT, Ivy, Gradle, etc.)
Debian debs
other JavaScript (jspm, etc.)
other Linux distro packages
support and detect license expressions (code in https://github.com/nexB/license-expression)
support and detect composite licenses
support custom licenses
move licenses data set to external separate repository
Improved unknown license detection
sync with external sources (DejaCode, SPDX, etc.)
pre scan filtering (ignore binaries, etc)
pre/post/ouput plugins! (worked as part of the GSoC by @yadsharaf )
scan plugins (e.g. plugins that run a scan to collect data)
support Python 3 #295
transparent archive extraction (as opposed to on-demand with extractcode)
scancode.yml configuration file for exclusions, defaults, scan failure conditions, etc.
support scan pipelines and rules to organize more complex scans
scan baselining, delta scan and failure conditions (such as license change, etc) ( spawned as its the DeltaCode project)
dedupe and similarities to avoid re-scanning. For now only identical files are scanned only once.
Improved logging, tracing and error diagnostics
native support for ABC Data (See aboutcode_data )
symbols : parsing complete in https://github.com/nexB/scancode-toolkit-contrib/
metrics : some elements in https://github.com/nexB/scancode-toolkit-contrib/
ELFs : parsing complete in https://github.com/nexB/scancode-toolkit-contrib/
Java bytecode : parsing complete in https://github.com/nexB/scancode-toolkit-contrib/
Windows PE : parsing complete in https://github.com/nexB/scancode-toolkit-contrib/
Mach-O : parsing complete in in https://github.com/nexB/scancode-toolkit-contrib/
Dalvik/dex
Other work in progress¶
ScanCode server: Spawned as its own project: https://github.com/nexB/scancode-server. Will include Integration / webhooks for Github, Bitbucket.
VulnerableCode: NVD and CVE lookups: Spawned as its own project: https://github.com/nexB/vulnerablecode
ScanCode Workbench: desktop app for scan review: Spawned as its own project: https://github.com/nexB/scancode-workbench
DependentCode: dynamic dependencies resolutions: Spawned as its own project: https://github.com/nexB/dependentcode
(Note that this will be spawned in its project) Some code is in https://github.com/nexB/scancode-toolkit-contrib/
Completed features¶
JSON compact and pretty
plain HTML tables, also usable in a spreadsheet
fancy HTML ‘app’ with a file tree navigation, and scan results filtering, search and sorting
improved scans GUI now its own project: https://github.com/nexB/aboutcode-manager
simple scan summary
SPDX output
Google Summer of Code 2017 - Final report¶
Project: Plugin architecture for ScanCode¶
Yash D. Saraf yashdsaraf@gmail.com
This project’s purpose was to create a decoupled plugin architecture for ScanCode such that it can handle plugins at different stages of a scan and can be coupled at runtime. These stages were,
1. Format :¶
In this stage, the plugins are supposed to run after the scanning is done and post-scan
plugins are called. These plugins could be used for:
- converting the scanned output to the given format (say csv, json, etc.)
HOWTO
Here, a plugin needs to add an entry in the scancode_output_writers
entry point in the following
format : '<format> = <module>:<function>'
.
<format>
is the format name which will be used as the command line option name (e.gcsv
orjson
).<module>
is a python module which implements theoutput
hook specification.<function>
is the function to which the scan output will be passed if this plugin is called.
The <format>
name will be automatically added to the --format
command line option and
(if called) the scanned data will be passed to the plugin.
2. Post-scan :¶
In this stage, the plugins are supposed to run after the scanning is done. Some uses for these plugins were:
summarization of scan outputs
e.g A post-scan plugin for marking
is_source
to true for directories with ~90% of source files.simplification of scan outputs
e.g The
--only-findings
option to return files or directories with findings for the requested scans. Files and directories without findings are omitted (not considering basic file information as findings)).
This option already existed, I just ported it to a post-scan plugin.
HOWTO
Here, a plugin needs to add an entry in the scancode_post_scan
entry point in the following
format '<name> = <module>:<function>'
<name>
is the command line option name (e.g only-findings).<module>
is a python module which implements thepost_scan
hook specification.<function>
is the function to which the scanned files will be passed if this plugin is called
The command line option for this plugin will be automatically created using the <function>
‘s
doctring as its help text and (if called) the scanned files will be passed to the plugin.
3. Pre-scan :¶
In this stage, the plugins are supposed to run before the scan starts. So the potential uses for these types of plugins were to:
- ignore files based on a given pattern (glob)
- ignore files based on their info i.e size, type etc.
- extract archives before scanning
HOWTO
Here, a plugin needs to add an entry in the scancode_pre_scan
entry point in the following
format : '<name> = <module>:<class>'
<name>
is the command line option name (e.g ignore ).<module>
is a python module which implements thepre_scan
hook specification.<class>
is the class which is instantiated and its appropriate method is invoked if this plugin is called. This needs to extend theplugincode.pre_scan.PreScanPlugin
class.
The command line option for this plugin will be automatically created using the <class>
‘s
doctring as its help text. Since there isn’t a single spot where pre-scan
plugins can be
plugged in, more methods to PreScanPlugin
class can be added which can represent different
hooks, say to add or delete a scan there might be a method called process_scan
.
If a plugin’s option is passed by the user, then the <class>
is instantiated with the user
input and its appropriate aforementioned methods are called.
4. Scan (proper):¶
In this stage, the plugins are supposed to run before the scan starts and after the
pre-scan
plugins are called. These plugins would have been used for
- adding or deleting scans
- adding dependency scans (whose data could be used in other scans)
No development has been done for this stage, but it will be quite similar to pre-scan
.
5. Other work:¶
Here, the goal was to add command line options to pre-defined groups such that they are displayed
in their respective groups when scancode -h
or scancode --help
is called. This helped to
better visually represent the command line options and determine more easily what context they
belong to.
Add a Resource class to hold all scanned info
* Ongoing
*
Here, the goal was to create a Resource
class, such that it holds all the scanned data for a
resource (i.e a file or a directory). This class would go on to eventually encapsulate the caching
logic entirely. For now, it just holds the info
and path
of a resource.
Google Summer of Code 2019 - Final report¶
Overview¶
Problem: Since Python 2.7 will retire in few months and will not be maintained any longer.
Solution: Scancode needs to be ported to python 3 and all test suites must pass on both version of Python. The main difference that makes Python 3 better than Python 2.x is that the support for unicode is greatly improved in Python 3. This will also be useful for scancode as scancode has users in more than 100 languages and it’s easy to translate strings from unicode to other languages.
Objective: To make scancode-toolkit installable on on Python 3.6 and higher, as presently it installs with Python 2.7 only.
Implementation¶
It was started in development mode(editable mode) and then it was moved to work in virtual environments.
I have worked module by module according to the order of hierarchy of modules. For example :All module is dependent on commoncode, so it must be ported first. In this way we have created the Porting order:
- commoncode
- plugincode
- typecode
- extractcode
- textcode
- scancode basics (some tests are integration tests and will have to wait to be ported)
- formattedcode, starting with JSON (some tests are integration tests and will have to wait to be ported)
- cluecode
- licensedcode
- packagedcode (depends on licensecode)
- summarycode
- fixup the remaining bits and tests
After porting each module, I have marked these modules as ported scanpy3
with help of
conffest plugin (created by @pombredanne). Conffest
plugin is heart of this project. Without this, it was very difficult to do. Dependencies was fixed
at the time of porting the module where it was used.
Challenging part of Project¶
It is very difficult to deal with paths on different operating systems.The issue is around
macOS/Windows/Linux. The first two OS handle unicode paths comfortably on Python 2 and 3 but not
completely on macOS Mojave because its filesystem encoding is APFS. Linux paths are bytes and
os.listdir is broken on Python 2. As a result you can only sanely handle Linux paths as bytes
on Python 2. But on Python 3 path seems to be corrected as unicode
on Linux.
For more details visit here :
We came with various Solution:
- To use pathlib which generally handle paths correctly across platforms. And for backports we use pathlib 2. But this solution also fails because pathlib 2 does not work as expected wrt unicode vs bytes. And os.listdir also doesn’t work properly.
- To use path.py which handles the paths across all the platforms even on macOS Mojave .
- Use
bytes
on linux and python 3 andunicode
everywhere.
We choose the third solution because it is most fundamental and simple and easy to use.
Project was tracked in this ticket nexB/scancode-toolkit#295
Project link : Port Scancode to Python 3
My contribution : List of Commits
Note : Please give your feedback here
Outcome¶
Now we have liftoff on Python 3 . We are able to run basic scans without errors on develop branch.
You check it by running scancode -clipeu samples/ --json-pp - -n4
.
At last I would like to thanks my Mentor @pombredanne aka Philippe Ombredanne . He has helped lot in completing this project. He is very supportive and responsive. I have learned a lot from him. By his encouragement and motivation, I am very improving day by day, building and developing my skills. I have completed all the tasks that were in the scope of this GSoC project.
Miscellaneous¶
FAQ¶
Why ScanCode?¶
We could not find an existing tool (open source or commercial) meeting our needs:
- usable from the command line or as library
- running on Linux, Mac and Windows
- written in a higher level language such as Python
- easy to extend and evolve
Can licenses be synchronized with the DejaCode license library?¶
The license keys are the same that are used in DejaCode. They are kept in sync by hand in the short term. There is also a ticket to automate that sync with DejaCode and possibly other sources. See https://github.com/nexB/scancode-toolkit/issues/41
How is ScanCode different from licensecheck?¶
At a high level, ScanCode detects more licenses and copyrights than licensecheck does, reporting more details about the matches. It is likely slower.
In more details: ScanCode is a Python app using a data-driven approach (as opposed to carefully crafted regex):
- for license scan, the detection is based on a (large) number of license full texts (~900) and license notices/rules (~1800) and is data driven as opposed to regex-driven. It detects exactly where in a file a license text is found. Just throw in more license texts to improve the detection.
- for copyright scan, the approach is natural language parsing (using NLTK) with POS tagging and a grammar; it has a few thousand tests.
- licenses and copyrights are detected in texts and binaries
Licensecheck (available here for reference: /https://metacpan.org/release/App-Licensecheck ) is a Perl script using hand-crafted regex patterns to find typical copyright statements and about 50 common licenses. There are about 50 license detection tests.
A quick test (in July 2015, before a major refactoring, but for this notice still valid) shows several things that are not detected by licensecheck that are detected by ScanCode.
How can I integrate ScanCode in my application?¶
More specifically, does this tool provide an API which can be used by us for the integration with my system to trigger the license check and to use the result?
In terms of API, there are two stable entry points:
#. The JSON output when you use it as a command line tool from any language or when you call the scancode.cli.scancode function from a Python script. #. Otherwise the scancode.cli.api module provides a simple function if you are only interested in calling a certain service on a given file (such as license detection or copyright detection)
Can I install ScanCode in a Unicode path?¶
Not for now. See https://github.com/nexB/scancode-toolkit/issues/867 There is a bug in virtualenv on Python2 https://github.com/pypa/virtualenv/issues/457 At this stage and until we completed the migration to Python 3 there is no way out but to use a path that contains only ASCII characters.
The line numbers for a copyright found in a binary are weird. What do they mean?¶
When scanning binaries, the line numbers are just a relative indication of where a detection was found: there is no such thing as lines in a binary. The numbers reported are based on the strings extracted from the binaries, typically broken as new lines with each NULL character. They can be safely ignored.
Support¶
Documentation¶
The ScanCode toolkit documentation lives at aboutcode.readthedocs.io/en/latest/scancode-toolkit/.
Issue Tracker¶
Post questions and bugs as GitHub tickets at: https://github.com/nexB/scancode-toolkit/issues
StackOverflow¶
Ask question on StackOverflow using the [scancode] tag.
Talk to the Developers¶
Join our Gitter Channel to talk with the developers of ScanCode Toolkit.
Documentation¶
For more information on Documentation or to leave feedback mail at aboutCode@groups.io, or leave a message at our Docs Channel.
Runtime Performance Reports¶
These are reports of runtimes for real life scans:
2015-09-03 by @rrjohnston
- On Ubuntu 12.04 x86_64 Python 2.7.3 and ScanCode Version 1.3.1
- Specs: 40 threads (2 processors, 10 cores each, with hyperthreading) 3.1 GHz 128GB RAM 8TB controller RAID5
- scanned 195676 files in about 16.7 hours or about 3.25 file per second (using defaults licenses and copyrights)
- notes: this version of ScanCode runs on a single thread so it does not make good use of extra processing power.