Configuring MultiQC
Whilst most MultiQC settings can be specified on the command line, MultiQC is also able to parse system-wide and personal config files. At run time, it collects the configuration settings from the following places in this order (overwriting at each step if a conflicting config variable is found):
- Hardcoded defaults in MultiQC code
- System-wide config in
<installation_dir>/multiqc_config.yaml
- Manual installations only, not
pip
orconda
- Manual installations only, not
- User config in
$XDG_CONFIG_HOME/multiqc_config.yaml
(or~/.config/multiqc_config.yaml
if$XDG_CONFIG_HOME
is not set) - User config in
~/.multiqc_config.yaml
- File path set in environment variable
$MULTIQC_CONFIG_PATH
- For example, define this in your
~/.bashrc
file and keep the file anywhere you like
- For example, define this in your
- Environment variables prefixed with
MULTIQC_
- For example,
$MULTIQC_TITLE
,$MULTIQC_TEMPLATE
- see docs below
- For example,
- Config file in the current working directory:
multiqc_config.yaml
- Config file paths specified in the command with
--config
/-c
- You can specify multiple files like this, they can have any filename.
- Command line config (
--cl-config
) - Specific command line options (e.g.
--force
)
Sample name cleaning
MultiQC typically generates sample names by taking the input or log file name, and 'cleaning' it.
Cleaning extensions
To do this, it uses the fn_clean_exts
settings and looks for any matches. If it finds any matches, everything to the right is removed.
fn_clean_exts:
- ".gz"
- ".fastq"
Input | Cleaned sample name |
---|---|
mysample.fastq.gz | mysample |
secondsample.fastq.gz_trimming_log.txt | secondsample |
thirdsample.vcf.gz_report.txt | thirdsample.vcf |
To add to the MultiQC defaults instead of overwriting them, use extra_fn_clean_exts
:
extra_fn_clean_exts:
- ".myformat"
- "_processedFile"
Trimming extensions
To remove a substring only if it is at the start or end of a sample name, rather than trimming it and everything after it, use fn_clean_trim
.
fn_clean_trim:
- ".fastq.gz"
- "_report.txt"
Input | Cleaned sample name |
---|---|
mysample.fastq.gz | mysample |
secondsample.fastq.gz_trimming_log.txt | secondsample.fastq.gz_trimming_log.txt |
thirdsample.vcf.gz_report.txt | thirdsample.vcf.gz |
Again, to add to the MultiQC defaults instead of overwriting them, use extra_fn_clean_trim
:
extra_fn_clean_trim:
- "#"
- ".myext"
Other search types
If needed, you can specify different string matching methods to fn_clean_exts
and extra_fn_clean_exts
for more complex sample name cleaning:
truncate
(default)
This is the default method as described above. The two examples below are equivalent:
extra_fn_clean_exts:
- ".fastq"
extra_fn_clean_exts:
- type: "truncate"
pattern: ".fastq"
remove
The remove
type allows you to remove an exact match from the filename.
This includes removing a substring within the middle of a sample name.
extra_fn_clean_exts:
- type: remove
pattern: .sorted
Input | Cleaned sample name |
---|---|
secondsample.sorted.deduplicated | secondsample.deduplicated |
regex
You can also remove a substring with a regular expression. A useful website to work with writing regexes is regex101.com.
extra_fn_clean_exts:
- type: regex
pattern: "^processed."
Input | Cleaned sample name |
---|---|
processed.thirdsample.processed | thirdsample.processed |
regex_keep
If you'd rather like to only keep the match of a regular expression and discard everything else, you can use the regex_keep
type.
This is particularly useful if you have predictable sample names or identifiers.
extra_fn_clean_exts:
- type: regex_keep
pattern: "[A-Z]{3}[1-9]{2}"
Input | Cleaned sample name |
---|---|
merged.recalibrated.XZY97.alignment.bam | XZY97 |
module
This key will tell MultiQC to only apply the pattern to a specific MultiQC module.
This should be a string that matches the module's anchor
- the #module
bit when you click the main module heading in the sidebar (remove the #
).
For example, to truncate all sample names to 5 characters for just Kallisto:
extra_fn_clean_exts:
- type: regex_keep
pattern: "^.{5}"
module: kallisto
You can also supply a list of multiple module anchors if you wish:
extra_fn_clean_exts:
- type: regex_keep
pattern: "^.{5}"
module:
- kallisto
- cutadapt
Clashing sample names
This process of cleaning sample names can sometimes result in exact duplicates.
A duplicate sample name will overwrite previous results. Warnings showing these events
can be seen with verbose logging using the --verbose
/-v
flag, or in multiqc_data/multiqc.log
.
Problems caused by this will typically be discovered be fewer results than expected. If you're ever
unsure about where the data from results within MultiQC reports come from, have a look at
multiqc_data/multiqc_sources.txt
, which lists the path to the file used for every section
of the report.
Directory names
One scenario where clashing names can occur is when the same file is processed in different directories.
For example, if sample_1.fastq
is processed with four sets of parameters in four different
directories, they will all have the same name - sample_1
. Only the last will be shown.
If the directories are different, this can be avoided with the --dirs
/-d
flag.
For example, given the following files:
├── analysis_1
│ └── sample_1.fastq.gz.aligned.log
├── analysis_2
│ └── sample_1.fastq.gz.aligned.log
└── analysis_3
└── sample_1.fastq.gz.aligned.log
Running multiqc -d .
will give the following sample names:
analysis_1 | sample_1
analysis_2 | sample_1
analysis_3 | sample_1
Filename truncation
If the problem is with filename truncation, you can also use the --fullnames
/-s
flag,
which disables all sample name cleaning. For example:
├── sample_1.fastq.gz.aligned.log
└── sample_1.fastq.gz.subsampled.fastq.gz.aligned.log
Running multiqc -s .
will give the following sample names:
sample_1.fastq.gz.aligned.log
sample_1.fastq.gz.subsampled.fastq.gz.aligned.log
You can turn off sample name cleaning permanently by setting
fn_clean_sample_names
to false
in your config file.
Module search patterns
Many bioinformatics tools have standard output formats, filenames and other
signatures. MultiQC uses these to find output; for example, the FastQC module
looks for files that end in _fastqc.zip
.
This works well most of the time, until someone has an automated processing
pipeline that renames things. For this reason, as of version v0.3.2 of MultiQC,
the file search patterns are loaded as part of the main config. This means that
they can be overwritten in <installation_dir>/multiqc_config.yaml
or
~/.multiqc_config.yaml
. So if you always rename your _fastqc.zip
files to
_qccheck.zip
, MultiQC can still work.
To see the default search patterns, check a given module in the MultiQC documentation.
Each module has its search patterns listed beneath any free-text docs.
Alternatively, see the search_patterns.yaml
file in the MultiQC source code. Copy the section for the program that you want to modify and paste this
into your config file. Make sure you make it part of a dictionary called sp
as follows:
sp:
mqc_module:
fn: _mysearch.txt
Search patterns can specify a filename match (fn
) or a file contents
match (contents
), as well as a number of additional search keys.
See below for the full reference.
Using log filenames as sample names
A substantial number of MultiQC modules take the sample name identifiers that you
see in the report from the file contents - typically the filename of the input file.
This is because log files can often be called things like mytool.log
or even concatenated.
Using the input filename used by the tool is typically safer and more consistent across modules.
However, sometimes this does not work well. For example, if the input filename is not relevant (eg. using a temporary file or FIFO, process substitution or stdin etc.). In these cases your log files may have useful filenames but MultiQC will not be using them.
To force MultiQC to use the log filename as the sample identifier, you can use the
--fn_as_s_name
command line flag or set the use_filename_as_sample_name
:
use_filename_as_sample_name: true
This affects all modules and all search patterns. If you want to limit this to just one or more specific search patterns, you can do by giving a list:
use_filename_as_sample_name:
- cutadapt
- picard/gcbias
- picard/markdups
Note that this should be the search pattern key and not just the module name. This is because some modules search for multiple files.
The log filename will still be cleaned. To use the raw log filenames,
combine with the --fullnames
/-s
flag or fn_clean_sample_names
config option described above.
Ignoring Files
MultiQC begins by indexing all of the files that you specified and building a list
of the ones it will use. You can specify files and directories to skip on the command
line using -x
/--ignore
, or for more permanent memory, with the following config file
options: fn_ignore_files
, fn_ignore_dirs
and fn_ignore_paths
(the command line
option simply adds to all of these).
For example, given the following files:
├── analysis_1
│ └── sample_1.fastq.gz.aligned.log
├── analysis_2
│ └── sample_1.fastq.gz.aligned.log