Definition: Fullnames, names and tags of directories and files
In order to keep track of data sets, samples, chip types etc between sessions, the aroma.* packages have well defined rules how to name and structure data files and data sets. In order to explain how this works, we have to define a few terms.
Pathname
A pathname, which consists of a path followed by a filename, refers to the string identifying a file. Formally, we say the format is:
<pathname> = <path>/<filename>
For instance, for a a file with pathname /rawData/BooA_2007/Mapping250K_Nsp/C358,2007-01-24.cel the path is directory /rawData/BooA_2007/Mapping250K_Nsp/ and the filename is C358,2007-01-24.cel.
Filename
In turn, we say that a filename consists of a fullname, a dot, and a filename extension, e.g. the above file has fullname "C358,2007-01-24" and filename extension "cel". The format for this is:
<filename> = <fullname>.<extension>
Footnote: The term "fullname" is our convention. In the *nix world the term "basename" is used for the same thing, but in R that term is the same as the filename. To avoid ambiguity, we decided to use the term fullname.
Sample names, tags and filenames
Data files typically have filenames that reveals the name of the sample plus extra information such as the chip type, the lab the hybridization data and so on, followed by the filename extension.
Example
Here are some examples from randomly picked data sets:
- MCF7_Hind.CEL,
- MCF7_Xba.CEL,
- C358 Nsp 24-01-2007 250K.CEL, and
- NA06985_Hind_B5_3005533.CEL.
As humans we infer that the first two files refer to CEL data for a sample "MCF7" hybridized to chip types "Hind" and "Xba", that the third file is sample "C358" hybridized to an "250K" "Nsp" chip on January 24, 2007, and that the last file is sample "NA06985" hybridized to a "Hind" chip with extra tag "B5_3005533", which do not know what it means.
Fullname
In order for the computer to know what part of the the filename is referring to the sample name and what is extra information we define the terms "name" and "tag", where name is typically either a sample name or a data set name. Then constraint filenames to have the following format:
<fullname> = <name>(,<tag>)*
This format is read as "the fullname consists of a name followed by zero
or more (the *
) comma-separated tags". Because of this format,
sample names and tags cannot contain commas.
Footnote: A commas is a legal character in all file systems we know of, and it has been used successfully for a long time in for instance the CVS version control system.
Example continued
To follow the above format, we rename the above four CEL files as:
- MCF7,Hind.CEL,
- MCF7,Xba.CEL,
- C358,Nsp,24-01-2007,250K.CEL, and
- NA06985,Hind,B5_3005533.CEL.
With this filename format, the package can identify the sample names unambiguously.
Directory names
Analogously to a filename, a directory name consists of a name followed by optional tags:
<dirname> = <fullname>
<path> = <path>/<dirname>
In aroma.*, each data set has a its own unique path, and its fullname is inferred from the directory name. For example, the directory name "Affymetrix_2006-500k,ACT,QN" refers data set name "Affymetrix_2006-500k" with tags "ACT" and "QN".
Formal grammar
Generic to all (R.filesets) data sets:
<pathname> = <path>/<filename>
<path> = <path>/<dirname>
<dirname> = <fullname>
<filename> = <fullname>.<extension>
<fullname> = <name>(,<tag>)*
Specific to all aroma.* data files:
<dataPathname> = <rootPath>/<dataSet>/<chipType>/<dataFilename>
<dataFilename> = <sampleName>(,<tags>)*.<extension>
<rootPath> = ./<dirname>
<dataSet> = <dirname>
<chipType> = <dirname>