Files
In Unix, everything is a file; see page 27 of the Unix Power Tools.
Line Endings
When manipulating files in Unix, it is necessary to remember that Unix text files have a newline \n
at the end of each line. Windows files, in contrast, have a carriage return \r
followed immediately by a newline \n
. Macintosh files historically (more than 20 years ago) only used a carriage return \r
but the Mac operating system is now Unix based (and has been, since 2001), so Macs now use a newline \n
just like Unix. Sometimes it is necessary convert files from one type to another, and (in particular) it is sometimes necessary to change the types of characters at the end of each line of text.
For example, to remove the carriage return from a Windows file, so that it can be used on a Mac or Unix machine, you can use:
tr -d '\015' <mypcfile.txt >myunixfile.txt
(see the discussion at the end of section 21.11 in Unix Power Tools. Note that the book was last updated in 2003 and still has outdates information about Mac files, because in 2001, the Mac switched to Unix and both use newline characters \n
at the end of lines of text)
Naming files in Unix
It is advisable to use only letters, numbers, underscores, and periods in file names in Unix.
Please try to avoid putting spaces in Unix filenames. If you use a space in a filename, it leads to difficulties, for instance, needing to put double quotes around a file name, or needing to put a backslash to escape each space: "this looks ugly.txt"
or my\ silly\ file\ name.txt
Filenames are also case sensitive. In other words, thisismyfile.txt
and ThisIsMyFile.txt
are different files. You will get comfortable with file naming conventions and will develop your own style. Dr Ward likes to use (mostly) lowercase letters in his files names, for instance, myexample.txt
. Some people like to use camel back notation, in which the first word in a file name is not capitalized, but the other words are capitalized, for instance, thisIsMyFile.txt
.
File Extensions
Unix usually does not treat files differently, regardless of the file extension. Nonetheless, especially when working with data, it can be helpful to name files in such a way that a user can guess what kind of data is inside the file.
For instance:
-
.csv for comma-separated text files (i.e., with commas between text fields)
-
.tsv for tab-separated text files (i.e., with tabs between text fields)
-
.txt for text files with other separators or without any separators
-
.db for database files
-
.sh for bash shell scripts
-
.gz, .tar, .tar.gz, .tgz, .z, .Z, .zip for compressed files
-
.html, .htm, .xhtml, .xml for HTML or XML files
-
.ps for Postscript files
-
.pdf for Adobe pdf files
-
.c or .cpp for C or C++ source files, respectively
-
a.out is the default name for executable files created by C++
-
file extensions are usually not used with directory names (in other words, it looks weird to use a period in a directory name)
Searching for files
When using ls
or mv
or grep
or other shell commands to work with files, it is possible to identify multiple files using wildcards. A question mark ?
allows matching for any one character. For example:
rm drwardfile?.csv
would remove (for instance) the files drwardfile1.csv
through drwardfile5.csv
and also the files drwardfileA.csv
and drwardfileB.csv
and drwardfileC.csv
etc.
An asterisk *
matches any number of occurrences of any group of characters. For instance, if we type
mv mystate*data.txt staterepository
we could move the files mystateINdata.txt
and mystateILdata.txt
and mystateOHdata.txt
and mystateMIdata.txt
etc. into the directory staterepository
. Such a command would also move any files named, for instance, mystate1999data.txt
or mystatedata.txt
or mystateasdFGhJKl1234data.txt
would all be moved into the directory staterepository
too.
Similarly, rm *.csv
would remove all files with the extension .csv
at the end. The command rm *
removes all files in the current directory.
It is also possible to put specific characters or ranges of characters. For instance, mv myfiles199[0-9].txt ninetiesfolder
will move all files from myfiles1990.txt
to myfiles199o.txt
into the directory ninetiesfolder
.
Recently Dr Ward wanted to remove all files ending with .csv
and .CSV
and even files ending in anything in between, for instance, .Csv
, and he used rm *.[Cc][Ss][Vv]
as one way to achieve the removal of all of these files at once.
See pages 658 through 660 in section 33.2 of Unix Power Tools for more examples.
==
Unix stores files in a hierarchical structure, with directories (also called folders) and files or more directories inside, possibly over and over. The Mac and Windows operating systems have similar structures of directories (or folders) and files.
See Figure 1-3 of "A Unix filesystem tree" example in Section 1.14 of Unix Power Tools
![](./images/httpatomoreillycomsourceoreillyimages142646.png)
and Figure 1-4 of "Part of a Unix filesystem tree" example in Section 1.16 of Unix Power Tools
![](./images/httpatomoreillycomsourceoreillyimages142648.png)
Some examples from Anvil include, for instance:
/anvil/projects/tdm/data
is the directory for the data sets from The Data Mine projects, and
/home/x-mdw
is Dr Ward’s home directory.
Links to Files
Chapter 10 of the Unix Power Tools book has a great deal of information about how to establish links to files in Unix, including what happens when moving or copying files, and also information about how things work with links in the operating system behind-the-scenes. We do not focus on links to files here, but if you are interested, there are a variety of examples posted there to consider.
Differences between files
The diff
and cmp
commands can be used to compare differences between files. For instance:
diff myfile1.txt myfile2.txt
shows any differences between these two files. The lines with <
at the beginning show the content of myfile1.txt
that is missing from myfile2.txt
, and similarly, the lines with >
at the beginning show the content of myfile2.txt
that is missing from myfile1.txt
.
If the two files are identical, then diff
will not print any output.
On the other hand,
cmp myfile1.txt myfile2.txt
compares the two files but only shows the location of the first difference in the two files. For this reason, cmp
is often faster than diff
. As with the diff
command, the cmp
will not print any output if the two files are identical.
less
The less
utility is helpful for reading files, when you know that you only want to view the file but not modify it.
quota
Unix systems usually have a way to easily see a user’s quota, i.e., how much space the user has available for saving files. On Anvil, users can check their quota by typing:
myquota
Right now, Dr Ward’s quota looks like this:
Type Location Size Limit Use Files Limit Use
==============================================================================
home x-mdw 1.9GB 25.0GB 8% - - -
scratch anvil 6.9GB 100.0TB 0.01% 0k 1,000k 0.00%
projects x-cis220051 10.1TB 15.0TB 67% 3,084k 10,485k 29%
Users can store files in their home directories, up to a maximum of 25 GB. They can also store files in their scratch directories, up to a maximum of 100 TB. For instance, Dr Ward’s home directory and scratch directory are located at:
/home/x-mdw
and
/anvil/scratch/x-mdw
Scratch directories are sometimes erased at regular intervals. Users should store files that they want to keep in their home directory, and should store temporary files in their scratch directory (for instance, large data files that can be removed after analyzing them).