Submission Guidelines
These guidelines are intended to provide depositors with technical information needed when working with data and provide you the information you need to prepare your data for submission. If these guildelines and the linked documents do not answer all the questions you have or otherwise do not meet your needs, please contact the LAC helpdesk. General recommendations can be found in this article.
Ethical and Legal issues
The archive is receptive to complaints from consultants, speech communities and their representatives and expects researchers to respect the cultural norms of the represented individuals and cultures.
The archive reserves the right to temporarily depublish data, if their ethical or legal status is disputed and may permanently remove data if it derives from unethical practices.
Informed Consent
The depositor guarantees that clear, unambiguous and informed consent was collected from the speakers represented in the collection and that the research conforms with the ethical requirements for research procedures of the involved institutions and funding bodies. The depositor ensures that a culturally adequate method for collecting informed consent was used. Written or recorded oral constent are established practices.
Recommendations for File and Data Formats
File Naming Recommendations
We recommend that you follow a consistent naming scheme for all the files in a collection. We do not require a specific naming pattern. Several types of information have proven useful in naming patterns. Typical components of these schemes are
the name or the glottocode or the ISO 639-3 code of the object language or alternatively a project acronym the date of recording (preferably in the form YYYYMMDD or YYYY-MM-DD) a running number for all recodings made on the same day (preferably with leading zeroes, e.g. 0001, 0002, ...) These three components have proven to be sufficent to identify recordings and easy to generate (semi-)automatically. For many project combinations of the language name and a running number can be sufficient, but adding the recording date requires the researcher to track the running number only for one day. In larger research projects, with several teams making recordings in parallel, additional components such as initials of consultants or identifiers for the different teams can be useful.
Naming pattern | File name |
---|---|
language name + running number + file extension | kuvi-0006.wav |
language name + date + running number + file extension | samre-20090304-0023.wav |
ISO 639-3 code + date + running number + file extension | std-20090304-0023.wav |
project acronym + date + running number + file extension | summit-20140304-0002.mp4 |
ISO 639-3 code + consultant + date + running number + file extension | pcj-LK-20070130-0009.wav |
File naming schemes should aim at producing unique identifiers. Ideally the generation of the file name can be done (semi-)automatically. Adding complex phrases such as titles or fullnames of participants to the file name is error prone and cumbersome. This information is better recorded in the metadata.
It has proven to be very helpful for processing of bundles, for all files of a bundle to have a similar names. The recommended practice is for all files in a bundle to have an identical name, except for the file extension (i.e. .wav, .mp4, .eaf etc.).
pcj-20090304-0023.wav
pcj-20090304-0023.mp4
pcj-20090304-0023.eaf
If the unit represented such as a recorded story is split over two files, and addional component such as part1, part2 are a good way to indicate that the sequence of the two files represent the whole unit. If files of the same type do not form parts of a whole, adding an additional runnning number is a simple way to keep the file name unique.
pcj-20090304-0023-part1.wav
pcj-20090304-0023-part2.wav
pcj-20090304-0023.eaf
pcj-20090304-0023-001.jpeg
pcj-20090304-0023-002.jpeg
pcj-20090304-0023-003.jpeg
Any naming scheme is acceptable as long as the file names are unique across a collection and the names as such are valid file names on all common operating systems. This means most types of whitespace and special characters such as , /, |, ", ', *, [], { }, or ? should be avoided.
Metadata
Metadata should be provided for each bundle as BLAM CMDI (BLAM-bundle-repository-v0.14). BLAM CMDI can be produced with Arbil. We will soon also provide a BLAM profile for CMDI Maker. The metadata can be transferred to us in the form of this Excel sheet.
Format Recommendations
The LAC accepts a list of file types and file formats. A brief list of acceptable file formats can be found in the format whitelist.
Audio Recommendations:
The best common quality for uncompressed loss-less audio recordings is LPCM with a sampling rate of 96 kHz and a bit depth of 24 bit. However, an encoding with LPCM at 48 kHz and 16 bit ensures the file to be generally playable and processible, independent of framework, platform, and device restrictions. It is a good compromise between high quality audio and practical considerations.
Audio recording should have two channels (stereo). If a single speaker is recorded one channel recordings (mono) are acceptable. More complex setups such as six channel recordings (5.1) will cause problems with most tools used in language research and should be avoided. The audio recommendations for archiving are:
File format | Encoding | Sampling Rate | Bit Depth |
---|---|---|---|
WAV | LPCM | 48 kHz | 16 bit |
Note: High quality audio formats such as WAV LPCM 96 kHz/24 bit and WAV LPCM 48 kHz/24 bit can cause problems in annotation tools such as ELAN and Exmeralda and may not provide any additional information relevant for research. However, we will accept these format for archiving, just as we accept CD quality audio recodings (WAV LPCM 44.1 kHz/16 bit).
Video Recommendations:
Video formats used in language research and archiving are in many ways determined by the limitation of high-level consumer (“prosumer”) digital cameras. Most modern cameras record video encoded as h.264 (sometimes indicated as AVC or AVCHD in camera interfaces). Although propietary, we consider this encoding currently the most suitable encoding for our purposes because of its widespread support in software and hardware.
The most common audio encodings in video cameras are LPCM, AC3, and AAC. From linguistic point of view, LPCM encoding (often indicated as PCM) is highly preferable and should be chosen if possible. AAC is a lossy compressed format and is not suitable for all kinds of analysis. It is however the preferred option if no PCM audio can be provided. Audio encoded as AC3 can cause issues in ELAN and Exmeralda on some platforms and should be re-encoded as AAC before archiving.
The MP4 container specification does not allow LPCM encoded audio. If your camera can record LPCM audio with h.264 video, we recommed to use the MOV container format. MOV files can be problematic for video annotation programs such as ELAN. The best way to deal with this issue is to produce a working file with AAC audio and h.264 video in a MP4 container and use this file with ELAN. For archiving both the MOV and the MP4 file should be submitted together with the ELAN file. Having two files for one recording increases the storage requirements for project work and archiving, but the advantage of archiving uncompressed audion and video in one container compensates for the disadvantages.
If your camera can only record compressed audio, we suggest that you also record uncompressed audion with a dedicated audio recorder. In case you have no dedicated audio recorder available and you can only record compressed audio with your video camera, you might want to derive a WAV file from a lossy compressed encoded AAC audio track of a video file, this file – despite its appearance as uncompressed LPCM data – will not be suitable to some types of analysis. You should indicate the source format of this file in the metadata to inform future users of the provenance of this data. The video recommendations for archiving are:
File format | Video Encoding | Encoding options | Video resolution | Frame rate | Audio encoding | Sampling rate | Bit depth | Bit rate |
---|---|---|---|---|---|---|---|---|
MOV | h.264 | profile: main, level: 4.0 | 1080p | 30fps | LPCM | 48 kHz | 16 bit | |
MP4 | h.264 | profile: main, level: 4.0 | 1080p | 30fps | AAC (LC) | 48 kHz | 128-384 kbps | |
(WAV | 48 kHz | 16 bit) |
Note: While we recommend a 30fps frame rate, we discourage any change to the frame rate after recording. Any common frame rate besides 30 fps – such as 24 fps (e.g. used in NTSC), 25 fps (e.g. used in PAL), as well as higher frame rates – are acceptable.
Video file formats recommended for film archives such as uncompressed MXF or lossless JPEG2000 are not recommended for most digital video recordings as these recordings were originally recorded in a lossy format (mostly h.264 encoded). If you are intending to digitize and archive non-digital film, please contact the LAC helpdesk for further arrangements.
Time-aligned Annotations:
ELAN annotations are the recommended format for time-aligned annotation. The Language Archive Cologne provides additional value services for ELAN annotations. Other structured and documented annotation formats can be used, but the archive currently does not provide any additional value services.
The archive also accepts Praat TextGrid, Exmaralda transcriptions, TEI (in particular ISO 24624:2016), and FLeX XML. Toolbox Files are still accepted, but the format itself has proven to be problematic and should be avoided if possible.
Additional Metadata and other Structured Data
Written data and metadata supplementing or accompanying audio-visual data should be archived in established standards or at least well documented formats. For metadata, the preferred format are profiles of the CMDI family. In general, XML and CSV formats are recommended. CSV encoded (meta)data should be annotated following the W3C Metadata Vocabulary for Tabular Data recommendations.
Textual Data
Textual data supplementing audio-visual recordings can be archived alongside the recordings. The recommended formats for unstructured textual data are PDF/A, plain UTF-8 encoded text files, or XHTML.
Still Images
The archive allows archiving of still images supplementing audio-visual data to be archived alongside the recordings. Accepted formats, in order of preference are TIFF, JPEG2000, PNG, and JPEG.
If you have any remaining questions, please contact the LAC helpdesk. We are also always happy to hear back from you with any suggestions or feedback which helps us to improve our submission guidelines.