This module includes both unit tests, which can run in isolation without connecting to the S3 service, and integration tests, which require a working connection to S3 to interact with a bucket. Unit test suites follow the naming convention Test*.java. Integration tests follow the naming convention ITest*.java.
Due to eventual consistency, integration tests may fail without reason. Transient failures, which no longer occur upon rerunning the test, should thus be ignored.
The Apache Jenkins infrastucture does not run any S3 integration tests, due to the need to keep credentials secure.
This is important: patches which do not include this declaration will be ignored
This policy has proven to be the only mechanism to guarantee full regression testing of code changes. Why the declaration of region? Two reasons
You don’t need to test from a VM within the AWS infrastructure; with the -Dparallel=tests option the non-scale tests complete in under ten minutes. Because the tests clean up after themselves, they are also designed to be low cost. It’s neither hard nor expensive to run the tests; if you can’t, there’s no guarantee your patch works. The reviewers have enough to do, and don’t have the time to do these tests, especially as every failure will simply make for a slow iterative development.
Please: run the tests. And if you don’t, we are sorry for declining your patch, but we have to.
Some of the tests do fail intermittently, especially in parallel runs. If this happens, try to run the test on its own to see if the test succeeds.
If it still fails, include this fact in your declaration. We know some tests are intermittently unreliable.
The tests and the S3A client are designed to be configurable for different timeouts. If you are seeing problems and this configuration isn’t working, that’s a sign of the configuration mechanism isn’t complete. If it’s happening in the production code, that could be a sign of a problem which may surface over long-haul connections. Please help us identify and fix these problems — especially as you are the one best placed to verify the fixes work.
To integration test the S3* filesystem clients, you need to provide two files which pass in authentication details to the test runner.
These are both Hadoop XML configuration files, which must be placed into hadoop-tools/hadoop-aws/src/test/resources.
This file pre-exists and sources the configurations created under auth-keys.xml.
For most purposes you will not need to edit this file unless you need to apply a specific, non-default property change during the tests.
The presence of this file triggers the testing of the S3 classes.
Without this file, *none of the integration tests in this module will be executed*.
The XML file must contain all the ID/key information needed to connect each of the filesystem clients to the object stores, and a URL for each filesystem for its testing.
The contents of each bucket will be destroyed during the test process: do not use the bucket for any purpose other than testing. Furthermore, for s3a, all in-progress multi-part uploads to the bucket will be aborted at the start of a test (by forcing fs.s3a.multipart.purge=true) to clean up the temporary state of previously failed tests.
Example:
<configuration> <property> <name>test.fs.s3n.name</name> <value>s3n://test-aws-s3n/</value> </property> <property> <name>test.fs.s3a.name</name> <value>s3a://test-aws-s3a/</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value>DONOTPCOMMITTHISKEYTOSCM</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>DONOTEVERSHARETHISSECRETKEY!</value> </property> <property> <name>fs.s3a.access.key</name> <description>AWS access key ID. Omit for IAM role-based authentication.</description> <value>DONOTCOMMITTHISKEYTOSCM</value> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key. Omit for IAM role-based authentication.</description> <value>DONOTEVERSHARETHISSECRETKEY!</value> </property> <property> <name>test.sts.endpoint</name> <description>Specific endpoint to use for STS requests.</description> <value>sts.amazonaws.com</value> </property> </configuration>
The file hadoop-tools/hadoop-aws/src/test/resources/contract-test-options.xml must be created and configured for the test filesystems.
If a specific file fs.contract.test.fs.* test path is not defined for any of the filesystems, those tests will be skipped.
The standard S3 authentication details must also be provided. This can be through copy-and-paste of the auth-keys.xml credentials, or it can be through direct XInclude inclusion.
Here is an an example contract-test-options.xml which places all test options into the auth-keys.xml file, so offering a single place to keep credentials and define test endpoint bindings.
<configuration> <include xmlns="http://www.w3.org/2001/XInclude" href="auth-keys.xml"/> </configuration>
In the file src/test/resources/contract-test-options.xml, the filesystem name must be defined in the property fs.contract.test.fs.s3n. The standard configuration options to define the S3N authentication details must also be provided.
Example:
<property> <name>fs.contract.test.fs.s3n</name> <value>s3n://test-aws-s3n/</value> </property>
In the file src/test/resources/contract-test-options.xml, the filesystem name must be defined in the property fs.contract.test.fs.s3a. The standard configuration options to define the S3N authentication details must also be provided.
Example:
<property> <name>fs.contract.test.fs.s3a</name> <value>s3a://test-aws-s3a/</value> </property>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- ~ Licensed to the Apache Software Foundation (ASF) under one ~ or more contributor license agreements. See the NOTICE file ~ distributed with this work for additional information ~ regarding copyright ownership. The ASF licenses this file ~ to you under the Apache License, Version 2.0 (the ~ "License"); you may not use this file except in compliance ~ with the License. You may obtain a copy of the License at ~ ~ http://www.apache.org/licenses/LICENSE-2.0 ~ ~ Unless required by applicable law or agreed to in writing, software ~ distributed under the License is distributed on an "AS IS" BASIS, ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~ See the License for the specific language governing permissions and ~ limitations under the License. --> <configuration> <include xmlns="http://www.w3.org/2001/XInclude" href="auth-keys.xml"/> <property> <name>fs.contract.test.fs.s3</name> <value>s3://test-aws-s3/</value> </property> <property> <name>fs.contract.test.fs.s3a</name> <value>s3a://test-aws-s3a/</value> </property> <property> <name>fs.contract.test.fs.s3n</name> <value>s3n://test-aws-s3n/</value> </property> </configuration>
This example pulls in the auth-keys.xml file for the credentials. This provides one single place to keep the keys up to date —and means that the file contract-test-options.xml does not contain any secret credentials itself. As the auth keys XML file is kept out of the source code tree, it is not going to get accidentally committed.
For S3a encryption tests to run correctly, the fs.s3a.server-side-encryption-key must be configured in the s3a contract xml file with a AWS KMS encryption key arn as this value is different for each AWS KMS.
Example:
<property> <name>fs.s3a.server-side-encryption-key</name> <value>arn:aws:kms:us-west-2:360379543683:key/071a86ff-8881-4ba0-9230-95af6d01ca01</value> </property>
You can also force all the tests to run with a specific SSE encryption method by configuring the property fs.s3a.server-side-encryption-algorithm in the s3a contract file.
After completing the configuration, execute the test run through Maven.
mvn clean verify
It’s also possible to execute multiple test suites in parallel by passing the parallel-tests property on the command line. The tests spend most of their time blocked on network I/O with the S3 service, so running in parallel tends to complete full test runs faster.
mvn -Dparallel-tests clean verify
Some tests must run with exclusive access to the S3 bucket, so even with the parallel-tests property, several test suites will run in serial in a separate Maven execution step after the parallel tests.
By default, parallel-tests runs 4 test suites concurrently. This can be tuned by passing the testsThreadCount property.
mvn -Dparallel-tests -DtestsThreadCount=8 clean verify
To run just unit tests, which do not require S3 connectivity or AWS credentials, use any of the above invocations, but switch the goal to test instead of verify.
mvn clean test mvn -Dparallel-tests clean test mvn -Dparallel-tests -DtestsThreadCount=8 clean test
To run only a specific named subset of tests, pass the test property for unit tests or the it.test property for integration tests.
mvn clean test -Dtest=TestS3AInputPolicies mvn clean verify -Dit.test=ITestS3AFileContextStatistics -Dtest=none mvn clean verify -Dtest=TestS3A* -Dit.test=ITestS3A*
Note that when running a specific subset of tests, the patterns passed in test and it.test override the configuration of which tests need to run in isolation in a separate serial phase (mentioned above). This can cause unpredictable results, so the recommendation is to avoid passing parallel-tests in combination with test or it.test. If you know that you are specifying only tests that can run safely in parallel, then it will work. For wide patterns, like ITestS3A* shown above, it may cause unpredictable test failures.
S3A can connect to different regions —the tests support this. Simply define the target region in contract-test-options.xml or any auth-keys.xml file referenced.
<property> <name>fs.s3a.endpoint</name> <value>s3.eu-central-1.amazonaws.com</value> </property>
This is used for all tests expect for scale tests using a Public CSV.gz file (see below)
The TestS3AInputStreamPerformance tests require read access to a multi-MB text file. The default file for these tests is one published by amazon, s3a://landsat-pds.s3.amazonaws.com/scene_list.gz. This is a gzipped CSV index of other files which amazon serves for open use.
The path to this object is set in the option fs.s3a.scale.test.csvfile,
<property> <name>fs.s3a.scale.test.csvfile</name> <value>s3a://landsat-pds/scene_list.gz</value> </property>
(the reason the space or newline is needed is to add “an empty entry”; an empty <value/> would be considered undefined and pick up the default)
Of using a test file in an S3 region requiring a different endpoint value set in fs.s3a.endpoint, a bucket-specific endpoint must be defined. For the default test dataset, hosted in the landsat-pds bucket, this is:
<property> <name>fs.s3a.bucket.landsat-pds.endpoint</name> <value>s3.amazonaws.com</value> <description>The endpoint for s3a://landsat-pds URLs</description> </property>
Integration test results and logs are stored in target/failsafe-reports/. An HTML report can be generated during site generation, or with the surefire-report plugin:
mvn surefire-report:failsafe-report-only
There are a set of tests designed to measure the scalability and performance at scale of the S3A tests, Scale Tests. Tests include: creating and traversing directory trees, uploading large files, renaming them, deleting them, seeking through the files, performing random IO, and others. This makes them a foundational part of the benchmarking.
By their very nature they are slow. And, as their execution time is often limited by bandwidth between the computer running the tests and the S3 endpoint, parallel execution does not speed these tests up.
The tests are enabled if the scale property is set in the maven build this can be done regardless of whether or not the parallel test profile is used
mvn verify -Dscale mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8
The most bandwidth intensive tests (those which upload data) always run sequentially; those which are slow due to HTTPS setup costs or server-side actionsare included in the set of parallelized tests.
Some of the tests can be tuned from the maven build or from the configuration file used to run the tests.
mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8 -Dfs.s3a.scale.test.huge.filesize=128M
The algorithm is
Only a few properties can be set this way; more will be added.
Property | Meaninging |
---|---|
fs.s3a.scale.test.timeout | Timeout in seconds for scale tests |
fs.s3a.scale.test.huge.filesize | Size for huge file uploads |
fs.s3a.scale.test.huge.huge.partitionsize | Size for partitions in huge file uploads |
The file and partition sizes are numeric values with a k/m/g/t/p suffix depending on the desired size. For example: 128M, 128m, 2G, 2G, 4T or even 1P.
Some scale tests perform multiple operations (such as creating many directories).
The exact number of operations to perform is configurable in the option scale.test.operation.count
<property> <name>scale.test.operation.count</name> <value>10</value> </property>
Larger values generate more load, and are recommended when testing locally, or in batch runs.
Smaller values results in faster test runs, especially when the object store is a long way away.
Operations which work on directories have a separate option: this controls the width and depth of tests creating recursive directories. Larger values create exponentially more directories, with consequent performance impact.
<property> <name>scale.test.directory.count</name> <value>2</value> </property>
DistCp tests targeting S3A support a configurable file size. The default is 10 MB, but the configuration value is expressed in KB so that it can be tuned smaller to achieve faster test runs.
<property> <name>scale.test.distcp.file.size.kb</name> <value>10240</value> </property>
S3A specific scale test properties are
The Huge File tests validate S3A’s ability to handle large files —the property fs.s3a.scale.test.huge.filesize declares the file size to use.
<property> <name>fs.s3a.scale.test.huge.filesize</name> <value>200M</value> </property>
Amazon S3 handles files larger than 5GB differently than smaller ones. Setting the huge filesize to a number greater than that) validates support for huge files.
<property> <name>fs.s3a.scale.test.huge.filesize</name> <value>6G</value> </property>
Tests at this scale are slow: they are best executed from hosts running in the cloud infrastructure where the S3 endpoint is based. Otherwise, set a large timeout in fs.s3a.scale.test.timeout
<property> <name>fs.s3a.scale.test.timeout</name> <value>432000</value> </property>
The tests are executed in an order to only clean up created files after the end of all the tests. If the tests are interrupted, the test data will remain.
The S3A filesystem is designed to work with storage endpoints which implement the S3 protocols to the extent that the amazon S3 SDK is capable of talking to it. We encourage testing against other filesystems and submissions of patches which address issues. In particular, we encourage testing of Hadoop release candidates, as these third-party endpoints get even less testing than the S3 endpoint itself.
If the endpoint doesn’t support server-side-encryption, these will fail. They can be turned off.
<property> <name>test.fs.s3a.encryption.enabled</name> <value>false</value> </property>
Encryption is only used for those specific test suites with Encryption in their classname.
To test on alternate infrastructures supporting the same APIs, the option fs.s3a.scale.test.csvfile must either be set to “ ”, or an object of at least 10MB is uploaded to the object store, and the fs.s3a.scale.test.csvfile option set to its path.
<property> <name>fs.s3a.scale.test.csvfile</name> <value> </value> </property>
(yes, the space is necessary. The Hadoop Configuration class treats an empty value as “do not override the default”).
The test TestS3ATemporaryCredentials requests a set of temporary credentials from the STS service, then uses them to authenticate with S3.
If an S3 implementation does not support STS, then the functional test cases must be disabled:
<property> <name>test.fs.s3a.sts.enabled</name> <value>false</value> </property>
These tests reqest a temporary set of credentials from the STS service endpoint. An alternate endpoint may be defined in test.fs.s3a.sts.endpoint.
<property> <name>test.fs.s3a.sts.endpoint</name> <value>https://sts.example.org/</value> </property>
The default is "“; meaning ”use the amazon default value".
Logging at debug level is the standard way to provide more diagnostics output; after setting this rerun the tests
log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
There are also some logging options for debug logging of the AWS client properties log4j.logger.com.amazonaws=DEBUG log4j.logger.com.amazonaws.http.conn.ssl=INFO log4j.logger.com.amazonaws.internal=INFO
There is also the option of enabling logging on a bucket; this could perhaps be used to diagnose problems from that end. This isn’t something actively used, but remains an option. If you are forced to debug this way, consider setting the fs.s3a.user.agent.prefix to a unique prefix for a specific test run, which will enable the specific log entries to be more easily located.
New tests are always welcome. Bear in mind that we need to keep costs and test time down, which is done by
No duplication: if an operation is tested elsewhere, don’t repeat it. This applies as much for metadata operations as it does for bulk IO. If a new test case is added which completely obsoletes an existing test, it is OK to cut the previous one —after showing that coverage is not worsened.
Efficient: prefer the getFileStatus() and examining the results, rather than call to exists(), isFile(), etc.
Isolating Scale tests. Any S3A test doing large amounts of IO MUST extend the class S3AScaleTestBase, so only running if scale is defined on a build, supporting test timeouts configurable by the user. Scale tests should also support configurability as to the actual size of objects/number of operations, so that behavior at different scale can be verified.
Designed for parallel execution. A key need here is for each test suite to work on isolated parts of the filesystem. Subclasses of AbstractS3ATestBase SHOULD use the path() method, with a base path of the test suite name, to build isolated paths. Tests MUST NOT assume that they have exclusive access to a bucket.
Extending existing tests where appropriate. This recommendation goes against normal testing best practise of “test one thing per method”. Because it is so slow to create directory trees or upload large files, we do not have that luxury. All the tests against real S3 endpoints are integration tests where sharing test setup and teardown saves time and money.
A standard way to do this is to extend existing tests with some extra predicates, rather than write new tests. When doing this, make sure that the new predicates fail with meaningful diagnostics, so any new problems can be easily debugged from test logs.
This is what we expect from new tests; they’re an extension of the normal Hadoop requirements, based on the need to work with remote servers whose use requires the presence of secret credentials, where tests may be slow, and where finding out why something failed from nothing but the test output is critical.
Extend AbstractS3ATestBase or AbstractSTestS3AHugeFiles unless justifiable. These set things up for testing against the object stores, provide good threadnames, help generate isolated paths, and for AbstractSTestS3AHugeFiles subclasses, only run if -Dscale is set.
Key features of AbstractS3ATestBase
Having shared base classes may help reduce future maintenance too. Please use them/
Don’t ever log credentials. The credential tests go out of their way to not provide meaningful logs or assertion messages precisely to avoid this.
This means efficient in test setup/teardown, and, ideally, making use of existing public datasets to save setup time and tester cost.
Strategies of particular note are:
The ordered test case mechanism of AbstractSTestS3AHugeFiles is probably the most elegant way of chaining test setup/teardown.
Regarding reusing existing data, we tend to use the landsat archive of AWS US-East for our testing of input stream operations. This doesn’t work against other regions, or with third party S3 implementations. Thus the URL can be overridden for testing elsewhere.
As well as making file size and operation counts scaleable, this includes making test timeouts adequate. The Scale tests make this configurable; it’s hard coded to ten minutes in AbstractS3ATestBase(); subclasses can change this by overriding getTestTimeoutMillis().
Equally importantly: support proxies, as some testers need them.
The ContractTestUtils class contains a whole set of assertions for making statements about the expected state of a filesystem, e.g. assertPathExists(FS, path), assertPathDoesNotExists(FS, path), and others. These do their best to provide meaningful diagnostics on failures (e.g. directory listings, file status, …), so help make failures easier to understand.
At the very least, do not use assertTrue() or assertFalse() without including error messages.
The configuration tests must declare the test bucket test.fs.s3.name and the credentials for the s3:// filesystem, and the contract test bucket fs.contract.test.fs.s3
The filesystem name must be defined in the property fs.contract.test.fs.s3. The same bucket name can be used for all tests
Example:
<property> <name>test.fs.s3.name</name> <value>s3://test-aws-s3/</value> </property> <property> <name>fs.contract.test.fs.s3</name> <value>${test.fs.s3.name}</value> </property> <property> <name>fs.s3.awsAccessKeyId</name> <value>DONOTPCOMMITTHISKEYTOSCM</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>DONOTEVERSHARETHISSECRETKEY!</value> </property>
Although the auth-keys.xml file is marged as ignored in git and subversion, it is still in your source tree, and there’s always that risk that it may creep out.
You can avoid this by keeping your keys outside the source tree and using an absolute XInclude reference to it.
<configuration> <include xmlns="http://www.w3.org/2001/XInclude" href="file:///users/ubuntu/.auth-keys.xml" /> </configuration>