Medium Article about rebranding yace
YACE is currently in quick iteration mode. Things will probably break in upcoming versions. However, it has been in production use at InVision AG for a couple of months already.
Only latest version gets security updates. We won’t support older versions.
In case of a vulnerability please directly contact us via mail - security@nerdswords.de
Do not disclose any specifics in github issues! - Thank you.
We will contact you as soon as possible.
Supported services with auto discovery through tags:
ghcr.io/nerdswords/yet-another-cloudwatch-exporter:x.x.x
e.g. 0.5.0Option | Description |
---|---|
labels-snake-case | Causes labels on metrics to be output in snake case instead of camel case |
Key | Description |
---|---|
apiVersion | Configuration file version |
sts-region | Use STS regional endpoint (Optional) |
discovery | Auto-discovery configuration |
static | List of static configurations |
customNamespace | List of custom namespace configurations |
Key | Description |
---|---|
exportedTagsOnMetrics | List of tags per service to export to all metrics |
jobs | List of auto-discovery jobs |
exportedTagsOnMetrics example:
exportedTagsOnMetrics:
ec2:
- Name
- type
Note: Only tagged resources are discovered.
Key | Description |
---|---|
regions | List of AWS regions |
type | Cloudwatch service alias (“alb”, “ec2”, etc) or namespace name (“AWS/EC2”, “AWS/S3”, etc). |
length (Default 120) | How far back to request data for in seconds |
delay | If set it will request metrics up until current_time - delay |
roles | List of IAM roles to assume (optional) |
searchTags | List of Key/Value pairs to use for tag filtering (all must match), Value can be a regex. |
period | Statistic period in seconds (General Setting for all metrics in this job) |
statistics | List of statistic types, e.g. “Minimum”, “Maximum”, etc (General Setting for all metrics in this job) |
roundingPeriod | Specifies how the current time is rounded before calculating start/end times for CloudWatch GetMetricData requests. This rounding is optimize performance of the CloudWatch request. This setting only makes sense to use if, for example, you specify a very long period (such as 1 day) but want your times rounded to a shorter time (such as 5 minutes). to For example, a value of 300 will round the current time to the nearest 5 minutes. If not specified, the roundingPeriod defaults to the same value as shortest period in the job. |
addCloudwatchTimestamp | Export the metric with the original CloudWatch timestamp (General Setting for all metrics in this job) |
customTags | Custom tags to be added as a list of Key/Value pairs |
dimensionNameRequirements | List of metric dimensions to query. Before querying metric values, the total list of metrics will be filtered to only those that contain exactly this list of dimensions. An empty or undefined list results in all dimension combinations being included. |
metrics | List of metric definitions |
searchTags example:
searchTags:
- key: env
value: production
Key | Description |
---|---|
name | CloudWatch metric name |
statistics | List of statistic types, e.g. “Minimum”, “Maximum”, etc. |
period | Statistic period in seconds (Overrides job level setting) |
length | How far back to request data for in seconds(for static jobs) |
delay | If set it will request metrics up until current_time - delay (for static jobs) |
nilToZero | Return 0 value if Cloudwatch returns no metrics at all. By default NaN will be reported |
addCloudwatchTimestamp | Export the metric with the original CloudWatch timestamp (Overrides job level setting) |
addCloudwatchTimestamp
for sparse metrics, e.g from S3, since Prometheus won’t scrape metrics containing timestamps older than 2-3 hoursKey | Description |
---|---|
regions | List of AWS regions |
roles | List of IAM roles to assume |
namespace | CloudWatch namespace |
name | Must be set with multiple block definitions per namespace |
customTags | Custom tags to be added as a list of Key/Value pairs |
dimensions | CloudWatch metric dimensions as a list of Name/Value pairs |
metrics | List of metric definitions |
apiVersion: v1alpha1
sts-region: eu-west-1
discovery:
exportedTagsOnMetrics:
ec2:
- Name
ebs:
- VolumeId
jobs:
- type: es
regions:
- eu-west-1
searchTags:
- key: type
value: ^(easteregg|k8s)$
metrics:
- name: FreeStorageSpace
statistics:
- Sum
period: 60
length: 600
- name: ClusterStatus.green
statistics:
- Minimum
period: 60
length: 600
- name: ClusterStatus.yellow
statistics:
- Maximum
period: 60
length: 600
- name: ClusterStatus.red
statistics:
- Maximum
period: 60
length: 600
- type: elb
regions:
- eu-west-1
length: 900
delay: 120
statistics:
- Minimum
- Maximum
- Sum
searchTags:
- key: KubernetesCluster
value: production-19
metrics:
- name: HealthyHostCount
statistics:
- Minimum
period: 600
length: 600 #(this will be ignored)
- name: HTTPCode_Backend_4XX
statistics:
- Sum
period: 60
length: 900 #(this will be ignored)
delay: 300 #(this will be ignored)
nilToZero: true
- name: HTTPCode_Backend_5XX
period: 60
- type: alb
regions:
- eu-west-1
searchTags:
- key: kubernetes.io/service-name
value: .*
metrics:
- name: UnHealthyHostCount
statistics: [Maximum]
period: 60
length: 600
- type: vpn
regions:
- eu-west-1
searchTags:
- key: kubernetes.io/service-name
value: .*
metrics:
- name: TunnelState
statistics:
- p90
period: 60
length: 300
- type: kinesis
regions:
- eu-west-1
metrics:
- name: PutRecords.Success
statistics:
- Sum
period: 60
length: 300
- type: s3
regions:
- eu-west-1
searchTags:
- key: type
value: public
metrics:
- name: NumberOfObjects
statistics:
- Average
period: 86400
length: 172800
- name: BucketSizeBytes
statistics:
- Average
period: 86400
length: 172800
- type: ebs
regions:
- eu-west-1
searchTags:
- key: type
value: public
metrics:
- name: BurstBalance
statistics:
- Minimum
period: 600
length: 600
addCloudwatchTimestamp: true
- type: kafka
regions:
- eu-west-1
searchTags:
- key: env
value: dev
metrics:
- name: BytesOutPerSec
statistics:
- Average
period: 600
length: 600
- type: appstream
regions:
- eu-central-1
searchTags:
- key: saas_monitoring
value: true
metrics:
- name: ActualCapacity
statistics:
- Average
period: 600
length: 600
- name: AvailableCapacity
statistics:
- Average
period: 600
length: 600
- name: CapacityUtilization
statistics:
- Average
period: 600
length: 600
- name: DesiredCapacity
statistics:
- Average
period: 600
length: 600
- name: InUseCapacity
statistics:
- Average
period: 600
length: 600
- name: PendingCapacity
statistics:
- Average
period: 600
length: 600
- name: RunningCapacity
statistics:
- Average
period: 600
length: 600
- name: InsufficientCapacityError
statistics:
- Average
period: 600
length: 600
- type: backup
regions:
- eu-central-1
searchTags:
- key: saas_monitoring
value: true
metrics:
- name: NumberOfBackupJobsCompleted
statistics:
- Average
period: 600
length: 600
static:
- namespace: AWS/AutoScaling
name: must_be_set
regions:
- eu-west-1
dimensions:
- name: AutoScalingGroupName
value: Test
customTags:
- key: CustomTag
value: CustomValue
metrics:
- name: GroupInServiceInstances
statistics:
- Minimum
period: 60
length: 300
[Source: config_test.yml]
Key | Description |
---|---|
regions | List of AWS regions |
name | the name of your rule. It will be added as a label in Prometheus |
namespace | The Custom CloudWatch namespace |
roles | Roles that the exporter will assume |
metrics | List of metric definitions |
statistics | default value for statistics |
nilToZero | default value for nilToZero |
period | default value for period |
length | default value for length |
delay | default value for delay |
addCloudwatchTimestamp | default value for addCloudwatchTimestamp |
apiVersion: v1alpha1
sts-region: eu-west-1
customNamespace:
- name: customEC2Metrics
namespace: CustomEC2Metrics
regions:
- us-east-1
metrics:
- name: cpu_usage_idle
statistics:
- Average
period: 300
length: 300
nilToZero: true
- name: disk_free
statistics:
- Average
period: 300
length: 300
nilToZero: true
### Metrics with exportedTagsOnMetrics
aws_ec2_cpuutilization_maximum{dimension_InstanceId="i-someid", name="arn:aws:ec2:eu-west-1:472724724:instance/i-someid", tag_Name="jenkins"} 57.2916666666667
### Info helper with tags
aws_elb_info{name="arn:aws:elasticloadbalancing:eu-west-1:472724724:loadbalancer/a815b16g3417211e7738a02fcc13bbf9",tag_KubernetesCluster="production-19",tag_Name="",tag_kubernetes_io_cluster_production_19="owned",tag_kubernetes_io_service_name="nginx-ingress/private-ext",region="eu-west-1"} 0
aws_ec2_info{name="arn:aws:ec2:eu-west-1:472724724:instance/i-someid",tag_Name="jenkins"} 0
### Track cloudwatch requests to calculate costs
yace_cloudwatch_requests_total 168
# CPUUtilization + Name tag of the instance id - No more instance id needed for monitoring
aws_ec2_cpuutilization_average + on (name) group_left(tag_Name) aws_ec2_info
# Free Storage in Megabytes + tag Type of the elasticsearch cluster
(aws_es_free_storage_space_sum + on (name) group_left(tag_Type) aws_es_info) / 1024
# Add kubernetes / kops tags on 4xx elb metrics
(aws_elb_httpcode_backend_4_xx_sum + on (name) group_left(tag_KubernetesCluster,tag_kubernetes_io_service_name) aws_elb_info)
# Availability Metric for ELBs (Successful requests / Total Requests) + k8s service name
# Use nilToZero on all metrics else it won't work
((aws_elb_request_count_sum - on (name) group_left() aws_elb_httpcode_backend_4_xx_sum) - on (name) group_left() aws_elb_httpcode_backend_5_xx_sum) + on (name) group_left(tag_kubernetes_io_service_name) aws_elb_info
# Forecast your elasticsearch disk size in 7 days and report metrics with tags type and version
predict_linear(aws_es_free_storage_space_minimum[2d], 86400 * 7) + on (name) group_left(tag_type, tag_version) aws_es_info
# Forecast your cloudwatch costs for next 32 days based on last 10 minutes
# 1.000.000 Requests free
# 0.01 Dollar for 1.000 GetMetricStatistics Api Requests (https://aws.amazon.com/cloudwatch/pricing/)
((increase(yace_cloudwatch_requests_total[10m]) * 6 * 24 * 32) - 100000) / 1000 * 0.01
The following IAM permissions are required for YACE to work.
"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"
The following IAM permissions are required for the transit gateway attachment (tgwa) metrics to work.
"ec2:DescribeTags",
"ec2:DescribeInstances",
"ec2:DescribeRegions",
"ec2:DescribeTransitGateway*"
The following IAM permission is required to discover tagged API Gateway REST APIs:
"apigateway:GET"
The following IAM permissions are required to discover tagged Database Migration Service (DMS) replication instances and tasks:
"dms:DescribeReplicationInstances",
"dms:DescribeReplicationTasks"
YACE will automatically attempt to assume the role associated with a machine within EC2. If this is undesirable behavior turn off the use of the use of metadata endpoint by setting the environment variable AWS_EC2_METADATA_DISABLED=true
.
docker run -d --rm -v $PWD/credentials:/exporter/.aws/credentials -v $PWD/config.yml:/tmp/config.yml \
-p 5000:5000 --name yace ghcr.io/nerdswords/yet-another-cloudwatch-exporter:vx.xx.x # release version as tag - Do not forget the version 'v'
to support local testing all AWS urls can be overridden with by setting an environment variable AWS_ENDPOINT_URL
docker run -d --rm -v $PWD/credentials:/exporter/.aws/credentials -v $PWD/config.yml:/tmp/config.yml \
-e AWS_ENDPOINT_URL=http://localhost:4766 -p 5000:5000 --name yace ghcr.io/nerdswords/yet-another-cloudwatch-exporter:vx.xx.x # release version as tag - Do not forget the version 'v'
---
apiVersion: v1
kind: ConfigMap
metadata:
name: yace
data:
config.yml: |-
---
# Start of config file
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: yace
spec:
replicas: 1
selector:
matchLabels:
name: yace
template:
metadata:
labels:
name: yace
spec:
containers:
- name: yace
image: ghcr.io/nerdswords/yet-another-cloudwatch-exporter:vx.x.x # release version as tag - Do not forget the version 'v'
imagePullPolicy: IfNotPresent
args:
- "--config.file=/tmp/config.yml"
ports:
- name: app
containerPort: 5000
volumeMounts:
- name: config-volume
mountPath: /tmp
volumes:
- name: config-volume
configMap:
name: yace
Multiple roleArns are useful, when you are monitoring multi-account setup, where all accounts are using same AWS services. For example, you are running yace in monitoring account and you have number of accounts (for example newspapers, radio and television) running ECS clusters. Each account gives yace permissions to assume local IAM role, which has all the necessary permissions for Cloudwatch metrics. On this kind of setup, you could simply list:
jobs:
- type: ecs-svc
regions:
- eu-north-1
roles:
- roleArn: "arn:aws:iam::1111111111111:role/prometheus" # newspaper
- roleArn: "arn:aws:iam::2222222222222:role/prometheus" # radio
- roleArn: "arn:aws:iam::3333333333333:role/prometheus" # television
metrics:
- name: MemoryReservation
statistics:
- Average
- Minimum
- Maximum
period: 600
length: 600
Additionally, if the IAM role you want to assume requires an External ID you can specify it this way:
roles:
- roleArn: "arn:aws:iam::1111111111111:role/prometheus"
externalId: "shared-external-identifier"
The flags ‘cloudwatch-concurrency’ and ‘tag-concurrency’ define the number of concurrent request to cloudwatch metrics and tags. Their default value is 5.
Setting a higher value makes faster scraping times but can incur in throttling and the blocking of the API.
The exporter scraped cloudwatch metrics in the background in fixed interval. This protects from the abuse of API requests that can cause extra billing in AWS account.
The flag ‘scraping-interval’ defines the seconds between scrapes. The default value is 300.
It is possible to embed YACE in to an external application. This mode might be useful to you if you would like to scrape on demand or run in a stateless manner.
The entrypoint to use YACE as a library is the UpdateMetrics
func in update.go which requires,
config
: this is the struct representation of the configuration defined in Top Level Configurationregistry
: any prometheus compatible registry where scraped AWS metrics will be writtenmetricsPerQuery
: controls the same behavior defined by the CLI flag metrics-per-query
labelsSnakeCase
: controls the same behavior defined by the CLI flag labels-snake-case
cloudwatchSemaphore
/tagSemaphore
: adjusts the concurrency of requests as defined by Requests concurrency. Pass in a different length channel to adjust behaviorcache
session.NewSessionCache(config, <fips value>)
would be the default<fips value>
is defined by the fips
CLI flagobservedMetricLabels
registry
registry
between callslogger
logger.NewLogrusLogger(log.StandardLogger())
is an acceptable defaultThe update definition also includes an exported slice of Metrics which includes AWS API call metrics. These can be registered with the provided registry
if you want them
included in the AWS scrape results. If you are using multiple instances of registry
it might make more sense to register these metrics in the application using YACE as a library to better
track them over the lifetime of the application.