Skip to content
This repository has been archived by the owner on Jul 20, 2024. It is now read-only.

Does this still work? #35

Open
nelg opened this issue Feb 16, 2021 · 21 comments
Open

Does this still work? #35

nelg opened this issue Feb 16, 2021 · 21 comments

Comments

@nelg
Copy link
Contributor

nelg commented Feb 16, 2021

Hi,

I've had issues with this not working, although it used to work.

It seems that when it deletes the default route:

# switch the default route to eth1
ip route del default dev eth0

The nat instance then looses all internet connectivity.

Does this still work for you?

@nelg
Copy link
Contributor Author

nelg commented Feb 16, 2021

To get my one working, I ended up making the changes as per nelg@e4a0b33

@DesAWSume
Copy link

DesAWSume commented May 1, 2021

Hi nelg

If I just wanna setup a Linux 2 NAT instance, don't wanna IaC to provision all other infra. which commands I should run to be able to have a Amazon Linux 2 NAT working?

Thanks in advance.

@szromek
Copy link

szromek commented May 5, 2021

@nelg I am experiencing the same issue as you did and your fix seems to solve the problem. Could you provide @int128 with a PR that could be tested, merged and published to Terraform registry, so the whole module would be operational again?

@nelg
Copy link
Contributor Author

nelg commented May 29, 2021

@nelg I am experiencing the same issue as you did and your fix seems to solve the problem. Could you provide @int128 with a PR that could be tested, merged and published to Terraform registry, so the whole module would be operational again?

Sure, will do

@nelg
Copy link
Contributor Author

nelg commented May 29, 2021

Here is the PR #37

@arjitj2
Copy link

arjitj2 commented Jan 22, 2022

This issue and your fix solved 5+ hours of debugging work for me. Thank you and I hope it gets merged soon.

@int128
Copy link
Owner

int128 commented Jan 23, 2022

It seems NAT connection is lost after the NAT instance is rebooted.

ip route del default dev eth0 command is needed to change the default route to eth1 to fix the source IP, because an EIP of eth0 will be changed when the instance is recreated by Auto Scaling Group.

I noticed the route table is broken after reboot as follows:

## When an instance is created

ssm-user@ip-172-18-138-43 bin]$ ip ro
default via 172.18.128.1 dev eth1 metric 10001
169.254.169.254 dev eth0
172.18.128.0/20 dev eth0 proto kernel scope link src 172.18.138.43
172.18.128.0/20 dev eth1 proto kernel scope link src 172.18.132.145

ssm-user@ip-172-18-138-43 bin]$ sudo reboot

## After reboot

ssm-user@ip-172-18-138-43 bin]$ ip ro
default via 172.18.128.1 dev eth0
default via 172.18.128.1 dev eth1 metric 10001
169.254.169.254 dev eth0
172.18.128.0/20 dev eth0 proto kernel scope link src 172.18.138.43
172.18.128.0/20 dev eth1 proto kernel scope link src 172.18.132.145

Finally I could fixed this problem by removing the config of eth0:

sudo rm /etc/sysconfig/network-scripts/ifcfg-eth0

I will add it to the script.

@int128
Copy link
Owner

int128 commented Jan 29, 2022

I think #42 resolved the issue. Please let me know if the issue still occurs.

@nelg
Copy link
Contributor Author

nelg commented Apr 6, 2022

I have tested version 2.0.1 release on terraform registry, and it doesn't work.. still have eth0 as the default route, so the instance can't send traffic to the internet.

which version should I test?

@nelg
Copy link
Contributor Author

nelg commented Apr 12, 2022

I'm quite keen to get a version of this published on the registry that works. Rather than me publishing a copy of your one, can we work together to get it working, if you have time sometime in the next couple of weeks.

My solution is working for us, but it's not perfect and ends up with 2 default routes, and two interfaces in the same subnet.
The two ENI's attached, 1 has a public IP and a private IP, the other just has a private IP. We have to route out the one that has a public IP to get to the internet.

@JulianCBC
Copy link

Yeah, this latest fix is bogus.

I built this module from the example in README.md and this is my NAT instance's networking details after a reboot:

sh-4.2$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 06:e8:a4:c9:de:f6 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 06:33:c7:03:41:92 brd ff:ff:ff:ff:ff:ff
    inet 10.0.128.88/24 brd 10.0.128.255 scope global dynamic eth1
       valid_lft 3401sec preferred_lft 3401sec
    inet6 fe80::433:c7ff:fe03:4192/64 scope link
       valid_lft forever preferred_lft forever
sh-4.2$ ip route
default via 10.0.128.1 dev eth1 metric 10001
10.0.128.0/24 dev eth1 proto kernel scope link src 10.0.128.88
sh-4.2$ sudo iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  anywhere             anywhere

Long story short, it appears this module is broken, I tried downgrading to 2.0.0, but after that I couldn't even connect to the EC2 instance via SSM to debug this.

@nelg
Copy link
Contributor Author

nelg commented Jul 22, 2022 via email

@JulianCBC
Copy link

Yep, had to reorganise stuff so I could, but I did have an EIP on the NAT instance when I did my first round of testing with version 2.0.1.

My initial testing of this module failed to produce a working internet connection on the NAT instance or an instance on a private subnet, so it looks like something's misconfigured or missing. For the record, it's possible that the "something missing" is entirely my fault.

My understanding is that NAT gateways work like this: private host -> network interface -> NAT -> route table -> internet. So therefore how we get to the internet shouldn't matter, which makes the act of deleting the eth0 configuration script and therefore leaving that interface unconfigured after a reboot seem bogus as it shouldn't matter. That said, all my previous hacking has used separate interfaces for the input and output sides of the NAT gateway, so it's quite possible it'll all work on one interface and leaving eth0 unconfigured is correct.

I suspect that I've made a mistake somewhere here, but I also know that the NAT gateway should have had internet access in my testing, and the fact that it doesn't is concerning. I'm going to try a couple of other options then maybe return to this depending on the outcome. fck-nat seems promising if I can figure out a simple way to Terraformise it's setup.

(Another thing that stood out is that the ENI handling needs to be smarter: we should be able to detect whether it's already connected or somehow still in-use (e.g. after an instance is terminated) and respond appropriately.)

@JulianCBC
Copy link

I've been thinking about this over the past couple of days and worked out why deconfiguring eth0 and requiring an EIP felt so wrong to me, and what I did wrong to break my instance of this module.

Essentially the bit I was missing here is that we need to have a public IP address so we can send stuff through an internet gateway and that the floating ENI (eth1) doesn't get one by default, so we need to assign an EIP to it so it has a public IP and can therefore connect out, otherwise we kill our internet connection when we deconfigure eth0.

This makes sense with the current use cases:

  1. Port forwarding: as we need a static public IP, we need an EIP.
  2. NAT with an EIP: as eth1 has a public IP, all our connectivity can all be done on eth1 so therefore we don't need eth0.

The reason why it wasn't working for me initially is because if the EIP isn't available before the EC2 instance starts, it doesn't get the routes it needs and is therefore cut off from the internet.

I'd really like this module to work without an EIP, so I'm going to hack together a patch to always use eth0 for output which should make this more reliable and drop the EIP requirement unless people are doing DNAT. (DNAT should still work even if we're using eth0 for our default route.)

@JulianCBC
Copy link

Ok:

  1. Fix for this not working at all: Fix NAT not working 2022-07 #51
  2. Changes to use eth0 for the upstream connection: Use eth0 for output #52 (Note that this is on top of the previous change)

@int128 these changes are probably overkill and I haven't tested DNAT, but they Work For Me so they should be mergeable.

@int128
Copy link
Owner

int128 commented Jul 27, 2022

This module uses eth1 with the EIP to pin the source IP address.
If eth0 is used, the source IP address may fluctuate.

I think your change breaks the fixed IP feature. How do you think?

@nelg
Copy link
Contributor Author

nelg commented Jul 27, 2022

This is what I have been using, which seems to be ok, at least not enough I've had problems.

module "nat" {
  source = "github.com/int128/terraform-aws-nat-instance?ref=5a3d3f41568d8af145e291067f1e6e9d71fb36fd"
  enabled                     = var.nat_gw ? false : true
  name                        = "natgw"
  vpc_id                      = module.vpc.vpc_id
  public_subnet               = module.vpc.public_subnets[0]
  private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks
  private_route_table_ids     = var.nat_gw ? [] : module.vpc.private_route_table_ids
}

resource "aws_eip" "nat" {
  network_interface = module.nat.eni_id
  tags = {
    "Name" = "nat-instance-main"
  }
}

@JulianCBC
Copy link

This module uses eth1 with the EIP to pin the source IP address. If eth0 is used, the source IP address may fluctuate.

I think your change breaks the fixed IP feature. How do you think?

I guess it depends on your use case.

If you need all your NATed traffic to come from a constant IP, then yeah, this breaks that, but this should be a pretty niche use-case and NAT instances should be pretty long-running and therefore have a relatively constant IP address, just not one known in advance.

If DNAT port forwarding is enabled, it should still work as long as the services inside the private subnet aren't expecting to be able to tell remote services something like "hey, connect to whatever my IP is, but on port 1234", where port 1234 has previously been opened using DNAT. Again, this should be a pretty niche use-case and I think that most common services that do this, e.g. active FTP, already have special case handling in Linux.

I guess that in my opinion, a constant source IP address isn't required for well over 90% of use cases, so this will be fine and removes the need for an EIP, reducing costs and resource usage.

But yeah, we can't ignore those niche cases, so maybe this should be switchable then? No EIP required for the common use cases, and tell the module it'll have an EIP if you absolutely need certainty about the source IP.

@nelg
Copy link
Contributor Author

nelg commented Jul 27, 2022

If you need all your NATed traffic to come from a constant IP, then yeah, this breaks that, but this should be a pretty niche use-case and NAT instances should be pretty long-running and therefore have a relatively constant IP address, just not one known in advance.

I think this case needs to be supported, it's not that uncommon to have a white listed external IP.

I guess that in my opinion, a constant source IP address isn't required for well over 90% of use cases, so this will be fine and removes the need for an EIP, reducing costs and resource usage.

As per the AWS docs:
An Elastic IP address doesn’t incur charges as long as all the following conditions are true:

  • The Elastic IP address is associated with an EC2 instance.
  • The instance associated with the Elastic IP address is running.
  • The instance has only one Elastic IP address attached to it.
  • The Elastic IP address is associated with an attached network interface. For more information, see Network interface basics.

So, having the Elastic IP I don't think is adding any costs, because the NAT instance exists all of the time.

@JulianCBC
Copy link

I think this case needs to be supported, it's not that uncommon to have a white listed external IP.

I agree that there are situations where it's needed, so I'll make it configurable.

As per the AWS docs: An Elastic IP address doesn’t incur charges as long as all the following conditions are true:

  • The Elastic IP address is associated with an EC2 instance.
  • The instance associated with the Elastic IP address is running.
  • The instance has only one Elastic IP address attached to it.
  • The Elastic IP address is associated with an attached network interface. For more information, see Network interface basics.

So, having the Elastic IP I don't think is adding any costs, because the NAT instance exists all of the time.

True, but you're limited to 5 of them without jumping through hoops with AWS support - I had to change how I was doing stuff in my VPC because I was using all 5 before I deployed this, so for people in situations where stuff you can't use is using most of your allocation or you want more than 5 VPCs with NAT instances, it'd be nice to not require one.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html#using-instance-addressing-limit

@Udit-Sharma2020
Copy link

I am using this module then, EIP is not attching to nat instance and the snat service is failing

When i analyzed the repo i find out that

This NAT module has a runonce.sh script and a snat.sh
Now when the Launch template is created, the user data section has a execution command to exec this runonce.sh script

Now this runonce.sh script is responsible to attach the ENI to the same nat instance and then start the snat service
which in turn calls the /opt/nat/snat.sh script that configures NAT configuration.

But this is not working as per expected
runonce.sh is not getting executed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants